I have a MYSQL InnoDB table table
which contains the following columns (table and column names changed):
where rel_ab
is a column that describes the relationship between the 2 variables var_a
and var_b
on a given date. (var_a
and var_b
refer to different tables)
Data is uploaded in batches every day, totaling approximately 7 million rows per day. The problem was, after just a few weeks, it started taking hours to upload each new daily batch. Clearly we need to improve our table design. Here are some additional details about our form.
COMPRESSION="zlib"
. var_a
and var_b
columns.
SELECT * FROM table WHERE date =
. The selection only takes a few minutes. var_a
and var_b
. df.to_sql('temp', con, if_exists='replace', index=False, method='multi')
, where we insert ignore< code>temp Go to table
and delete temp
. Therefore, I plan to do at least one of the following:
var_a
and var_b
and rely on the data upload process to complete everything correctly. This is because neither index actually improves query speed in our use case. table_230501
which contains columns var_a
, var_b
, rel_ab
. This is because we are only selecting one date at a time. I understand that the first solution may threaten data integrity, while the second solution will clutter our architecture. In my limited experience, I've never heard of the second option either, and can't find any examples of this design online. Are any of these options a sensible solution? Both will increase upload speeds and reduce disk usage, but both have their drawbacks. Otherwise, what other ways are there to increase upload speed?
EDIT: My SHOW CREATE TABLE
should look like
CREATE TABLE table ( date date NOT NULL, var_a int NOT NULL, var_b int NOT NULL, rel_ab decimal(19,16) NOT NULL, PRIMARY KEY (date,`var_a`,`var_b`), KEY a_idx (var_a), KEY b_idx (var_b), CONSTRAINT a FOREIGN KEY (var_a) REFERENCES other_table_a (var_a) ON DELETE RESTRICT ON UPDATE CASCADE, CONSTRAINT b FOREIGN KEY (var_b) REFERENCES other_table_b (var_b) ON DELETE RESTRICT ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION="zlib"
P粉7812356892023-09-10 13:05:25
There are some potential solutions that can help you improve the upload speed of your MySQL tables:
Delete the indexes on var_a and var_b: Since you are not using these indexes to speed up queries, deleting them can help speed up the upload process. However, if you use foreign key constraints, it is generally recommended to keep indexes on the columns that belong to the foreign key.
Partition the table by date: Partitioning helps improve query performance because it allows the database to scan only the relevant partitions for a given query. However, it also makes maintenance and backups more complex, which may not be necessary if your queries are already performing well.
Use bulk insert method: Instead of using df.to_sql to insert individual rows, you can try using a bulk insert method such as LOAD DATA INFILE or the MySQL bulk insert API. This is faster than inserting individually, especially if you can upload the data in batches rather than one row at a time.
Use a different compression algorithm: You are currently using zlib compression, but there are other compression algorithms that may be faster or more efficient for your data. You can try trying different compression options to see if they improve upload speeds.
Increase server resources: If you have the budget and resources, upgrading server hardware or increasing the number of servers may help increase upload speeds. This may not be a viable option for everyone, but it's worth considering if you've exhausted your other options.
As far as your suggested option is concerned, removing the foreign key constraints may cause data integrity issues, so I don't recommend this approach. If your query is already experiencing performance issues, partitioning by date may be a good solution, but if your query is already running quickly, it may not be necessary.
P粉0989790482023-09-10 11:03:23
To speed up uploads, delete them. Seriously, if the only thing you're doing is getting exactly what's in a file for a certain date, why put the data into a table? (Your comment points out that a single file is actually several files. It might be a good idea to combine them first.)
If you do need the data in the table, let's discuss these...
Show creation table
; the content you provide may miss some subtleties. Loading data
? Hopefully not inserting one row at a time. I don't know how pandas works. (Nor do you know how the 99 other packages that "simplify" MySQL access work.) Please understand what it does behind the scenes. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times faster than row-by-row loading. FLOAT
? MEDIUMINT [UNSIGNED]
Save at least 7MB per day. Multiple "identical" tables are always unwise. A table is always better. However, as suggested above, a zero table is still better.