search

Home  >  Q&A  >  body text

Is it wise to create different tables for the same data for each date?

I have a MYSQL InnoDB table table which contains the following columns (table and column names changed):

where rel_ab is a column that describes the relationship between the 2 variables var_a and var_b on a given date. (var_a and var_b refer to different tables)

Data is uploaded in batches every day, totaling approximately 7 million rows per day. The problem was, after just a few weeks, it started taking hours to upload each new daily batch. Clearly we need to improve our table design. Here are some additional details about our form.

Therefore, I plan to do at least one of the following:

I understand that the first solution may threaten data integrity, while the second solution will clutter our architecture. In my limited experience, I've never heard of the second option either, and can't find any examples of this design online. Are any of these options a sensible solution? Both will increase upload speeds and reduce disk usage, but both have their drawbacks. Otherwise, what other ways are there to increase upload speed?

EDIT: My SHOW CREATE TABLE should look like

CREATE TABLE table (
  date date NOT NULL,
  var_a int NOT NULL,
  var_b int NOT NULL,
  rel_ab decimal(19,16) NOT NULL,
  PRIMARY KEY (date,`var_a`,`var_b`),
  KEY a_idx (var_a),
  KEY b_idx (var_b),
  CONSTRAINT a FOREIGN KEY (var_a) REFERENCES other_table_a (var_a) ON DELETE RESTRICT ON UPDATE CASCADE,
  CONSTRAINT b FOREIGN KEY (var_b) REFERENCES other_table_b (var_b) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION="zlib"

P粉665679053P粉665679053485 days ago566

reply all(2)I'll reply

  • P粉781235689

    P粉7812356892023-09-10 13:05:25

    There are some potential solutions that can help you improve the upload speed of your MySQL tables:

    Delete the indexes on var_a and var_b: Since you are not using these indexes to speed up queries, deleting them can help speed up the upload process. However, if you use foreign key constraints, it is generally recommended to keep indexes on the columns that belong to the foreign key.

    Partition the table by date: Partitioning helps improve query performance because it allows the database to scan only the relevant partitions for a given query. However, it also makes maintenance and backups more complex, which may not be necessary if your queries are already performing well.

    Use bulk insert method: Instead of using df.to_sql to insert individual rows, you can try using a bulk insert method such as LOAD DATA INFILE or the MySQL bulk insert API. This is faster than inserting individually, especially if you can upload the data in batches rather than one row at a time.

    Use a different compression algorithm: You are currently using zlib compression, but there are other compression algorithms that may be faster or more efficient for your data. You can try trying different compression options to see if they improve upload speeds.

    Increase server resources: If you have the budget and resources, upgrading server hardware or increasing the number of servers may help increase upload speeds. This may not be a viable option for everyone, but it's worth considering if you've exhausted your other options.

    As far as your suggested option is concerned, removing the foreign key constraints may cause data integrity issues, so I don't recommend this approach. If your query is already experiencing performance issues, partitioning by date may be a good solution, but if your query is already running quickly, it may not be necessary.

    reply
    0
  • P粉098979048

    P粉0989790482023-09-10 11:03:23

    To speed up uploads, delete them. Seriously, if the only thing you're doing is getting exactly what's in a file for a certain date, why put the data into a table? (Your comment points out that a single file is actually several files. It might be a good idea to combine them first.)

    If you do need the data in the table, let's discuss these...

    • Before determining the index, we must look at all major queries.
    • The order of columns in the PK is important for both loading and querying.
    • Partitioning may help with loading, but is unlikely to help with querying. Exceptions: Do you delete "old" data?
    • Please provide Show creation table; the content you provide may miss some subtleties.
    • How is loading done? A huge Loading data? Hopefully not inserting one row at a time. I don't know how pandas works. (Nor do you know how the 99 other packages that "simplify" MySQL access work.) Please understand what it does behind the scenes. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times faster than row-by-row loading.
    • I haven't seen the need for a temporary table while loading. Maybe. If you remove the FK (as you suggested), you can execute queries to verify the existence of var_a and var_b in other tables. That's "analog FK".
    • If feasible, sort incoming data based on PK. (This may be the root cause of slow economic growth.)
    • Are there any auxiliary keys? They affect loading speed.
    • I think your FK implies indexes on other tables.
    • Are you adding new rows to other tables?
    • "rel_ab (DECIMAL)" - how many decimal places? What is the exact statement? If it's some kind of measurement, have you considered FLOAT?
    • Now there are many rows in other tables. That is, do you really need a 4-byte INT to reference them? Switch to 3 bytes MEDIUMINT [UNSIGNED] Save at least 7MB per day.
    • How do you handle the 7 million rows in that SELECT?
    • No compression. InnoDB is very inefficient. Only one of the 4 columns may be compressible. Compression requires additional buffer_pool_space. Compression uses a lot of CPU. For InnoDB, 2x shrinkage is typical.

    Multiple "identical" tables are always unwise. A table is always better. However, as suggested above, a zero table is still better.

    reply
    0
  • Cancelreply