Suitable situations for table splitting in mysql: 1. When the amount of data is too large and normal operation and maintenance affects business access, for example, backing up the database requires a large amount of disk IO and network IO, and DDL modification of a table will cause Locking the entire table, accessing and updating large tables will cause lock waits; 2. As the business develops, some fields need to be split vertically; 3. The amount of data in a single table increases rapidly. When the performance is close to the bottleneck, Horizontal slicing needs to be considered.
The operating environment of this tutorial: windows7 system, mysql8 version, Dell G3 computer.
Not all tables need to be split, it mainly depends on the growth rate of the data. Segmentation will increase the complexity of the business to a certain extent. In addition to carrying data storage and query, the database is also one of its important tasks to assist the business in better realizing its needs.
Don’t use the big trick of sub-database and sub-table unless absolutely necessary to avoid "over-design" and "premature optimization". Before splitting databases and tables, don’t split just for the sake of splitting. Try your best to do what you can first, such as upgrading hardware, upgrading network, separating read and write, index optimization, etc. When the amount of data reaches the bottleneck of a single table, consider sharding databases and tables.
So when should sub-tables be considered in mysql?
1. The amount of data is too large and normal operation and maintenance will affect business access
The operation and maintenance mentioned here refers to:
For database backup, if a single table is too large, a large amount of disk IO and network IO will be required during backup. For example, if 1T of data is transmitted over the network and occupies 50MB, it will take 20,000 seconds to complete. The risk of the entire process is relatively high.
When DDL modifications are made to a large table, MySQL will lock the entire table for a long time. During this period, the business cannot access the table, which has a great impact. If you use pt-online-schema-change, triggers and shadow tables will be created during use, which also takes a long time. During this operation, it is counted as risk time. Splitting the data table and reducing the total amount can help reduce this risk.
Large tables will be accessed and updated frequently, so lock waits are more likely to occur. Split the data, trade space for time, and reduce access pressure in disguise
2. As the business develops, some fields need to be split vertically
For example, if the user table designed at the beginning of the project is as follows:
In the initial stage of the project, this design meets simple business needs and is also convenient. Rapid iterative development. When the business develops rapidly, the number of users surges from 100,000 to 1 billion, and users are very active. The last_login_name field is updated every time they log in, causing the user table to be constantly updated, which is very stressful. Other fields: id, name, personal_info are unchanged or rarely updated. From a business perspective, it is necessary to split last_login_time and create a new user_time table.
The personal_info attribute is updated and queried less frequently, and the text field takes up too much space. At this time, it is necessary to vertically split the user_ext table.
3. The amount of data grows rapidly
With the rapid development of business, the amount of data in a single table will continue to grow. When the performance is close to the bottleneck, you need to consider Horizontal segmentation is done to create separate databases and tables. At this time, you must choose appropriate segmentation rules and estimate the data capacity in advance
Business case analysis
1. User Center business scenario
The user center is a very common business, which mainly provides user registration, login, query/modification and other functions. Its core table is:
Any architectural design that is divorced from the business is a rogue. Before sub-database and table sub-database, the business scenario requirements need to be sorted out:
User side: front desk access, traffic volume Larger, high availability and high consistency need to be ensured. There are two main types of requirements:
User login: query user information through login_name/phone/email, 1% of requests belong to this type
User information query: After logging in, query user information through uid, 99% of requests are of this type
2. Horizontal segmentation method
When the amount of data becomes larger and larger, the database needs to be horizontally segmented, as described above The segmentation methods include "based on numerical range" and "based on numerical modulo"."Based on value range": Use the primary key uid as the basis for division, and split the data horizontally into multiple databases according to the range of uid. For example: user-db1 stores data with uid ranges from 0 to 1000w, and user-db2 stores data with uid ranges from 1000w to 2000wuid.
The advantage is: expansion is simple. If the capacity is not enough, just add a new db.
The disadvantage is: the request volume is uneven. Generally, newly registered users will be more active, so the new user-db2 will have a higher load than user-db1, resulting in low server utilization. Balance
#"Modulo based on numerical value": The primary key uid is also used as the basis for division, and the data is split horizontally into multiple databases based on the modulo value of uid. . For example: user-db1 stores uid data modulo 1, user-db2 stores uid data modulo 0.
The advantages are: the data volume and request volume are evenly distributed
The disadvantage is: expansion is troublesome, when the capacity is not enough, add a new db, Requires rehash. Smooth migration of data needs to be considered.
Non-uid query method
After horizontal segmentation, the demand for query by uid can be very good If satisfied, it can be routed directly to the specific database. For queries based on non-uid, such as login_name, it is not known which library should be accessed. In this case, all libraries need to be traversed, and the performance will be reduced a lot.
For the user side, the solution of "establishing a mapping relationship from non-uid attributes to uid" can be adopted; for the operation side, the solution of "separating the front-end and back-end" can be adopted.
1. Establish a mapping relationship from non-uid attributes to uid
Mapping relationship
For example : login_name cannot be directly located in the database. A mapping relationship of login_name → uid can be established and stored in an index table or cache. When accessing login_name, first query the uid corresponding to login_name through the mapping table, and then locate the specific library through the uid.
The mapping table has only two columns and can carry a lot of data. When the amount of data is too large, the mapping table can also be split horizontally. This type of kv format index structure can use cache to optimize query performance, and the mapping relationship will not change frequently, and the cache hit rate will be very high.
Gene method
Split library gene: If the library is split by uid, it is divided into 8 libraries and is routed using uid%8 , at this time, it is the last 3 bits of uid that determine which library this row of User data falls on, then these 3 bits can be regarded as sub-library genes.
2. Separation of front-end and back-end
For the user side, the main requirement is to focus on single-line queries, and it is necessary to establish a mapping relationship from login_name/phone/email to uid. Can solve the query problem of these fields.
As for the operation side, there are many queries with batch paging and various conditions. Such queries require a large amount of calculation, return a large amount of data, and consume high performance of the database. At this time, if the same batch of services or databases are shared with the user side, a small number of background requests may occupy a large amount of database resources, resulting in user-side access performance degradation or timeout.
This type of business is best to adopt the "separation of front-end and back-end" solution. The back-end business on the operation side extracts independent services and DBs to solve the coupling with the front-end business system. Since the operation side does not have high requirements for availability and consistency, it is not necessary to access the real-time library. Instead, it can asynchronously synchronize data to the operation library through binlog for access. When the amount of data is large, you can also use ES search engine or Hive to meet the complex query methods in the background.
[Related recommendations: mysql video tutorial]
The above is the detailed content of When to split tables in mysql. For more information, please follow other related articles on the PHP Chinese website!