ID, as the unique identifier of the business, is often seen in data design, for example:
•Product - product_id
•Order - order_id
•Message - message_id
These identifiers are often the primary keys of the database, and MySQL will The primary key is to create a clustered index, which directly points to the data address. Compared with the ordinary index pointing to the clustered index, it reduces one index query and is very fast. Businesses such as messages and orders generally have the need to query data in reverse chronological order. One way is to create an index on the time column, and even better is to rely on the insertion order of the ID itself. Therefore, distributed ID needs to meet two core conditions:
• Globally unique
• Time trend orderly
Some people may say, wouldn’t it be enough to just use MySQL’s auto_increment directly? In the early days of starting a business, I would also choose this solution. It is simple, efficient and fast - startups still have to iterate quickly and produce products as soon as possible, and products change frequently. The awesome architecture that takes too much time to develop may not be useful. Yes, valuable time was wasted. However, there are some problems with this solution:
• Affects parallel insertion - record B depends on the primary key of record A. You need to wait until record A is inserted successfully and get A.id before you can insert record B
• Data recovery is difficult - —After the data is accidentally deleted or lost, since there is no ID in the log, the data correlation cannot be directly determined
• Impact on database and table sharding—Since the ID is not known until it is inserted, database and table sharding cannot be performed based on the primary key of the business
Therefore, after the business is stable, you must take time to pay off early technical debt.
Common solutions
Use the auto_increment of the database to generate a unique ID
Advantages
•Simple, using existing functions, small development effort
•Fixed ID step size
Disadvantages
•Single point of writing, not high Available
• Even if multiple main libraries are expanded according to different auto_increment starting points, although the availability is improved, the strict order of IDs cannot be guaranteed
• The database needs to be accessed every time, and it is easy to reach the performance ceiling
Pulling IDs in batches, Allocate one by one
This solution also stores the ID data in the database. The ID service pulls N IDs from the database each time and updates the current maximum ID value to the original data + N. The ID service receives the ID each time When a request is generated, these N IDs are returned in sequence.
Advantages
•Batch acquisition, no need to access the database every time, low database pressure
Disadvantages
•The entire service is still a single point
•Service downtime and restart will cause ID discontinuity
•Cannot be horizontally expanded
Improvements
Add a set of backup services. If the main service fails and drifts to the backup service, you can use vip + keepalived or add a proxy.
uuid
Advantages
•Locally generated ID, no single point problem, no performance bottleneck
Disadvantages
•Cannot guarantee incremental order
•Length is too long, low performance as a primary key
Snowflake-like algorithm
Snowflake is Twitter's open source distributed ID generation algorithm. Its core idea is: a long ID, using 41 bits as the number of milliseconds, 10 bits as the machine number, and 12 bits as the sequence number within the millisecond. This algorithm can theoretically generate up to 1000*(2^12), or 400W IDs per second on a single machine, which can fully meet business needs.
Learning from snowflake’s ideas and combining the business logic and concurrency of each company, you can implement your own distributed ID generation algorithm.
Advantages
•Time is at a high level, the trend is increasing
•Simple to implement, does not rely on other services, easy to expand
Disadvantages
•There is no global clock, a single machine is absolutely in order, but from the perspective of the entire cluster, the trend is Sequential
Notes
•Since ID is often used as the identifier of the sub-database and sub-table, these IDs need to have a certain degree of randomness so that the data after the sub-database will not be uneven. The sequence number can be different at the beginning of each millisecond. Starting from 1, second is starting from any one from 0-9