1. Introduction
With the widespread popularity of Internet applications, the storage and access of massive data have become bottlenecks in system design. For a large-scale Internet application, billions of PVs every day undoubtedly place a considerable load on the database. It has caused great problems to the stability and scalability of the system. To improve website performance through data segmentation, horizontally expanding the data layer has become the preferred method for architecture developers. Horizontally sharding the database can reduce the load on a single machine and minimize the losses caused by downtime. Through the load balancing strategy, the access load of a single machine is effectively reduced and the possibility of downtime is reduced; through the cluster solution, the problem of single-point database inaccessibility caused by database downtime is solved; through the read-write separation strategy, more It maximizes the speed and concurrency of reading (Read) data in the application. At present, a large number of domestic large-scale Internet applications use such data segmentation solutions, such as Taobao, Alibaba, and Tencent. Most of them have implemented their own distributed data access layer (DDAL). Divided by implementation methods and implementation levels, it can be roughly divided into two levels (Java application as an example): encapsulation of the JDBC layer and implementation of the ORM framework layer. As far as the direct encapsulation of the JDBC layer is concerned, one of the better-developed projects in China is the project called "Amoeba", which was developed by the Alibaba Group's research institute and is still in the testing phase (beta version). Operation efficiency and production timeliness need to be studied. As far as the implementation of the ORM framework layer is concerned, Taobao's distributed data access layer based on ibatis and Spring has been used for many years, and its operating efficiency and production effectiveness have been recognized by developers and users. This article is a distributed data access layer based on the ORM framework layer. The difficulty of this topic lies in the formulation and selection of routing rules and later scalability after the database is divided. For example, how to achieve the purpose of expanding the database capacity (adding machine nodes) with the least amount of data migration. The core issues will revolve around the routing rules and load balancing strategies of database shards and tables.
2. Basic principles and concepts
2.1 Basic principles:
The process of human cognitive problems is always like this: what (what) -? why (why)-? how (how to do it), next, this article will discuss and research these three issues:
2.1.1 What is data segmentation
The word "Shard" in English means "fragment", and as a database-related A technical term that seems to have first appeared in massively multiplayer online role-playing games. "Sharding" is tentatively called "sharding". Sharding is not a new technology, but a relatively simple software concept. As we all know, the data table partitioning function was only available after MySQL5. Before that, many potential users of MySQL were concerned about the scalability of MySQL. Whether it has the partitioning function has become a measure of the scalability of a database. Key indicators (of course not the only indicators). Database scalability is an eternal topic. MySQL promoters are often asked: How can it be done if application data processing on a single database is stretched and requires partitioning? The answer is: Sharding. Sharding is not a function attached to a specific database software, but an abstraction based on specific technical details. It is a solution for horizontal expansion (ScaleOut, or horizontal expansion, outward expansion). Its main purpose is to break through single The I/O capacity limit of the node database server solves the problem of database scalability.
Through a series of segmentation rules, the data is horizontally distributed into different DBs or tables, and the specific DB or table that needs to be queried is found through the corresponding DB routing or table routing rules for Query operation. The "sharding" mentioned here usually refers to "horizontal slicing", which is also the focus of this article. What specific segmentation and routing methods will be used? At this point, readers will inevitably have questions. Let’s give a simple example: Let’s illustrate the logs in a Blog application. For example, the log article (article) table has the following fields:
article_id(int),title(varchar( 128)),content(varchar(1024)),user_id(int)
Faced with such a table, how do we segment it? How to distribute such data to tables in different databases? In fact, analyzing the application of blog, it is not difficult for us to draw the conclusion: in the application of blog, users are divided into two types: viewers and blog owners. When a viewer browses a blog, he or she is actually browsing under a specific user's blog, and the owner of the blog who manages his own blog also operates under a specific user's blog (in his own space) . The so-called specific user is represented by a database field as "user_id". It is this "user_id", which is the basis for the sub-library and the basis of the rules we need. We can do this, put all the article information with user_id between 1 and 10000 into the article table in DB1, put all the article information with user_id between 10001 and 20000 into the article table in DB2, and so on, until DBn. In this way, the article data is naturally divided into various databases, achieving the purpose of data segmentation. The next problem to be solved is how to find the specific database? In fact, the problem is simple and obvious. Since we used the distinguishing field user_id when dividing the database, it is natural that the process of database routing is still indispensable. user_id. Consider the blog application we just presented. Whether it is accessing other people's blogs or managing my own blog, in short, I need to know who the user of this blog is. That is, we know the user_id of this blog, and use this user_id. The rules when dividing the database, in turn, locate the specific database. For example, if user_id is 234, if you use the rule of that person, you should locate DB1. If user_id is 12343, if you use the rule of that person, you should locate DB2. By analogy, the rules of sharding are used to reversely route to a specific DB. This process is called "DB routing".
Of course, the DB design that takes into account data segmentation must be an unconventional and unorthodox DB design. So what kind of DB design is the orthodox DB design?
This is basically what we usually use in a regular manner. Normally, we consciously design our database according to the paradigm. When the load is high, we may consider using the relevant Replication mechanism to improve read and write throughput and performance. This may be able to meet many needs, but the flaws of this mechanism itself are still relatively obvious. (mentioned below). The “conscious design according to paradigm” mentioned above. Taking into account the DB design of data segmentation, it will violate this usual rules and constraints. In order to segment, we have to have redundant fields in the database tables, which are used as distinguishing fields or marker fields called sub-databases, such as the article above Fields such as user_id in the example (of course, the example just now does not reflect the redundancy of user_id very well, because the user_id field will still appear even if it is not divided into databases. It can be regarded as a bargain for us). Of course, the emergence of redundant fields does not only appear in the scenario of sub-database. In many large-scale applications, redundancy is also necessary. This involves the design of efficient DB, and this article will not go into details.
2.1.2 Why data segmentation is needed
The above has given a brief description and explanation of what data segmentation is. Readers may wonder, why data segmentation is needed? Is a mature and stable database like Oracle sufficient to support the storage and query of massive data? Why do we need data slicing? Indeed, Oracle's DB is indeed very mature and stable, but the high usage fees and high-end hardware support are not something every company can afford. Just imagine the usage cost of tens of millions a year and the hardware support of a minicomputer that costs tens of millions of dollars. Is this something that ordinary companies can afford? Even if we can afford it, if there is a better solution, a cheaper solution with better horizontal scalability, why wouldn't we choose it?
However, things are always unsatisfactory. Normally, we consciously design our database according to the paradigm. When the load is high, we may consider using the relevant Replication mechanism to improve read and write throughput and performance. This may be able to meet many needs, but the flaws of this mechanism itself are still relatively obvious. of. First of all, its effectiveness depends on the proportion of read operations. The Master often becomes the bottleneck. Write operations need to be queued sequentially for execution. If it is overloaded, the Master cannot handle it first. The delay in data synchronization of Slaves may also be relatively large, and it will be very expensive. CPU computing power, because the write operation still needs to be run on each slave machine after it is executed on the master. Sharding may become useless at this time. Replication can't be done, so why can Sharding work? The reason is simple, because it scales well. We know that each machine has its own physical upper limit no matter how well configured it is, so when our application has reached or far exceeded a certain upper limit of a single machine, we can only seek help from other machines or continue to upgrade. Our hardware, but the common solution is to scale horizontally by adding more machines to share the pressure. We also have to consider whether our machines can meet demand through linear growth as our business logic continues to grow? Sharding can easily distribute computing, storage, and I/O to multiple machines in parallel, which can make full use of the various processing capabilities of multiple machines, while avoiding single points of failure, providing system availability, and performing good error isolation. .
Based on the above factors, data segmentation is very necessary, and the data segmentation we discuss here also uses MySql as the background. Based on cost considerations, many companies have also chosen Free and Open MySql. Developers who know something about MySQL may know that the data table partitioning function was only available after MySQL5. Before that, many potential users of MySQL were concerned about the scalability of MySQL, and whether it has the partitioning function became a question. A key indicator (of course not the only indicator) to measure the scalability of a database. Database scalability is an eternal topic. MySQL promoters are often asked: If the application data on a single database is stretched and needs to be partitioned, how can it be done? The answer is Sharding, too. This is what we call a data slicing scheme.
We use free MySQL and cheap servers or even PCs as clusters to achieve the effect of minicomputer + large commercial DB, reduce a lot of capital investment, and reduce operating costs. Why not? Therefore, we choose Sharding and embrace Sharding.
2.1.3 How to achieve data segmentation
Speaking of data segmentation, once again we will elaborate and explain the method and form of data segmentation in more detail.
Data segmentation can be physical. The data is distributed to different DB servers through a series of segmentation rules, and access to specific databases is routed through routing rules. In this way, each access is not Instead of a single server, there are N servers, so that the load pressure on a single machine can be reduced.
Data segmentation can also be within the database. The data is distributed into different tables of a database through a series of segmentation rules. For example, the article is divided into sub-tables such as article_001, article_002, etc., and several sub-tables are combined horizontally. A logically complete article table is formed, and the purpose of doing so is actually very simple. For example, for example, there are currently 50 million pieces of data in the article table. At this time, we need to add (insert) a new piece of data to this table. After the insert is completed, the database will re-index this table and create 50 million rows of data. The system overhead of indexing cannot be ignored. But conversely, if we divide this table into 100 tables, from article_001 to article_100, the 50 million rows of data are averaged, and each sub-table has only 500,000 rows of data. At this time, we add a table with only 500,000 rows of data. After inserting data, the time to create an index will be reduced by an order of magnitude, which greatly improves the runtime efficiency of the DB and increases the concurrency of the DB. Of course, the benefits of sharding are not yet known. There are also many obvious benefits such as lock operations for write operations.
In summary, sub-database reduces the load of single-point machines; sub-table improves the efficiency of data operations, especially the efficiency of Write operations. At this point in the writing, we still haven’t touched on the issue of how to segment. Next, we will elaborate and explain the segmentation rules in detail.
As mentioned above, in order to achieve horizontal segmentation of data, there must be redundant characters in each table as the basis for segmentation and mark fields. In common applications, we choose user_id as the distinguishing field. Based on this There are the following three methods and rules for dividing databases: (Of course there are other ways)
Divided by number segment:
(1) user_id is the distinction, 1 to 1000 corresponds to DB1, 1001 to 2000 corresponds to DB2 , and so on;
Advantages: Partial migration is possible
Disadvantages: Uneven data distribution
(2) Hash modulus:
Hash the user_id (or use the value of user_id directly if the user_id is a numeric value), and then use a specific number. For example, if the application needs to divide a database into 4 databases, we will use the number 4 to The hash value of user_id is modulo calculated, that is, user_id%4. In this case, there are four possibilities for each operation: when the result is 1, it corresponds to DB1; when the result is 2, it corresponds to DB2; when the result is 3, it corresponds to DB3; When the result is 0, it corresponds to DB4. In this way, the data is distributed to 4 DBs very evenly.
Advantages: Evenly distributed data
Disadvantages: Data migration is troublesome, data cannot be allocated according to machine performance
(3) Save the database configuration in the authentication database
That is to create a DB, and this DB saves the user_id to the DB separately Mapping relationship, every time you access the database, you must first query the database to get the specific DB information, and then you can perform the query operation we need.
Advantages: Strong flexibility, one-to-one relationship
Disadvantages: One more query is required before each query, which greatly reduces performance
The above are the three methods we choose in normal development, which may be possible in some complex projects A mix of these three methods will be used. Through the above description, we also have a simple understanding of the rules of sub-library. Of course, there will be better and more complete ways to sub-library, and we still need to continue to explore and discover.
3. The basic outline of this discussion
In the above text, we expounded some concepts and meanings of database segmentation as well as some conventional segmentation according to the rules of human cognitive things, what? why? how The rules are briefly introduced. The distributed data layer discussed in this topic is not just that, it is a complete data layer solution. What does it look like? In the following text, I will elaborate on the complete ideas and implementation methods of this research topic.
The functions provided by the distributed data solution are as follows:
(1) Provide sharding rules and routing rules (RouteRule referred to as RR), and directly embed the three segmentation rules mentioned in the above description into this system. The specific embedding The method will be explained and discussed in detail in the following content;
(2) Introduce the concept of cluster (Group) to ensure high availability of data;
(3) Introduce load balancing policy (LoadBalancePolicy referred to as LB);
(4) Introduce a cluster node availability detection mechanism to regularly detect the availability of single-point machines to ensure the correct implementation of the LB strategy and ensure a high degree of system stability;
(5) Introduce read/write separation to improve Data query speed;
The data layer design of the sub-database and sub-table is not perfect enough. What will happen when the DB server on a certain node goes down? Yes, we have adopted a database segmentation scheme, which means that there are N machines forming a complete DB. If one machine goes down, only one N/N of the data in the DB cannot be accessed. This means This is acceptable to us. At least it is much better than the situation before splitting, and the entire DB will not be inaccessible. In general applications, it is acceptable for such machine failure to cause data inaccessibility. What if our system is a highly concurrent e-commerce website? The economic losses caused by a single node machine downtime are very serious. In other words, there are still problems with our current solution, and the fault-tolerance performance cannot withstand the test. Of course, problems always have solutions. We introduce the concept of cluster, which I call Group here, that is, we introduce multiple machines to each sub-library node. The data saved by each machine is the same. Under normal circumstances, these multiple machines share the load. When a In the event of a downtime, the load balancer will distribute the load to the downed machine. In this way, the problem of fault tolerance is solved. So we introduced the concept of clusters and embedded it into our framework and became part of the framework.
As shown in the figure above, the entire data layer consists of three clusters: Group1, Group2, and Group3. These three clusters are the results of horizontal segmentation of data. Of course, these three clusters also form a cluster containing complete data. D.B. Each Group includes 1 Master (of course there can be multiple Masters) and N Slaves. The data of these Masters and Slaves are consistent. For example, if one slave in Group 1 is down, then two slaves can be used. This model will never cause the problem of inaccessibility of certain parts of the data, unless all the machines in the entire Group are down, but Considering that the probability of such a thing happening is very small (unless there is a power outage, it is unlikely to happen).
Before the cluster was introduced, the process of our query was roughly as follows: request the data layer and pass the necessary sub-database distinguishing field (usually user_id)? The data layer routes to the specific DB based on the distinguishing field? In this determination Perform data operations in the DB. This is the situation without the introduction of clusters. What would it look like if clusters were introduced at that time? As you can see from Figure 1, the rules and policies on our router can actually only be routed to a specific Group, that is, they can only be routed to a virtual Group. This Group is not a specific physical server. The next thing that needs to be done is to find the specific physical DB server to perform specific data operations. Based on the needs of this link, we introduced the concept of load balancer (LB). The load balancer's responsibility is to locate a specific DB server. The specific rules are as follows: The load balancer will analyze the read and write characteristics of the current SQL. If it is a write operation or an operation that requires strong real-time performance, it will directly allocate the query load to the Master. If it is a read operation, it will be allocated through the load balancing policy. A slave. The main research direction of our load balancer is load distribution strategy. Usually load balancing includes random load balancing and weighted load balancing. Random load balancing is easy to understand, that is, randomly selecting a Slave from N Slaves. Such random load balancing does not consider machine performance. It defaults to the same performance of each machine. If this is the real situation, there is nothing wrong with doing so. What if this is not the case? The physical performance and configuration of each Slave machine are different. It is very unscientific to use random load balancing without considering performance. This will bring unnecessary high load to machines with poor machine performance, and even It brings the risk of downtime, and at the same time, high-performance database servers cannot fully utilize their physical performance. Based on this consideration, we introduced weighted load balancing, that is, through a certain interface within our system, each DB server can be assigned a weight, and then when running, LB will be allocated according to the proportion of the weight in the cluster. A certain proportion of the load is given to the DB server. Of course, the introduction of such a concept will undoubtedly increase the complexity and maintainability of the system. There are gains and losses, and there is no way for us to escape them.
With the sub-database, the cluster, and the load balancer, is everything going to be fine? Things are not as simple as we think. Although these things can basically ensure that our data layer can withstand a lot of pressure, such a design cannot completely avoid the hazards of database downtime. If slave2 in Group1 is down, the LB of the system cannot know it. This is actually very dangerous, because the LB does not know and will think that slave2 is available, so it will still allocate load to slave2. In this way, problems arise, and the client will naturally encounter data operation failure errors or exceptions. This is very unfriendly! How to solve this problem? We introduce an availability detection mechanism for cluster nodes or an availability data push mechanism. What is the difference between these two mechanisms? Let’s talk about the detection mechanism first. As the name suggests, detection is my data layer client, which tries the availability of each database in the cluster from time to time. The implementation principle is trial linking, or trial access to the database port, which can be done. , of course, you can also use JDBC to try to connect and use Java's Exception mechanism to judge availability. The details will be mentioned in the following text. So what is the data push mechanism? In fact, this issue needs to be discussed in a real application scenario. Under normal circumstances, if the applied DB database is down, I believe the DBA will definitely know it. At this time, the DBA will manually push the current status of the database through the program. To the client, that is, the application side of the distributed data layer, a local DB status list is updated at this time. And inform LB that this database node cannot be used, please do not assign load to it. One is an active monitoring mechanism, and the other is a passive notification mechanism. Both have their own merits. But they can all achieve the same effect. In this way, the problem just assumed will not happen. Even if it does happen, the probability of it happening will be minimized.
We haven’t given much in-depth explanation of the Master and Slave mentioned in the above text. As shown in Figure 1, a Group consists of 1 Master and N Slaves. Why do this? The Master is responsible for the load of write operations, which means that all write operations are performed on the Master, while read operations are allocated to the Slave. This can greatly improve reading efficiency. In general Internet applications, after some data surveys, it is concluded that the read/write ratio is about 10:1, which means that a large number of data operations are concentrated on read operations, which is why we have multiple Slave s reason. But why separate reading and writing? R&D personnel who are familiar with DB all know that write operations involve locking issues, whether it is row locks, table locks or block locks, which relatively reduce system execution efficiency. Our separation is to concentrate write operations on one node, while read operations are performed on other N nodes. This effectively improves the reading efficiency and ensures the high availability of the system. The separation of reading and writing will also introduce new problems. For example, how can the data on my Master be synchronized and consistent with other Slave machines in the cluster? This is a problem that we do not need to pay too much attention to. MySql's Proxy mechanism can Help us do this. Since the Proxy mechanism is not very relevant to this topic, we will not introduce it in detail here.
To sum up, the general function of the distributed data layer studied in this topic is this.