Home  >  Article  >  Backend Development  >  Optimization experience of a production accident

Optimization experience of a production accident

PHPz
PHPzOriginal
2017-03-12 16:24:131098browse

After a normal event promotion, customer service began to give feedback one after another. Users reported that they could not open the webpage or APP when grabbing bids. When they opened it, the bids had already been snatched up. They were not particularly interested at first. I felt Isn’t that what it’s like when competing for bids, and isn’t that what it’s like when competing for Xiaomi phones? As the event continued, more users protested strongly. Users who received interest rate coupons or cash coupons were unable to grab the bids, believing that the platform was fraudulent and deliberately prevented them from being used to save resources.

Analysis process

In fact, there have been continuous user feedbacks in the past that did not decrease, and customers were deceived by using Xiaomi to grab mobile phones as an example. This time the user feedback was too strong, so we paid attention to it. got up. We have a total of three front-end products, app, official website, and H5. Among them, the app is used the most, and the official website is second. H5 is rarely used in daily life, but the traffic will increase sharply during events (events are usually mostly H5 games, and H5 is also convenient for promotion and marketing. ), the three front-end products all use lvs to load into the two back-end webservice servers (as shown below). This time the user feedback is basically on the web and app sides, so focus on observing these four servers. server.

Optimization experience of a production accident

First of all, I suspected whether the network bandwidth was full, and found a network engineer to monitor it through Tools. During the bidding process, the maximum bandwidth usage was only about 70%. , and then rule it out; I once again doubted whether the web server could no longer withstand it. Use the top command to check the load of the two servers on the official website. At the moment of bidding, it will soar to about 6-8, and it will slowly increase after the bidding. It returned to normal, and the two servers of the app peaked at 10-12, and then returned to normal.

Tracked the web server business log and found that the database Update layer reported that no new database connections could be requested or the database connections had been used up. It was thought that the maximum number of connections in the database was too small, so adjustments were made. mysql databaseThe maximum number of connections is 3 times that of the past; I will continue to observe the business log when bidding next time and find that errors related to database links are no longer reported, but many users still report that the page cannot be opened during bidding. .

Continue to track the web server, use the command (ps -ef|grep httpd|wc -l) when bidding to check the number of httpd connections, which is about 1,000, and randomly check apacheThe maximum number of connections set in the configuration file is 1024 (apache’s default maximum number of connections is 256). It turns out that the number of connections during the bidding process has reached the maximum number of connections. Many users have been unable to obtain http connections during the bidding process. As a result, the page becomes unresponsive or the app keeps waiting. So adjust the maximum number of connections in the apache configuration file to 1024*3.

Continue to observe during the bidding process, the number of Apache connections can still soar to between 2600-2800 during the bidding process. According to customer service feedback, there are still many users reporting the problem of bidding, but it is slightly better than before. A little, but there are sporadic user feedbacks that they have already grabbed the target, and finally it was rolled back. Then continue to observe the database server, use the top command and MySQL Workbench to view the various loads of the mysql main library and the slave library. I was shocked (as shown below). The indicators of the mysql server main library have reached their peak, while the slave library is almost not too big. pressure.

Optimization experience of a production accident

The tracking code found that all the business codes at the three ends were connected to the main library, and only the query business in the background was used in the slave library, so the transformation was started immediately; Except for queries during the bidding process, all queries on other pages or businesses were transformed into queries on the slave database. After the transformation, we found that the pressure on the master database was significantly reduced, and the pressure on the slave database began to increase. As shown below:

Optimization experience of a production accident

#According to the feedback from customer service, after the transformation, the problem of the bid being returned is almost gone. During the bidding process, the page cannot be opened or is opened slowly. It has been alleviated to a certain extent, but some users still report this problem. According to the analysis results of the above projects, we can conclude that:

  • 1 The two servers under load have reached the processing limit and more configurations are required. server to load.

  • 2 The pressure on the mysql main database has been significantly reduced, but the pressure on the slave database has increased. It is necessary to change the current one master and one slave to one master and multiple slaves model.

  • 3 To completely solve these problems, we need to comprehensively consider the overall optimization of the platform, such as: business optimization (removing hot spots in the business), increasing caching, and paginationfacestatic (you can use the front-end optimization rules of Yahoo and Google, and there are many test websites on the Internet for evaluation) and so on.

I wrote an optimization report based on these circumstances, see below:

Optimization Report

1 Background

With the continuous development of the company's business, the business volume and user volume have surged. The official website pv has also increased from the initial xxx-xxx to the current xxx-xxxx, and the active users of the APP have increased significantly; therefore, it has also affected the current platform's TechnologyArchitecture has greater challenges. Especially when the platform's bid sources are tight recently, the time to complete the bid is getting shorter and shorter. The pressure on servers is also increasing; therefore, the current system architecture needs to be upgraded to support a larger number of users and business volumes.

2 User access diagram

Optimization experience of a production accident

Currently, the platform has three products facing users, the platform official website, platform APP, and platform small webpage; among them, the platform official website and platform APP The pressure is relatively high.

3 Existing problems

The problems when users compete for bids are concentrated in the following aspects
1. The webpage or APP cannot be opened
2. The website or APP is slow to open
3. After the transfer was successful during the bidding process, the update failed due to the heavy pressure on the server, and the refund was issued again.
4. The number of database connections was exhausted, resulting in the failure to add investment records after the bidding was full, and the progress of the bidding was rolled back.

4. Analysis

Through in-depth analysis of recent server parameters, concurrency, and system logs, it is concluded that:
1. The server pressure is huge during the bidding process of the platform's official website and platform APP. Among them, the problem of platform APP is more prominent. During the peak period of bidding, the maximum number of apache connections for a single APP server has been close to 2600, which is close to the maximum processing capacity of apache. 2. The database server is under huge pressure. The pressure on the database is mainly prominent in two periods

1) When the platform is doing activities, the number of visits to the official website, small web pages, and APPs increases dramatically, resulting in a huge increase in data query volume. When the database processing limit is reached, problems will occur. Problems such as slow website opening;

2) When users compete for bids, the pressure on users to compete for bids is divided into two stages: before bidding and during bidding. Before bidding, because the bidding is full very quickly, users open the bidding page in advance and refresh it continuously. This will increase the query pressure on the database. If the number of users competing for bids is very large, the number of database connections will be used up before bidding. ; During the bidding process, a single purchase will probably involve about 15 tables for change and query. Each bid has a share of 10 million, and about 100-200 people will purchase and complete the full bid each time. Calculated based on the median value of 150 people, in a few seconds The data needs to be updated 2000-
300
0 times within a period of time (only updates, excluding queries), resulting in a large amount of concurrency, which may cause update failures or connection timeouts, thus affecting user bidding and normal system fullness. mark. 5 Solution

1. Web server solution

Schematic diagram of a single user accessing web services


Optimization experience of a production accidentCurrent website and platform The APP uses two services for balanced responsibility. Each server has

installed

apache for server-side processing. Each apache can handle a maximum of about 2,000 connections. Therefore, in theory, the current website or APP can handle more than 4,000 user requests. If you want to support 10,000 requests at the same time, you need 5 apache servers to support it, so you currently lack 6 web servers. Access diagram after upgrading the server

Optimization experience of a production accident2. Database solution

Current database deployment plan



Optimization experience of a production accident1) Master-slave Separately solves 80% of the query pressure of the main database. At present, the official website and APP of the platform are connected to the MySQL main database, which doubles the pressure on the main database. Migrating all queries in the service to the slave database can greatly reduce the pressure on the main database.

2) Add a cache server. When the slave database query reaches its peak, it will also affect the master-slave synchronization, thereby affecting transactions. Therefore, queries frequently used by users are cached to reduce the request pressure on the database. It is necessary to

add

three cache servers to build a redis cluster.

3. Other optimizations
1) The homepage of the official website is static. According to cnzz statistics, the homepage accounts for about 15% of the total visits to the website. Data that does not change frequently on the homepage are processed statically to improve The smoothness of opening the official website.

2) Optimize the apache server, enable gzip compression, configure a reasonable number of links, etc.

3) Remove the update hotspot in the investment process: the target schedule. Each time a bid succeeds or fails, the bid schedule needs to be updated. Problems such as optimistic locking may occur during multi-thread updates. Eliminate updates during the process and only save the bid progress information in the bid schedule after the bid is full, optimizing the pressure on the database during the investment process.

6 Server upgrade plan

1. The biggest pressure on the platform comes from the database. It is necessary to change the current one master and one slave to one master and four slaves. A large number of queries generated by the official website/app/small webpage are distributed to three slave databases by virtual IP, and the background management queries go to another slave database. The database needs to add three new servers
Schematic diagram after database upgrade
Optimization experience of a production accident

2. Increase cache to reduce data pressure. Two new cache servers with large memory need to be added
Optimization experience of a production accident

3. Three new web servers need to be added to decompose user access requests.

The app needs to add two new servers.
The pressure on the app server during the bidding process Maximum, two new servers need to be added. Schematic diagram after the configuration is completed
Optimization experience of a production accident

The official website needs to add one new server
The official website also has certain requirements in the bidding process Pressure requires a new server. The completed diagram is as follows:
Optimization experience of a production accident

In total, 8 servers need to be purchased, two of which require large memory (64G or more)

Click to download the optimization report word version

Note: After all optimization plans are put into production, the problems will be solved and there will be no bids. worry!


##Author: Pure Smile
Source: http://www.php.cn/
Copyright belongs to the author, please indicate the source when reprinting.

The above is the detailed content of Optimization experience of a production accident. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn