


Business is growing exponentially, can usability construction be so stable?
1. Problems and Challenges
- Computer room level failure risk (both large and small companies will encounter it, fiber mining outage or internal failure in the computer room, etc.);
- Rapid business growth has significantly increased capacity requirements.
- How to prevent the occurrence of faults?
- How to find the fault as soon as possible?
- How to quickly cure the fault?
- After the fault is restored, how to follow up?
1) Service perspective
##A service is nothing more than a requested input, and normally it only needs a corresponding output. In real situations, there are many aspects that affect the correct response of the service. In some classic scenarios, the influencing factors have been summarized
In terms of capacity: exponential growth in business requests will lead to abnormal output of a single service;- On the service side: there is a bug in the software itself, and the service crashes as a result;
- Hardware side: Abnormalities caused by host hardware, computer room, and network.
- Service layer: Collaborative configuration is required between services. Incorrect configuration settings can also cause full-link abnormalities;
- Upstream and downstream dependencies: Abnormalities in some key services can cause abnormalities across the entire link.
From the perspective of the stability of the entire link: upstream and downstream dependencies, insufficient capacity, and abnormal service configurations are all important factors affecting stability.
3. Fault prevention construction
After analyzing the fault factors from the two perspectives of service and full link, the fault There are corresponding ideas for prevention construction:
- Full-link abnormality: It is necessary to analyze the strength and weakness of upstream and downstream, and provide special protection for key servers , to ensure the stability of the entire link;
- Change exceptions: establish change process specifications and change management platforms;
- Infrastructure exceptions: rely on high-availability architecture, remove single point risks, and Good redundancy and disaster recovery.
4. Fault prevention
I talked about the overall analysis and construction ideas before. How does vivo actually do it?
We have implemented construction guarantees based on the entire link. The entire link has been constructed from the access layer, business logic layer, middleware layer, storage layer, and infrastructure layer:
1) Unitization: Reduce service calls across computer rooms to avoid the failure of a single computer room from affecting all computer room services;
2) More Entrance: In the past, many businesses only had a single access layer entrance. After building the multi-entry capability of IDC and public cloud, the impact of a single entrance exception on the overall service access will be smaller;
3) Overload protection: When the business capacity suddenly increases, the access layer service can actively reject some burst requests according to the settings to prevent excessive request traffic from overwhelming subsequent services;
4) Circuit breaker downgrade: Monopoly downgrade of dependent services can shield the impact of abnormal services and avoid the avalanche effect.
5. Fault discovery
## We have built a fault detection capability based on the entire link, and currently the proactive fault detection rate can reach 90%, which includes client monitoring, server monitoring and basic monitoring: 1) Client monitoring: self-built dial-up test system, monitoring the availability of each service through bypass simulated user access; 2) Server monitoring: Including domain name monitoring, log monitoring and call monitoring between services. According to the monitoring implementation method, it is mainly metrics/logs/trace; 3) Basic monitoring: monitor the hardware resource usage of the host situation, mainly in the form of metrics. #6. Troubleshooting Mainly includes fault analysis and fault handling.
- Troubleshooting: Failure plan construction, including plan formulation, drills, etc.
7. Fault recovery
Fault recovery is very important in the entire high availability construction cycle important part.
We use business-based SLA grading to ensure business stability in a targeted manner. And record every fault of the business, improve and verify capacity building:
1) Business classification: Operation and maintenance resources are very limited, ensuring that all businesses have the same SLA, so classification Guarantee is very necessary. Based on the reputation and revenue of the business, we divide it into four business levels: core, important, general, and other. This guides the operation and maintenance manpower and guarantee efforts invested in each business;
2) Fault record: Improve review efficiency, and track online business faults for subsequent analysis to guide business optimization;
3) Fault improvement : Conduct backward verification based on chaos engineering to determine whether the improvement measures have taken effect.
This is our practice in fault review. We have also implemented these capabilities and practices into the platform and managed the fault review work through the platform.
8. Capacity management
- Resource elastic scalability: Build hybrid cloud-based resource guarantee capabilities to greatly improve resource elasticity;
- Resource delivery, operation and management capabilities : Establish a management mechanism for the entire life cycle of resources to ensure the maximum supply and use efficiency of resources, including budget management, demand management, procurement management, and inventory operation management.
3. Usability phase construction
After usability capability building, we divide it into three stages to build usability: Standardization stage , process stage and platform stage.
1. Standardization stage
##Why should we build standardization? Standardization can greatly reduce the complexity of business operation and maintenance, thereby reducing operation and maintenance costs. We have done a lot of standardization work at both the hardware and software levels.
- Hardware level: computer room standardization, network standardization (public network, active Internet access, intranet dedicated line);
- Software level: OS standardization, host environment standardization , service catalog standardization, Agent standardization, access to nginx cluster standardization, and service capability standardization (middleware services).
##First of all, we will condense the best practices and methods in the operation and maintenance process into process mechanisms and specifications to ensure business stability is orderly and controllable, including operation and maintenance military regulations. , fault response mechanism, public affairs specifications, large-scale event guarantee specifications, etc.
For example, when the guarantee specifications for large-scale events are not established, such as when there are large-scale operational activities or Spring Festival red envelope distribution activities, it is easy for online failures to occur. Since 2018 After establishing the guarantee standards for large-scale events, heavy insurance such as the Spring Festival can ensure smooth operation.
3. Platform and system construction
Availability capability building: fault prevention, fault discovery, fault cure, fault review
- Availability phase construction: standardization, process/standardization, platform/automation
Q&A
Q1: What are the biggest difficulties encountered during the implementation of usability construction?
A1: The first point is the construction specifications of the underlying technical capabilities. Failure to comply with these specifications will lead to great uncertainty in the business availability results, so certain rules must be formulated for the team. standards, and at the same time, there must be a certain bottom-keeping mechanism;
The second point is the recognition from the upper level. Each business has different demands at different stages, and the stability is different. Well, it will affect business, reputation and revenue. After being recognized by the upper management, usability construction will be easier to promote.
Q2: During the implementation of CMDB, in addition to the development person in charge, host and other information, what other information did your company associate in the actual process? For example, is it related to middleware information?
A2: Many of our systems are currently based on CMDB. Not only the operation and maintenance system, many systems are built based on CMDB, and middleware services will also be integrated with CMDB. Association construction, such as dubbo in microservices, is also based on CMDB for service discovery and governance.
Lecturer Introduction
Zhou Jiali is now the operation and maintenance director of vivo, responsible for the operation and maintenance of vivo’s Internet business. This person who has worked at Baidu and Tencent has experience in offline business operation and maintenance such as client, internationalization and big data algorithms. After joining vivo, I led the construction of business high availability and improved the business availability to 99.99% level.
The above is the detailed content of Business is growing exponentially, can usability construction be so stable?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

WebStorm Mac version
Useful JavaScript development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Notepad++7.3.1
Easy-to-use and free code editor