Home >Operation and Maintenance >Safety >From a CTO perspective: How to build operation and maintenance/SRE capabilities

From a CTO perspective: How to build operation and maintenance/SRE capabilities

WBOY
WBOYforward
2023-06-09 12:37:08857browse

From a CTO perspective: How to build operation and maintenance/SRE capabilities


There have been many articles recently discussing the issue of whether to retain or retain operation and maintenance positions. The SRETalk public account I host I also posted the opinions of many operation and maintenance directors. I also personally communicated with many people in the industry. I have some small thoughts and recorded them for reference by CTOs/CIOs. As an operation and maintenance/SRE, if you think If you are confused, I also recommend that you read this article carefully.


I think this is an in-depth thinking, it may be boring, but it will be helpful for career choice and team building. This article welcomes well-founded discussions, but does not welcome arrogance. In addition, many things are not black and white. It is great if the content of the article can inspire you and bring new thinking to CXOs' decision-making.


In addition, SRETalk’s operations and maintenance director interviews will continue, and more different views will continue to be output for your reference, and my views are not necessarily correct. , also for reference only.

About the title

First let me talk about the title, "How to build operation and maintenance/SRE capabilities". Here I do not write about building a team, but building capabilities, because some goals may not be achieved. You must build your own team. From the perspective of cost, predictability of results, and long-term investment and maintenance, you need to make careful decisions. If you make the wrong decision, the future will be a mess. This will be discussed later.

About the operation and maintenance/SRE team

Another point should be clarified in advance. The operation and maintenance/SRE team mentioned in the article all serve the business, and the success of the business is the first priority. Some operation and maintenance teams have made some products and exported them for external commercialization, which has become a business in itself. This is another matter. Moreover, based on my experience in my old employer, the operation and maintenance/SRE team’s approach (external commercialization output ) is not advisable, especially in a company that does not have ToB genes and does not have corresponding ToB organization construction.

Where to obtain operation and maintenance/SRE capabilities

Since everything is for business success (regardless of business, only considering whether you can be promoted or whether you can fool the boss is another matter), we will The focus is on what operation and maintenance capabilities the business requires (explained in detail later) and where these operation and maintenance capabilities need to be obtained. There are three typical acquisition methods.

From a CTO perspective: How to build operation and maintenance/SRE capabilities

Self-built team

The first is to provide relevant capabilities through self-built teams. This method is the most familiar to everyone. Self-built teams Deliverables to the business usually include two parts: products and services. Let’s talk about the product first:

  • If the product needs are general needs, the product will most likely be an open source project that can be used directly. It is necessary to consider the durability of the open source project (whether the developers of the open source project have income support from commercial companies, most personal open source projects will die without income), activity (has the project not been updated for many years? Are the issues and PRs raised? Timely processing? Usually processing within a week can be regarded as active), ecological prosperity (Are many people participating in making contributions? Many companies are using it?)
  • Does the open source project require secondary development? If the secondary development code can be merged back to the main trunk, it usually means that the secondary development code is universal and has been recognized by the open source project team. If it cannot be merged back to the main trunk, subsequent maintenance will be troublesome, especially after the talent changes. It is usually possible to make some glue code based on the API of the open source project and integrate it with the internal system. After all, the open source code has not been modified, and the subsequent upgrades of the open source project can still keep up.
  • Of course, there are also complete self-research without open source (just use some open source lib libraries, and develop the core product logic by yourself). You should be cautious about this. If the open source community does not have relevant products, you can only develop it by yourself. However, after self-research, you must consider long-term maintenance issues. R&D personnel Usually I like to do things from 0 to 1. Later, when the profits are small and I cannot get promotion and salary increase, it is easy to change. As for the operation and maintenance track, the open source community has a dazzling array of products, and there may be only a handful of products that require self-development, so think twice.

The second is service. The so-called service here refers to the expert experience exported to the business side. For example, if a self-built team builds a monitoring product, this team needs to output monitoring best practices to the company's internal "customers". When problems arise with the monitoring product, this team needs to quickly resolve them. In fact, the middle and back-end teams within the company need to have a strong sense of service and understand the best practices in the industry. Otherwise, they will easily be led by the business and go in the opposite direction to the best practices in the industry. It’s all a problem.

The core of service relies on people (of course, it would be great to solidify best practices into products). As a manager, if you want this team to deliver good services, you need to consider many people. Questions, such as: whether it can recruit relevant talents, whether it can retain relevant talents (development space, salary, etc.), at least two people in each direction of the self-built team can complement each other, and whether the cost can be afforded.

Third-party suppliers

Obtaining operation and maintenance capabilities through third-party suppliers is another way. The supplier's deliverables obviously include two parts: products and services. Products are divided into two types: open source and closed source. What are the considerations?

  • Open source products usually have more users and more scenarios to polish, but some long-tail requirements are usually not open source. As for the reason, either the open source team As a chargeable item, either the open source team feels that these long-tail requirements are not general enough and are not worth putting into the product.
  • Closed source products usually have a small audience, and there are not many open source users to help polish the products. They need to be polished by commercial customers for a long time, or the suppliers of closed source products have very strong quality. Management system, complete testing of products, this requires finding suppliers with big businesses. Moreover, testers and end users are two types of people after all, and polishing by commercial customers is indispensable. However, if the supplier The merchant has a strong quality assurance team, which will make the polishing process shorter.
  • Whether it is open source or closed source, the supplier comes with the product. As Party A, you can directly test it to see how the product matches and get feedback quickly. If you build a self-built team, It may take several months or even a year or two to develop, and the business may not be able to afford to wait. Whether the product really meets expectations after development is determined by many factors, and the results are unpredictable.

The second is service. Suppliers usually have advantages over self-built teams. The reasons are as follows:

  • Because suppliers have seen more customer scenarios, and ToB company, the long-term accumulation of industry Know-How is the core competitiveness of this company, and suppliers will continue to learn from excellence. Learn experience from customers and feed it back to less advanced customers, creating a virtuous cycle and a win-win situation for all parties.
  • It is also because suppliers have seen more scenarios and can make better abstractions for products, making the products more versatile and more like a product. The products made by self-built teams are usually more tool-oriented. No offense intended, I mean usually.
  • The reason why suppliers start businesses in the field of operation and maintenance is most likely because they have made some achievements in this field. Compared with self-built teams, suppliers usually have better top-level knowledge. If you really go When you recruit people, you will find that the most talented people have either started a business, are too expensive, or are unwilling to come.

In addition, let’s talk about the cost issue. The supplier’s charges are most likely more cost-effective than recruiting people yourself (provided that the right people are recruited). Otherwise, the business logic will not hold. This principle is obvious and will not be repeated again.

Obtaining operation and maintenance capabilities from third-party suppliers seems to be overwhelming for self-built teams, so do you still need to read the following articles? In fact, it is not necessarily the case. For a certain operation and maintenance capability, what is more important is product capability or service capability. What you need most is product capability or service capability. It needs to be looked at case by case. Later, I will look at it from the business side. All aspects of operation and maintenance capabilities are dismantled separately.

What technical support capabilities are needed for the business?

The essence of operation and maintenance is a type of technical support capabilities, which is very similar to the infrastructure team. Some of them can be put into the operation and maintenance team, but they can be put into the infrastructure The team problem is not big. Some companies even put such people directly into the business R&D team. Let’s ignore the division of labor for the time being and first sort out what kind of technical support capabilities the business needs.

From a CTO perspective: How to build operation and maintenance/SRE capabilities

This picture actually explains the problem very well. Let me elaborate a little more:

  • Reliable basic environment and components: To run business programs, you need basic networks, hardware, operating systems, databases, middleware, etc. These environments and components need to be stable and reliable
  • Fast and safe changes Ability: The ability to make rapid changes is easy for everyone to understand. As a developer, when you write a feature or make a bugfix, you definitely want to deliver it quickly, but changes can easily lead to failures, changes need to be controlled, and safety needs to be ensured as much as possible
  • Reliability assurance capability: After the software is deployed to the production environment, you may encounter various problems. How to quantify risks in advance, how to quickly discover problems, locate problems, and quickly stop losses, this may be a problem on the business side. The most important requirement for the operation and maintenance side is
  • best practices: The business relies on many basic supporting capabilities. How are these capabilities used? Is it industry best practice? Is it a best practice for most other operations within the company? A basic support team is needed to feed back to the business

How to obtain each ability

How should the four abilities mentioned above be obtained? Now let’s break it up and break it down and talk about it.

Reliable basic environment and components

First of all, let’s talk about the basic hardware environment. Obviously there are two options, cloud or self-built. If the policy requires that you have to toss it yourself, there is no way. The policy shall prevail. If you can choose by yourself, in this era, it is most likely to be more suitable to go to the cloud. Unless the company is very large and has a large amount of machines, building it yourself may have an advantage. Note that what I say here is only possible . When calculating costs, remember to include labor costs, not just hardware costs.

Regarding career choice: It does not seem to be good news for system operation and maintenance engineers and network operation and maintenance engineers. The emergence of the cloud has indeed taken up space for some of these positions. There is no way. The wheel of the times is rolling forward, and everyone is the dust of history.

Let’s talk about components, such as MySQL, Redis, MongoDB, Kafka, ElasticSearch, Nginx, Kubernetes, etc. There are obviously three options, use cloud PaaS products or make your own or produce your own hardware. Suppliers provide solutions and services. For each choice, we will make a comment respectively:

    Cloud PaaS products: If the scale is small and there is no relevant talent reserve, it is more appropriate to use cloud PaaS products, which can quickly transfer capabilities During construction, Party A who chooses to use PaaS products on the cloud usually already uses virtual machines and Kubernetes-like runtime environments on the cloud. By the way, purchasing PaaS products is relatively smooth and does not require new suppliers. docking.
  • Do it yourself: If a certain component is very large, it may be necessary to build it yourself, such as Kafka. Do it yourself, hire two people, one main and one backup. The level is not bad, and you can be sure of everything if something goes wrong. , the annual cost in Beijing is about 1 million. How big is the scale to save this money from hardware and components? Of course, you can also recruit some low-cost operation and maintenance engineers (
  • emphasis added, operation and maintenance engineers may be needed here, but their ranks are not high), who can solve daily problems and high-level problems No, you can turn to the expert services of an external provider for advanced issues.
  • Provide your own hardware suppliers with solutions and services: Compared with the PaaS products of cloud vendors, third-party suppliers are usually more cost-effective and faster in response. However, with so many components, each supplier is likely to only have We can get a limited number of items, but as Party A, you may have to deal with multiple suppliers at the same time, which is a little troublesome. For products that require cross-cloud collaboration, such as unified monitoring, fault location, and FinOps-related products, if the company uses multiple clouds or a hybrid cloud architecture, there is a high probability that a third-party supplier is more suitable.

About career choice: For experienced veterans of various components, the first choice is to work for a cloud vendor or start a business to export experience, and the second choice is to go to a large manufacturer that builds its own components. Generally speaking, It is difficult for small and medium-sized factories to have high salaries. After all, third-party expert services are very cost-effective.

The ability to change quickly and safely

The most common changes made in business research and development are binary and configuration changes. Of course, there are also changes to the basic environment and components.

Let’s talk about binary and configuration changes first. How can we iterate quickly and safely? It can be done in stages. When the company is still relatively small, you don’t need to pay too much attention to the construction of tools. You only need to set the specifications and processes. Standard aspects such as: which account is deployed under, which directory, how to put logs, how to host the process, any changes must be rollable, etc. In terms of processes, such as: change notification mechanism, multi-module collaborative online mechanism, and non-rollback There needs to be an approval mechanism and so on.

Then, we need to have quantitative data on historical changes, such as how many changes a certain team has made in the last quarter, what is the rollback rate, and what is the failure rate. Each team has a comparison, and the team that does not do well is It will be improved in the next quarter.

When the company continues to grow, it can invest manpower to build a change platform, implement standardized systems on the platform, and produce quantitative data. Because different companies have different situations, in the era of traditional physical machines and virtual machines, it is very difficult to It is rare to see commercial change systems. Of course, after the rise of Kubernetes, many of the underlying differences have been shielded. The platform for making changes based on Kubernetes has become much more versatile, and related products have begun to come out.

Changes to the production environment are not the same as changes to the test environment and joint debugging environment. The production environment has stricter stability requirements, while the test environment and joint debugging environment have relatively low requirements. The so-called CI/CD systems are mostly designed for test environments and joint debugging environments. There are only a handful of companies that can implement CD for production environments.

Focus: The CI/CD system for testing and joint debugging environments is more about speeding up R&D efficiency; the change system for the production environment is more about ensuring stability and implementation. normative system. The company is small in the early stage, so it is enough to rely on rules and regulations. Later, it will need a collaborative effort through changes in rules and regulations and a platform.

Who will determine this regulatory system? Who will develop the change platform?

The formulation of specifications is actually in the early stages. The specifications may already be in place before the operation and maintenance team exists. Therefore, it is most likely that the CTO and the subordinate Core team will formulate them. If it has not been formulated before, the operation and maintenance director (Operation and maintenance director is here) can take the lead in formulating it, and the Core team under the CTO will review it (everyone has participation), and finally the CTO will make the decision. Publish (top-down) and everyone executes.

It is relatively appropriate for the development of the change platform to be developed by the operation and maintenance team. Later, we will introduce some other platforms and set up a dedicated operation and maintenance team (there is no difference between the operation and maintenance I am talking about here and SRE. You can also call this team the SRE team) is appropriate. Changing the platform requires implementing the company's specifications, so there are relatively few cases of outsourcing. After the company reaches a certain scale, self-research and accumulation based on open source things is a high probability choice.

About career selection: Change management is an important part of an enterprise and also serves the stability of the system. This is a typical DevOps position, and the ceiling is about the P7 level (purely a personal opinion, for reference only).

The other is the change of basic components and environment, typically such as MySQL table structure, Nginx configuration, DNS, VIP, etc. Such changes can be internalized into the component management and control platform, so that The component capability provider provides change entry and management control capabilities.

Reliability Guarantee Capability

This capability is very important. SRE is the abbreviation of Site Reliability Engineering, that is, site reliability engineering. From the CTO's perspective, when software is deployed to the production environment, various problems may occur in the future. We hope to have an engineering system to ensure reliability. This is a huge topic, and this article won’t go into detail, just clarify what is and who is responsible for it.

The so-called reliability is the process of fighting against failures. Therefore, we still look at the life cycle of failures, starting from each link of the life cycle, to defeat the failure, or even kill it directly. In the cradle.

From a CTO perspective: How to build operation and maintenance/SRE capabilities

There is a lot of work to be done in prevention and risk control before the failure begins.

For example: formulate alarm completeness standards and make quantitative assessments of each business line; formulate positioning principles and processes as well as standards for fault grading and responsibility; sort out the correspondence between the core functions and service modules of each business in advance, and establish a global stability view or The war room is used to quickly identify faulty modules or interfaces; optimize the architecture; sort out failure plans and conduct regular drills to keep them fresh, which is the mess of chaos engineering; and so on.

There are some things here that require business research and development, such as architecture optimization. For the rest, my suggestion is: Let the operation and maintenance team take the lead and cooperate with R&D. For example, the Core team under the CTO will most likely have both an operation and maintenance position and a technical position for each business. In name, the CTO will make the decision, authorizing the operation and maintenance position to take the lead, and the R&D position for each business to cooperate. Of course, when it comes to actual operations, the No. 1 operation and maintenance position may find a capable person to do the actual operation in the future, and each business line may also have people who rely on the No. 1 technical position to provide interface support.

Except for architecture optimization, these other things are all horizontal matters. There can be some methodologies and best practices to bring everyone together and help share these methodologies and best practices. best practices. Of course, some people will have questions: Can we directly find someone from the R&D team to form such a stable virtual organization and jointly promote this matter? In fact, you can try it. However, there will be a few problems:

  • Each business line usually only has one or two interface people. With fewer people and more work, it is highly likely that this person will have difficulty balancing business code development and stability work. If this person does stability full-time, it will actually be quite difficult. Regarding SRE
  • If it is SRE, the assessment system is actually different from that of business R&D personnel. How to determine KPI? And this person may not have a good sense of belonging
  • If this person takes care of two things at the same time: stability and business research and development, it may cause people's inertia. When stability work encounters problems, they will naturally They will want to do some business research and development work. When business research and development encounters problems, they will want to be lazy and do stable work

Focus on: prevention and risk management in advance For control, please CXO ask the operation and maintenance director for the results, but you must provide great cooperation and push it from top to bottom. For the SRE engineer role to solve this problem, it seems that a very professional high-level person is required. There is a high probability that the cognitive skills cannot keep up within 5 years of working. Perhaps, recruiting SRE from the senior R&D team is a good choice. CXOs can Give it a try.

Reduce the impact after the failure begins

Once a failure occurs, our primary goal becomes to reduce the impact. The relevant teams immediately collaborated to quickly locate the direct cause, stop the loss quickly, and then slowly investigate the root cause afterwards. The following work content will be involved here:

  • Define fault: Usually, when there is a problem with business indicators, it means that the fault has begun, such as a drop in order volume, a drop in ride-hailing call volume, a drop in payment volume, and the boss will Pay special attention to this type of indicator; while a machine's CPU surge or the disk is full, it may just be a problem internally digested by the team. Even K8s-like systems automatically resolve drift, which usually has no impact on the customer's main process, and the boss does not pay attention. In order not to be confused, we need to distinguish the definition of faults and problems. Different business lines have different indicators, but the overall methodology is the same.
  • Response to fault: Is the recipient of the fault alarm for business research and development? Or SRE? Or OnCall center? Different companies have huge differences in their practices. My personal idea is: send it directly to those who are capable of handling it! There is no black and white. Different alarms have different handling mechanisms. For example, if there is a problem with the basic network, it will obviously be sent to the network engineer. If there is a problem with a certain business, it will be sent to the corresponding operation and maintenance and R&D. Try not to transfer it again in the middle. , send it to Zhang San. If Zhang San can't handle it and contact Li Si, it will be a waste of time. Troubleshooting should be done against time.
  • Quick location: An effective fault location system is a killer. Fault location systems are usually built based on observability data and can be regarded as cockpit-level products. Observability data is massive. Without sorting and utilization, this massive data cannot be turned into valuable information. From the perspective of positioning, what is usually required is: observability system, fault location, and continuous operation. There is too much content to expand here. If you want to discuss in detail, you can contact me. What? Don't know how to contact me? SRETalk official account, find out more.
  • Quick stop loss: To stop loss quickly, you must have a complete plan. When reviewing each failure, it is recommended that the CTO and the operation and maintenance director pay attention to the effectiveness of the plan, that is, whether the failure is caused by using a The existing plan is used to stop losses, or the solution is saved. If it is saved now, it means your plan is not complete enough.

OK, the above is eloquent, but back to the question, for this work content, who should the CTO ask for results? My suggestion is: SRE team (the words operation and maintenance and SRE appear many times in this article, and they basically mean the same thing in this article. Operation and maintenance here is not just Operations). Obviously SRE cannot solve all faults. It should be said that most faults have to rely on people from other teams, but the CTO can't always go to team A and team B. Therefore, SRE must carry the CTO’s Sword of Shang Fang and take the lead in overall stability construction. Each business needs the best cooperation from the export interface. The so-called stability construction includes preventive risk control beforehand and overall planning and coordination during the event. , and the subsequent review is promoted, which is also the greatest value of SRE to the company.

Best Practices

This contains a lot of content, such as which model package is more suitable, what networking method is more suitable, and which components companies have better Control, can you get better support (whether it is an internal team or a third-party supplier), what are the programming languages ​​and frameworks recommended or even required by the company, and what are the access layer solutions recommended by the industry? What is the change plan? How to do observability? Etc., etc.

It is undeniable that these practical methods of a great business R&D team are clear, but it is also undeniable that after there are more business lines, the level will be mixed, and a team with a poor level will inevitably need someone with a coaching role, which cannot always be achieved Go to the CTO for everything. As a horizontal technical team, the SRE team is particularly suitable for taking charge of this matter. But obviously, this is a high-end position that cannot be filled by newcomers. Recruiting high-level people to do business as BP is an effective means to promote the unification of the technology stack. If the CTO does not use this starting point well, the technology will The system will bloom, but behind it will be various governance dilemmas. The above four supporting capabilities, how should the business side obtain them, how should the CTO coordinate, how should the various teams cooperate, that's all. Let us make two more summaries below.

Summary 1: How does the CTO help the business line obtain these supporting capabilities?

Obviously, the CTO does not need to do everything himself, but the CTO must do a good job of checking things. The CTO must issue policies and be the commander-in-chief of the entire army. The horizontal work is left to the SRE team, and the interface personnel of each team work hard to cooperate. This is most likely a best practice. If the horizontal work goals are completely dispersed into the self-closed loop of the business team, you will not be able to enjoy the experience dissemination ability brought by the horizontal team. Moreover, the butt determines the head, and if you are not in the right position, you will not be able to do what you want. Each business is prone to have its own little ninety-nine. The horizontal organization of the center is also a mechanism to cut down vassals. Sorry to use this word too strongly, the intention is good, you have to experience it for yourself.

Another addition to the topic of FinOps, FinOps is also a horizontal capability. Should it also be left to SRE? This is not necessarily the case. I think it’s good to let the business close the loop. The business itself is responsible for profits and losses. IT expenditures are the majority of expenditures. The business GM should be very concerned about it. The CEO presses KPIs related to revenue and net profit to the business GM. The business GM can Self-closing loops do a good job of compromise.

Summary 2: Operation and Maintenance/SRE Career Suggestions

If you don’t have too high level and salary expectations, you can do some relatively basic Operations work. There is a high probability that this position will not be available in 10 years. die. If you have higher expectations for rank and salary, it is an effective path to delve deeply into a certain niche and become an industry expert. After that, it will focus on the integration of multiple technical directions and develop in breadth. After that, start a business or become a senior executive.

The author of this article

Qin Xiaohui, entrepreneurial research and development of Open-Falcon and Nightingale, author of Geek Time's "​

​Operation and Maintenance Monitoring System Practical Notes​

", public account The manager of SRETalk and the entrepreneurial partner of Kuaimao Nebula. The direction of entrepreneurship is to ensure stability. If you have any needs, please feel free to contact me for communication.

The above is the detailed content of From a CTO perspective: How to build operation and maintenance/SRE capabilities. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete