Home >Operation and Maintenance >Safety >Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps

Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps

PHPz
PHPzforward
2023-06-09 23:26:281194browse

Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps

Through interviews and manuscript requests, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting The industry has to move forward in a better way.

In this issue, we invite Shao Haiyang from Youpaiyun Technology, a 25-year Linux veteran. Mr. Shao is obsessed with technology and moves up step by step. This is a typical growth of ordinary operation and maintenance personnel. Path, I hope today’s interview can give you some inspiration.

This is the 4th issue of the down-to-earth and high-level "​​Operation and Maintenance Hundreds Forum​", let’s start!

Hello Mr. Shao, please introduce yourself first and talk about your resume and current situation so that everyone can get to know you better and understand your background. To help readers understand the content of the following interview

I am Shao Haiyang from Youpaiyun Technology. I have been using Linux for almost 25 years since 1998. I am a veteran (veteran) of Linux. System operation and maintenance/architect, advocate of DevOps eight honors and eight disgraces, amateur writer; proficient in (guilty) system optimization and network service management, Linux system customization, CDN acceleration and security defense; good at high-performance Internet network and architecture design, Virtualized KVM and OpenStack cloud platform, K8S container cloud and Ceph distributed storage and other new technologies; likes to communicate and share, active in the community, and has been actively involved in the organization and dissemination of open source activities.

In the field of operation and maintenance, each company will formulate its own operation and maintenance guidelines or operating specifications. Can you share your company's experience and give us some reference?

Youpaiyun is a company that provides cloud storage, cloud distribution, and cloud processing services. It is also the first professional cloud service provider in China to provide programmable CDN services. Its characteristic is that it is available 7x24 all year round. Intermittent services, so there are some rules or principles for cloud operation and maintenance, such as:

Ensure stability first, and then optimize

Over-design or premature optimization is likely to lead to To avoid more downtime, we must first focus on improving the scalability and high availability of the system. Adhering to the implementation strategy of "first complete, then perfect, then perfect", the project also adopts the implementation strategy of "first usable, then easy to use, then good to use".

Provide reliable test basis and time verification

Before introducing new technologies into the architecture, it is necessary to ensure the stability of the new technologies and sufficient long-term testing, and more importantly, there must be The integrity of the tool chain developed in operation and maintenance engineering. Being caught off guard due to online rework or changes may already be the trigger for failure.

Use controllable automation methods to improve efficiency

Automation methods such as automatic deployment, automatic orchestration, automatic inspection, and automatic upgrade are increasingly used in cloud operation and maintenance . This is a trend that adapts to the era of cloud computing, but with greater ability comes greater responsibility. Be careful about the avalanche and thundering herd effects of automation, and do a good job in grayscale/blue-green deployment and various tests.

Keep it simple, monitor everything

Keep it simple, don’t make it too complicated. In addition to common abnormal problem alarms, business indicators, market indicators, sales data, costs, etc. can be used for trend analysis information. Regular polling to view the peaks and troughs of each trend data can help you gain insights.

Budget-oriented operation and maintenance

The operation and maintenance team is usually the biggest spender. Because of insufficient budget, it is difficult for operation and maintenance without money to take into account the growing growth. The company's business scale, unless the company's business has stagnated or no longer has explosive growth, faced with such challenges, operation and maintenance must learn to reduce costs and gain, increase revenue and reduce expenditures, and use new technologies to improve energy efficiency.

Scenario-oriented intelligent operation and maintenance

Various load scenarios, from high-concurrency processing to video transcoding, from high-performance parallel computing to massive networks ask. These different load scenarios also have different requirements for network bandwidth, various processing and IO. Intelligent operation and maintenance requires an in-depth understanding of the business and reasonable allocation of resources and architecture to meet the needs of different business scenarios.

Continuous integration and release system

Continuous release includes grayscale release, test release, rolling release, rollback release and other scenarios, and ensures that each scenario It should be controllable.

Ensure that anyone can be replaced

In an iron-clad camp, it is normal for people to move around and move around. Do a good job in shared document management and knowledge transfer and sharing among employees. , in theory, everyone can be replaced, and no one should become the ceiling of the company.

Although growth is your own business, if you have the right field, the right project opportunities, the right team, and the right mechanism, engineers will grow faster and the team will be more effective. Can you systematically talk about how you promote the growth of operation and maintenance students?

The company has always actively encouraged employees to self-improve their skills and promote growth:

  • Monthly Open Day: The company’s technical committee will regularly hold lectures and share Some gains from cutting-edge research must have a theme, focus, application scenarios, and preferably examples.
  • Weekly sharing meeting: All developers are encouraged to regularly share new technologies, talk about the problems they face, or anything else they are thinking about. The shared content will be formed into documents and video archives, and will be archived based on Ratings are rewarded with bonuses and points incentives.
  • Company bounty project: Either the company or the employees themselves can initiate the project. After passing the review by the technical committee, they can form a team to complete it. Based on the output documents, data comparison, and technology sharing, the corresponding project bonus will be obtained. There are corresponding patent bonuses for applying for a patent.
  • Cultivate personal influence: Encourage employees to go out to share engineering experience and sort out work experience by publishing articles or speeches to improve personal influence, and provide incentives for manuscript fees and lecture fees based on audience feedback. .
  • Subscribe to newspapers, magazines and other paper books to learn about the latest developments. On a department-by-department basis, a certain allowance for book purchases is allocated.

The training within the Youpai cloud operation and maintenance team includes:

  • Turn the "ceiling into a supporting board": put yourself in a management role of cultivating new people, and do not let You become the bottleneck of the company and the ceiling of employees, encourage new people to try new things and deal with failures, and increase their own skills and practical experience; trust, help each other, and inspire, they will continue to create surprises.
  • Produce "automation tools": use your own experience to abstract business into program models, produce or train the writing of automated scripts, improve the work efficiency of the team, and allow employees to save energy and time to learn other new knowledge;
  • Undertake the "high-precision and professional" project: prepare the latest knowledge research and feasibility analysis in advance, organize it into documents for public training, and then hand it over to the team for in-depth research and implementation, transform it into productivity, accumulate front-line experience and then provide feedback and improvement Documents, a virtuous cycle;
  • Actively promote "knowledge sharing": various cases and "pits" will be organized into wiki documents. Through document sharing, lectures are shared regularly, and employees are encouraged to write high-quality, highly readable articles. Strong documentation and open-mouthed training increase appeal and self-confidence;
  • Encourage "participation in open source exchanges": The company encourages employees to go out and participate in technical exchange conferences. Working behind closed doors is time-consuming and labor-intensive, and it is not as good as professional people to provide guidance. There will also be funds for book purchases, team building activities, and coffee break culture;

One of the typical career paths of an operation and maintenance engineer is to be a manager, but managers and senior operators The problems that maintenance has to solve are completely different. For those senior operation and maintenance who have just entered management positions, can you share some of your experience?

For those who have just entered the management position, my suggestion is to sort out the remaining technical debt and inventory and cultivate talent skills in a timely manner. Lay a good foundation first, and then you can have more skills later. For greater room for progress, please refer to my sharing of "​​Eight Honors and Eight Disgraces of DevOps​".

  • 1. Be proud of configurable and be ashamed of hard coding
  • 2. Be proud of mutual provision and be ashamed of single point
  • 3. Be proud of restarting at any time, be ashamed of being unable to migrate
  • 4. Be proud of overall delivery, be ashamed of partial delivery
  • 5. Be proud of being stateless, be ashamed of being stateful
  • 6. Be proud of standardization and be ashamed of specialization
  • 7. Be proud of automated tools and be ashamed of manual work and human flesh
  • 8. Be proud of being unattended , ashamed of manual intervention

The inventory of talent trees in the skill tree is mainly to cooperate with human resources to divide the nine-square grid of talents (if it is development or operation and maintenance, replace the performance on the left with potential, performance For sales), what is tested is the manager's ability to analyze all aspects of employees and know how to make good use of them.

Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps

Combined with the company's OKR goal management to motivate employees, its advantage is that while gathering goals, it can also:

  • Stimulate personal self-motivation and encourage employees to innovate and reflect;
  • tests relative results and encourages difficult challenges and breakthroughs;
  • assesses the ability to collaborate and cooperate, Encourage employees to coordinate and promote all aspects; It is impossible to solve the problems in all scenarios. After observing it in the past few years, which companies do you think are not suitable for Kubernetes? Can you give a portrait of such a company and explain why?

Although Kubernetes represents the best engineering application practice of devops so far (so delicious), it cannot be applied in all situations, such as cloud CDN edge servers and data centers. As a log analysis platform, Ceph distributed storage is mainly based on physical machines. Therefore, I suggest you find some suitable scenarios and try them out first, such as: The machine resources are severely wasted due to off-peak hours;

The CPU, disk and network IO are not intensive;

    There is no need for persistent storage or resource preemption;
  • The software architecture has been transformed by microservices;
  • The business processing program has periodic and elastic expansion ;
  • Operation and maintenance and R&D are the closest partners. How does your company divide work boundaries? Also, can you share some experience on how to keep these two characters working closely together?

Operation and maintenance engineer = general charging into battle Software engineer = strategist sitting in battle tent

Theoretically, excellent Software engineers can do some (or even all) of the work of operation and maintenance engineers, such as monitoring the performance of business software. If programmers insert a lot of hooks or probes into the program, they can count the data. No need The laborious monitoring of operation and maintenance; for example, when programmers design programs, they consider sub-databases and tables, and consider large concurrency and distributed design, then operation and maintenance can expand the machine horizontally; if the software does not have so many bugs , there are many ifs... However, the reality is cruel, there are too few such high-level programmers, especially in China, everyone is busy implementing business functions, and they are not even willing to write documents or even comments. Not to mention being able to think so thoroughly; similarly, operation and maintenance comes into contact with many excellent and mature open source software, from which we can learn how to design excellent software. For example, for excellent programs, the log information will be very detailed. We can Monitor it through standard syslog or logs. Therefore, senior operation and maintenance will:

Actively participate in prior planning, cooperate with development to conduct drills, automate deployment, and assist in architecture improvement

Reasonable demands and resources are required, and it is best to have a budget to prevent problems before they happen.

    Online monitoring, fault review, and feedback to the entire team will force everyone to coordinate and make improvements
  • Of course, to achieve the above-mentioned capabilities of operation and maintenance management, it is necessary to study with concentration, connect the past and the next, coordinate the team, and practice hard for many years. By that time, operation and maintenance will no longer be responsible for the results of things, but change roles. , leading and coordinating the entire process. Of course, the ability here refers not only to skills, but also to the ability to understand the business and face the allocation and control of the entire project and resources from the company's management level. Therefore, operation and maintenance engineers are actually complementary to software engineers in reality. Because everyone has different abilities and focuses, everyone must unite as one to be able to win the battle. No one can do without it. This is a process of common cultivation and progress. .
  • Finally, my personal opinion: Architect may not be a person's role, but a collective name for a team. It can:

You don't have to charge into battle, you can have an overview of the overall situation , strategize and schedule all resources (the function of the operation and maintenance architect)

Can lead and unite the team, build a high-level building, and implement solutions according to the times (the function of the software architect)

    Can grasp the company's business direction and depth, negotiate cooperation, and control costs (the function of a business architect)
  • Operation and maintenance needs to communicate and collaborate with multiple other departments. In view of the various The team's goals and concerns may not be consistent, and cooperation may not be so smooth. What tricks did you use to make the process smoother?

In fact, most of the reasons for poor communication lie in the unpredictability of the consequences. You talk about redundancy and he talks about budget. You talk about structure and he talks about construction period. Everyone has their own positions and difficulties. , but no one is responsible for the results. I found in my work that when a failure occurs, the cooperation of various departments is unprecedentedly united and the combat effectiveness is the strongest. Therefore, the key to communication and collaboration is: It requires both teamwork and clear responsibilities
  • During pre-department communication, determine the project expectations, costs, influencing factors, failure consequences and responsible parties;
  • During post-failure review, based on the cause of the failure, "pass the buck" with reasonable evidence ”, and at the same time, we must take warning and make up for it;

For example, to provide 10W online concurrency capability, we need redundant bandwidth and the number of redundant servers x 2. The consequences and responsibilities caused by halving the budget due to insufficient budget People; another example is poor software design. Through performance monitoring, the consequences of abnormal indicators and the responsible person are discovered; of course, if the alarm is not handled in time, it is understandable that human operation failures will also be counted in operation and maintenance; fault culture means paying attention to problems and paying attention to things. In itself, it's not about the person but the matter. Everyone grows up through failures and becomes stronger during reviews.

What do you think are the most important goals of operation and maintenance work? How did you achieve these goals?

Operation and maintenance automation;

Monitoring normalization;

Log visualization!

This is too long, so I won’t go into details. You can refer to "​​Enlightenment and Architecture Design of Cloud Operation and Maintenance​"

When it comes to tool selection, how do you decide whether to develop it yourself, use open source, or use commercial products?

Youpaiyun usually does not reinvent the wheel, but it will definitely make good use of the wheel first, or modify the wheel to make it more convenient. Choosing self-research often means you have certain development capabilities. Coupled with some necessary reasons, such as:

  • Cannot find open source software that meets the requirements, such as our self-developed cloud processing software...
  • Open source software has bugs or issues, The community cannot advance in the short term, but the business is urgently needed and can only be solved through self-research, such as the memory leak problem of ats...
  • The functional characteristics of open source software are not consistent with the company's business, so the software has to be modified, such as The anti-hotlink module of nginx needs to be customized with customers...
  • The design goals of open source software are too lofty and have good versatility but are bloated. If we only need a certain small function, we don’t need a fancy one, such as Where to bury performance probes...
  • There are data protection requirements, or when there is privacy...

More and more companies are moving to the public cloud , under the cloud native architecture, have the core functions of the SRE team changed? How should we highlight the value of the team?

Public cloud serves as the IaaS base, container cloud serves as the CaaS middle layer, and cloud native serves as the SaaS application layer. The entire cloud ecosystem is changing with each passing day, and the core functions of the SRE team will pay more attention to the top-level system. Sexual capacity planning, indicator monitoring, high availability and distributed elastic design, so cross-platform and cross-department functional complementarity, team collaboration, continuous improvement, and courage to take responsibility include:

  • Actively participate in prior planning , cooperate with development to conduct drills, and assist in architecture improvement;
  • Reasonably raise availability requirements, redundant resources, and preferably have a budget to prevent problems before they occur;
  • Online monitoring, fault Analyze and give feedback to the entire team, forcing the top and bottom to coordinate and make improvements;

The value of a team lies in whether it can always accept new things, new challenges, and use their strengths to avoid being a frog in the well. It is not a matter of boiling frogs in warm water. When innovation or subversion comes, we can still not be decoupled by the times.

For individual operation and maintenance engineers, what is the transformation path of SRE? What should I pay attention to?

##Technical field

    Learn abstract business models, standardized components, customized scripts, automated deployment, and improve overall efficiency;
  • Learn to collect logs, log analysis and visualization to improve the efficiency of operation and maintenance monitoring and early warning alerts;
  • Mastering and becoming familiar with one or several languages ​​can help you grow and improve your combat effectiveness;
  • Take notes frequently, review the past to learn the new, combine learning and thinking, learn to accumulate, draw inferences from one example;
  • Be brave enough to face the challenges of emerging technologies, and learn them if you can’t beat them;

Non-technical field

  • Learning ability requires broad knowledge;
  • In terms of communication, understand the precise needs of customers;
  • Technical risks, labor, schedule and other costs, trade-offs;
  • Community activities, active sharing, exercise eloquence and communication skills;
  • Improve your influence, learn to walk with others, and make more friends;

Faced with the rapid development of basic technologies, do you have any career planning suggestions for operation and maintenance personnel who have just entered the industry and those who have been in the industry for a long time?

First of all, it is not the job that chooses the person, but the person who chooses the job. If a person is interested in something and has really studied hard for nearly 10,000 hours, he can actually do anything. . For example, when I graduated, the emphasis was on compound talents and there was no such thing as operation and maintenance. Not only did we build (DIY) machines and teach ourselves the Linux operating system, we also learned programming, messed around with the Internet, and wrote our own programs such as forum chat rooms. ;Linux brings us innovative, fun, and excellent open source software every day, allowing us to maintain our passion to toss and learn to our heart's content. When the opportunity comes with the rise of the Internet, it is actually natural to become an operation and maintenance director. ; In fact, in addition to that, I have also transitioned into pre-sales and technical support, traveled to the market, and often did speech training, so a real master is one who cannot learn anything, has many skills but does not overwhelm himself, and is someone who understands business and Operations and maintenance engineers who can develop.

What do you think is the most important quality for operation and maintenance personnel? What messages do you have for new operations and maintenance personnel?

I think the most important ability is the ability to express and communicate, but it does not exclude the technical reserves, practical skills, programming skills and learning abilities required for operation and maintenance itself. Considering that operation and maintenance is still mostly a cost expenditure position, how to use esoteric and obscure performance and bottleneck indicators to intuitive chart display to obtain continuous investment from the upper management requires skills; and then face your colleagues and your brother departments , you also need your influence to coordinate and promote the work. If you can do this, it means that you have the ability to lead, so that you will be at a higher level in everything you do in the future, and use an overall view to coordinate and plan the entire project. Reasonable allocation and control of goals, personnel, construction schedules and resources.

The above is the detailed content of Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete