Home >Operation and Maintenance >Safety >Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebang's OPaS ideas

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebang's OPaS ideas

PHPz
PHPzforward
2023-06-08 21:12:271138browse

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas


In the first issue, the boss of Yangqingjing expressed many interesting views. Some people left a message saying that it was about operation and maintenance. Guide to persuading people to quit, haha, the guests in this issue will have different opinions. Please keep an open mind, listen to the opinions of hundreds of schools of thought, and make your own career and life plans. As the saying goes, if you listen to both, you will be enlightened, but if you believe only, you will be dark. If you only listen to what suits your ears, there is a high probability that there will be no in-depth thinking and collision, which is a pity.


This is the second issue of the down-to-earth and high-level "Operation and Maintenance Forum", let's start!

Guest Introduction

In this issue we invite Nie An, the head of operation and maintenance of Zuoyebang. Nie An is a senior industry veteran who has worked for Alibaba, Xiaomi, Didi Di and Zuoyebang have more than 10 years of operation and maintenance/R&D/management experience.

Brief description of key points

  • Traditional operation and maintenance is responsible for assembling industrial products into services, delivering them to users, and maintaining service operations; it is characterized by strong dependence on the business
  • Domain crisis. In the cloud native era, public clouds are widely used, microservice architecture and DevOps are truly achieved, tool systems continue to prosper, and traditional operation and maintenance responsibilities are constantly being outsourced, transferred, and replaced. A domain crisis has emerged
  • Organizational structure, the collaboration method has gradually upgraded from everyone's collaboration to platform self-service, and the main theme of operation and maintenance has changed from horizontal collaboration to service products and technology middle platform
  • Operation and maintenance transformation, technically through the self-service platform , the external operation and maintenance service capability OPaS (OP as Service) is divided into two layers: objects and scenarios; the underlying objects are maintained isomorphically, forming a sustainable operation and maintenance architecture
  • Business operation and maintenance, The core of service-oriented transformation is role recognition. Operations and maintenance personnel must adjust themselves from an operational role that is dependent on the business to an independent operation and maintenance service provider; from a hyper-service perspective, business operation and maintenance has great potential
  • Component operation and maintenance, controlling the component itself, goes further than pure operation and maintenance management, following the onion model, that is, based on resource delivery, building a management platform, and then going deep into the professional field of the component itself
  • Operation and maintenance development, stripped out Repeated platform iteration work focused on the public operation and maintenance center, specialized in technology and high leverage

Operation and maintenance stage

Internet operation and maintenance has experienced pure manual, There are several stages such as standardization, platformization, and digital intelligence, as shown in the figure below. Among them, DevOps is a technology-driven organizational change and a non-professional change.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

From the development history of operation and maintenance, we can see several characteristics:

  • Inheritance. The new stage often inherits and carries forward the excellent experience of the old stage, and innovates in concepts, technologies, and organizations
  • For example, platformization inherits and strengthens the results of the standardization stage, and data Intelligentization inherits the achievements of platformization and introduces big data technology
  • Responsibility transfer. DevOps is a watershed in the operation and maintenance management model. Operations and maintenance after DevOps
  • On the one hand, continue to advance in the direction of operation and maintenance specialization, and maintain the ability to manage isomorphic management of higher-level operation and maintenance objects.
  • On the other hand, it emphasizes the integration of operation and maintenance, R&D, and the responsibilities of operation and maintenance are gradually transferred to business research and development

Learning the development history of a certain field allows us to learn from history and take advantage of the trend. .

Traditional Operation and Maintenance

In the traditional operation and maintenance model, service objects can basically be divided into three layers. The lowest layer is the hardware infrastructure IaaS, which is mainly composed of computing, network, and storage; the middle layer is the software infrastructure, including operating systems, virtualization technology, code frameworks, middleware, etc.; the top layer is the business layer, mainly application services .

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

The traditional responsibility of operation and maintenance is to assemble industrial products into services through a series of processes, technologies, and methods. Deliver it to users and maintain service operation; it is usually required to achieve multiple-dimensional goals (operability) such as stability, cost, security, and efficiency. To a certain extent, traditional operation and maintenance needs to be attached to the business to generate value; many companies will regard whether they understand the business as one of the main assessments of operation and maintenance workers (dependence). With the popularization of cloud computing and cloud native technology, the traditional operation and maintenance model has encountered many challenges. for example,

  • After enterprises use public cloud, IaaS/PaaS and even SaaS are basically service-oriented and can be obtained through API; a large amount of operation and maintenance construction work is completed with the help of cloud vendors, such as hardware, system, network, For databases, big data, etc., the original factory only needs to retain a small amount of professional selection and integration capabilities (outsourcing)
  • After the popularization of cloud native technology, microservice architecture and DevOps were achieved on a large scale, which was previously completed by professional operation and maintenance personnel Operations are gradually handed over to business R&D for self-service completion, such as delivery, change, monitoring, capacity, etc. Operation and maintenance responsibilities are largely transferred to business R&D (transfer)
  • The professional aggregation effect of public cloud and the cloud native The open source system provides continued improvement in tooling prospects. After tooling improves efficiency, the same position requires less labor; tooling accumulates professional capabilities and the technical threshold for operators is getting lower and lower; after tools evolve to automation and intelligence, machines can replace labor. The replacement of labor by platforms is still gradually deepening (replacement)

As mentioned above, after infrastructure is outsourced to public clouds and cloud native, operation and maintenance responsibilities are transferred to business research and development, and platforms replace labor professionals. sex. Faced with such trends and facts, operation and maintenance practitioners need to make some transformations.

Organizational Structure

First let’s talk about the organizational structure. In the long term, the organizational form of a company in the cloud native era will consist of the following parts:

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

The top end user is the enterprise’s Party A customer , are also potential profit-making groups. The business team is responsible for end users, and its roles include product, business, marketing, marketing, etc. Business research and development directly serves the business team, mainly providing SaaS applications/services. Platform research and development serves business research and development, provides various PaaS capabilities, and encapsulates cloud vendors. There will also be some cross-functional organizations, such as cost operation FinOps, efficiency operation EP, administrative team IT, etc.

In the new organizational structure, everyone’s ultimate goal is to do their own thing and serve end users well. The business team pays more attention to business value, and the R&D system focuses on service quality. With the advancement of information technology, the functions currently performed by cross-functional organizations will gradually be decomposed to the platform R&D team. The main method of organizational collaboration will be upgraded from everyone's collaboration to platform self-service. Operations and maintenance have new job goals, namely: The main theme of operation and maintenance is the management platform, resource & technology center, not horizontal collaboration. Operations and maintenance must use high technology leverage, empower business, and help enterprises improve operating efficiency. .

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Technical Architecture

Operation and maintenance transformation, the goal is to provide operation and maintenance management to upper-level teams through a self-service platform Service; the essence is operation and maintenance service OPaS (OP as Service). According to content differences, operation and maintenance work can be divided into two categories: object management and scene management, as shown in the figure below.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Object management is a vertical model that revolves around operating and maintaining objects and building a life cycle management platform. Operation and maintenance objects can be classified according to IaaS resources (machine, network, storage, cloud services), PaaS components (database, cache, MQ, gateway), SaaS applications (business middle platform, business applications), service framework (runtime, Code framework, name service) and other dimensions, the classification granularity of different companies is different. Each type of object has an independent management platform (chimney). The functions of the management platform should cover the complete life cycle of the operation and maintenance object. The key stages include modeling (metadata), delivery/change, monitoring/measurement, offline, etc., which is different from the public Cloud management functions are similar. The goal of object management is to produce vertically complete cloud products and build an internal cloud platform ICSP.

Scenario management is a horizontal mode that manages the life cycle stages of various operation and maintenance objects according to operation and maintenance scenarios. The classification of operation and maintenance scenarios, including delivery/change, monitoring/measurement, multi-cloud, cost, etc., is very close to the working habits of business research and development, covers a few high-frequency scenarios, and is similar in different companies. Each type of operation and maintenance scenario has an independent scenario management platform, such as work order center, data center, FinOps platform, etc. Scenario management is based on object management. The scenario management platform manages operation and maintenance objects by unifying models, aggregating data, orchestrating management and control APIs, etc. The goal of scene management is to provide self-service business management capabilities and build an internal developer platform IDP.

Common ways to generate operation and maintenance objects include self-research, open source construction, external procurement (public cloud), etc. Each operation and maintenance object can be further subdivided into different categories, clusters, instances, etc., with unprecedented scale and complexity. Only by maintaining the isomorphism of the management characteristics of operation and maintenance objects can we build and maintain operation and maintenance services on a large scale and at low cost, thereby realizing large-scale operation and maintenance (technical leverage effect). Therefore, the isomorphism of operation and maintenance objects is the basis of the entire operation and maintenance architecture. premise.

Isomorphic maintenance

Isomorphic maintenance is aimed at the management characteristics of operation and maintenance objects, not all characteristics. The method of maintaining isomorphism is: controlling increment, repairing inventory, and preventing fission. As shown in the figure below, the platform is used to deliver demand and control increments, to drive governance through measurement to repair inventory, and to prevent large-scale fission of the technical system through standardized service frameworks; platforms and metrics strictly follow specifications, and specifications also require metrics or platforms. input of questions to improve, the three complement each other. Specifications are divided into service specifications (corresponding to service governance), management specifications (corresponding to operation and maintenance control) and other types.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Isomorphism is maintained and relies on an organizational division of labor with clear responsibilities. For example, operation and maintenance focuses on management, stripping off business tools and returning them to business R&D, such as status quo governance, alarm response, and CD; business R&D focuses on business implementation, stripping off the non-business logic of the service framework and handing it over to the infrastructure. Implementation, such as service discovery and traffic control; the infrastructure focuses on middle-end capabilities such as service framework, stripping away management functions and handing them over to operation and maintenance, such as demand delivery, change control, etc. The influence of culture cannot be ignored. Operations and architecture will output concepts and cultivate user habits through communication and guidance, such as not providing SLA commitments for personalized needs and providing out-of-the-box observation capabilities for standard applications.

Based on the isomorphic maintenance of operation and maintenance objects, and upward support for the operation and maintenance service-oriented technology system, a sustainable operation and maintenance architecture is formed, as shown in the figure below. Under the current technical level, operation and maintenance services based on self-service platforms can solve 70% of the needs, and the remaining 30% still require manual labor, such as demand communication, problem troubleshooting, result acceptance, policy compliance, etc. With the advancement of technology and concepts, it is believed that the proportion of operation and maintenance services will further increase.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Note: The service framework in this article includes not only the code framework and code library N years ago, but also the current popular microservice governance. Transition stage, naming is urgent.

Transformation Practice

Operation and Maintenance as a Service OPaS

Business operation and maintenance, also called application operation and maintenance, is the closest to cloud native and has been hardest hit. In addition to traditional cross-team responsibilities such as specification formulation, process construction, and global management, business operations and maintenance must be transformed in a service-oriented direction. The path is as follows:

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

  • First, role perception needs to change. Adjust yourself from an operational role that relies on business to generate value to an operation and maintenance service provider with independent value. Role change is the key
  • In the organization, re-dividing the main responsibilities. Business R&D is the main party responsible for the application, and operation and maintenance is not the main responsible party for the application, nor is it a plug-in nanny, but the provider of management capabilities for the application. Business R&D uses operation and maintenance services and completes the operation work by itself
  • mechanism , reconstruct the evaluation system. The performance of business operation and maintenance positions is no longer strongly tied to business teams and business research and development, but more focused on service-oriented operation and maintenance, with less emphasis on subjective evaluation and more emphasis on technical evaluation.
  • Second , four steps for operation and maintenance transformation. Clarify the object--> Abstract commonality--> Build the platform--> Achieve large-scale operation and maintenance
  • The object of business operation and maintenance is first the application (also called service), and then the expansion scenario of the application (Such as business perspective, company global perspective)
  • Abstract commonality is the difficulty and the key point. There are a large number of applications, complex technology stacks, and many personalized features. It is necessary to abstract the common management characteristics of applications to avoid falling into personalized cases. Strictly speaking, the common characteristics of applications are the objects of operation and maintenance management
  • The construction platform refers to the application management platform, and large-scale operation and maintenance is a sustainable final state
  • Third, application Objects remain isomorphic. In addition to building service-oriented capabilities, the main energy of operation and maintenance personnel should be invested in isomorphic maintenance

Operation and maintenance as service OPaS (OP as Service) is our mid-term transformation, from the perspective of business operation and maintenance The proposed goals pointed out the general direction, but lacked the path and was relatively abstract; after that, OPaS was gradually refined into the operation and maintenance architecture of ICSP IDP, and its scope of application was extended to the entire operation and maintenance team, so that there was a clear path and starting point.

Hyperservice Perspective (Business Operation and Maintenance)

In addition to servitization, business operation and maintenance can also lead the construction of the hyperservice perspective (now renamed as scenario). The DevOps technology puzzle under cloud native is not complete. It has only completed the application computing part, and there are gaps in capabilities in other directions, especially the upward business perspective, department perspective, company perspective, etc., let’s call it Super Service Perspective. From a hyper-service perspective, business R&D personnel usually do not have the ability or motivation to take the lead; department heads or architects can take care of their own departments, but are limited by job responsibilities and find it difficult to expand to the overall situation. On the other hand, the hyper-service perspective is the old battlefield of traditional business operation and maintenance, with unparalleled experience, understanding and cognitive advantages. Business operation and maintenance leads the construction of a hyper-service perspective, which can not only fill the gap in the cloud native field, but also give full play to the professional advantages of business operation and maintenance, and take advantage of the opportunity for transformation. It will be a win-win choice, as shown below.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Super service perspective, including but not limited to:

  • Demand delivery: work order center, orchestration engine, execution Engine
  • Change control: five catch-all rules, centralized management and control, arrangement approval, execution approval, service inspection, change measurement
  • Observation measurement: aggregate and display observation and measurement data from the business perspective, support drill-down Down to the application granularity
  • Multi-cloud architecture: measurement, governance, planning, and drills throughout the entire technical system
  • Cost control: billing, apportionment, management, control, and optimization of all the company's IT resources, independently for FinOps Direction
  • Standard formulation: Formulation of operation and maintenance specifications from a company-wide perspective, supervision of process implementation, to avoid chimney-like repetitive construction of small teams
  • etc.

under cloud native Looking down at the DevOps technology puzzle, there are gaps in capabilities. For example, the support for basic services such as CDN, object storage, MQ, and EMR is not perfect, and it is still in the exploratory period in 2022; from the perspective of operation and maintenance management, as long as it is covered by the service framework (Authentication, discovery, communication, perception, flow control) is radiated, even if it is managed by Cloud Native.

Onion model (cloud services, middleware, big data operation and maintenance)

Cloud services, middleware, big data and other operation and maintenance objects, the technology stack is converged and professionally focused. When implementing transformation, operation and maintenance personnel can follow the onion model.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

  • The first stage is based on resource delivery, transforming the original operation and maintenance objects into resource entities, and ensuring delivery to the upstream service functions and establish the bottom line of job value
  • In the second stage, invest a lot of energy in building a management platform to manage the life cycle of resource entities and liberate yourself. The platform must be able to self-service ToC and achieve decoupling
  • The third stage is to go deep into the professional field of the component itself and improve professionalism from all aspects such as architecture, code, performance, operation and maintenance. When this step is achieved, operation and maintenance has become a service expert in this field, not just an administrator.

The onion model was first verified in database, big data, middleware and other positions, and later It was taken over and used in cloud services, and it was also successful. For example, our company's cloud service operation and maintenance CloudOps team implements transformation according to the onion model. The details are as follows.

  • The objects of this team are various cloud services, distributed in Tencent, Alibaba, Baidu and other cloud vendors
  • Two years ago, we provided machines, storage and other resources through various manual methods to support the rapid development of the business (resource delivery)
  • After that, we Start building a multi-cloud management platform to manage the life cycle of cloud services such as machines, bandwidth, object storage, and CDN. In this process, the CloudOps management platform was successfully transformed into the company's internal secondary cloud service provider ICSP (platform capability)
  • Next, we will continue to strengthen our learning, awareness, and understanding of public cloud products. Selection, evolution promotion, etc., and strive to establish more professionalism in this field (component itself)

Operation and maintenance middle platform (operation and maintenance development)

As the business operation Roles such as maintenance, component operation and maintenance, and system operation and maintenance (resource network cloud services) began to participate in development work. The space left for the operation and maintenance development DevOps team gradually became less and less, and the division of labor was unclear during the transformation process. With reference to the prediction of the upgrade of the organizational structure and technical architecture, we have re-adjusted the positioning of OpDev: OpDev should not be a development outsourcing or vassal of operation and maintenance personnel, but should have its own independent services. As a result, the original operation and maintenance platform was split into two parts. One part focused on functional iteration and could not be reused, and was left to the original users to maintain themselves, such as IDP resource console, ICSP scenario management tools, etc.; the other part was public functions. , abstracted as the operation and maintenance middle platform is responsible for OpDev, such as unified account IAM, work order orchestration engine, monitoring indicator collector, etc., as shown below.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

The operation and maintenance center is a subset of the original operation and maintenance platform. It does not need to rebuild domain knowledge. It needs to re-do domain abstract modeling and has relatively high requirements for code quality (same as basic components). This is exactly what OpDev is for children. ’s strengths. As responsibilities are centralized and reduced, OpDev must simultaneously slim down and achieve higher leverage.

Some lessons

Briefly share some of our company’s transformation lessons, including

  • There is a compromise between transformation and conservatism. The transformation from traditional operation and maintenance to a service provider will not happen overnight, nor will all employees migrate. There will always be someone who stays behind (the current technical level is about 73%). After resources are concentrated, the back-end personnel will receive more value returns
  • Gradient of R&D capability differentiation. The capabilities of children's shoes in the transformation from operation and maintenance to development are uneven. It must start from the iteration of business needs, strictly control the design and acceptance to ensure quality, consciously complement the engineering theory, and be equipped with excellent operation and maintenance middle-end capabilities to ensure A clean underlying
  • platform is not the only option. Platform is the most powerful way to undertake service capabilities, but it is definitely not the only way. Organization, culture, norms, processes, and platforms are all indispensable (but the transfer cost may be slightly higher)
  • Clear the objects of operation and maintenance management. Operation and maintenance, especially application operation and maintenance, the management object is not the application itself, but the common characteristics of the application; the more common characteristics of the application, the greater the value of application operation and maintenance (leverage)
  • Organizational guarantee cannot be ignored . The organizational structure is the primary productive force. The CTO must make a difference, have clear goals, and have a clear division of labor, such as clarifying main responsibilities, setting up independent acceptance agencies, measurement and governance cycles, etc. This is the organizational guarantee for operation and maintenance transformation
  • vigilance Pure project thinking. Operations and maintenance still need to participate in some projects to explode value and gain a sense of accomplishment in the short term, but it is also easy for people to lose their temper and the value to zero; it requires conscious design goals and accumulation of service capabilities during the project process
  • Prevention is more effective than emergency response. Stability issues need to be solved in the architectural field, and prevention is more effective than emergency response. Prioritize extending MTBF, followed by shortening MTTR

The following is additional content, not the core of this article.

The evolution of demand delivery

Whether it is a public cloud or an internal K8S platform, there are a large number of demand delivery operations. This type of ToM (ToManager) delivery platform often lacks necessary constraints and can only be open to experienced people.

In order to optimize the division of labor and improve efficiency, the operation and maintenance management plane ToC (ToRD) can be implemented through "work order arrangement and approval"; the workflow/work order itself will be heavily integrated into the best practices of operation and maintenance management. , can be safely opened to research and development. This is an important direction for the servitization of operation and maintenance capabilities. The evolution path of self-service delivery is as follows:

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Currently, the communication link from requirements to technical solutions is relatively difficult to self-service or automate. More attempts are needed in the future.

Marginal Point of Scale Operation and Maintenance

The essence of economics of scale operation and maintenance is marginal cost, which is the interaction of "diminishing marginal cost of operation and maintenance management vs. increasing marginal cost of isomorphic maintenance". As shown in the figure below, when the number of operation and maintenance objects is small, the cost of operation and maintenance management accounts for the majority, such as building platforms and manual operations; when the number of operation and maintenance objects increases, isomorphic maintenance constitutes the main cost; the marginal turning point will be affected by technology and concepts and other environmental factors.

Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebangs OPaS ideas

Cloud native technology reduces the difficulty of maintaining isomorphism (promoting the isomorphism maintenance curve to shift to the right) and improves operation and maintenance service capabilities (promoting the The operation and maintenance management curve shifts downward), allowing operation and maintenance personnel to manage more operation and maintenance objects at a lower cost, thus significantly improving production efficiency.

The above is the detailed content of Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebang's OPaS ideas. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete