Home >Technology peripherals >AI >How to simplify cloud native operations and maintenance

How to simplify cloud native operations and maintenance

王林
王林forward
2023-04-08 20:31:041871browse

While cloud computing brings intensification, efficiency, elasticity and business agility, it also poses unprecedented challenges to cloud operation and maintenance. How to face the challenges of new technology trends, build an intelligent monitoring platform for the cloud era, and provide better protection for cloud applications is a difficult problem facing every enterprise today.

In the eighth issue of the recent [T·Talk] series of events, 51CTO Content Center specially invited Zhang Huaipeng, VP of Chengyun Products, to the live broadcast room to share with everyone the creation of digital observation tools in the cloud era. experience and thinking. [T·Talk] has also compiled the exciting content of this issue, and I hope you can gain something from it:

Under the wave of digital transformation Pain points of digital operations

Digital transformation and digital economic construction are the major trends of the current era. Digital transformation can be said to be the fourth industrial revolution in human history. Our daily work methods, payment methods, shopping methods, including travel methods, are all affected by digitalization all the time. To put it simply, we have now entered the era of digital DT from the traditional IT era.

In the era of digital DT, digital transformation has almost redefined the current business of enterprises and the way they experience business. However, as the digital transformation of various industries continues to deepen, more and more digital application accidents have begun to gradually emerge. For example, the health code collapse and nucleic acid testing system abnormalities in a certain province or city at the beginning of the year have had a huge impact on society.

How to simplify cloud native operations and maintenance


## According to the survey, 60% of CEOs currently believe that digital transformation is very important. Enterprises are also making great strides towards digital transformation and artificial intelligence evolution under the leadership of this group of people. However, in sharp contrast, 95% of enterprise applications have not received effective monitoring and attention.

Most of the current digital operation methods are produced in the traditional data center era, and a large number of tools or technologies do not take into account the cloud computing scenario. With the popularity of cloud computing, the informatization scenario has undergone earth-shaking changes. The complexity of the application itself has exploded, with more and more distributions, dependencies becoming more and more complex, and the pace of software iteration becoming faster and faster. In such a scenario, enterprises urgently need to build a set of solutions based on business and data flow for the DT era.

The DT era has produced too many new technologies and new scenarios, such as cloud native, which is currently very popular. The requirements of cloud native have accelerated the evolution from traditional operation and maintenance to application operation and maintenance. . There is a large amount of infrastructure in traditional scenarios, but as businesses move to the cloud, the infrastructure will be hosted by operators or operators. Enterprises no longer need to provide traditional computer room management, weak current management, hardware monitoring, bare metal monitoring, and UPS configuration. Troubles about electricity, temperature and humidity. Therefore, traditional equipment operation and maintenance has evolved into site reliability application-focused operation and maintenance, and enterprises will invest less and less in traditional operation and maintenance.

How to simplify cloud native operations and maintenance


Currently, we are in the transition to intelligent operation and maintenance stage. What needs to be done now is to make digital operation and maintenance and IT operation and maintenance lighter, faster, and less expensive. The energy of the operation and maintenance team needs to be focused on the enterprise business itself, and the business is the key issue that the operation and maintenance personnel need to pay attention to. These will bring about the demand for intelligent operation and maintenance.


Typical technical path for enterprises to lead to intelligent operation and maintenance

1. What is intelligent operation and maintenance

Regarding intelligent operations, Forrester and Gartner have defined it in reports: AIOps is a set of data fields that apply AI and data science to business and operations to establish correlations and be able to provide real-time normative and predictive answers. software system. AIOps can be a software system, so it can be a implemented product. AIOps can enhance and partially replace traditional main IT operation and maintenance functions, including availability and performance monitoring, event correlation and analysis, IT service management and automation.

AIOps is oriented to Operations. Operations need to cover the three aspects of observation, management and disposal. However, the current overall level of the industry is more focused on the observation level. Forrester also gave a classic statement on this: AIOps promises stronger observability and stability.

Forrester believes that one of the core values ​​of current AIOps is to enhance pre-event capabilities, improve and expand your observable capabilities.

2. What is observability

Observability was first born in control theory, which refers to the degree to which a system can infer its internal state from external output. In the field of IT, Gartner defines observability as a characteristic of software and systems. Specifically, it refers to the ability to determine the current system status and system conditions based on the telemetry data generated by the system. This ability is observability or observability.

How to simplify cloud native operations and maintenance


##Why is observability needed?

Traditional monitoring technologies and tools are difficult to track the communication paths and dependencies in the current increasingly distributed architecture. In cloud-native scenarios or cloud scenarios, dependencies It is very complex and no longer like many traditional monolithic architecture applications. Observability can better control complex systems. Through the three data pillars of observability, we can understand all aspects of complex systems in a very intuitive and detailed manner.

Observability not only serves operations and maintenance, but can also serve the development department, SRE department, Support department, marketing department and Business department. Therefore, if we can integrate AIOps and observability to create an integrated platform, we will get a very perfect product that can kill two birds with one stone.

3. Two typical technical paths for enterprises to lead to intelligent operation and maintenance AIOps

The two typical technical paths for enterprises to lead to IT intelligent operation and maintenance can be visualized It can be summarized as "plug-in AIOps" and "endogenous AIOps". Plug-in AIOps implants the AIOps platform into the enterprise IT operation and maintenance environment through bypass. AIOps is an independent algorithm platform that accesses enterprise heterogeneous data, and then uses data engineers to sort out the dependencies between the data and use big data processing technology to achieve project-based delivery.

How to simplify cloud native operations and maintenance


## Endogenous AIOps emphasizes the integrated technical route. The AIOps engine can realize the closed-loop of the entire data processing process without the participation of data engineers. Similar to the express delivery process, the sender's items are equivalent to data. After obtaining the data, the courier will perform packaging, warehousing, dispatching, transportation and other operations. But in the end, the recipient receives the item, and all the processing steps in between do not need to be handled by the sender and recipient. Endogenous AIOps emphasizes this capability and embeds AI capabilities into an integrated observation platform.

Differences in technical implementation:

Plug-in AIOps generally uses traditional machine learning AI. This technology is essentially a statistical method that combines Metric, log, Events and other information are correlated and analyzed to reduce the noise of alarms. Through machine learning AI, we can obtain a set of correlated alerts. Therefore, it requires a certain period of time. Generally speaking, plug-in AIOps requires manual work or historical records to come up with a recommendation or possible root cause.

At the same time, plug-in AIOps requires a lot of dependence on external data, and plug-in AIOps manufacturers usually only make algorithm platforms. Data cleaning, dependencies between CMDB entities, etc. all require external data. Therefore, if you want to implement plug-in AIOps, you need to have a very mature information operation and maintenance system. You need to have the prerequisite to call data, have APM products, and have relatively complete observability before you can implement plug-in AIOps.

Endogenous AIOps provides a deterministic artificial intelligence analysis, taking deterministic analysis results as the goal, that is, after a problem occurs, the root cause of the problem is deterministic and is a Near real-time results. Endogenous AIOps maintains a matrix dependency map with very high real-time performance. This technology does not need to rely on the traditional static CMDB. Instead, the dependency map itself is equivalent to a real-time CMDB, which can integrate dependencies. The relationship changes in real time, and management analysis is realized with the help of endogenous relationships.

How do companies decide to choose the technology path that suits them?

At the implementation level of AIOps, enterprises also need to consider many issues. From the perspective of business managers, in addition to basic issues such as cost and team, it is also necessary to consider the balance between different departments, as well as the balance between cost, stability and efficiency. The goal of AIOps is to not only solve problems, but also to solve them reasonably. While ensuring costs, we can maximize the stability and efficiency of our business.


How to simplify cloud native operations and maintenance

##At Forrester A report by , mentioned that when enterprises implement AIOps, they have the following key capabilities that need to be considered:

  • Whether the AIOps platform and the ITOM tool chain can be seamlessly integrated, whether The ability to achieve a high degree of automation
  • AIOps platform attaches great importance to native data. Native data includes cloud native dependencies and cloud native machine data information
  • Full-service dependency map automation and panoramic construction
  • The future of AIOps is intelligent observation and perception and automated implementation practice
  • Root cause analysis and incident remediation plan automation
  • Modern technology operations require intelligence and automation
From the data processing process Look at the differences between the two technical paths:

The traditional AIOps platform, that is, the plug-in AIOps platform, uses many tools to piece together and assemble during the data processing process to create a shaky big data system. If a staff change occurs, it is very likely that the new handover will be left with a large amount of technical debt.

The first step of data collection requires reliance on a large number of open source and commercial tools. The second step is to inject the data into the big data platform. The third step is to manually sort out data relationships and clean the data. The first three steps are very time-consuming. The fourth step is to discover and locate problems. Only in this step will AIOps vendors get involved. The vendor team needs to be stationed at the customer site to build on demand. Manufacturers will inquire about needs and provide corresponding services. Fifth, build a dashboard. Sixth, system expansion. As the scale of the application system increases, the entire system grows linearly.


How to simplify cloud native operations and maintenance


#In the entire process, data engineers need to spend nearly 80% Time is used for data cleaning, collection and organization. The entire solution requires cutting-edge talents in the field of operation and maintenance. They must not only be experts in operation and maintenance, but also need to understand algorithms and development. AIOps itself is a supporting system that is used to solve problems, but plug-in AIOps is likely to make operation and maintenance heavier, requiring a dedicated team to maintain the AIOps platform itself.

The data processing process of endogenous AIOps is very simple, and one tool can solve the data collection problem. And because it is a highly commercialized product, it has out-of-the-box dashboard capabilities, including engines. Therefore, the subsequent processing procedures are all black-box, and do not require the company to pay too much attention, nor do business engineers need to understand the algorithm and have the technical level of SRE.

How to simplify cloud native operations and maintenance

At the same time, endogenous AIOps will grow non-linearly as the scale of enterprise business systems expands. The entire system, including the user team and the product, is growing non-linearly. Once the entire solution is laid out, the enterprise only needs to install one Agent, and many of the subsequent capabilities will be automated. This allows the enterprise's operation and maintenance personnel to focus on the enterprise's own business.

Summary:

The industry needs a new generation of software intelligence platform that can fully cover the entire data processing process. Deliver the results your customers want directly, rather than presenting raw data. In general, among the two technical paths of plug-in AIOps and endogenous AIOps, it is more recommended for enterprises to use endogenous AIOps, which belongs to a new paradigm of intelligent operation and maintenance.

Endogenous AIOps helps simplify cloud native operation and maintenance

The goal of the endogenous AIOps platform is Build an all-in-one platform that combines AIOps and observability. It requires observation capabilities, and the observation capabilities must be centered on application monitoring. Application monitoring is the phenomenon layer facing end users. At the same time, infrastructure monitoring needs to be integrated, including cloud platform monitoring and black box monitoring. Finally, you also need to have the ability to provide front-end digital experience.

The new AIOps platform needs to create continuous automation, from data access to the output of data results. It is necessary to have prior capabilities and the ability to predict and warn.

The new AIOps platform needs to provide high-level observability. It does not just show the original data and original parts to the enterprise, but also pays attention to the phenomenon and experience, and provides accurate As a result, the impact and interference caused by massive noise to the enterprise can be reduced as much as possible.

The data processing model of endogenous AIOps has many differences, such as emphasizing the ability of an Agent in data collection. In terms of data processing, we emphasize the indicator system. The construction of the indicator system is different from the traditional method. We emphasize that endogenous AIOps is endogenous to the integrated platform.


How to simplify cloud native operations and maintenance


#The endogenous AIOps platform will mainly focus on the following five aspects Helping cloud native operation and maintenance simplify complexity:

  • The endogenous AIOps platform can directly obtain high-quality observation data
  • can To create continuous automation capabilities, for operation and maintenance, work efficiency will be higher
  • The platform can build a real-time matrix topology and search according to the diagram
  • Able to instantly output impact analysis
  • Point to the root cause and witness the results
1. Directly obtain high-quality observation data

First, obtain high-quality monitoring data directly. A classic summary is that "high-quality observations come from high-quality telemetry." High-quality back-end analysis must require the generation of high-quality front-end telemetry data. Observability focuses on three pillars. If you want to do high-level observability and endogenous AIOps analysis, you need five pillars. In addition to traditional tracking data, indicators, and log data, you also need very critical topology data and code data. The quality of data can directly determine the upper limit of the model.

Directly obtain high-quality monitoring data. These data must be collected in a non-intrusive and automated manner without modifying source code, business and applications, and can achieve contextual information and automation. combine. Context information can assist in achieving true root cause analysis, help root cause analysis extract high-fidelity background information, and help the platform build real-time service flow diagrams and topology diagrams to identify dependencies. Including matrix-type relational topology technology, this contextual information is also very critical.

Topology diagram mainly shows the dependencies of the entire application environment, including vertical stacks and horizontal stacks. The service flow diagram provides a view of the entire transaction from the perspective of a service or request. Through the service flow diagram and topology diagram, the sequence of calls between services can be explained. The service flow diagram shows the entire distributed sequence of transactions, which is orderly, while the topology diagram is a higher-level abstraction, showing dependencies, etc.

How to simplify cloud native operations and maintenance


Directly obtaining high-quality monitoring relationships requires the use of commercial Agents Technology, although there are many open source tools or free tools on the market, commercial Agent technology has the following advantages that open source tools do not have.

  • The stability, security and reliability of the collected proxy probes are guaranteed
  • The probes are critical to the host and The resource overhead and performance impact of core business are guaranteed
  • Deployment and instrumentation, including changes, can use less manual operations
  • Monitoring Can be automatically implanted into dynamic methods or components of container classes
  • Various indicator sampling is fine, native high fidelity
  • has enough The information and context are available to build a unified data model

#The above advantages are not available in many free tools. The endogenous AIOps platform relies on One Agent technology. Agent has an edge computing design and does a lot of data aggregation and data cleaning on edge endpoints.

2. Create continuous automation

The ability of the endogenous AIOps platform is designed to build continuous automation. Monitoring complex cloud-native environments requires automation. Including automated deployment, automated adaptation, automated discovery, monitoring, injection, cleaning and a series of automation. In a complex cloud-native environment, it is difficult to understand these end-to-end businesses manually, so a high degree of automation capabilities are needed as auxiliary tools to assist automatic operation and maintenance.

3. Build a real-time matrix relationship map

The endogenous AIOps platform can build a real-time matrix topology. You can search according to the drawing and see the horizontal direction in the drawing, such as the dependency diagram of the service layer, as well as the container layer, host layer, process level, etc. The vertical direction is what container the service runs on, which process this container corresponds to, and which cloud host this process falls on.

4. Real-time output impact analysis

Output impact analysis is equivalent to network security thinking, and it is the same in operation and maintenance. When a system failure or abnormality occurs, what are its impact areas, which users, services, and applications will be affected, and what are the root causes. Through automated means and technology, the results are output to users without requiring manual analysis by operation and maintenance personnel.


How to simplify cloud native operations and maintenance


5. Point to the root cause and witness Results

Finally, the very important ability of automated operation and maintenance is to get to the root cause and witness the results. Traditional technologies require different methods based on knowledge base, CMDB, and causal inference, while AIOps provides endogenous root cause positioning. It can open up data dependencies. In addition to the dependencies between objects, it can also open up the dependencies between different data types, such as the dependencies between call chains, logs, and indicators. It provides real-time root cause location, is highly adaptable, has low overhead and very high accuracy. It also has unsupervised technology and does not require too much manual assistance to achieve the delivery of these capabilities.

Summary

If an enterprise wants to succeed in digital transformation, it needs to ensure that all applications, digital services and the dynamic multi-cloud platforms that support their operation can work perfectly, and every time make it happen.

These highly dynamic and distributed cloud-native technologies are completely different from traditional scenarios. As a result, the complexity brought about by microservices, containers, and software-defined cloud infrastructure is now spiraling out of control. These complexities exceed the limits of team management capabilities and continue to grow. If you want to understand everything happening in these rapidly changing environments at any time, you must improve your observability and intelligent operation and maintenance capabilities.

We need to use a high degree of automation and intelligent technology to make cloud native operation and maintenance lighter, faster, and less expensive, so that the enterprise team can focus their energy In the enterprise business itself, we are truly moving towards the era of intelligent operation and maintenance.

Guest introduction

Zhang Huaipeng, VP of Chengyun Products. Joined Hangzhou Chengyun Digital Technology Co., Ltd. in 2017 and is responsible for the daily management of the [DataBuff integrated observation and intelligent operation and maintenance] product line. He serves as the manager of the IPD integrated product development team and participates in market management, demand analysis, team collaboration, process structuring, and quality. control etc.

The above is the detailed content of How to simplify cloud native operations and maintenance. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete