


Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults
Editor's note: In 2023, the Dragon Lizard Community officially established the system operation and maintenance alliance, which consists of the Academy of Information and Communications Technology, Alibaba Cloud, ZTE, Fudan University, Tsinghua University, Zhejiang University, Yunguan Qiuhao, Chengyun Digital, Yunshan It was co-sponsored by 12 units including Network, Inspur Information, Tongxin Software and China Unicom Software Institute. This article is reproduced from Yun Guan Qiu Hao and introduces Kindling-OriginX, a member of the System Operation and Maintenance Alliance, to automatically generate explainable fault root cause reports by combining DeepFlow's complete network data capabilities.
DeepFlow is an open source project that leverages eBPF technology to provide high observability for complex cloud infrastructure and cloud native applications. Through eBPF technology, DeepFlow collects fine link tracking data, network and application performance indicators, with full link coverage and rich TCP performance indicators. These features provide professional users and network experts with powerful troubleshooting and problem location support.
Kindling-OriginX is a fault root cause derivation product. The goal is to provide users with an interpretable fault root cause report, allowing users to directly understand the fault root cause, and with a root cause reasoning process to verify the root cause. accuracy. Network faults are difficult to explain simply. It is not enough to simply tell users which network segment has problems. Users need more indicators and illustrations to help users better understand what faults occurred on the network and where they occurred. .
This article introduces Kindling-OriginX, which combines DeepFlow's complete network data capabilities to automatically generate interpretable fault root cause reports.
soma-chaos simulates network failure
-
Inject a 200ms delayed network simulation fault into seat-service.
-
Next, we first use DeepFlow to identify 200ms network failures and take corresponding actions.
Manually simplified troubleshooting process
Step 1: Use the Trace system to narrow the scope
In a microservice environment, when a performance problem occurs on an interface, the first step is to use the tracking system to check which link is causing the slowness and understand the specific performance.
Using the Tracing system, users can accurately locate specific Traces. After analyzing the Trace, it was found that the execution time of seat-service was long, and a long config-service call occurred at the same time. In this case, linked network indicators will help pinpoint the source of the network problem.
Step 2: Use DeepFlow flame graph to determine which network segment the fault occurs
Input the fault representative traceid into DeepFlow in the flame graph, find the performance of Trace at the network level, and then analyze the flame graph in depth. If you have a good understanding of flame graphs and have expert experience with network knowledge, you can The flame graph manually analyzed that: this fault should have occurred in the caller, which is the seat-service, and the problem occurred during the time period when the syscall was sent to the network card, that is, there was a problem in the container network period (which is consistent with fault injection).
(Picture/DeepFlow network flame graph)
Step 3: Determine what network indicators are abnormal in the container network
Based on troubleshooting experience, users need to check the network indicators of the pods of seat-service and config-service. At this time, the user needs to jump to DeepFlow's Pod-level network indicator page. Through this page, users can view a 200ms delay mutation in connection establishment and a mutation in the RTT indicator.
(Figure/DeepFlow-pod level monitoring indicators)
(Figure/DeepFlow-pod level monitoring indicators)
Step 4: Eliminate possible interference factors
According to experience, when the host's CPU and bandwidth are full, packet loss and delay will also occur in the virtual network, so it is necessary to check the CPU and node level of the node where seat-service and config-service are located at that time. bandwidth to ensure that Node level resources are not saturated.
Use the k8s command to confirm the node where the two pods are located, and then go to DeepFlow's node indicator monitoring page to check the corresponding indicators. It is found that the bps, pps and other indicators of the node are within a reasonable range.
(Picture/Find the node where the pod is located through k8s command)
(Figure/DeepFlow-node level monitoring indicators (client))
(Figure/DeepFlow-node level monitoring indicators (server))
Since there was no obvious abnormality in the node-level network indicators, it was finally determined that the pod-level rtt indicator of seat-service was abnormal.
Manual Troubleshooting Summary
After a series of troubleshooting processes, the end user can troubleshoot the fault, but the following requirements are imposed on the user:
-
very rich network knowledge
-
In-depth understanding of network flame graph
-
Proficient in using related tools
Kindling-OriginX How to combine DeepFlow metrics to produce explainable fault reports
Kindling-OriginX Based on different user needs and usage scenarios, Kindling-OriginX processes and presents DeepFlow data.
By analogy to the manual most simplified troubleshooting process, the troubleshooting process using Kindling-OriginX is as follows:
Automatic analysis of each Trace
In view of the fault at this time, each Trace is automatically analyzed, and the listed Traces are grouped according to the fault node. Travel-service is caused by cascading faults. This article does not focus on cascading faults. If you are interested, you can refer to how to deal with microservice cascading faults.
Review Fault root report where the fault node is seat-service
Fault root cause conclusion:
For sub-request 10.244.1.254:50332->10.244.5.79:15679 rtt indicator, there is a delay of about 200ms.
Fault reasoning verification
Since Kindling-OriginX has identified that there is a problem with the network where seat-service calls config-service, it does not need to completely present all the data of DeepFlow's flame graph to the user. It only needs to interface with DeepFlow and only get the seat-service call. The relevant data of the network call in config-service is enough.
Using DeepFlow's seat-service to call config-service data, it is automatically analyzed that the container network of the client pod has a delay of 201ms.
Kindling-OriginX will simulate expert analysis experience and further correlate DeepFlow's retransmission indicators and RTT indicators to determine what exactly causes the delay in seat-service calling config-service.
Kindling-OriginX will also integrate the node’s CPU utilization and bandwidth indicators to eliminate interference factors.
Kindling-OriginX completes the entire fault reasoning in a one-page report, and each data source is trustworthy and verifiable.
Summarize
Kindling-OriginX and DeepFlow both use eBPF technology and aim to provide flexible and efficient solutions for users with different needs in different scenarios. We also look forward to seeing the emergence of more domestic products with complementary capabilities in the future.
DeepFlow can provide very complete basic data of the full-link network, making cloud native applications deeply observable, and is very useful for troubleshooting network problems.
Kindling-OriginX uses eBPF to collect troubleshooting North Star indicators, AI algorithms and expert experience to build a fault reasoning engine to provide users with interpretable root cause reports.
-- over --
The above is the detailed content of Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults. For more information, please follow other related articles on the PHP Chinese website!

What is HP Battery Check? How to download HP Battery Check? How to check battery health on HP laptop in Windows 11/10? To find answers to these questions, go on reading and you can find much information given by php.cn.

Are you looking for a Microsoft Excel 2019 download source? You may want to download Excel 2019 for free on Windows/Mac/Android/iOS. php.cn Software writes this post to introduce some Microsoft Excel 2019 download sources for different platforms.

It is annoying to run into game not using GPU when playing a video game. How to fix it? If you find a game using 0 GPU, you can read through this post on php.cn Website to get help.

Do you know what “192.168.10.1” is? How to log in to your 192.168.0.1 IP address? php.cn will show you some basic information about this IP and some details on 192.168.10.1 admin login, change password & issue troubleshooting.

The Failed to Synchronize Achievements error just not only happens on the Uplay client but also on Far Cry. When you launch the game, a message appears saying “Failed to Synchronize Achievements” which allows you to skip. If you want to get rid of th

Many processes are running in the background when you use your computer. You may want to know the process start time. You can check a process start time using Windows PowerShell or Process Explorer. php.cn Software will introduce these two methods he

To create and edit spreadsheets, most of you may use Microsoft Excel. Microsoft Excel is not free. Its stand-alone app costs $159.99. You can also buy a Microsoft 365 plan to get Excel and other Office apps. This post mainly explains the Microsoft Ex

Do you use the printer in your work and life? Then, you must know before using it, you must download and install the corresponding driver. In this post on php.cn Website, we will mainly introduce to you how to install, update and download Kyocera pri


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Atom editor mac version download
The most popular open source editor
