Home  >  Article  >  Operation and Maintenance  >  [Nightingale Monitoring] Alarm management, great!

[Nightingale Monitoring] Alarm management, great!

PHPz
PHPzforward
2023-06-09 08:31:301103browse

[Nightingale Monitoring] Alarm management, great!

#Monitoring is the method, alarming is the means, and solution is the purpose.

But, have you ever encountered this kind of confusion? I have collected a lot of indicators, but I don’t know which indicators should trigger alarms, nor how to send these alarms to the corresponding teams or individuals, nor how to upgrade the alarms.

When I used Prometheus Altermanager before, I created a DingTalk group for each team, then added a bunch of tags, matched different tags and sent them to different groups. If I want to upgrade the alarm, In many cases, this is done through threshold upgrade, but it is difficult to upgrade the same alarm through time.

But Nightingale’s alarm rule management is not that complicated (they do the complicated things for you), and it is also very elegant. I first met Nightingale in "[Nightingale Monitoring]", and it's still strong! 》​​Mentioned: Grafana is better at monitoring panel management, and N9e is better at managing alarm rules.

Today, let’s take a look at how Nightingale plays.

Alarm rules

The troops and horses have not moved, but the food and grass go first.

To alert, we must first know what our needs are, that is, we must understand which indicators need to be alerted.

For example, at the system level, we need to consider CPU, memory, disk, IO and other indicators; at the application level, we need to consider application saturation, failure rate, delay, etc.; at the business level, we need to consider Consider how many times this transaction failed, where it failed, etc.

At different levels, the monitoring indicators and alarm strategies considered will be different.

Nightingale’s alarm rules are divided into built-in rules and custom rules.

The built-in rules are designed to lower the threshold for everyone to use and provide everyone with a set of universal rules. The main contents are as follows:

[Nightingale Monitoring] Alarm management, great!

#The built-in alarm rules will not take effect unless you add them to your rules. If you like a certain rule, you can clone it into the active rules. For example, I cloned the Linux TIME_WAIT alarm rule into the default business group.

[Nightingale Monitoring] Alarm management, great!

#Then go to the alarm rule overview and you will see that a new alarm rule has been added to the default business group.

[Nightingale Monitoring] Alarm management, great!

After seeing this, do you have any inspiration in your mind?

We can create multiple business groups according to the actual situation, and then can we manage the alarm rules involving multiple business groups separately?

Assuming we have two teams, the front office and the middle office, we can classify the indicators separately.

[Nightingale Monitoring] Alarm management, great!

In principle, the rules imported by default are not effective and require some additional configuration.

Click on the alarm rule name to enter the configuration page.

[Nightingale Monitoring] Alarm management, great!

#We can customize alarm conditions, data sources, alarm levels and other configurations. The information we configured above is summarized as follows:

    The data source of the alarm is local_prometheus, which indicates which cluster your alarm comes from.
  • The alarm condition is that the alarm will only be triggered when the total number of TIME_WAIT is greater than 20000.
  • The alarm level is Level 2, which is the general important level.
  • The execution frequency is once every 15 seconds. If the alarm rules are still met for 60 seconds continuously, an alarm will be triggered.
The next step is additional configuration, as follows:

[Nightingale Monitoring] Alarm management, great!

The effective configuration is used to configure the time period and business group in which the alarm rule will take effect. The notification configuration is to configure the notification medium, that is, if an alarm occurs, which channels should be used to send it to which place.

However, you can also make additional configurations in the notification configuration:

  • Start recovery notification, that is, if the alarm is restored, the person in charge will also be notified through this channel.
  • Alarm receiving group, that is, business group.
  • Observe the duration. After the alarm is restored, observe how long it takes to send a recovery notification to the business group. Which volatile alarms can be avoided? Issues such as alarms and recovery.
  • Repeat notification, that is, within this time period, if the alarm has not been resolved, it will be sent again. Of course, alarm escalation is not involved here.

After seeing this, do you have a certain understanding of common alarm rule management?

In addition to cloning the built-in alarm rules, we can also customize alarm rules, but the overall configuration is the same as above.

Block Alarm

Generally, shielded alarms are not very important alarms.

Under what circumstances will the alarm be blocked?

For example, when we are publishing an application, we will inevitably encounter problems. At this time, we can make some blocking rules in advance to avoid generating alarm messages.

[Nightingale Monitoring] Alarm management, great!

Shielding rules are also divided by business groups. We can add a new rule as follows to create a rule for blocking message center alarms.

[Nightingale Monitoring] Alarm management, great!

In this way, within the fixed time window, the alarm information will no longer be sent.

Some students may want to say, is it a little troublesome to add them one by one?

If it is an active alarm that has been generated, it can be blocked with one click.

[Nightingale Monitoring] Alarm management, great!

If it is a historical alarm, it can also be blocked with one click.

[Nightingale Monitoring] Alarm management, great!

What else?

If you want to block anything, just add it yourself!

Alarm upgrade

What should I do if an alarm has not been processed within a period of time?

Either it is not an important alarm - delete the rule and leave it useless.

Either it is an alarm that cannot be resolved - upgrade it and let more people know about it.

In Nightingale, alarm upgrades can be implemented in subscription rules.

For example, our configuration is as follows:

[Nightingale Monitoring] Alarm management, great!

#If the alarm event of server=notice is not resolved within 1 hour, we will upgrade the alarm level to level one , and send alarm information to higher-level groups.

The rules here can also be classified and managed by business teams.

In addition, it also provides active alarms and historical alarms. You can view the current alarm information and historical alarm records.

Alarm self-healing

The longer you work in operation and maintenance, you will actually find that the processing of many things is repetitive. Some simple and repetitive tasks can be performed through automated scripts. Processing can not only improve work efficiency, but also reduce the risk of human operation to a certain extent.

Nightingale provides alarm self-healing function. Although the function is good, don’t be greedy.

When dealing with an alarm, you must first find out the real reason behind it, so that you can solve the problem. So for alarm self-healing, you must understand that the risk of the automated operation you do is very low and you have tried it many times. Do not use the cd /opt/aaa;rm -rf ./ operation.

In Nightingale, use the ibex template to implement alarm self-healing. Currently, the ibex-server side needs to be deployed by itself, and the ibex-agent side has been integrated into Categraf.

Deploy ibex-server

Go to https://github.com/flashcatcloud/ibex/releases to download the binary package. After downloading, there are the following files:

# ll
total 21536
drwxr-xr-x 3 root root 4096 Apr 19 10:44 etc
-rwxr-xr-x 1 root root 16105472 Nov 152021 ibex
-rw------- 1 root root5931963 Jun32022 ibex-1.0.0.tar.gz
drwxr-xr-x 2 root root 4096 Nov 152021 sql

Import database:

mysql -uroot -p <sql/ibex.sql

Then modify the /etc/server.conf configuration file, mainly modifying the database configuration.

Finally start the server:

nohup ./ibex server &> server.log &

Configure the client

In the system configuration​->notification configuration​- >The server address corresponding to the alarm self-healing module configuration:

[Nightingale Monitoring] Alarm management, great!

Test self-healing

Then go to alarm self-healing​- >Add a script to the self-healing script, as follows:

[Nightingale Monitoring] Alarm management, great!

Save and exit, click to create a task:

[Nightingale Monitoring] Alarm management, great!

If the configuration inside does not need to be modified or after modifying the corresponding configuration, choose to execute immediately:

[Nightingale Monitoring] Alarm management, great!

At this point, what do you think? Is it good?

Anyway, I didn’t succeed. At this point I have to complain about this module:

  • Are there any prerequisites for the deployment of ibex-server?
  • Is there any preconditions for ibex-agent (categraf)?
  • The execution of the self-healing script failed. There is no specific failure log on either the client or the server.
  • How to put the alarm self-healing configuration entry of the N9e V6 version into the message notification module? Strange
  • Official Document This module is a bit too simple

So, I did not succeed here, the front end threw a timeout.

[Nightingale Monitoring] Alarm management, great!

There are no logs in the backend.

[Nightingale Monitoring] Alarm management, great!

Summary

Currently Nightingale can relatively complete the management of alarm rules, distribution of alarm channels, and suppression and upgrade of alarm messages. Moreover, FlashDuty can access different cluster alarms, which is enough for most enterprises.

Only when testing the alarm self-healing, I failed to test successfully. It should be related to my environment:

  • N9e overall module is deployed to K8s using Helm, but the
  • ibex-server side is deployed directly on the host in binary form

, but the specific cause has not been found out, and there is too little troubleshooting information available.

The above is the detailed content of [Nightingale Monitoring] Alarm management, great!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete