Home > Article > Operation and Maintenance > [Nightingale Monitoring] Alarm management, great!
#Monitoring is the method, alarming is the means, and solution is the purpose.
But, have you ever encountered this kind of confusion? I have collected a lot of indicators, but I don’t know which indicators should trigger alarms, nor how to send these alarms to the corresponding teams or individuals, nor how to upgrade the alarms.
When I used Prometheus Altermanager before, I created a DingTalk group for each team, then added a bunch of tags, matched different tags and sent them to different groups. If I want to upgrade the alarm, In many cases, this is done through threshold upgrade, but it is difficult to upgrade the same alarm through time.
But Nightingale’s alarm rule management is not that complicated (they do the complicated things for you), and it is also very elegant. I first met Nightingale in "[Nightingale Monitoring]", and it's still strong! 》Mentioned: Grafana is better at monitoring panel management, and N9e is better at managing alarm rules.
Today, let’s take a look at how Nightingale plays.After seeing this, do you have a certain understanding of common alarm rule management?
In addition to cloning the built-in alarm rules, we can also customize alarm rules, but the overall configuration is the same as above.
Generally, shielded alarms are not very important alarms.
Under what circumstances will the alarm be blocked?
For example, when we are publishing an application, we will inevitably encounter problems. At this time, we can make some blocking rules in advance to avoid generating alarm messages.
Shielding rules are also divided by business groups. We can add a new rule as follows to create a rule for blocking message center alarms.
In this way, within the fixed time window, the alarm information will no longer be sent.
Some students may want to say, is it a little troublesome to add them one by one?
If it is an active alarm that has been generated, it can be blocked with one click.
If it is a historical alarm, it can also be blocked with one click.
What else?
If you want to block anything, just add it yourself!
What should I do if an alarm has not been processed within a period of time?
Either it is not an important alarm - delete the rule and leave it useless.
Either it is an alarm that cannot be resolved - upgrade it and let more people know about it.
In Nightingale, alarm upgrades can be implemented in subscription rules.
For example, our configuration is as follows:
#If the alarm event of server=notice is not resolved within 1 hour, we will upgrade the alarm level to level one , and send alarm information to higher-level groups.
The rules here can also be classified and managed by business teams.
In addition, it also provides active alarms and historical alarms. You can view the current alarm information and historical alarm records.
The longer you work in operation and maintenance, you will actually find that the processing of many things is repetitive. Some simple and repetitive tasks can be performed through automated scripts. Processing can not only improve work efficiency, but also reduce the risk of human operation to a certain extent.
Nightingale provides alarm self-healing function. Although the function is good, don’t be greedy.
When dealing with an alarm, you must first find out the real reason behind it, so that you can solve the problem. So for alarm self-healing, you must understand that the risk of the automated operation you do is very low and you have tried it many times. Do not use the cd /opt/aaa;rm -rf ./ operation.
In Nightingale, use the ibex template to implement alarm self-healing. Currently, the ibex-server side needs to be deployed by itself, and the ibex-agent side has been integrated into Categraf.
Go to https://github.com/flashcatcloud/ibex/releases to download the binary package. After downloading, there are the following files:
# ll total 21536 drwxr-xr-x 3 root root 4096 Apr 19 10:44 etc -rwxr-xr-x 1 root root 16105472 Nov 152021 ibex -rw------- 1 root root5931963 Jun32022 ibex-1.0.0.tar.gz drwxr-xr-x 2 root root 4096 Nov 152021 sql
Import database:
mysql -uroot -p <sql/ibex.sql
Then modify the /etc/server.conf configuration file, mainly modifying the database configuration.
Finally start the server:
nohup ./ibex server &> server.log &
In the system configuration->notification configuration- >The server address corresponding to the alarm self-healing module configuration:
Then go to alarm self-healing- >Add a script to the self-healing script, as follows:
Save and exit, click to create a task:
If the configuration inside does not need to be modified or after modifying the corresponding configuration, choose to execute immediately:
At this point, what do you think? Is it good?
Anyway, I didn’t succeed. At this point I have to complain about this module:
So, I did not succeed here, the front end threw a timeout.
There are no logs in the backend.
Currently Nightingale can relatively complete the management of alarm rules, distribution of alarm channels, and suppression and upgrade of alarm messages. Moreover, FlashDuty can access different cluster alarms, which is enough for most enterprises.
Only when testing the alarm self-healing, I failed to test successfully. It should be related to my environment:
, but the specific cause has not been found out, and there is too little troubleshooting information available.
The above is the detailed content of [Nightingale Monitoring] Alarm management, great!. For more information, please follow other related articles on the PHP Chinese website!