首页  >  文章  >  后端开发  >  使用 EventBridge 和 Lambda 进行自动故障排除和 ITSM 系统

使用 EventBridge 和 Lambda 进行自动故障排除和 ITSM 系统

王林
王林原创
2024-08-23 06:00:32500浏览

介绍 :

各位,在 IT 运营中,监视服务器指标(例如 CPU/内存和磁盘或文件系统的利用率)是一项非常通用的任务,但如果任何指标被触发为关键指标,则需要专门人员执行一些基本操作通过登录服务器进行故障排除,并找出使用的最初原因,如果该人收到多个相同的警报,导致无聊且根本没有生产力,则必须执行多次。因此,作为一种解决方法,可以开发一个系统,一旦触发警报,该系统就会做出反应,并通过执行一些基本的故障排除命令来对这些实例采取行动。只是总结问题陈述和期望 -

问题陈述:

开发一个能够满足低于预期的系统 -

  • 每个 EC2 实例都应由 CloudWatch 监控。
  • 一旦触发警报,就必须有一些东西可以登录到受影响的 EC2 实例并执行一些基本的故障排除命令。
  • 然后,创建一个 JIRA 问题来记录该事件,并在评论部分添加命令的输出。
  • 然后,发送一封自动电子邮件,其中提供所有警报详细信息和 JIRA 问题详细信息。

架构图:

Automatic Troubleshooting & ITSM System using EventBridge and Lambda

先决条件:

  1. EC2 实例
  2. CloudWatch 警报
  3. EventBridge 规则
  4. Lambda 函数
  5. JIRA 帐户
  6. 简单的通知服务

实施步骤:

  • A. CloudWatch 代理安装和配置设置:
    打开 Systems Manager 控制台并单击“文档”
    搜索“AWS-ConfigureAWSPackage”文档并通过提供所需的详细信息来执行。
    包名称 = AmazonCloudwatchAgent
    安装后,需要根据配置文件配置 CloudWatch 代理。为此,请执行 AmazonCloudWatch-ManageAgent 文档。另外,请确保 JSON CloudWatch 配置文件存储在 SSM 参数中。
    一旦您看到指标正在向 CloudWatch 控制台报告,请为 CPU 和内存利用率等创建警报。

  • B.设置EventBridge规则:
    为了跟踪警报状态的变化,这里,我们稍微定制了模式来跟踪警报状态从 OK 到 ALARM 的变化,而不是反向变化。然后,将此规则添加到 lambda 函数作为触发器。

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "state": {
      "value": ["ALARM"]
    },
    "previousState": {
      "value": ["OK"]
    }
  }
}
  • C.创建 Lambda 函数以在 JIRA 中发送电子邮件和记录事件: 此 lambda 函数是为由 EventBridge 规则触发的多个活动创建的,并作为使用 AWS SDK(Boto3) 添加的目标 SNS 主题。一旦触发 EventBridge 规则,就会将 JSON 事件内容发送到 lambda,该函数通过该函数捕获多个详细信息以不同的方式进行处理。 到目前为止,我们已经研究了两种类型的警报 - i。 CPU 利用率和 ii.内存利用率。一旦这两个警报中的任何一个被触发并且警报状态从 OK 更改为 ALARM,就会触发 EventBridge,这也会触发 Lambda 函数来执行表单代码中提到的那些任务。

Lambda 先决条件:
我们需要导入以下模块才能使代码正常工作 -

  • >>操作系统
  • >>系统
  • >> json
  • >>波托3
  • >>时间
  • >>请求

注意: 上面的模块中,除了“requests”模块之外,其余的都默认在 lambda 底层基础设施中下载。 Lambda 不支持直接导入“requests”模块。因此,首先,通过执行以下命令在本地计算机(笔记本电脑)的文件夹中安装请求模块 -

pip3 install requests -t <directory path> --no-user

_之后,这将被下载到您执行上述命令的文件夹或您想要存储模块源代码的文件夹中,这里我希望 lambda 代码正在您的本地计算机中准备好。如果是,则使用 module.txt 创建整个 lambda 源代码的 zip 文件。之后,将 zip 文件上传到 lambda 函数。

所以,我们在这里执行以下两个场景 -

1。 CPU 利用率 - 如果触发 CPU 利用率警报,则 lambda 函数需要获取实例并登录到该实例并执行前 5 个高消耗进程。然后,它将创建一个 JIRA 问题并在评论部分添加流程详细信息。同时,它将发送一封电子邮件,其中包含警报详细信息和 jira 问题详细信息以及流程输出。

2。内存利用率 - 与上面的方法相同

Now, let me reframe the task details which lambda is supposed to perform -

  1. Login to Instance
  2. Perform Basic Troubleshooting Steps.
  3. Create a JIRA Issue
  4. Send Email to Recipient with all Details

Scenario 1: When alarm state has been changed from OK to ALARM

First Set (Define the cpu and memory function) :

################# Importing Required Modules ################
############################################################
import json
import boto3
import time
import os
import sys
sys.path.append('./python')   ## This will add requests module along with all dependencies into this script
import requests
from requests.auth import HTTPBasicAuth

################## Calling AWS Services ###################
###########################################################
ssm = boto3.client('ssm')
sns_client = boto3.client('sns')
ec2 = boto3.client('ec2')

################## Defining Blank Variable ################
###########################################################
cpu_process_op = ''
mem_process_op = ''
issueid = ''
issuekey = ''
issuelink = ''

################# Function for CPU Utilization ################
###############################################################
def cpu_utilization(instanceid, metric_name, previous_state, current_state):
    global cpu_process_op
    if previous_state == 'OK' and current_state == 'ALARM':
        command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -5'
        print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}')
        # Start a session
        print(f'Starting session to {instanceid}')
        response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]})
        command_id = response['Command']['CommandId']
        print(f'Command ID: {command_id}')
        # Retrieve the command output
        time.sleep(4)
        output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
        print('Please find below output -\n', output['StandardOutputContent'])
        cpu_process_op = output['StandardOutputContent']
    else:
        print('None')

################# Function for Memory Utilization ################
############################################################### 
def mem_utilization(instanceid, metric_name, previous_state, current_state):
    global mem_process_op
    if previous_state == 'OK' and current_state == 'ALARM':
        command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -5'
        print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}')
        # Start a session
        print(f'Starting session to {instanceid}')
        response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]})
        command_id = response['Command']['CommandId']
        print(f'Command ID: {command_id}')
        # Retrieve the command output
        time.sleep(4)
        output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
        print('Please find below output -\n', output['StandardOutputContent'])
        mem_process_op = output['StandardOutputContent']
    else:
        print('None')

Second Set (Create JIRA Issue) :

################## Create JIRA Issue ################
#####################################################
def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val):
    ## Create Issue ##
    url ='https://<your-user-name>.atlassian.net//rest/api/2/issue'
    username = os.environ['username']
    api_token = os.environ['token']
    project = 'AnirbanSpace'
    issue_type = 'Incident'
    assignee = os.environ['username']
    summ_metric  = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None
    metric_val = metric_val
    summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}'
    description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}'

    issue_data = {
        "fields": {
            "project": {
                "key": "SCRUM"
            },
            "summary": summary,
            "description": description,
            "issuetype": {
                "name": issue_type
            },
            "assignee": {
                "name": assignee
            }
        }
    }
    data = json.dumps(issue_data)
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json"
    }
    auth = HTTPBasicAuth(username, api_token)
    response = requests.post(url, headers=headers, auth=auth, data=data)
    global issueid
    global issuekey
    global issuelink
    issueid = response.json().get('id')
    issuekey = response.json().get('key')
    issuelink = response.json().get('self')

    ################ Add Comment To Above Created JIRA Issue ###################
    output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None
    comment_api_url = f"{url}/{issuekey}/comment"
    add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output}))

    ## Check the response
    if response.status_code == 201:
        print("Issue created successfully. Issue key:", response.json().get('key'))
    else:
        print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}")

Third Set (Send an Email) :

################## Send An Email ################
#################################################
def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink):
    ### Define a dictionary of custom input ###
    metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'}

    ### Conditions ###
    if previous_state == 'OK' and current_state == 'ALARM' and metric_name in list(metric_list.keys()):
        metric_msg = metric_list[metric_name]
        output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None
        print('This is output', output)
        email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization is high for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \nAccount = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput: \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team"
        res = sns_client.publish(
            TopicArn = os.environ['snsarn'],
            Subject = f'High {metric_msg} Utilization Alert : {instanceid}',
            Message = str(email_body)
            )
        print('Mail has been sent') if res else print('Email not sent')
    else:
        email_body = str(0)

Fourth Set (Calling Lambda Handler Function) :

################## Lambda Handler Function ################
###########################################################
def lambda_handler(event, context):
    instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId']
    metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name']
    account = event['account']
    timestamp = event['time']
    region = event['region']
    current_state = event['detail']['state']['value']
    current_reason = event['detail']['state']['reason']
    previous_state = event['detail']['previousState']['value']
    previous_reason = event['detail']['previousState']['reason']
    metric_val = json.loads(event['detail']['state']['reasonData'])['evaluatedDatapoints'][0]['value']
    ##### function calling #####
    if metric_name == 'CPUUtilization':
        cpu_utilization(instanceid, metric_name, previous_state, current_state)
        create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val)
        send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink)
    elif metric_name == 'mem_used_percent':
        mem_utilization(instanceid, metric_name, previous_state, current_state)
        create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val)
        send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink)
    else:
        None

Alarm Email Screenshot :

Automatic Troubleshooting & ITSM System using EventBridge and Lambda

Note: In ideal scenario, threshold is 80%, but for testing I changed it to 10%. Please see the Reason.

Alarm JIRA Issue :

Automatic Troubleshooting & ITSM System using EventBridge and Lambda

Scenario 2: When alarm state has been changed from OK to Insufficient data

In this scenario, if any server cpu or memory utilization metrics data are not captured, then alarm state gets changed from OK to INSUFFICIENT_DATA. This state can be achieved in two ways - a.) If server is in stopped state b.) If CloudWatch agent is not running or went in dead state.
So, as per below script, you'll be able to see that when cpu or memory utilization alarm status gets insufficient data, then lambda will first check if instance is in running status or not. If instance is in running state, then it will login and check CloudWatch agent status. Post that, it will create a JIRA issue and post the agent status in comment section of JIRA issue. After that, it will send an email with alarm details and agent status.

Full Code :

################# Importing Required Modules ################
############################################################
import json
import boto3
import time
import os
import sys
sys.path.append('./python')   ## This will add requests module along with all dependencies into this script
import requests
from requests.auth import HTTPBasicAuth

################## Calling AWS Services ###################
###########################################################
ssm = boto3.client('ssm')
sns_client = boto3.client('sns')
ec2 = boto3.client('ec2')

################## Defining Blank Variable ################
###########################################################
cpu_process_op = ''
mem_process_op = ''
issueid = ''
issuekey = ''
issuelink = ''

################# Function for CPU Utilization ################
###############################################################
def cpu_utilization(instanceid, metric_name, previous_state, current_state):
    global cpu_process_op
    if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA':
        ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name']
        if ec2_status == 'running':
            command = 'systemctl status amazon-cloudwatch-agent;sleep 3;systemctl restart amazon-cloudwatch-agent'
            print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}')
            # Start a session
            print(f'Starting session to {instanceid}')
            response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]})
            command_id = response['Command']['CommandId']
            print(f'Command ID: {command_id}')
            # Retrieve the command output
            time.sleep(4)
            output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
            print('Please find below output -\n', output['StandardOutputContent'])
            cpu_process_op = output['StandardOutputContent']
        else:
            cpu_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!'
            print(f'Instance current status is {ec2_status}. Not able to reach out!!')
    else:
        print('None')

################# Function for Memory Utilization ################
############################################################### 
def mem_utilization(instanceid, metric_name, previous_state, current_state):
    global mem_process_op
    if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA':
        ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name']
        if ec2_status == 'running':
            command = 'systemctl status amazon-cloudwatch-agent'
            print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}')
            # Start a session
            print(f'Starting session to {instanceid}')
            response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]})
            command_id = response['Command']['CommandId']
            print(f'Command ID: {command_id}')
            # Retrieve the command output
            time.sleep(4)
            output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
            print('Please find below output -\n', output['StandardOutputContent'])
            mem_process_op = output['StandardOutputContent']
            print(mem_process_op)
        else:
            mem_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!'
            print(f'Instance current status is {ec2_status}. Not able to reach out!!')     
    else:
        print('None')

################## Create JIRA Issue ################
#####################################################
def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val):
    ## Create Issue ##
    url ='https://<your-user-name>.atlassian.net//rest/api/2/issue'
    username = os.environ['username']
    api_token = os.environ['token']
    project = 'AnirbanSpace'
    issue_type = 'Incident'
    assignee = os.environ['username']
    summ_metric  = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None
    metric_val = metric_val
    summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}'
    description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}'

    issue_data = {
        "fields": {
            "project": {
                "key": "SCRUM"
            },
            "summary": summary,
            "description": description,
            "issuetype": {
                "name": issue_type
            },
            "assignee": {
                "name": assignee
            }
        }
    }
    data = json.dumps(issue_data)
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json"
    }
    auth = HTTPBasicAuth(username, api_token)
    response = requests.post(url, headers=headers, auth=auth, data=data)
    global issueid
    global issuekey
    global issuelink
    issueid = response.json().get('id')
    issuekey = response.json().get('key')
    issuelink = response.json().get('self')

    ################ Add Comment To Above Created JIRA Issue ###################
    output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None
    comment_api_url = f"{url}/{issuekey}/comment"
    add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output}))

    ## Check the response
    if response.status_code == 201:
        print("Issue created successfully. Issue key:", response.json().get('key'))
    else:
        print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}")

################## Send An Email ################
#################################################
def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink):
    ### Define a dictionary of custom input ###
    metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'}

    ### Conditions ###
    if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA' and metric_name in list(metric_list.keys()):
        metric_msg = metric_list[metric_name]
        output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None
        email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization alarm state has been changed to {current_state} for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \n Account = {account}, \nTimestamp = {timestamp}, \nRegion = {region},  \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00  \n\nProcessOutput = \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team"
        res = sns_client.publish(
            TopicArn = os.environ['snsarn'],
            Subject = f'Insufficient {metric_msg} Utilization Alarm : {instanceid}',
            Message = str(email_body)
        )
        print('Mail has been sent') if res else print('Email not sent')
    else:
        email_body = str(0)

################## Lambda Handler Function ################
###########################################################
def lambda_handler(event, context):
    instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId']
    metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name']
    account = event['account']
    timestamp = event['time']
    region = event['region']
    current_state = event['detail']['state']['value']
    current_reason = event['detail']['state']['reason']
    previous_state = event['detail']['previousState']['value']
    previous_reason = event['detail']['previousState']['reason']
    metric_val = 'NA'
    ##### function calling #####
    if metric_name == 'CPUUtilization':
        cpu_utilization(instanceid, metric_name, previous_state, current_state)
        create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val)
        send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink)
    elif metric_name == 'mem_used_percent':
        mem_utilization(instanceid, metric_name, previous_state, current_state)
        create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val)
        send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink)
    else:
        None

Insufficient Data Email Screenshot :

Automatic Troubleshooting & ITSM System using EventBridge and Lambda

Insufficient data JIRA Issue :

Automatic Troubleshooting & ITSM System using EventBridge and Lambda

Conclusion :

In this article, we have tested scenarios on both cpu and memory utilization, but there can be lots of metrics on which we can configure auto-incident and auto-email functionality which will reduce significant efforts in terms of monitoring and creating incidents and all. This solution has given a initial approach how we can proceed further, but for sure there can be other possibilities to achieve this goal. I believe you all will understand the way we tried to make this relatable. Please like and comment if you love this article or have any other suggestions, so that we can populate in coming articles. ??

Thanks!!
Anirban Das

以上是使用 EventBridge 和 Lambda 进行自动故障排除和 ITSM 系统的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn