各位,在IT 營運中,監控伺服器指標(例如CPU/記憶體和磁碟或檔案系統的使用率)是一項非常通用的任務,但如果任何指標被觸發為關鍵指標,則需要專門人員執行一些基本操作透過登入伺服器進行故障排除,並找出使用的最初原因,如果該人收到多個相同的警報,導致無聊且根本沒有生產力,則必須多次執行該操作。因此,作為一種解決方法,可以開發一個系統,一旦觸發警報,系統就會做出反應,並透過執行一些基本的故障排除命令來對這些實例採取行動。只是總結問題陳述和期望 -
開發一個能夠滿足低於預期的系統 -
A. CloudWatch 代理安裝與設定設定:
開啟 Systems Manager 控制台並點選「文件」
搜尋“AWS-ConfigureAWSPackage”文件並透過提供所需的詳細資訊來執行。
套件名稱 = AmazonCloudwatchAgent
安裝後,需要根據設定檔設定 CloudWatch 代理程式。為此,請執行 AmazonCloudWatch-ManageAgent 文件。另外,請確保 JSON CloudWatch 設定檔儲存在 SSM 參數中。
一旦您看到指標正在向 CloudWatch 控制台報告,請為 CPU 和記憶體使用率等建立警報。
B.設定EventBridge規則:
為了追蹤警報狀態的變化,這裡,我們稍微客製化了模式來追蹤警報狀態從 OK 到 ALARM 的變化,而不是反向變化。然後,將此規則新增至 lambda 函數作為觸發器。
{ "source": ["aws.cloudwatch"], "detail-type": ["CloudWatch Alarm State Change"], "detail": { "state": { "value": ["ALARM"] }, "previousState": { "value": ["OK"] } } }
Lambda 先決條件:
我們需要導入以下模組才能使程式碼正常工作 -
注意: 上面的模組中,除了「requests」模組之外,其餘的都預設在 lambda 底層基礎設施中下載。 Lambda 不支援直接匯入「requests」模組。因此,首先,透過執行以下命令在本機電腦(筆記型電腦)的資料夾中安裝請求模組 -
pip3 install requests -t <directory path> --no-user
_之後,這將被下載到您執行上述命令的資料夾或您想要儲存模組原始程式碼的資料夾中,這裡我希望 lambda 程式碼正在您的本機電腦中準備好。如果是,則使用 module.txt 建立整個 lambda 原始碼的 zip 檔案。之後,將 zip 檔案上傳到 lambda 函數。
所以,我們在這裡執行以下兩個場景 -
1。 CPU 使用率 - 如果觸發 CPU 使用率警報,則 lambda 函數需要取得執行個體並登入該執行個體並執行前 5 個高消耗程序。然後,它將創建一個 JIRA 問題並在評論部分中添加流程詳細資訊。同時,它將發送一封電子郵件,其中包含警報詳細資訊和 jira 問題詳細資訊以及流程輸出。
2。記憶體利用率 - 與上面的方法相同
Now, let me reframe the task details which lambda is supposed to perform -
First Set (Define the cpu and memory function) :
################# Importing Required Modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## This will add requests module along with all dependencies into this script import requests from requests.auth import HTTPBasicAuth ################## Calling AWS Services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## Defining Blank Variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# Function for CPU Utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'OK' and current_state == 'ALARM': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -5' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) cpu_process_op = output['StandardOutputContent'] else: print('None') ################# Function for Memory Utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'OK' and current_state == 'ALARM': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -5' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) mem_process_op = output['StandardOutputContent'] else: print('None')
Second Set (Create JIRA Issue) :
################## Create JIRA Issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## Create Issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'AnirbanSpace' issue_type = 'Incident' assignee = os.environ['username'] summ_metric = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None metric_val = metric_val summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}' description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}' issue_data = { "fields": { "project": { "key": "SCRUM" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "Accept": "application/json", "Content-Type": "application/json" } auth = HTTPBasicAuth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ Add Comment To Above Created JIRA Issue ################### output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## Check the response if response.status_code == 201: print("Issue created successfully. Issue key:", response.json().get('key')) else: print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}")
Third Set (Send an Email) :
################## Send An Email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### Define a dictionary of custom input ### metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'} ### Conditions ### if previous_state == 'OK' and current_state == 'ALARM' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None print('This is output', output) email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization is high for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \nAccount = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput: \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team" res = sns_client.publish( TopicArn = os.environ['snsarn'], Subject = f'High {metric_msg} Utilization Alert : {instanceid}', Message = str(email_body) ) print('Mail has been sent') if res else print('Email not sent') else: email_body = str(0)
Fourth Set (Calling Lambda Handler Function) :
################## Lambda Handler Function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId'] metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousState']['value'] previous_reason = event['detail']['previousState']['reason'] metric_val = json.loads(event['detail']['state']['reasonData'])['evaluatedDatapoints'][0]['value'] ##### function calling ##### if metric_name == 'CPUUtilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: None
Alarm Email Screenshot :
Note: In ideal scenario, threshold is 80%, but for testing I changed it to 10%. Please see the Reason.
Alarm JIRA Issue :
In this scenario, if any server cpu or memory utilization metrics data are not captured, then alarm state gets changed from OK to INSUFFICIENT_DATA. This state can be achieved in two ways - a.) If server is in stopped state b.) If CloudWatch agent is not running or went in dead state.
So, as per below script, you'll be able to see that when cpu or memory utilization alarm status gets insufficient data, then lambda will first check if instance is in running status or not. If instance is in running state, then it will login and check CloudWatch agent status. Post that, it will create a JIRA issue and post the agent status in comment section of JIRA issue. After that, it will send an email with alarm details and agent status.
Full Code :
################# Importing Required Modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## This will add requests module along with all dependencies into this script import requests from requests.auth import HTTPBasicAuth ################## Calling AWS Services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## Defining Blank Variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# Function for CPU Utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent;sleep 3;systemctl restart amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) cpu_process_op = output['StandardOutputContent'] else: cpu_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################# Function for Memory Utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) mem_process_op = output['StandardOutputContent'] print(mem_process_op) else: mem_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################## Create JIRA Issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## Create Issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'AnirbanSpace' issue_type = 'Incident' assignee = os.environ['username'] summ_metric = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None metric_val = metric_val summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}' description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}' issue_data = { "fields": { "project": { "key": "SCRUM" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "Accept": "application/json", "Content-Type": "application/json" } auth = HTTPBasicAuth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ Add Comment To Above Created JIRA Issue ################### output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## Check the response if response.status_code == 201: print("Issue created successfully. Issue key:", response.json().get('key')) else: print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}") ################## Send An Email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### Define a dictionary of custom input ### metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'} ### Conditions ### if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization alarm state has been changed to {current_state} for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \n Account = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput = \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team" res = sns_client.publish( TopicArn = os.environ['snsarn'], Subject = f'Insufficient {metric_msg} Utilization Alarm : {instanceid}', Message = str(email_body) ) print('Mail has been sent') if res else print('Email not sent') else: email_body = str(0) ################## Lambda Handler Function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId'] metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousState']['value'] previous_reason = event['detail']['previousState']['reason'] metric_val = 'NA' ##### function calling ##### if metric_name == 'CPUUtilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: None
Insufficient Data Email Screenshot :
Insufficient data JIRA Issue :
In this article, we have tested scenarios on both cpu and memory utilization, but there can be lots of metrics on which we can configure auto-incident and auto-email functionality which will reduce significant efforts in terms of monitoring and creating incidents and all. This solution has given a initial approach how we can proceed further, but for sure there can be other possibilities to achieve this goal. I believe you all will understand the way we tried to make this relatable. Please like and comment if you love this article or have any other suggestions, so that we can populate in coming articles. ??
Thanks!!
Anirban Das
以上是使用 EventBridge 和 Lambda 進行自動故障排除和 ITSM 系統的詳細內容。更多資訊請關注PHP中文網其他相關文章!