首頁  >  文章  >  後端開發  >  ClassiSage:基於 Terraform IaC 自動化 AWS SageMaker HDFS 日誌分類模型

ClassiSage:基於 Terraform IaC 自動化 AWS SageMaker HDFS 日誌分類模型

Barbara Streisand
Barbara Streisand原創
2024-10-26 05:04:30551瀏覽

經典聖人

使用 AWS SageMaker 及其 Python SDK 製作的機器學習模型,用於使用 Terraform 實現基礎架構設定自動化的 HDFS 日誌分類。

連結:GitHub
語言:HCL(terraform)、Python

內容

  • 概述:項目概述。
  • 系統架構:系統架構圖
  • ML 模型:模型概論。
  • 入門:如何運行專案。
  • 控制台觀察:執行專案時可以觀察到的實例和基礎架構的變化。
  • 結束和清理:確保不產生額外費用。
  • 自動建立的物件:在執行過程中建立的檔案和資料夾。

  • 先遵循目錄結構以便更好地設定項目。
  • 從 GitHub 上傳的 ClassiSage 專案儲存庫中取得主要參考,以便更好地理解。

概述

  • 模型是使用 AWS SageMaker 進行 HDFS 日誌分類以及用於儲存資料集的 S3、Notebook 檔案(包含 SageMaker 實例的程式碼)和模型輸出。
  • 基礎設施設定是使用 Terraform 自動化的,Terraform 是一個由 HashiCorp 創建的提供基礎設施即程式碼的工具
  • 使用的資料集是HDFS_v1。
  • 專案使用模型 XGBoost 版本 1.2 實作 SageMaker Python SDK

系統架構

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

機器學習模型

  • 影像 URI
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 初始化對容器的超參數和估計器呼叫
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 訓練工作
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 部署
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 驗證
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

入門

  • 使用 Git Bash 克隆儲存庫/下載 .zip 檔案/分叉儲存庫。
  • 前往您的 AWS 管理控制台,點擊右上角的帳戶設定文件,然後從下拉清單中選擇我的安全憑證。
  • 建立存取金鑰:在存取金鑰部分,按一下建立新存取金鑰,將出現一個對話框,其中包含您的存取金鑰 ID 和秘密存取金鑰。
  • 下載或複製金鑰:(重要)下載 .csv 檔案或將金鑰複製到安全位置。這是您唯一可以查看秘密存取金鑰的時間。
  • 開啟克隆的儲存庫。在你的 VS Code 中
  • 在ClassiSage下建立一個檔案為terraform.tfvars,其內容為
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
  • 下載並安裝使用 Terraform 和 Python 的所有相依性。
  • 在終端機中輸入/貼上 terraform init 來初始化後端。

  • 然後輸入/貼上 terraform Plan 以查看計劃或簡單地進行 terraform 驗證以確保沒有錯誤。

  • 最後在終端機中輸入/貼上 terraform apply --auto-approve

  • 這將顯示兩個輸出,一個作為bucket_name,另一個作為pretrained_ml_instance_name(第三個資源是賦予儲存桶的變數名稱,因為它們是全域資源)。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 終端機中顯示指令完成後,導覽至 ClassiSage/ml_ops/function.py 並在檔案的第 11 行新增程式碼
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

並將其變更為專案目錄所在的路徑並儲存。

  • 然後在 ClassiSageml_opsdata_upload.ipynb 上使用程式碼執行所有程式碼儲存格,直到儲存格編號 25
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

將資料集上傳到 S3 Bucket。

  • 程式碼單元執行的輸出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 執行筆記本後,重新開啟您的 AWS 管理主控台。
  • 您可以搜尋 S3 和 Sagemaker 服務,並將看到啟動的每個服務的實例(S3 儲存桶和 SageMaker Notebook)

名為「data-bucket-」的 S3 儲存桶,上傳了 2 個物件、一個資料集和包含模型程式碼的 pretrained_sm.ipynb 檔案。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 前往AWS SageMaker中的筆記本實例,按一下建立的實例,然後按一下開啟Jupyter。
  • 之後,點擊視窗右上角的「新建」並選擇「在終端機上」。
  • 這將建立一個新終端。

  • 在終端機上貼上以下內容(替換為 VS Code 終端機輸出中顯示的bucket_name 輸出):
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

將 pretrained_sm.ipynb 從 S3 上傳到 Notebook 的 Jupyter 環境的終端命令

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 返回到開啟的 Jupyter 實例,然後按一下 pretrained_sm.ipynb 檔案將其開啟並為其指派 conda_python3 核心。
  • 向下捲動到第四個儲存格,並將變數bucket_name的值替換為VS Code的終端輸出bucket_name = ""
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

程式碼單元執行的輸出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 在檔案頂部,轉到「核心」標籤來重新啟動。
  • 執行 Notebook 直到程式碼儲存格編號 27,使用程式碼
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
  • 您將得到預期的結果。 資料將被獲取,在針對具有定義的輸出路徑的標籤和功能進行調整後,分為訓練集和測試集,然後使用SageMaker 的Python SDK 的模型將被訓練、部署為端點、驗證以提供不同的指標。

控制台觀察筆記

執行第 8 個單元

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
  • 將在S3中設定輸出路徑來儲存模型資料。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

執行第23個單元

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
  • 訓練作業將會開始,您可以在訓練標籤下查看。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 一段時間後(預計3分鐘),它將完成並顯示相同的內容。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

執行第 24 個代碼單元

  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
  • 端點將部署在推理標籤下。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

額外的控制台觀察:

  • 在「推理」標籤下建立端點配置。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 也在「推理」標籤下建立模型。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


結束和清理

  • 在 VS Code 中傳回 data_upload.ipynb 執行最後 2 個代碼單元,將 S3 儲存桶的資料下載到本機系統。
  • 該資料夾將命名為downloaded_bucket_content。 已下載資料夾的目錄結構。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 您將在輸出儲存格中獲得下載檔案的日誌。它將包含原始 pretrained_sm.ipynb、final_dataset.csv 和名為「pretrained-algo」的模型輸出資料夾,其中包含 sagemaker 程式碼檔案的執行資料。
  • 最後進入 SageMaker 實例內的 pretrained_sm.ipynb 並執行最後 2 個程式碼單元。 端點和S3儲存桶內的資源將被刪除,以確保不會產生額外費用。
  • 刪除端點
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 清除S3:(需要銷毀實例)
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
  • 返回專案文件的 VS Code 終端,然後輸入/貼上 terraform destroy --auto-approve
  • 所有建立的資源實例將會被刪除。

自動建立的對象

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

注意:
如果您喜歡這個機器學習專案的想法和實現,該專案使用AWS Cloud 的S3 和SageMaker 進行HDFS 日誌分類,使用Terraform 進行IaC(基礎設施設定自動化),請在查看GitHub 上的專案儲存庫後考慮喜歡這篇文章並加星號.

以上是ClassiSage:基於 Terraform IaC 自動化 AWS SageMaker HDFS 日誌分類模型的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn