首页 >后端开发 >Python教程 >ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

Barbara Streisand
Barbara Streisand原创
2024-10-26 05:04:30598浏览

经典圣人

使用 AWS SageMaker 及其 Python SDK 制作的机器学习模型,用于使用 Terraform 实现基础设施设置自动化的 HDFS 日志分类。

链接:GitHub
语言:HCL(terraform)、Python

内容

  • 概述:项目概述。
  • 系统架构:系统架构图
  • ML 模型:模型概述。
  • 入门:如何运行项目。
  • 控制台观察:运行项目时可以观察到的实例和基础设施的变化。
  • 结束和清理:确保不产生额外费用。
  • 自动创建的对象:在执行过程中创建的文件和文件夹。

  • 首先遵循目录结构以便更好地设置项目。
  • 从 GitHub 上传的 ClassiSage 项目存储库中获取主要参考,以便更好地理解。

概述

  • 该模型是使用 AWS SageMaker 进行 HDFS 日志分类以及用于存储数据集的 S3、Notebook 文件(包含 SageMaker 实例的代码)和模型输出。
  • 基础设施设置是使用 Terraform 自动化的,Terraform 是一个由 HashiCorp 创建的提供基础设施即代码的工具
  • 使用的数据集是HDFS_v1。
  • 该项目使用模型 XGBoost 版本 1.2 实现 SageMaker Python SDK

系统架构

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

机器学习模型

  • 图像 URI
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 初始化对容器的超参数和估计器调用
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 培训工作
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 部署
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 验证
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

入门

  • 使用 Git Bash 克隆存储库/下载 .zip 文件/分叉存储库。
  • 转到您的 AWS 管理控制台,单击右上角的帐户配置文件,然后从下拉列表中选择我的安全凭证。
  • 创建访问密钥:在访问密钥部分,单击创建新访问密钥,将出现一个对话框,其中包含您的访问密钥 ID 和秘密访问密钥。
  • 下载或复制密钥:(重要)下载 .csv 文件或将密钥复制到安全位置。这是您唯一可以查看秘密访问密钥的时间。
  • 打开克隆的存储库。在你的 VS Code 中
  • 在ClassiSage下创建一个文件为terraform.tfvars,其内容为
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
  • 下载并安装使用 Terraform 和 Python 的所有依赖项。
  • 在终端中输入/粘贴 terraform init 来初始化后端。

  • 然后输入/粘贴 terraform Plan 以查看计划或简单地进行 terraform 验证以确保没有错误。

  • 最后在终端中输入/粘贴 terraform apply --auto-approve

  • 这将显示两个输出,一个作为bucket_name,另一个作为pretrained_ml_instance_name(第三个资源是赋予存储桶的变量名称,因为它们是全局资源)。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 终端中显示命令完成后,导航到 ClassiSage/ml_ops/function.py 并在文件的第 11 行添加代码
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

并将其更改为项目目录所在的路径并保存。

  • 然后在 ClassiSageml_opsdata_upload.ipynb 上使用代码运行所有代码单元格,直到单元格编号 25
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

将数据集上传到 S3 Bucket。

  • 代码单元执行的输出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 执行笔记本后,重新打开您的 AWS 管理控制台。
  • 您可以搜索 S3 和 Sagemaker 服务,并将看到启动的每个服务的实例(S3 存储桶和 SageMaker Notebook)

名为“data-bucket-”的 S3 存储桶,上传了 2 个对象、一个数据集和包含模型代码的 pretrained_sm.ipynb 文件。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 转到AWS SageMaker中的笔记本实例,单击创建的实例,然后单击打开Jupyter。
  • 之后,单击窗口右上角的“新建”并选择“在终端上”。
  • 这将创建一个新终端。

  • 在终端上粘贴以下内容(替换为 VS Code 终端输出中显示的bucket_name 输出):
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

将 pretrained_sm.ipynb 从 S3 上传到 Notebook 的 Jupyter 环境的终端命令

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 返回到打开的 Jupyter 实例,然后单击 pretrained_sm.ipynb 文件将其打开并为其分配 conda_python3 内核。
  • 向下滚动到第四个单元格,并将变量bucket_name的值替换为VS Code的终端输出bucket_name = ""
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

代码单元执行的输出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 在文件顶部,转到“内核”选项卡来重新启动。
  • 执行 Notebook 直到代码单元格编号 27,使用代码
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
  • 您将得到预期的结果。 数据将被获取,在针对具有定义的输出路径的标签和功能进行调整后,分为训练集和测试集,然后使用 SageMaker 的 Python SDK 的模型将被训练、部署为端点、验证以提供不同的指标。

控制台观察笔记

执行第 8 个单元

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
  • 将在S3中设置输出路径来存储模型数据。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

执行第23个单元

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
  • 训练作业将会开始,您可以在训练选项卡下查看。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 一段时间后(预计3分钟),它将完成并显示相同的内容。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

执行第 24 个代码单元

  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
  • 端点将部署在推理选项卡下。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

额外的控制台观察:

  • 在“推理”选项卡下创建端点配置。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 也在“推理”选项卡下创建模型。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


结束和清理

  • 在 VS Code 中返回 data_upload.ipynb 执行最后 2 个代码单元,将 S3 存储桶的数据下载到本地系统。
  • 该文件夹将被命名为downloaded_bucket_content。 已下载文件夹的目录结构。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 您将在输出单元格中获得下载文件的日志。它将包含原始 pretrained_sm.ipynb、final_dataset.csv 和名为“pretrained-algo”的模型输出文件夹,其中包含 sagemaker 代码文件的执行数据。
  • 最后进入 SageMaker 实例内的 pretrained_sm.ipynb 并执行最后 2 个代码单元。 端点和S3存储桶内的资源将被删除,以确保不会产生额外费用。
  • 删除端点
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 清除S3:(需要销毁实例)
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
  • 返回项目文件的 VS Code 终端,然后输入/粘贴 terraform destroy --auto-approve
  • 所有创建的资源实例将被删除。

自动创建的对象

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

注意:
如果您喜欢这个机器学习项目的想法和实现,该项目使用 AWS Cloud 的 S3 和 SageMaker 进行 HDFS 日志分类,使用 Terraform 进行 IaC(基础设施设置自动化),请在查看 GitHub 上的项目存储库后考虑喜欢这篇文章并加星标.

以上是ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn