ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Barbara Streisand

Oct 26, 2024 am 05:04 AM

ClassiSage

A Machine Learning model made with AWS SageMaker and its Python SDK for Classification of HDFS Logs using Terraform for automation of infrastructure setup.

Link: GitHub
Language: HCL (terraform), Python

Content

Overview: Project Overview.
System Architecture: System Architecture Diagram
ML Model: Model Overview.
Getting Started: How to run the project.
Console Observations: Changes in instances and infrastructure that can be observed while running the project.
Ending and Cleanup: Ensuring no additional charges.
Auto Created Objects: Files and Folders created during execution process.

Firstly follow the Directory Structure for better project setup.
Take major reference from the ClassiSage's Project Repository uploaded in GitHub for better understanding.

Overview

The model is made with AWS SageMaker for Classification of HDFS Logs along with S3 for storing dataset, Notebook file (containing code for SageMaker instance) and Model Output.
The Infrastructure setup is automated using Terraform a tool to provide infrastructure-as-code created by HashiCorp
The data set used is HDFS_v1.
The project implements SageMaker Python SDK with the model XGBoost version 1.2

System Architecture

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ML Model

Image URI

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Initializing Hyper Parameter and Estimator call to the container

hyperparameters = {
"max_depth":"5", ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
"eta":"0.2", ## Learning rate. Lower values make the learning process slower but more precise.
"gamma":"4", ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
"min_child_weight":"6", ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
"subsample":"0.7", ## Fraction of training data used. Reduces overfitting by sampling part of the data.
"objective":"binary:logistic", ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
"num_round":50 ## Number of boosting rounds, essentially how many times the model is trained.
}
# A SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
hyperparameters=hyperparameters, # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
role=sagemaker.get_execution_role(), # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
train_instance_count=1, # Sets the number of training instances. Here, it’s using a single instance.
train_instance_type='ml.m5.large', # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
train_volume_size=5, # 5GB # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
output_path=output_path, # Defines where the model artifacts and output of the training job will be saved in S3.
train_use_spot_instances=True, # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
train_max_run=300, # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
train_max_wait=600) # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Training Job

  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Deployment

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Validation

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Getting Started

Clone the repository using Git Bash / download a .zip file / fork the repository.
Go to your AWS Management Console, click on your account profile on the Top-Right corner and select My Security Credentials from the dropdown.
Create Access Key: In the Access keys section, click on Create New Access Key, a dialog will appear with your Access Key ID and Secret Access Key.
Download or Copy Keys: (IMPORTANT) Download the .csv file or copy the keys to a secure location. This is the only time you can view the secret access key.
Open the cloned Repo. in your VS Code
Create a file under ClassiSage as terraform.tfvars with its content as

Download and install all the dependancies for using Terraform and Python.
In the terminal type/paste terraform init to initialize the backend.
Then type/paste terraform Plan to view the plan or simply terraform validate to ensure that there is no error.
Finally in the terminal type/paste terraform apply --auto-approve
This will show two outputs one as bucket_name other as pretrained_ml_instance_name (The 3rd resource is the variable name given to the bucket since they are global resources ).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

After Completion of the command is shown in the terminal, navigate to ClassiSage/ml_ops/function.py and on the 11th line of the file with code

  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

and change it to the path where the project directory is present and save it.

Then on the ClassiSageml_opsdata_upload.ipynb run all code cell till cell number 25 with the code

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

to upload dataset to S3 Bucket.

Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

After the execution of the notebook re-open your AWS Management Console.
You can search for S3 and Sagemaker services and will see an instance of each service initiated (A S3 bucket and a SageMaker Notebook)

S3 Bucket with named 'data-bucket-' with 2 objects uploaded, a dataset and the pretrained_sm.ipynb file containing model code.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Go to the notebook instance in the AWS SageMaker, click on the created instance and click on open Jupyter.
After that click on new on the top right side of the window and select on terminal.
This will create a new terminal.

On the terminal paste the following (Replacing with the bucket_name output that is shown in the VS Code's terminal output):

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

Terminal command to upload the pretrained_sm.ipynb from S3 to Notebook's Jupyter environment

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Go Back to the opened Jupyter instance and click on the pretrained_sm.ipynb file to open it and assign it a conda_python3 Kernel.
Scroll Down to the 4th cell and replace the variable bucket_name's value by the VS Code's terminal output for bucket_name = ""

Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

On the top of the file do a Restart by going to the Kernel tab.
Execute the Notebook till code cell number 27, with the code

  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

You will get the intended result. The data will be fetched, split into train and test sets after being adjusted for Labels and Features with a defined output path, then a model using SageMaker's Python SDK will be Trained, Deployed as a EndPoint, Validated to give different metrics.

Console Observation Notes

Execution of 8th cell

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

An output path will be setup in the S3 to store model data.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 23rd cell

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

A training job will start, you can check it under the training tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

After some time (3 mins est.) It shall be completed and will show the same.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 24th code cell

An endpoint will be deployed under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Additional Console Observation:

Creation of an Endpoint Configuration under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Creation of an model also under under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Ending and Cleanup

In the VS Code comeback to data_upload.ipynb to execute last 2 code cells to download the S3 bucket's data into the local system.
The folder will be named downloaded_bucket_content. Directory Structure of folder Downloaded.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

You will get a log of downloaded files in the output cell. It will contain a raw pretrained_sm.ipynb, final_dataset.csv and a model output folder named 'pretrained-algo' with the execution data of the sagemaker code file.
Finally go into pretrained_sm.ipynb present inside the SageMaker instance and execute the final 2 code cells. The end-point and the resources within the S3 bucket will be deleted to ensure no additional charges.
Deleting The EndPoint

  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Clearing S3: (Needed to destroy the instance)

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

Come back to the VS Code terminal for the project file and then type/paste terraform destroy --auto-approve
All the created resource instances will be deleted.

Auto Created Objects

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

NOTE:
If you liked the idea and the implementation of this Machine Learning Project using AWS Cloud's S3 and SageMaker for HDFS log classification, using Terraform for IaC (Infrastructure setup automation), Kindly consider liking this post and starring after checking-out the project repository at GitHub.

The above is the detailed content of ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),