AI革命>我们大多数人都忽略了一个非常关键的问题 - 我们如何维护这些复杂的AI系统?这就是机器学习操作(MLOP)发挥作用的地方。在此博客中,我们将通过构建一个端到端项目来了解MLOP的重要性。
>本文是> > data Science Blogathon的一部分。 目录的>
现在,让我们直接进入项目配置。首先,我们需要从UCI机器学习存储库下载在线零售数据集。 Windows不支持ZenML,因此我们需要使用Linux(Windows中的WSL)或MacOS。接下来下载unignts.txt。现在,让我们进入终端以进行几个配置。
># Make sure you have Python 3.10 or above installed python --version # Make a new Python environment using any method python3.10 -m venv myenv # Activate the environment source myenv/bin/activate # Install the requirements from the provided source above pip install -r requirements.txt # Install the Zenml server pip install zenml[server] == 0.66.0 # Initialize the Zenml server zenml init # Launch the Zenml dashboard zenml up
现在,只需使用默认登录凭据登录Zenml仪表板(无需密码)。
恭喜您已经成功完成了项目配置。>
>探索性数据分析(EDA)>
>pro tip:进行自己的分析而不关注我。>
>您只需关注本笔记本,我们在该笔记本上创建了不同的数据分析方法以在我们的项目中使用。现在,假设您已经执行了数据分析的份额,那么让我们直接跳到辛辣的部分。
为了增加代码的模块化和重复性,@Step Decorator是从Zenml中使用的,该装饰器组织了我们的代码以传递到Pipelines Hassle Hastle Free,减少了错误的机会。
>摄入数据的示例代码
> ingest_data.py
的代码示例
import logging import pandas as pd from abc import ABC, abstractmethod # Setup logging configuration logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") # Abstract Base Class for Data Ingestion Strategy # ------------------------------------------------ # This class defines a common interface for different data ingestion strategies. # Subclasses must implement the `ingest` method. class DataIngestionStrategy(ABC): @abstractmethod def ingest(self, file_path: str) -> pd.DataFrame: """ Abstract method to ingest data from a file into a DataFrame. Parameters: file_path (str): The path to the data file to ingest. Returns: pd.DataFrame: A dataframe containing the ingested data. """ pass # Concrete Strategy for XLSX File Ingestion # ----------------------------------------- # This strategy handles the ingestion of data from an XLSX file. class XLSXIngestion(DataIngestionStrategy): def __init__(self, sheet_name=0): """ Initializes the XLSXIngestion with optional sheet name. Parameters: sheet_name (str or int): The sheet name or index to read, default is the first sheet. """ self.sheet_name = sheet_name def ingest(self, file_path: str) -> pd.DataFrame: """ Ingests data from an XLSX file into a DataFrame. Parameters: file_path (str): The path to the XLSX file. Returns: pd.DataFrame: A dataframe containing the ingested data. """ try: logging.info(f"Attempting to read XLSX file: {file_path}") df = pd.read_excel(file_path,dtype={'InvoiceNo': str, 'StockCode': str, 'Description':str}, sheet_name=self.sheet_name) logging.info(f"Successfully read XLSX file: {file_path}") return df except FileNotFoundError: logging.error(f"File not found: {file_path}") except pd.errors.EmptyDataError: logging.error(f"File is empty: {file_path}") except Exception as e: logging.error(f"An error occurred while reading the XLSX file: {e}") return pd.DataFrame() # Context Class for Data Ingestion # -------------------------------- # This class uses a DataIngestionStrategy to ingest data from a file. class DataIngestor: def __init__(self, strategy: DataIngestionStrategy): """ Initializes the DataIngestor with a specific data ingestion strategy. Parameters: strategy (DataIngestionStrategy): The strategy to be used for data ingestion. """ self._strategy = strategy def set_strategy(self, strategy: DataIngestionStrategy): """ Sets a new strategy for the DataIngestor. Parameters: strategy (DataIngestionStrategy): The new strategy to be used for data ingestion. """ logging.info("Switching data ingestion strategy.") self._strategy = strategy def ingest_data(self, file_path: str) -> pd.DataFrame: """ Executes the data ingestion using the current strategy. Parameters: file_path (str): The path to the data file to ingest. Returns: pd.DataFrame: A dataframe containing the ingested data. """ logging.info("Ingesting data using the current strategy.") return self._strategy.ingest(file_path) # Example usage if __name__ == "__main__": # Example file path for XLSX file # file_path = "../data/raw/your_data_file.xlsx" # XLSX Ingestion Example # xlsx_ingestor = DataIngestor(XLSXIngestion(sheet_name=0)) # df = xlsx_ingestor.ingest_data(file_path) # Show the first few rows of the ingested DataFrame if successful # if not df.empty: # logging.info("Displaying the first few rows of the ingested data:") # print(df.head()) pass csv
>写下所有方法后,是时候初始化Zenml步骤中的步骤文件夹了。现在,我们到目前为止创建的所有方法将在Zenml步骤中使用。
示例摄入的示例代码
>
import os import sys sys.path.append(os.path.dirname(os.path.dirname(__file__))) import pandas as pd from src.ingest_data import DataIngestor, XLSXIngestion from zenml import step @step def data_ingestion_step(file_path: str) -> pd.DataFrame: """ Ingests data from an XLSX file into a DataFrame. Parameters: file_path (str): The path to the XLSX file. Returns: pd.DataFrame: A dataframe containing the ingested data. """ # Initialize the DataIngestor with an XLSXIngestion strategy ingestor = DataIngestor(XLSXIngestion()) # Ingest data from the specified file df = ingestor.ingest_data(file_path) return df
哇!祝贺创建和学习MLOP最重要的部分之一。可以让一些不知所措,因为这是您的第一次。不要承受太大的压力,因为当您运行第一级生产级ML模型时,一切都会很有意义。
>是时候构建我们的管道了。不,不要携带水或油。管道是按特定顺序组织的一系列步骤,以形成我们完整的机器学习工作流程。 @PiPeline装饰器在Zenml中用于指定将包含我们上面创建的步骤的管道。这种方法确保我们可以将一个步骤的输出用作下一步的输入。
这是我们的triagn_pipeline.py:
# Make sure you have Python 3.10 or above installed python --version # Make a new Python environment using any method python3.10 -m venv myenv # Activate the environment source myenv/bin/activate # Install the requirements from the provided source above pip install -r requirements.txt # Install the Zenml server pip install zenml[server] == 0.66.0 # Initialize the Zenml server zenml init # Launch the Zenml dashboard zenml up>现在我们可以单击一次训练_pipeline.py来训练我们的ML模型。您可以检查Zenml仪表板中的管道:
>
import logging import pandas as pd from abc import ABC, abstractmethod # Setup logging configuration logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") # Abstract Base Class for Data Ingestion Strategy # ------------------------------------------------ # This class defines a common interface for different data ingestion strategies. # Subclasses must implement the `ingest` method. class DataIngestionStrategy(ABC): @abstractmethod def ingest(self, file_path: str) -> pd.DataFrame: """ Abstract method to ingest data from a file into a DataFrame. Parameters: file_path (str): The path to the data file to ingest. Returns: pd.DataFrame: A dataframe containing the ingested data. """ pass # Concrete Strategy for XLSX File Ingestion # ----------------------------------------- # This strategy handles the ingestion of data from an XLSX file. class XLSXIngestion(DataIngestionStrategy): def __init__(self, sheet_name=0): """ Initializes the XLSXIngestion with optional sheet name. Parameters: sheet_name (str or int): The sheet name or index to read, default is the first sheet. """ self.sheet_name = sheet_name def ingest(self, file_path: str) -> pd.DataFrame: """ Ingests data from an XLSX file into a DataFrame. Parameters: file_path (str): The path to the XLSX file. Returns: pd.DataFrame: A dataframe containing the ingested data. """ try: logging.info(f"Attempting to read XLSX file: {file_path}") df = pd.read_excel(file_path,dtype={'InvoiceNo': str, 'StockCode': str, 'Description':str}, sheet_name=self.sheet_name) logging.info(f"Successfully read XLSX file: {file_path}") return df except FileNotFoundError: logging.error(f"File not found: {file_path}") except pd.errors.EmptyDataError: logging.error(f"File is empty: {file_path}") except Exception as e: logging.error(f"An error occurred while reading the XLSX file: {e}") return pd.DataFrame() # Context Class for Data Ingestion # -------------------------------- # This class uses a DataIngestionStrategy to ingest data from a file. class DataIngestor: def __init__(self, strategy: DataIngestionStrategy): """ Initializes the DataIngestor with a specific data ingestion strategy. Parameters: strategy (DataIngestionStrategy): The strategy to be used for data ingestion. """ self._strategy = strategy def set_strategy(self, strategy: DataIngestionStrategy): """ Sets a new strategy for the DataIngestor. Parameters: strategy (DataIngestionStrategy): The new strategy to be used for data ingestion. """ logging.info("Switching data ingestion strategy.") self._strategy = strategy def ingest_data(self, file_path: str) -> pd.DataFrame: """ Executes the data ingestion using the current strategy. Parameters: file_path (str): The path to the data file to ingest. Returns: pd.DataFrame: A dataframe containing the ingested data. """ logging.info("Ingesting data using the current strategy.") return self._strategy.ingest(file_path) # Example usage if __name__ == "__main__": # Example file path for XLSX file # file_path = "../data/raw/your_data_file.xlsx" # XLSX Ingestion Example # xlsx_ingestor = DataIngestor(XLSXIngestion(sheet_name=0)) # df = xlsx_ingestor.ingest_data(file_path) # Show the first few rows of the ingested DataFrame if successful # if not df.empty: # logging.info("Displaying the first few rows of the ingested data:") # print(df.head()) pass csv创建部署管道
import os import sys sys.path.append(os.path.dirname(os.path.dirname(__file__))) import pandas as pd from src.ingest_data import DataIngestor, XLSXIngestion from zenml import step @step def data_ingestion_step(file_path: str) -> pd.DataFrame: """ Ingests data from an XLSX file into a DataFrame. Parameters: file_path (str): The path to the XLSX file. Returns: pd.DataFrame: A dataframe containing the ingested data. """ # Initialize the DataIngestor with an XLSXIngestion strategy ingestor = DataIngestor(XLSXIngestion()) # Ingest data from the specified file df = ingestor.ingest_data(file_path) return df在运行部署管道时,我们将在zenml仪表板中获得这样的视图:
创建烧瓶应用
>
#import csvimport os import sys sys.path.append(os.path.dirname(os.path.dirname(__file__))) from steps.data_ingestion_step import data_ingestion_step from steps.handling_missing_values_step import handling_missing_values_step from steps.dropping_columns_step import dropping_columns_step from steps.detecting_outliers_step import detecting_outliers_step from steps.feature_engineering_step import feature_engineering_step from steps.data_splitting_step import data_splitting_step from steps.model_building_step import model_building_step from steps.model_evaluating_step import model_evaluating_step from steps.data_resampling_step import data_resampling_step from zenml import Model, pipeline @pipeline(model=Model(name='CLTV_Prediction')) def training_pipeline(): """ Defines the complete training pipeline for CLTV Prediction. Steps: 1. Data ingestion 2. Handling missing values 3. Dropping unnecessary columns 4. Detecting and handling outliers 5. Feature engineering 6. Splitting data into train and test sets 7. Resampling the training data 8. Model training 9. Model evaluation """ # Step 1: Data ingestion raw_data = data_ingestion_step(file_path='data/Online_Retail.xlsx') # Step 2: Drop unnecessary columns columns_to_drop = ["Country", "Description", "InvoiceNo", "StockCode"] refined_data = dropping_columns_step(raw_data, columns_to_drop) # Step 3: Detect and handle outliers outlier_free_data = detecting_outliers_step(refined_data) # Step 4: Feature engineering features_data = feature_engineering_step(outlier_free_data) # Step 5: Handle missing values cleaned_data = handling_missing_values_step(features_data) # Step 6: Data splitting train_features, test_features, train_target, test_target = data_splitting_step(cleaned_data,"CLTV") # Step 7: Data resampling train_features_resampled, train_target_resampled = data_resampling_step(train_features, train_target) # Step 8: Model training trained_model = model_building_step(train_features_resampled, train_target_resampled) # Step 9: Model evaluation evaluation_metrics = model_evaluating_step(trained_model, test_features, test_target) # Return evaluation metrics return evaluation_metrics if __name__ == "__main__": # Run the pipeline training_pipeline()为创建index.html文件,请按照以下代码:
执行后您的app.py应该像这样:
mlflow ui
>现在的最后一步是在您的github存储库中提交这些更改并在任何云服务器上在线部署模型,对于此项目,我们将在免费渲染服务器上部署app.py,您也可以这样做。
> 就是这样。您已成功创建了第一个MLOP项目。希望你喜欢它!
结论
MLOP已成为管理机器学习工作流程(从数据摄入到模型部署)的复杂性的必不可少的实践。通过利用开源MLOPS框架Zenml,我们简化了为客户寿命价值(CLTV)预测的构建,培训和部署生产级ML模型的过程。通过模块化编码,强大的管道和无缝集成,我们演示了如何有效地创建一个端到端的项目。随着企业越来越依赖AI驱动的解决方案,Zenml授权团队之类的框架以最少的手动干预来保持可伸缩性,可重复性和性能。>
以上是了解Zenml项目的MLOP的详细内容。更多信息请关注PHP中文网其他相关文章!