首页 >科技周边 >人工智能 >了解Zenml项目的MLOP

了解Zenml项目的MLOP

Lisa Kudrow
Lisa Kudrow原创
2025-03-08 11:16:09490浏览

AI革命>我们大多数人都忽略了一个非常关键的问题 - 我们如何维护这些复杂的AI系统?这就是机器学习操作(MLOP)发挥作用的地方。在此博客中,我们将通过构建一个端到端项目来了解MLOP的重要性。

>本文是> > data Science Blogathon的一部分。 目录的>

>什么是mlops?管道

    常见问题
  • 什么是mlops?
  • MLOPS
  • MLOPS授权机器学习工程师简化ML模型生命周期的过程。生产机器学习很困难。机器学习生命周期由许多复杂的组件组成,例如数据摄入,数据准备,模型培训,模型调整,模型部署,模型监视,解释性等等。 MLOP通过强大的管道自动化过程的每个步骤,以减少手动错误。这是一种协作实践,可以通过最低限度的手动工作和最大的有效操作来简化您的AI基础架构。将MLOP视为具有某些香料的AI行业的Devops。
  • 什么是zenml?
  • Zenml是一个开源MLOPS框架,可简化机器学习工作流程的开发,部署和管理。通过利用MLOP的原理,它与各种工具和基础架构无缝集成,为用户提供了一种模块化方法,可以在单个工作场所下维护其AI工作流程。 Zenml提供了诸如Auto-Logs,Meta-Data跟踪器,模型跟踪器,实验跟踪器,Artifact Store和简单的Python Decorators诸如核心逻辑无复杂配置的功能。
  • >
  • 通过动手项目了解MLOP
  • >现在,我们将在端到端的简单生产级数据科学项目的帮助下了解MLOP的实施。在此项目中,我们将创建并部署机器学习模型,以预测客户的客户寿命价值(CLTV)。 CLTV是公司使用的关键指标,以查看他们长期从客户那里获得多少损益。使用此指标,一家公司可以选择进一步花费或不花钱购买目标广告,等等。
  • >
  • 让我们开始在下一部分中实施项目。

    初始配置

    现在,让我们直接进入项目配置。首先,我们需要从UCI机器学习存储库下载在线零售数据集。 Windows不支持ZenML,因此我们需要使用Linux(Windows中的WSL)或MacOS。接下来下载unignts.txt。现在,让我们进入终端以进行几个配置。

    >
    # Make sure you have Python 3.10 or above installed
    python --version
    
    # Make a new Python environment using any method
    python3.10 -m venv myenv 
    
    # Activate the environment
    source myenv/bin/activate
    
    # Install the requirements from the provided source above
    pip install -r requirements.txt
    
    # Install the Zenml server
    pip install zenml[server] == 0.66.0
    
    # Initialize the Zenml server
    zenml init
    
    # Launch the Zenml dashboard
    zenml up

    现在,只需使用默认登录凭据登录Zenml仪表板(无需密码)。

    恭喜您已经成功完成了项目配置。

    >

    >探索性数据分析(EDA)

    现在是时候让我们的数据弄脏数据了。我们将创建用于分析我们数据的Ajupyter笔记本。

    >

    >

    pro tip:进行自己的分析而不关注我。>

    >您只需关注本笔记本,我们在该笔记本上创建了不同的数据分析方法以在我们的项目中使用。>

    现在,假设您已经执行了数据分析的份额,那么让我们直接跳到辛辣的部分。>

    将Zenml的步骤定义为模块化编码

    为了增加代码的模块化和重复性,@Step Decorator是从Zenml中使用的,该装饰器组织了我们的代码以传递到Pipelines Hassle Hastle Free,减少了错误的机会。

    >

    在我们的源文件夹中,我们将在初始化它们之前为每个步骤编写方法。我们通过为每种方法的策略(数据摄入,数据清洁,功能工程等)创建一个抽象方法来遵循每种方法的系统设计模式。

    >摄入数据的示例代码

    > ingest_data.py

    的代码示例

    >我们将遵循此模式来创建其余方法。您可以从给定的github存储库复制代码。

    import logging
    import pandas as pd
    from abc import ABC, abstractmethod
    
    # Setup logging configuration
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
    
    # Abstract Base Class for Data Ingestion Strategy
    # ------------------------------------------------
    # This class defines a common interface for different data ingestion strategies.
    # Subclasses must implement the `ingest` method.
    class DataIngestionStrategy(ABC):
        @abstractmethod
        def ingest(self, file_path: str) -> pd.DataFrame:
            """
            Abstract method to ingest data from a file into a DataFrame.
    
            Parameters:
            file_path (str): The path to the data file to ingest.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            pass
        
    # Concrete Strategy for XLSX File Ingestion
    # -----------------------------------------
    # This strategy handles the ingestion of data from an XLSX file.
    class XLSXIngestion(DataIngestionStrategy):
        def __init__(self, sheet_name=0):
            """
            Initializes the XLSXIngestion with optional sheet name.
    
            Parameters:
            sheet_name (str or int): The sheet name or index to read, default is the first sheet.
            """
            self.sheet_name = sheet_name
    
        def ingest(self, file_path: str) -> pd.DataFrame:
            """
            Ingests data from an XLSX file into a DataFrame.
    
            Parameters:
            file_path (str): The path to the XLSX file.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            try:
                logging.info(f"Attempting to read XLSX file: {file_path}")
                df = pd.read_excel(file_path,dtype={'InvoiceNo': str, 'StockCode': str, 'Description':str}, sheet_name=self.sheet_name)
                logging.info(f"Successfully read XLSX file: {file_path}")
                return df
            except FileNotFoundError:
                logging.error(f"File not found: {file_path}")
            except pd.errors.EmptyDataError:
                logging.error(f"File is empty: {file_path}")
            except Exception as e:
                logging.error(f"An error occurred while reading the XLSX file: {e}")
            return pd.DataFrame()
    
    
    # Context Class for Data Ingestion
    # --------------------------------
    # This class uses a DataIngestionStrategy to ingest data from a file.
    class DataIngestor:
        def __init__(self, strategy: DataIngestionStrategy):
            """
            Initializes the DataIngestor with a specific data ingestion strategy.
    
            Parameters:
            strategy (DataIngestionStrategy): The strategy to be used for data ingestion.
            """
            self._strategy = strategy
    
        def set_strategy(self, strategy: DataIngestionStrategy):
            """
            Sets a new strategy for the DataIngestor.
    
            Parameters:
            strategy (DataIngestionStrategy): The new strategy to be used for data ingestion.
            """
            logging.info("Switching data ingestion strategy.")
            self._strategy = strategy
    
        def ingest_data(self, file_path: str) -> pd.DataFrame:
            """
            Executes the data ingestion using the current strategy.
    
            Parameters:
            file_path (str): The path to the data file to ingest.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            logging.info("Ingesting data using the current strategy.")
            return self._strategy.ingest(file_path)
    
    
    # Example usage
    if __name__ == "__main__":
        # Example file path for XLSX file
        # file_path = "../data/raw/your_data_file.xlsx"
    
        # XLSX Ingestion Example
        # xlsx_ingestor = DataIngestor(XLSXIngestion(sheet_name=0))
        # df = xlsx_ingestor.ingest_data(file_path)
    
        # Show the first few rows of the ingested DataFrame if successful
        # if not df.empty:
        #     logging.info("Displaying the first few rows of the ingested data:")
        #     print(df.head())
        pass csv

    >写下所有方法后,是时候初始化Zenml步骤中的步骤文件夹了。现在,我们到目前为止创建的所有方法将在Zenml步骤中使用。

    示例摄入的示例代码了解Zenml项目的MLOP

    > data_ingestion_step.py的示例代码:

    >

    >我们将遵循与上述相同的模式,以创建我们项目中的其余ZenML步骤。您可以从这里复制它们。

    >

    import os
    import sys
    sys.path.append(os.path.dirname(os.path.dirname(__file__)))
    
    import pandas as pd
    from src.ingest_data import DataIngestor, XLSXIngestion
    from zenml import step
    
    @step
    def data_ingestion_step(file_path: str) -> pd.DataFrame:
        """
        Ingests data from an XLSX file into a DataFrame.
    
        Parameters:
        file_path (str): The path to the XLSX file.
    
        Returns:
        pd.DataFrame: A dataframe containing the ingested data.
        """
        # Initialize the DataIngestor with an XLSXIngestion strategy
        
        ingestor = DataIngestor(XLSXIngestion())
        
        # Ingest data from the specified file
        
        df = ingestor.ingest_data(file_path)
        
        return df

    哇!祝贺创建和学习MLOP最重要的部分之一。可以让一些不知所措,因为这是您的第一次。不要承受太大的压力,因为当您运行第一级生产级ML模型时,一切都会很有意义。

    >

    构建管道

    是时候构建我们​​的管道了。不,不要携带水或油。管道是按特定顺序组织的一系列步骤,以形成我们完整的机器学习工作流程。 @PiPeline装饰器在Zenml中用于指定将包含我们上面创建的步骤的管道。这种方法确保我们可以将一个步骤的输出用作下一步的输入。

    这是我们的triagn_pipeline.py:

    # Make sure you have Python 3.10 or above installed
    python --version
    
    # Make a new Python environment using any method
    python3.10 -m venv myenv 
    
    # Activate the environment
    source myenv/bin/activate
    
    # Install the requirements from the provided source above
    pip install -r requirements.txt
    
    # Install the Zenml server
    pip install zenml[server] == 0.66.0
    
    # Initialize the Zenml server
    zenml init
    
    # Launch the Zenml dashboard
    zenml up
    >现在我们可以单击一次训练_pipeline.py来训练我们的ML模型。您可以检查Zenml仪表板中的管道:

    了解Zenml项目的MLOP

    我们可以检查我们的模型详细信息,还可以通过在终端中运行以下代码来训练多个模型,并在MLFlow仪表板中进行比较。

    >

    import logging
    import pandas as pd
    from abc import ABC, abstractmethod
    
    # Setup logging configuration
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
    
    # Abstract Base Class for Data Ingestion Strategy
    # ------------------------------------------------
    # This class defines a common interface for different data ingestion strategies.
    # Subclasses must implement the `ingest` method.
    class DataIngestionStrategy(ABC):
        @abstractmethod
        def ingest(self, file_path: str) -> pd.DataFrame:
            """
            Abstract method to ingest data from a file into a DataFrame.
    
            Parameters:
            file_path (str): The path to the data file to ingest.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            pass
        
    # Concrete Strategy for XLSX File Ingestion
    # -----------------------------------------
    # This strategy handles the ingestion of data from an XLSX file.
    class XLSXIngestion(DataIngestionStrategy):
        def __init__(self, sheet_name=0):
            """
            Initializes the XLSXIngestion with optional sheet name.
    
            Parameters:
            sheet_name (str or int): The sheet name or index to read, default is the first sheet.
            """
            self.sheet_name = sheet_name
    
        def ingest(self, file_path: str) -> pd.DataFrame:
            """
            Ingests data from an XLSX file into a DataFrame.
    
            Parameters:
            file_path (str): The path to the XLSX file.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            try:
                logging.info(f"Attempting to read XLSX file: {file_path}")
                df = pd.read_excel(file_path,dtype={'InvoiceNo': str, 'StockCode': str, 'Description':str}, sheet_name=self.sheet_name)
                logging.info(f"Successfully read XLSX file: {file_path}")
                return df
            except FileNotFoundError:
                logging.error(f"File not found: {file_path}")
            except pd.errors.EmptyDataError:
                logging.error(f"File is empty: {file_path}")
            except Exception as e:
                logging.error(f"An error occurred while reading the XLSX file: {e}")
            return pd.DataFrame()
    
    
    # Context Class for Data Ingestion
    # --------------------------------
    # This class uses a DataIngestionStrategy to ingest data from a file.
    class DataIngestor:
        def __init__(self, strategy: DataIngestionStrategy):
            """
            Initializes the DataIngestor with a specific data ingestion strategy.
    
            Parameters:
            strategy (DataIngestionStrategy): The strategy to be used for data ingestion.
            """
            self._strategy = strategy
    
        def set_strategy(self, strategy: DataIngestionStrategy):
            """
            Sets a new strategy for the DataIngestor.
    
            Parameters:
            strategy (DataIngestionStrategy): The new strategy to be used for data ingestion.
            """
            logging.info("Switching data ingestion strategy.")
            self._strategy = strategy
    
        def ingest_data(self, file_path: str) -> pd.DataFrame:
            """
            Executes the data ingestion using the current strategy.
    
            Parameters:
            file_path (str): The path to the data file to ingest.
    
            Returns:
            pd.DataFrame: A dataframe containing the ingested data.
            """
            logging.info("Ingesting data using the current strategy.")
            return self._strategy.ingest(file_path)
    
    
    # Example usage
    if __name__ == "__main__":
        # Example file path for XLSX file
        # file_path = "../data/raw/your_data_file.xlsx"
    
        # XLSX Ingestion Example
        # xlsx_ingestor = DataIngestor(XLSXIngestion(sheet_name=0))
        # df = xlsx_ingestor.ingest_data(file_path)
    
        # Show the first few rows of the ingested DataFrame if successful
        # if not df.empty:
        #     logging.info("Displaying the first few rows of the ingested data:")
        #     print(df.head())
        pass csv
    创建部署管道

    接下来,我们将创建deployment_pipeline.py

    import os
    import sys
    sys.path.append(os.path.dirname(os.path.dirname(__file__)))
    
    import pandas as pd
    from src.ingest_data import DataIngestor, XLSXIngestion
    from zenml import step
    
    @step
    def data_ingestion_step(file_path: str) -> pd.DataFrame:
        """
        Ingests data from an XLSX file into a DataFrame.
    
        Parameters:
        file_path (str): The path to the XLSX file.
    
        Returns:
        pd.DataFrame: A dataframe containing the ingested data.
        """
        # Initialize the DataIngestor with an XLSXIngestion strategy
        
        ingestor = DataIngestor(XLSXIngestion())
        
        # Ingest data from the specified file
        
        df = ingestor.ingest_data(file_path)
        
        return df
    在运行部署管道时,我们将在zenml仪表板中获得这样的视图:

    了解Zenml项目的MLOP

    恭喜您在本地实例中使用MLFLOW和ZENML部署了最佳模型。

    创建烧瓶应用

    我们的下一步是创建一个将我们的模型投射到最终用户的烧瓶应用程序。为此,我们必须在模板文件夹中创建一个app.py和index.html。请按照以下代码创建app.py:

    >

    #import csvimport os
    import sys
    sys.path.append(os.path.dirname(os.path.dirname(__file__)))
    from steps.data_ingestion_step import data_ingestion_step
    from steps.handling_missing_values_step import handling_missing_values_step
    from steps.dropping_columns_step import dropping_columns_step
    from steps.detecting_outliers_step import detecting_outliers_step
    from steps.feature_engineering_step import feature_engineering_step
    from steps.data_splitting_step import data_splitting_step
    from steps.model_building_step import model_building_step
    from steps.model_evaluating_step import model_evaluating_step
    from steps.data_resampling_step import data_resampling_step
    from zenml import Model, pipeline
    
    
    @pipeline(model=Model(name='CLTV_Prediction'))
    def training_pipeline():
        """
        Defines the complete training pipeline for CLTV Prediction.
        Steps:
        1. Data ingestion
        2. Handling missing values
        3. Dropping unnecessary columns
        4. Detecting and handling outliers
        5. Feature engineering
        6. Splitting data into train and test sets
        7. Resampling the training data
        8. Model training
        9. Model evaluation
        """
        # Step 1: Data ingestion
        raw_data = data_ingestion_step(file_path='data/Online_Retail.xlsx')
    
        # Step 2: Drop unnecessary columns
        columns_to_drop = ["Country", "Description", "InvoiceNo", "StockCode"]
        refined_data = dropping_columns_step(raw_data, columns_to_drop)
    
        # Step 3: Detect and handle outliers
        outlier_free_data = detecting_outliers_step(refined_data)
    
        # Step 4: Feature engineering
        features_data = feature_engineering_step(outlier_free_data)
        
        # Step 5: Handle missing values
        cleaned_data = handling_missing_values_step(features_data)
        
        # Step 6: Data splitting
        train_features, test_features, train_target, test_target = data_splitting_step(cleaned_data,"CLTV")
    
        # Step 7: Data resampling
        train_features_resampled, train_target_resampled = data_resampling_step(train_features, train_target)
    
        # Step 8: Model training
        trained_model = model_building_step(train_features_resampled, train_target_resampled)
    
        # Step 9: Model evaluation
        evaluation_metrics = model_evaluating_step(trained_model, test_features, test_target)
    
        # Return evaluation metrics
        return evaluation_metrics
    
    
    if __name__ == "__main__":
        # Run the pipeline
        training_pipeline()
    为创建index.html文件,请按照以下代码:

    执行后您的app.py应该像这样:
    mlflow ui

    了解Zenml项目的MLOP>现在的最后一步是在您的github存储库中提交这些更改并在任何云服务器上在线部署模型,对于此项目,我们将在免费渲染服务器上部署app.py,您也可以这样做。

    访问render.com,并将您的github存储库连接到渲染中。

    > 就是这样。您已成功创建了第一个MLOP项目。希望你喜欢它!

    结论

    MLOP已成为管理机器学习工作流程(从数据摄入到模型部署)的复杂性的必不可少的实践。通过利用开源MLOPS框架Zenml,我们简化了为客户寿命价值(CLTV)预测的构建,培训和部署生产级ML模型的过程。通过模块化编码,强大的管道和无缝集成,我们演示了如何有效地创建一个端到端的项目。随着企业越来越依赖AI驱动的解决方案,Zenml授权团队之类的框架以最少的手动干预来保持可伸缩性,可重复性和性能。

    >

    钥匙要点

      MLOPS简化了ML生命周期,通过自动管道来降低错误并提高效率。
    • > zenml提供了用于管理机器学习工作流程的模块化的可重复使用的编码结构。
    • >构建端到端管道涉及定义明确的步骤,从数据摄入到部署。
    • >部署管道和烧瓶应用程序确保ML模型已准备就绪且可访问。
    • >
    • > Zenml和MLFlow等工具可启用ML项目的无缝跟踪,监视和优化。>
    • 常见问题

    > Q1。什么是MLOP,为什么重要? MLOP(机器学习操作)通过自动化数据摄入,模型培训,部署和监视,确保效率和可伸缩性等过程来简化ML生命周期。 Zenml是为了什么? Zenml是一个开源MLOPS框架,可简化使用模块化和可重复使用的代码的机器学习工作流的开发,部署和管理。

    Q3。我可以在Windows上使用Zenml吗? Zenml不直接支持Windows,但可以与WSL(Linux的Windows子系统)一起使用。 Zenml中管道的目的是什么? Zenml中的管道定义了一系列步骤,确保了机器学习项目的结构化和可重复使用的工作流程。烧瓶应用程序如何与ML模型集成? Blask应用程序充当用户界面,允许最终用户输入数据并从已部署的ML模型中接收预测。

以上是了解Zenml项目的MLOP的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn