집 >백엔드 개발 >파이썬 튜토리얼 >Magic Mushrooms: Mage를 사용하여 널 데이터 탐색 및 처리

Magic Mushrooms: Mage를 사용하여 널 데이터 탐색 및 처리

王林원래의: 2024-08-18 06:02:02809검색

Mage는 데이터 탐색 및 마이닝을 가능하게 하는 기능, 그래프 템플릿을 통한 빠른 시각화, 데이터 작업을 마법 같은 것으로 바꿔주는 기타 여러 기능을 갖춘 ETL 작업을 위한 강력한 도구입니다.

데이터를 처리할 때 ETL 프로세스 중에 나중에 문제를 일으킬 수 있는 누락된 데이터를 찾는 것이 일반적입니다. 데이터 세트로 수행하려는 활동에 따라 null 데이터는 상당히 지장을 줄 수 있습니다.

데이터세트에 데이터가 없는지 확인하기 위해 Python과 pandas 라이브러리를 사용하여 null 값을 나타내는 데이터를 확인할 수 있습니다. 또한 이러한 null 값이 다음에서 미치는 영향을 더욱 명확하게 보여주는 그래프를 만들 수 있습니다. 우리 데이터 세트.

우리의 파이프라인은 데이터 로딩으로 시작, 두 가지 처리 단계, 데이터 내보내기의 4단계로 구성됩니다.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

데이터 로더

이 기사에서는 대회의 일부로 Kaggle에서 사용할 수 있는 데이터세트인 독성 버섯의 이진 예측을 사용합니다. 웹사이트에서 제공하는 학습 데이터세트를 활용해 보세요.

사용할 데이터를 로드할 수 있도록 Python을 사용하여 데이터 로더 단계를 만들어 보겠습니다. 이 단계 전에 데이터를 로드할 수 있도록 내 컴퓨터에 로컬로 있는 Postgres 데이터베이스에 테이블을 만들었습니다. 데이터가 Postgres에 있으므로 Mage 내에서 이미 정의된 Postgres 로드 템플릿을 사용하겠습니다.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from os import path

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data_from_postgres(*args, **kwargs):
    """
    Template for loading data from a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql
    """
    query = 'SELECT * FROM mushroom'  # Specify your SQL query here
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        return loader.load(query)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """

    assert output is not None, 'The output is undefined'

load_data_from_postgres() 함수 내에서 데이터베이스에 테이블을 로드하는 데 사용할 쿼리를 정의합니다. 제 경우에는 io_config.yaml 파일에 은행 정보를 설정했는데 기본 구성으로 정의되어 있으므로 config_profile 변수에 기본 이름만 전달하면 됩니다.

블록을 실행한 후 이미 정의된 템플릿을 통해 데이터에 대한 정보를 제공하는 차트 추가 기능을 사용합니다. 이미지에 노란색 선으로 표시된 재생 버튼 옆의 아이콘을 클릭하세요.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

데이터세트를 더 자세히 살펴보기 위해 summay_overview 및 feature_profiles 옵션이라는 두 가지 옵션을 선택합니다. summary_overview를 통해 데이터세트의 열과 행 수에 대한 정보를 얻을 수도 있습니다. 예를 들어 범주형, 숫자 및 부울 열의 총 수와 같은 유형별 총 열 수를 볼 수도 있습니다. 반면에 Feature_profiles는 유형, 최소값, 최대값과 같은 데이터에 대한 보다 자세한 설명 정보를 제공하며, 특히 처리의 초점인 누락된 값을 시각화할 수도 있습니다.

누락된 데이터에 더 집중할 수 있도록 각 열에서 누락된 값의 %, 누락된 데이터의 백분율이 표시된 막대 그래프 템플릿을 사용해 보겠습니다.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

그래프는 누락된 값이 내용의 80% 이상에 해당하는 4개의 열과 누락된 값을 나타내지만 그 양이 더 적은 다른 열을 표시합니다. 이제 이 정보를 통해 다양한 처리 전략을 모색할 수 있습니다. 이 널 데이터

변압기 드롭 컬럼

Null 값이 80% 이상인 열의 경우 우리가 따를 전략은 데이터 프레임에서 열 삭제를 수행하고 데이터 프레임에서 제외할 열을 선택하는 것입니다. Python 언어에서 TRANSFORMER 블록을 사용하여 열 제거 옵션을 선택합니다.

from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:
    """
    Execute Transformer Action: ActionType.REMOVE
    Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns
    """
    action = build_transformer_action(
        df,
        action_type=ActionType.REMOVE,
        arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'],        
        axis=Axis.COLUMN,
    )
    return BaseAction(action).execute(df)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.

    """
    assert output is not None, 'The output is undefined'

execute_transformer_action() 함수 내에서 데이터 세트에서 제외하려는 열 이름이 포함된 목록을 인수 변수에 삽입하고 이 단계 후에 블록을 실행하면 됩니다.

변환기에서 누락된 값 채우기

이제 Null 값이 80% 미만인 열에 대해 누락된 값 채우기 전략을 사용합니다. 일부 경우에는 누락된 데이터가 있음에도 불구하고 이를 다음과 같은 값으로 바꿉니다. 평균 또는 패션에 따라 최종 목표에 따라 데이터 세트를 많이 변경하지 않고도 데이터 요구 사항을 충족할 수 있습니다.

Existem algumas tarefas, como a de classificação, onde a substituição dos dados faltantes por um valor que seja relevante (moda, média, mediana) para o dataset, possa contribuir com o algoritmo de classificação, que poderia chegar a outras conclusões caso o dados fossem apagados como na outra estratégia de utilizamos.

Para tomar uma decisão com relação a qual medida vamos utilizar, vamos recorrer novamente a funcionalidade Add chart do Mage. Usando o template Most frequent values podemos visualizar a moda e a frequência desse valor em cada uma das colunas.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Seguindos passos semelhantes aos anteriores, vamos usar o tranformer Fill in missing values, para realizar a tarefa de subtiruir os dados faltantes usando a moda de cada uma das colunas: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy
from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:

    """
    Execute Transformer Action: ActionType.IMPUTE
    Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values

    """
    action = build_transformer_action(
        df,
        action_type=ActionType.IMPUTE,
        arguments=df.columns,  # Specify columns to impute
        axis=Axis.COLUMN,
        options={'strategy': ImputationStrategy.MODE},  # Specify imputation strategy
    )

    return BaseAction(action).execute(df)


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'

Na função execute_transformer_action() , definimos a estratégia para a substituição dos dados num dicionário do Python. Para mais opções de substituição, basta acessar a documentação do transformer: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

Ao realizar todas as transformações, vamos salvar nosso dataset agora tratado, na mesma base do Postgres mas agora com um nome diferente para podermos diferenciar. Usando o bloco Data Exporter e selecionando o Postgres, vamos definir o shema e a tabela onde queremos salvar, lembrando que as configurações do banco são salvas previamente no arquivo io_config.yaml.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:

    """
    Template for exporting data to a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql

    """

    schema_name = 'public'  # Specify the name of the schema to export data to
    table_name = 'mushroom_clean'  # Specify the name of the table to export data to
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        loader.export(
            df,
            schema_name,
            table_name,
            index=False,  # Specifies whether to include index in exported table
            if_exists='replace', #Specify resolution policy if table name already exists
        )

Obrigado e até a próxima ?

repo -> https://github.com/DeadPunnk/Mushrooms/tree/main

위 내용은 Magic Mushrooms: Mage를 사용하여 널 데이터 탐색 및 처리의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

Python pandas default github etl transformer https

성명：

이전 기사：Python&#s Walrus 연산자를 사용하여 코드 최적화: 피해야 할 실제 사례 및 안티패턴다음 기사：Python&#s Walrus 연산자를 사용하여 코드 최적화: 피해야 할 실제 사례 및 안티패턴