Dataset when building a Machine Learning model-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Dataset when building a Machine Learning model

Linda Hamilton

Oct 05, 2024 am 06:14 AM

Perhaps one of the biggest difficulties for those starting to study Machine Learning is working, processing the data, making small inferences, and then putting together your model.

In this article I will exemplify how to analyze a dataset to better build a Machine Learning model by going through:

Brief explanation of what Machine Learning is
Types and differences of Learning
Understand and extract important information from the dataset
Treat data from a dataset
Differences between algorithms of a model
Construction of a Linear Regression model

But let's start from the beginning, so we can contextualize, what is Machine Learning (ML)?

ML is one of the different branches of Artificial Intelligence (AI), as well as Neural Networks or Robotics, and others. The type of machine learning depends on how the data is structured, so it can be divided into different types, from there creating a model. An ML model is created using algorithms that process input data and learn to predict or classify the results.

The Importance of a Dataset

To create an ML model we need a dataset, within the dataset there must be our input features, which is basically our entire dataset except for the target column depending on our type of learning, if it is Supervised learning the dataset must contain the targets, or labels, or correct answers, as this information will be used for training and testing the model.
Some types of learning and the dataset structure for them:

Supervised Learning: Here the model learns through a set of labeled data, with correct outputs already provided, so the model aims to learn to associate inputs and outputs to be able to make its predictions correctly for the new data .
Unsupervised learning: Model training is done with unlabeled data, there is no correct output associated with the input, so the objective of the model is to identify patterns and groupings in the data.
Reinforcement Learning: In this, the model learns from interaction with the environment. He will take actions in the environment and receive a reward for correct actions or receive a penalty for wrong actions, with the objective of fully maximizing long-term rewards by maximizing the behavior that led to performing correct actions.

Therefore, the dataset basically defines the entire behavior and learning process of the model generated by the machine.
To continue with the examples, I will use a dataset with labels, exemplifying a model with Supervised Learning, where the objective will be to define the monthly value of life insurance for a specific audience.

Let's start by loading our dataset and see its first lines.


import pandas as pd

data = pd.read_csv('../dataset_seguro_vida.csv')
data.head()

Dataset na construção de um modelo de Machine Learning

Let's detail our data a little more, we can see its format, and discover the number of rows and columns in the dataset.


data.shape

Dataset na construção de um modelo de Machine Learning

We have here a data structure of 500 rows and 9 columns.

Now let's see what types of data we have and if we are missing any data.


data.info()

Dataset na construção de um modelo de Machine Learning

We have 3 numeric columns here, including 2 int (whole numbers) and 1 float (numbers with decimal places), and the other 6 are object. So we can move on to the next step of processing the data a little.

Working the Dataset

A good step towards improving our dataset is to understand that some types of data are processed and even understood more easily by the model than others. For example, data that is of type object is heavier and even limited to work with, so it is better to transform it to category, as this allows us to have several gains from performance to efficiency in memory use (in In the end, we can even improve this by making another transformation, but when the time comes I'll explain better).


object_columns = data.select_dtypes(include='object').columns

for col in object_columns:
    data[col] = data[col].astype('category')

data.dtypes

Dataset na construção de um modelo de Machine Learning

Como o nosso objetivo é conseguir estipular o valor da mensalidade de um seguro de vida, vamos dar uma olhada melhor nas nossas variáveis numéricas usando a transposição.


data.describe().T

Dataset na construção de um modelo de Machine Learning

Podemos aqui ver alguns detalhes e valores dos nossos inputs numéricos, como a média aritmética, o valor mínimo e máximo. Através desses dados podemos fazer a separação desses valores em grupos baseados em algum input de categoria, por gênero, se fuma ou não, entre outros, como demonstração vamos fazer a separação por sexo, para visualizar a media aritmética das colunas divididas por sexo.


value_based_on_sex = data.groupby("Sexo").mean("PrecoSeguro")
value_based_on_sex

Dataset na construção de um modelo de Machine Learning

Como podemos ver que no nosso dataset os homens acabam pagando um preço maior de seguro (lembrando que esse dataset é fictício).

Podemos ter uma melhor visualização dos dados através do seaborn, é uma biblioteca construída com base no matplotlib usada especificamente para plotar gráficos estatísticos.


import seaborn as sns

sns.set_style("whitegrid")

sns.pairplot(
    data[["Idade", "Salario", "PrecoSeguro", "Sexo"]],
    hue = "Sexo",
    height = 3,
    palette = "Set1")

Dataset na construção de um modelo de Machine Learning

Aqui podemos visualizar a distribuição desses valores através dos gráficos ficando mais claro a separação do conjunto, com base no grupo que escolhemos, como um teste você pode tentar fazer um agrupamento diferente e ver como os gráficos vão ficar.

Vamos criar uma matriz de correlação, sendo essa uma outra forma de visualizar a relação das variáveis numéricas do dataset, com o auxilio visual de um heatmap.


numeric_data = data.select_dtypes(include=['float64', 'int64'])

corr_matrix = numeric_data.corr()
sns.heatmap(corr_matrix, annot= True)

Dataset na construção de um modelo de Machine Learning

Essa matriz transposta nos mostra quais variáveis numéricas influenciam mais no nosso modelo, é um pouco intuitivo quando você olha para a imagem, podemos observar que a idade é a que mais vai interferir no preço do seguro.
Basicamente essa matriz funciona assim:

Os valores variam entre -1 e 1:
1: Correlação perfeita positiva - Quando uma variável aumenta, a outra também aumenta proporcionalmente.
0: Nenhuma correlação - Não há relação linear entre as variáveis.
-1: Correlação perfeita negativa - Quando uma variável aumenta, a outra diminui proporcionalmente.

Lembra da transformada que fizemos de object para category nos dados, agora vem a outra melhoria comentada, com os dados que viraram category faremos mais uma transformada, dessa vez a ideia é transformar essa variáveis categóricas em representações numéricas, isso nos permitirá ter um ganho incrível com o desempenho do modelo já que ele entende muito melhor essas variáveis numéricas.

Conseguimos fazer isso facilmente com a lib do pandas, o que ele faz é criar nova colunas binarias para valores distintos, o pandas é uma biblioteca voltada principalmente para analise de dados e estrutura de dados, então ela já possui diversas funcionalidades que nos auxiliam nos processo de tratamento do dataset.


data = pd.get_dummies(data)

Dataset na construção de um modelo de Machine Learning

Pronto agora temos nossas novas colunas para as categorias.

Decidindo o Algoritmo

Para a construção do melhor modelo, devemos saber qual o algoritmo ideal para o propósito da ML, na tabela seguinte vou deixar um resumo simplificado de como analisar seu problema e fazer a melhor escolha.

Dataset na construção de um modelo de Machine Learning

Olhando a tabela podemos ver que o problema que temos que resolver é o de regressão. Aqui vai mais uma dica, sempre comesse simples e vá incrementando seu e fazendo os ajustes necessários até os valores de previsibilidade do modelo ser satisfatório.

Para o nosso exemplo vamos montar um modelo de Regressão Linear, já que temos uma linearidade entre os nossos inputs e temos como target uma variável numérica.

Sabemos que a nossa variável target é a coluna PrecoSeguro , as outras são nossos inputs. Os inputs em estatísticas são chamadas de variável independente e o target de variável dependente, pelos nomes fica claro que a ideia é que o nosso target é uma variável que depende dos nosso inputs, se os inputs variam nosso target tem que vai variar também.

Vamos definir nosso y com o target


y = data["PrecoSeguro"]
E para x vamos remover a coluna target e inserir todas as outras
X = data.drop("PrecoSeguro", axis = 1)

Antes de montarmos o modelo, nosso dataset precisa ser dividido uma parte para teste e outra para o treino, para fazer isso vamos usar do scikit-learn o método train_test_split.


from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(
    X,y, 
    train_size = 0.80, 
    random_state = 1)

Aqui dividimos o nosso dataset em 80% para treino e 20% para testes. Agora podemos montar o nosso modelo.


from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X_train,y_train)

Modelo montado agora podemos avaliar seu desempenho


lr.score(X_test, y_test).
lr.score(X_train, y_train)

Aqui podemos analisar a o coeficiente de determinação do nosso modelo para testes e para o treinamento.

Podemos usar um outro método para poder descobrir o desvio padrão do nosso modelo, e entender a estabilidade e a confiabilidade do desempenho do modelo para a amostra


<p>from sklearn.metrics import mean_squared_error<br>
import math</p>

<p>y_pred = lr.predict(X_test)<br>
math.sqrt(mean_squared_error(y_test, y_pred))</p>

Considerações

O valor perfeito do coeficiente de determinação é 1, quanto mais próximo desse valor, teoricamente melhor seria o nosso modelo, mas um ponto de atenção é basicamente impossível você conseguir um modelo perfeito, até mesmo algo acima de 0.95 é de se desconfiar.
Se você tiver trabalhando com dados reais e conseguir um valor desse é bom analisar o seu modelo, testar outras abordagens e até mesmo revisar seu dataset, pois seu modelo pode estar sofrendo um overfitting e por isso apresenta esse resultado quase que perfeitos.
Aqui como montamos um dataset com valores irreais e sem nenhum embasamento é normal termos esses valores quase que perfeitos.

Deixarei aqui um link para o github do código e dataset usados nesse post

GitHub

The above is the detailed content of Dataset when building a Machine Learning model. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

For loop and while loop in Python: What are the advantages of each?May 13, 2025 am 12:01 AM

Forloopsareadvantageousforknowniterationsandsequences,offeringsimplicityandreadability;whileloopsareidealfordynamicconditionsandunknowniterations,providingcontrolovertermination.1)Forloopsareperfectforiteratingoverlists,tuples,orstrings,directlyacces

Python: A Deep Dive into Compilation and InterpretationMay 12, 2025 am 12:14 AM

Pythonusesahybridmodelofcompilationandinterpretation:1)ThePythoninterpretercompilessourcecodeintoplatform-independentbytecode.2)ThePythonVirtualMachine(PVM)thenexecutesthisbytecode,balancingeaseofusewithperformance.

Is Python an interpreted or a compiled language, and why does it matter?May 12, 2025 am 12:09 AM

Pythonisbothinterpretedandcompiled.1)It'scompiledtobytecodeforportabilityacrossplatforms.2)Thebytecodeistheninterpreted,allowingfordynamictypingandrapiddevelopment,thoughitmaybeslowerthanfullycompiledlanguages.

For Loop vs While Loop in Python: Key Differences ExplainedMay 12, 2025 am 12:08 AM

Forloopsareidealwhenyouknowthenumberofiterationsinadvance,whilewhileloopsarebetterforsituationswhereyouneedtoloopuntilaconditionismet.Forloopsaremoreefficientandreadable,suitableforiteratingoversequences,whereaswhileloopsoffermorecontrolandareusefulf

For and While loops: a practical guideMay 12, 2025 am 12:07 AM

Forloopsareusedwhenthenumberofiterationsisknowninadvance,whilewhileloopsareusedwhentheiterationsdependonacondition.1)Forloopsareidealforiteratingoversequenceslikelistsorarrays.2)Whileloopsaresuitableforscenarioswheretheloopcontinuesuntilaspecificcond

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver CS6

Visual web development tools

WebStorm Mac version

Useful JavaScript development tools

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

1666

1425

1328

1273

1253