Learn to Split in Training and Testing Data from a Dataset Using Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Learn to Split in Training and Testing Data from a Dataset Using Python

DDD

Oct 30, 2024 am 10:57 AM

Aprenda a Dividir em Treinamento e Teste os Dados de um Dataset Utilizando Python

Summary

This article teaches you how to divide a dataset into training and testing data and save this division in a .pkl file, essential for training and evaluating Machine Learning models in an organized way. The process uses the sklearn and pickle libraries, allowing you to reuse the processed data in future projects. This article is the next step in a series of tutorials on data preprocessing.

Main Topics Covered:

Notebook preparation on Google Colab
Division of the dataset into training and testing data
Detailed explanation of Python code for division
Saving the split to a .pkl file using pickle
Advantages of saving processed data for future use

Important: To follow this article, first read the articles below in the suggested sequence. Each article provides the foundation you need to understand the next, ensuring you understand the entire workflow up to this point.

Article 1: Applying Machine Learning: A Guide to Getting Started as Models in Classification

Article 2: Exploring Classification in Machine Learning: Types of Variables

Article 3: Exploring Google Colab: Your Ally for Coding Machine Learning Models

Article 4: Exploring Data with Python on Google Colab: A Practical Guide Using the adult.csv Dataset

Article 5: Demystifying Predictor and Class Division and Categorical Attribute Handling with LabelEncoder and OneHotEncoder

Article 6: Data Scaling: The Foundation for Efficient Models

Introduction

In this article, you will learn how to divide a dataset into training and testing, as well as saving this division in a .pkl file. This process is essential to ensure a clean separation between the data that will be used to train the model and that that will be used to evaluate its performance.

Starting the process in Google Colab

First of all, access this notebook link and select File > Save a copy to Drive. Remember that the dataset (adult.csv) needs to be loaded again with each new post (more information in Article 4 above), as each tutorial creates a new notebook, adding only the necessary code presented in this article, but the notebook is with all the code generated so far. A copy of the notebook will be saved on Google Drive, within the Colab Notebooks folder, keeping the process organized and continuous.

Why split the dataset into training and testing?

Dividing the dataset is a fundamental step in any Machine Learning project, as it allows the model to "learn" from a part of the data (training) and then be evaluated on new data, never seen before (testing). This practice is essential to measure the generalization of the model. To facilitate monitoring, we will use the following variables:

X_adult_treinamento: training predictor variables
X_adult_teste: test predictor variables
y_adult_treinamento: training target variable
y_adult_teste: test target variable

Python code to split the dataset

Below is the Python code to perform the split between training and testing data:

from sklearn.model_selection import train_test_split

X_adult_treinamento, X_adult_teste, y_adult_treinamento, y_adult_teste = train_test_split(X_adult, y_adult, test_size=0.2, random_state=0)

# Dados para o treinamento
X_adult_treinamento.shape, y_adult_treinamento.shape

# Dados para o teste
X_adult_teste.shape, y_adult_teste.shape

The figure below shows the previous code with its outputs after execution.

Aprenda a Dividir em Treinamento e Teste os Dados de um Dataset Utilizando Python

Explanation of the Code:

train_test_split: Function from the sklearn library that splits the dataset.
test_size=0.2: Indicates that 20% of the data will be reserved for testing, and the remaining 80% for training.
random_state=0: Ensures that the division is always the same, generating consistent results for each run.
shape: Checks the shape of the data after splitting to confirm that the splitting occurred correctly.

Saving the split to a .pkl file

To make work easier and ensure consistency between different runs, we will save the training and testing variables in a .pkl file. This makes it possible to reuse the data whenever necessary, without having to do the division again.

Code to save variables using pickle:

import pickle
with open('adult.pkl', mode='wb') as fl:
  pickle.dump([X_adult_treinamento, y_adult_treinamento, X_adult_teste, y_adult_teste], fl)

To view the adult.pkl file on the notebook, simply click on the folder icon on the left side as shown in the figure below.

Aprenda a Dividir em Treinamento e Teste os Dados de um Dataset Utilizando Python

Explanation of the Code:

pickle: Python library used to serialize objects, allowing you to save complex variables in files.
dump: Saves the variables in a file called adult.pkl. This file will be read in the future to load the dataset divided into training and testing, optimizing the workflow.

Conclusion

In this article, you learned how to split a dataset into training and testing data and save it in a .pkl file. This process is fundamental in Machine Learning projects, ensuring an organized and efficient structure. In the next article, we will cover the creation of models, starting with the Naive Bayes algorithm, using the adult.pkl file to continue development.

Books I recommend

1. Practical Statistics for Data Scientists
2. Introduction to Computing Using Python
3. 2041: How Artificial Intelligence Will Change Your Life in the Next Decades
4. Intensive Python Course
5. Understanding Algorithms. An Illustrated Guide for Programmers and Others Who Are Curious
6. Artificial Intelligence - Kai-Fu Lee
7. Introduction to Artificial Intelligence - A Non-Technical Approach - Tom Taulli

New Kindles

I did a detailed analysis of the new Kindles launched this year, highlighting their main innovations and benefits for digital readers. Check out the full text at the following link: The Fascinating World of Digital Reading: Advantages of Having a Kindle.

Amazon Prime

Joining Amazon Prime offers a series of advantages, including unlimited access to thousands of films, series and music, as well as free shipping on millions of products with fast delivery. Members also enjoy exclusive offers, early access to promotions and benefits on services such as Prime Video, Prime Music and Prime Reading, making the shopping and entertainment experience much more convenient and rich.

If you are interested, use the following link: AMAZON PRIME, which helps me continue to promote artificial intelligence and computer programming.

The above is the detailed content of Learn to Split in Training and Testing Data from a Dataset Using Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python's Execution Model: Compiled, Interpreted, or Both?May 10, 2025 am 12:04 AM

Pythonisbothcompiledandinterpreted.WhenyourunaPythonscript,itisfirstcompiledintobytecode,whichisthenexecutedbythePythonVirtualMachine(PVM).Thishybridapproachallowsforplatform-independentcodebutcanbeslowerthannativemachinecodeexecution.

Is Python executed line by line?May 10, 2025 am 12:03 AM

Python is not strictly line-by-line execution, but is optimized and conditional execution based on the interpreter mechanism. The interpreter converts the code to bytecode, executed by the PVM, and may precompile constant expressions or optimize loops. Understanding these mechanisms helps optimize code and improve efficiency.

What are the alternatives to concatenate two lists in Python?May 09, 2025 am 12:16 AM

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

Python: Efficient Ways to Merge Two ListsMay 09, 2025 am 12:15 AM

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiled vs Interpreted Languages: pros and consMay 09, 2025 am 12:06 AM

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

Python: For and While Loops, the most complete guideMay 09, 2025 am 12:05 AM

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

Python concatenate lists into a stringMay 09, 2025 am 12:02 AM

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055518 fails to install in Windows 10?

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

1 months agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SublimeText3 Mac version

God-level code editing software (SublimeText3)

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

1664

1423

1317

1268

1246