search
HomeCommon ProblemFour steps of data preprocessing
Four steps of data preprocessingMar 05, 2021 am 10:36 AM
Data preprocessing

The four steps of data preprocessing are data cleaning, data integration, data transformation and data reduction; data preprocessing refers to the review and screening before classifying or grouping the collected data. , sorting and other necessary processing; data preprocessing, on the one hand, is to improve the quality of the data, on the other hand, it is also to adapt to the software or methods of data analysis.

Four steps of data preprocessing

The operating environment of this article: Windows 7 system, Dell G3 computer.

Data preprocessing refers to the necessary processing such as review, screening, sorting, etc. before classifying or grouping the collected data.

On the one hand, data preprocessing is to improve the quality of data, and on the other hand, it is also to adapt to the software or method of data analysis. Generally speaking, the data preprocessing steps are: data cleaning, data integration, data transformation, data reduction, and each major step has some small subdivisions. Of course, these four major steps may not necessarily be performed when doing data preprocessing.

1. Data Cleaning

Data cleaning, as the name suggests, turns “black” data into “white” data and “dirty” data into To become "clean", dirty data is dirty in form and content.

Dirty in form, such as missing values ​​and special symbols;

Dirty in content, such as outliers.

1. Missing values

Missing values ​​include the identification of missing values ​​and the processing of missing values.

In R, the function is.na is used to identify missing values, and the function complete.cases is used to identify whether the sample data is complete.

Commonly used methods for dealing with missing values ​​are: deletion, replacement and interpolation.

  • Deletion method: The deletion method can be divided into deleting observation samples and variables according to different angles of deletion, deleting observation samples (line deletion method), and the na.omit function in R can delete Rows containing missing values.

    This is equivalent to reducing the sample size in exchange for the completeness of the information. However, when there are large missing variables and little impact on the research objectives, you can consider deleting the statement mydata[,-p] in the variable R. To be done. mydata represents the name of the deleted data set, p is the number of columns of the deleted variable, and - represents deletion.

  • Replacement method: The replacement method, as the name suggests, replaces missing values. There are different replacement rules according to different variables. The variable where the missing value is located is a numeric type. Use other numbers under this variable. The missing values ​​are replaced by the mean; when the variable is a non-numeric variable, the median or mode of other observed values ​​under the variable is used.

  • Interpolation method: The interpolation method is divided into regression interpolation and multiple interpolation.

    Regression interpolation refers to treating the interpolated variable as the dependent variable y, and other variables as independent variables, using the regression model for fitting, and using the lm regression function in R to interpolate missing values. ;

    Multiple imputation refers to generating a complete set of data from a data set containing missing values. It is performed multiple times to generate a random sample of missing values. The mice package in R can perform multiple imputation.

2. Outliers

Outlier values, like missing values, include the identification and processing of outliers.

  • The identification of outliers is usually handled with a univariate scatter plot or a box plot. In R, dotchart is a function that draws a univariate scatter plot, and the boxplot function draws a box plot. ; In the graph, points far away from the normal range are regarded as outliers.

  • The processing of outliers includes deleting observations containing outliers (direct deletion, when there are few samples, direct deletion will cause insufficient sample size and change the distribution of variables), treat them as missing values ​​( Use the existing information to fill in missing values), average correction (use the average of the two observations before and after to correct the outlier), and do not process it. When handling outliers, you must first review the possible reasons for the occurrence of outliers, and then determine whether the outliers should be discarded.

2. Data integration

The so-called data integration is to merge multiple data sources into one data storage , of course, if the data being analyzed is originally in a data store, there is no need for data integration (all-in-one).

The implementation of data integration is to combine two data frames based on keywords and use the merge function in R. The statement is merge (dataframe1, dataframe2, by="keyword"), and the default is in ascending order. Arrangement.

The following problems may occur when performing data integration:

  1. The same name has different meanings, the name of an attribute in data source A and the name of an attribute in data source B The same, but the entities represented are different and cannot be used as keywords;

  2. has synonymous names, that is, the name of an attribute in the two data sources is different but the entity it represents is the same. Can be used as keywords;

  3. Data integration often results in data redundancy. The same attribute may appear multiple times, or it may be duplication caused by inconsistent attribute names. For duplicate attributes, do the related work first. Analyze and detect, and delete it if there is any.

3. Data transformation

Data transformation is to transform it into an appropriate form to meet the needs of software or analysis theory.

1. Simple function transformation

Simple function transformation is used to transform data without normal distribution into data with normal distribution. Commonly used ones include square, Square root, logarithm, difference, etc. For example, in time series, logarithm or difference operations are often performed on data to convert non-stationary sequences into stationary sequences.

2. Standardization

Normalization is to remove the influence of the variable dimension, such as directly comparing the difference in height and weight, the difference in units and the range of values. The differences make this not directly comparable.

  • Minimum-maximum normalization: also called dispersion standardization, linearly transforms the data and changes its range to [0,1]

  • Zero-mean normalization: also called standard deviation standardization, the mean value of the processed data is equal to 0, and the standard deviation is 1

  • ##Decimal scaling normalization: move the decimal places of the attribute value, and Attribute values ​​are mapped to [-1,1]

3. Continuous attribute discretization

Convert continuous attribute variables into categorical attributes, that is Discretization of continuous attributes, especially some classification algorithms require data to be categorical attributes, such as the ID3 algorithm.

Commonly used discretization methods include the following:

  1. Equal-width method: Divide the value range of the attribute into intervals with the same width, similar to making a frequency distribution table;

  2. Equal frequency method: put the same records into each interval;

  3. One-dimensional clustering: two steps, first put the continuous The values ​​of the attributes are clustered using a clustering algorithm, and then the clustered sets are merged into a continuous value and marked with the same label.

4. Data reduction

Data reduction refers to the understanding of the mining task and the content of the data itself Basically, find useful features of the data that depend on the discovery target to reduce the size of the data, thereby minimizing the amount of data while maintaining the original appearance of the data as much as possible.

Data curation can reduce the impact of invalid and erroneous data on modeling, reduce time, and reduce the space for storing data.

1. Attribute reduction

Attribute reduction is to find the smallest attribute subset and determine the probability distribution of the subset that is close to the probability distribution of the original data.

  1. Merge attributes: merge some old attributes into a new attribute;

  2. Select forward step by step: start from an empty attribute set, Each time, a current optimal attribute is selected from the original attribute set and added to the current subset, until the optimal attribute cannot be selected or a constraint value is satisfied;

  3. Step by step selection: from one Starting from an empty attribute set, each time the current worst attribute is selected from the original attribute set and eliminated from the current subset, until the worst attribute cannot be selected or a constraint value is satisfied;

  4. Decision Tree induction: attributes that do not appear in this decision tree are deleted from the initial set to obtain a better attribute subset;

  5. Principal component analysis: use fewer variables to explain Most of the variables in the original data (convert highly correlated variables into independent or uncorrelated variables).

2. Numerical reduction

By reducing the amount of data, including parametric and non-parametric methods, with parameters such as linear regression and multiple regression , parameterless methods such as histogram, sampling, etc.

For more related knowledge, please visit the

FAQ column!

The above is the detailed content of Four steps of data preprocessing. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何使用 PHP 函数进行数据预处理?如何使用 PHP 函数进行数据预处理?May 02, 2024 pm 03:03 PM

PHP数据预处理函数可用于进行类型转换、数据清理、日期和时间处理。具体来说,类型转换函数允许变量类型转换(例如int、float、string);数据清理函数可删除或替换无效数据(如is_null、trim);日期和时间处理函数可进行日期转换和格式化(如date、strtotime、date_format)。

用Python解开数据分析的密码用Python解开数据分析的密码Feb 19, 2024 pm 09:30 PM

数据预处理数据预处理是数据分析过程中至关重要的一步。它涉及清理和转换数据以使其适合分析。python的pandas库提供了丰富的功能来处理此任务。示例代码:importpandasaspd#从CSV文件读取数据df=pd.read_csv("data.csv")#处理缺失值df["age"].fillna(df["age"].mean(),inplace=True)#转换数据类型df["gender"]=df["gender"].astype("cateGory")机器学习Python的Scik

如何使用Vue Router实现页面跳转前的数据预处理?如何使用Vue Router实现页面跳转前的数据预处理?Jul 21, 2023 am 08:45 AM

如何使用VueRouter实现页面跳转前的数据预处理?引言:在使用Vue开发单页应用时,我们常常会使用VueRouter来管理页面之间的跳转。而有时候,我们需要在跳转之前对一些数据进行预处理,例如从服务器获取数据,或者验证用户权限等。本文将介绍如何使用VueRouter实现页面跳转前的数据预处理。一、安装和配置VueRouter首先,我们需要安装Vu

Python中的数据预处理技术是什么?Python中的数据预处理技术是什么?Jun 04, 2023 am 09:11 AM

Python作为一种常用的编程语言,可以处理和分析各种不同的数据。数据预处理是数据分析中非常重要和必要的一步,它包括数据清洗、特征提取、数据转换和数据标准化等步骤,预处理的目的是为了提高数据的质量和可分析性。Python中有许多数据预处理技术和工具可以使用,下面将介绍一些常用的技术和工具。数据清洗在数据清洗阶段,我们需要处理一些原始数据中的缺失值、重复值、异

如何利用Vue表单处理实现表单提交前的数据预处理如何利用Vue表单处理实现表单提交前的数据预处理Aug 10, 2023 am 09:21 AM

如何利用Vue表单处理实现表单提交前的数据预处理概述:在Web开发中,表单是平常最常见的元素之一。而在表单提交前,我们经常需要对用户输入的数据进行一些预处理,例如格式校验、数据转换等。Vue框架提供了方便易用的表单处理功能,本文将介绍如何利用Vue表单处理实现表单提交前的数据预处理。一、创建Vue实例和表单控件首先,我们需要创建一个Vue实例并定义一个包含表

Go语言和MySQL数据库:如何进行数据预处理?Go语言和MySQL数据库:如何进行数据预处理?Jun 17, 2023 am 08:27 AM

在现代软件开发中,对于大多数应用程序来说,必须能够与各种关系型数据库进行交互,以便能够在应用程序和数据库之间共享数据。MySQL是一种广泛使用的开源关系型数据库管理系统,而Go语言则是一种现代性能极佳的编程语言,它提供了很多内置库来轻松实现与MySQL数据库的交互。本文将探讨如何使用Go语言编写预处理语句来提高MySQL数据库的性能。什么是预处理?预处理是使

计算机视觉中目标检测的数据预处理计算机视觉中目标检测的数据预处理Nov 22, 2023 pm 02:21 PM

本文涵盖了在解决计算机视觉中的目标检测问题时,对图像数据执行的预处理步骤。首先,让我们从计算机视觉中为目标检测选择正确的数据开始。在选择计算机视觉中的目标检测最佳图像时,您需要选择那些在训练强大且准确的模型方面提供最大价值的图像。在选择最佳图像时,考虑以下一些因素:目标覆盖度:选择那些具有良好目标覆盖度的图像,也就是感兴趣的对象在图像中得到很好的表示和可见。对象被遮挡、重叠或部分切断的图像可能提供较少有价值的训练数据。目标变化:选择那些在对象外观、姿势、尺度、光照条件和背景方面具有变化的图像。所

在JavaScript中实现服务器端渲染和数据预处理的方式在JavaScript中实现服务器端渲染和数据预处理的方式Jun 15, 2023 pm 04:44 PM

在JavaScript中实现服务器端渲染和数据预处理的方式在现代Web应用程序中,构建高性能和可伸缩性的网站变得越来越重要。服务器端渲染和数据预处理是实现这种目标的两个关键技术,它们可以显著提高应用程序的性能和响应速度。本文将介绍如何使用JavaScript实现服务器端渲染和数据预处理的方式。服务器端渲染服务器端渲染是指在服务器端生成HTML代码并将其发送到

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version