


This article brings you relevant knowledge about python, which mainly introduces the issues related to outliers in data analysis. Generally, the detection methods for outliers include statistical methods and cluster-based methods. Class methods, as well as some methods that specialize in detecting outliers, etc. These methods are introduced below. I hope they will be helpful to everyone.
Recommended learning: python learning tutorial
1 What is an outlier?
In machine learning, Anomaly detection and processing is a relatively small branch, or a by-product of machine learning, because in general prediction problems, The model is usually an expression of the data structure of the overall sample. This expression usually captures the general properties of the overall sample, and those points that are completely inconsistent with the overall sample in terms of these properties are called Abnormal points, usually abnormal points are not welcomed by developers in prediction problems, because prediction problems usually focus on the properties of the overall sample, and the generation mechanism of abnormal points is completely inconsistent with the overall sample. If the algorithm detects abnormalities If the model is point sensitive, the generated model cannot express the overall sample well, and the prediction will be inaccurate. On the other hand, abnormal points are of great interest to analysts in certain scenarios, such as disease prediction. Usually the physical indicators of healthy people are similar in some dimensions. If a person If there are abnormalities in his physical indicators, then his physical condition must have changed in some aspects. Of course, this change is not necessarily caused by the disease (often called noise points), but the occurrence and detection of abnormalities are disease predictions. An important starting point. Similar scenarios can also be applied to credit fraud, cyber attacks, etc.
2 Detection methods for outliers
Generally, the detection methods for outliers include statistical methods, clustering-based methods, and some methods that specialize in detecting outliers. These are discussed below. Methods are introduced accordingly.
1. Simple Statistics
If you use pandas
, we can directly use describe()
to observe the statistical description of the data (only a rough Observe some statistics), but the statistical data are continuous, as follows:
df.describe()
Or simply use a scatter plot to clearly observe the existence of outliers. As shown below:
2. 3∂ principle
This principle has a condition: The data needs to obey the normal distribution. Under the 3∂ principle, if an outlier exceeds 3 times the standard deviation, it can be regarded as an outlier. The probability of positive or negative 3∂ is 99.7%, so the probability of a value other than 3∂ from the average value appearing is P(|x-u| > 3∂)
The red arrow points to the outlier.
3. Box plot
This method uses the interquartile range (IQR) of the box plot to detect outliers, also called Tukey's test. The definition of a box plot is as follows:
The interquartile range (IQR) is the difference between the upper quartile and the lower quartile. We use 1.5 times the IQR as the standard and stipulate that points exceeding the upper quartile 1.5 times the IQR distance, or the lower quartile -1.5 times the IQR distance are outliers. The following is the code implementation in Python, mainly using the percentile
method of numpy
.
Percentile = np.percentile(df['length'],[0,25,50,75,100]) IQR = Percentile[3] - Percentile[1] UpLimit = Percentile[3]+ageIQR*1.5 DownLimit = Percentile[1]-ageIQR*1.5
You can also use the visualization method boxplot
of seaborn
to achieve:
f,ax=plt.subplots(figsize=(10,8)) sns.boxplot(y='length',data=df,ax=ax) plt.show()
The red arrow points to It's an outlier.
The above is a simple method commonly used to determine outliers. Let's introduce some more complex outlier detection algorithms. Since it involves a lot of content, only the core ideas will be introduced. Interested friends can study in depth on their own.
4. Based on model detection
This method generally builds a probability distribution model, and calculates the probability that the object conforms to the model, and treats objects with low probability as as an outlier. If the model is a collection of clusters, anomalies are objects that do not significantly belong to any cluster; if the model is regression, anomalies are objects that are relatively far from the predicted value.
Probability definition of outliers: An outlier is an object, with respect to the probability distribution model of the data, which has a low probability. The prerequisite for this situation is to know what distribution the data set obeys. If the estimation is wrong, it will cause a heavy-tailed distribution.
For example, the RobustScaler
method in feature engineering, when scaling data feature values, it will use the quantile distribution of data features to divide the data into multiple segments according to the quantile, and only take the middle Segment is used for scaling, for example, only data from the 25% quantile to the 75% quantile is used for scaling. This reduces the impact of abnormal data.
Advantages and Disadvantages: (1) With a solid theoretical foundation in statistics, these tests can be very effective when there is sufficient data and knowledge of the type of test used; (2) For For multivariate data, fewer options are available, and for high-dimensional data, these detection possibilities are poor.
5. Outlier detection based on proximity
Statistical methods use the distribution of data to observe outliers. Some methods even require some distribution conditions, but in practice the distribution of data is very It is difficult to meet some assumptions and has certain limitations in use.
It is easier to determine a meaningful measure of proximity for a data set than to determine its statistical distribution. This method is more general and easier to use than statistical methods because an object's outlier score is given by the distance to its k-nearest neighbors (KNN).
It should be noted that the outlier score is highly sensitive to the value of k. If k is too small, a small number of nearby outliers may result in a low outlier score; if K is too large, all objects in clusters with less than k points may become outliers. In order to make this scheme more robust to the selection of k, the average distance of the k nearest neighbors can be used.
Advantages and disadvantages: (1) Simple; (2) Disadvantages: The proximity-based method requires O(m2) time and is not suitable for large data sets; (3) The The method is also sensitive to the choice of parameters; (4) it cannot handle data sets with regions of different densities, because it uses a global threshold and cannot account for such changes in density.
5. Density-based outlier detection
From a density-based perspective, outliers are objects in low-density areas. Density-based outlier detection is closely related to proximity-based outlier detection, since density is often defined in terms of proximity. A common way to define density is to define density as the reciprocal of the average distance to the k nearest neighbors. If this distance is small, the density is high and vice versa. Another density definition is The density definition used by the DBSCAN clustering algorithm, that is, the density around an object is equal to the number of objects within a specified distance d from the object.
Advantages and Disadvantages: (1) It gives a quantitative measure that the object is an outlier, and it can be processed well even if the data has different areas; (2) Like distance-based methods, these methods necessarily have a time complexity of O(m2). For low-dimensional data, using specific data structures can achieve O(mlogm)
; (3) Parameter selection is difficult. Although the LOF
algorithm handles this problem by observing different k values and then obtaining the maximum outlier score, it is still necessary to choose upper and lower bounds for these values.
6. Clustering-based method for outlier detection
Clustering-based outliers:An object is an outlier based on clustering. If the object does not Strongly belongs to any cluster, then the object belongs to an outlier.
The impact of outliers on initial clustering: If outliers are detected through clustering, there is a question since the outliers affect clustering: whether the structure is valid. This is also the shortcoming of the k-means
algorithm, which is sensitive to outliers. In order to deal with this problem, you can use the following methods: cluster objects, delete outliers, and cluster objects again (this does not guarantee optimal results).
Advantages and Disadvantages: (1) Clustering techniques based on linear and near-linear complexity (k-means) may be highly effective in discovering outliers; (2) Definition of clusters It is usually the complement of outliers, so clusters and outliers may be discovered at the same time; (3) the resulting outlier sets and their scores may be very dependent on the number of clusters used and the existence of outliers in the data; (4) The quality of the clusters generated by the clustering algorithm has a great impact on the quality of the outliers generated by the algorithm.
7. Specialized outlier detection
In fact, the original intention of the clustering method mentioned above is unsupervised classification, not to find outliers, but it just happens that its function can The detection of outliers is a derived function.
In addition to the methods mentioned above, there are two more commonly used methods specifically used to detect abnormal points: One Class SVM
and Isolation Forest
, the details are not available Do in-depth research.
3 How to handle outliers
We have detected outliers and we need to handle them to a certain extent. The general methods of handling outliers can be roughly divided into the following categories:
- Delete records containing abnormal values: Delete records containing abnormal values directly;
- Treat as missing values: Treat abnormal values as Missing values are processed using the missing value processing method;
- Mean value correction: The outlier can be corrected with the average of the two observed values before and after;
- No processing: Conduct data mining directly on the data set with outliers;
Whether outliers should be deleted can be considered based on the actual situation. Because some models are not very sensitive to outliers, even if there are outliers, the model effect will not be affected. However, some models such as logistic regression LR are very sensitive to outliers. If not processed, very poor effects such as overfitting may occur.
4 Summary of outliers
The above is a summary of outlier detection and processing methods.
We can find outliers through some detection methods, but the results obtained are not absolutely correct. The specific situation needs to be judged based on the understanding of the business. Similarly, how to deal with outliers, whether they should be deleted, corrected, or not processed, also needs to be considered based on the actual situation and is not fixed.
Recommended learning: python tutorial
The above is the detailed content of Python data outlier detection and processing (detailed examples). For more information, please follow other related articles on the PHP Chinese website!

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Notepad++7.3.1
Easy-to-use and free code editor

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Dreamweaver CS6
Visual web development tools