Detailed introduction to KNN algorithm-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed introduction to KNN algorithm

PHP中文网

Jun 20, 2017 pm 04:59 PM

introducealgorithm

The full name of the KNN algorithm is k-Nearest Neighbor, which means K nearest neighbor.

Algorithm Description

KNN is a classification algorithm. Its basic idea is to classify by measuring the distance between different feature values.

The algorithm process is as follows:

1. Prepare the sample data set (each data in the sample has been classified and has a classification label);
2. Use the sample data for training ;
3. Input test data A;
4. Calculate the distance between A and each data in the sample set;
5. Sort in order of increasing distance;
6. Select the distance from A The smallest k points;
7. Calculate the frequency of occurrence of the category where the first k points are located;
8. Return the category with the highest frequency of occurrence of the first k points as the predicted classification of A.

Main factors

Training set (or sample data)

If the training set is too small, misjudgment will occur. If the training set is too large, the system overhead of classifying the test data will be very large. .

Distance (or similar measurement algorithm)

What is a suitable distance measure? A closer distance should mean that the two points are more likely to belong to the same category.

Distance measurements include:

1. Euclidean distance

The euclidean metric (also called Euclidean distance) is a commonly used distance definition , refers to the real distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). The Euclidean distance in two- and three-dimensional space is the actual distance between two points.

Suitable for space problems.

2. Manhattan Distance

Taxi geometry or Manhattan Distance is a term created by Herman Minkowski in the 19th century. It is a method used in geometric metric spaces. A geometric term used to indicate the sum of the absolute axis distances of two points on the standard coordinate system. The Manhattan distance is the sum of the distances of the projections of the line segments formed by the Euclidean distance on the fixed rectangular coordinate system of the Euclidean space on the axis.

The red line in the figure represents the Manhattan distance, the green represents the Euclidean distance, which is the straight-line distance, and the blue and yellow represent the equivalent Manhattan distance. Manhattan distance - the distance between two points in the north-south direction plus the distance in the east-west direction, that is, d(i, j) = |xi-xj|+|yi-yj|.

Suitable for path problems.

3. Chebyshev distance

In mathematics, Chebyshev distance is a measure in vector space. The distance between two points is defined as the absolute difference in the numerical value of each coordinate. The maximum value.

Chebyshev distance will be used to calculate the distance between two points in the grid, such as: chessboard, warehousing and logistics applications.

For a grid, a point whose Chebyshev distance is 1 is the Moore neighborhood of this point.

Used for problems calculating distances in a grid.

4. Minkowski Distance

Min's distance is not a distance, but a definition of a set of distances.

Depending on the different variable parameters, Min's distance can represent a type of distance.

There is a variable parameter p in the formula:
When p=1, it is Manhattan distance;
When p=2, it is Euclidean distance;
When p→∞ , which is the Chebyshev distance.

5. Standardized Euclidean distance (Standardized Euclidean distance)

Standardized Euclidean distance is an improvement scheme made to address the shortcomings of simple Euclidean distance. It can be regarded as a weighted Euclidean distance.

The idea of standard Euclidean distance: Since the distribution of each dimensional component of the data is different, first "standardize" each component to have equal mean and variance.

6. Mahalanobis Distance

represents the covariance distance of the data.

It is an effective method to calculate the similarity of two unknown sample sets.

Dimensionally independent, the interference of correlation between variables can be eliminated.

7. Bhattacharyya Distance In statistics, Bhattacharyya distance is used to measure two discrete probability distributions. It is often used in classification to measure the separability between classes.

8. Hamming distance

The Hamming distance between two equal-length strings s1 and s2 is defined as the minimum amount of work required to change one of them into the other. Number of substitutions.

For example, the Hamming distance between the strings "1111" and "1001" is 2.

Application:
Information coding (in order to enhance fault tolerance, the minimum Hamming distance between codes should be made as large as possible).

9. Cosine of the included angle (Cosine)

In geometry, the cosine of the included angle can be used to measure the difference in the direction of two vectors, and in data mining, it can be used to measure the difference between sample vectors.

10. Jaccard similarity coefficient (Jaccard similarity coefficient)

The Jaccard distance uses the proportion of different elements in the two sets to all elements to measure the distinction between the two sets.
The Jaccard similarity coefficient can be used to measure the similarity of samples.

11. Pearson Correlation Coefficient

Pearson correlation coefficient, also called Pearson product-moment correlation coefficient (Pearson product-moment correlation coefficient), is a linear correlation coefficient. The Pearson correlation coefficient is a statistic used to reflect the degree of linear correlation between two variables.

The impact of high dimensions on distance measurement:
When there are more variables, the discriminating ability of Euclidean distance becomes worse.

The impact of variable value range on distance:
Variables with larger value ranges often play a dominant role in distance calculations, so the variables should be standardized first.

The size of k

k is too small, the classification results are easily affected by noise points, and the error will increase;
k is too large, and the nearest neighbors may include too many other categories point (weighting the distance can reduce the impact of k value setting);
k=N (number of samples) is completely undesirable, because no matter what the input instance is at this time, it is simply predicted that it belongs to the training The class with the most instances, the model is too simple and ignores a lot of useful information in the training instances.

In practical applications, the K value generally takes a relatively small value. For example, the cross-validation method is used (in simple terms, some samples are used as the training set and some are used as the test set) to select the optimal K value. .

Rule of thumb: k is generally lower than the square root of the number of training samples.

Advantages and Disadvantages

1. Advantages
Simple, easy to understand, easy to implement, high accuracy, and not sensitive to outliers.

2. Disadvantages

KNN is a lazy algorithm. It is very simple to construct a model, but the system overhead of classifying test data is large (large amount of calculation and large memory overhead) because it requires scanning. All training samples and calculate the distance.

Applicable scope

Numeric type and nominal type (with a finite number of different values, and there is no order between the values).
Such as customer churn prediction, fraud detection, etc.

Algorithm Implementation

Here we take python as an example to describe the implementation of the KNN algorithm based on Euclidean distance.

Euclidean distance formula:

Sample code taking Euclidean distance as an example:

#! /usr/bin/env python#-*- coding:utf-8 -*-# E-Mail : Mike_Zhang@live.comimport mathclass KNN:    def __init__(self,trainData,trainLabel,k):
        self.trainData = trainData
        self.trainLabel = trainLabel
        self.k = k       def predict(self,inputPoint):
        retLable = "None"arr=[]for vector,lable in zip(self.trainData,self.trainLabel):
            s = 0for i,n in enumerate(vector) :
                s += (n-inputPoint[i]) ** 2arr.append([math.sqrt(s),lable])
        arr = sorted(arr,key=lambda x:x[0])[:self.k]           
        dtmp = {}for k,v in arr :if not v in dtmp : dtmp[v]=0
            dtmp[v] += 1retLable,_ = sorted(dtmp.items(),key=lambda x:x[1],reverse=True)[0]        return retLable

data = [
    [1.0, 1.1],
    [1.0, 1.0],
    [0.0, 0.0],
    [0.0, 0.1],
    [1.3, 1.1],
]

labels = ['A','A','B','B','A']
knn = KNN(data,labels,3)print knn.predict([1.2, 1.1])  
print knn.predict([0.2, 0.1])

Above The implementation is relatively simple. You can use ready-made libraries in development, such as scikit-learn:

Algorithm application

Recognize handwriting number

The above is the detailed content of Detailed introduction to KNN algorithm. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles