


This article brings you a detailed introduction to the KNN algorithm (k-nearest neighbor algorithm) in Python (with examples). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you. helped.
The KNN algorithm is a data classification algorithm. The category of the k nearest neighbor data from the sample represents the category of the sample, so it is also called the k-nearest neighbor algorithm. The KNN algorithm is one of the simplest methods in data mining and can be roughly divided into the following steps:
Training data: data of all data categories in the original data set.
Test data: The data sample we will use for testing.
Processing data
The test data we get is usually of different dimensions than the training data. At this time, we need to upgrade the test data The dimension is the same as the training data. Python's numpy comes with a tile() function that can help us increase the dimension of the test data.
Vectorize the data
After the dimensionality of the test data is increased, in order to calculate the distance from the sample point, we need to vectorize the data at this time , the so-called vectorization is very simple, that is, subtracting two data of the same dimension.
Calculate the Euclidean distance
The Euclidean distance, that is, the Euclidean distance, can be calculated using the Pythagorean theorem. The square and square root of each vector in the vector group obtained by subtracting the test data and training data can be used to obtain a vector group composed of distances.
Classification based on distance
Select k data with the smallest distance from the sample point, and count which data categories among these k data With the highest frequency of occurrence, the data category of the sample point can be determined.
Algorithm implementation:
1. First we need to introduce numpy and operator, enter from numpy import *
and import operator
.
2. Next we need to define a knn function. In the knn function we need to introduce four parameters, namely k, training data, test data and data category.
3. Next, we need to perform dimensionality enhancement operation on the data first. We need to use the tile(a,(b,c)) function under numpy. a is the data to be performed on the dimensionality enhancement operation, that is Test data, b is the row data to be upgraded to the test data, and c is the column data to be upgraded to the test data.
4. In the previous operation, we generally need to obtain the number of rows and columns of the training data. In this case, we need to use shape() Function, the shape() function returns a tuple consisting of rows and columns of the training data. If we want to know the number of rows or columns of the training data, we only need to reference it through the array element subscript.
5. After the dimensions of the data are the same, we need to subtract the two data to get a vector, and then calculate the sum of the squares of each value of this vector The square root of is the distance from the test data to the training data, and then calls the argsort() function to sort the distances in ascending order, but this function returns the subscript of the array element.
6. Next, in order to intuitively see the number of occurrences of different data categories, we need to set up an empty dictionary to store the data. After getting After the dictionary, we need to sort the dictionary in descending order according to the number of occurrences of different data categories, and then return the first value of the dictionary to get the data category of the test data.
7. The algorithm code is as follows:
from numpy import * import operator def knn(k, test_data, train_data, labels): train_size = train_data.shape[0] #获取训练数据的行数 test_size = tile(test_data, (train_size, 1)) #将测试数据的行升维 minus = test_size-train_data #得到向量 sq_minus = minus**2 sum_sq_minus = sq_minus.sum(axis=1) #得到平方后的每个数组内元素的和 distc = sum_sq_minus**0.5 sort_distc = distc.argsort() #将距离按升序排列 static = {} for i in range(0, k): vote = labels[sort_distc[i]] #获取数据类型 static[vote] = static.get(vote, 0)+1 #统计每个数据类型的出现次数 sort_static = sorted(static.items(), key=operator.itemgetter(1), reverse=True) #将字典中的元素按出现次数降序排列 return sort_static[0][0] #返回出现次数最多的数据类型
8. The dictionary needs to be sorted in the algorithm, so the sorted() function needs to be used. The sorted() function has three parameters, namely items. (), operator.itemgetter(), reverse, the default sorting is ascending order. If we want to sort in descending order, we need to set the third parameter to True. Here we are sorting according to the values of the dictionary, so we need to enter sorted(static.items(), key=operator.itemgetter(1), reverse=True)
, when the value in the operator.itemgetter() function is 1, it is sorted according to the values of the dictionary, and the value is 0 When, it is sorted according to the key of the dictionary.
9. The way to access elements after sorting is the same as the way to access two-dimensional array elements
The above is the detailed content of Detailed introduction to the KNN algorithm (k-nearest neighbor algorithm) in Python (with examples). For more information, please follow other related articles on the PHP Chinese website!

Numpy切片和索引ndarray对象的内容可以通过索引或切片来访问和修改,与 Python 中 list 的切片操作一样。ndarray 数组可以基于 0 ~ n-1 的下标进行索引,切片对象可以通过内置的 slice 函数,并设置 start, stop 及 step 参数进行,从原数组中切割出一个新数组。切片还可以包括省略号 …,来使选择元组的长度与数组的维度相同。 如果在行位置使用省略号,它将返回包含行中元素的 ndarray。高级索引整数数组索引以下实例获取数组中 (0,0),(1,1

近年来,机器学习(MachineLearning)成为了IT行业中最热门的话题之一,Python作为一种高效的编程语言,已经成为了许多机器学习实践者的首选。本文将会介绍Python中机器学习的概念、应用和实现。一、机器学习概念机器学习是一种让机器通过对数据的分析、学习和优化,自动改进性能的技术。其主要目的是让机器能够在数据中发现存在的规律,从而获得对未来

随着互联网技术的发展和大数据的普及,越来越多的公司和机构开始关注数据分析和机器学习。现在,有许多编程语言可以用于数据科学,其中Go语言也逐渐成为了一种不错的选择。虽然Go语言在数据科学上的应用不如Python和R那么广泛,但是它具有高效、并发和易于部署等特点,因此在某些场景中表现得非常出色。本文将介绍如何利用Go语言进行数据分析和机器学习

区别:1、“数据分析”得出的结论是人的智力活动结果,而“数据挖掘”得出的结论是机器从学习集【或训练集、样本集】发现的知识规则;2、“数据分析”不能建立数学模型,需要人工建模,而“数据挖掘”直接完成了数学建模。

量化交易(也称自动化交易)是一种应用数学模型帮助投资者进行判断,并且根据计算机程序发送的指令进行交易的投资方式,它极大地减少了投资者情绪波动的影响。量化交易的主要优势如下:快速检测客观、理性自动化量化交易的核心是筛选策略,策略也是依靠数学或物理模型来创造,把数学语言变成计算机语言。量化交易的流程是从数据的获取到数据的分析、处理。数据获取数据分析工作的第一步就是获取数据,也就是数据采集。获取数据的方式有很多,一般来讲,数据来源主要分为两大类:外部来源(外部购买、网络爬取、免费开源数据等)和内部来源

随着大数据时代的到来,越来越多的企业和组织开始利用大数据分析来帮助自己更好地了解其所面对的市场和客户,以便更好地制定商业策略和决策。而在大数据分析中,MySQL数据库也是经常被使用的一种工具。本文将介绍MySQL中的大数据分析技巧,为大家提供参考。一、使用索引进行查询优化索引是MySQL中进行查询优化的重要手段之一。当我们对某个列创建了索引后,MySQL就可

CAE和AI技术双融合已成为企业研发设计环节数字化转型的重要应用趋势,但企业数字化转型绝不仅是单个环节的优化,而是全流程、全生命周期的转型升级,数据驱动只有作用于各业务环节,才能真正助力企业持续发展。数字化浪潮席卷全球,作为数字经济核心驱动,数字技术逐步成为企业发展新动能,助推企业核心竞争力进化,在此背景下,数字化转型已成为所有企业的必选项和持续发展的前提,拥抱数字经济成为企业的共同选择。但从实际情况来看,面向C端的产业如零售电商、金融等领域在数字化方面走在前列,而以制造业、能源重工等为代表的传

俄乌冲突爆发 2 周后,数据分析公司 Palantir 的首席执行官亚历山大·卡普 (Alexander Karp) 向欧洲领导人提出了一项建议。在公开信中,他表示欧洲人应该在硅谷的帮助下实现武器现代化。Karp 写道,为了让欧洲“保持足够强大以战胜外国占领的威胁”,各国需要拥抱“技术与国家之间的关系,以及寻求摆脱根深蒂固的承包商控制的破坏性公司与联邦政府部门之间的资金关系”。而军队已经开始响应这项号召。北约于 6 月 30 日宣布,它正在创建一个 10 亿美元的创新基金,将投资于早期创业公司和


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

SublimeText3 English version
Recommended: Win version, supports code prompts!

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
