聚類是一種無監督的學習,將相似的物件放到同一簇中,有點像是全自動分類,簇內的物件越相似,簇間的物件差異越大,則聚類效果越好。本文主要為大家詳細介紹了python實作kMeans演算法,具有一定的參考價值,有興趣的夥伴們可以參考一下,希望能幫助大家。
1、k均值聚類演算法
k均值聚類將資料分為k個簇,每個簇透過其質心,即簇中所有點的中心來描述。首先隨機決定k個初始點作為質心,然後將資料集分配到距離最近的簇。然後將每個簇的質心更新為所有資料集的平均值。然後再進行第二次劃分資料集,直到聚類結果不再變化為止。
偽代碼為
隨機建立k個簇質心
當任一點的簇分配改變時:
對資料集中的每個資料點:
對每個質心:
計算資料集與質心的距離
並將平均值當作質心
python實作
import numpy as np import matplotlib.pyplot as plt def loadDataSet(fileName): dataMat = [] with open(fileName) as f: for line in f.readlines(): line = line.strip().split('\t') dataMat.append(line) dataMat = np.array(dataMat).astype(np.float32) return dataMat def distEclud(vecA,vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2))) def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m): centmin = min(dataSet[:,i]) centmax = max(dataSet[:,i]) center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return center def kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = distMeans(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 for cent in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssment data = loadDataSet('testSet.txt') muCentroids, clusterAssing = kMeans(data,4) fig = plt.figure(0) ax = fig.add_subplot(111) ax.scatter(data[:,0],data[:,1],c = clusterAssing[:,0].A) plt.show() print(clusterAssing)
2、二分k平均值演算法
先將所有點作為一個簇,然後將該簇一分為二,然後選擇所有簇中對其劃分能夠最大程度減低SSE的值的簇,直到滿足指定簇數為止。
將所有點看成一個簇計算SSE
while 當簇數目小於k:
for 每一個簇:
計算總誤差
在給定的簇上進行k均值聚類(k=2)
進行分割操作
#
import numpy as np import matplotlib.pyplot as plt def loadDataSet(fileName): dataMat = [] with open(fileName) as f: for line in f.readlines(): line = line.strip().split('\t') dataMat.append(line) dataMat = np.array(dataMat).astype(np.float32) return dataMat def distEclud(vecA,vecB): return np.sqrt(np.sum(np.power((vecA-vecB),2))) def randCent(dataSet,k): m = np.shape(dataSet)[1] center = np.mat(np.ones((k,m))) for i in range(m): centmin = min(dataSet[:,i]) centmax = max(dataSet[:,i]) center[:,i] = centmin + (centmax - centmin) * np.random.rand(k,1) return center def kMeans(dataSet,k,distMeans = distEclud,createCent = randCent): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroids = createCent(dataSet,k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = distMeans(dataSet[i,:],centroids[j,:]) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 for cent in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:,0].A == cent)[0]] centroids[cent,:] = np.mean(ptsInClust,axis = 0) return centroids,clusterAssment def biKmeans(dataSet,k,distMeans = distEclud): m = np.shape(dataSet)[0] clusterAssment = np.mat(np.zeros((m,2))) centroid0 = np.mean(dataSet,axis=0).tolist() centList = [centroid0] for j in range(m): clusterAssment[j,1] = distMeans(dataSet[j,:],np.mat(centroid0))**2 while (len(centList)<k): lowestSSE = np.inf for i in range(len(centList)): ptsInCurrCluster = dataSet[np.nonzero(clusterAssment[:,0].A == i)[0],:] centroidMat,splitClustAss = kMeans(ptsInCurrCluster,2,distMeans) sseSplit = np.sum(splitClustAss[:,1]) sseNotSplit = np.sum(clusterAssment[np.nonzero(clusterAssment[:,0].A != i)[0],1]) if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat.copy() bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit print('the best cent to split is ',bestCentToSplit) # print('the len of the bestClust') bestClustAss[np.nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) bestClustAss[np.nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit clusterAssment[np.nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:] = bestClustAss.copy() centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] centList.append(bestNewCents[1,:].tolist()[0]) return np.mat(centList),clusterAssment data = loadDataSet('testSet2.txt') muCentroids, clusterAssing = biKmeans(data,3) fig = plt.figure(0) ax = fig.add_subplot(111) ax.scatter(data[:,0],data[:,1],c = clusterAssing[:,0].A,cmap=plt.cm.Paired) ax.scatter(muCentroids[:,0],muCentroids[:,1]) plt.show() print(clusterAssing) print(muCentroids)
程式碼及資料集下載:K-means
相關推薦:
以上是python實作kMeans演算法的詳解的詳細內容。更多資訊請關注PHP中文網其他相關文章!

toAppendElementStoApythonList,usetheappend()方法forsingleements,Extend()formultiplelements,andinsert()forspecificpositions.1)useeAppend()foraddingoneOnelementAttheend.2)useextendTheEnd.2)useextendexendExendEnd(

TocreateaPythonlist,usesquarebrackets[]andseparateitemswithcommas.1)Listsaredynamicandcanholdmixeddatatypes.2)Useappend(),remove(),andslicingformanipulation.3)Listcomprehensionsareefficientforcreatinglists.4)Becautiouswithlistreferences;usecopy()orsl

金融、科研、医疗和AI等领域中,高效存储和处理数值数据至关重要。1)在金融中,使用内存映射文件和NumPy库可显著提升数据处理速度。2)科研领域,HDF5文件优化数据存储和检索。3)医疗中,数据库优化技术如索引和分区提高数据查询性能。4)AI中,数据分片和分布式训练加速模型训练。通过选择适当的工具和技术,并权衡存储与处理速度之间的trade-off,可以显著提升系统性能和可扩展性。

pythonarraysarecreatedusiseThearrayModule,notbuilt-Inlikelists.1)importThearrayModule.2)指定tefifythetypecode,例如,'i'forineizewithvalues.arreaysofferbettermemoremorefferbettermemoryfforhomogeNogeNogeNogeNogeNogeNogeNATATABUTESFELLESSFRESSIFERSTEMIFICETISTHANANLISTS。

除了shebang線,還有多種方法可以指定Python解釋器:1.直接使用命令行中的python命令;2.使用批處理文件或shell腳本;3.使用構建工具如Make或CMake;4.使用任務運行器如Invoke。每個方法都有其優缺點,選擇適合項目需求的方法很重要。

ForhandlinglargedatasetsinPython,useNumPyarraysforbetterperformance.1)NumPyarraysarememory-efficientandfasterfornumericaloperations.2)Avoidunnecessarytypeconversions.3)Leveragevectorizationforreducedtimecomplexity.4)Managememoryusagewithefficientdata

Inpython,ListSusedynamicMemoryAllocationWithOver-Asalose,而alenumpyArraySallaySallocateFixedMemory.1)listssallocatemoremoremoremorythanneededinentientary上,respizeTized.2)numpyarsallaysallaysallocateAllocateAllocateAlcocateExactMemoryForements,OfferingPrediCtableSageButlessemageButlesseflextlessibility。

Inpython,YouCansspecthedatatAtatatPeyFelemereModeRernSpant.1)Usenpynernrump.1)Usenpynyp.dloatp.dloatp.ploatm64,formor professisconsiscontrolatatypes。


熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

WebStorm Mac版
好用的JavaScript開發工具

SublimeText3漢化版
中文版,非常好用

Dreamweaver CS6
視覺化網頁開發工具

SAP NetWeaver Server Adapter for Eclipse
將Eclipse與SAP NetWeaver應用伺服器整合。

MantisBT
Mantis是一個易於部署的基於Web的缺陷追蹤工具,用於幫助產品缺陷追蹤。它需要PHP、MySQL和一個Web伺服器。請查看我們的演示和託管服務。