Home >Backend Development >Python Tutorial >Detailed explanation of hierarchical clustering algorithm in Python

Detailed explanation of hierarchical clustering algorithm in Python

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2023-06-10 09:28:372706browse

The hierarchical clustering algorithm is an unsupervised learning algorithm that groups data points. It is also called hierarchical clustering or hierarchical clustering algorithm. It continuously merges the most similar points or clusters based on the similarity or distance between points, and finally obtains a tree structure (also called a clustering tree or classification tree), dividing all points into several clusters .

Python is one of the most widely used programming languages. It has many powerful data processing and visualization tools, and there are many implementations of hierarchical clustering algorithms. In this article, we will discuss methods and some best practices for implementing hierarchical clustering algorithms in Python.

Data preparation

Before starting hierarchical clustering, you need to prepare the data set for clustering. Generally speaking, these data sets should meet the following conditions:

The data set should be numerical. Non-numeric data may cause errors in the algorithm.
The data set should be preprocessed, that is, it has undergone standardization, feature selection, or other preprocessing operations to eliminate data bias and noise.

In Python, we can use the pandas library to load, prepare and preprocess data. pandas provides the DataFrame data structure, which can easily handle tabular data. The following is a simple example:

import pandas as pd

# 读取csv文件
data = pd.read_csv('data.csv')

# 对数据进行预处理（比如标准化）
data = (data - data.mean()) / data.std()

Among them, we first call the read_csv function of pandas to read a csv file, and then normalize the read data so that the data can be put into in the algorithm.

Selection of clustering algorithms

In Python, there are many clustering algorithms to choose from, and hierarchical clustering algorithm is one of them. However, it requires selecting an appropriate algorithm based on the characteristics and needs of the data.

In the classic hierarchical clustering algorithm, there are two main linking methods: minimum distance and maximum distance. The minimum distance (or simple connectivity) method compares the most similar points in two populations, while the maximum distance (or total connectivity) method compares the least similar points in the two populations. In addition, there is an average linkage method (also called UPGMA algorithm), which uses the average distance between two groups to calculate similarity. In Python, we can use the linkage function in the scipy library to perform hierarchical clustering. Here is a simple example:

from scipy.cluster.hierarchy import linkage

# 进行层次聚类
Z = linkage(data, method='single')

In this example, we use the linkage function for minimum distance clustering. The first parameter of this function is the data and the second parameter is the linking method to use. Here we use the 'single' method, which is the minimum distance link method.

Visualization of tree structure

The tree structure is the core part of the hierarchical clustering algorithm, and it can be visualized using a dendrogram. Python provides many tools for visualization, two of the most popular are matplotlib and seaborn libraries.

The following is a simple example of using the matplotlib library to draw a dendrogram:

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

fig, ax = plt.subplots(figsize=(15, 10))

# 绘制树状图
dendrogram(Z, ax=ax, leaf_font_size=8)
plt.show()

In this example, we first create a canvas with the ax axis and then call dendrogram Function draws dendrogram. The first parameter of this function is the Z matrix and the second parameter is the axis object. leaf_font_sizeThe parameter is used to adjust the leaf size.

Using the seaborn library we can get more beautiful and interactive visualizations. The following is an example of using the seaborn library to draw a dendrogram:

import seaborn as sns

sns.set(style='white')

# 将聚类结果转换为DataFrame
df = pd.DataFrame({'x': data.index, 'y': Z[:, 2]})

# 绘制树状图
sns.scatterplot(x='x', y='y', data=df, s=50, legend=False)
plt.show()

In this example, we first convert the clustering results into a data frame, and then use the scatterplot function in the seaborn library Draw a tree diagram. sParameters are used to adjust the size of points.

Selection of clusters

In the hierarchical clustering algorithm, there are two methods of selecting clusters: based on distance (that is, the height in the dendrogram) and based on quantity. Based on distance means setting the maximum distance or minimum distance as a threshold and splitting the dendrogram to form clusters. Number-based refers to selecting a certain number of clusters, usually starting from the maximum or minimum distance. Both methods have their advantages and disadvantages and need to be chosen on a case-by-case basis.

The following is a simple example of converting hierarchical clustering results into a cluster list:

from scipy.cluster.hierarchy import fcluster

# 将层次聚类结果转换为簇列表
clusters = fcluster(Z, t=2.0, criterion='distance')

In this example, we use the fcluster function to convert hierarchical clustering results is a list of clusters. The first parameter of this function is the Z matrix, the second parameter is the threshold, and the third parameter is the criterion for determining the threshold type.

Summary

In this article, we discussed methods and some best practices for implementing hierarchical clustering algorithms in Python. We first looked at data preparation and then discussed algorithm selection and visualization of tree structures. Finally, we discuss cluster selection methods. These methods can help us better understand the hierarchical clustering algorithm, so that we can apply it to our own data and draw useful conclusions.

The above is the detailed content of Detailed explanation of hierarchical clustering algorithm in Python. For more information, please follow other related articles on the PHP Chinese website!

Python scipy pandas matplotlib 数据结构值类型对象算法

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Detailed explanation of LDA topic model in PythonNext article：Detailed explanation of LDA topic model in Python

See more

Detailed explanation of hierarchical clustering algorithm in Python

Related articles