Python中的层次聚类算法详解
更新时间:2023-07-14什么是层次聚类算法
层次聚类算法属于无监督学习,是一种基于距离度量的聚类算法。它通过度量两个样本之间的相似性或距离来定义簇与簇之间的距离。然后按照距离迭代地将最相似的簇(样本)合并成一个新簇,直到所有样本合并为一个簇或达到预设的簇数量。
import numpy as np
from scipy.spatial.distance import pdist, squareform
def euclidean_distance(X):
"""
计算欧氏距离矩阵
"""
return squareform(pdist(X, metric='euclidean'))
X = np.array([
[1, 2],
[2, 1],
[3, 4],
[4, 3]
])
distance_matrix = euclidean_distance(X)
print(distance_matrix)
层次聚类算法分类
层次聚类算法可以分为凝聚层次聚类和分裂层次聚类两种。
凝聚层次聚类:从数据点的角度来看,算法首先将每个数据点都视为一个独立的簇,然后每一次将相似度最大的两个簇合并为一个簇,直到满足停止准则为止。
def agglomerative_clustering(X, k):
"""
凝聚层次聚类
"""
n = X.shape[0]
distance_matrix = euclidean_distance(X)
clusters = list(range(n))
while len(clusters) > k:
closest = np.argmin(distance_matrix)
row = closest // n
col = closest % n
merged = set([clusters[row], clusters[col]])
others = [c for c in clusters if c not in merged]
new_cluster = n + len(others)
distance_matrix = np.delete(distance_matrix, [row, col], axis=0)
distance_matrix = np.delete(distance_matrix, [row, col], axis=1)
row_dist = np.min(distance_matrix[row, others])
col_dist = np.min(distance_matrix[col, others])
merged_dist = np.max(distance_matrix[row, col])
new_row = np.concatenate(
[np.array([merged_dist]),
np.array([(row_dist + merged_dist) / 2]),
distance_matrix[row, others]])
new_col = np.concatenate(
[np.array([merged_dist]),
np.array([(col_dist + merged_dist) / 2]),
distance_matrix[col, others]])
distance_matrix = np.vstack([distance_matrix, new_row])
distance_matrix = np.insert(distance_matrix, n, new_col, axis=1)
clusters = [c if c not in merged else new_cluster for c in clusters]
return clusters
分裂层次聚类:从整个数据集角度来看,算法首先将所有数据视为一个簇,然后每次将该簇划分为两个簇,直到达到预设的簇数量为止。通常使用贪心算法(如K-Means)进行分裂。
层次聚类算法的预处理
层次聚类算法对数据有一些预处理要求:
- 特征值必须是数值型的
- 特征标准化(可选)
- 异常值检测与修正(可选)
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = pd.read_csv('data/iris.csv', header=None)
X = iris.iloc[:, :4].values
y = iris.iloc[:, 4].values
sc = StandardScaler()
X_std = sc.fit_transform(X)
层次聚类算法的评估
层次聚类算法没有显式的损失函数,通常利用一些指标来评估聚类结果的好坏程度,常见的指标有轮廓系数和Calinski-Harabasz指数。
from sklearn.metrics import silhouette_score, calinski_harabasz_score
labels = agglomerative_clustering(X_std, 3)
score_silhouette = silhouette_score(X_std, labels)
score_CH = calinski_harabasz_score(X_std, labels)
print("轮廓系数:", score_silhouette)
print("Calinski-Harabasz指数:", score_CH)