我遇到了一个问题,那就是在Python1.2.2和1.3.1版本之间使用SCISKIT-LEARN进行k-Means集群时,集群结果似乎不一致.
当我将集群数量(K)设置为3时,两个版本的集群结果不一致.
下面是一段代码(使用Toy数据集):
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
import numpy as np
digits = load_digits()
X = digits.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
y_lower = 10
sil_score = silhouette_score(X_scaled, cluster_labels)
for i in range(3):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, alpha=0.7)
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax.set_xlabel("The silhouette coefficient values")
ax.axvline(x=sil_score, color="red", linestyle="--")
ax.set_title(f"Silhouette analysis for KMeans clustering with n_clusters = {i}")
ax.set_title("Silhouette plot for the various clusters")
plt.show()
Discrepancies Observed:个
我们注意到fit和fit_predict方法的结果存在差异. 在查看scikit-learn的更新日志(log)时,我观察到这两个版本之间确实有更新.然而,我不确定这些变化是否是我们聚类结果差异背后的原因.
Queries:个
-
SCRICKIT-LEARN 1.2.2和1.3.1之间的版本差异是否会导致k-均值聚类结果的差异?造成这种差异的原因是什么?
-
这些集群结果中的哪些应该被认为是正确的?
感谢您的帮助!