K-Means聚类算法原理

发表于 2019-04-25 更新于 2021-05-27 分类于数据科学

本文字数： 7k 阅读时长 ≈ 6 分钟

Introduction To K-Means Clustering

K-Means 聚类算法是一类无监督学习算法，适用于unlabeled data (data without defined categories or groups)该算法的目标是在数据中找到划分的界限，使数据被分为K个类别。算法不断的迭代，使得每个数据点根据特征被归入K个Categories中的其中一个。

算法的最终输出有：

    1. K簇的质心，可用于标记新数据
    2. 训练数据的标签（每个数据点分配给一个集群）

K簇的质心就是一组特征值，根据这k个质心即确定数据所代表的类

算法

Κ -means聚类算法使用迭代优化，以产生最终结果。算法输入是簇 Κ 和数据集的数量。数据集是每个数据点的向量集合。算法从Κ 质心的初始估计开始，可以随机生成或从数据集中随机选择。然后算法在两个步骤之间迭代：

数据分配

每个质心定义一个簇。在此步骤中，每个数据点基于平方欧几里德距离分配到距其最近的质心。更正式地说，如果c i 是集合C中的质心集合，则每个数据点x按下面公式被分配给集群。

\[ \underset{c_i\in C}{\arg\min}\;DIST(C_i，x)^2 \]
1. 质心更新
在此步骤中，质心被重新计算，通过计算分配给该质心的所有数据的平均值来更新质心

\[c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i x_i}\]

该算法在步骤1和步骤2之间迭代，直到满足停止标准（即，没有数据点改变簇，距离的总和最小化，或者达到一些最大迭代次数）。

由于初始质心的选择是随机确定，这意味着最终结果可能是局部最优。

K 的选择

上面描述的K-Means算法找到类并为数据点标记都基于提前给定的K值。为了发现数据集的簇的数量，用户需要给定一系列不同的k值并运行算法，然后比较结果。实际上，也没有具体的算法能直接给出合适的k值，但存在一些措施来辅助确定较为准确的K值

对于不同k值的度量方式：数据点与其质心的平均距离。

由于增加簇的数量将总是减少到数据点的距离，即增加 K 将总是减小该度量值。而当K 与数据点的数量相同时，该度量值就会达到零的极值。

故该指标不能用作唯一目标。，到质心的平均距离可以看做是一个关于K的函数，下图对该函数进行绘制，其中下降率急剧变化的“驻点”，可用于粗略地确定ķ。

也存在许多其他用于验证K的技术，包括交叉验证，信息标准，信息理论跳跃方法，轮廓方法和G均值算法。此外，监视跨组的数据点分布可以深入了解算法如何分割每个K的数据。

算法伪代码描述

function K-Means(数据data, 中心点k) 
	获取输入数据的维度Dim 和 个数 N
	随机生成 K 个 Dim 维的点作为中心点
	while(未收敛)		
			对 N 个点：计算每个点属于的类别
			对 K 个中心点 ：
				1、找出所有与该点同类的所有数据点
				2、把自己的坐标修改为这些数据点的中心
	end	
	output
end

Python 实现

python code(use PCA to reduct dimensions)

选用数据集：

sklearn 库中的鸢尾花，降维后绘制图象。
自设数据集，给定不同的协方差矩阵和均值生成符合分布的随机数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from PCA import PCA


class K_Means(object):
    def __init__(self, n_clusters, max_iter=200, n_init=5):
        self.n_cluster = n_clusters
        self.max_iter = max_iter
        self.n_init = n_init

        self.data = None
        self.dim = None
        self.n = None
        self.center_point = None
        self.data_label = None
        self.init_point = None


    def rand_points(self):  # 初始化中心点
        counter = 0
        index = []

        center_point = np.zeros((self.n_cluster, self.dim), dtype=np.float32)
        self.init_point = np.zeros((self.n_cluster, self.dim), dtype=np.float32)
        while counter < self.n_cluster:
            i = np.random.randint(0, self.n)
            if i not in index:
                index.append(i)
                center_point[counter] = self.data[i]
                counter += 1

        self.init_point[:] = center_point[:]

    def _k_means(self):
        self.rand_points()
        sign = np.zeros(self.n_cluster, dtype=np.bool)
        data_label = np.zeros((self.n, 1), dtype=np.float32)
        counter = 0
        new_center_point = np.zeros_like(self.init_point)
        new_center_point[:] = self.init_point[:]

        while True and counter < self.max_iter:
            for i in range(self.n):
                d = 1e+10
                label = 0
                for j in range(self.n_cluster):
                    distance = np.linalg.norm(self.data[i] - new_center_point[j])
                    if distance < d:
                        d = distance
                        label = j
                data_label[i] = label

            for i in range(self.n_cluster):
                cluster = np.argwhere(data_label == i)[:, 0]
                coord = self.data[cluster].T
                ave = []
                for j in coord:
                    ave.append(np.mean(j))
                ave = np.array(ave)

                if np.linalg.norm(new_center_point[i] - ave) > 1e-06:

                    sign[i] = False
                    new_center_point[i] = ave
                else:
                    sign[i] = True

            if sign.all() == True:
                break

        counter += 1
        distance = 0
        for j in range(self.n_cluster):
            data_labeld = self.data[np.argwhere(data_label == j)]
            for i in data_labeld:
                distance += np.linalg.norm(new_center_point[j] - i)

        return new_center_point, data_label, distance / self.n

    def fit(self, data):  # K-Means算法实现主体 多次运行取最优值
        self.data = data
        self.dim = data.shape[1]
        self.n = data.shape[0]
        distance = 1e+10
        init_point = np.zeros((self.n_cluster, self.dim), dtype=self.data.dtype)
        for _ in range(self.n_init):
            new_center_point, data_label, d = self._k_means()

            if d < distance:
                distance = d
                self.center_point = new_center_point
                self.data_label = data_label
                init_point[:] = self.init_point[:]

        self.init_point = init_point
        print('\n', 20 * '_', "K-Means 数据聚类", 20 * '_','\n')
        print('\n', 20 * '_', " 初始中心点 ", 20 * '_','\n', init_point)
        print('\n', 20 * '_', " 最终中心点 ", 20 * '_','\n', self.center_point)

        print('\n', 20 * '_', ' 运行次数：{}'.format(self.n_init), 20 * '_')

    def label_(self):  # 返回标签
        return self.data_label.reshape(self.n)

    def cluster_centers(self):  # 最终分类中心点
        return self.center_point

    def predict(self, data):  # 对输入数据进行分类预测
        data = np.array(data)
        try:
            n, d = data.shape
        except ValueError:
            print("wrong data type")
            return False

        if d != self.dim:
            print("wrong data type")
            return False

        result = []
        for i in range(len(data)):
            d = 1e+10
            label = 0
            for j in range(self.n_cluster):
                distance = np.linalg.norm(data[i] - self.center_point[j])
                if distance < d:
                    d = distance
                    label = j
            result.append(label)
        return np.array(result)


class DataProduce:
    def __init__(self):
        pass

    def test_data(self):
        data1 = np.random.multivariate_normal([10, 0], [[5, 0], [0, 5]], size=100)
        data2 = np.random.multivariate_normal([0, -10], [[5, 0], [0, 5]], size=100)
        data3 = np.random.multivariate_normal([0, 15], [[5, 0], [0, 5]], size=100)
        data4 = np.random.multivariate_normal([-10, 5],[[5, 0], [0, 5]], size=100)

        target = [0 for _ in data1]
        target.extend([1 for _ in data2])
        target.extend([2 for _ in data3])
        target.extend([3 for _ in data4])

        data = np.concatenate((data1,data2,data3,data4))
        data_dict = {'data': data, 'target': target}
        return data_dict

    def iris(self):
        return load_iris()  # 鸾尾花数据集


class Draw:
    def __init__(self):
        pass

    def _2d_draw(self,d):
        data = d['data']
        d_label = d['target']
        cluster_num = len(set(d_label))
        # --------------------------绘制网格-----------------------------
        h = .1
        x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
        y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
        kmeans = K_Means(cluster_num)
        kmeans.fit(data)
        centroids = kmeans.cluster_centers()

        label_pred = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

        z = label_pred.reshape(xx.shape)
        fig = plt.figure(figsize=(10, 5))
        ax1 = fig.add_subplot(1, 2, 1)
        ax2 = fig.add_subplot(1, 2, 2)
        ax1.set_title("initial data-set", size=20)
        ax2.set_title('K-means clustering ', size=20)
        ax2.imshow(z, interpolation='nearest',
                   extent=(xx.min(), xx.max(), yy.min(), yy.max()),
                   cmap=plt.cm.Paired,
                   aspect='auto', origin='lower')
        ax2.plot(data[:, 0], data[:, 1], 'k.', markersize=2)
        ax1.scatter(*zip(*data), c=d_label, s=2)

        ax2.scatter(*zip(*data), c='black', s=2)
        ax2.scatter(*zip(*centroids), c='w', marker='x', s=1000)
        ax2.scatter(*zip(*kmeans.init_point), c='r', marker='1', s=1000)
        plt.show()


if __name__ == '__main__':
    data = DataProduce()

    test_data = data.test_data()
    iris = data.iris()

    kmeans = K_Means(4)

    # --------------4维数据聚类------------
    kmeans.fit(iris['data'])
    # --------------PCA降维---------------
    pca = PCA(iris['data'])
    reduced_data = {'data': pca.pca(2), 'target': iris['target']}

    # ----------------绘图-----------------
    draw = Draw()
    draw._2d_draw(reduced_data)
    draw._2d_draw(test_data)

Running Result

鸢尾花数据集，k=3

红色y型记号代表初始随机质心，白色记号代表最终数据中心

自设数据集，k=4，分类结果正确

红色y型记号代表初始随机质心，白色记号代表最终数据中心