Machine Learning - k-means

Sep 20, 2018

Formula

Sample set $D = \{\boldsymbol{x}_1, \boldsymbol{x}_2, ..., \boldsymbol{x}_m\}$ . Cluster algorithms want to divide the data into $k$ clusters $\{C_l \mid l = 1, 2, ..., k \}$ , where $C_{l'} \bigcap_{l' \neq l} C_l = \emptyset$ and $D = \bigcup_{l=1}^k C_l$ .

We use $\lambda_{j} \in \{ 1, 2, ..., k\}$ represent the cluster label, so $\boldsymbol{x}_j \in C_{\lambda_j}$ . So the cluster result can be represented as a cluster label vector $\{\lambda_1, \lambda_2, ..., \lambda_k\}$ .

For k-means algorithm, our target is minimize the square error:

$E = \sum_i^k \sum_{\boldsymbol{x} \in C_i} \| \boldsymbol{x} - \mu_i \|_2^2$

where $\mu_i$ is the mean vector of cluster $C_i$ :

$\mu_i = \frac{1}{|C_i|} \sum_{\boldsymbol{x} \in C_i}\boldsymbol{x}$

Find the best cluster is an NP-hard problem, but there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum.

Algorithm

Initialize the cluster $C_j = \emptyset, j \in \{1, 2, ..., k\}$ , randomly choose k sample as the initial centroids: $\{ \mu_1, \mu_2, ..., \mu_k \}$
Calculate the distance between sample $\boldsymbol{x}_i$ and $\mu_j$ as $d_{ij} = \| \boldsymbol{x}_i - u_j \|_2^2$
Label sample $\boldsymbol{x}_i$ as $\lambda_{\mathop{\arg\min}_{j} d_{ij}}$ , so $C_{\lambda_i} = C_{\lambda_i} \cap \boldsymbol{x}_i$
Update the centroids $\mu_i = \frac{1}{\mid C_i \mid} \sum_{\boldsymbol{x} \in C_i}\boldsymbol{x}$
Repeat step 2 ~ 4 until centroids don’t change.