Note

Machine Learning - k-means

Formula

Sample set . Cluster algorithms want to divide the data into clusters , where and .

We use represent the cluster label, so . So the cluster result can be represented as a cluster label vector .

For k-means algorithm, our target is minimize the square error:

where is the mean vector of cluster :

Find the best cluster is an NP-hard problem, but there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum.

Algorithm

  1. Initialize the cluster , randomly choose k sample as the initial centroids:
  2. Calculate the distance between sample and as
  3. Label sample as , so
  4. Update the centroids
  5. Repeat step 2 ~ 4 until centroids don’t change.
machine-learning