Machine Learning - k-means
Formula
Sample set . Cluster algorithms want to divide the data into clusters , where and .
We use represent the cluster label, so . So the cluster result can be represented as a cluster label vector .
For k-means algorithm, our target is minimize the square error:
where is the mean vector of cluster :
Find the best cluster is an NP-hard problem, but there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum.
Algorithm
- Initialize the cluster , randomly choose k sample as the initial centroids:
- Calculate the distance between sample and as
- Label sample as , so
- Update the centroids
- Repeat step 2 ~ 4 until centroids don’t change.