K-Means¶
Input: \(k =\) number of clusters
Output: Set of \(k\) clusters
Steps¶
-
Normalize \(X\)
-
Randomly choose \(k\) objects from the dataset as the initial cluster centroids (centroid = center of a cluster)
-
For each of the objects
-
Compute distance between current objects and \(k\) cluster centroids
-
Assign current object to that cluster to which it is closest
If distance of a point between 2 clusters is same, then we assign the point to first centroid.
-
Compute âcluster centersâ \(m\) of each cluster. These become the new cluster centroids
\[ \begin{aligned} m_k &= \Big(\text{mean}(X), \text{mean}(Y) \Big) \\ X &= \text{List of } x \text{ coordinates} \\ Y &= \text{List of } y \text{ coordinates} \end{aligned} \]
-
Repeat steps 2-3 until convergence criterion is satisfied
-
Stop
Convergence Criterion¶
- Particular number of iterations We can derive this by testing and plotting graph of accuracy vs iterations
or
- when clusters donât change over subsequent iterations
Limitations¶
- Clustering stuck in local minima: convergence dependent on initialization
- Measuring clustering quality is hard & relies on heuristics
- Non-probabilistic: Cluster assignment is binary