- use prototypes
- restartusing existing points
- click and drag to add examples
The above is a visualization of the k-means clustering algorithm in action. K-means is an unsupervised clustering algorithm, that is used to partition numerical data sets into groups of similar vectors. It's an unsupervised algorithm, meaning that we don't know the labels of particular clusters beforehand. More formally, k-means is an algorithm that takes n examples and partitions them into k clusters, where each example is assigned to the cluster with the nearest center. Here, I use the example of two-dimensional k-means, although it can be used with vectors in any dimension of space.
While it may sound complex, this algorithm is actually very simple. It consists of two steps which are iterated until the clustering is stable. Clusters are described by prototype vectors, which represent the average vector of all of the points assigned to that cluster. Think of it as the "center" of the cluster.
Since the algorithm is just beginning, we'll just randomly assign each example to one of the k clusters that we are hoping to find. This is known as random partitioning.
Each of the prototype vectors is updated, its position set to the average of all of the examples that are currently assigned to the cluster.
Each of the examples in the data set is assigned the label of its nearest prototype vector. If none of the examples have changed clusters, then the algorithm is finished. Otherwise, return to the update step.
Complexity and Running time
k-means has a superpolynomial worst-case running time in the number of input points. Generally, this clustering problem is NP-hard, so heuristics such as the above algorithm are used, which are not guaranteed to find the globally optimal solution.
Because of the assumption that all points in a cluster will be close to one another geometrically, k-means has a difficult time properly identifying non-spherical clusters of vectors. Try it out above -- draw two stripes next to each other -- k-means will likely split each of the stripes into multiple clusters, even though we would intuitively understand each stripe as its own cluster.
K-means has an inherent element of randomness. Depending on the random initialization of the prototype vectors, the algorithm will unfold differently. As a result, in many cases it makes sense to restart the algorithm with new initial clusterings. Then, we can look for clusterings with particularly low variances, to get a more optimal clustering.
There are plenty of limitations to K-means clustering. You may be wondering -- how do I know beforehand what to set as the value of k, the number of clusters you are looking for. Well, there are few heuristics you can use. If you know the number of categories or clusters that are in the data, then you should try that.
If the number of expected clusters is unknown, it's possible to try K-means with various numbers of prototype vectors, looking for the k that minimizes the variance of the points assigned to each cluster. Occam's Razor comes into play here -- we are looking for the simplest set of clusters that explain the examples, so we should look for small values of k that minimize variance. Obviously, if we look for very large values of k, (such as if k >> n) then each cluster will be really small and have a low variance, but we haven't learned much about the data set.
There are other clustering algorithms that address these problems, such as Hierarchical Agglomerative Clustering (HAC), which I will address in future visual algorithms.