used for clustering
points in an n-dimensional
space. K means
is one of the oldest clustering algorithms. Clustering is a useful tool
for helping you to see subgroups and structure within a large body of data
-- it's a tool to help you see patterns
It's a very simple algorithm:
1. Specify the number of clusters you are looking for; this is the "k" in k means.
2. Pick k random points from the data you're given to cluster. These k become the initial centroids of the clusters.
3. For each point in the data, work out the centroid nearest to it.
4. For each of the centroids, find the mean of the points nearest to it; i.e. the points you worked out in part 3. Use this mean as the new centroid.
5. Check if any point has changed the centroid it is closest to. If any point has changed the cluster it belongs to, go back to step 3 and repeat. Else stop, and return the current centroids.
K Means is pretty cool overall, but it does have some problems. Firstly, you have to know k, which can be tricky. Secondly, it is quite sensitive to the initial choice of centroids. For this reason, sometimes the K means is randomly restarted, and some metric, e.g. the total sum distance of points from the centroids is used to choose the best from various runs of the algorithm.
K Means is used a lot in data mining and sometimes in visualization. It has a big-brother, known as the E-M algorithm, which is a little more general, but it's quite complicated.