# Clustering Part 2: K-means clustering

Clustering data is the act of partitioning observations into groups, or clusters, such that each data point in the subset shares similar characteristics to its corresponding members. Cluster analysis is commonly used in fields that utilize data mining, pattern recognition and machine learning. While MATLAB has several clustering tools included in its arsenal, we’ll take a look at the function kmeans in this tutorial. Following classification of n observations into k clusters, we can use binary classification to investigate the sensitivity and specificity of our clustering.

Let’s once again use a similar database of males and females, with their corresponding heights and weights. As you will see, there is some overlap among the genders, but we will nonetheless attempt to segregate the two groups. A plot of all of our individuals would look as follows:

In order to use K-means clustering, we need to a priori specify how many clusters (k) we would like to use. In this example, we know we are looking for k=2 since we have males and females. For further non-trivial clustering problems, this determination of k will become more important. The overall goal is to reduce k, while also reducing the variance of each cluster. The trivial solution where k = n would of course give us no variance, but with too many clusters, and on the other extreme if k = 1, we have large variance and only 1 cluster. While trial-and-error will allow you to determine optimum number of clusters in a data set, one can also check the average distance of each observation to the centroid of their corresponding cluster.

One additional consideration is the number of classification variables that will be used. In this example we utilize height and weight, in order to have a 2-dimensional classification (to allow for ease of analysis and plotting). It is conceivable than many more dimensions or classification variables are used, though computation time and improvement in the clustering will need to be assessed.

K-means clustering is an iterative unsupervised learning process that attempts to determine the best separation of observations, based on the minimizing function (in this case the Euclidean distance) from each input parameter to the cluster centroid. In this heuristic method, the first step of k-means clustering is to randomly choose 2 (In this case where k = 2) arbitrary means. Second, all observations are assigned to 1 of the two clusters, based on their distance to each mean. Third, the mean of each cluster is updated based on associated observations. Steps 2 and 3 are iteratively repeated until a stopping criteria is hit (i.e. no observations change membership, the sum of the distances is minimized or the maximum number of iterations is reached). As this algorithm is sensitive to the choice of initial means, an optimum solution is not guaranteed, though you can repeat the process several times to see if a robust solution can be found (in our case 5 replications).

So lets get to the actual k-means clustering.

```% Initialize your input variables
% Where height and weight are in this case 100x1 vectors
input = [height weight];
num_clusters = 2;
% Specify how many times you want to kmeans function to repeat
num_replicates = 5;
% Run kmeans function
[membership, ctrs, sumd, pointd] = kmeans(input,...
clusters,'Replicates',num_replicates,...
'Distance','sqEuclidean');
%{
We also included the squared Euclidean distance (not necessary,
as it is default), that we will ask the function to minimize,
though you could also use:
'cityblock','cosine','correlation' or 'Hamming'.

For the output variables:
membership will give us a 100x1 vector with each subject
classified as 1 or 2 depending on the cluster they belong to
(since in our case k = 2)

ctrs gives the centroid locations.
In this case 2x1 vector of height,weight.

sumd returns the within-cluster sums of all observations to
its corresponding centroid distance.
In this case as a 1x2 vector.

pointd returns each individual observations distance to each
centroid.
In this case as a 100x2 array.
%} ```

If we were to plot the K-means classification of the members, we would obtain the following results:

As you can see, the observations that are closest to each cluster centroid are associated with that group, though several false classifications do occur. (You can see the accuracy rate in Part 1). Overall while K-means clustering will provide a binary yes/no of whether an observation does belong in a certain group, further analysis through the use of algorithms such as Gaussian Mixture Models will provide us with the ability to provide the probability likelihood that an observation belongs in any of the k-clusters. We’ll discuss this functionality in the next part!

## 14 thoughts on “Clustering Part 2: K-means clustering”

1. In the end, how do you plot K-means classification of the members ? Can you write codes , please ?

2. how to find cluster through distance formula in matlab?