K-Means Clustering

Sometimes you have the data input features but not the output. What to do in that case? How to train a model? In that case, we use clustering. In this article, we are going to learn about k-means clustering which is used for unsupervised learning as well as supervised learning.

K-Means Clustering

In clustering we have unlabeled data and our algorithm tries to divide it into different groups having same patterns. Lets say we have a data as shown in the figure and we want to divide it into two clusters.

We choose two random points denoted by red cross and blue cross.

In the first step we mark the data points red which are closer to red mark and blue which are closer to blue mark as shown.

Now we find new centroids by taking the mean of red points and blue points and again perform the above step.

In the final step we’ll get two well separated clusters or groups as shown in the figure and therefore classify the new data points in any of the clusters they belong to.

These are basic steps of k-means clustering algorithm. Here is the algorithm.

Randomly initialize K cluster centroids \mu_1,\mu_2,...,\mu_K \ \epsilon \ \Re^n
Repeat {
               for i = 1 to m
                            c(i) = index (from 1 to K) of cluster centroid closest to x(i)
              for k = 1 to K
                            µ(k) = average (mean) of points assigned to cluster k
}

To calculate the closeness of points to the centroids, we use square distance formula given by J = \frac{1}{m}\sum_{i=1}^{m}||x^{(i)} - \mu_{c^{(i)}}||^2. For each centroid we want to calculate minimum value of J for all data points. After that we update the centroids. Here is code for the same algorithm in MATLAB.

function idx = findClosestCentroids(X, centroids)

    K = size(centroids, 1);
    idx = zeros(size(X,1), 1);
    m = length(X);

    for i=1:m
        minC = Inf;
        for j=1:K
            dist = norm(X(i,:)-centroids(j,:))^2;
            if dist < minC
                minC = dist;
                idx(i)=j;
            end
        end
    end

end

function centroids = computeCentroids(X, idx, K)

    [m n] = size(X);
    centroids = zeros(K, n);
    for k = 1:K
	centroids(k,:) = mean(X(find(idx == k),:));
    end

end

These are the functions that are used in the main function

function [centroids, idx] = runkMeans(X, initial_centroids, ...
                                      max_iters, plot_progress)

    for i=1:max_iters
        idx = findClosestCentroids(X, centroids);
        centroids = computeCentroids(X, idx, K);
    end

end

Some Points to remember:

  • Choose number of clusters according to dataset distribution and the range of output. You can try different numbers and then see which give better results.
  • For initializing the centroid points, either choose randomly or you can assign any data point as the initial centroid point. If you choose randomly, you’d have to take care that they are close to specific clusters/groups.

This is a simple but very useful machine learning technique used in various classification tasks. This can be used in integrated machine learning tehniques for example with Naive Bayes Classification, which give better results. We’ll disucuss the Naive Bayes Technique in further posts. For now, we are done with basics of K-Means Clustering algorithm. We’ll see its implementation and effectiveness with Network Intrusion Detection Challenge. Till then, Happy Learning 🙂

Leave a comment