K-means clustering, the methodology

Srijan Bhushan
2 min readSep 5, 2022

what is K-means clustering?

Unsupervised machine learning problems are situations where there is no target for the problem i.e. there is no defined outcome for what we are trying to predict. Examples of such problems include clustering data together. Let’s assume you have a corpus of data that os not labeled, just data with attributes. When you do no have a target i.e. targets like predicted sales, predicted probability of purchase, ranking etc. In such a situation what you want, for example, is to cluster data together i.e. group data together.

K-means clustering is one of those methodologies. K-means clustering follows the following steps to put data points into clusters.

  1. Pick K. First you have to pick K. What is the right value for K and how to to pick that? The answer to that is covered this article further down.
  2. Now, pick K random points in the data. These will help establish the K clusters. You can pick any K random points.

3. Calculate the distance of each point remaining in the data set from those K points. Assign each point to the Kth cluster, with which it has the least distance. Distance can be calculated using Euclidean distance — which is basically straight line distance between the coordinates.

4. Then, in the next iteration, calculate the distance of each data point, again, this time from the “mean of each Kth cluster”, and assign it to the cluster with the least distance.

5. Repeat step #4 until the cluster points are not changing any more i.e. no new data points are assigned to any of the clusters.

6. You have your clusters ready with each data point assigned.

This is how K-means clustering works. You can use this algorithm to cluster your data together. You can choose K depending on your use case.

How do we know which K to chose? Is there a cost function we can look to minimize?

You can choose the value of K by testing (grid search) different values and checking which K has the least “Within SSE (sum of squared error)” i.e. the SSE of all data within each cluster.

--

--