K-Means Clustering: A Primer
Clustering in Machine Learning refers to the process of grouping the data points into clusters or groups such that object in each group has similar characteristics. One can use a clustering algorithm to classify each data point into a particular category, given a set of data points. Theoretically, data points in the same group should exhibit identical properties and/or characteristics.
Clustering belongs to family of unsupervised learning technique, because we do not have the ground truth to compare the clustering algorithm’s output to the true labels to test its accuracy. To know the different types of Machine Learning algorithms please follow here.
Clustering has a myriad of uses in a variety of industries. Some common applications for clustering include the following:
- Image Segmentation
- Recommendation Engine
- Market Segmentation
- Social Network Analysis, etc.
In this post we are concentrating on one of the clustering techniques— K-Means Algorithm. The article layout is as follows
- Introduction
- Controlling factors
- Determining the value of K
- Applications
- Technical interview gotchas for k-means clustering
Introduction
k-means algorithm belongs to centroid-based clustering technique that organizes the data into non-hierarchical clusters. For exhaustive list of clustering algorithm have a look at A Comprehensive Survey of Clustering Algorithms. Centroid-based algorithms are efficient, but sensitive to initial conditions and outliers.
The K-Means algorithm finds k clusters by choosing k data points at random as initial cluster centers. Each data point is then assigned to the cluster with center that is closest to that point. Each cluster center is then replaced by the mean of all the data points that have been assigned to that cluster. This process is iterated until no data point is reassigned to a different cluster.
In the above mentioned algorithm, there are two factors which we can control in order to obtain optimal clustering
- Selection of the initial centroids
- The convergence criteria
Controlling Factors
Selection of the initial centroids
- Forgy Initialization — chooses any K points from the data at random as the initial centroids
- Random Partition Method — First, randomly assign each point in the data to a random cluster ID. Then, group the points by their cluster ID and take the average (per cluster ID) to yield the initial centroids
- K means ++ — First, choose a random point from the data. Then, choose the next point such that is more probable to lie at a large distance from the first point. We do so by sampling a point from a probability distribution that is proportional to the squared distance of a point from the first center. And, so on.
- Naive Sharding — First, calculate composite summation value of all the data points. Then, using the value sort the data points and horizontally split into K pieces, or shards. Finally, the data points of each shard are independently summed and their mean is computed. This set of mean is used as initial centroids.
The Convergence Criteria
Determines when iteration is exited. It represents a proportion of the minimum distance between initial cluster centers, so it must be greater than 0 but not greater than 1.
Determining the value of K
- ELBOW CURVE Method
One of the ways to determine the right number of clusters, K is by using the Elbow Curve method. The idea is to run the k-means clustering algorithm on the data points for a range of values of k (say, k from 1 to 10), and for each value of k calculate the sum of squared errors of all the data points to the centroid. This sum value is plotted against k.
There will be a sudden fall in the graph looking similar to human elbow
2. Silhouette analysis
In Silhouette analysis, we run the k-means clustering algorithm on the data points for a range of values of k (say, k from 1 to 10), and for each value of k we compute the mean of Silhouette coefficient of all the data points. The Silhouette Coefficient is calculated using
- Intra-cluster distance of point i, a(i) — average distance between i and all the other data points in the cluster to which i belongs
- Inter-cluster distance of point i, b(i) — average distance between i and all the other data points in the cluster to which i doesn’t belongs
The Silhouette Coefficient for a data point S(i) is
The mean value Silhouette coefficient is plotted against k.
S(i) will be in range [-1,1] where 1 represents appropriate clustering and -1 represents the dissimilarity between i and the cluster it belongs.
Applications
Image Compression
The objective of image compression is to reduce irrelevance and redundancy of the image data to be able to store or transmit data in an efficient form.
An image consists of a rectangular array of dots called pixels. Typically, each pixel has 3 channels (Red Green Blue) and each channel can have intensity values ranging between 0–255 ie, 8 bits (1 Byte). Therefore, one pixel is 3 Bytes. So, for an RGB image of shape WxH, the size of the image is 3xWxH Bytes. It can be drastically reduced without affecting the visual quality using K means clustering as follows.
The Original Image File Size: 219203 bytes
The Compressed Image File Size: 111200 bytes
Total Compression: 108003 bytes
Image Segmentation
Image segmentation is the process of partitioning a digital image into multiple segments. The goal is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.
For a given image, if we can determine the number of objects present, we can cluster the images based on the pixel values.
Consider the following example: the image below can be segmented into 4 parts
So let’s check if we can use K means clustering algorithm with K=4 to obtain the segmented image.
As we can see above, the resultant segmented image looks decent enough (Without Fancy State Of The Art CNN’s Duhhhhhhhhhh!!)
Technical interview gotchas for k-means clustering
Attention aspiring Data Scientist!! Here are some interview questions for you!
- Does the selection of initial centroid affect the final resulting cluster?
Hint: In the below example, assuming k=2, what will be the resultant cluster if you pick {B, E} as centroids? Will it be the same as picking {D, F}?
2. What is the time complexity for Randomly initialized K-means clustering algorithm?
Please attempt it yourself first, before moving on to the solution.
Solution
- Computing distance between two data points is O(m) where m is the dimensionality of the data point.
- Reassigning clusters for each of the data points is O(Knm) where K is the number of clusters and n is the number of data points: O(Knm).
- Computing new centroids- each data point gets added once to some centroid: O(nm).
- For 1 iteration: O(Knm)+O(nm) = O(Knm)
- Assume these two steps are each done once for I iterations: O(IKnm).
Final Notes
In this article, we tried to unhype the core idea of K means clustering and its applications. Feel free to comment with your ideas and doubts. You can reach out to us at unhypedai@gmail.com