(Edureka) (Code Lab) ML Specialization

k-means clustering is the quintessential clustering algorithm which automatically groups data into clusters

Usage

(Google)

Algorithm

(Intuition) (Algorithm)

  1. Choose k as a number of clusters wanted
  2. Randomly initialize cluster centroids (each is the center point of the cluster)
  3. Repeat these steps until there are few to no changes in EITHER the point assignment to centroids OR the location of cluster centroids. Do this by minimizing the cost
    1. Assign each point to the closest centroid . Frobenius norm is computed.
    2. Recompute the centroids of newly formed clusters . Each = the average coordinates of all points in the cluster

Choose k

(ML Spec)

Initialize

(ML Spec) Repeat these initialization (, usually ) times and pick a situation with the lowest cost/distortion

  1. Randomly pick training examples
  2. Set equal to these examples
  3. Implement and minimize cost function

Minimize cost function

The cost function to be minimized for k-means is called “distortion”

The cost will either decrease or stay the same after each iteration

Notation

  • = index of cluster () to which example is currently assigned
  • = cluster centroid numbered
  • = cluster centroid of the cluster to which example has been assigned
  • = number of training example

Assumptions

(Raykov et. al, 2016)

  • Clusters are spherical, with data points being modeled as lying within a sphere around their cluster centroid.
  • Data has a clustering tendency, and clusters are well-separated.
  • The number  of groupings is fixed and assumed known
  • Because k-means clusters data points purely on their (Euclidean) geometric closeness to the cluster centroid, it implicitly assumes
    • All clusters have the same radii
    • Few or no outliers, which are unusually far away from the cluster centroid by comparison to the rest of the points in that cluster. Their presence will dramatically impair the results of k-means
    • Each cluster contains roughly the same number of data, although clusters might have different densities

Code

(Code Lab)

import numpy as np
 
def init_centroids(X, K):
    """
    This function initializes K centroids that are to be 
    used in K-Means on the dataset X
    
    Args:
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
    
    Returns:
        centroids (ndarray): Initialized centroids
    """
    
    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])
    
    # Take the first K examples as centroids
    centroids = X[randidx[:K]]
    
    return centroids
    
def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): k centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """
    K = centroids.shape[0]
    m = X.shape[0]
    idx = np.zeros(m, dtype=int)
    
    for i in range(m):
        distances = [] #Store ALL distances between m points and K centroids
        
        for j in range(K):
            
            #Calculate Euclidean distance (Frobenius norm)
            norm = np.linalg.norm(X[i] - centroids[j]) 
            distances.append(norm)
        
        #get the closet centroid for the current data point
        idx[i] = np.argmin(distances) 
    return idx
 
def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    m, n = X.shape
    centroids = np.zeros((K, n))
  
    for k in range(K):
        
        # get all points assigned to the centroid k
        points = X[idx == k]
        
        # recompute the centroid as the mean coordinates of all data points in that cluster
        centroids[k] = np.mean(points, axis = 0) #mean by columns
    
    return centroids