# Basic concepts of K-means Clustering

# k-Means Clustering: A Centroid-Based Technique

**K-means**

**clustering**is one of the easiest, simple and most popular unsupervised machine learning algorithms. The goal of

**K-means**is simple: group similar data points together and recognize the underlying patterns.

**K-means**looks for a fixed number (

*k*) of clusters in a dataset, to accomplish this goal. A cluster refers to a collection of data points aggregated together exhibiting certain similarities. The target number

*k*

*is a hyperparameter (i.e. You have to set its value by yourself before you the learning process begin)*, which refers to the number of centroids you need in the dataset. A centroid is the fictional or real location representing the center of the cluster. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster (we will deal with K medoid in upcoming articles).Every data point is assigned to each of the clusters through reducing the in-cluster sum of squares.

K-means Clustering |

**K-means**algorithm identifies

*k*number of centroids, and then group every data point to the nearest cluster, while keeping the centroids as small as possible. The term

*‘means’*of cluster in the

**K-means**refers to average of the data in a cluster; that is, the centroid.

**, contains n data points in Euclidean space. Clustering (Partitioning) methods distribute the data points in D into k clusters,**

*D***, that is,**

*C1,… ,Ck***. An objective function is used to assess the partitioning quality so that data points within a cluster are similar to one another.**

*Ci belongs to D*

A K means clustering technique uses the mean(i.e. centroid value) of a cluster, * Ci *, to represent that cluster. The difference between an data point

*and*

**p belongs to Ci****ci**, the mean of the cluster, is measured by

**dist(p, ci)**, where dist(x,y) gives the Euclidean distance between the two points x and y. The quality of cluster

*can be measured by the within cluster variation, which is the sum of squared error between all data points in*

**Ci****and the centroid**

*Ci***, defined as,**

*ci*K means error |

**is the sum of the squared error for all data points in the data set;**

*E***is the point in space representing a given data point ; and**

*p**is the mean of cluster*

**ci***(both*

**Ci****and**

*p**are multidimensional).*

**ci**## How does the k-means algorithm work?

## Algorithm of k-means Clustering:

The k-means algorithm is used for partitioning (clustering), where each cluster’s center is represented by the mean value of the data points in the cluster (group).

We have,

**Input:**

-k: the number of clusters,

-D: a data set containing n objects.

**Output: **

-A set of k clusters.

**Steps:**

(1) randomly select k data points from D as the initial cluster centers;

**Start loop**;

(3) re/assign each data points to the cluster to which the object is the most similar (i.e. which is more near based upon the euclidean distance measure),based on the mean value of the objects in the cluster;

(4) update the cluster means, that is, calculate the new mean value of the data points for each cluster;

(5) go to

**loop**until no change in cluster means

**;**