# Machine learning kmeans algorithm principle & spark implementation

Time：2021-1-27

Data developers who don’t understand algorithms are not good algorithm engineers. I still remember that when I was a graduate student, my tutor talked about some data mining algorithms. I was quite interested in them, but I had no choice but to contact them less after work. The disdain chain of data engineers is model > real-time > offline data warehouse > ETL Engineer > BI Engineer (don’t like it, don’t spray it). Now I mainly do offline data warehouse, of course, in the early stage Have done some ETL work, in order to career long-term development, broaden their technical boundaries, it is necessary to gradually in-depth real-time and model, so from the beginning of this article, is also a flag, in-depth study of real-time and model part.

To change yourself, start by improving what you are not good at.

# 1. Kmeans – algorithm Introduction

K-means algorithm is an unsupervised clustering algorithm, which is easy to implement and has good clustering effect, so it is widely used,

• K-means algorithm, also known as k-means or K-means, is generally the first algorithm to master clustering algorithm.
• Here, K is a constant, which needs to be set in advance. Generally speaking, the algorithm is to aggregate m unlabeled samples into K clusters by iteration.
• In the process of clustering samples, the distance between samples is often used as an index to divide. coreK-means clustering algorithm is an iterative clustering analysis algorithm. Its steps are to randomly select k objects as the initial clustering center, then calculate the distance between each object and each seed clustering center, and assign each object to the nearest clustering center. Cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster center will be recalculated according to the existing objects in the cluster. This process will be repeated until a termination condition is met. The termination condition can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of squares of errors is locally minimum

# 2. Kmeans algorithm flow

#### 2.5 use the new center point to start the next cycle (continue cycle step 2.3)

Conditions for exiting the cycle

1. Specify the number of cycles

2. All the center points almost do not move (that is, the sum of the distances of the center points is less than a given Changshu, such as 0.00001)

Selection of K valueThe value of K is very important to the final result, but it must be given in advance. Given an appropriate value of K, prior knowledge is needed, so it is difficult to estimate out of thin air, or it may lead to poor results.

The existence of outliersK-means algorithm uses the mean of all points as the new particle (center point) in the iterative process. If there are abnormal points in the cluster, the mean deviation will be serious. For example, if there are five data in a cluster, namely 2, 4, 6, 8 and 100, then the new particle is 24. Obviously, this particle is far away from most of the points. In the current situation, the idea of using median 6 may be better than the idea of using mean. The clustering method of using median is called k-mediods clustering (K-median clustering)

Initial value sensitivityK-means algorithm is sensitive to initial value, and choosing different initial value may lead to different clustering rules. In order to avoid the final result anomaly caused by this sensitivity, we can initialize multiple sets of initial nodes to construct different classification rules, and then select the optimal construction rules. In view of this, the following derivatives are derived: binary k-means algorithm, K-means + + algorithm, K-means | algorithm, canopy algorithm and so on

It is one of the most commonly used clustering algorithms because of its simple implementation, good mobility and scalability.

# 4. Implementation of kmeans algorithm in spark

After copying this content, open Baidu network disk mobile app, operation is more convenient

Iris data set contains three types of 150 tone data, each type contains 50 data, each record contains four characteristics: calyx length, calyx width, petal length, petal width

After these four features, cluster the flowers, assume K is 3, and see the difference with the actual results

## 4.2 implementation

Instead of using the MLB library, the scala native implementation is used

``````package com.hoult.work

import org.apache.commons.lang3.math.NumberUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

import scala.collection.mutable.ListBuffer
import scala.math.{pow, sqrt}
import scala.util.Random

object KmeansDemo {

def main(args: Array[String]): Unit = {

val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getCanonicalName)
.getOrCreate()

val sc = spark.sparkContext
.rdd.map(_.split(",").filter(NumberUtils.isNumber _).map(_.toDouble))
.filter(!_.isEmpty).map(_.toSeq)

val res: RDD[(Seq[Double], Int)] = train(dataset, 3)

res.sample(false, 0.1, 1234L)
.map(tp => (tp._1.mkString(","), tp._2))
.foreach(println)
}

//Define a method. The parameters passed in are data set, K, the maximum number of iterations, and the change threshold of cost function
//The maximum number of iterations and the threshold value of cost function change are the default values, which can be changed according to the needs
def train(data: RDD[Seq[Double]], k: Int, maxIter: Int = 40, tol: Double = 1e-4) = {

val sc: SparkContext = data.sparkContext

Var I = 0 // iterations
Var cost = 0d // initial cost function
Var conversion = false // judge convergence, that is, the change of cost function is less than the threshold tol

//Step 1: randomly select k initial clustering centers
var initk: Array[(Seq[Double], Int)] = data.takeSample(false, k, Random.nextLong()).zip(Range(0, k))

var res: RDD[(Seq[Double], Int)] = null

while (i < maxIter && !convergence) {

val centers: Array[(Seq[Double], Int)] = bcCenters.value

val clustered: RDD[(Int, (Double, Seq[Double], Int))] = data.mapPartitions(points => {

val listBuffer = new ListBuffer[(Int, (Double, Seq[Double], Int))]()

//Calculate the distance from each sample point to each cluster center
points.foreach { point =>

//The cluster ID, the sum of squares of the minimum distance, the sample points, and 1
val cost: (Int, (Double, Seq[Double], Int)) = centers.map(ct => {

ct._2 -> (getDistance(ct._1.toArray, point.toArray), point, 1)

}).minBy(_ ._ 2._ 1) // assign the sample to the nearest cluster center
listBuffer.append(cost)
}

listBuffer.toIterator
})
//
val mpartition: Array[(Int, (Double, Seq[Double]))] = clustered
.reduceByKey((a, b) => {
val cost = a._ 1 + b._ 1 // cost function
val count = a._ 3 + b._ 3 // the number of samples of each class is accumulated
val newCenters = a._ 2.zip(b._ 2).map(tp => tp._ 1 + tp._ 2) // new cluster center set
(cost, newCenters, count)
})
.map {
case (clusterId, (costs, point, count)) =>
clusterId -> (costs,  point.map (_  /Count)) // new cluster centers
}
.collect()
val newCost =  mpartition.map (_ ._ 2._ 1) . sum // cost function
convergence =   math.abs (newcost - cost) < = tol // judge the convergence, that is, whether the change of cost function is less than the threshold tol
//New cost function of transformation
cost = newCost
//Transform initial cluster center
initk = mpartition.map(tp => (tp._2._2, tp._1))
//The cluster result returns the ID of the sample point and its class
res = clustered.map(tp=>(tp._2._2,tp._1))
i += 1
}
//Return clustering results
res
}

def getDistance(x:Array[Double],y:Array[Double]):Double={
sqrt(x.zip(y).map(z=>pow(z._1-z._2,2)).sum)
}

}``````

Result screenshot: Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence. 