# Analysis of kmeans in spark

Time：2021-4-23

Analyze the kmeans code, which is a little more complicated

``````import numpy as np
from pyspark import SparkContext

#The purpose of this function is to convert the read data into float data
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])

#The purpose of this function is to find out which point set the point should be divided into and return the serial number
def closestPoint(p, centers):
bestIndex = 0
closest = float("+inf")
for i in range(len(centers)):
tempDist = np.sum((p - centers[i]) ** 2)
if tempDist < closest:
closest = tempDist
bestIndex = i
return bestIndex

if __name__ == "__main__":

if len(sys.argv) != 4:
print("Usage: kmeans <file> <k> <convergeDist>", file=sys.stderr)
exit(-1)

print("""WARN: This is a naive implementation of KMeans Clustering and is given
as an example! Please refer to examples/src/main/python/mllib/kmeans.py for an example on
how to use MLlib's KMeans implementation.""", file=sys.stderr)

sc = SparkContext(appName="PythonKMeans")
lines = sc.textFile(sys.argv)
#Here, the RDD map function is called to convert all the data to the float type
data = lines.map(parseVector).cache()
#Here K is the number of centers set
K = int(sys.argv)
#If the distance between the two times is less than the threshold, the iteration will be stopped
convergeDist = float(sys.argv)
#K values are extracted by sampling from the point set
kPoints = data.takeSample(False, K, 1)
#Distance difference after center point adjustment
tempDist = 1.0

#If the distance difference is greater than the threshold, execute
while tempDist > convergeDist:
#The map process is performed on all the data, and the RDD of (index, (point, 1)) is finally generated
closest = data.map(
lambda p: (closestPoint(p, kPoints), (p, 1)))
#Execute the reduce process, the purpose of which is to find the center point again, and the generated RDD is also RDD
pointStats = closest.reduceByKey(
lambda p1_c1, p2_c2: (p1_c1 + p2_c2, p1_c1 + p2_c2))
#Generate a new center point
newPoints = pointStats.map(
lambda st: (st, st / st)).collect()
#Calculate the distance between the new and old center points
tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

#Set new center point
for (iK, p) in newPoints:
kPoints[iK] = p

print("Final centers: " + str(kPoints))

sc.stop()``````

Here is the whole process. It’s easier to use it with numpy

## Blog Garden Background page dynamic effects

1. To set animation, you must first apply for permission 1.1 first enter [my blog park] and enter [settings] in [management] 1.2 find [blog sidebar announcement] and click [apply for JS permission] 1.3 write the content of application JS permission (examples are as follows) Dear blog administrator: Can you open JS permission for me? I […]