Time：2021-10-12

# origin

Recently read < < my first algorithm book > > ([Japan] Ishida Baohui; Miyazaki Xiuyi)
This series of notes is intended to use golang exercises

# K-means clustering algorithm

``````Clustering is when multiple data are input,
The operation of grouping "similar" data into groups.

K-means algorithm is one of clustering algorithms.
Firstly, k points are randomly selected as the center point of the cluster,
Then repeat "divide data into corresponding clusters" and
"Move the center point to the center of gravity",
Until the center point no longer changes.

In the k-means algorithm, with the continuous repetition of operations,
The position of the center point must converge somewhere,
This has been proved mathematically.

From < < my first algorithm book > > [Japan] Ishida Baohui; Miyazaki Xiuyi``````

# scene

• A sudden outbreak of COVID-19 in some place is urgently needed to locate the possible source of disease according to the distribution of cases.
• First, input the coordinates of case distribution into the system
• Then, according to the k-means algorithm, clustering is carried out according to K from 1 to 3
• The central point of clustering may be the source of the disease # technological process

1. Given several samples and the sample distance calculator, K sample center points need to be solved
2. Firstly, k points are randomly selected from the sample as the center point
3. Cycle each sample

1. Calculate the distance from the sample point to k center points respectively
2. Judge the center point with the smallest distance from the sample point
3. Divide the sample into clusters at the minimum center point
4. Calculate the center point of each cluster as the new center point

1. Each sample in the cyclic cluster
2. Calculate the sum of the distances from this sample to other samples in this cluster
3. The distance from other samples and the smallest point is the new center point
5. Repeat 3-4 until the center point does not change and the calculation is completed

# Design

• IPoint: the sample point interface is actually an empty interface
• Idistancecalculator: distance calculator interface
• Iclassifier: classifier interface, which clusters samples into k and returns k center points
• Tperson: case sample point, implementing iPoint interface, including X and Y coordinates
• Tpersondistancecalculator: case distance calculator, which calculates the straight-line distance of X and Y coordinates between two points
• Tkmeansclassifier: K – means cluster, which implements iclassifier interface

# unit testing

k_means_test.go

``````package others

import (
km "learning/gooop/others/k_means"
"strings"
"testing"
)

func Test_KMeans(t *testing.T) {
//Create sample points
samples := []km.IPoint {
km.NewPerson(2, 11),
km.NewPerson(2, 8),
km.NewPerson(2, 6),

km.NewPerson(3, 12),
km.NewPerson(3, 10),

km.NewPerson(4, 7),
km.NewPerson(4, 3),

km.NewPerson(5, 11),
km.NewPerson(5, 9),
km.NewPerson(5, 2),

km.NewPerson(7, 9),
km.NewPerson(7, 6),
km.NewPerson(7, 3),

km.NewPerson(8, 12),

km.NewPerson(9, 3),
km.NewPerson(9, 5),
km.NewPerson(9, 10),

km.NewPerson(10, 3),
km.NewPerson(10, 6),
km.NewPerson(10, 12),

km.NewPerson(11, 9),
}

fnPoints2String := func(points []km.IPoint) string {
items := make([]string, len(points))
for i,it := range points {
items[i] = it.String()
}
return strings.Join(items, " ")
}

for k:=1;k<=3;k++ {
centers := km.KMeansClassifier.Classify(samples, km.PersonDistanceCalculator, k)
t.Log(fnPoints2String(centers))
}
}``````

# Test output

``````\$ go test -v k_means_test.go
=== RUN   Test_KMeans
k_means_test.go:53: p(7,6)
k_means_test.go:53: p(5,9) p(7,3)
k_means_test.go:53: p(9,10) p(3,10) p(7,3)
--- PASS: Test_KMeans (0.00s)
PASS
ok      command-line-arguments  0.002s``````

# IPoint.go

The sample point interface is actually an empty interface

``````package km

import "fmt"

type IPoint interface {
fmt.Stringer
}``````

# IDistanceCalculator.go

Distance calculator interface

``````package km

type IDistanceCalculator interface {
Calc(a, b IPoint) int
}``````

# IClassifier.go

Classifier interface, cluster the samples into k and return k center points

``````package km

type IClassifier interface {
//Cluster the samples into k and return k center points
Classify(samples []IPoint, calc IDistanceCalculator, k int) []IPoint
}``````

# tPerson.go

Case sample points, realize iPoint interface, including X and Y coordinates

``````package km

import "fmt"

type tPerson struct {
x int
y int
}

func NewPerson(x, y int) IPoint {
return &tPerson{x, y, }
}

func (me *tPerson) String() string {
return fmt.Sprintf("p(%v,%v)", me.x, me.y)
}``````

# tPersonDistanceCalculator.go

The case distance calculator calculates the straight-line distance of X and Y coordinates between two points

``````package km

type tPersonDistanceCalculator struct {
}

var gMaxInt = 0x7fffffff_ffffffff

func newPersonDistanceCalculator() IDistanceCalculator {
return &tPersonDistanceCalculator{}
}

func (me *tPersonDistanceCalculator) Calc(a, b IPoint) int {
if a == b {
return 0
}

p1, ok := a.(*tPerson)
if !ok {
return gMaxInt
}

p2, ok := b.(*tPerson)
if !ok {
return gMaxInt
}

dx := p1.x - p2.x
dy := p1.y - p2.y

d := dx*dx + dy*dy
if d < 0 {
panic(d)
}
return d
}

var PersonDistanceCalculator = newPersonDistanceCalculator()``````

# tKMeansClassifier.go

K – means clustering implement iclassifier interface

``````package km

import (
"math/rand"
"time"
)

type tKMeansClassifier struct {
}

type tPointEntry struct {
point IPoint
distance int
index int
}

func newPointEntry(p IPoint, d int, i int) *tPointEntry {
return &tPointEntry{
p, d, i,
}
}

func newKMeansClassifier() IClassifier {
return &tKMeansClassifier{}
}

//Cluster the samples into k and return k center points
func (me *tKMeansClassifier) Classify(samples []IPoint, calc IDistanceCalculator, k int) []IPoint {
sampleCount := len(samples)
if sampleCount <= k {
return samples
}

//Initialization, randomly select k center points
rnd := rand.New(rand.NewSource(time.Now().UnixNano()))
centers := make([]IPoint, k)
for selected, i:= make(map[int]bool, 0), 0;i < k; {
n := rnd.Intn(sampleCount)
_,ok := selected[n]

if !ok {
selected[n] = true
centers[i] = samples[n]
i++
}
}

//Divide samples according to the distance to the center point
for {
groups := me.split(samples, centers, calc)

newCenters := make([]IPoint, k)
for i,g := range groups {
newCenters[i] = me.centerOf(g, calc)
}

if me.groupEquals(centers, newCenters) {
return centers
}
centers = newCenters
}
}

//Cluster the distance between the sample point and the center point
func (me *tKMeansClassifier) split(samples []IPoint, centers []IPoint, calc IDistanceCalculator) [][]IPoint {
k := len(centers)
result := make([][]IPoint, k)
for i := 0;i<k;i++ {
result[i] = make([]IPoint, 0)
}

entries := make([]*tPointEntry, k)
for i,c := range centers {
entries[i] = newPointEntry(c, 0, i)
}

for _,p := range samples {
for _,e := range entries {
e.distance = calc.Calc(p, e.point)
}

center := me.min(entries)
result[center.index] = append(result[center.index], p)
}

return result
}

//Calculate the center of gravity of a cluster of samples. The center of gravity is the point with the smallest sum from each point
func (me *tKMeansClassifier) centerOf(samples []IPoint, calc IDistanceCalculator) IPoint {
entries := make([]*tPointEntry, len(samples))
for i,src := range samples {
distance := 0
for _,it := range samples {
distance += calc.Calc(src, it)
}
entries[i] = newPointEntry(src, distance, i)
}

return me.min(entries).point
}

//Judge whether the two groups of points are the same
func (me *tKMeansClassifier) groupEquals(g1, g2 []IPoint) bool {
if len(g1) != len(g2) {
return false
}

for i,v := range g1 {
if g2[i] != v {
return false
}
}

return true
}

//Find the point with the smallest distance
func (me *tKMeansClassifier) min(entries []*tPointEntry) *tPointEntry {
minI := 0
minD := gMaxInt
for i,it := range entries {
if it.distance < minD {
minI = i
minD = it.distance
}
}

return entries[minI]
}

var KMeansClassifier = newKMeansClassifier()``````

(end)

## SQL exercise 20 – Modeling & Reporting

This blog is used to review and sort out the common topic modeling architecture, analysis oriented architecture and integration topic reports in data warehouse. I have uploaded these reports to GitHub. If you are interested, you can have a lookAddress:https://github.com/nino-laiqiu/TiTanI recorded a relatively complete development process in my hexo blog deployed on GitHub. You can […]