Basic data structure and algorithm k-means clustering algorithm

Time:2021-10-12

origin

Recently read < < my first algorithm book > > ([Japan] Ishida Baohui; Miyazaki Xiuyi)
This series of notes is intended to use golang exercises

K-means clustering algorithm

Clustering is when multiple data are input,
The operation of grouping "similar" data into groups.

K-means algorithm is one of clustering algorithms.
Firstly, k points are randomly selected as the center point of the cluster,
Then repeat "divide data into corresponding clusters" and
"Move the center point to the center of gravity",
Until the center point no longer changes.

In the k-means algorithm, with the continuous repetition of operations,
The position of the center point must converge somewhere,
This has been proved mathematically.

From < < my first algorithm book > > [Japan] Ishida Baohui; Miyazaki Xiuyi

scene

  • A sudden outbreak of COVID-19 in some place is urgently needed to locate the possible source of disease according to the distribution of cases.
  • First, input the coordinates of case distribution into the system
  • Then, according to the k-means algorithm, clustering is carried out according to K from 1 to 3
  • The central point of clustering may be the source of the disease

Basic data structure and algorithm k-means clustering algorithm

technological process

  1. Given several samples and the sample distance calculator, K sample center points need to be solved
  2. Firstly, k points are randomly selected from the sample as the center point
  3. Cycle each sample

    1. Calculate the distance from the sample point to k center points respectively
    2. Judge the center point with the smallest distance from the sample point
    3. Divide the sample into clusters at the minimum center point
  4. Calculate the center point of each cluster as the new center point

    1. Each sample in the cyclic cluster
    2. Calculate the sum of the distances from this sample to other samples in this cluster
    3. The distance from other samples and the smallest point is the new center point
  5. Repeat 3-4 until the center point does not change and the calculation is completed

Design

  • IPoint: the sample point interface is actually an empty interface
  • Idistancecalculator: distance calculator interface
  • Iclassifier: classifier interface, which clusters samples into k and returns k center points
  • Tperson: case sample point, implementing iPoint interface, including X and Y coordinates
  • Tpersondistancecalculator: case distance calculator, which calculates the straight-line distance of X and Y coordinates between two points
  • Tkmeansclassifier: K – means cluster, which implements iclassifier interface

unit testing

k_means_test.go

package others

import (
    km "learning/gooop/others/k_means"
    "strings"
    "testing"
)

func Test_KMeans(t *testing.T) {
    //Create sample points
    samples := []km.IPoint {
        km.NewPerson(2, 11),
        km.NewPerson(2, 8),
        km.NewPerson(2, 6),

        km.NewPerson(3, 12),
        km.NewPerson(3, 10),

        km.NewPerson(4, 7),
        km.NewPerson(4, 3),

        km.NewPerson(5, 11),
        km.NewPerson(5, 9),
        km.NewPerson(5, 2),

        km.NewPerson(7, 9),
        km.NewPerson(7, 6),
        km.NewPerson(7, 3),

        km.NewPerson(8, 12),

        km.NewPerson(9, 3),
        km.NewPerson(9, 5),
        km.NewPerson(9, 10),

        km.NewPerson(10, 3),
        km.NewPerson(10, 6),
        km.NewPerson(10, 12),

        km.NewPerson(11, 9),
    }

    fnPoints2String := func(points []km.IPoint) string {
        items := make([]string, len(points))
        for i,it := range points {
            items[i] = it.String()
        }
        return strings.Join(items, " ")
    }

    for k:=1;k<=3;k++ {
        centers := km.KMeansClassifier.Classify(samples, km.PersonDistanceCalculator, k)
        t.Log(fnPoints2String(centers))
    }
}

Test output

$ go test -v k_means_test.go 
=== RUN   Test_KMeans
    k_means_test.go:53: p(7,6)
    k_means_test.go:53: p(5,9) p(7,3)
    k_means_test.go:53: p(9,10) p(3,10) p(7,3)
--- PASS: Test_KMeans (0.00s)
PASS
ok      command-line-arguments  0.002s

IPoint.go

The sample point interface is actually an empty interface

package km

import "fmt"

type IPoint interface {
    fmt.Stringer
}

IDistanceCalculator.go

Distance calculator interface

package km

type IDistanceCalculator interface {
    Calc(a, b IPoint) int
}

IClassifier.go

Classifier interface, cluster the samples into k and return k center points

package km

type IClassifier interface {
    //Cluster the samples into k and return k center points
    Classify(samples []IPoint, calc IDistanceCalculator, k int) []IPoint
}

tPerson.go

Case sample points, realize iPoint interface, including X and Y coordinates

package km

import "fmt"

type tPerson struct {
    x int
    y int
}

func NewPerson(x, y int) IPoint {
    return &tPerson{x, y, }
}

func (me *tPerson) String() string {
    return fmt.Sprintf("p(%v,%v)", me.x, me.y)
}

tPersonDistanceCalculator.go

The case distance calculator calculates the straight-line distance of X and Y coordinates between two points

package km


type tPersonDistanceCalculator struct {
}

var gMaxInt = 0x7fffffff_ffffffff

func newPersonDistanceCalculator() IDistanceCalculator {
    return &tPersonDistanceCalculator{}
}

func (me *tPersonDistanceCalculator) Calc(a, b IPoint) int {
    if a == b {
        return 0
    }

    p1, ok := a.(*tPerson)
    if !ok {
        return gMaxInt
    }

    p2, ok := b.(*tPerson)
    if !ok {
        return gMaxInt
    }

    dx := p1.x - p2.x
    dy := p1.y - p2.y

    d := dx*dx + dy*dy
    if d < 0 {
        panic(d)
    }
    return d
}

var PersonDistanceCalculator = newPersonDistanceCalculator()

tKMeansClassifier.go

K – means clustering implement iclassifier interface

package km

import (
    "math/rand"
    "time"
)

type tKMeansClassifier struct {
}

type tPointEntry struct {
    point IPoint
    distance int
    index int
}

func newPointEntry(p IPoint, d int, i int) *tPointEntry {
    return &tPointEntry{
        p, d, i,
    }
}

func newKMeansClassifier() IClassifier {
    return &tKMeansClassifier{}
}

//Cluster the samples into k and return k center points
func (me *tKMeansClassifier) Classify(samples []IPoint, calc IDistanceCalculator, k int) []IPoint {
    sampleCount := len(samples)
    if sampleCount <= k {
        return samples
    }

    //Initialization, randomly select k center points
    rnd := rand.New(rand.NewSource(time.Now().UnixNano()))
    centers := make([]IPoint, k)
    for selected, i:= make(map[int]bool, 0), 0;i < k; {
        n := rnd.Intn(sampleCount)
        _,ok := selected[n]

        if !ok {
            selected[n] = true
            centers[i] = samples[n]
            i++
        }
    }


    //Divide samples according to the distance to the center point
    for {
        groups := me.split(samples, centers, calc)

        newCenters := make([]IPoint, k)
        for i,g := range groups {
            newCenters[i] = me.centerOf(g, calc)
        }

        if me.groupEquals(centers, newCenters) {
            return centers
        }
        centers = newCenters
    }
}

//Cluster the distance between the sample point and the center point
func (me *tKMeansClassifier) split(samples []IPoint, centers []IPoint, calc IDistanceCalculator) [][]IPoint {
    k := len(centers)
    result := make([][]IPoint, k)
    for i := 0;i<k;i++ {
        result[i] = make([]IPoint, 0)
    }

    entries := make([]*tPointEntry, k)
    for i,c := range centers {
        entries[i] = newPointEntry(c, 0, i)
    }

    for _,p := range samples {
        for _,e := range entries {
            e.distance = calc.Calc(p, e.point)
        }

        center := me.min(entries)
        result[center.index] = append(result[center.index], p)
    }

    return result
}

//Calculate the center of gravity of a cluster of samples. The center of gravity is the point with the smallest sum from each point
func (me *tKMeansClassifier) centerOf(samples []IPoint, calc IDistanceCalculator) IPoint {
    entries := make([]*tPointEntry, len(samples))
    for i,src := range samples {
        distance := 0
        for _,it := range samples {
            distance += calc.Calc(src, it)
        }
        entries[i] = newPointEntry(src, distance, i)
    }

    return me.min(entries).point
}

//Judge whether the two groups of points are the same
func (me *tKMeansClassifier) groupEquals(g1, g2 []IPoint) bool {
    if len(g1) != len(g2) {
        return false
    }

    for i,v := range g1 {
        if g2[i] != v {
            return false
        }
    }

    return true
}

//Find the point with the smallest distance
func (me *tKMeansClassifier) min(entries []*tPointEntry) *tPointEntry {
    minI := 0
    minD := gMaxInt
    for i,it := range entries {
        if it.distance < minD {
            minI = i
            minD = it.distance
        }
    }

    return entries[minI]
}


var KMeansClassifier = newKMeansClassifier()

(end)

Recommended Today

SQL exercise 20 – Modeling & Reporting

This blog is used to review and sort out the common topic modeling architecture, analysis oriented architecture and integration topic reports in data warehouse. I have uploaded these reports to GitHub. If you are interested, you can have a lookAddress:https://github.com/nino-laiqiu/TiTanI recorded a relatively complete development process in my hexo blog deployed on GitHub. You can […]