pyclustering. cluster. Kmeans usage explanation

Time:2021-12-17

Pyclustering is a python library for cluster analysis. This article will explain the kmeans library.

Recently, I have been doing some research with kmeans algorithm. An idea is to replace the distance function of kmeans, but sklearn does not provide an interface, and the effect of self-made wheels is not good. Finally, find the pyclustering library, so record your experience here.

Kmeans training process is as follows:BlogAs shown in.

Package used
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
import numpy as np
1. Initialize centroid:
initial_centers = kmeans_plusplus_initializer(x, cluster_num).initialize()

Where x is data, cluster\_ Num is the number of clusters.

2. Instantiate kmeans class:
kmeans_instance = kmeans(x, initial_centers, metric=metric)

Metric is the measurement distance, and the default is European distance, which will be described in detail below.

3. Training:
kmeans_instance.process()
3. Classification:
clusters = kmeans_instance.get_clusters()

Classify the above training data X in the form of a list. For example, if the categories of data a, B and C are 1, 1 and 0 respectively, the index list [[0,0], [1]] is returned

4. Return centroid:
cs = kmeans_instance.get_centers()
5. Forecast:

As for prediction, several methods are given to adapt to different scenarios.

  • The first is to directly use instance class prediction
label = kmeans_instance.predict(x)
  • According to the previously obtained clusters
label = np.array([0]*len(x))
for i,sub in enumerate(clusters):
    label[sub] = i
  • According to the obtained centroid, it is directly encapsulated into a function, and metric is a metric function
def Clu_predict(x,cs,class_num,metric = distance_metric(type_metric.EUCLIDEAN)):
    differences = np.zeros((len(x), class_num))
    for index_point in range(len(x)):
        differences[index_point] = [metric(x[index_point], c) for c in cs]
    label = np.argmin(differences, axis=1)
    return label

Note that the efficiency here is very high. It is recommended to define the matrix operation yourself.

6. Measurement:
  • Take Manhattan distance as an example:
manhattan_metric = distance_metric(type_metric.MANHATTAN)
kmeans_instance = kmeans(x, initial_centers, metric=manhattan_metric)

Put type_ metric. Just replace the later ones. The distance provided by the library is

class type_metric(IntEnum):
    """!
 @brief Enumeration of supported metrics in the module for distance calculation between two points.
 """
 ## Euclidean distance, for more information see function 'euclidean_distance'. EUCLIDEAN = 0
 ## Square Euclidean distance, for more information see function 'euclidean_distance_square'.
 EUCLIDEAN_SQUARE = 1
 ## Manhattan distance, for more information see function 'manhattan_distance'.
 MANHATTAN = 2
 ## Chebyshev distance, for more information see function 'chebyshev_distance'.
 CHEBYSHEV = 3
 ## Minkowski distance, for more information see function 'minkowski_distance'.
 MINKOWSKI = 4
 ## Canberra distance, for more information see function 'canberra_distance'.
 CANBERRA = 5
 ## Chi square distance, for more information see function 'chi_square_distance'.
 CHI_SQUARE = 6
 ## Gower distance, for more information see function 'gower_distance'.
 GOWER = 7
 ## User defined function for distance calculation between two points.
 USER_DEFINED = 1000
  • Use a custom distance, taking cosine distance as an example:
def cosine_distance(a, b):
     a_norm = np.linalg.norm(a)
     b_norm = np.linalg.norm(b)
     similiarity = np.dot(a, b.T)/(a_norm * b_norm)
     dist = 1. - similiarity
     return dist
metric = distance_metric(type_metric.USER_DEFINED, func=cosine_distance)

Distance only needs to calculate the distance between two points.