Hello, ladies and gentlemen.（*￣▽￣*) I’m a vegetable. This is my sklearn class.

My development environment is**Jupyter lab**The libraries and versions used are for your reference.

**Python**3.7.1 (Your version should be at least 3.4 or more)

**Scikit-learn**0.20.0 (your version should be at least 0.20)

**Graphviz**0.8.4 (No decision tree can be drawn, oh, installation code CONDA install python-graphviz)

**Numpy** 1.15.3, **Pandas** 0.23.4, **Matplotlib** 3.0.1, **SciPy** 1.1.0

Here, we use SKlearn to construct three kinds of data with different distributions, and then test the effect of decision tree on these data sets, so that we can better understand the decision tree. The following figure is the result of three representations. The implementation process will be described in detail later.~

### 1. Import the required libraries

```
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.tree import DecisionTreeClassifier
```

### 2. Generating three data sets

We first generate three types of data sets from sklearns database: 1) lunar data, 2) ring data and 3) dichotomous data.

```
# Make_classification library generates random dichotomous data
X, y = make_classification (n_samples = 100, # generates 100 samples
N_features = 2, # contains two features, that is, generating two-dimensional data
N_redundant = 0, # add redundancy feature 0
N_informative = 2, # contains two features of information
Random_state = 1, # random mode 1
N_clusters_per_class= 1 # Each cluster contains one label category
)
```

Here you can look at X and y, where X is 100 rows of data with two 2 features and Y is a binary label.

You can also draw scatter plots to observe the distribution of features in X.

`plt.scatter(X[:,0],X[:,1]); `

As can be seen from the graph, the two clusters of the generated dichotomous data are far away from each other, which is not conducive to our test of the effect of classifier. So we use NP to generate random arrays. By adding or subtracting random numbers between 0 and 1 from the generated dichotomous data points, the data distribution becomes more sparse and sparse.

[Note] This process can only be run once, because X becomes very sparse after multiple runs, the data of two clusters will be mixed together, and the effect of classifier will continue to decline.

```
RNG = np. random. Random State (2) # generates a random pattern
X += 2 * rng. uniform (size = X. shape) # Random numbers between 0 and 1
linearly_separable = (X, y)
```

A new X is generated, and scatter plots can still be drawn to observe the distribution of features.

`plt.scatter(X[:,0],X[:,1]);`

```
# Create moon data with make_moons, make_circles create ring data, and pack three sets of data into list datasets
datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.2, factor=0.5, random_state=1),
linearly_separable]
```

### 3. Draw the classification effect images of three data sets and three decision trees

```
# Create a canvas with a width-to-height ratio of 6*9
figure = plt.figure(figsize=(6, 9))
# Setting the global variable I used to arrange the position of the image display
i = 1
# Start iterating over the data and loop for the data in datasets
for ds_index, ds in enumerate(datasets):
# The data in X are standardized, and then divided into training set and test set.
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)
# Find out the maximum and minimum values of the two features in the data set, let the maximum + 0.5, the minimum - 0.5, and create a larger interval than the interval itself of the two features.
x1_min, x1_max = X[:, 0].min() - .5, X[:, 0].max() + .5
x2_min, x2_max = X[:, 1].min() - .5, X[:, 1].max() + .5
# Generating grid data with eigenvectors is actually equivalent to countless points on the coordinate axis.
# The function np. arange returns the value of a uniform interval between two given numbers, and 0.2 is the step size.
# Function meshgrid is used to generate grid data, which can generate two two two-dimensional matrices from two one-dimensional arrays.
# If the first array is narray, the dimension is n, the second parameter is marray, and the dimension is m. So the first two-dimensional array is a matrix with narray rows and m rows, while the second two-dimensional array is a matrix with marray transformations as columns and N columns as columns.
# The generated grid data is used to draw decision boundaries, because the function contourf of drawing decision boundaries requires that both input features must be two-dimensional.
array1,array2 = np.meshgrid(np.arange(x1_min, x1_max, 0.2),
np.arange(x2_min, x2_max, 0.2))
# Next, the color canvas is generated.
# Create colours for canvas with Listed Colormap, FF0000 is red, 0000FF is blue
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
# Add a subgraph to the canvas, data as len (datasets) rows, two columns, on location I
ax = plt.subplot(len(datasets), 2, i)
# So far, three coordinate systems between 0 and 1 have been generated, and then we put the title on our coordinate system.
# We have three coordinate systems, but we just need to have a title on the first coordinate system, so we set the condition if ds_index==0.
if ds_index == 0:
ax.set_title("Input data")
# Put the distribution of data sets in our coordinate system
# Play the training set first
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
cmap=cm_bright,edgecolors='k')
# Play test set
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
cmap=cm_bright, alpha=0.6,edgecolors='k')
# Set the maximum and minimum coordinate axes for the graph, and set no coordinate axes
ax.set_xlim(array1.min(), array1.max())
ax.set_ylim(array2.min(), array2.max())
ax.set_xticks(())
ax.set_yticks(())
# After each loop, change the value of I so that the graph is in a different position each time.
i += 1
# So far, the image of the data set itself has been arranged. Running the above code, you can see three processed data sets.
############################# Starting here is the decision tree model.##########################
# Iterative decision tree, first use subplot to increase the structure of subplot (row, column, index), and use index I to define the location of the graph.
# Here, len (datasets) is actually 3, 2 are two columns.
# At the beginning of the function, we define I = 1, and when we build the image of the data set above, we have let I + 1, so the value of I in each loop is 2, 4, 6.
ax = plt.subplot(len(datasets),2,i)
# Modeling process of decision tree: instantiation fit training score interface to get prediction accuracy
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Drawing decision boundaries, for this purpose, we will specify a color [x1_min, x1_max] x [x2_min, x2_max] for each point in the grid.
# The interface of the classification tree, predict_proba, returns the label class probability corresponding to each input data point.
# Class probability is the number of samples of the same class in the leaf nodes where the data points are located/the total number of samples in the leaf nodes.
# Since the training set X_train imported by decision tree during training contains two features, we must also import an array with the same structure when calculating class probability, that is to say, there must be two features.
# ravel () converts a multidimensional array into a one-dimensional array
# np.c_is a function that combines two arrays.
# Here, we first reduce the dimensionality of two grid data into one-dimensional array, then link the two arrays into data with two characteristics, and then bring them into the decision tree model. The generated Z contains the index of data and the class probability corresponding to each sample point, and then slice and classify the probability.
Z = clf.predict_proba(np.c_[array1.ravel(),array2.ravel()])[:, 1]
#np.c_[np.array([1,2,3]), np.array([4,5,6])]
# Take the returned class probability as data and put it in contourf to draw contours.
Z = Z.reshape(array1.shape)
ax.contourf(array1, array2, Z, cmap=cm, alpha=.8)
# Put the distribution of data sets in our coordinate system
# Put the training set in the picture
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# Put the test set in the diagram
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
edgecolors='k', alpha=0.6)
# Set the maximum and minimum coordinate axes for the graph
ax.set_xlim(array1.min(), array1.max())
ax.set_ylim(array2.min(), array2.max())
# Setting the coordinate axis does not show the ruler nor the number
ax.set_xticks(())
ax.set_yticks(())
# We have three coordinate systems, but we just need to have a title on the first coordinate system, so we set the condition if ds_index==0.
if ds_index == 0:
ax.set_title("Decision Tree")
# Numbers written in the lower right corner
ax.text(array1.max() - .3, array2.min() + .3, ('{:.1f}%'.format(score*100)),
size=15, horizontalalignment='right')
# Let I continue to add one
i += 1
plt.tight_layout()
plt.show()
```

The results of the operation are as follows:

From the graph, each line is a decision boundary drawn by the decision tree on the two-dimensional plane. Whenever the decision tree branches once, a line appears. When the dimensionality of data is higher, the decision boundary will change from line to surface, or even into a multi-dimensional graph that we can’t imagine.

At the same time, it is easy to see that classification trees are not naturally good at circular data. Each model has its own decision-making upper limit, so there is a possibility that no adjustment can improve performance. When one model can not be adjusted, we can choose to use other models instead of hanging on a tree. By the way, the nearest neighbor algorithm, RBF support vector machine and Gauss process are the best for moon data; the nearest neighbor algorithm and Gauss process are the best for ring data; the naive Bayes, neural network and random forest are the best for Half-divided data.