# Decision tree and random forest for Python machine learning

Time：2021-12-3
##### catalogue
• What is a decision tree
• Decision tree composition
• Determination method of node
• Basic process of decision tree
• Common parameters of decision tree
• Classification tree of code implementation decision tree
• Application of grid search in classification tree
• Performance of classification tree in synthetic data
• What is random forest
• Principle of random forest
• Common parameters of random forest
• Decision tree and random forest effect
• Using random forests to adjust breast cancer data

## What is a decision tree

Decision tree is one of the top ten classical data mining algorithms. It is a number shape structure similar to flow chart. Its rule is the idea of IIF… Then… And can be used for the prediction of numerical dependent variables or the classification of discrete dependent variables. The algorithm is simple, intuitive and easy to understand. It does not require researchers to master any field knowledge or complex mathematical logic, Moreover, the output result of the algorithm has strong interpretability. Usually, decision technology as a classifier will have a good prediction accuracy. At present, more and more industries use this algorithm to solve practical problems.

## Decision tree composition This is a typical decision tree, which presents the process of top-down growth. The hidden rules in the data can be intuitively displayed through the tree structure. The dark ellipse in the figure represents the root node of the tree, the detected ellipse represents the middle node of the tree, and the box represents the leaf node of the number. For all non leaf nodes, it is used for condition judgment, Leaf nodes represent the final stored classification results.

• Root node: there are no in edges and out edges. Include initial, feature specific questions.
• Intermediate node: there are both incoming and outgoing edges. There is only one incoming edge and many outgoing edges. Are questions about characteristics.
• Leaf node: there are in edges but no out edges. Each leaf node is a category label.
• Child node and parent node: among the two connected nodes, the parent node is closer to the root node, and the other is the child node.

### Determination method of node

Then, the selection of fields for all non leaf nodes can directly determine the good or bad of our classification results. How to classify these non leaf nodes to make the results better and more efficient, that is, what we call purity. For the measurement of purity, we have three measurement indicators: information gain, information gain rate and Gini coefficient.

• For information gain, we will calculate the amount of information gain for the results of each classification condition in the process of classification. Then, the classification conditions corresponding to the maximum information gain are obtained through the comparison of information gain. In other words, the maximum amount of information gain is the best classification condition we are looking for. How to calculate the information gain value is not explained here, because it needs to be based on a lot of mathematical calculation and probability knowledge.) Enter “entropy” to use information entropy
• In the process of calculating the information gain, there may be too many values with different category effects of the data set, so the calculation result of the information gain will be larger, but sometimes it can not really reflect the classification effect of our data set. Therefore, the information gain rate is introduced here to punish the information gain value to a certain extent, The simple understanding is to divide the information gain value by the amount of information.
• The information gain rate and information gain index in the decision tree realize the selection of root node and intermediate node. It can only classify discrete random variables, and there is nothing to do with the dependent variables of continuity. In order to make the decision number predict the dependent variables of continuity, Gini coefficient index is introduced for field selection. Enter “Gini” to use Gini impulse

Compared with Gini coefficient, information entropy is more sensitive to impure, and the punishment for impure is the strongest. However, in practice, the effect of information entropy and Gini coefficient is basically the same. The calculation of information entropy is slower than that of Gini coefficient, because the calculation of Gini coefficient does not involve logarithm. In addition, because information entropy is more sensitive to impure, when information entropy is used as an index, the growth of decision tree will be more “fine”. Therefore, for high-dimensional data or data with a lot of noise, information entropy is easy to over fit, and the Gini coefficient is often better than.

### Basic process of decision tree He will cycle the process until no more features are available or the overall impurity index is optimal, and the decision tree will stop growing. However, in the process of modeling, due to this high-precision training, the model may have high prediction accuracy on the basis of training, but the effect on the test set is not ideal. In order to solve the problem of over fitting, the decision tree is usually pruned.
There are three pruning operations for decision tree: error reduction pruning, pessimistic pruning and cost complexity pruning. However, here we only refer to the parameters provided by sklearn to restrict the growth of decision tree.

### Common parameters of decision tree

``````DecisionTreeClassifier(criterion="gini"
#Criterion is used to specify the evaluation indicator of the selected node field. For the classification decision tree, the default Gini indicates that the Gini coefficient index is used to select the best leaf node. There may be "entropy", but it is easy to over fit and insufficient for fitting
#For the regression decision number, the default is "MSE", which indicates the best segmentation means of the mean square error selection node.
,random_state=None
#Specifies the seed of the random number generator. The default none indicates that the default random number generator is used.
,splitter="best"
#It is used to specify the selection method of split points in the node. The default best means to select the best split point from all split points,
#It can also be random, indicating that the segmentation points are randomly selected

#The following parameters are to prevent over fitting
,max_depth=None
#It is used to specify the maximum depth of the decision tree. The default is none, which means that there are no restrictions on the depth during the growth of the tree.
,min_samples_leaf=1
#The minimum sample size used to specify leaf nodes is 1 by default.
,min_samples_split=2
#Used to specify the minimum sample size that the root node or intermediate node can continue to split. The default is 2.
,max_features=None
#Used to specify the maximum number of delimited fields contained in the decision tree. The default of none indicates that all fields are used during segmentation. If it is a specific integer, consider using the corresponding number of segmentation means. If it is a floating-point number of 0 ~ 1, consider the number of fields corresponding to the percentage.
,min_impurity_decrease=0
#The minimum impurity value used to specify whether the node continues to split. The default value is 0
,class_weight=None
#Specifies the weight between categories in the dependent variable. The default none means that each category has the same weight.
)``````

### Classification tree of code implementation decision tree

``````import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import graphviz
#Instantiate the dataset in dictionary format
datatarget=pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)
#The training and test sets are classified through the data and labels of the data sets.
xtrain,xtest,ytrain,ytest=train_ test_ Split (wine. Data # dataset).
, wine.target# dataset label.
,test_ Size = 0.3 # means that the training set is 70% and the test set is 30%.
)

#Create a tree model.
CLF = tree. Decisiontreeclassifier (criterion = "Gini") # here only Gini coefficient is used as parameter, others are not used temporarily
#Put the training set data into the model for training. CLF is our trained model, which can be used to make the decision of testing agent.
clf=clf.fit(xtrain,ytrain)

#Then, for our trained model, how can we explain that it is a better model? Then we need to introduce a function score in the model to score the model
Clf.score (xtest, ytest) # test set scoring
#Cross validation can also be used to evaluate the data, which is more stable.
cross_ val_ Score (CLF # fill in the model you want to score.
, wine.data # scored dataset
, wine.target # is the label used for scoring.
, CV = 10# number of cross validation
#,scoring="neg_mean_squared_error"
#This parameter is only available in the regression, which means that the score is output with negative mean square error
).mean()

#Display the decision tree in the form of a tree
dot=tree.export_graphviz(clf
#,feature_ Names # feature name
#,class_ Names # result name
, filled = true # color auto fill
, rounded = true) # arc boundary
graph=graphviz.Source(dot)

#Model feature importance index
clf.feature_importances_
#[* zip (wine. Feature_name, CLF. Feature_imports_)] combination of feature name and importance

#Apply returns the index of the leaf node where each test sample is located
clf.apply(Xtest)
#Predict returns the classification / regression results of each test sample
clf.predict(Xtest)``````

Different decision trees_ Depth learning curve

``````from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt、

xtrain,xtest,ytrain,ytest=train_test_split(wine.data,wine.target,test_size=0.3)
trainscore=[]
testscore=[]
for i in range(10):
clf=tree.DecisionTreeClassifier(criterion="gini"
,random_state=0
,max_depth=i+1
,min_samples_split=5
,max_features=5
).fit(xtrain,ytrain)

#The results of the following two sentences are essentially the same. The second is more stable, but it takes time
once1=clf.score(xtrain,ytrain)
once2=cross_ val_ Score (CLF, wine. Data, wine. Target, CV = 10). Mean() # or CLF. Score (xtest, ytest)

trainscore.append(once1)
testscore.append(once2)

#Draw image
plt.plot(range(1,11),trainscore,color="red",label="train")
plt.plot(range(1,11),testscore,color="blue",label="test")
#Display range
plt.xticks(range(1,11))
plt.legend()
plt.show()``````

result: In general, with Max_ The higher the depth, the higher the scores of the training set and the test set. However, due to the small amount of data in the red wine data set, it is not particularly obvious. But we can still see that it reached the peak of test level effect when the maximum depth was 4.

### Application of grid search in classification tree

We find that there are many parameters for this model. If we want to get a model with relatively high scores and good results, we need to continuously cycle through each parameter before we can get the best answer. However, it is difficult for us to realize it artificially, so we have such a tool: grid search can automatically determine various parameters with the best effect. (but it takes time)

``````from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_ Selection import gridsearchcv # grid search import
xtrain, xtest, ytrain, ytest = train_test_split(wine.data, wine.target, test_size=0.3)
CLF = decisiontreeclassifier (random_state = 0) # determine random_ State, the result will be the same when running multiple times in the future

#Parameters is the main parameter of grid search. It exists in the form of dictionary. The key is the parameter name of the specified model, and the value is the parameter list
parameters = {"criterion": ["gini", "entropy"]
, "splitter": ["best", "random"]
, "max_depth": [*range(1, 10)]
# ,"min_samples_leaf":[*range(5,10)]
}
#Use network search to facilitate query and return the established model. Grid search includes cross validation, and CV is the number of cross
GS = GridSearchCV(clf, cv=10, param_grid=parameters)
#Conduct data training
gs = GS.fit(xtrain, ytrain)

# best_ params_ Interface can view the best combination
#Best combination, (not necessarily the best, maybe some parameters are not involved, and the result is better)
best_choice = gs.best_params_
print(best_choice)

# best_ score_ The interface can view the score of the best combination
best_score = gs.best_score_
print(best_score)``````

Different Max in regression tree_ Depth fitting sine function data

``````import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

#Instantiation of random number generator
rng = np.random.RandomState(0)
#Rng.rand (80, 1) generates 80 numbers between 0 and 1, axis = 0, and the vertical sorting is used as the independent variable
X = np.sort(5 * rng.rand(80, 1), axis=0)
#Generation of dependent variables
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

#Different tree models are established. Except for criterion, the parameters of regression tree are the same as those of classification tree
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
#Training data
regr_1.fit(X, y)
regr_2.fit(X, y)

#Generate test data
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
#Forecast x_ Performance of test data on different numbers
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

#Drawing
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",
c="red", label="data")
plt.plot(X_test, y_1, color="blue",
label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="green", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()``````

result: It can be seen from the image that there are advantages and disadvantages for nodes with different depths. For Max_ When depth = 5, it is generally close to the original data results, but for some special noise, it will also be very close to the noise, so there will be a large difference between individual points and the real data, that is, over fitting phenomenon (it can be well judged for the training data set, but the result is not ideal for the test data set). For Max_ When depth = 2, although it can not be particularly close to a large number of data sets, it can often be well avoided when dealing with some noise.

### Performance of classification tree in synthetic data

``````import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.tree import DecisionTreeClassifier

######Generate three data sets######
#Binary data set
X, y = make_ Classification (n_samples = 100, # generate 100 samples
n_ Features = 2, # there are two features
n_ Redundant = 0, # add 0 redundant features
n_ Informative = 2, # including 2 features of information
random_ State = 1, # random mode 1
n_ clusters_ per_ Class = 1 # each cluster contains 1 label category
)
RNG = np.random.randomstate (2) # generates a random pattern
X + = 2 * RNG. Uniform (size = x.shape) # plus or minus random numbers between 0 and 1
linearly_ Separable = (x, y) # generates a new X. you can still draw a scatter diagram to observe the distribution of features
#plt.scatter(X[:,0],X[:,1])

#Use make_ Moons create moon data, make_ Circles creates ring data and packages the three sets of data in the list datasets
moons=make_moons(noise=0.3, random_state=0)
circles=make_circles(noise=0.2, factor=0.5, random_state=1)
datasets = [moons,circles,linearly_separable]

figure = plt.figure(figsize=(6, 9))
i=1
#Set the global variable i = 1 used to arrange the display position of the image
#Start iterating over the data and perform a for loop on the data in datasets
for ds_index, ds in enumerate(datasets):
X, y = ds
#Standardize the data
X = StandardScaler().fit_transform(X)
#Partition dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=42)

#Determine the data range for subsequent determination of drawing background
#Note that x [:, 0] refers to abscissa and x [:, 1] refers to ordinate
x1_min, x1_max = X[:, 0].min() - .5, X[:, 0].max() + .5
x2_min, x2_max = X[:, 1].min() - .5, X[:, 1].max() + .5

#Make each point on the drawing board form a coordinate (take one without an interval of 0.2). Array1 represents the abscissa and array2 represents the ordinate
array1,array2 = np.meshgrid(np.arange(x1_min, x1_max, 0.2), np.arange(x2_min, x2_max, 0.2))
#Color list
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

##########Reality of raw data#########

#Use the ith of the six picture positions (3, 2) on the drawing board
ax = plt.subplot(len(datasets), 2, i)

if ds_index == 0:
ax.set_title("Input data")

#Draw the training data distribution point diagram
Ax. Scatter (x_train [:, 0], # abscissa
X_ Train [:, 1], # ordinate
c=y_ Train, # means according to y_ Train label from CM_ Take out the corresponding color from the bright color list, that is, the same label has the same color.
cmap = cm_ Bright, # color list
Edgecolors = 'k' # generates a scatter chart, the color of each point edge
)
#Draw the scatter diagram of the test data set, in which there are more alpha = 0.4 parameters than the training set to distinguish the training test set
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,cmap = cm_bright, alpha = 0.4, edgecolors = 'k')
#Displayed coordinate range
ax.set_xlim(array1.min(), array1.max())
ax.set_ylim(array2.min(), array2.max())
#Do not display coordinate values
ax.set_xticks(())
ax.set_yticks(())

#I + 1 displays the drawing on the next sub plate
i += 1

#######Data display after decision#######
ax = plt.subplot(len(datasets), 2, i)
#Instantiated training model
clf = DecisionTreeClassifier(max_depth=5).fit(X_train, y_train)
score = clf.score(X_test, y_test)

# np.c_ Is a function that can combine two arrays
#Travel () can convert a multidimensional array into a one-dimensional array
#The range of classification results is determined by the predicted value of each point on the drawing board, and based on this, the range is displayed through different colors
Z = clf.predict(np.c_[array1.ravel(), array2.ravel()])
Z = Z.reshape(array1.shape)
Cm = plt.cm.rdbu # automatically selects the instantiated model of color.
Ax. Contour (abscissa of array1# Sketchpad).
, z# draw the predicted value m corresponding to each point on the board, and determine the color of the bottom plate accordingly.
, CMAP = cm# color. For the selected model, since there are only two values of Z, you can also write out a color list and specify the corresponding color to prevent it from being generated automatically
, alpha=0.8
)
#As with the original data set, draw the scatter diagram of each training set and test set
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors = 'k')
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, edgecolors = 'k', alpha = 0.4)
ax.set_xlim(array1.min(), array1.max())
ax.set_ylim(array2.min(), array2.max())
ax.set_xticks(())
ax.set_yticks(())

if ds_index == 0:
ax.set_title("Decision Tree")

#Add the score of the model in the lower right corner
ax.text(array1.max() - .3, array2.min() + .3, ('{:.1f}%'.format(score * 100)),
size = 15, horizontalalignment = 'right')
i += 1

#Automatically adjust the distance between the sub palette and the sub palette
plt.tight_layout()
plt.show()`````` It can be seen from here that the decision tree has a good classification effect for moon data and binary classification data, but the classification effect for ring data is not so ideal.

## What is random forest

Random forest is an integrated algorithm. Literally, forest is a set composed of multiple decision trees, and these subtrees are fully grown cart trees. Random means that multiple random trees are randomly generated. The generation process adopts bosstrap sampling method. This algorithm has two advantages: fast running speed and high prediction accuracy, It is called one of the best algorithms.

## Principle of random forest

The core idea of the algorithm is to use the voting mechanism of multiple decision trees to solve the classification or prediction problem. For classification problems, the judgment results of multiple numbers are used for voting, and the type of sample is finally determined according to the principle that the minority obeys the majority. For predictive problems, the regression results of multiple trees are averaged to finally determine the prediction of the sample.

The modeling process of random forest is vividly described as follows: technological process:

• Using the bootstrap sampling method, k data sets are generated from the original data set, and each data set contains n observations and P independent variables.
• A cart decision tree is constructed for each data set. In the process of constructing numbers, P fields (characteristics) are randomly selected instead of all independent variables as the selection of node fields
• Let every decision tree grow as fully as possible, so that every node in the tree is as pure as possible, that is, every sub tree in the random forest does not need pruning.
• For the random forest of K cart trees, for the classification problem, the voting method is used to use the category of the highest vote for the final judgment result, and for the regression problem, the mean method is used as the final result of the prediction sample.

### Common parameters of random forest

Using classified random forest as an example

``````Randomforestclassifier (n_estimators = 10 # is used to specify the number of decision trees contained in a random forest
, criterion = "Gini" # used to measure the split field of each tree classification node. It has the same meaning as decision tree.
,max_ Depth = none # is used for the maximum depth of each decision tree, and its growth depth is not limited by default.
,min_ samples_ Split = 2 # is used to specify the minimum sample size that each decision number root node or intermediate node can continue to split. The default is 2.
,min_ samples_ Leaf = 1 # is used to specify the minimum sample size of each leaf node of the decision tree. The default value is 1
,max_ Features = "auto" # used to specify the maximum number of split fields (number of features) contained in each decision tree. The default is none. Represents all features involved in segmentation
, bootstrap = true ## whether to enable the out of band mode (with put back value to generate out of band data). If not, you need to divide the train and test data sets by yourself. The default is true
,oob_ Score = false ## whether to detect with out of band data means whether to use out of band samples to calculate generalization. The error is false by default. Out of pocket samples refer to samples that are not selected during bootstrap sampling
,random_ State = none # is used to specify the seed of the random number generator. The default is none, indicating the default random number generator
,calss_ Weight # is used for the weight of categories in the dependent variable. By default, the weight of each category is the same
)
Note: General n_ The larger the estimators, the better the effect of the model. But correspondingly, any model has decision boundaries
n_ After the estimators reach a certain degree, the accuracy of random forest often does not rise or begin to fluctuate
And n_ The larger the estimators, the larger the amount of computation and memory required, and the longer the training time``````

Random forest application example

``````from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
xtrain,xtest,ytrain,ytest=train_test_split(wine.data ,wine.target,test_size=0.3)
rfc=RandomForestClassifier(random_state=0
,n_ Estimators = 10# number of spanning trees
,bootstrap=True
,oob_ Score = true # enable out of bag data detection
).fit(xtrain,ytrain)
rfc.score(xtest,ytest)
rfc.oob_ score_# Out of band data as detection data set
rfc.predict_ Probabilities of all data on target are # output by proba (wine. Data)
rfc.estimators_# All trees
rfc.estimators_ .random_ State # a tree random -- state value``````

Parameter n_ Effects of estimators on random forest

``````
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
score=[]
for i in range(100):
rfc=RandomForestClassifier(random_state=0
,n_estimators=i+1
,bootstrap=True
,oob_score=False
)
once=cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
score.append(once)

plt.plot(range(1,101),score)
plt.xlabel("n_estimators")
plt.show()
print("best srore = ",max(score),"\nbest n_estimators = ",score.index(max(score))+1)
``````

Output:
best srore = 0.9833333333333334
best n_estimators = 12 ### Decision tree and random forest effect

``````
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
score1=[]
score2=[]
for i in range(10):
rfc=RandomForestClassifier(n_estimators=12)
once1=cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
score1.append(once1)
clf=DecisionTreeClassifier()
once2=cross_val_score(clf,wine.data,wine.target,cv=10).mean()
score2.append(once2)
plt.plot(range(1,11),score1,label="Forest")
plt.plot(range(1,11),score2,label="singe tree")
plt.legend()
plt.show()
`````` It can be seen intuitively from the image that the model effect of the random forest integration algorithm composed of multiple decision trees is obviously better than that of a single decision tree.

### Using random forests to adjust breast cancer data

n_estimators

``````
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
scores=[]
for i in range(1,201,10):
rfc = RandomForestClassifier(n_estimators=i, random_state=0).fit(cancer.data,cancer.target)
score = cross_val_score(rfc,cancer.data,cancer.target,cv=10).mean()
scores.append(score)
print(max(scores),(scores.index(max(scores))*10)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scores)
plt.show()

``````

0.9649122807017545 111 max_depth

``````
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
scores=[]
for i in range(1,20,2):
rfc = RandomForestClassifier(n_estimators=111,max_depth=i, random_state=0)
score = cross_val_score(rfc,cancer.data,cancer.target,cv=10).mean()
scores.append(score)
print(max(scores),(scores.index(max(scores))*2)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,20,2),scores)
plt.show()
``````

0.9649122807017545 11 Change Gini to entropy

``````
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
scores=[]
for i in range(1,20,2):
rfc = RandomForestClassifier(n_estimators=111,criterion="entropy",max_depth=i, random_state=0)
score = cross_val_score(rfc,cancer.data,cancer.target,cv=10).mean()
scores.append(score)
print(max(scores),(scores.index(max(scores))*2)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,20,2),scores)
plt.show()
``````

0.9666666666666666 7 Gini and entropy result pictures: This is the end of this article on the decision tree and random forest of Python machine learning. For more information about Python decision tree and random forest, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!