## Cart tree

Cart tree (classification and regression tree) classifies regression trees, which can be used to create classification trees and regression trees.

Classification tree: Taking C4.5 as an example, each branch is exhaustive**Features**According to $featur {E_ i} The threshold value of “threshold + threshold”.

Regression tree: the overall process is similar, the difference is that each node of the regression tree will get a predicted value, which is generally the mean value of the divided samples. When feature values are exhausted to find the best advantage, the criterion is no longer the maximum entropy, but the mean square error.

difference:

- Whether it is a classification tree or a regression tree, each tree is a feature threshold of judgment.
- Finally, the classification tree is a category, and the regression tree is a fitting value.
- C4.5 is used for classification tree and least squares is used for regression tree.
- Node output value, regression tree has one more fitting value.

It focuses on how the regression tree selects features and thresholds, and determines the output value of nodes?

**1. Select feature + threshold:**

Suppose that the input space is divided into m regions: ${R_ 1},{R_ 2},…,{R_ {RM {m}}} $, the output value of each region is ${C_ m} The regression tree model can be expressed as follows:

$$

f(x) = \sum\limits_{m = 1}^M {{c_m}I(x \in {R_m})}

$$

The j-th feature ${x ^ {(J)}} $and its value s are selected as the segmentation basis

$$

{R_1}(j,s) = \{ x|{x^{(j)}} \le s\} ,{R_2}(j,s) = \{ x|{x^{(j)}} > s\}

$$

Using the least square to find the optimal tangent point

$$

\mathop {\min }\limits_{j,s} [\mathop {\min }\limits_{{c_1}} \sum\limits_{{x_i} \in {R_1}} {{{({y_i} – {c_1})}^2} + } \mathop {\min }\limits_{{c_2}} \sum\limits_{{x_i} \in {R_2}} {{{({y_i} – {c_2})}^2}} ]

$$

The selected feature is j and the threshold of J feature is s.

**2. Output value of node:**

After selecting the optimal segmentation variable J and the optimal threshold s, the output value of this node is as follows:

$$

{c_1} = mean({y_i}|{x_i} \in {R_1}(j,s)),{c_2} = mean({y_i}|{x_i} \in {R_2}(j,s))

$$

By traversing all the feature components (J, s) and dividing the space successively, a regression tree is generated.

### Three criteria of classification tree

Entropy calculation:

$$

H(D) = – \sum\limits_{k = 1}^K {\frac{{\left| {{C_k}} \right|}}{{\left| D \right|}}\log } \frac{{\left| {{C_k}} \right|}}{{\left| D \right|}} = – \sum {{p_i}\log {p_i}}

$$

**1. Information gain (ID3)**

$$

$$

g(D,A) = H(D) – H(D|A)

$$

ID3 tends to the features with more values (the features with more values are easier to get the subset with high purity), because the more values are, the more organic the subset is, the higher the purity of the subset is

**2. Information gain ratio (C4.5)**

$$

$$

{g_R}(D,A){\rm{ = }}\frac{{g(D,A)}}{{{H_A}(D)}},{H_A}(D) = – \sum\limits_{k = 1}^n {\frac{{\left| {{D_k}} \right|}}{{\left| D \right|}}\log } \frac{{\left| {{D_k}} \right|}}{{\left| D \right|}}

$$

${H_ A} (d) $: feature a has n values, and is divided according to n a values under calculation D. the more n is, the greater H value is, so as to suppress the multi valued features.

**3. Gini index**

The Gini index represents the probability that a random sample in the sample set is wrongly shared

$$

$$

Gini(p) = \sum\limits_{k = 1}^K {{p_k}(1 – {p_k})}

$$

$$

Gini(D) = 1 – {\sum\limits_{k = 1}^K {(\frac{{\left| {{C_k}} \right|}}{{\left| D \right|}})} ^2}

$$

$$

Gini(D,A) = \frac{{\left| {{D_1}} \right|}}{{\left| D \right|}}Gini({D_1}) + \frac{{\left| {{D_2}} \right|}}{{\left| D \right|}}Gini({D_2})

$$

Cart uses the Gini index

reference resourceshttps://zhuanlan.zhihu.com/p/…