Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Time:2022-1-2

Original link:http://tecdat.cn/?p=24421 

AdaBoost is?

Boosting refers to a series of machine learning meta algorithms, which combines the outputs of many “weak” classifiers into a powerful “set”, in which the error rate of each weak classifier may be only a little better than random guess.

The name AdaBoost represents adaptive lifting. It refers to a special lifting algorithm. In this algorithm, we are suitable for a series of “tree stumps” (decision trees with one node and two leaves) and weight their final votes according to their prediction accuracy. After each iteration, we reweighted the data set and paid more attention to the data points incorrectly classified by the previous weak learner. In this way, these data points will receive “special attention” during iteration T + 1.
 

How does it compare to random forests?

characteristic

Random forest

AdaBoost

depth

Infinite (a complete tree)

Tree stump (single node with 2 leaves)

Tree growth

independent

successively

vote

identical

weighting

AdaBoost algorithm

A) The unified initialization sample weight isPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.

B) For each iteration T:

  1. findht(x)Minimized weak learnerPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.
  2. We set weights for weak learners based on their accuracy:Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison
  3. Increase the weight of misclassification observation:Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.
  4. Re normalize the weights so thatPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.

C) Take the final prediction as the weighted majority of weak learner prediction:Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.

mapping

We will use the following functions to visualize our data points and optionally cover the decision boundary of the fitted AdaBoost model.

def plot(X: np.ndaay,
                  y: np.ndrry,
                  cf=None) -> None:
    "" "draw ± samples in 2D, and you can select the decision boundary" ""


    if not ax:
        fig, ax = plt.sults(fgsze=(5, 5), di=100)

    pad = 1
    x\_min, x\_max = X\[:, 0\].min() - pad, X\[:, 0\].max() + pad
    y\_min, y\_max = X\[:, 1\].min() - pad, X\[:, 1\].max() + pad

    if saligs is not None:
        sies = np.array(spl_wigts) * X.hae\[0\] * 100
    else:
        sze = np.oes(sape=X.shpe\[0\]) * 100



    if cf:
        xx, yy = np.ehrid(n.aange(x\_min, x\_max, plot_step),
                             p.aang(y\_min, y\_max, plot_step))

       pdt(np.c_\[xx.ravel(), yy.ravel()\])
      
        
        #If all predictions are positive, adjust the color map accordingly.
        if list(np.niue(Z)) == \[1\]:
            colors = \['r'\]
        else:
            colors = \['b', 'r'\]


  

    ax.st\_im(in+0.5, \_ax-0.5)
    ax.st_lm(ymin+0.5, yax-0.5)

data set

We will use a similar method to generate a dataset, but use fewer data points. The key here is that we want two non linearly separable classes, because this is the ideal use case of AdaBoost.

def maketat(n: it = 100, rased: it = None):
    "" "generate a dataset for evaluating AdaBoost classifiers" ""
    
    nclas = int(n/2)
    
    if ranmed:
        np.ram.sed(rndoed)

    X, y = me\_gainqnes(n=n, n\_fees=2, n_css=2)
    

plot(X, y)

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Benchmark using scikit learn

Let’s establish a benchmark by importing AdaBoost classifier from scikit learn and fitting it to our data set to see what the output of our model should look like.

from skarn.esele import AdosClaser

bnh = Adostlier(netrs=10, atm='SAMME').fit(X, y)
plat(X, y, bech)

tnr = (prdict(X) != y).man()

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

The classifier completely fits the training data set in 10 iterations, and the data points in our data set are reasonably separated.

Write your own AdaBoost classifier

The following is the framework code of our AdaBoost classifier. After fitting the model, we will save all the key attributes to the class — including the sample weight of each iteration — so that we can check them later to understand the role of our algorithm in each step.

The following table shows the mapping between the variable names we will use and the mathematical symbols used earlier in the algorithm description.

variable

mathematics

sampleweight

wi(t)

stump

ht(x)

stumpweight

αt

error

εt

predict(X)

Ht(x)

class AdBst:
    "" "AdaBoost Classifier" ""

    def \_\_init\_\_(self):
        self.sump = None
        self.stup_weght = None
        self.erro = None
        self.smle_weih = None

    def \_ceck\_X_y(self, X, y):
        "" "verify assumptions about input data format" ""
        assrt st(y) == {-1, 1}
        reurn X, y

Fitting model

Think back to our algorithm to fit the model:

  1. findht(x)Minimized weak learnerPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.
  2. We set weights for weak learners based on their accuracy:Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison
  3. Increase the weight of misclassification observation:Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison. attentionPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparisonWhen the assumption is consistent with the label, it will be evaluated as + 1, and when it is inconsistent with the label, it will be evaluated as – 1.
  4. Re normalize the weights so thatPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.

The following code is essentially the above one-to-one implementation, but there are several points to note:

  • Since the focus here is to understand the set elements of AdaBoost, we will call decinteassfir (mxdpth = 1, mlefnes = 2) to implement the logic of selecting each HT (x).
  • We set the initial unified sample weight outside the for loop, and set the weight of T + 1 in each iteration T, unless it is the last iteration. Here, we specially save a set of sample weights on the fitting model so that we can visualize the sample weights at each iteration in the future.
def ft(slf, X: narry, y: ndray, ites: int):
    "" "fit model with training data" ""

    X, y = slf.\_chck\_X_y(X, y)
    n = X.shpe\[0\]

    #Start numpy array
    self.smle_wegts = np.zos(shpe=(itrs, n))
    self.tumps = np.zeos(she=iters, dtpe=obect)


    #Initialize weights evenly
    sef.sampewegts\[0\] = np.one(shpe=n) / n

    for t in range(iters):
        #Fitting weak learner
        fit(X, y, smpe\_eght=urrsmle\_igts)

        #The error and tree stump weight are calculated from the prediction of weak learners
        predict(X)
        err = cu_seghts\[(pred != y)\].sum()# / n
        weiht = np.log((1 - err) / err) / 2

        #Update sample weight
        newweis = (
            crrawe * np.exp(-sum_wiht * y * tupd)
        )
        

        #If it is not the final iteration, the sample weight of T + 1 is updated
        if t+1 < ies:
            sef.smpe\_wit\[t+1\] = ne\_saml_wigt

        #Save the results of the iteration
        sef.sups\[t\] = tump

Make predictions

We use “weighted majority voting” to make the final prediction, and calculate the symbol (±) of the linear combination of the prediction of each tree stump and its corresponding tree stump weight.

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

def pedc(self, X):
    "" "use the fitted model for prediction" ""
    supds = np.aray(\[stp.pect(X) for sump in slf.stps\])
    return np.sgn(np.dt(self.tum_whts, sumpreds))

performance

Now let’s put everything together and fit the model with the same parameters as our benchmark.

#Specify the function we define separately as the method of the classifier
AaBt.fit = fit
Adostreit = pedct

plot(X, y, clf)

err = (clf.prdc(X) != y).mean()

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

not bad We achieved exactly the same results as the sklearn benchmark. I chose this dataset to show the advantages of AdaBoost, but you can run the notebook yourself to see if it matches the output, regardless of the starting conditions.

visualization

Since we save all intermediate variables as arrays in our fitting model, we can use the following function to visualize the evolution of our set learner in each iteration t.

  • The left column shows the selected “stump” weak learner, which corresponds to HT (x).
  • The right column shows the cumulative strong learners so far.  Ht(x)。
  • The size of data point markers reflects their relative weight. The data points incorrectly classified in the previous iteration will be more weighted, so it will appear larger in the next iteration.
def truost(clf, t: int):
    "" "AdaBoost fitting until (and including) a specific iteration.".  """

    nwwghts = clf.suweighs\[:t\]


def plotost(X, y, clf, iters=10):
    "" "draw weak learners and cumulative strong learners in each iteration.".  """

    #Larger mesh
    fig, axs = subplos(fisze=(8,ters*3),
                             nrows=iers,
                             ncls=2,
                             shaex=True,
                             dpi=100)
    


        #Drawing weak learners
        plotot(X, y, cf.\[i\],
                      saplweghs=clf.saple_wigts\[i\],
                      aoat=False, a=ax1)

        #Drawing strong learners
        truost(clf, t=i + 1)
        pltot(X, y, tun_cf,
                      weights=smplweih\[i\], ax=ax2)

    plt.t_aot()

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Why do some iterations have no decision boundaries?

You may notice that our weak learner classifies all points as positive when iterating t = 2,5,7,10. This happens because given the current sample weight, the minimum error can be achieved by predicting all data points as positive values. Note that in each of these iterations above, negative samples are surrounded by positive samples with proportionally higher weights.

There is no way to draw a linear decision boundary to correctly classify any number of negative data points without misclassifying the higher cumulative weight of positive samples. However, this does not prevent our algorithm from converging. All negative points are misclassified, so the sample weight increases. This weight update enables weak learners in the next iteration to find a meaningful decision boundary.

Why are we interested in alpha_ Which particular formula does t use?

Why do we use this particular valueαt? We can prove the choicePython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparisonMinimize the exponential loss on the training setPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison 。

Ignoring symbolic functions, weHStrong learner in iterationtIs a weighted combination of weak learnersh(x)。 In any given iterationt, we canHt(x) It is recursively defined as the value at iterationt−1Plus the weighted weak learner of the current iteration.

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

The loss function we apply to h is the average loss of all N data points. Alternative recursive definitionsHt(x), and use the identity to split the index termPython integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison.

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Now let’s take the derivative of the loss function with respect toαtAnd set it to zero to find the parameter value of the minimization loss function. The sum can be divided into two: case whereht(xi)=yiAnd case whereht(xi)≠yi

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Finally, we recognize that the sum of weights is equivalent to the error calculation we discussed earlier: ∑ DT (I)= ϵ t。 By permutation and then algebraic operation, we can separate α t。
 

Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Further reading


Python integrated learning: write and build AdaBoost classification model, visual decision boundary and sklearn package call comparison

Most popular insights

1.Why employees leave from decision tree model

2.R language tree based method: decision tree, random forest

3.Using scikit learn and pandas decision trees in Python

4.Machine learning: running random forest data analysis reports in SAS

5.R language uses random forest and text mining to improve airline customer satisfaction

6.Machine learning boosts fast fashion and accurate sales time series

7.Recognition of changing stock market conditions by machine learning — Application of hidden Markov model

8.Python machine learning: implementation of recommendation system (collaborative filtering by matrix decomposition)

9.Predicting bank customer churn using Python machine learning classification in Python