AdaBoost ,Gradient Boosting algorithm,XGBoost Ensemble Model

Pradeep Dhote
7 min readAug 5, 2020

AdaBoost is a boosting ensemble model and works especially well with the decision tree. Boosting model’s learn from the previous mistakes, (e.g. misclassification data points.)by increasing the weight of misclassified data points.

Algorithm intuition :

Example For understanding this algorithm, we’ll use the following simple dataset for heart patient prediction.

  1. Initialize the weight to every N observation.

There are a total of 8 observation in data set the sample weights(𝑤=1𝑁w=1N) as 1/8 in the beginning. And, at the beginning, all the samples are equally important.

2. We consider the individual columns (feature)to create weak decision-makers (called as stamp) try to figure out what are the correct and incorrect predictions based on that column

# select the 1 feature based on lowest Gini index /highest Information Gain

now calculate the Gini index of the individual stumps using the formula

G.I= ∑(𝑤𝑒𝑖𝑔ℎ𝑡𝑜𝑓𝑡ℎ𝑒𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛)∗(1−(𝑝2(1−𝑝)2))

G.I for chest pain tree= 0.47
G.I for blocked arteries tree= 0.5
G.I for body-weight tree= 0.2

And, we select the tree with the lowest Gini Index. This will be the first decision-maker for our model.and take the decision.then calculate the error based on decision-maker/stamp result
As this stump classified only two data incorrectly out of the 8, hence the 
total error = 2/8 = 0.25 .

3. Now calculate the contribution of this tree(stump)/performance of stamp

performance of stamp = ½(𝑙𝑜𝑔(1−𝑡𝑜𝑡𝑎𝑙𝑒𝑟𝑟𝑜𝑟)/𝑡𝑜𝑡𝑎𝑙𝑒𝑟𝑟𝑜𝑟)
= 1/2 (log(1- 0.25)/0.25)
= 1/2 x log(3)
= 1/2 x 1.0986
=0.549

4. calculate the new weights for each misclassification(increase) and right classification (decrease)

New weight = old weight *e^ +_ performance
where
[+ misclassification]
[- right classification]
Increase the sample weight for incorrectly classified datapoints New weight= old weighte^ + performance= 1/8 e^0.54 = 0.072Decrease the sample weight for correctly classified datapoints New weight= old weighte^- contribution= 1/8 e^-0.54=0.21

5. Normalize the new weights so that sum of weights is one

we add all the new update weights, we get 0.85. Hence, for normalization we divide all the update weights by 0.85 and then create normalized sample weights as shown below:

6.Then we create new trees(stamp) which consider the dataset which was prepared using the new update weights.

7. Now repeat from step 2 and so on till the accuracy achieved and all the error reduced .

Few Key points for adaboost

  1. Sequential
  2. computational scalability
  3. use to exploit /use the dependency between model
  4. SAMME — Stagewise Additive Multimodeling using Exponential Loss Function
  5. Decision Stump
  6. Can Handle missing value and outlier
  7. Handles Mixed Predictor as well (Quantitive and Qualitative)

Gradient Boosting Machine (GBM)

Gradient Boosted is a ensemble model ,use Average model and decision trees as estimators. It can work with different loss functions evaluate it’s gradient.

GBM is derived from the weights optimized by Gradient Descent to minimize the cost function. The result of Gradient Descent is the same function of the model as the beginning, just with better parameters.

Gradient Boosting algorithm

Let’s see how maths work out for Gradient Boosting algorithm.

We will use a simple example to understand the GBM algorithm. We have to predict the Home Price

Step 1: Create the Base model(Average Model),Calculate the average of the target label (Home Price).average value is the predicted value of Base model.

Step 2: Calculate the residuals from average prediction and Actual values

[residual = actual value — predicted value],After computing the residuals, we get the following table.

Step 3: Now create another model (RM1),Construct a decision tree (full depth),which will take Residuals as a target.

we have new predicted residuals values,now we calculate new Predicted target value.

new Predicted target value. = Actual value +LR* Predicted Residuals

LR →Learning Rate(alfa) = 0.1

step -4 Now Calculate the residuals New Residuals RM1 .

[residual = actual value — predicted value],After computing the residuals, we get the following table.

Step 5: Again create another model (RM2),Construct a decision tree (full depth),which will take New Residuals RM1 as a target.

we have new RM2 predicted residuals values,now we calculate new Predicted target value.

new Predicted target value. = Actual value +LR* Predicted Residuals

Step 6Fit another model on residuals that is still left. i.e. and repeat steps 2 to 5 until it starts overfitting or the sum of residuals become constant. or the maximum limit of the number of estimators is reached or residuals become Zero.

Overfitting can be controlled by consistently checking accuracy on validation data.

The whole Concept are shown in Block diagram below.

XGBoost(Extreme Gradient Boosting)

XGBoost improves the gradient boosting method even further.

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm.

it has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique

  • Implements regularisation helping reduce overfit (GB does not have);
  • Implements parallel processing being much faster than GB;
  • Allows users to define custom optimisation objectives and evaluation criteria adding a whole new dimension to the model;
  • XGBoost has an in-built routine to handle missing values;
  • XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain;(Pruning)
  • XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

Mathematics Involved XGBoost doesn’t use entropy or Gini indices. Instead, it utilises gradient (the error term) and hessian for creating the trees. Hessian for a problem is the number of residuals . Mathematically, Hessian is a second order derivative of the loss at the current estimate given as:

Example For understanding this algorithm, we’ll use the following simple dataset for heart patient prediction.

step-1 calculated/select the starting probability of target

Assume

  • probability =0.5 for all sample(records)
  • Is heart Patient “yes” = 1
  • Is heart Patient “yes” = 0

Step 2: Calculate the residuals from assume probability and Actual values

[residual = actual value — assume probability],After computing the residuals, we get the following table.

step-3 Now construct the tree using Gain(same theory )

Now for splitting data into a tree form, calculate𝐺𝑎𝑖𝑛=𝑙𝑒𝑓𝑡𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦+𝑟𝑖𝑔ℎ𝑡𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦−𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦𝑓𝑜𝑟𝑟𝑜𝑜𝑡compute the similarity using the formula𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = R𝑟𝑎𝑑𝑖𝑒𝑛𝑡^2/previous(p)(1-previous(p))+𝜆Where 𝜆 is the regularization term.
while calculate similarity 𝜆 = 0 (just i used for explanation)

next all steps are similar to decision tree to choose root node ,child node ,leaf node.after getting all the nodes we get their “𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦”

which we used to calculate the Gain,(root Similarity,left similarity ,right similarity,

step -4 next fit the data set in to tree or based on the similarity we get the next new probability score

Learning is done using the equation

new probability score = 𝑜𝑙𝑑𝑉𝑎𝑙𝑢𝑒+𝜂∗𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

where

𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 =(similarity which will get traversing nodes as per sample(records)

𝜂 is the learning rate

Step -5 repeat steps 2 to 5 until it starts overfitting or the sum of residuals become constant. or the maximum limit of the number of estimators is reached or residuals become Zero.

final output

Below is the Link of these parameters according to their function and their counterparts across different models.

CatBoost vs. Light GBM vs. XGBoost

--

--