Decision Tree Explained :Classification and Regression Algorithm

7 min readJul 26, 2020

Decision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets.Apart from that, it is very easy to understand that makes it more popular to use.

As the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions

Let’s understand the approach to decision tree with a basic scenario. Suppose it’s Friday night and live concert in Town and you are not able to decide if you should go out or stay at home. Let the decision tree decide it for you.

let’s understand how a decision tree makes a decision. So how did it work?

It selects a root node based on a given condition, e.g. our root node was chosen as time >10 pm
Then, the root node was split into child notes based on the given condition.
The right child node in the above figure fulfilled the condition, .
The left child node didn’t fulfil the condition, so again it was split based on a new condition.

This process continues till all the conditions are met or if you have predefined the depth of your tree, e.g. the depth of our tree is 3, and it reached there when all the conditions were exhausted.

how the parent nodes and condition is chosen for the splitting ExplainedBelow

Decision Tree for Regression

Before getting in to the theory, we need some basic terminology. Trees are drawn upside down. The final regions are termed leaves. The points inside the tree where a split occurs is an interval node. Finally, segments that connect nodes are branches.

To create a regression tree:

Divide the predictor space into J distinct and non-overlapping regions
For every observation that falls in a region, predict the mean of the response value in that region
Each region is split to minimize the RSS. To do so, it takes a top-down greedy approach also called recursive binary splitting.

Top-down? :Because all observations are in a single region before the first split.

Greedy Approach? :Because the best split occurs at a particular step, rather than looking ahead and making a split that will result in a better prediction in a future step.

Recursive binary splitting(Greedy approach)

As mentioned above, we try to divide the X values into j regions, but it is very expensive in terms of computational time to try to fit every set of X values into j regions.
decision tree use greedy approach in which nodes are divided into two regions based on the given condition, i.e. not every node will be split but the ones which satisfy the condition are split into two branches. It is called greedy because it does the best split at a given step at that point of time .
It decides a threshold value(say s) to divide the observations into different regions

this process has a high chance of over-fitting the training data as it will be very complex.

Pruning

Pruning enables us to avoid Over-fitting of the regression or classification model

The Decision Tree in our original example is quite simple, but it is not so when the dataset is huge and there are more variables to take into consideration. This is where Pruning is required. Pruning refers to the removal of those branches in our decision tree which we feel do not contribute significantly to our decision process.
Let’s assume that our example data has a variable called Vehicle which relates to or is derivative of the condition Money when it has the value Rich. Now if Vehicle is Available, we go for Shopping via Car but if it is not available we go shopping through any other means of transport. But in the end we go for Shopping.
This implies that the Vehicle variable is not of much significance and can be ruled out while constructing a Decision Tree.
The concept of Pruning enables us to avoid Over-fitting of the regression or classification model so that for a small sample of data, the errors in measurement are not included while generating the model

Decision Tree for Classification

A classification tree is very similar to a regression tree. However, we cannot use the mean value of the response, so we now predict the most commonly occurring class in a region. Of course, RSS cannot be used as a criterion. Instead, each split is done to minimize the classification error rate.

Decision tree algorithm is a tree where nodes represents features(attributes), branch represents decision(rule) and leaf nodes represents outcomes(discrete and continuous).
the features are divided on the base of CART And ID 4.5 or c4.5.
that uses set of rules to build Decision tree

Different Algorithms for Decision Tree

ID3 (Iterative Dichotomiser 3) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes.

C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classification purposes.

Classification and Regression Tree(CART) : It is the most popular algorithm used for constructing decision trees. It uses Gini impurity as the default calculation for selecting root nodes, however one can use “entropy” for criteria as well. This algorithm works on both regression as well as classification problems. We will use this algorithm in our python implementation.

Lets see decision tree with this simple example

It is normal “AND’ operation problem, where ‘A’, ‘B’ are features and “A and B” are corresponding labels.

Before going deeper into the concept lets understand about impurity, types of measures of impurity. it will help us to understand the algorithm better.

Impurity

Let’s understand Impurity with the following toy example.

From above image, a ball is randomly drawn from each bowl. So how much information you needed to accurately tells the color of ball. So, left bowl needed less information as all of the ball is red colored, central bowl needed more information than left bowl to tell it accurately, and right bowl needed maximum information as both number of both color ball are same.

As information is measure of purity, so we can say that left bowl is pure node, middle is less impure and right is more impure.

There are couple of impurity measures

Entropy

Entropy is a measure of impurity.like ordered or uncertainty in a bunch of information entropy is measure of randomness

other word can say Entropy is measure of randomness or an amount of information is needed to accurately describe the some sample. So if sample is homogeneous, means all the element are similar than Entropy is 0, else if sample is equally divided than entropy is maximum 1. So, left bowl has lowest entropy, middle bowl has more entropy and right bowl has highest entropy.

Information Gain

Information gain calculates the reduction in entropy after splitting a node. It is the difference between entropies before and after the split. The more the information gain, the more entropy is removed.

Where, T is the parent node before split and X is the split node from T.

It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain
each variable, and selecting the variable that maximizes the information gain,
Information gain is calculated by comparing the entropy of the dataset before and after a transformation.

A tree which is splitted on basis of entropy and information gain value looks like:

Entropy & Information gain :ID3 (Iterative Dichotomiser 3)

Gini index / Gini impurity

Gini Index is measure of how offend a randomly chosen elements from the sets would be incorrectly labeled

Gini index is measure of inequality in sample. It has value between 0 and 1. Gini index of value 0 means sample are perfectly homogeneous and all element are similar, whereas, Gini index of value 1 means maximal inequality among elements. It is sum of the square of the probabilities of each class. It is illustrated as,

Gini index calculated by subtracting the sum of square probability of each class