Random Forest Algorithm

Pradeep Dhote
2 min readJan 7, 2021
Photo by Sergei Akulich on Unsplash

Decision trees are one of such models which have low bias but high variance. We have studied that decision trees tend to overfit the data. overcome this problem we use Bagging.

Bagging technique becomes a very good solution for decreasing the variance in a decision tree.Instead of using a bagging model with underlying model as a decision tree, we can also use Random forest which is more convenient and well optimized for decision trees.

The main issue with bagging is that there is not much independence among the sampled datasets i.e. there is correlation

The advantage of random forests over bagging models is that the random forests makes a tweak in the working algorithm of bagging model to decrease the correlation in trees. The idea is to introduce more randomness while creating trees which will help in reducing correlation.

Random Forest is kind of Ensemble Logarithm which using decision tree in Randomize way. Resulting Random Forest is not baised since there are multiple trees and each tree is trained on a subset of data.

Let’s understand how algorithm works

  1. Pick N random sample(records)collect from the dataset using bootstrapping .and create new dataset called bootstrap-data
  2. Select random subset of feature (variable) from bootstrap-dataset .build the decision tree with high depths.
  3. Choose the No. of Decision tree you want in your want in your model and repeat step 1&2.
  4. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in forest. Or, in case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote.

Advantages and Disadvantages of Random Forest:

1) It can be used for both regression and classification problems.

2) Since base model is a tree, handling of missing values is easy.

3) It gives very accurate result with very low variance.

4) Results of a random forest are very hard to interpret in comparison with decision trees.

5) High computational time than other respective models.

Random Forest should be used where accuracy is up utmost priority and interpretability is not very important. Also, computational time is less expensive than the desired outcome.

--

--