Implementation and Explanation of the Random Forest in Python

Decision Tree — The Building Block
Decision Tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
Decision Tree Representation: Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute as shown in the above figure. This process is then repeated for the subtree rooted at the new node.
What is Random Forest?
Random forest is an ensemble of Decision Tree, generally trained via the bagging method, typically with max_sample set to the size of the training set. Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Tree.
Like RandomForestClassifier there is a RandomForestRegressor class for Regression task.
RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier(to control how trees are grown), and all the hyperparameters of a BaggingClassifier to control the ensemble itself.
The Random Forest algorithm introduces extra randomness when growing trees, instead of searching for the best features when splitting a node, it searches for the best feature among a random subset of features which results in greater tree diversity, which trades a higher bias for a lower variance, and in return an overall better model.
Implementation of Random Forest
Now we have basic knowledge about random forests we will work with data and will create a random forest model for that dataset I will break down the steps and explain every step to you so that we can have a nice grasp of the topic.
We will be using the Titanic dataset for this tutorial and we will not focus on getting the best result, our focus will be to fully understand Random Forest.
Making The Data Ready
I have downloaded the titanic dataset from Kaggle and imported it using pandas.
Link of the Kaggle Titanic Dataset:
Data Preparation
Dropping the Useless Columns
As we are just here to learn how Random Forest works we can not do extreme feature engineering. So we will remove the columns which i think is useless but in reality you can do some feature extracting from these columns.
Removing The Null Values
So in the below remove_null_values() function we are replacing the null values with mean values of that columns.
Splitting The Dataset
We divided our data into training and test dataset with train data 80% and test data 20%.
Model Pipeline Creation
Here we make a pipeline for One-Hot-Encoder and StandardScaler
In case of OneHotEncoder we encoded sex and embarked column from which we got total of 5 columns.
In case of StandardScaler we took all 10 columns (7 columns from dataset in which 2 column in onehot encoder 7–2=5, from OneHotEncoder we got 5 columns so total columns we have is 5+5=10)
We also used our RandomForestClassifier model with default settings.
Model Prediction
We predicted our model and took out the accuracy_score which is around 79% so without doing any hypertunning and some basic feature engineering technique we achieved 79% accuracy
So how we got such an accuracy with auto-tuning Random Forest
Extra Trees
When you are growing a tree in a random forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random threshold for each feature rather than searching for the best possible threshold.
A forest of such extreme random trees is called an Extremely Randomized trees. And this technique trades more bias for a lower variance.
Feature Importance
Yet another great quality of random forest is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature importance by looking at how much the tree nodes that use that feature reduce impurity on average.
Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importance is equal to 1.
You will access the result using the feature_importance_ variable.

Wrap-Up
Machine learning may seem intimidating at first, but the entire field is just many simple ideas combined together to yield extremely accurate models that can ‘learn’ from past data. The random forest is no exception. There are two fundamental ideas behind a random forest, both of which are well known to us in our daily life:
- Constructing a flowchart of questions and answers leading to a decision
- The wisdom of the (random and diverse) crowd
It is the combination of these basic ideas that lead to the power of the random forest model.