# Introduction

Random Forest is a supervised classification algorithm, it can classify data according to various given features.

Assuming that we want to determine whether a person is male or female according to his/her weight, height and 100m-race time. Training data is as follows.

Person Weight(kg) Height(meter) 100m-race time(second) Gender
A 50 1.62 18 Female
B 70 1.81 16 Male
C 60 1.72 15 Female
D 70 1.71 19 Male
E 52 1.69 17 Female

We can load these data and train them with the random forest classification algorithm. The model obtained from training could be used for prediction. E.g., We will be able to predict this person’s gender using the trained model.

Weight(kg) Height(meter) 100m-race time(second)
60 1.62 16

Notice that we will mainly focus on how to use random forest and how to write the algorithm from scratch. We won’t dive into the esoteric mathematical principles behind it. After finishing this post, you will be able to understand various parameters seen in third-party random forest implementations.

All the code mentioned in the post is available for download. So please refer to the code if there’s anything unclear in the post.

# Execution

Let’s first run the code that we will write, so we could know what it’s like.

1. Install Python3

4. Execution

So you can see that, we get the highest accuracy with 3 trees, which is about 85%.

# How It Works

Random Forest is rather complex, so let’s use an example.

Person Weight(kg) Height(meter) 100m-race time(second) Gender
A 50 1.62 18 Female
B 70 1.81 16 Male
C 60 1.72 15 Female
D 70 1.71 19 Male
E 52 1.69 17 Female

We mentioned before that we could use these data to train our random forest model, in order to predict new items. So how to train? In fact, training is equivalent to building a tree here. Steps are as follows.

1. Based on D’s height, anyone whose height is less or equal to 1.71m belong to one group, and anyone whose height is greater than 1.71m belong to another group, then we get two groups(Don’t think too much about why to split in this way, this is just an example, we will talk about the reason in detail later).

2. For group A, D, E, based on A’s 100m-race time, anyone whose time is less or equal to 18s belong to one group, and anyone whose time is greater than 18s belong to another group. The same goes to group B, C. Based on C’s height, anyone whose height is less than or equal to 1.72m belong to one group, and anyone whose height is greater than 1.72m belong to another group. After splitting, we get a tree like this.

3. Now only group A, E could be further split. So let’s base on A’s weight, anyone whose weight is less than or equal to 50kg belong to one group, and anyone whose weight is greater than 50kg belong to another group. After that, we mark each leaf node with their genders.

That’s it, a tree in the random forest! Now we can use this tree to predict new data. Assuming we want to predict this person’s gender:

Weight(kg) Height(meter) 100m-race time(second)
60 1.62 16

Just like training, this person’s height is 1.62m, which is less than or equal to 1.71, so he/she belongs to group B, C in the second layer. Again, compare based on his/her height, which is less than or equal to 1.72m, so he/she belongs to leaf node C, which means the prediction result is Female. This is the whole process of prediction.

# The Principle To Split A Tree Into Two Groups

In the above example, we first split the whole data into two groups according to D’s height, then continue to split them according to D’s height, A’s weight, etc. What’s going on here? It seemed that we were casually splitting the data with no principle. OK, I concede that it’s true. I just want to show you guys how to build a random forest tree. In fact, the genuine tree-building-process would split the data according to gini index. E.g., assuming we split the data according to A’s weight, we will get two groups of data: A and B, C, D, E. Let’s call them group1 and group2 respectively, then we can calculate gini index according to the following equation.

$gini$ = [1 - (the number of males in group1 / the number of people in group1)$^2$ - (the number of females in group1 / the number of people in group1)$^2$] $\times$ (the number of people in group1 / the total number of people in both groups) + [1 - (the number of males in group2 / the number of people in group2)$^2$ - (the number of females in group2 / the number of people in group2)$^2$] $\times$ (the number of people in group2 / the total number of people in both groups)

So the gini index should be calculated as follows if we split the data based on A’s weight.
$$gini = 0 + (1 - 0.25 - 0.25) \times 0.8 = 0.4$$

We can also split the data based on A’s height, to get another gini index.
$$gini = 0 + (1 - 0.25 - 0.25) \times 0.8 = 0.4$$
We can also split based on A’s 100m-race time, B’s weight, B’s height, …, E’s 100m-race time, 3 x 5 = 15 ways in total. We calculate the gini index for each of the 15 ways, and choose the one with the smallest gini index. So we should split based on D’s weight if we got the smallest gini index based on D’s weight. Why choose the smallest one? Because the smaller gini index is, the purer each group will be. We are not going to dive into the reason in detail here because it’s more about the math rather than the implementation.

The code to calculate gini index is as follows.

We use the above piece of code in this way:

# Use multiple trees to boost the accuracy

You may wonder why it’s called the random forest when we only used one tree? Good question! In fact, we shouldn’t only use one tree. The correct process is as follows.

1. Choose 90% of the data randomly for training.
2. Train those data, i.e. the process of building a tree shown above.
3. Use this tree to predict, get the prediction x.
4. Repeat the above three steps, build another tree, get another prediction y.
5. Repeat the first three steps again, get another prediction z.
6. Choose the one that appears the most in x, y, z, which should be our final prediction, return it.

So you should know why it’s call random forest, right? We built 3 trees in total, and got the final result based on 3 predictions obtained from 3 trees. The number 3 can be changed, too. You can also build 5 trees, 10 trees, etc., whatever works out for you. Moreover, the sampling ratio 90% can be changed, too. 80%, 70%, whatever you like.

The purpose of building multiple trees is to avoid overfitting. From Wikipedia:

In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.

# Code

Now that we know how it works, it’s time for us to dive into the code. Notice that some parameters in the code are not mentioned before, so let’s review them together.

1. min_size: when the number of data in some node is less than min_size, further splitting is not allowed. I.e., the current group is taken as a leaf node, the value of the leaf node is determined by the category that appears the most in the group.
2. max_depth: The maximum depth of a tree, further splitting is not allowed when max_depth is exceeded, the value of the node is determined by the category that appears the most in the group.
3. n_features: The number of features chosen to build the current tree. In case if you don’t know what a feature is, weight, height, 100m-race time are both called features in the previous example. We choose n_features features for training each time we build a tree. In this way, features used in each tree is different, which means the final trees we build will be different, so overfitting could be avoid.

Code to implement random forest is as follows.

So how to use this piece of code? Let’s take Sonar, which is real-life data as an example(You can have a glimpse of its contents in here). The last column in Sonar represents category, which are two of them in total, R and M. R means rock and M means metal. The first 60 columns represents data obtained by bouncing sonar signals off a surface(R or M) at various angles and under various conditions. Let’s load these data and split them into two groups, one for training and one for testing. Training data is used to build models, and test data is used to check the accuracy of the model.

The code is as follows.

The result is as follows.

As you can see, we get the highest accuracy with 3 trees(around 85%), we have reason to believe that we could get a better result if further tunning is conducted.

Share