What is Random Forest Regression?
Random Forest or Random Decision Forests are an ensemble learning method for classification and regression tasks and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Let’s understand Random Forest Regression using the Position_Salaries data set which is available on Kaggle. This data set consists of a list of positions in a company along with the band levels and their associated salary. The data set includes columns for Position with values ranging from Business Analyst, Junior Consultant to CEO, Level ranging from 1–10, and finally the Salary associated with each position ranging from $45000 to $1000000.
Required R package
First, you need to install the randomForest and ggplot2 package and load the randomForest and ggplot2 library then after you can able to perform the following operations. So let’s start to implement Random Forest Regression model.
Import libraries
install.packages('randomForest')
install.packages('ggplot2')
library(randomForest)
library(ggplot2)
Note: If you use R studio then packages need to be installed only once.
Importing the dataset
dataset <- read.csv('../input/position-salaries/Position_Salaries.csv')
dataset <- dataset[2:3]
dim(dataset)
Please refer to Regression Algorithm Part 4 for more information. Here, we will make use of Random Forest Regression to predict the accurate salary of the employee.
Apply Random Forest Regression to the data set
set.seed(1234)
regressor <- randomForest(x <- dataset[1], y <- dataset$Salary, ntree = 500)
The randomForest() function used to create a Random Forest Regression model. The first argument takes an independent variable and the second argument takes the dependent variable. You need to set a number of trees in the final argument. You can try ntree with 10, 50, 100, etc. and see the result and you will make a more robust model by adding more trees in the forest.
Predicting a new result with Random Forest Regression
y_pred <- predict(regressor, data.frame(Level = 6.5))
This code predicts the salary associated with 6.5 level according to a Random Forest Regression Model and it predicts 160907.7 which is very close to real 160 k so it’s a pretty well prediction. On this data set Random Forest Regression has given the best prediction results after Polynomial Linear Regression.
Visualize the Random Forest Regression results
x_grid <- seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x <- dataset$Level, y <- dataset$Salary), colour = 'red') +
geom_line(aes(x <- x_grid, y <- predict(regressor, data.frame(Level = x_grid))), colour = 'blue') +
ggtitle('Random Forest Regression') +
xlab('Level') +
ylab('Salary')
The Random Forest Regression model looks like a Decision Tree Regression model but the only difference is we get more steps in the stairs by having several decision trees instead of one decision tree. Therefore, we are getting a more accurate prediction result.
The code is available on my GitHub account.
The previous part of the series part1, part2, part3, part4 and part5 covered the Linear Regression, Multiple Linear Regression, Polynomial Linear Regression, Support Vector Regression and Decision Tree Regression.
If you like the blog or found it helpful please leave a clap!
Thank you.