What is a Multiple Linear Regression?
Multiple Linear Regression is an extension of the simple linear regression algorithms to predict values from more than one independent variable. So in general it is a relationship between multiple independent variable and one dependent variable.
Let’s understand Multiple Linear Regression using the 50-startups data set which is available on Kaggle. This data set contains 50 business startups data. The variables used in the data set are R&D spend, Administration, and Marketing Spend, State and Profit. Our goal is to design a model that can predict the Profit based on appropriate independent variable.
Required R package
First, you need to install the caTools package and load the caTools library then after you can able to perform the following operations.
Import libraries
install.packages('caTools')
library(caTools)
Note: If you use R studio then packages need to be installed only once.
Import the data set
dataset <- read.csv('../input/50-startups/50_Startups.csv')
dim(dataset)
The read.csv() function is used to read the csv file and the dim() function is used to know csv file contains how many rows and columns. The above data set contains 50 rows and 5 columns.
Encoding categorical data
# Encoding categorical data
dataset$State <- factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
If you see the data set then all columns having numerical data except State column so you need to first encode that column. After apply encoding you will get the dummy data.
Split the data set into the Training and Test test
set.seed(123)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
If you set the seed value, the same set of numbers will appear every time otherwise different numbers appear with every invocation. Here, data set split into the 80–20 (train-test)ratio but you can also split the data set into a 70–30 or 60–40 ratio. So after splitting the data set, training_set contains 80% data and test_set contains 20% data.
Train the Simple Linear Regression model on the Training set
regressor <- lm(formula <- Profit ~ ., data <- training_set)
The lm() function used to create a Regression model. If you look at the data set, we have one dependent variable and multiple independent variables. So the notation formula <- Profit ~ . means that the Profit is based on multiple variables and dot means all the independent variables. Now, the second argument takes the data set on which you want to train your regression model. After running this above code your regression model will ready.
Interpreting the model
summary(regressor)
Please the detail explanation of Interpreting the model in Regression Algorithm Part1.
Predict the Test set results
y_pred <- predict(regressor, test_set)
The predict function used to predict new observations.
Difference between actual value and predicted value
original <- test_set[5]
print(original)
predicted <- y_pred
print(predicted)
d <- original - predicted
print(d)
Using above code you are able to see the difference between actual profit value and predicted profit value.
The code is available on my GitHub account.
Thank you.