What is a Linear Regression?
Linear Regression is one of the supervised machine learning algorithms to predict values within a continuous range. Linear Regression model predicts the target value using the independent variables.
Let’s understand Linear Regression using the salary data set which is available on Kaggle. This data set contains 35 jobholder’s salary and years of experience. Our goal is to design a model that can predict the salary if years of experience are provided. Using the training data, a regression line is obtained which will give a minimum error. This linear equation is then used to apply for new data. That is, if we give the years of experience of jobholder as an input, our model should be able to predict their salary with minimum error. Now, with the help of Linear Regression, we will find the relationship between salary and years of experience using R Language.
Required R package
First, you need to install the caTools and ggplot2 package and load the caTools and ggplot2 library then after you can able to perform the following operations.
Import libraries
install.packages('caTools')
install.packages('ggplot2')library(caTools)
library(ggplot2)
Note: If you use R studio then packages need to be installed only once.
Import the data set
dataset <- read.csv('../input/salary/Salary.csv')
dim(dataset)
The read.csv() function is used to read the csv file and the dim() function is used to know csv file contains how many rows and columns.
Split the data set into the Training and Test test
set.seed(123)
split <- sample.split(dataset$Salary, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)cat(' Dimension of traing data:', dim(training_set), "\n",'Dimension of testing data:', dim(test_set))
If you set the seed value, the same set of numbers will appear every time otherwise different numbers appear with every invocation. Here, data set split into the 80–20 (train-test)ratio but you can also split the data set into a 70–30 or 60–40 ratio. So after splitting the data set, training_set contains 80% data and test_set contains 20% data.
Train the Simple Linear Regression model on the Training set
regressor <- lm(formula <- Salary ~ YearsExperience, data <- training_set)
The lm() function used to create a Linear Regression model. If you look at the data set, we have one dependent variable salary and one independent variable years of experience. Therefore, the notation formula <- Salary ~ YearsExperience means that the salary is proportional to years of experience. Now, the second argument takes the data set on which you want to train your regression model. After running this above code your regression model will ready.
Interpreting the model
summary(regressor)
You can interpret your model using a summary function. After, running this code you will see the output, which tells you what the formula. So, it’s the salary being proportional to the number of years of experience and the model is built on the train set.
Then, you have some information about the residuals and coefficients. Coefficient tells you the statistical significance of your core efficiency and if you observe three starts that means the independent variable is highly statistically significant because you can have either no star or one star, two stars, three stars. No star means that there is no statistical significance and three stars means that there is a high statistical significance. You can see some other information like R-squared, P-value, etc. P-value is another indicator of the statistical significance because the lower the P-value is the more significant your independent variable is going to be. Here, you can see the P-value is two point two to ten at the power of minus sixteen which means that it’s a very small P-value. That means this independent variable is highly significant.
Predict the Test set results
y_pred <- predict(regressor, test_set)
The predict function used to predict new observations.
Visualize the Training set results
ggplot() +
geom_point(aes(x <- training_set$YearsExperience, y <- training_set$Salary), colour = 'red') +
geom_line(aes(x <- training_set$YearsExperience, y <- predict(regressor, training_set)), colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
To visualize the results ggplot2 library is used. You can use geom_point and geom_line to plot scatter point and regression line. Set the title using ggtitle function. Pass the training_set data to display and plot the training set result.
Visualize the Test set results
ggplot() +
geom_point(aes(x <- test_set$YearsExperience, y <- test_set$Salary), colour = 'red') +
geom_line(aes(x <- training_set$YearsExperience, y <- predict(regressor, training_set)), colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Here, you need to change data only in geom_point because we have already trained our model on training_set so our regression line remains as it is.
Evaluation matrix to check the performance of the model
original <- test_set$Salary
predicted <- y_pred
d <- original - predicted
MAE <- mean(abs(d))
MSE <- mean((d)^2)
RMSE <- sqrt(MSE)
R2 <- 1 - (sum((d)^2) / sum((original - mean(original))^2))cat(" Mean Absolute Error:", MAE, "\n", "Mean Square Error:", MSE, "\n",
"Root Mean Square Error:", RMSE, "\n", "R-squared:", R2)
Using the evaluation matrix you can able to check the performance of the model.
The code is available on my GitHub account.
Thank you.