Regression Analysis
is a form of mathematical analysis that uses quantified models, representations, and synopses for a given set of experimental data or real-life studies. is a supervised learning technique that helps in finding the correlation between variables and enables us to predict the continuous output variable based on
one or more predictor variables. It is mainly used for prediction, forecasting, and determining the causal-effect relationship between variables.
So now the Question is why do we use Regression Analysis?
As we know that Regression analysis helps in the prediction of a continuous variable. It has a statistical approach that is used in machine learning and data science. Many scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. Below are some other reasons for using Regression analysis:
Regression estimates the relationship between the target and the independent variable.
It is used to find the trends in data.
It helps to predict real/continuous values.
By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.
Let us understand how the concept of regression analysis can be used to predict sales for next year for the advertisement company.
Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $300 in the year 2022 and wants to know the prediction about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression analysis. What we will do is we will plot a graph between the variables which best fit the given data points, using this plot, the machine learning model can make predictions about the data. In simple words, "Regression shows a line or curve that passes through all the data points on a target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum." The distance between data points and line tells whether a model has captured a strong relationship or not.
There are some terminologies Related to the Regression Analysis:
Dependent Variable: The main factor in Regression analysis that we want to predict or understand is called the dependent variable. It is also called the target variable.
Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called the independent variable, also called a predictor.
Outliers: something that lies outside the main body or group that it is a part of, for ex: as a cow far from the rest of the herd, or a distant island belonging to a cluster of islands.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with the test dataset, then such a problem is called Overfitting. And if our algorithm does not perform well even with the training dataset, then such a problem is called underfitting.
Different Types of Regression
Linear Regression
Logistic Regression
Decision Tree Regression
Random Forest Regression and many more…
Today we will see Linear Regression and Decision Tree Regression
1) Linear Regression: Linear regression is a statistical regression method that is used for predictive analysis. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. It is used for solving the regression problem in machine learning. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression. If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression.
Example of Linear Regression
The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience.
Graphical representation of Linear Regression
Some popular applications of linear regression are:
Analyzing trends and sales estimates
Salary forecasting
Real estate prediction
Arriving at ETAs in traffic.
2) Decision Tree Regression: Decision Tree is a supervised learning algorithm that can be used for solving both classification and regression problems. It can solve problems for both categorical and numerical data. Decision Tree regression builds a tree-like structure in which each internal node represents the "test" for an attribute, each branch represents the result of the test, and each leaf node represents the final decision or result. A decision tree is constructed starting from the root node/parent node (dataset), which splits into left and right child nodes (subsets of the dataset). These child nodes are further divided into their children node, and themselves become the parent node of those nodes.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the main point to remember while creating a machine learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node. For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf node of the tree.
Example of Decision Tree Regression: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Let’s see how we can do this stepwise:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.
When implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and for sub-nodes. So, there is a technique which is called an Attribute selection measure or ASM. There are two popular techniques for ASM with formula, which are:
Information Gain = Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us with a class. According to the value of information gain, we split the node and build the decision tree. A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first.
Formula: Information Gain = Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm. An attribute with a low Gini index should be preferred as compared to the high Gini index. It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Formula: Gini Index= 1- ∑jPj2
Conclusion: The purpose of the article was to learn how regression analysis works in machine learning how Linear Regression and Decision Tree Regression works in real-world scenarios. I have chosen these two topics to explain but there are many more which I will cover in near future.
I hope this article was helpful to you. Please leave your queries if any below.