Image source
In the process of building a Predictive Machine Learning model, we always come across some types of error, either with training data or testing data. Here, let’s understands this error problem in simple words because sometimes, it is difficult to understand the concept with mathematical and technical terms.
What is Bias?
Bias is an error between a prediction value and the correct value of the model.
In simple words, bias means the error of the training data.
It can lead to underfitting.
Let’s understand the underfitting problem using regression. In the above image, I have used polynomial linear regression with a degree of a polynomial one. So, it will act as a simple linear regression and created the best fit line. Here, the R-square error will be high because the points are in a polynomial form and for the training data set, it gave a very high error so, the training data accuracy is low, and for the test data, the accuracy goes down. This scenario is called an underfitting problem. For the underfitting problem, the bias and variance are high because training and test data set error is high.
Also, understand this problem using classification. For example, the model gives a training error of 25% and a test error of 26%. So in this scenario, both the train and test dataset error is high. That means both bias and variance error is very high. In that case, it is a disaster because the model has a high bias and high variance. So, it is not performing well on training data and test data.
What is Variance?
Variance is an error in the result, predicted by the model. The model learns noise from the training data set and performs poorly on the test data set.
In simple words, variance means the error of the test data.
It can lead to high sensitivity and overfitting.
The above image shows you the regression problem for the overfitting problem. Here, the R-square error will be less for the training data because the points are fit well on the curve, and we will get high training accuracy. But the model is not performing well on the test data, so the test data accuracy goes down. This scenario is called an overfitting problem. For the overfitting problem, the bias is low and variance is high because training dataset error is low, and the test data set error is high.
For the classification example, the model gives a training error of 1% and a test error of 20%. So in this scenario, the training error is low, and the test error is high. That means the model has low bias and high variance. So it is performing well on training data but not performing well on test data.
What is Bias-Variance Trade-off?
Bias-Variance trade-off means to find the right balance of bias and variance without overfitting and underfitting the data.
The above image shows a degree of polynomial two sets the linear regression line to a smaller curve, and it satisfies most of the points. So the model performs well on the training and testing data, and we will get a less R-square error.
For example, the model gives a training error of 7% and the test error of 6%. So in this scenario, both the train and test dataset error is low. That means both bias and variance error is low. So it is performing well on training data and test data.
Reducing Bias error:
Hyperparameter tuning: Any machine learning model requires different hyperparameters such as constraints, weights, optimizer, activation function, or learning rates for generalizing different data patterns. Tuning these hyperparameters is necessary so that the model can optimally solve machine learning problems.
Trying an appropriate algorithm: Before relying on any model, we need to ensure that the model works best for our assumption. Change the model to reduce bias.
Enough Data/Data Representative: Ensure that the data is enough, diverse, and represents all possible groups or outcomes.
Maintain separate training and test data: Splitting the dataset into a train (50%), test(25%), and validation sets ( 25%). The train set is used to build the model, the test set is to check the accuracy of the model, and the validation set is to evaluate the performance of your model hyperparameters.
Reduce the dimensionality of the data: Remove some features to reduce the bias from the data.
Reducing Variance error:
Reducing variance means that to prevent the overfitting problem. My one of the friends Mahitha have already mentioned how to prevent overfitting problem in her blog. You can refer the blog from this link.
Hopefully, you now understand the bias and variance are very important for performing machine learning, and the optimal bias and optimal variance lead to better consistency and more reliable models.
Thank you.