Both bagging and boosting are the most prominent ensemble techniques. The general principle of an ensemble method in Machine Learning to combine the predictions of several models. Ensemble means group of models working together to solve a common problem.
We can use ensemble methods to combine different models in two ways: either using a single base learning algorithm that remains the same across all models (a homogeneous ensemble model), or using multiple base learning algorithms that differ for each model (a heterogeneous ensemble model). Bootstrap aggregating also known as BAGGING (from Bootstrap Aggregating), is a machine learning ensemble Meta -algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. It is usually applied to decision tree methods, it can be used with any type of method.
The application of either bagging or boosting requires the selection of a base learner algorithm first. For example, if one chooses a classification tree, then boosting and bagging would be pool of trees with the size equal to the user’s preference.
Advantages:
Many weak learners aggregated typically outperform a single learner over the entire set, and has less over fit.
Removes variance in high-variance low-bias data sets.
Can be performed in parallel, as each separate bootstrap can be processed on its own before combination.
Disadvantages:
In a data set with high bias, bagging will also carry high bias into its aggregate.
Introduces loss of interpret ability of a model.
It can be computationally expensive and this discourage its use in certain instances.
How it works:
Let’s consider a dateset having alphabets (A, B, C, D, E, F, G, H, I). Here at first it chooses a sample 1 and run model with chosen classifier and gets the accuracy. Later, while choosing next sample it places the previous sample and choose the next
sample 2 which is known as sampling with replacement. Like wise it under goes number of times which was given by us and get accuracy's for all those sets.
For Classification, a process called voting is used to determine the final result, where the result produced the most frequently by the classifier is the given result for the sample.Example election process, the person with majority voting wins. For Regression, the sample is assigned the average classifier value produced by the trees. It takes average of all the accuracy values and gives us output.
Sample code for classification:
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
model = BaggingClassifier(DecisionTreeClassifier(random_state=1))
model.fit(x_train, y_train)
model.score(x_test,y_test)
Sample code for regression problem:
from sklearn.ensemble import BaggingRegressor
model = BaggingRegressor(DecisionTreeRegressor(random_state=1))
model.fit(x_train, y_train)
model.score(x_test,y_test)
Parameters used in the algorithm:
base_estimator:
It defines the base estimator to fit on random subsets of the dataset.
When nothing is specified, the base estimator is a Decision Tree.
n_estimators:
It is the number of base estimators to be created. Default = 10.
The number of estimators should be carefully tuned as a large number would take a very long time to run, while a very small number might not provide the best results.
max_samples:
This parameter controls the size of the subsets.
It is the maximum number of samples to train each base estimator(with replacement).
max_features:
Controls the number of features to draw from the whole dataset.
It defines the maximum number of features required to train each base estimator(without replacement).
n_jobs:
The number of jobs to run in parallel.
None means 1.
If -1, it uses all the processors.
random_state:
It specifies the method of random split. When random state value is same for two models, the random selection is same for both models.
This parameter is useful when you want to compare different models.
bootstrap:
default=True
Whether samples are drawn with replacement. If False, sampling without replacement is performed.
Happy reading !