You can't build a great building on a weak foundation. You must have a solid foundation if you're going to have a strong super structure.
-Gordon B.Hinckley
Feature Engineering is the crucial module in the life cycle of a Data science project.
Suppose you are trying to do a gardening project you need to make a note of necessary steps like what is the purpose of the project, requirements to be considered and things you need to do in order to make the process fast, easy and effective. We also might need to preprocess certain materials to serve the purpose. This collection of all needed materials and making them usable for building the project is called engineering and when it is done for features in a Data science project it is called Feature Engineering.
Raw data can be haphazard. It is a good practice to be able to answer all the above questions after Features are Engineered.
The major objective of Feature Engineering is identifying, preprocessing or modifying the features which contribute to the effective model. To achieve this there are multiple steps depending on the data.
1.EDA:
Before starting any Machine learning project the data is messy and is not suitable for modelling. In order to prepare the data for modelling we have to explore the data and find relevant insights. Exploratory Data Analysis is exploring the data set for data types(categorical or numerical),missing values, outliers, cleaning the data like checking for format, headers etc.
Dtale, Pandas profiling, Sweetviz, Autoviz, Dataprep are libraries which can automate all the EDA along with plots.
2.Handling the missing values :
This can be done in many ways like imputing with mean, median, mode, using imputation algorithms like:
KNN
LLS impute which uses regression
MICE(Multiple imputations by chained equations)
CART(Classification and Regression trees)
GMCimpute(Gaussian mixture clustering imputation)
CMVE(Collateral missing value imputation)
AMVI(Ameliorative missing value imputation)
ABBA(Adaptive Bicluster based approach imputation)
The above are a few algorithms used for imputing missing values. The missing values can be replaced using a hybrid of any of the above algorithms. The imputed dataset can be evaluated with either internal or external validation. Hence the reliability and accuracy of our imputation technique can be measured.
3.Handling imbalanced Dataset:
Imbalance data leads to biased models which is not ideal. Hence we need a way to handle Imbalanced data. This can be approached in following ways:
Resampling
More training data
Using K-fold
Cluster the abundant class
Changing performance metrics
We also can customize a way to handle imbalanced data sets.
4.Outliers:
Removing outliers results in a statistically significant increase in the accuracy of the model. Hence this is an important step to be performed. Anomaly detection can be achieved by the following ways:
Using scatter plots
Box Plots
Z-score
Using IQR(inter quartile range)
Isolation forest
Cluster analysis
Ensemble techniques
5.Feature scaling:
Raw data can have values in any range. It is atypical to be able to analyze features with the drastic range changes If the data is in different units of measurement , it is important to bring them to the same scale. Hence the data has to be transformed in limited ranges or same units which in turn can achieve accurate models. The Feature Transformation can be done in the following ways:
Standard scaler
MinMax scaler(0-1)
Gaussian Transformations like
logarithmic transformation
reciprocal transformation
square root transformation
exponential transformation
box cox transformation
6.Handling Categorical data:
The categorical data handles useful information but a machine cannot calculate when used as it is, these variables should be modified to numerical so that the calculations can be performed and we can derive more insights from the data. The categorical variables are classified into ordinal and nominal where ordinal categorical variables can be ranked. Some of the ways to convert the categorical data are as follows:
One hot encoding
Label encoding
Binary encoding
Replacing
Backward difference encoding
Count or frequency encoding
Conclusion:
Feature Engineering is the crucial step where we are trying to create a bond to be able to understand the data clearly and extract all possible information the data is trying to say to us. This task takes much of a Data scientist's time. If the base is strong the building will be strong and hence if Feature Engineering is done properly we can achieve an effective model.