top of page
Writer's pictureMahitha Kumar

A thumbnail of Feature Engineering


You can't build a great building on a weak foundation. You must have a solid foundation if you're going to have a strong super structure.

-Gordon B.Hinckley



Feature Engineering is the crucial module in the life cycle of a Data science project.

Suppose you are trying to do a gardening project you need to make a note of necessary steps like what is the purpose of the project, requirements to be considered and things you need to do in order to make the process fast, easy and effective. We also might need to preprocess certain materials to serve the purpose. This collection of all needed materials and making them usable for building the project is called engineering and when it is done for features in a Data science project it is called Feature Engineering.




Raw data can be haphazard. It is a good practice to be able to answer all the above questions after Features are Engineered.


The major objective of Feature Engineering is identifying, preprocessing or modifying the features which contribute to the effective model. To achieve this there are multiple steps depending on the data.


1.EDA:

Before starting any Machine learning project the data is messy and is not suitable for modelling. In order to prepare the data for modelling we have to explore the data and find relevant insights. Exploratory Data Analysis is exploring the data set for data types(categorical or numerical),missing values, outliers, cleaning the data like checking for format, headers etc.


Dtale, Pandas profiling, Sweetviz, Autoviz, Dataprep are libraries which can automate all the EDA along with plots.


2.Handling the missing values :

This can be done in many ways like imputing with mean, median, mode, using imputation algorithms like:

  • KNN

  • LLS impute which uses regression

  • MICE(Multiple imputations by chained equations)

  • CART(Classification and Regression trees)

  • GMCimpute(Gaussian mixture clustering imputation)

  • CMVE(Collateral missing value imputation)

  • AMVI(Ameliorative missing value imputation)

  • ABBA(Adaptive Bicluster based approach imputation)

The above are a few algorithms used for imputing missing values. The missing values can be replaced using a hybrid of any of the above algorithms. The imputed dataset can be evaluated with either internal or external validation. Hence the reliability and accuracy of our imputation technique can be measured.


3.Handling imbalanced Dataset:

Imbalance data leads to biased models which is not ideal. Hence we need a way to handle Imbalanced data. This can be approached in following ways:

  • Resampling

  • More training data

  • Using K-fold

  • Cluster the abundant class

  • Changing performance metrics

We also can customize a way to handle imbalanced data sets.


4.Outliers:

Removing outliers results in a statistically significant increase in the accuracy of the model. Hence this is an important step to be performed. Anomaly detection can be achieved by the following ways:

  • Using scatter plots

  • Box Plots

  • Z-score

  • Using IQR(inter quartile range)

  • Isolation forest

  • Cluster analysis

  • Ensemble techniques

5.Feature scaling:

Raw data can have values in any range. It is atypical to be able to analyze features with the drastic range changes If the data is in different units of measurement , it is important to bring them to the same scale. Hence the data has to be transformed in limited ranges or same units which in turn can achieve accurate models. The Feature Transformation can be done in the following ways:

  • Standard scaler

  • MinMax scaler(0-1)

  • Gaussian Transformations like

    • logarithmic transformation

    • reciprocal transformation

    • square root transformation

    • exponential transformation

    • box cox transformation

6.Handling Categorical data:

The categorical data handles useful information but a machine cannot calculate when used as it is, these variables should be modified to numerical so that the calculations can be performed and we can derive more insights from the data. The categorical variables are classified into ordinal and nominal where ordinal categorical variables can be ranked. Some of the ways to convert the categorical data are as follows:

  • One hot encoding

  • Label encoding

  • Binary encoding

  • Replacing

  • Backward difference encoding

  • Count or frequency encoding

Conclusion:

Feature Engineering is the crucial step where we are trying to create a bond to be able to understand the data clearly and extract all possible information the data is trying to say to us. This task takes much of a Data scientist's time. If the base is strong the building will be strong and hence if Feature Engineering is done properly we can achieve an effective model.










55 views

Recent Posts

See All
bottom of page