Sep 27, 20223 min read

Automated EDA (Exploratory Data Analysis) and Data Cleaning in Python using DataPrep

As a Data Analyst, telling the story of the data we are analyzing in the best way possible is one of the most important skills to have. In order to tell a better story, we need to have a better understanding of the data.

In simple terms to have a better understanding of the data and to gain insights, we perform EDA in Python which helps us in identifying the patterns in the dataset, and analyze the correlation and association between various variables with the help of visualizations.

The core of any EDA consists of

Data Sourcing
Data Cleaning
Descriptive Statistics
Handling Missing Values
Data Visualization

In order to perform EDA and come to conclusions from the analysis, it’s gonna take some time and effort, but what if I say there is a way to generate a report automatically with all the analysis with just a single line of code?

I am going to use the Waste_Haluer dataset for the analysis here and I will be using Jupyter Notebook.

Let's start with importing the data prep library

Pip install dataprep

Next, let's load the waste_hauler data we want to use.

from dataprep.datasets import load_dataset
df = load_dataset('waste_hauler')
df

As we can see the data has 1000 rows and 5 columns, let's perform the automated EDA to get an overview of the data.

REPORT of DATA

The above line of code computes the following statistics of the data:

Type, unique values, and missing values that are essential.
Quantile statistics are a minimum value, a median value, a maximum value, a range value, and an interquartile range value.
Means, modes, standard deviations, sums, median absolute deviations, coefficients of variation, kurtosis, and skewness are descriptive statistics.
Values with the highest frequency.
Histogram.
There are two types of correlations: Spearman and Pearson matrices, which show correlated variables.
Values that are missing-Missing Values, Missing Values conclusion.
Customize Your Plot: Correlation matrices are useful for understanding the relationships between attributes.

from dataprep.eda import create_report
report = create_report(df)
report

The EDA is not yet complete, however. To better understand the data, we sometimes need to finish the EDA manually.

We can also, easily customize the analysis we want using data prep.

ANALYZE A DATA FRAME

Let's use the plot function to plot the data frame which would give us statistical analysis and insights into the dataset and distribution of each column.

ANALYZE A COLUMN

we can further customize by giving the column name into the plot function which would give us a statistical analysis of the column and also more visualizations like Bar chart, pie chart, word cloud, word frequency, word length, and value table.

and if we give numeric columns to plot it would give us stats, Histogram, Kdeplot, Normal Q-Q plot, Box plot, and Value Tables.

ANALYZE 2 COLUMNS

We can compare 2 columns which give us plots like a nested bar chart, heat map, and stacked bar chart which shows the relationship between the columns.

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

Dirty data can lead to misleading or incorrect insights which can cause stakeholders to make incorrect decisions based on their data.

Every dataset is unique in how it will need to be cleaned, but here are some ways you can clean up your data.

1. Remove Duplicated data

2. Drop null values

3. Populate/input missing values

4. Filter out outliers

5. Normalize data typos/differences

6. Converting data types properly

It can also be a lengthy process to clean data, but it's well worth it to make sure your data can be used appropriately, Let's see how to clean data using data prep. clean automatically.

Clean Headers

from dataprep.clean import clean_headers
clean_headers(df,case="const)

To clean the Headers we use the clean_headers function and I have used case style as constant, there are various styles we could use like

snake: “column_name”
kebab: “column-name”
camel: “columnName”
pascal: “ColumnName”
const: “COLUMN_NAME”
sentence: “Column name”
title: “Column Name”
lower: “column name”
upper: “COLUMN NAME”

Clean Phone numbers

from dataprep.clean import clean_phone
clean_phone(df, "PHONE")

clean_phone(df, "PHONE", output_format="national",split=True, fix_missing="auto")

Here the Output Format sets which style we want the phone number to display in and The split parameter adds individual columns containing the cleaned phone number values to the given DataFrame.By default, the fix_missing parameter is set to “empty” (leave the missing country code as is). If set to “auto”, the country code is set to “1”.

Replace

clean_headers(df, replace={"PHONE": "CONTACT NUMBER"})

These are a few of the ways to use data prep. clean and it has many more parameters that we can use and many more ways to clean the data.

Link to the Jupyter notebook that I used here.

Advantages of Using Dataprep:

Plotting the data using Matplotlib, or Seaborn might seem difficult especially if you are a beginner.
There are other automated EDA libraries like Pandas Profiling and AutoViz but DataPrep is comparatively faster and we can customize the report here, which is not possible with others.
DataPrep.EDA supports big data stored in a Dask cluster by accepting a Dask data frame as input.

Resources

EDA using DATA PREP

Data Cleaning using DATA PREP

Thank you for taking the time to read the blog and I hope it was useful to you.