What is Exploratory Data Analysis?
Exploratory Data Analysis or EDA is nothing but a data exploration technique to understand the various aspects of the data. It includes several techniques in a sequence that we have to follow. The main objective of the EDA is to understand the data.
Few points we need to keep in mind before exploring are the data is clean, does not have any redundancy and any missing values or even null values in the data set. Also, we should be aware of all the important variables in the data set and remove all the unnecessary noise in the data that may hinder the accuracy of our conclusions when we work on model building.
We must also know the relationship between the variables through EDA and last but not least we must be able to derive conclusions on gather insights about the data for conclusive interpretation in order to move on to more complex processes in the data processing life cycle.
What is the Objective of EDA in data exploration?
The first objective of the EDA in data exploration is to make sure that the data was clean.
That means the data set is with no redundancy. EDA helps us to identify all the faulty pints in data and if we identify the faulty points, we can easily remove them from the data.
The second objective of EDA is to understand the relationship between the variables which gives us wider perspective on the data and helps to build on it by utilizing the relationship between the variables.
Steps involved in Exploratory Data Analysis:
The basic steps involved in EDA are
Understand the data:
After loading the data into our program, the first step we have to do is understand the data, means the different variables exist I the data and relationship between them and type of the data, the number of rows and columns in the data set and how it looks like.
Clean the data:
Data clean means clean the data from redundancies. Redundancy means it can be the irregularity of data, or some variables and columns not necessary for making our conclusions or interpretations. So we can simply remove them.
Analysis of relationship between variables:
This will help us to get an idea of the data and to build on our interpretations.
Now let’s jump into the visual studio code with a sample data set.
The first thing we have to do in visual studio is import pandas with alias as pd. Also import some more libraries we need as shown below.
Here NumPy and seaborn are libraries for numerical calculations and visualizations.
Understanding the data:
Here we can see the data with first 5 rows and columns as gender, race etc.
Using these columns, we can do any conclusions and interpretations. Let’s check the tail of the data set.
From this we can see row starts with ‘0 ‘and ends with 999. That means our data set has 1000 rows.
Now we can check the shape of the data as below.
That means data set is with 1000 rows and 8 columns.
describe() displays only integer type objects with values like count, mean, std min, max.
We can check the rows and columns details separately. For columns we use df.columns which will show all the columns in the data set.
Info() gives the details of the objects and type of the objects with null values information.
rename () will be used to rename the column names as we want.
Cleaning the data:
For checking the total number of nulls in each column,
Here, we didn’t have null values. So, we don’t need to remove any values or replace values. The next part is dropping the redundant data which is not necessarily going to perfect our performance of our table.
Now going to remove the unnecessary columns race/ethnicity and parental level of education which are nor important to our data evaluation.
Relationship Analysis:
The next step and also final step are Relationship analysis. First, I would like to do is correlation matrix because it gives us a wider X perspective on what exactly we are dealing with here and a correlation matrix is a table showing correlation coefficients between variables and each cell in the table shows the correlation between two variables. Also, correlation matrix can be useful for diagnostic analysis.
Here dark shade shows the positive correlation and lighter shade represents negative correlation. It is good to remove correlated variable in feature selection.
Another way of analyzing numerical data would be using plots like pair plot. Pair plot can be used to visualize the relation between two variables. These variable maybe continuous, categorical or Boolean types.
Here each variable will be shown in a separate grid as shown above.
Another approach is using scatter plot. Scatter plot shows the relation between two numerical variables.
Here we took x-axis as math score, y- axis as reading score and gender as hue with student data.
We move onto the next plot nothing is but Histogram. So, Histogram is a graphical display of data using powers of different heights, and in histogram each bar groups numbers into ranges. So, the taller bar show that mod rate of range actually falls in that, and a histogram basically displayed shape and the spread and continuous sample data. Now we can see some histograms as part of our analysis.
Here we can see the most of values are between 60-80 so we will take a guess most of the people are getting scores in between the 60-70. We check for other values as well like check for reading scores and writing scores.
We can take bins as well.
We can see the 5 bars here. This is how we can use the histograms for data visualization. Histogram shows the frequency distribution of a variable.
Next, we can work on categorical plots.
As summary in the process of EDA we load the data, understand the data, cleaned the data and show the relationship analysis with different plots.
Hope I was able to give some knowledge regarding the EDA process.
References:
Thank you for reading my blog.