What is Exploratory Data Analysis (EDA)?
EDA is the process of analyzing a dataset and summarizing its main features, often by data visualization methods. EDA performs preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses and verify assumptions. It is primarily used by data scientists to understand the various aspects of data. It is used to see what data can reveal beyond formal modeling or hypothesis.
Why is EDA important to data scientists?
The main purpose of EDA is to look at data before making any assumptions. It can help the data scientists identify obvious errors, better understand patterns within the data, detect outliers and anomalies and find the relationship between different variables. Although EDA can be carried out at various stages of the data analytics process, it is usually conducted before a hypothesis or end goal is defined.
In general, EDA focuses on understanding the characteristics of a dataset before deciding what we want to do with that dataset. Exploratory data analytics often uses visual techniques like graphs, plots, and other visualizations. This is because our natural pattern-detecting abilities make it much easier to spot trends and anomalies when they’re represented visually. EDA provides invaluable insights that an algorithm cannot.
In short, EDA is used for:
· Generate questions about your data.
· Search for answers by visualizing, transforming and modeling your data.
· Use what we learn to refine our questions or generate new questions.
What are the benefits of EDA?
1. Spotting missing or incorrect data
2. Understanding the structure of data
3. Testing our hypothesis and assumptions
4. Identifying the most important variables
5. Determining error margins
6. Identifying the most appropriate statistical tools to analyze.
Exploratory Data Analysis:
Importing required libraries:
First, let us install the libraries that we need. We will install Pandas, NumPy, Seaborn and Matplotlib libraries.
1. Pandas – for reading the dataset files
2. Seaborn – for graphical visualization of data
3. NumPy – for numerical calculations
4. Matplotlib – for graphical visualization
Data Discovery Analysis:
For our use, I’ve uploaded a dataset that constitutes the percentage of protein intake from different types of food in countries around the world. The last columns also include counts of obesity, undernourished, and COVID-19 cases as percentages of total population.
In this code, we have read the dataset using the pandas library. The first five and the last five rows of the dataset are displayed using head() and tail() respectively.
Now, to get a random sample of data, we’ll use sample().
The above code shows 10% sample data.
Basic Information about data:
The df.info() function will give us basic information about the dataset. For any data, it is good to start by knowing its information.
The describe() method shows descriptive qualities of the dataset.
Count : shows the total number of rows
Mean: shows the average
Std : Standard deviation value
Min: Minimum value
25% : First Quantile
50% : Median or Second Quantile
75% : Third Quantile
Max : Maximum value
Even though this dataset has no missing values, in many cases, the data comes with a lot of missing values. These missing values can be dealt with in one of the following methods.
1. Replacing the null values with the mean values
2. Replacing the null values with the median value
3. Replacing the null values with the mode value
4. Dropping the values in case if the dataset is huge.
If the dataset contains any outliers, the first three approaches are not recommended since they will alter the mean, median, and mode values.
Now, to look at the datatypes of the columns, we can use dtypes.
We can use duplicate.sum() function to find the sum of duplicate values. It will show the number of duplicate values if they are present in the data.
Our result shows that there are no duplicates in the dataset. Now to check for any null values, we can use isnull() function.
The following code shows that the columns Obesity, Undernourished, Confirmed, Deaths, Recovered and Active have null values in them.
We can replace the null values with 0, mean, median or mode value, depending on the dataset.
Assuming we are replacing the null values in the column Undernourised with the median value, let us follow the code snippet.
To understand the relation or correlation between the variables, we can use corr() function. It shows the correlation between each variable on a scale of -1 to 1, while 1 denotes total positive correlation, 0 denotes no correlation, and -1 denotes absolute correlation.
We can even use seaborn to visualize the correlation matrix.
EDA is the most important part of any analysis, because it summarizes the features and characteristics of the dataset. We know what exploratory data analysis is and why it’s important, how exactly does it work? In short, exploratory data analysis considers what to look for, how to look for it, and, finally, how to interpret what we discover. Exploring data with an open mind tends to reveal its underlying nature far more readily than making assumptions about the rules we think (or want) it to adhere to. In data analytics terms, we can generally say that exploratory data analysis is a qualitative investigation, not a quantitative one. We have looked at some of the basic descriptive analyses of the data using Python libraries. We can explore the data more with more visualizations from the python library in the next blog.