Structured Query Language or SQL
One of the most popular and easy-to-learn tools for data analysis is SQL. It let us retrieve, filter, sort, join and aggregate data. We can uncover patterns, trends, and outliers in our datasets using simple SQL queries.
Exploratory data analysis involves understanding the dataset that we are working with. It is a process of familiarizing with the dataset. EDA is a fundamental step in the analytical process.
Exploratory Data Analysis (EDA) with SQL
It is a process of using SQL (Structured Query Language) queries to explore and analyze tables in a database management system (DBMS).
In this blog, we will go through the steps involved in EDA with SQL.
I’ve used PostgreSQL as the DBMS to demonstrate. If you’re using a different SQL database, refer to its documentation for the correct syntax. Most commands and functions discussed here should be applicable across other SQL systems.
Dataset used is the Hospital Dataset from Kaggle. https://www.kaggle.com/datasets/namratakapoor1/hcdatasetTo simulate real-life scenario, I created a database named Hospital and schema named hospitaldata, as shown in the figure below, and loaded this dataset in PostgreSQL DBMS. The steps to load the data by creating tables and loading csv files are not included here, as in real world projects you would be most likely working with pre-loaded databases for exploration.
ERD of HospitalData.
On gaining access to a database, the first thing you’ll want to explore are the tables, columns, and their relationships. As shown in the above figure, there are 6 tables in hospitaldata schema. Patients and Providers are the base tables holding patients and healthcare providers information respectively. Rest of the tables hold transactional data.
Entity Relationship Diagram (ERD) illustrates the various tables (entities), their columns (attributes), and the relationships between these tables. ERD helps us understand the structure and connections in a database.
Describing the Structure of a Table.
In PostgreSQL, the information_schema.columns view contains metadata about all the columns in the database. Querying it helps us understand the structure of a table, in terms of table schema, column names, data types, if columns are nullable or not, etc. Here’s an example query to describe the patients table in the hospitaldata schema.
Likewise we can look at the other tables from this database. Lets look at table edvisits
Retrieving data from database; The SELECT statement.
Once we’re familiar with the tables and columns, the next step is to understand the data stored in those tables. To do this, we typically examine a few rows of data. SELECT statement lets us retrieve data from the tables.
SELECT * FROM table_name; Lets us look at the rows, it also tells us the total number of rows in a queried table. For instance, here the Total Rows are 945
SELECT * FROM hospitaldata.patients;
Note: In “hospitaldata.patients” , hospitaldata is the schema name followed by table name patients
Filtering Data; The WHERE clause.
Where clause lets us filter data on the basis of some condition. Let’s filter rows from providers table where the providerspeciality is ‘Surgery’
Sorting and Ordering data.
Sorting data lets us organize data in a sequence and improves readability.
We can sort data in SQL using the ORDER BY clause.
Let’s sort readmissionregistry table by admissionid
Limit clause
In real life, as data analysts we would be working with massive datasets containing millions of rows. We can optimize performance of the select statement with a Limit clause by limiting the number of rows returned.
COUNT Function & Missing values.
We can use the COUNT function to get the count of rows in a table.
If we give a column name to the COUNT function, it will give us the count of non null values from that column. Below is the query for getting the count of values in the race column of patients table.
Value returned is 939 which is lesser than the count of total rows in the Patients table that was 945. This hints that there are missing values in the race column.
We can validate this using IS NULL condition in the WHERE clause as shown below. Query returned all the 6 rows where values for the race column were missing or NULL.
Checking for duplicates with Group By clause.
Finding duplicates can be tricky at times. We have five tables with primary keys ensuring uniqueness at the primary key level. We still need to check for duplicates manually.
Here is an example query that checks for duplicate records in the patients table based on a combination of columns:
This query groups records by dateofbirth, gender, race, ptlanguage and counts the number of occurrences for each group. Having clause with count(*) > 1 returns rows with duplicates. We get 6 rows with count 2. we need to investigate these records further to determine as to what needs to be done with these.
readmissionregistry table has no primary keys. We can use the same logic to check for duplicates.
There are none in this table as the query returned 0 rows.
Joins
We can explore how the tables are related to each other using joins. So far we have queried single tables. With SQL joins we can query multiple related tables. Join used in the query below is inner (default) join. It returns only the rows where there is match in both the tables patients and admissionsdischarges.
Summarizing data
We can get basic statistical summary with SQL using functions like AVG, MIN, MAX, STDDEV, VARIANCE
MIN and MAX functions let us find the range for a column. We can also find out if there are any extremely odd values or outliers using these functions.
AVG function gives the arithmetic mean. It lets us look at the central tendency of the column values.
Standard deviation and variance measure the amount of dispersion in a set of values. They tell us how far spread values are from their mean. We have STDDEV and VARIANCE functions in PostgreSQL
Since variance and standard deviation can be computed for a sample of data or for the population. PostgreSQL also provides us with corresponding functions for sample and population; viz var_samp, stddev_samp, var_pop, stddev_pop respectively. These functions are good to have in our toolkit which will let us explore data in a meaningful and statistical way.
Summarizing and aggregating data
We can also calculate the summary statistics for a grouped result. For instance, if we want to calculate the average acuity (severity) as per the reason for visit , we group by reasonforvisit column and then get the average of acuity on the grouped data.
Exploratory Data Analysis with SQL is a powerful way to uncover insights from your data, but this is just the beginning. Don’t restrict yourself to what’s covered here. I sincerely hope these steps would help you get started with EDA in SQL. Keep exploring, and let your curiosity guide you to even deeper insights!