Data Visualization Using Seaborn - 10 Essential Charts

Data Visualization

Data by itself can be difficult to understand and interpret. Data visualization translates complex data into visual formats such as charts, graphs and maps that are easier for the human brain to comprehend. It is one of the core data science processes, as it helps to make data more understandable and actionable for a wide range of users, from business professionals to data scientists.

Python is widely used by data scientists for its ease to use, scalability, and large ecosystem of libraries and frameworks. One such Python library that stands out for data visualization is Seaborn. Built on top of Matplotlib, Seaborn offers an easy-to-use interface to create visually appealing and informative plots. In this blog, we will explore 10 essential plots that you can create using Seaborn to create insightful data visualizations and get a better understanding of your data.

Seaborn - The Basics

Seaborn may not be available by default in Jupyter or similar tools you use for Python. However,. it is easy to import and use seaborn libraries.

# Import seaborn library
pip install seaborn

Once seaborn is installed, we can import Seaborn along with other essential libraries. In this blog we will use numpy &pandas for data manipulation and matplotlib along with seaborn for data visualization. Under the hood, seaborn uses matplotlib to draw its plots.

# Python libraries for data manipulating
import numpy as np
import pandas as pd

# Python libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

The 10 Essential Seaborn Plots

We will explore 10 simple but essential plots available in Seaborn. For data, we will use one of the sample datasets, 'mpg' available in Seaborn. The 'mpg' dataset consists of data on cars and their features.

# Load sample dataset from sns
df = sns.load_dataset('mpg')
df.info()

The data set has 9 columns with 398 records on different car models.

#   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object

1. Count Plot

A count plot is generally used to show the counts of observations in each bin or group of categorical variable using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

# Count plot of cars per country of origin
plt.figure(figsize=(5,4))
plt.title('Cars per Origin')
plt.xlabel('Country of Origin');
plt.ylabel('Number of Cars');
sns.countplot(data=df, x='origin', color ='teal', width=0.5)
plt.show()

The above plot shows count of cars under each country (group). The plot defaults like color, bar width etc. can be overridden. Seaborn is built on top of matplotlib and so we can use pyplot to additionally customize the plot features like title, x & y axis labels etc.

2. Bar Plot

Bar plots look similar to count plots, but instead of the count of observations in each category, they show the relationship between a categorical variable and a numerical variable using an aggregate measure (like mean) of the numerical variable among observations in each category.

# Command to tell Python to actually display the graphs
%matplotlib inline

# Bar plot of miles per gallon (mpg) per country of origin
plt.figure(figsize=(10,5))
plt.title('Bar plot with Mean MPG per Model Year')
plt.xlabel('Model year');
plt.ylabel('Average mpg');
sns.barplot(data=df, x='model_year', y ='mpg', estimator = 'mean', width=0.5)

A bar plot is used to aggregate the categorical data according to statistical method, typically mean. The height of each rectangle represents an estimate of central tendency for the numerical variable and error bars in between represents the uncertainty surrounding that estimate.

# Bar plot of miles per gallon (mpg) per country of origin
plt.figure(figsize=(10,5))
plt.title('Bar plot with Median MPG per Model Year')
plt.xlabel('Model year');
plt.ylabel('Median mpg');
sns.barplot(data=df, x='model_year', y ='mpg', hue = 'origin', palette= 'mako', estimator = 'median');

Bar plot above is customized using the following parameters:

hue - used to visualize the data of different categories in one plot.
palette - Colors to use for the different levels of the hue variable.
estimator - statistical function to estimate within each categorical bin.

3. Box Plot

A box plot, or a box-and-whisker plot, shows the distribution of numerical data and skewness through displaying the data quartiles. It is also called a five-number summary plot, where the five-number summary includes the minimum value, first quartile, median, third quartile, and the maximum value. A box plot helps to maintain the distribution of quantitative data in such a way that it facilitates the comparisons between variables or across levels of a categorical variable.

# Box plot of weight
plt.figure(figsize=(5,3))
plt.title('Box Plot - Distribution of Weight')
sns.boxplot(data=df, x='weight',color ='teal');

The box plot has the median slightly towards the left of the box while the whisker on the right is longer indicating that the data is skewed towards the right.

We can compare groups with boxplots.

# Box plot of horsepower across cylinders 
plt.figure(figsize=(10,5))
plt.title('Box Plot - Distribution of Horsepower Against Total Cylinders')
sns.boxplot(data=df, x='cylinders', y='horsepower', hue='cylinders', legend=False, palette='YlOrBr');

The colors of the boxes can be changed by setting hue to the variable in x-axis and setting the legend to false. The chart above has a couple of outliers for cars with 6 and 8 cylinders.

4. Violin Plot

A violin plot is similar to a box plot. It shows the distribution of data points after grouping by one (or more) variables. However, unlike a box plot, each violin is drawn using a kernel density estimate of the underlying distribution.

# Violin plot of weight
plt.figure(figsize=(5,3))
plt.title('Violin Plot - Distribution of Weight')
sns.violinplot(data=df, x='weight',color ='teal');

The white dot in the plot represents the median. The thick dark bar in the center represents the interquartile range. The wider section of the violin plot represents a higher probability and the skinnier sections represent a lower probability for the given value.

Like the box plots, we can have multiple violin plots represented in the same chart.

# Violin plot of horsepower across cylinders 
plt.figure(figsize=(10,5))
plt.title('Violin Plot - Distribution of Horsepower Against Total Cylinders')
sns.violinplot(data=df, x='cylinders', y='horsepower', hue='cylinders', legend=False, palette='colorblind');

5. Strip Plot

A strip plot is basically a scatter plot to show all observations along with some representation of the underlying distribution. Strip plots are considered a good alternative to a box plot or a violin plot for comparing data distributions when we have fewer data points.

# Strip plot of car weight with model year
plt.figure(figsize=(12,7))
plt.title('Strip Plot - Car Weight Across Model Years')
sns.stripplot(data=df, x='model_year', y='weight', hue='origin', jitter=True);

Strip plots can have 'hue' parameter set to display observations for subgroups. The 'jitter' parameter when set avoids overlapping of data points with same values.

6. Line Plot

A line plot is a relational data visualization showing how one continuous variable changes when another does. This type of data often shows up when we have data that evolves over time, for example, when we have monthly data over several years. It is one of the most common graphs widely used in finance, sales, marketing, healthcare, natural sciences, and more.

# Line plot of Car mpg against model years
plt.figure(figsize=(10,5))
plt.title('Line Plot - Miles per Gallon Across Model Years')
sns.lineplot(data=df, x='model_year', y='mpg', hue='origin', markers=True, style='origin', errorbar=('ci', False));

The line plot by default plots the confidence interval. This can be removed by setting the errorbar with ci as false. The lines in the plot can be set with different line styles and marker can be set using the 'markers' parameter.

7. Scatter Plot

Scatterplot can be used with several semantic groupings which can help to understand well in a graph. Sometimes we want to know if two variables mean something when put together, whether a small change in one variable affects the other variable. In such cases, plotting a scatterplot, or scatter-diagram, with the data points can help us to check whether there is a potential relationship between them.

# Scatter plot of weight against mpg
plt.figure(figsize=(10,5))
plt.title('Scatter Plot - Car Weight Correlation with MPG')
sns.scatterplot(data=df, x='mpg', y='weight', hue='cylinders');

The scatter plot shows a correlation between weight of the cars and their miles per gallon. Lighter the cars, the mpg is likely to be more. Using the hue parameter over cylinders, we can also infer that the heavier cars are likely to have more cylinders.

8. Histogram

A histogram is a univariate plot which helps us understand the distribution of a continuous numerical variable. It breaks the range of the continuous variables into intervals of equal length and then counts the number of observations in each interval.

# Histogram of Acceleration
plt.figure(figsize=(8,5))
plt.title('Histogram - Acceleration')
sns.histplot(data=df, x='acceleration',kde=True,bins=12);

The histogram shows a symmetric distribution of acceleration data. Kernel Density Estimation, or kde parameter when set true, visualizes the distribution of data over a continuous interval. The binwidth parameter is used to specify the width of the interval/bins and the bins parameter is used to set the number of intervals.

9. Heatmap

A heatmap is a graphical representation of data as a color-encoded matrix. It is a great way of representing the correlation for each pair of columns in the data.

# Heatmap of continuous variables
plt.title('Heatmap')
sns.heatmap(data=df[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']].corr(), annot=True, cbar=False);

A value tending towards 1 shows high positive correlation while a value tending towards -1 shows a high negative correlation. Except for acceleration all the other variables have high correlation with each other.

10. Pair Plot

A pair plot shows the relationship between two numeric variables for each pair of columns in the dataset. It creates a grid of axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column.

# Pair plot of numerical variables
sns.pairplot(data=df, vars=['mpg', 'displacement', 'horsepower', 'weight', 'acceleration'],hue='origin');

Conclusion

Seaborn is a powerful and easy-to-use tool for creating a wide variety of data visualizations. The 10 plots explored in this blog assists with most needs in exploratory data analysis or preparing insights for stakeholders. With just a few lines of code, we can transform raw data to insightful visualizations that communicate data findings effectively. Explore the range of plots offered by Seaborn and start experimenting with your datasets today!

Data Visualization Using Seaborn - 10 Essential Charts