As I contemplated the next topic for my blog, this struck me as an interesting and curious topic that would also be a useful Python exercise. Let me tell you why I chose this title. While I was searching for datasets to work on, I happened to stumble upon an interesting one that featured heights of all the US presidents. I was instantly curious to explore the data and find out how many US presidents were taller than 6 feet. I hope the same curiosity will invite many to read my blog as well.
Let me start with a small introduction to the Python libraries in question: Numpy, Pandas, MatPlotLib, and Seaborn which will be used in finding the answer to my question.
Anytime we look at the scientific computing and data science communities, three python packages that are extensively used:
Numpy
Pandas
MatplotLib
Numpy stands for Numerical Python. NumPy was created in 2005 by Travis Oliphant. It is an open-source python library that can be used freely. It provides a high-performance multidimensional array object and tools for working with these arrays. Numpy is used in handling basic mathematical functions such as mean, average, and so on. It is used extensively in the creation and manipulation of multidimensional arrays known as tensors in neural networks and machine learning.
To use Numpy, we need to import the package as shown below:
import numpy as np
Here’s a simple NumPy program that builds a 2X2 array and then performs various array operations on it.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print('Array:')
print(arr)
print("Sum of all elements:", np.sum(arr)) #sum of all elements
print("Sum of each column:", np.sum(arr, axis=0)) # columnwise sum
print("Sum of each row:", np.sum(arr, axis=1)) #rowwise sum
Output:
Array:
[[1 2 3]
[4 5 6]]
Sum of all elements: 21
Sum of each column: [5 7 9]
Sum of each row: [ 6 15]
Pandas is a data analysis library that is built on NumPy using Python programming language. It is a highly useful and important package that is used in data science. It provides high-performance, easy-to-use structures, and data analysis tools. It can present data in a way that is suitable for data analysis using Series and DataFrame data structures. Pandas has a variety of methods for convenient data filtering and utilities to perform Input/Output operations seamlessly.
To use Pandas, we need to import the package as shown below:
import pandas as pd
Here’s a simple Pandas program that builds a data frame using 2 lists.
# students column
stud = ["Ram", "Diya", "Raj"]
# mark of student column
mark = [80,85,90]
# Put everything into a dataframe
df = pd.DataFrame()
df["Student"] = stud
df["Mark"] = mark
print('DataFrame')
print(df)
Output:
DataFrame
Student Mark
0 Ram 80
1 Diya 85
2 Raj 90
MatPlotLib and Seaborn libraries are used for data visualization.
MatPlotLib is a library that adds the data visualization functions to Python. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on. It is well integrated to work with NumPy and Pandas.
Seaborn is a Python data visualization library based on MatPlotLib. It provides a high-level interface for drawing attractive and informative statistical graphics. It is more integrated for working with Pandas data frames.
Now let’s cut to the chase, and find out the answer to my title question.
To find the average height of US Presidents, we will use president_heights_new.csv from Kaggle datasets. The data is loaded to a data frame as shown below and let’s explore the data.
#Load the CSV data in to dataframe
data=pd.read_csv('../input/us-presidents-heights-how-low-can-u-go/president_heights_new.csv')
#print no of rows
print(len(data))
# print the first four records just to check the data print(data.head(4))
Output:
No. of records = 43
order name height(cm)
0 1 George Washington 189
1 2 John Adams 170
2 3 Thomas Jefferson 189
3 4 James Madison 163
Let’s take the height column data into a Numpy array to perform numerical operations. We can also perform the same using data frames, but here I am using Numpy arrays to demonstrate how all these libraries work together to achieve the required outcome.
heights = np.array(data['height(cm)'])
print("Heights of US Presidents")
print(heights)
Output:
Heights of US Presidents
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183 177 185 188 188 182 185 188]
Now on this array, we can perform all the statistical operations as shown below. Let’s find out the mean, minimum and maximum heights.
print("Mean Height: ", heights.mean())
print("Minimum Height: ", heights.min())
print("Maximum Height: ", heights.max())
Output:
Mean Height: 179.93023255813952
Minimum Height: 163
Maximum Height: 193
We can see that the average height of the US presidents is 179.93 which is 180cm i.e 6 feet.
Sometimes, it will be more useful to see the output in a pictorial format. We can achieve this using MatPlotLib and Seaborn. Since this dataset is a very simple one with 2–3 columns we will just use the MatPlotLib library to plot a bar chart. Let’s import the MatPlotLib library as shown below:
import matplotlib.pyplot as plt
You can see, that I have used pyplot module. pyplot is a module in matplotlib package, which is a collection of command style functions. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
Let’s now plot a bar chart with heights of the US presidents on the X-axis and the number of presidents on the Y-axis as shown below:
plt.hist(heights)
plt.title('Height distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number')
Output:
We can see that using the 3 very popular python libraries, we could find the average height of US presidents and also were able to see how their heights are distributed.
To conclude, these 3 python packages are extensively used by data scientists to format, process, and query their data and perform data manipulation and data visualization.
Happy coding!