Data Visualization using Pandas :
Data visualization is an essential part of exploratory data analysis. Pandas is arguably the most popular data analysis and manipulation library( an open-source library) .The various functions of Pandas constitutes a powerful and versatile data analysis tool. And as stands for “Python Data Analysis Library ”. What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software ( Excel or SPSS for example.)
Why Pandas? :
Pandas is one of the most famous data science tools and it’s definitely a game-changer for cleaning, manipulating, and analyzing data. is very powerful, flexible & easy to use tool which can be imported using import pandas as pd. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.
Importing Pandas :
To load the pandas package and start working with it, import the package, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
Data Frame :
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.
Creating DataFrame :
I am adding the Gestational Diabetes data into a DataFrame.
Displaying the DataFrame :
Now that we have a DataFrame, we can take a look at the data. First we can view the first few rows of data with .head() :
Statistical Functions :
Different statistics are available and can be applied to columns with numerical data. The statistic applied to multiple columns of a DataFrame is calculated for each numeric column.
Creating Plots :
Let us create few basic plots for better understanding of the concept.
Different Types of Plots :
Line Plot
Area Plot
Bar Plot
Barh Plot
Histogram Plot
Scatter Plot
Pie Plot
Box Plot
KDE Plot
Hexbin Plot
Line Plot :
Line plot function is useful to plot lines using Dataframe’s values as coordinates. Let us create line plot for the data of pregnancies column with the condition where no.of pregnancies are greater than 12.
Area Plot :
An area plot displays quantitative data visually. The following graph displays the area graph of the whole dataframe.
Bar Plot :
A bar plot is a plot that presents categorical data with rectangular bars with lengths proportional to the values that they represent. bar() function is use for plot vertical bar graph.
Let us create a bar graph for Age column :
In this plot we used the value_counts() method, which returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column. And head(10) method is used, which gives the data of top 10 rows only.
Barh Plot :
A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. Let us create a barh plot for Age column :
Hist Plot :
A histogram is a representation of the distribution of data. This function groups the values of all given Series in the DataFrame into bins. Let us create a Histogram Plot for Age column :
Scatter Plot :
Scatter plot is useful to see complex correlations between two variables. Let us create a scatter plot between Age and Pregnancies :
Pie Plot :
A pie plot is a proportional representation of the numerical data in a column. Let us create a Pie plot for the data of pregnancies and Age column with the conditions where no.of pregnancies are greater than 12 and Age is less than 66.
Box Plot :
A box plot is a method for graphically depicting groups of numerical data through their quartiles. Let us create Box plot for the data of pregnancies and Age columns.
KDE Plot :
In statistics,Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.
Hexbin Plot :
Hexagonal binning plot is a histogram of the number of occurrences of the observations. Let us create a Hexbin Plot using the columns Pregnancies and Age.
Conclusion
We have seen how Pandas can be used as a data visualization tool. It is way beyond the dedicated data visualization libraries such as Seaborn and Matplotlib. Now, we can plot any kind of charts with the help of Pandas visualization.