Any real-life data-problems will cause the issue of missing data and it is really important that such data points are taken care of in the right way.
Pandas is undoubtedly the most widely-used open-source library for data science and analysis, mostly preferred for ad-hoc data manipulation operations. The credit goes to its extremely flexible data representation using DataFrames and the functions to manipulate data present in these DataFrames.
When we use csv files with null values or missing data to populate a DataFrame, the null/missing values are replaced with NaN(not a number) in DataFrames. To handle missing data, Pandas uses the following functions:
Dropna() - removes missing values (rows/columns)
Fillna() - Replaces the missing values with user specified values.
Now, let's try to understand both these pandas functions with the help of examples in detail.
Pandas DataFrame: dropna() function
Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null/missing values.
Syntax:
DataFrame.dropna(self, axis=0, how='any',thresh=None, subset=None, inplace=False)
The definition of the parameters in the syntax are as follows:
axis: It refers to the orientation (row or column) in which data is dropped. Possible values are 0 or 1. 0/’index’ represents dropping rows and 1/’columns’ represent dropping columns.
how: Specifies the scenario in which the column/row containing null value has to be dropped. The values are ‘any’ or ‘all’. ‘all’ drop the row/column only if all the values in the row/column are null. ‘any’ drops the row/column when at-least one value in row/column is null.
thresh: Specifies the minimum number of non-NA values in row/column in order for it to be considered in the final result. Any row/column with the number of non-NA values < thresh value is removed irrespective of other parameters passed. When thresh=none, this filter is ignored.
subset: subset takes a list of columns/rows respectively (opposite to the axis) which are to be searched for null/NA values instead of an entire list of columns/rows.
inplace: It is a boolean which makes the changes in the DataFrame itself if True. If false, the original DataFrame is not modified, but a separate copy with the changes (i.e. dropped rows/columns) is returned.
Now that we understood the syntax and parameters of the dropna() function, let’s see how to use it in code and work with missing data.
Example use-cases:
Let’s first create a sample DataFrame to test dropna() function with various parameters. To create a DataFrame, we should import pandas library and to use NaN values, let’s import numpy library.
import pandas as pd
import numpy as np
1. Create a DataFrame
This DataFrame will be used to demonstrate how dropna() function can be used with various available parameters to manipulate data and remove missing values.
2. Dropping Rows/Columns
To drop rows, the axis parameter in the dropna() function should be set to 0. All the rows that have NaN values are removed as shown below.
Note: This is the default behavior when the axis is not explicitly specified, which means if dataframe.dropna() is used, the rows with NaN values will be removed.
To drop columns, the axis parameter should be set to 1. All the columns that have NaN values are removed as shown below:
3. Using ”how” parameter
By default, dropna() drops the complete row/column even if only 1 value is missing. We can use “how” parameter in dropna() to drop row/column only when all the values in a row/column are null. This is achieved by setting how=’all’ instead of how=’any’ (the default behavior). If the axis is not specified, then it applies on rows. We need to explicitly specify axis=1 if we need to use this parameter on column data.
4. Using “thresh” parameter
The “thresh” parameter is a powerful tool when combined with other parameters.
For instance, let's see how to thresh to keep only the rows or columns with at least 2 non-NA values.
5. Using “subset” parameter
We can use the “subset” parameter to define which columns to be used to look for null values.
In the below example, the code looks for null values in columns B and D and if there are any null values, then those rows are dropped.
6. Using “inplace” parameter
If “inplace” parameter is set to true, the dataframe is modified or else a copy is created with modified values.
Let's see what happens when we set "inplace" parameter to false. As you can see below, the original DataFrame (df) is not changed but when we try to assign the output to a new dataframe, we can see that the new dataframe has the changed dataset.
When "inplace" parameter is set to true, the original dataframe is changed as shown below.
Pandas DataFrame: fillna() function
When the dataset is very large, handling Nan or Null values becomes very critical. Usually larger datasets hold more number of Nan values in different forms, so standardizing these Nan’s to a single value or to a value which is needed is a critical process while handling larger datasets. The fillna() function comes handy to perform such function. It verifies all the Nan values and replaces them with the assigned replacement value.
Syntax:
DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
The definition of the parameters in the syntax are as follows:
Value: The value which needs to be used for filling all the Nan, must be assigned to this value parameter. A list cannot be assigned to this object.
Method: The method parameter represents the technique that needs to be used for filling the Nan’s in the dataframe.
Axis: This argument represents the column or the axis upon which the fillna() function needs to be applied. If the value 0 is assigned to axis, the Nan value is filled for each and every row in the dataframe, whereas mentioning the value of 1 in the dataframe fills the Nan value for all the columns in the dataframe.
Data: Defines the data of the row.
Inplace: If “inplace” parameter is set to true, the dataframe is modified or else a copy is created with modified values.
Limit: This mentions the overall number of Nan values that are allowed to be filled backward and forward.
Downcast: How to downcast a given value from its currently specified datatype if it is possible to be performed.
Let’s see how to use fillna() in code and work with missing data.
Example use-cases:
Let’s first create a sample DataFrame to test fillna() function with various parameters.
1. Create a DataFrame
This DataFrame will be used to demonstrate how dropna() function can be used with various available parameters to manipulate data and remove missing values.
2. Using "value" parameter
We can use fillna() to replace all the NaN values in a dataframe to 0s or any value that user specifies.
Replace all NaN elements with 0s.
Replace all NaN elements with values
3. Using "method" parameter
We can propagate non-null values forward or backward using "method" parameter as shown below:
4. Using "limit" parameter
We can use fillna() to replace only the first NaN element or in any position using "limit" parameter as shown below.
5. Using "axis" parameter
We can use axis to specify whether we want to replace NaN elements in columns or rows. By default, if nothing is specified, it replaces NaN values in rows, and if axis =1 then the NaN values in columns are replaced. In the below example, the NaN values in columns are replaced with 0s.
To summarize:
Handling null and missing data are always a challenge in any project. Given the fact that data science depends solely on data, dealing with missing data and null values are very critical for data scientists. Hope this explanation of mine about dropna() and fillna() functions will help everyone who is working with missing data like I do.
Happy coding!
References: