Introduction:
Python is a powerful programming language that is easy to learn with a wide range of libraries and frameworks for data analysis and machine learning.
Why Data Manipulation?
Data manipulation enables the business to fully leverage the raw data, improve its quality, and use it to obtain valuable business insights and for decision-making.
Common tools used for Data Manipulation:
Microsoft Excel, SQL, Python, R, Tableau, Power BI and SAS.
Here comes why Python is the most powerful tool amongst all the tools.
1. Readability and simple syntax: Python’s syntax makes it easy to learn and understand, even for beginners who can code though they don’t have a programming background for faster development.
2. Extensive Libraries: Python has an extensive collection of libraries, such as Pandas, NumPy, and Matplotlib, specially designed for data manipulation and visualization, which will enhance its capabilities.
3. Versatility: Python is a one-stop shop that can be used for various applications, such as building web applications, scripting, automation, and data analysis, making it available across different industries.
4. Cross Platform: Python code can be run in any different kind of operating system. Providing flexibility
5. Large Community: Python has a large community, where there is readily available documentation, and support for troubleshooting.
As said python has an extensive collection of libraries, out of them Pandas library (an open-source tool) simplifies the process of data manipulation by providing tools to import the files, clean, restructure, analyze, and export data to another file.
To start working with Pandas, Installing and importing the library is required
Installing Pandas using --- pip install pandas
Importing pandas using --- import pandas as pd
1. Reading the data: Reading the data is important to perform the data manipulation work. Python has different methods to perform. Here are the examples to read the most frequently used CSV files and EXCEL sheets.
#Importing data from the CVS file
df = pd. read_csv(‘file.csv’)
#Import data from the EXCEL file
df = pd. read_excel(‘file.xlsx’)
2. Pandas Data Structures: Pandas have two kinds of data structures that are essential for data manipulation: 1) Series 2) DataFrame
Series: Series is a one-dimensional array. Here is an example of creating a series with some employee’s names and their salaries.
Import pandas as pd
#Create a Series with employees’ names as labels and salaries as data
employee_salaries =pd. series ([1000,2000,3000,4000]), index= [“Bob”,” James”,” kay”,” Danial”])
#Print the series
print(employee_salaries)
DataFrame: DataFrame is a 2-dimensional data structure, which will be similar to a spreadsheet. Here’s an example to create a data frame for employee’s information.
# Create a dictionary with employee’s data
employe_data = {
"Name": ["Bob”,” James”,” Kay”,” Danial”, Jones"],
"Age": [30, 41, 35, 22, 23, 43],
"Dept": ["Accounts", "HR", "Marketing", "Sales", "Transport"],
"Salary": [1000,2000, 3000, 4000, 5000],
"Place": ["NY", "LA", "TX", "NJ", "OH"],
"Joining Year": [2018, 2023, 2021, 2016, 2022, 2021]
}
# Create a DataFrame from above Dictionary
df = pd. DataFrame(employe_data)
#print the data
print(df)
1. Data Exploration: It is essential to understand the data before we perform the data manipulation techniques. Pandas provide several functions for this purpose. Let’s explore the employee data. head (), tail (), describe (), info (), shape ()
head (): This function returns the first few records of the data frame. By default, it gives 5 records. Same with tail (), it provides the last few records and defaults as 5 records
#Display the first few records of the data frame
print (df. head ())
describe (): This function provides the statistical summary of the numerical columns in the DataFrame.
#Display a summary of the data frame
print(df.describe())
info (): This function displays a summary of a DataFrame, including data types, non-null values, and memory usage.
#Display information about the DataFrame
print (df.info ())
shape (): This function is used to the number of rows and columns of the DataFrame.
#Display DataFrame dimensions
print (df. shape ())
1. Handling missing values: The data might contain missing values, managing these missing values helps avoid errors and incorrect conclusions in the data analysis. By filling in the values, we can make sure our findings are accurate.
Identifying missing values:
#Check for missing values in the data frame
print (df. isnull (). sum ())
Dropping and filling in missing values:
#Dropping rows with missing values
df. dropna ()
# Fill missing values with a specific value
df. fillna (0)
#Filling missed value with a function
df. fillna (df. mean ())
2. Selecting and Filtering of data: Selecting and filtering can be done based on labels, integer positions, or boolean conditions.
#Selecting single-column
df[‘Name’]
#Select multiple columns
df [‘Name’,’ Age’,’ Salary’]
#Filtering rows based on condition
[df[‘salary’] =1000]
#Filtering rows based on multiple conditions
df[(df[‘salary’] ==1000) &(df[‘Age’] == 23)]
#Selecting rows and columns using loc and iloc:
o loc: This is level-based indexing. Data can be selected by row and column labels.
#Selecting rows and columns using label-based indexing
data = df.loc [df [‘Place’ ==’ NY’], [‘Name’,’ Salary’]]
print(data)
o iloc: This is position-based indexing. Data can be selected by row and column numbers.
#Select rows and columns using position-based indexing
data = df. iloc [[0,2,4], [0,1,2]]
print(data)
3. Data Transformation: In Pandas, data transformation techniques such as applying functions, mathematical operations, and string operations are essential for data to meet specific analyses. These methods enable you to create new columns, modify existing columns, and perform complex calculations.
o Applying functions:
# Apply a function to each element of a column
df [Salary Division] = df['Salary']. apply (lambda x: 'High Salary' if x >= 90 else ‘Low Salary')
o Mathematical functions:
#Perform mathematical operations on columns
df[‘Salary’] + 500
o String Operations:
#Perform string operations on columns
df[‘Name’].str.upper()
o Renaming Columns:
#Renaming the column “Name” to “Employe_name”
df. rename (columns = {“Name”:” Employe_name”}, inplace=True)
o Sorting:
#Sort a DataFrame by column
df. sort_values(by=[‘Salary’], ascending =[False])
4. Grouping and Aggregation: Aggregation is a process of combining multiple values into a single summary value through operations. The common aggregations are SUM, AVG, MIN, MAX and COUNT. Grouping is a process of categorizing data based on given conditions so that a group of data can be analyzed. After grouping, the aggregation functions can be applied to each group.
#Group the data by Age and calculate the mean salary
data_grouped = df. groupby (‘Age’) [‘Salary’]. mean ()
print(data_grouped)
Conclusion: Many useful tools are being provided by Python to clean, manipulate, and join the data. It helps to fix the missing data, transform, calculate new columns, and transform.
This blog provides a basic overview of Python capabilities which can be extended to complex transformations. Enjoy learning more about Pandas!!