top of page

PYTHON ‘GO-TO-LANGUAGE’ for Data Analysis

Writer's picture: Sirisha KothaSirisha Kotha

Introduction:

Python is a powerful programming language that is easy to learn with a wide range of libraries and frameworks for data analysis and machine learning.


Why Data Manipulation?

Data manipulation enables the business to fully leverage the raw data, improve its quality, and use it to obtain valuable business insights and for decision-making.


Common tools used for Data Manipulation:

Microsoft Excel, SQL, Python, R, Tableau, Power BI and SAS.


Here comes why Python is the most powerful tool amongst all the tools.

1.      Readability and simple syntax: Python’s syntax makes it easy to learn and understand, even for beginners who can code though they don’t have a programming background for faster development.

2.      Extensive Libraries: Python has an extensive collection of libraries, such as Pandas, NumPy, and Matplotlib, specially designed for data manipulation and visualization, which will enhance its capabilities.

3.      Versatility: Python is a one-stop shop that can be used for various applications, such as building web applications, scripting, automation, and data analysis, making it available across different industries.

4.      Cross Platform: Python code can be run in any different kind of operating system. Providing flexibility

5.      Large Community: Python has a large community, where there is readily available documentation, and support for troubleshooting.


As said python has an extensive collection of libraries, out of them Pandas library (an open-source tool) simplifies the process of data manipulation by providing tools to import the files, clean, restructure, analyze, and export data to another file.


To start working with Pandas, Installing and importing the library is required

Installing Pandas using   --- pip install pandas

Importing pandas using --- import pandas as pd


1.      Reading the data: Reading the data is important to perform the data manipulation work. Python has different methods to perform. Here are the examples to read the most frequently used CSV files and EXCEL sheets.

#Importing data from the CVS file

df = pd. read_csv(‘file.csv’)

#Import data from the EXCEL file

df = pd. read_excel(‘file.xlsx’)


2.      Pandas Data Structures: Pandas have two kinds of data structures that are essential for data manipulation: 1) Series 2) DataFrame


Series: Series is a one-dimensional array. Here is an example of creating a series with some employee’s names and their salaries.

 Import pandas as pd

#Create a Series with employees’ names as labels and salaries as data

employee_salaries =pd. series ([1000,2000,3000,4000]), index= [“Bob”,” James”,” kay”,” Danial”])

#Print the series

print(employee_salaries)

 

DataFrame: DataFrame is a 2-dimensional data structure, which will be similar to a spreadsheet. Here’s an example to create a data frame for employee’s information.

# Create a dictionary with employee’s data       

   employe_data = {

                                 "Name": ["Bob”,” James”,” Kay”,” Danial”, Jones"],

                                 "Age": [30, 41, 35, 22, 23, 43],

                                 "Dept": ["Accounts", "HR", "Marketing", "Sales", "Transport"],

                "Salary": [1000,2000, 3000, 4000, 5000],

                                 "Place": ["NY", "LA", "TX", "NJ", "OH"],

                                  "Joining Year": [2018, 2023, 2021, 2016, 2022, 2021]

                                   }

                             

# Create a DataFrame from above Dictionary

 df = pd. DataFrame(employe_data)

 

  #print the data 

  print(df)

 

1.      Data Exploration: It is essential to understand the data before we perform the data manipulation techniques. Pandas provide several functions for this purpose. Let’s explore the employee data. head (), tail (), describe (), info (), shape ()

 

head (): This function returns the first few records of the data frame. By default, it gives 5 records. Same with tail (), it provides the last few records and defaults as 5 records

 

#Display the first few records of the data frame

print (df. head ())

 

describe (): This function provides the statistical summary of the numerical columns in the DataFrame.

 

#Display a summary of the data frame

print(df.describe())

 

info (): This function displays a summary of a DataFrame, including data types, non-null values, and memory usage.

 

#Display information about the DataFrame

 print (df.info ())

 

shape (): This function is used to the number of rows and columns of the DataFrame.

 

#Display DataFrame dimensions

 print (df. shape ())

 

1.      Handling missing values: The data might contain missing values, managing these missing values helps avoid errors and incorrect conclusions in the data analysis. By filling in the values, we can make sure our findings are accurate.           

 

Identifying missing values:

#Check for missing values in the data frame

  print (df. isnull (). sum ())

 

Dropping and filling in missing values:


  • #Dropping rows with missing values

    df. dropna ()


  • # Fill missing values with a specific value

     df. fillna (0)

     

  • #Filling missed value with a function

    df. fillna (df. mean ())

 

2.      Selecting and Filtering of data: Selecting and filtering can be done based on labels, integer positions, or boolean conditions.

 

  • #Selecting single-column

    df[‘Name’]


  • #Select multiple columns

    df [‘Name’,’ Age’,’ Salary’]


  • #Filtering rows based on condition

    [df[‘salary’] =1000]


  • #Filtering rows based on multiple conditions

    df[(df[‘salary’] ==1000) &(df[‘Age’] == 23)]

 

  • #Selecting rows and columns using loc and iloc:

o   loc: This is level-based indexing. Data can be selected by row and column labels. 

#Selecting rows and columns using label-based indexing

data = df.loc [df [‘Place’ ==’ NY’], [‘Name’,’ Salary’]]

print(data)

o   iloc: This is position-based indexing. Data can be selected by row and column numbers.

 #Select rows and columns using position-based indexing

data = df. iloc [[0,2,4], [0,1,2]]

print(data)

 

3.      Data Transformation: In Pandas, data transformation techniques such as applying functions, mathematical operations, and string operations are essential for data to meet specific analyses. These methods enable you to create new columns, modify existing columns, and perform complex calculations.

o   Applying functions: 

# Apply a function to each element of a column

df [Salary Division] = df['Salary']. apply (lambda x: 'High Salary' if x >= 90 else ‘Low Salary')

o   Mathematical functions:

#Perform mathematical operations on columns

df[‘Salary’] + 500

o   String Operations:

#Perform string operations on columns

df[‘Name’].str.upper()

o   Renaming Columns:

#Renaming the column “Name” to “Employe_name”

df. rename (columns = {“Name”:” Employe_name”}, inplace=True)

o   Sorting:

#Sort a DataFrame by column

df. sort_values(by=[‘Salary’], ascending =[False])

 

4.      Grouping and Aggregation: Aggregation is a process of combining multiple values into a single summary value through operations. The common aggregations are SUM, AVG, MIN, MAX and COUNT. Grouping is a process of categorizing data based on given conditions so that a group of data can be analyzed. After grouping, the aggregation functions can be applied to each group.

 

#Group the data by Age and calculate the mean salary

 

data_grouped = df. groupby (‘Age’) [‘Salary’]. mean ()

 print(data_grouped)                                                                            

            

 Conclusion: Many useful tools are being provided by Python to clean, manipulate, and join the data. It helps to fix the missing data, transform, calculate new columns, and transform.

This blog provides a basic overview of Python capabilities which can be extended to complex transformations. Enjoy learning more about Pandas!!

83 views

Recent Posts

See All
bottom of page