The internet perhaps is the greatest source of information. Imagine you want to pull this large amount of data in the quickest way possible! Here is where web scraping comes into picture.
What is Web Scraping?
Web scraping is extracting large amount of data from websites through automation. The data on the websites is unstructured. Web scraping helps to extract and save these data in a required format. There are different ways to extract data from websites like through APIs or writing your own code. Lets see how this can be done using python.
Why Python for web scraping?
Python is easy to understand: Reading a python code is similar to reading a statement in English
Easy to use: There are no semicolons “;” or curly braces {} in the code
Impressive collection of libraries: Its libraries like Numpy, Pandas etc make web scraping much easier and faster.
How to scrape data using python?
Enter the URL you want to scrape
Inspect the respective elements in the webpage
Find the data
Write and run the code
Store the data in the required format
Lets take the example of scraping jobs from dice.
Libraries used:
Selenium: Selenium is used to automate the web browsers.
Pandas: Pandas is used to extract and stored the data in required format.
Pre-requisites: Python2.x or 3.x with selenium and pandas libraries installed, Google Chrome, Mac or Windows OS.
Step 1: We want to extract Job title, Company name, Location from dice website. The url is https://www.dice.com/jobs?q=selenium&countryCode=US&radius=30&radiusUnit=mi&page=5&pageSize=100&filters.postedDate=ONE&filters.employmentType=FULLTIME&language=en&eid=S2Q_,6Q_0
Step2: Inspecting the page: We have to inspect the elements to extract the data, for this right click and click on “Inspect”. “Browser inspector Box” will be opened.
Step3: Select an element to inspect. Lets extract Job title, Company Name, and Location which are all displayed in “div” tag. We will write xpath for each of these details respectively.
Step 4: Write the code
For this lets create a Python file in a new Project using PyCharm (Python IDE)
Now lets write our code in this file
from selenium import webdriver
import pandas as pd
import numpy as np
To use the Chrome browser we have to setup the path for the chromedriver
driver = webdriver.Chrome(“/*******/chromedriver/chromedriver”)
To open the url use:
As we saw earlier each of the data we want to extract are placed in there respective “div” tags, I will now write xpath for each of these elements(Job Title, Location, Company name) and also use “text” for these xpath, which will give me the string values of these elements. Please refer the code below
#xpath for Job Title
d1=driver.find_elements_by_xpath(“//*[@class="card-title-link bold”]”).text
#xpath for Company name
d2=driver.find_elements_by_xpath(//*[@class=“card-company”]).text
#xpath for Location
d3=driver.find_elements_by_xpath(//*[@id=“searchResultLocation”]).text
Note: We have used find_elements here as there are hundreds of jobs in dice and we can use for loop to extract all the jobs.
After extracting the data, we have to save it in a structured format. For this we have used dictionaries of Python.
Dictionaries: Dictionaries are nothing but a collection of data which are stored in Key:value pairs. They are ordered(Python 3.7), changeable and does not allow duplicates. They are written with curly brackets.
We will now use dictionaries with keys as the column names and their respective values will the variables above(d1, d2, d3).
thisdict[t]={ “Job Title”:d1, “Company name”:d2, “Location”:d3}
Now lets use this dictionary into Pandas dataframes.
Pandas is a popular Python package, its data structures makes data manipulations and analysis easy. The dataframes is one of these structures.
Dataframes are defined as two-dimensional labeled data structures with rows and columns, i.e data is stored in tabular form. We can give the column names for the data frames.
df = pd.DataFrame.from_dict(thisdict,orient=“index”,columns=[‘Job Title’, ‘Job Company Name’, ‘Job Location’])
Then we can display this dataframe in csv file.
df.to_csv("Output.csv", index=False)
This creates an Output csv file with all the job lists.
To summarize, scraping with python is fast and efficient. Python's syntax makes it easier for a coder to learn and extract required data quickly. The extracted data can be displayed not only in csv file but also in any other format e.g Excel, XML file.