In this blog, we will see Scrapy fundamentals, Scrapy Architecture and create a project with a spider to crawl and get data scraped. We will also see how to export data that is scraped by a Spider.
Web Scraping Using Scrapy: Web Scraping is data extraction from websites. We can extract from websites, anything you can think of can be extracted from the web, from text, images, videos, e-mails. Scrapy is the framework we use in this blog to do Web Scraping.
Other Tools: In contrast to Scrapy, there are other tools written in Python used to do scraping, for example, we can use to request library with HTML parsers like beautiful soup to extract data from HTML Web pages. But these tools cannot help for complex projects, but mainly used to perform simple tasks like scraping a single HTML web page. But when we talk about hundreds and hundreds of Web pages, this is really where Scrapy shines.
On a local machine, Scrapy can scrape on average 960 pages / minute on a 8GB processor. This can be different from a PC to another, with respect to your CPU performance, how much you have and your Internet speed.
Scrapy Architecture: There are five main components: Spiders, Pipelines, Middlewares, Engine, Scheduler.
Spiders: We define what we want to extract from a web page. The different kinds of Spiders available in Scrapy are defined as Python classes. The five Spiders in Scrapy are: scrapy.Spider, CrawlSpider, XMLFeedSpider (scrape XML files), CSVFeedSpider (scrape .csv files), SitemapSpider (scrape sitemaps).
Pipelines: These are related to the data we extract. Tasks like Cleaning the data, Removing Duplications, Storing data in external database are done using Pipelines.
Middlewares: Middlewares has everything to do with request sent to website, and response to get back from website. If we want to process a request by injecting custom headers OR want to use proxies, we can do that through Middlewares.
Engine: Engine is responsible for coordinating between all the other components. Engine ensures consistency of all the operations that happen.
Scheduler: Scheduler is responsible for preserving the order of operations. Technically, Scheduler is a simple data structure called Queue, which follows the FIFO methodology.
Example: Lets consider an imaginary website, www.quoteseveryday.com, and we will try extract all the available quotes from this website. Let's think we already built the Spider that will extract that data and you are now in the stage of launching it.
1. When we execute the Spider, what will happen is that the Spider you've built will start by sending a request to the engine and then the engine will transmit that request to the scheduler.
2. We know Scheduler follows FIFO mechanism to process requests, so, for simplicity we assume there is only one request. After that request is going to be served, will be sent back to the engine. So, now you kind of understand why I said the engine is responsible for coordinating between all the other components.
3. The request will be sent to the middleware component and more specifically to a middleware called the Downloader middleware (makes response), because that middleware is responsible for getting the response from the Target website. So we are not dealing anymore with a request. Instead, we have a response ready to be passed.
4. The response that was generated will be sent to the engine and from there it will be transmitted to the Spider through another type of middleware called the Spider middleware, which is responsible for extracting data, which are quotes.
5. After that, the scraped items will be sent to the engine and then to the item pipeline, which will be responsible for processing the items, which are the quotes created by the Spider middleware.
Robots.txt: Most websites do include in the root directory, a file called Robots.txt that is used by websites to give instructions to spiders that want to access the website whether they are allowed or not. This file contains three important instructions.
1. The first one is the user agent and User-Agent represents the identity of the spider.
2. The second instruction is the Allow instruction, which specifies the Web pages that the spider is allowed to scrape.
3. And in contrast, we have the Disallow instruction, which specifies the web pages that are forbidden.
Now, let's take a real world example. So if you head over to Facebook.com/robots.txt, you will get the below page.
First user agent is set to Applebot, then BaiduSpider, Bingbot, Googlebot and so on. So, for example, Googlebot is not allowed to access all the folders and the files listed like /ajax/, /photo.php etc. In contrast, if we scroll down, Googlebot is allowed to scrape the safetycheck web page and the URL to scrape is given.
Prerequisites: Installing Anaconda, helps install Python and Scrapy. Allow installation of Anaconda to set Path variable of Python - check the box during installation (). Use Anaconda Navigator to work with examples. Default apps are installed on base (root) environment. We will create a new environment, say scrapy_env, choose python latest version. To install Scrapy, go to https://scrapy.org/ --> click Anaconda / Conda --> conda install -c conda-forge scrapy==1.6 pylint autopep8 -y ( copy command from the page) --> click play button beside scrapy_env virtual environment, choose terminal, paste the command. This means, package installed is available in that specific virtual environment. The command given installs Scrapy version 1.6 version (version 2 will be covered later). Also installation of pylint, autopep8 are done with the command. Enter scrapy and pressing enter, shows details if Scrapy is installed correctly. This also shows list of available commands in Scrapy and how to use them.
Also install Visual Studio Code. Relaunch Anaconda Navigator, choose scrapy_env virtual environment created earlier. Launch VSCode from Conda. Mac users need to add VSCODE to Path.
We will see some commands:
scrapy bench: This command, will benchmark, our machine's configuration, on how it will perform when crawling websites. We get logs, with extensions, middlewares, pipelines etc. "INFO: spider opened" message, is when benchmark starts. It crawls local pages, and provides statistics regarding number of requests sent and number of responses received.
scrapy fetch: Fetches URL's HTML markup using Scrapy downloader. For example, scrapy fetch http://google.com gives root HMTL markup of google.com
scrapy genspider: Generates a new spider using pre-defined templates. Spider is a component used to scrape content from websites. Every website, will have their own spiders.
scrapy runspider: Run or execute a spider, without creating a project for quick or temporary work. In general, we create a project and create spiders and use them.
scrapy settings: returns default settings
scrapy shell: very important command, to use with when doing experiments, with websites we are trying to scrape, before writing actual spider.
scrapy startproject: starts new project, by creating all necessary files.
scrapy version: prints scrapy version.
scrapy view: opens a website by your choice in the browser and shows you how the spider is actually seeing the
website, like for example if you are getting that website or something else in return. Something you maybe don't know about is that a spider can see a website, are seen by a normal user like us however don't trust this command.
Don't use it at all and you can use alternative way to see how the spider is actually viewing the website before scraping it later on.
Creating a Scrapy Project:
1. Create a folder, say, ScrapyProjects. On command prompt, move to this folder.
2. scrapy startproject worldometers --> this command will start new project, worldometers, with predefined template. The project folder container following files and folders.
scrapy.cfg file is created, which is important to execute the spiders created, and also used to deploy spiders to scrapy daemon, or to Heroku or to ScrapingHub cloud.
spiders folder -> with empty __init__.py file
items.py: syntax: name = scrapy.Field() --> defining fields.
middlewares.py --> everything to do with request/response objects can be found here. We can find WorldometerSpiderMiddleware, WorldometerDownloaderMiddleware classes. Middleware writing involves, user agent selection, each time a request is sent.
pipelines.py --> used to store items that we scrape in a database.
settings.py --> used to tweak or add some extra configuration to project
3. Start first spider: we will be scraping the website, https://www.worldometers.info/world-population/population-by-country/. Here, we scrape, all country names, click on each country link, scrape each year and corresponding population in that year. Execute the following command to scrape the webpage. Scrapy uses HTTP protocol by default. On execution of this command, countries.py is created using template 'basic'.
cd worldometers, scrapy genspider countries www.worldometers.info/world-population/population-by-country/
The countries.py file has class CountriesSpider that extends scrapy.Spider. Each spider has unique "name", here "countries" is the name of spider. One project can have multiple spiders, with different names.
"allowed_domains" (optional, good practice to have) contains domain URLs that are allowed, if the website has any other links - not of this domain, those will not be scraped - limit the scope of spider that it visits to scrape.
modify line: allowed_domains = ['www.worldometers.info/']
"start_urls" : URLs will have the same protocol used by the website, here https.
"parse" method: to parse response we get back from spider.
Request is sent to link in start_urls link. We will capture response in parse method. "name", "allowed_domanins", "start_urls", "parse" method should have same variable / method names. This default behavior can however be overridden.
import scrapy
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
pass
Scrapy Shell:
This is the tool used before building spider, to do some basic element selection, to debug XPath expressions or CSS Selectors.
Install package IPython, using: conda install ipython , in the scrapy_env virtual environment we created.
"scrapy shell" command, without URL, gives same details as "scrapy bench" command. We also get a list of available Scrapy objects, with "scrapy shell" command.
Robots.txt is not available in this website, thus allows access to the URL, without restrictions.
Another way to request is by passing request object, in fetch method.
Execute:
r = scrapy.Request(url="https://www.worldometers.info/world-population/population-by-country/")
fetch(r)
Execute: response.body --> gives same HTML markup output as found in website, pressing ctrl+u.
Execute: view(response) --> shows same website, stored in different location, which is opened in browser. However, spider doesn't see JavaScript in the webpage.
Ctrl+Shift+I --> Developer Tools open in browser.
Ctrl+Shift+P --> Opens Command Palette. Type: javascript, choose disable javascript, this is the version that spider sees, without hover effect on the country name without links. Disabling javascript, makes sure that we get exactly what scrapy returns.
Scrape using XPath expression:
Even if we use CSS selector, we get XPath output. There will be slight impact on the performance of spider with CSS Selector, because of this conversion. So better to use XPath here. Dealing with xpath vs CSS Selector to get title element is shown in the image below:
response.xpath("//td/a/text()").getall() --> returns list of all countries.
countries = response.css("td a::text").getall() --> typing countries, gives list of countries as output.
On Visual Studio Code editor, make sure python extension is installed.
Open folder, worldometers, created earlier, in VSCode.
Modify countries.py file parse method as below:
def parse(self, response):
title = response.xpath("//h1/text()").get()
countries = response.xpath("//td/a/text()").getall()
yield{
'title': title,
'countries': countries
}
In shell, make sure you are on the same directory where scrapy.cfg file for the project is present.
Executing in shell: scrapy crawl countries
We get the title and list of countries, that are scraped.
To Execute in VSCode, use terminal, View --> Terminal. We can see same output as above.
XPath and CSS Selectors:
CSS Selector doesn't allow navigation up and down, but looks cleaner than XPath. Files to try are attached at the end of the blog. Copy the content of files into playground editors - links provided for CSS Selectors and XPath.
CSS Selectors: selecting tags by name, attributes, IDs, classes, by position in HTML web page.
Playground for CSS: https://try.jsoup.org/
id: two elements cannot have same id, in most cases. #id_valueOR #id#id
id: two elements cannot have same id, in most cases. #id_value ts. .class_name OR .class.class
tag_name#id_value, tag_name.class_name
tag1 tag2 // tag2 is inside tag1
attributes: tag_name[attribute=value] or just [attribute=value]
conditions: tag_name[attribute^='value'] // starts with value only
$ --> ends with, * --> anywhere, ^ --> starts with, ~ --> not beginning or not end
position based: eg. div.intro p, span#location // multiple elements
div.intro > p // direct child elements
div.intro + p // p elements placed immediately after div.intro
+ > are called CSS Combinators.
li:nth-child(1), li:nth-child(3) --> only odd numbered elements ==> li:nth-child(odd) // even also
div ~ p --> div followed by some elements and then p => gives all p elements after div
XPath: Playground for XPath - https://scrapinghub.github.io/xpath-playground/
//a/@href --> gives both https and http elements
//a[starts-with(@href, 'https')] --> only tags with https value in the element
ends-with can throw error in XPath 1.0 used in most browsers and XML itself.
//a[contains(@href, 'google')]
//a[contains(text(), 'France')] // case-sensitive
//ul[@id='items']/li[4 or 1]
//ul[@id='items']/li[position()=1 or position()=4]
//li[position()=1 and contains(@text, 'hello')]
//ul[@id='items']/li[position() = 1 or position()=last()]
//ul[@id='items']/li[position() > 1 or position()=last()] // > < mathematical operators can be used
Navigate to the top of hierarchy or to the top elements
//p[@id='unique']/parent::div
//p[@id='unique']/parent::node() // node gets parent element
//p[@id='unique']/ancestor::node() // returns parent, parent of parent --> grand parent
//p[@id='unique']/ancestor-or-self::node() // returns self element and parent
//p[@id='unique']/preceding::node() // all preceding elements are returned, excluding ancestors
//p[@id='unique']/preceding::h1 // h1 element preceding are returned
//p[@id='unique']/preceding::body // ancestor body not returned
//p[@id='outside']/preceding-sibling::node() // preceding sibling returned
Navigate to the bottom of hierarchy:
//div[@class='intro']/child::p
//div[@class='intro']/child::node() // all child elements
//div[@class='intro']/following::node()
//div[@class='intro']/following-sibling::node()
//div[@class='intro']/descendant::node() // returns all child, grand child
Terminology:
//elementName[predicate]
axisName::elementName
PROJECT:
We will see how to scrape country's population by each year, developing the code in each step and learning how to get various elements scraped.
1. To get Country name and link, replace parse method as below:
def parse(self, response):
# returns Selector with href in data of each Selector
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()
yield{
'title': name,
'country_link': link
}
2. To get country link, using absolute URL manually or using urljoin function and relative URLs.
import scrapy
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
# returns Selector with href in data of each Selector
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()
# absolute_url = f"https://www.worldometers.info{link}"
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)
yield response.follow(url=link)
3. Sending request to each country link and scraping year and population:
import scrapy
import logging
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
# returns Selector with href in data of each Selector
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()
# absolute_url = f"https://www.worldometers.info{link}"
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)
yield response.follow(url=link, callback=self.parse_country)
def parse_country(self, response):
# logging.info(response.url)
rows = response.xpath("//table[@class='table table-striped table-bordered table-hover table-condensed table-list'][1]/tbody/tr")
for row in rows:
year = row.xpath(".//td[1]/text()").get()
population = row.xpath(".//td[2]/strong/text()").get()
yield{
'year': year,
'population': population
}
4. To get Country name in scraped data, using global variable in the class, will not work, as same country named is saved for all items in countries. Workaround is to use meta parameter in function call and retrieve in callback function to display country name.
import scrapy import logging class CountriesSpider(scrapy.Spider): name = 'countries' allowed_domains = ['www.worldometers.info'] start_urls = ['https://www.worldometers.info/world-population/population-by-country/'] def parse(self, response): # returns Selector with href in data of each Selector countries = response.xpath("//td/a") for country in countries: name = country.xpath(".//text()").get() link = country.xpath(".//@href").get() # absolute_url = f"https://www.worldometers.info{link}" # absolute_url = response.urljoin(link) # yield scrapy.Request(url=absolute_url) yield response.follow(url=link, callback=self.parse_country, meta={'country_name': name}) def parse_country(self, response): # logging.info(response.url) name = response.request.meta['country_name'] rows = response.xpath("//table[@class='table table-striped table-bordered table-hover table-condensed table-list'][1]/tbody/tr") for row in rows: year = row.xpath(".//td[1]/text()").get() population = row.xpath(".//td[2]/strong/text()").get() yield{ 'country_name': name, 'year': year, 'population': population } Try: Scrape the national debt to GDP for each country listed in this website 'http://worldpopulationreview.com/countries/countries-by-national-debt/'.
Export Scraped Data:
In Terminal in VSCode, or in shell command prompt, type the below command to save data in JSON format
scrapy crawl countries -o population_dataset.json
Ctrl+J --> hides terminal, pressing Ctrl+J again brings control back to editor in VSCode.
Open json file created. Alt+Shift+F --> formats json file.
Ctrl+J --> back to terminal.
Create csv file similarly, replacing file extension in the above command.
Create XML file same way, with xml extension.
Conclusion:
We have seen how to get started with Scrapy fundamentals, Scrapy Architecture and create a project with a spider to crawl and get data scraped. We also saw how to export data that is scraped by a Spider.
Hope you enjoyed Scraping with Spiders in Scrapy and Python.
References: Udemy