What it is
Scrapy is a mature Python framework for web crawlers and data extraction. It helps describe which pages to visit, how to parse responses, which links to follow, and where to store structured output.
It addresses a practical problem: a one-off script becomes fragile when crawling many pages, respecting delays, handling errors, retrying requests, and storing data.
How a crawler works
The main unit is a spider. It receives start URLs, parses responses, and yields data items or new requests. The framework manages the queue, downloading, and pipelines.
This separates extraction logic from networking mechanics. Developers describe rules while Scrapy handles concurrency, retries, settings, and export.
Simple spider
This example shows the minimal shape: a spider class, start URLs, and title extraction from a page.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
What is inside
The repository contains the framework core, downloader, scheduler, spiders, pipelines, exports, middleware, documentation, and tests.
Scrapy fits recurring data collection where repeatability, logging, speed controls, and structure matter.
Strengths
The main strength is maturity. Scrapy has long been part of the Python ecosystem, with documentation, extensions, and known patterns.
Its architecture also helps larger crawlers avoid becoming random functions.
Limits
Scrapy does not remove legal and ethical limits. Site rules, request rate, terms, and privacy still matter.
Sites where content appears only after complex JavaScript may need a browser layer or another extraction approach.