Scrapy — open source GitHub project

Scrapy is a Python framework for web crawlers and structured data extraction from websites.

What it is

Scrapy is a mature Python framework for web crawlers and data extraction. It helps describe which pages to visit, how to parse responses, which links to follow, and where to store structured output.

It addresses a practical problem: a one-off script becomes fragile when crawling many pages, respecting delays, handling errors, retrying requests, and storing data.

How a crawler works

The main unit is a spider. It receives start URLs, parses responses, and yields data items or new requests. The framework manages the queue, downloading, and pipelines.

This separates extraction logic from networking mechanics. Developers describe rules while Scrapy handles concurrency, retries, settings, and export.

Simple spider

This example shows the minimal shape: a spider class, start URLs, and title extraction from a page.

Language: Python

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

What is inside

The repository contains the framework core, downloader, scheduler, spiders, pipelines, exports, middleware, documentation, and tests.

Scrapy fits recurring data collection where repeatability, logging, speed controls, and structure matter.

Strengths

The main strength is maturity. Scrapy has long been part of the Python ecosystem, with documentation, extensions, and known patterns.

Its architecture also helps larger crawlers avoid becoming random functions.

Limits

Scrapy does not remove legal and ethical limits. Site rules, request rate, terms, and privacy still matter.

Sites where content appears only after complex JavaScript may need a browser layer or another extraction approach.