Tutorials February 08, 2026 18 min read

Web Scraping with Python: The Complete Beginner's Guide (2026)

Learn how to scrape data from websites using Python, BeautifulSoup, and Scrapy. This step-by-step guide covers setup, parsing HTML, handling pagination, storing data, and avoiding common pitfalls.

Alex Thompson

Web Scraping with Python: The Complete Beginner's Guide (2026)

Web scraping is the process of automatically extracting data from websites. Instead of manually copying information from web pages, you write a program that does it for you — faster, more accurately, and at scale.

Companies use web scraping for price monitoring, lead generation, market research, news aggregation, and academic research. It's one of the most practical skills a developer or data analyst can learn because almost every industry has data locked inside web pages that isn't available through an API.

In this guide, you'll learn how web scraping works from the ground up. We'll cover the core concepts, build scrapers with Python, handle real-world challenges like pagination and dynamic content, store the results, and discuss the legal and ethical boundaries you need to respect.

How Web Scraping Works

At its core, web scraping is a three-step process. Your program sends an HTTP request to a website (just like a browser does), receives the HTML response, and then parses that HTML to extract the specific data you need.

How Web Scraping Works: The 3-Step Process

Think of it like reading a newspaper. Your eyes (the HTTP request) look at the page (the server response), your brain (the parser) identifies the article headlines, prices, or dates, and then you write them down in a notebook (storage).

Web Scraping vs. APIs: When to Use Which

Before scraping a website, always check if they offer an API. APIs are the preferred method because they're designed for programmatic access, return structured data, and are more reliable. Only use web scraping when no API is available.

Factor	Web Scraping	API
Data format	Unstructured HTML you must parse	Structured JSON/XML
Reliability	Breaks when website layout changes	Stable, versioned endpoints
Speed	Slower (downloads full page HTML, CSS, images)	Faster (only data, no UI)
Rate limits	Implicit (you can get blocked)	Explicit (documented limits)
Legal clarity	Gray area — depends on terms of service	Clear — usage terms in API docs
Best for	Sites with no API, public data extraction	Any site that offers an API

Setting Up Your Python Environment

You'll need Python 3.8+ and three libraries. Here's how to set up a clean project:

Terminal

mkdir web-scraper && cd web-scraper
python3 -m venv venv
source venv/bin/activate   # Windows: venv\\Scripts\\activate

pip install requests beautifulsoup4 lxml

Here's what each library does:

requests — Sends HTTP requests and handles responses, cookies, and headers
beautifulsoup4 — Parses HTML and lets you search for elements using CSS selectors or tag names
lxml — A fast HTML/XML parser that BeautifulSoup can use as its backend engine

Your First Web Scraper

Let's build a simple scraper that extracts article headlines from a blog page. This covers the fundamental pattern you'll use in every scraping project:

Python — scraper.py

import requests
from bs4 import BeautifulSoup

def scrape_headlines(url):
    """Scrape article headlines from a blog page."""

    # Step 1: Send the request
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    # Step 2: Parse the HTML
    soup = BeautifulSoup(response.text, "lxml")

    # Step 3: Extract the data
    articles = soup.select("article h2 a")  # CSS selector

    headlines = []
    for article in articles:
        headlines.append({
            "title": article.get_text(strip=True),
            "link": article.get("href", "")
        })

    return headlines


# Usage
results = scrape_headlines("https://example-blog.com/articles")
for item in results:
    print(f"{item['title']}")
    print(f"  {item['link']}\\n")

Let's break down what's happening:

We set a User-Agent header to identify our scraper. Many websites block requests that don't have a valid User-Agent.
We use response.raise_for_status() to throw an exception if the server returns a 4xx or 5xx error.
We use a CSS selector (article h2 a) to find all links inside <h2> tags that are inside <article> elements.
For each match, we extract the text content and the href attribute.

Understanding CSS Selectors for Scraping

CSS selectors are the most important skill in web scraping. They let you pinpoint exactly which elements to extract from a page. Here are the selectors you'll use most often:

Selector	What It Matches	Example
`tag`	All elements of that tag type	`h2` matches all <h2> tags
`.class`	Elements with that CSS class	`.price` matches class="price"
`#id`	Element with that specific ID	`#main-content`
`parent child`	Nested elements (descendant)	`div.product h3`
`parent > child`	Direct children only	`ul > li`
`[attr=value]`	Elements with a specific attribute	`a[href*="product"]`
`tag:nth-child(n)`	The nth child element	`tr:nth-child(2)` (second row)

Pro tip: Open your browser's DevTools (F12), right-click any element, and choose "Copy > Copy selector" to get the exact CSS selector for that element.

Finding the Right CSS Selector (Browser DevTools Flow)

Scraping Multiple Pages (Pagination)

Most websites spread their data across multiple pages. To scrape everything, you need to follow pagination links automatically. Here's how to handle the most common pagination pattern:

Python — Handling pagination

import time

def scrape_all_pages(base_url, max_pages=50):
    """Scrape data from all pages of a paginated listing."""
    all_results = []
    page = 1

    while page <= max_pages:
        url = f"{base_url}?page={page}"
        print(f"Scraping page {page}...")

        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            print(f"Stopped at page {page}: HTTP {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "lxml")
        items = soup.select(".product-card")

        if not items:
            print(f"No more items found. Stopping at page {page}.")
            break

        for item in items:
            all_results.append({
                "name": item.select_one("h3").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
                "url": item.select_one("a")["href"]
            })

        page += 1
        time.sleep(1.5)  # Be polite: wait between requests

    print(f"\\nScraped {len(all_results)} items from {page - 1} pages.")
    return all_results

Key points about pagination:

Always add delays between requests — time.sleep(1) to time.sleep(3) prevents overwhelming the server
Stop when there's no more data — Check if the page has items before continuing
Set a maximum page limit — Prevents infinite loops if the site keeps returning pages
Handle errors gracefully — A 404 or 500 error means you should stop, not crash

Handling Dynamic Content (JavaScript-Rendered Pages)

Some websites load their content using JavaScript after the initial page loads. When you use requests.get(), you only get the initial HTML — the JavaScript hasn't run yet, so the data isn't there.

For these sites, you need a headless browser that executes JavaScript. Playwright and Selenium are the two most popular options:

Terminal

pip install playwright
playwright install chromium

Python — Scraping JavaScript-rendered pages

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_dynamic_page(url):
    """Scrape a page that requires JavaScript to render content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content to load
        page.goto(url, wait_until="networkidle")

        # Optional: wait for a specific element to appear
        page.wait_for_selector(".product-list", timeout=10000)

        # Get the fully rendered HTML
        html = page.content()
        browser.close()

    # Now parse with BeautifulSoup as usual
    soup = BeautifulSoup(html, "lxml")
    products = soup.select(".product-list .item")

    return [
        {
            "name": p.select_one(".name").get_text(strip=True),
            "price": p.select_one(".price").get_text(strip=True)
        }
        for p in products
    ]

Static HTML vs. JavaScript-Rendered Pages

Rule of thumb: Start with requests + BeautifulSoup. If the data you need isn't in the HTML response, switch to Playwright. Don't use a headless browser unless you have to — it's significantly slower and uses more resources.

Storing Scraped Data

Once you've extracted data, you need to store it. The format depends on your use case:

Save to CSV (for spreadsheets and simple analysis)

Python — Save to CSV

import csv

def save_to_csv(data, filename="scraped_data.csv"):
    """Save a list of dictionaries to a CSV file."""
    if not data:
        print("No data to save.")
        return

    fieldnames = data[0].keys()

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

    print(f"Saved {len(data)} rows to {filename}")

Save to JSON (for APIs and nested data)

Python — Save to JSON

import json

def save_to_json(data, filename="scraped_data.json"):
    """Save data to a JSON file."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(data)} records to {filename}")

Save to SQLite (for large datasets and querying)

Python — Save to SQLite

import sqlite3

def save_to_database(data, db_name="scraped.db"):
    """Save scraped data to a SQLite database."""
    conn = sqlite3.connect(db_name)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            name TEXT,
            price TEXT,
            url TEXT UNIQUE,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    for item in data:
        try:
            conn.execute(
                "INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
                (item["name"], item["price"], item["url"])
            )
        except sqlite3.Error as e:
            print(f"DB error: {e}")

    conn.commit()
    print(f"Saved {len(data)} records to {db_name}")
    conn.close()

Scaling Up with Scrapy

For larger scraping projects (thousands of pages, multiple websites, scheduled runs), requests + BeautifulSoup can become limiting. That's where Scrapy comes in. Scrapy is a full-featured web scraping framework that handles concurrency, retries, rate limiting, and data pipelines out of the box.

Terminal

pip install scrapy
scrapy startproject product_scraper
cd product_scraper

Python — Scrapy spider example

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example-shop.com/products"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1.5,        # Wait between requests
        "CONCURRENT_REQUESTS": 4,     # Max parallel requests
        "ROBOTSTXT_OBEY": True,       # Respect robots.txt
        "AUTOTHROTTLE_ENABLED": True,  # Auto-adjust speed
    }

    def parse(self, response):
        # Extract product data from each card
        for card in response.css(".product-card"):
            yield {
                "name": card.css("h3::text").get("").strip(),
                "price": card.css(".price::text").get("").strip(),
                "rating": card.css(".rating::attr(data-score)").get(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
            }

        # Follow pagination link
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Choosing the Right Scraping Tool

Avoiding Common Pitfalls

Web scraping comes with real-world challenges. Here's how to handle the most common ones:

Getting blocked (HTTP 403 / 429 errors)

Websites detect and block scrapers that make too many requests too quickly. To avoid this:

Add time.sleep() delays of 1-3 seconds between requests
Set a realistic User-Agent header
Rotate User-Agent strings if scraping many pages
Respect robots.txt — it tells you which pages the site allows bots to access

Handling missing data

Not every page will have every field. Always use safe access patterns:

Python — Safe data extraction

# BAD: Crashes if element doesn't exist
title = soup.select_one("h2").text

# GOOD: Returns None or default value
element = soup.select_one("h2")
title = element.get_text(strip=True) if element else "N/A"

# ALSO GOOD: Use a helper function
def safe_text(soup, selector, default="N/A"):
    """Safely extract text from a CSS selector."""
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else default

Dealing with encoding issues

Some pages use non-standard character encodings. Always specify encoding when saving files:

Python

# Force UTF-8 encoding on the response
response.encoding = "utf-8"

# Always open files with encoding specified
with open("data.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fields)
    writer.writerows(data)

Legal and Ethical Guidelines

Web scraping is a powerful tool, but it comes with responsibilities. Follow these guidelines to stay on the right side of the law and be a good internet citizen:

Do	Don't
Check and respect `robots.txt`	Ignore a site's scraping restrictions
Add delays between requests (1-3 seconds)	Flood a server with hundreds of requests per second
Scrape only public, non-copyrighted data	Scrape personal data, login-protected content, or copyrighted material
Identify your bot with a User-Agent string	Disguise your scraper as a regular browser to bypass blocks
Cache results to avoid re-scraping the same pages	Repeatedly scrape the same pages unnecessarily
Read the website's Terms of Service	Assume all public data is free to scrape commercially
Use an API if one is available	Scrape a site that offers an API for the same data

Important: The legality of web scraping varies by country and context. In the US, the 2022 hiQ Labs v. LinkedIn ruling confirmed that scraping publicly accessible data is generally legal. However, violating a website's Terms of Service, scraping personal data (GDPR in Europe), or using scraped data for harmful purposes can create legal liability. When in doubt, consult a legal professional.

A Complete Real-World Example

Let's put everything together into a production-ready scraper that extracts job listings from a job board, handles pagination, saves to CSV, and includes proper error handling:

Python — complete_scraper.py

import requests
from bs4 import BeautifulSoup
import csv
import time
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

class JobScraper:
    def __init__(self, base_url, delay=2.0):
        self.base_url = base_url
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (compatible; JobScraper/1.0)"
        })

    def scrape_page(self, page_num):
        """Scrape a single page of job listings."""
        url = f"{self.base_url}?page={page_num}"
        logger.info(f"Fetching page {page_num}: {url}")

        try:
            resp = self.session.get(url, timeout=15)
            resp.raise_for_status()
        except requests.RequestException as e:
            logger.error(f"Request failed: {e}")
            return []

        soup = BeautifulSoup(resp.text, "lxml")
        jobs = []

        for card in soup.select(".job-listing"):
            title_el = card.select_one(".job-title a")
            company_el = card.select_one(".company-name")
            location_el = card.select_one(".location")
            salary_el = card.select_one(".salary")

            jobs.append({
                "title": title_el.get_text(strip=True) if title_el else "N/A",
                "company": company_el.get_text(strip=True) if company_el else "N/A",
                "location": location_el.get_text(strip=True) if location_el else "N/A",
                "salary": salary_el.get_text(strip=True) if salary_el else "Not listed",
                "url": title_el["href"] if title_el and title_el.get("href") else ""
            })

        return jobs

    def scrape_all(self, max_pages=20):
        """Scrape all pages and return combined results."""
        all_jobs = []

        for page in range(1, max_pages + 1):
            jobs = self.scrape_page(page)
            if not jobs:
                logger.info("No more jobs found. Done.")
                break

            all_jobs.extend(jobs)
            logger.info(f"Collected {len(jobs)} jobs from page {page}")
            time.sleep(self.delay)

        logger.info(f"Total: {len(all_jobs)} jobs scraped")
        return all_jobs

    @staticmethod
    def save_csv(data, filename="jobs.csv"):
        """Save job listings to CSV."""
        if not data:
            return

        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=data[0].keys())
            writer.writeheader()
            writer.writerows(data)

        logger.info(f"Saved to {filename}")


# Run the scraper
if __name__ == "__main__":
    scraper = JobScraper("https://example-jobs.com/listings", delay=2.0)
    jobs = scraper.scrape_all(max_pages=10)
    scraper.save_csv(jobs)

What to Do Next

You now have a solid foundation in web scraping with Python. Here's how to continue building your skills:

Practice on real sites — Try scraping public data from sites like Books to Scrape (books.toscrape.com), Quotes to Scrape, or Wikipedia tables
Learn Scrapy for larger projects — When you need to scrape thousands of pages, Scrapy's built-in concurrency and pipeline system will save you hours of code
Explore Playwright for JavaScript-heavy sites — Many modern web apps render content client-side and require a headless browser
Build a data pipeline — Combine scraping with data cleaning (pandas), storage (SQLite or PostgreSQL), and visualization (Chart.js or Matplotlib)
Schedule your scrapers — Use cron jobs on Linux or Task Scheduler on Windows to run scrapers automatically on a daily or weekly basis
Always check for APIs first — If a website offers an API for the data you need, use it. APIs are faster, more reliable, and more maintainable than scrapers

Web scraping is one of those skills that becomes more valuable the more you practice. Start with a simple project, handle the edge cases as they come up, and gradually take on more complex targets. The core pattern — request, parse, store — stays the same no matter how large the project gets.

python web scraping beautifulsoup scrapy data extraction tutorial

Web Scraping with Python: The Complete Beginner's Guide (2026)

How Web Scraping Works

Web Scraping vs. APIs: When to Use Which

Setting Up Your Python Environment

Your First Web Scraper

Understanding CSS Selectors for Scraping

Scraping Multiple Pages (Pagination)

Handling Dynamic Content (JavaScript-Rendered Pages)

Storing Scraped Data

Save to CSV (for spreadsheets and simple analysis)

Save to JSON (for APIs and nested data)

Save to SQLite (for large datasets and querying)

Scaling Up with Scrapy

Avoiding Common Pitfalls

Getting blocked (HTTP 403 / 429 errors)

Handling missing data

Dealing with encoding issues

Legal and Ethical Guidelines

A Complete Real-World Example

What to Do Next

Related Articles

How to Deploy a Production Server on a VPS: A Beginner's Guide

How to Build a Real Estate Property Search App with Python

How to Build a Real Estate Analytics Dashboard with Python and Chart.js

Ready to Start Building?