Back to Blog

Web Scraping with Python: The Complete Beginner's Guide (2026)

Learn how to scrape data from websites using Python, BeautifulSoup, and Scrapy. This step-by-step guide covers setup, parsing HTML, handling pagination, storing data, and avoiding common pitfalls.

Web Scraping with Python: The Complete Beginner's Guide (2026)

Web scraping is the process of automatically extracting data from websites. Instead of manually copying information from web pages, you write a program that does it for you — faster, more accurately, and at scale.

Companies use web scraping for price monitoring, lead generation, market research, news aggregation, and academic research. It's one of the most practical skills a developer or data analyst can learn because almost every industry has data locked inside web pages that isn't available through an API.

In this guide, you'll learn how web scraping works from the ground up. We'll cover the core concepts, build scrapers with Python, handle real-world challenges like pagination and dynamic content, store the results, and discuss the legal and ethical boundaries you need to respect.

How Web Scraping Works

At its core, web scraping is a three-step process. Your program sends an HTTP request to a website (just like a browser does), receives the HTML response, and then parses that HTML to extract the specific data you need.

How Web Scraping Works: The 3-Step Process
1. Request Send HTTP GET to the target URL (requests library) 2. Parse Extract data from HTML using selectors (BeautifulSoup) 3. Store Save results to CSV, JSON, or database 📡 HTTP Client 📄 HTML Parser 💾 Data Storage

Think of it like reading a newspaper. Your eyes (the HTTP request) look at the page (the server response), your brain (the parser) identifies the article headlines, prices, or dates, and then you write them down in a notebook (storage).

Web Scraping vs. APIs: When to Use Which

Before scraping a website, always check if they offer an API. APIs are the preferred method because they're designed for programmatic access, return structured data, and are more reliable. Only use web scraping when no API is available.

FactorWeb ScrapingAPI
Data formatUnstructured HTML you must parseStructured JSON/XML
ReliabilityBreaks when website layout changesStable, versioned endpoints
SpeedSlower (downloads full page HTML, CSS, images)Faster (only data, no UI)
Rate limitsImplicit (you can get blocked)Explicit (documented limits)
Legal clarityGray area — depends on terms of serviceClear — usage terms in API docs
Best forSites with no API, public data extractionAny site that offers an API

Setting Up Your Python Environment

You'll need Python 3.8+ and three libraries. Here's how to set up a clean project:

Terminal
mkdir web-scraper && cd web-scraper
python3 -m venv venv
source venv/bin/activate   # Windows: venv\\Scripts\\activate

pip install requests beautifulsoup4 lxml

Here's what each library does:

  • requests — Sends HTTP requests and handles responses, cookies, and headers
  • beautifulsoup4 — Parses HTML and lets you search for elements using CSS selectors or tag names
  • lxml — A fast HTML/XML parser that BeautifulSoup can use as its backend engine

Your First Web Scraper

Let's build a simple scraper that extracts article headlines from a blog page. This covers the fundamental pattern you'll use in every scraping project:

Python — scraper.py
import requests
from bs4 import BeautifulSoup

def scrape_headlines(url):
    """Scrape article headlines from a blog page."""

    # Step 1: Send the request
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    # Step 2: Parse the HTML
    soup = BeautifulSoup(response.text, "lxml")

    # Step 3: Extract the data
    articles = soup.select("article h2 a")  # CSS selector

    headlines = []
    for article in articles:
        headlines.append({
            "title": article.get_text(strip=True),
            "link": article.get("href", "")
        })

    return headlines


# Usage
results = scrape_headlines("https://example-blog.com/articles")
for item in results:
    print(f"{item['title']}")
    print(f"  {item['link']}\\n")

Let's break down what's happening:

  1. We set a User-Agent header to identify our scraper. Many websites block requests that don't have a valid User-Agent.
  2. We use response.raise_for_status() to throw an exception if the server returns a 4xx or 5xx error.
  3. We use a CSS selector (article h2 a) to find all links inside <h2> tags that are inside <article> elements.
  4. For each match, we extract the text content and the href attribute.

Understanding CSS Selectors for Scraping

CSS selectors are the most important skill in web scraping. They let you pinpoint exactly which elements to extract from a page. Here are the selectors you'll use most often:

SelectorWhat It MatchesExample
tagAll elements of that tag typeh2 matches all <h2> tags
.classElements with that CSS class.price matches class="price"
#idElement with that specific ID#main-content
parent childNested elements (descendant)div.product h3
parent > childDirect children onlyul > li
[attr=value]Elements with a specific attributea[href*="product"]
tag:nth-child(n)The nth child elementtr:nth-child(2) (second row)

Pro tip: Open your browser's DevTools (F12), right-click any element, and choose "Copy > Copy selector" to get the exact CSS selector for that element.

Finding the Right CSS Selector (Browser DevTools Flow)
Open DevTools Right-click element Click "Inspect" Identify Pattern Check class names, IDs, and nesting Write Selector soup.select() with your selector Test & Refine Verify it matches the right elements

Scraping Multiple Pages (Pagination)

Most websites spread their data across multiple pages. To scrape everything, you need to follow pagination links automatically. Here's how to handle the most common pagination pattern:

Python — Handling pagination
import time

def scrape_all_pages(base_url, max_pages=50):
    """Scrape data from all pages of a paginated listing."""
    all_results = []
    page = 1

    while page <= max_pages:
        url = f"{base_url}?page={page}"
        print(f"Scraping page {page}...")

        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            print(f"Stopped at page {page}: HTTP {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "lxml")
        items = soup.select(".product-card")

        if not items:
            print(f"No more items found. Stopping at page {page}.")
            break

        for item in items:
            all_results.append({
                "name": item.select_one("h3").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
                "url": item.select_one("a")["href"]
            })

        page += 1
        time.sleep(1.5)  # Be polite: wait between requests

    print(f"\\nScraped {len(all_results)} items from {page - 1} pages.")
    return all_results

Key points about pagination:

  • Always add delays between requeststime.sleep(1) to time.sleep(3) prevents overwhelming the server
  • Stop when there's no more data — Check if the page has items before continuing
  • Set a maximum page limit — Prevents infinite loops if the site keeps returning pages
  • Handle errors gracefully — A 404 or 500 error means you should stop, not crash

Handling Dynamic Content (JavaScript-Rendered Pages)

Some websites load their content using JavaScript after the initial page loads. When you use requests.get(), you only get the initial HTML — the JavaScript hasn't run yet, so the data isn't there.

For these sites, you need a headless browser that executes JavaScript. Playwright and Selenium are the two most popular options:

Terminal
pip install playwright
playwright install chromium
Python — Scraping JavaScript-rendered pages
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_dynamic_page(url):
    """Scrape a page that requires JavaScript to render content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content to load
        page.goto(url, wait_until="networkidle")

        # Optional: wait for a specific element to appear
        page.wait_for_selector(".product-list", timeout=10000)

        # Get the fully rendered HTML
        html = page.content()
        browser.close()

    # Now parse with BeautifulSoup as usual
    soup = BeautifulSoup(html, "lxml")
    products = soup.select(".product-list .item")

    return [
        {
            "name": p.select_one(".name").get_text(strip=True),
            "price": p.select_one(".price").get_text(strip=True)
        }
        for p in products
    ]
Static HTML vs. JavaScript-Rendered Pages
Static Page (requests + BS4) Server returns full HTML Data is already in the response Fast, lightweight, simple Dynamic Page (Playwright) Server returns skeleton HTML JavaScript loads data after page load Needs headless browser, slower Examples: Wikipedia, News sites, Government data, Blogs, Docs ~70% of websites Examples: React/Vue SPAs, Social media, Dashboards ~30% of websites

Rule of thumb: Start with requests + BeautifulSoup. If the data you need isn't in the HTML response, switch to Playwright. Don't use a headless browser unless you have to — it's significantly slower and uses more resources.

Storing Scraped Data

Once you've extracted data, you need to store it. The format depends on your use case:

Save to CSV (for spreadsheets and simple analysis)

Python — Save to CSV
import csv

def save_to_csv(data, filename="scraped_data.csv"):
    """Save a list of dictionaries to a CSV file."""
    if not data:
        print("No data to save.")
        return

    fieldnames = data[0].keys()

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

    print(f"Saved {len(data)} rows to {filename}")

Save to JSON (for APIs and nested data)

Python — Save to JSON
import json

def save_to_json(data, filename="scraped_data.json"):
    """Save data to a JSON file."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(data)} records to {filename}")

Save to SQLite (for large datasets and querying)

Python — Save to SQLite
import sqlite3

def save_to_database(data, db_name="scraped.db"):
    """Save scraped data to a SQLite database."""
    conn = sqlite3.connect(db_name)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            name TEXT,
            price TEXT,
            url TEXT UNIQUE,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    for item in data:
        try:
            conn.execute(
                "INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
                (item["name"], item["price"], item["url"])
            )
        except sqlite3.Error as e:
            print(f"DB error: {e}")

    conn.commit()
    print(f"Saved {len(data)} records to {db_name}")
    conn.close()

Scaling Up with Scrapy

For larger scraping projects (thousands of pages, multiple websites, scheduled runs), requests + BeautifulSoup can become limiting. That's where Scrapy comes in. Scrapy is a full-featured web scraping framework that handles concurrency, retries, rate limiting, and data pipelines out of the box.

Terminal
pip install scrapy
scrapy startproject product_scraper
cd product_scraper
Python — Scrapy spider example
import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example-shop.com/products"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1.5,        # Wait between requests
        "CONCURRENT_REQUESTS": 4,     # Max parallel requests
        "ROBOTSTXT_OBEY": True,       # Respect robots.txt
        "AUTOTHROTTLE_ENABLED": True,  # Auto-adjust speed
    }

    def parse(self, response):
        # Extract product data from each card
        for card in response.css(".product-card"):
            yield {
                "name": card.css("h3::text").get("").strip(),
                "price": card.css(".price::text").get("").strip(),
                "rating": card.css(".rating::attr(data-score)").get(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
            }

        # Follow pagination link
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)
Choosing the Right Scraping Tool
requests + BS4 Simple, beginner-friendly Best for: 1-100 pages Static HTML only No built-in concurrency Best for learning and quick scripts Scrapy Full framework, fast Best for: 100-100K pages Built-in concurrency Pipelines, middleware Best for production scrapers at scale Playwright Headless browser Best for: JS-heavy sites Handles SPAs, logins Slower, more resources Best for dynamic JS-rendered pages

Avoiding Common Pitfalls

Web scraping comes with real-world challenges. Here's how to handle the most common ones:

Getting blocked (HTTP 403 / 429 errors)

Websites detect and block scrapers that make too many requests too quickly. To avoid this:

  • Add time.sleep() delays of 1-3 seconds between requests
  • Set a realistic User-Agent header
  • Rotate User-Agent strings if scraping many pages
  • Respect robots.txt — it tells you which pages the site allows bots to access

Handling missing data

Not every page will have every field. Always use safe access patterns:

Python — Safe data extraction
# BAD: Crashes if element doesn't exist
title = soup.select_one("h2").text

# GOOD: Returns None or default value
element = soup.select_one("h2")
title = element.get_text(strip=True) if element else "N/A"

# ALSO GOOD: Use a helper function
def safe_text(soup, selector, default="N/A"):
    """Safely extract text from a CSS selector."""
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else default

Dealing with encoding issues

Some pages use non-standard character encodings. Always specify encoding when saving files:

Python
# Force UTF-8 encoding on the response
response.encoding = "utf-8"

# Always open files with encoding specified
with open("data.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fields)
    writer.writerows(data)

Legal and Ethical Guidelines

Web scraping is a powerful tool, but it comes with responsibilities. Follow these guidelines to stay on the right side of the law and be a good internet citizen:

DoDon't
Check and respect robots.txtIgnore a site's scraping restrictions
Add delays between requests (1-3 seconds)Flood a server with hundreds of requests per second
Scrape only public, non-copyrighted dataScrape personal data, login-protected content, or copyrighted material
Identify your bot with a User-Agent stringDisguise your scraper as a regular browser to bypass blocks
Cache results to avoid re-scraping the same pagesRepeatedly scrape the same pages unnecessarily
Read the website's Terms of ServiceAssume all public data is free to scrape commercially
Use an API if one is availableScrape a site that offers an API for the same data

Important: The legality of web scraping varies by country and context. In the US, the 2022 hiQ Labs v. LinkedIn ruling confirmed that scraping publicly accessible data is generally legal. However, violating a website's Terms of Service, scraping personal data (GDPR in Europe), or using scraped data for harmful purposes can create legal liability. When in doubt, consult a legal professional.

A Complete Real-World Example

Let's put everything together into a production-ready scraper that extracts job listings from a job board, handles pagination, saves to CSV, and includes proper error handling:

Python — complete_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

class JobScraper:
    def __init__(self, base_url, delay=2.0):
        self.base_url = base_url
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (compatible; JobScraper/1.0)"
        })

    def scrape_page(self, page_num):
        """Scrape a single page of job listings."""
        url = f"{self.base_url}?page={page_num}"
        logger.info(f"Fetching page {page_num}: {url}")

        try:
            resp = self.session.get(url, timeout=15)
            resp.raise_for_status()
        except requests.RequestException as e:
            logger.error(f"Request failed: {e}")
            return []

        soup = BeautifulSoup(resp.text, "lxml")
        jobs = []

        for card in soup.select(".job-listing"):
            title_el = card.select_one(".job-title a")
            company_el = card.select_one(".company-name")
            location_el = card.select_one(".location")
            salary_el = card.select_one(".salary")

            jobs.append({
                "title": title_el.get_text(strip=True) if title_el else "N/A",
                "company": company_el.get_text(strip=True) if company_el else "N/A",
                "location": location_el.get_text(strip=True) if location_el else "N/A",
                "salary": salary_el.get_text(strip=True) if salary_el else "Not listed",
                "url": title_el["href"] if title_el and title_el.get("href") else ""
            })

        return jobs

    def scrape_all(self, max_pages=20):
        """Scrape all pages and return combined results."""
        all_jobs = []

        for page in range(1, max_pages + 1):
            jobs = self.scrape_page(page)
            if not jobs:
                logger.info("No more jobs found. Done.")
                break

            all_jobs.extend(jobs)
            logger.info(f"Collected {len(jobs)} jobs from page {page}")
            time.sleep(self.delay)

        logger.info(f"Total: {len(all_jobs)} jobs scraped")
        return all_jobs

    @staticmethod
    def save_csv(data, filename="jobs.csv"):
        """Save job listings to CSV."""
        if not data:
            return

        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=data[0].keys())
            writer.writeheader()
            writer.writerows(data)

        logger.info(f"Saved to {filename}")


# Run the scraper
if __name__ == "__main__":
    scraper = JobScraper("https://example-jobs.com/listings", delay=2.0)
    jobs = scraper.scrape_all(max_pages=10)
    scraper.save_csv(jobs)

What to Do Next

You now have a solid foundation in web scraping with Python. Here's how to continue building your skills:

  • Practice on real sites — Try scraping public data from sites like Books to Scrape (books.toscrape.com), Quotes to Scrape, or Wikipedia tables
  • Learn Scrapy for larger projects — When you need to scrape thousands of pages, Scrapy's built-in concurrency and pipeline system will save you hours of code
  • Explore Playwright for JavaScript-heavy sites — Many modern web apps render content client-side and require a headless browser
  • Build a data pipeline — Combine scraping with data cleaning (pandas), storage (SQLite or PostgreSQL), and visualization (Chart.js or Matplotlib)
  • Schedule your scrapers — Use cron jobs on Linux or Task Scheduler on Windows to run scrapers automatically on a daily or weekly basis
  • Always check for APIs first — If a website offers an API for the data you need, use it. APIs are faster, more reliable, and more maintainable than scrapers

Web scraping is one of those skills that becomes more valuable the more you practice. Start with a simple project, handle the edge cases as they come up, and gradually take on more complex targets. The core pattern — request, parse, store — stays the same no matter how large the project gets.

Share this article:

Ready to Start Building?

Get your API key or deploy a Cloud VPS in minutes.