Web scraping is the process of automatically extracting data from websites. Instead of manually copying information from web pages, you write a program that does it for you — faster, more accurately, and at scale.
Companies use web scraping for price monitoring, lead generation, market research, news aggregation, and academic research. It's one of the most practical skills a developer or data analyst can learn because almost every industry has data locked inside web pages that isn't available through an API.
In this guide, you'll learn how web scraping works from the ground up. We'll cover the core concepts, build scrapers with Python, handle real-world challenges like pagination and dynamic content, store the results, and discuss the legal and ethical boundaries you need to respect.
How Web Scraping Works
At its core, web scraping is a three-step process. Your program sends an HTTP request to a website (just like a browser does), receives the HTML response, and then parses that HTML to extract the specific data you need.
Think of it like reading a newspaper. Your eyes (the HTTP request) look at the page (the server response), your brain (the parser) identifies the article headlines, prices, or dates, and then you write them down in a notebook (storage).
Web Scraping vs. APIs: When to Use Which
Before scraping a website, always check if they offer an API. APIs are the preferred method because they're designed for programmatic access, return structured data, and are more reliable. Only use web scraping when no API is available.
| Factor | Web Scraping | API |
|---|---|---|
| Data format | Unstructured HTML you must parse | Structured JSON/XML |
| Reliability | Breaks when website layout changes | Stable, versioned endpoints |
| Speed | Slower (downloads full page HTML, CSS, images) | Faster (only data, no UI) |
| Rate limits | Implicit (you can get blocked) | Explicit (documented limits) |
| Legal clarity | Gray area — depends on terms of service | Clear — usage terms in API docs |
| Best for | Sites with no API, public data extraction | Any site that offers an API |
Setting Up Your Python Environment
You'll need Python 3.8+ and three libraries. Here's how to set up a clean project:
mkdir web-scraper && cd web-scraper
python3 -m venv venv
source venv/bin/activate # Windows: venv\\Scripts\\activate
pip install requests beautifulsoup4 lxml
Here's what each library does:
- requests — Sends HTTP requests and handles responses, cookies, and headers
- beautifulsoup4 — Parses HTML and lets you search for elements using CSS selectors or tag names
- lxml — A fast HTML/XML parser that BeautifulSoup can use as its backend engine
Your First Web Scraper
Let's build a simple scraper that extracts article headlines from a blog page. This covers the fundamental pattern you'll use in every scraping project:
import requests
from bs4 import BeautifulSoup
def scrape_headlines(url):
"""Scrape article headlines from a blog page."""
# Step 1: Send the request
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, "lxml")
# Step 3: Extract the data
articles = soup.select("article h2 a") # CSS selector
headlines = []
for article in articles:
headlines.append({
"title": article.get_text(strip=True),
"link": article.get("href", "")
})
return headlines
# Usage
results = scrape_headlines("https://example-blog.com/articles")
for item in results:
print(f"{item['title']}")
print(f" {item['link']}\\n")
Let's break down what's happening:
- We set a
User-Agentheader to identify our scraper. Many websites block requests that don't have a valid User-Agent. - We use
response.raise_for_status()to throw an exception if the server returns a 4xx or 5xx error. - We use a CSS selector (
article h2 a) to find all links inside <h2> tags that are inside <article> elements. - For each match, we extract the text content and the
hrefattribute.
Understanding CSS Selectors for Scraping
CSS selectors are the most important skill in web scraping. They let you pinpoint exactly which elements to extract from a page. Here are the selectors you'll use most often:
| Selector | What It Matches | Example |
|---|---|---|
tag | All elements of that tag type | h2 matches all <h2> tags |
.class | Elements with that CSS class | .price matches class="price" |
#id | Element with that specific ID | #main-content |
parent child | Nested elements (descendant) | div.product h3 |
parent > child | Direct children only | ul > li |
[attr=value] | Elements with a specific attribute | a[href*="product"] |
tag:nth-child(n) | The nth child element | tr:nth-child(2) (second row) |
Pro tip: Open your browser's DevTools (F12), right-click any element, and choose "Copy > Copy selector" to get the exact CSS selector for that element.
Scraping Multiple Pages (Pagination)
Most websites spread their data across multiple pages. To scrape everything, you need to follow pagination links automatically. Here's how to handle the most common pagination pattern:
import time
def scrape_all_pages(base_url, max_pages=50):
"""Scrape data from all pages of a paginated listing."""
all_results = []
page = 1
while page <= max_pages:
url = f"{base_url}?page={page}"
print(f"Scraping page {page}...")
response = requests.get(url, headers=headers, timeout=10)
if response.status_code != 200:
print(f"Stopped at page {page}: HTTP {response.status_code}")
break
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product-card")
if not items:
print(f"No more items found. Stopping at page {page}.")
break
for item in items:
all_results.append({
"name": item.select_one("h3").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
"url": item.select_one("a")["href"]
})
page += 1
time.sleep(1.5) # Be polite: wait between requests
print(f"\\nScraped {len(all_results)} items from {page - 1} pages.")
return all_results
Key points about pagination:
- Always add delays between requests —
time.sleep(1)totime.sleep(3)prevents overwhelming the server - Stop when there's no more data — Check if the page has items before continuing
- Set a maximum page limit — Prevents infinite loops if the site keeps returning pages
- Handle errors gracefully — A 404 or 500 error means you should stop, not crash
Handling Dynamic Content (JavaScript-Rendered Pages)
Some websites load their content using JavaScript after the initial page loads. When you use requests.get(), you only get the initial HTML — the JavaScript hasn't run yet, so the data isn't there.
For these sites, you need a headless browser that executes JavaScript. Playwright and Selenium are the two most popular options:
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_dynamic_page(url):
"""Scrape a page that requires JavaScript to render content."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content to load
page.goto(url, wait_until="networkidle")
# Optional: wait for a specific element to appear
page.wait_for_selector(".product-list", timeout=10000)
# Get the fully rendered HTML
html = page.content()
browser.close()
# Now parse with BeautifulSoup as usual
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-list .item")
return [
{
"name": p.select_one(".name").get_text(strip=True),
"price": p.select_one(".price").get_text(strip=True)
}
for p in products
]
Rule of thumb: Start with requests + BeautifulSoup. If the data you need isn't in the HTML response, switch to Playwright. Don't use a headless browser unless you have to — it's significantly slower and uses more resources.
Storing Scraped Data
Once you've extracted data, you need to store it. The format depends on your use case:
Save to CSV (for spreadsheets and simple analysis)
import csv
def save_to_csv(data, filename="scraped_data.csv"):
"""Save a list of dictionaries to a CSV file."""
if not data:
print("No data to save.")
return
fieldnames = data[0].keys()
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} rows to {filename}")
Save to JSON (for APIs and nested data)
import json
def save_to_json(data, filename="scraped_data.json"):
"""Save data to a JSON file."""
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} records to {filename}")
Save to SQLite (for large datasets and querying)
import sqlite3
def save_to_database(data, db_name="scraped.db"):
"""Save scraped data to a SQLite database."""
conn = sqlite3.connect(db_name)
conn.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
price TEXT,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
for item in data:
try:
conn.execute(
"INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
(item["name"], item["price"], item["url"])
)
except sqlite3.Error as e:
print(f"DB error: {e}")
conn.commit()
print(f"Saved {len(data)} records to {db_name}")
conn.close()
Scaling Up with Scrapy
For larger scraping projects (thousands of pages, multiple websites, scheduled runs), requests + BeautifulSoup can become limiting. That's where Scrapy comes in. Scrapy is a full-featured web scraping framework that handles concurrency, retries, rate limiting, and data pipelines out of the box.
pip install scrapy
scrapy startproject product_scraper
cd product_scraper
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example-shop.com/products"]
custom_settings = {
"DOWNLOAD_DELAY": 1.5, # Wait between requests
"CONCURRENT_REQUESTS": 4, # Max parallel requests
"ROBOTSTXT_OBEY": True, # Respect robots.txt
"AUTOTHROTTLE_ENABLED": True, # Auto-adjust speed
}
def parse(self, response):
# Extract product data from each card
for card in response.css(".product-card"):
yield {
"name": card.css("h3::text").get("").strip(),
"price": card.css(".price::text").get("").strip(),
"rating": card.css(".rating::attr(data-score)").get(),
"url": response.urljoin(card.css("a::attr(href)").get()),
}
# Follow pagination link
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Avoiding Common Pitfalls
Web scraping comes with real-world challenges. Here's how to handle the most common ones:
Getting blocked (HTTP 403 / 429 errors)
Websites detect and block scrapers that make too many requests too quickly. To avoid this:
- Add
time.sleep()delays of 1-3 seconds between requests - Set a realistic
User-Agentheader - Rotate User-Agent strings if scraping many pages
- Respect
robots.txt— it tells you which pages the site allows bots to access
Handling missing data
Not every page will have every field. Always use safe access patterns:
# BAD: Crashes if element doesn't exist
title = soup.select_one("h2").text
# GOOD: Returns None or default value
element = soup.select_one("h2")
title = element.get_text(strip=True) if element else "N/A"
# ALSO GOOD: Use a helper function
def safe_text(soup, selector, default="N/A"):
"""Safely extract text from a CSS selector."""
el = soup.select_one(selector)
return el.get_text(strip=True) if el else default
Dealing with encoding issues
Some pages use non-standard character encodings. Always specify encoding when saving files:
# Force UTF-8 encoding on the response
response.encoding = "utf-8"
# Always open files with encoding specified
with open("data.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writerows(data)
Legal and Ethical Guidelines
Web scraping is a powerful tool, but it comes with responsibilities. Follow these guidelines to stay on the right side of the law and be a good internet citizen:
| Do | Don't |
|---|---|
Check and respect robots.txt | Ignore a site's scraping restrictions |
| Add delays between requests (1-3 seconds) | Flood a server with hundreds of requests per second |
| Scrape only public, non-copyrighted data | Scrape personal data, login-protected content, or copyrighted material |
| Identify your bot with a User-Agent string | Disguise your scraper as a regular browser to bypass blocks |
| Cache results to avoid re-scraping the same pages | Repeatedly scrape the same pages unnecessarily |
| Read the website's Terms of Service | Assume all public data is free to scrape commercially |
| Use an API if one is available | Scrape a site that offers an API for the same data |
Important: The legality of web scraping varies by country and context. In the US, the 2022 hiQ Labs v. LinkedIn ruling confirmed that scraping publicly accessible data is generally legal. However, violating a website's Terms of Service, scraping personal data (GDPR in Europe), or using scraped data for harmful purposes can create legal liability. When in doubt, consult a legal professional.
A Complete Real-World Example
Let's put everything together into a production-ready scraper that extracts job listings from a job board, handles pagination, saves to CSV, and includes proper error handling:
import requests
from bs4 import BeautifulSoup
import csv
import time
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)
class JobScraper:
def __init__(self, base_url, delay=2.0):
self.base_url = base_url
self.delay = delay
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; JobScraper/1.0)"
})
def scrape_page(self, page_num):
"""Scrape a single page of job listings."""
url = f"{self.base_url}?page={page_num}"
logger.info(f"Fetching page {page_num}: {url}")
try:
resp = self.session.get(url, timeout=15)
resp.raise_for_status()
except requests.RequestException as e:
logger.error(f"Request failed: {e}")
return []
soup = BeautifulSoup(resp.text, "lxml")
jobs = []
for card in soup.select(".job-listing"):
title_el = card.select_one(".job-title a")
company_el = card.select_one(".company-name")
location_el = card.select_one(".location")
salary_el = card.select_one(".salary")
jobs.append({
"title": title_el.get_text(strip=True) if title_el else "N/A",
"company": company_el.get_text(strip=True) if company_el else "N/A",
"location": location_el.get_text(strip=True) if location_el else "N/A",
"salary": salary_el.get_text(strip=True) if salary_el else "Not listed",
"url": title_el["href"] if title_el and title_el.get("href") else ""
})
return jobs
def scrape_all(self, max_pages=20):
"""Scrape all pages and return combined results."""
all_jobs = []
for page in range(1, max_pages + 1):
jobs = self.scrape_page(page)
if not jobs:
logger.info("No more jobs found. Done.")
break
all_jobs.extend(jobs)
logger.info(f"Collected {len(jobs)} jobs from page {page}")
time.sleep(self.delay)
logger.info(f"Total: {len(all_jobs)} jobs scraped")
return all_jobs
@staticmethod
def save_csv(data, filename="jobs.csv"):
"""Save job listings to CSV."""
if not data:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
logger.info(f"Saved to {filename}")
# Run the scraper
if __name__ == "__main__":
scraper = JobScraper("https://example-jobs.com/listings", delay=2.0)
jobs = scraper.scrape_all(max_pages=10)
scraper.save_csv(jobs)
What to Do Next
You now have a solid foundation in web scraping with Python. Here's how to continue building your skills:
- Practice on real sites — Try scraping public data from sites like Books to Scrape (books.toscrape.com), Quotes to Scrape, or Wikipedia tables
- Learn Scrapy for larger projects — When you need to scrape thousands of pages, Scrapy's built-in concurrency and pipeline system will save you hours of code
- Explore Playwright for JavaScript-heavy sites — Many modern web apps render content client-side and require a headless browser
- Build a data pipeline — Combine scraping with data cleaning (pandas), storage (SQLite or PostgreSQL), and visualization (Chart.js or Matplotlib)
- Schedule your scrapers — Use cron jobs on Linux or Task Scheduler on Windows to run scrapers automatically on a daily or weekly basis
- Always check for APIs first — If a website offers an API for the data you need, use it. APIs are faster, more reliable, and more maintainable than scrapers
Web scraping is one of those skills that becomes more valuable the more you practice. Start with a simple project, handle the edge cases as they come up, and gradually take on more complex targets. The core pattern — request, parse, store — stays the same no matter how large the project gets.