Guide February 10, 2026 16 min read

Web Scraping Without Getting Blocked: 10 Proven Techniques (2026)

Learn why websites block scrapers and how to avoid it. This guide covers rotating proxies, realistic headers, rate limiting, CAPTCHA handling, headless browsers, and 7 more techniques with Python code examples.

Jordan Lee

Web Scraping Without Getting Blocked: 10 Proven Techniques (2026)

You've built a web scraper, it works perfectly for the first 20 requests, and then suddenly every response comes back as a 403 Forbidden or a CAPTCHA page. Sound familiar?

Websites use increasingly sophisticated anti-bot systems to detect and block automated traffic. Cloudflare, Akamai, PerimeterX, and DataDome can identify scrapers based on dozens of signals — from your request headers to your mouse movements.

In this guide, you'll learn exactly why scrapers get blocked and the 10 most effective techniques to avoid detection. Every method includes working Python code so you can implement it immediately. These are the same strategies used by professional data extraction teams working at scale.

Why Do Websites Block Scrapers?

Before diving into solutions, it helps to understand what you're up against. Websites block scrapers for several legitimate reasons:

Server protection — Unthrottled scrapers can send thousands of requests per second, effectively becoming a DDoS attack
Data protection — Businesses don't want competitors automatically harvesting their pricing, inventory, or content
Bandwidth costs — Every request costs money. Bots can account for 40%+ of a site's traffic
User experience — Heavy bot traffic slows down the site for real users
Terms of service — Many sites explicitly prohibit automated access in their ToS

How Anti-Bot Systems Detect Scrapers

Modern anti-bot systems analyze multiple signals simultaneously. Getting blocked usually means you've tripped one or more of these detection methods:

How Anti-Bot Systems Identify Scrapers

The key insight is that no single technique will make your scraper undetectable. You need to combine multiple methods to look like a real browser operated by a real person.

Technique 1: Set Realistic HTTP Headers

The most common reason scrapers get blocked is sending requests with default or missing headers. The Python requests library sends a User-Agent like python-requests/2.31.0 — an obvious bot signature.

A real browser sends 10-15 headers with every request. Your scraper should do the same:

Python — Realistic headers

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0"
}

session = requests.Session()
session.headers.update(headers)

response = session.get("https://example.com/products")
print(response.status_code)

Important: Keep your User-Agent up to date. Sending a Chrome 90 User-Agent in 2026 is a red flag. Check your own browser's User-Agent at whatismybrowser.com and use a recent version.

Technique 2: Rotate User-Agents

Sending the same User-Agent for thousands of requests from the same IP is suspicious. Real traffic comes from a mix of browsers, operating systems, and devices. Rotate through a list of realistic User-Agent strings:

Python — User-Agent rotation

import random

USER_AGENTS = [
    # Chrome on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    # Chrome on Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    # Firefox on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
    # Safari on Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
    # Edge on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0",
    # Chrome on Linux
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]

def get_random_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": random.choice([
            "en-US,en;q=0.9",
            "en-GB,en;q=0.9",
            "en-US,en;q=0.5",
        ]),
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
    }

Technique 3: Add Random Delays Between Requests

Real users don't click links at perfectly regular intervals. A human might wait 3 seconds on one page and 15 seconds on another. Fixed delays like time.sleep(1) are still detectable because they create an unnatural, machine-like pattern.

Use randomized delays that mimic human browsing behavior:

Python — Human-like delays

import time
import random

def human_delay(min_seconds=1.5, max_seconds=4.5):
    """Wait a random amount of time to mimic human browsing."""
    delay = random.uniform(min_seconds, max_seconds)
    # Occasionally add a longer pause (like reading a page)
    if random.random() < 0.1:  # 10% chance
        delay += random.uniform(3, 8)
    time.sleep(delay)

# Usage in your scraper loop
for url in urls_to_scrape:
    response = session.get(url, headers=get_random_headers())
    process(response)
    human_delay()  # Random wait before next request

Bot Traffic vs. Human-Like Traffic Pattern

Technique 4: Use Proxy Rotation

Sending hundreds of requests from a single IP address is the fastest way to get blocked. Proxy rotation distributes your requests across many IP addresses so no single IP makes too many requests.

There are three types of proxies, each with different trade-offs:

Proxy Type	Detection Risk	Speed	Cost	Best For
Datacenter	High — easily flagged	Very fast	$1-5/mo	Non-protected sites
Residential	Low — real ISP IPs	Medium	$5-15/GB	Protected sites
Mobile	Very low	Slower	$15-30/GB	Heavily protected sites

Python — Proxy rotation

import itertools

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
    "http://user:pass@proxy4.example.com:8080",
]

proxy_pool = itertools.cycle(PROXIES)

def make_request(url):
    proxy = next(proxy_pool)
    try:
        response = requests.get(
            url,
            headers=get_random_headers(),
            proxies={"http": proxy, "https": proxy},
            timeout=15
        )
        return response
    except requests.RequestException:
        # Retry with next proxy if one fails
        proxy = next(proxy_pool)
        return requests.get(
            url,
            headers=get_random_headers(),
            proxies={"http": proxy, "https": proxy},
            timeout=15
        )

Technique 5: Respect robots.txt

The robots.txt file tells bots which pages they're allowed to access. While it's not legally binding in all jurisdictions, respecting it is considered best practice and ignoring it can escalate blocking measures.

Python — Checking robots.txt

from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent="*"):
    """Check if a URL is allowed by the site's robots.txt."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Usage
url = "https://example.com/products/page/1"
if can_scrape(url):
    response = session.get(url, headers=get_random_headers())
else:
    print(f"Blocked by robots.txt: {url}")

Also check for the Crawl-delay directive in robots.txt. If a site specifies Crawl-delay: 10, it's asking bots to wait at least 10 seconds between requests. Respect this value or use an even longer delay.

Technique 6: Use a Headless Browser for JavaScript-Heavy Sites

Sites protected by Cloudflare, DataDome, or PerimeterX often require JavaScript execution to prove you're a real browser. The requests library can't execute JavaScript, so these challenges will block you immediately.

Playwright with stealth settings can bypass many of these checks:

Terminal

pip install playwright playwright-stealth
playwright install chromium

Python — Stealth headless browser

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_with_stealth(url):
    """Scrape a JavaScript-protected page using a stealth browser."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox"
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            locale="en-US"
        )
        page = context.new_page()
        stealth_sync(page)  # Apply stealth patches

        page.goto(url, wait_until="networkidle")

        # Wait for real content (not loading screen)
        page.wait_for_selector(".product-list", timeout=15000)

        html = page.content()
        browser.close()

    return html

Technique 7: Handle Cookies and Sessions

Many websites set tracking cookies on your first visit and expect them on subsequent requests. A scraper that doesn't send cookies back looks like a new visitor every time — which is unnatural and triggers bot detection.

Python — Proper session handling

session = requests.Session()
session.headers.update(get_random_headers())

# First request: visit the homepage to get cookies
session.get("https://example.com/")

# Now subsequent requests include cookies automatically
response = session.get("https://example.com/products?page=1")
print(session.cookies.get_dict())  # Shows stored cookies

# The session maintains cookies across all requests
for page in range(1, 20):
    resp = session.get(f"https://example.com/products?page={page}")
    human_delay()

Using a requests.Session() object instead of individual requests.get() calls also reuses the underlying TCP connection, which is faster and more realistic.

Technique 8: Mimic Real Navigation Patterns

Real users don't jump directly to page 47 of search results. They start at the homepage, navigate to a category, apply filters, and then browse pages sequentially. Anti-bot systems track this navigation pattern.

Python — Realistic navigation

def scrape_with_navigation(session):
    """Mimic natural browsing flow."""
    # Step 1: Visit homepage first
    session.get("https://example.com/", headers=get_random_headers())
    human_delay(2, 5)

    # Step 2: Navigate to category page
    session.get("https://example.com/products", headers=get_random_headers())
    human_delay(1, 3)

    # Step 3: Now scrape listing pages sequentially
    all_data = []
    for page in range(1, 50):
        resp = session.get(
            f"https://example.com/products?page={page}",
            headers=get_random_headers(),
            # Include referer to show where you "came from"
            headers={**get_random_headers(),
                "Referer": f"https://example.com/products?page={page-1}"
                    if page > 1 else "https://example.com/products"
            }
        )

        if resp.status_code != 200:
            break

        all_data.extend(parse_page(resp.text))
        human_delay()

    return all_data

Technique 9: Handle CAPTCHAs Gracefully

CAPTCHAs are the last line of defense. When you encounter one, you have several options — listed from most to least desirable:

Strategy	Approach	Effectiveness
Avoid triggering them	Use all techniques above to stay under the radar	Best — prevention over cure
Slow down and retry	Wait 30-60 seconds, switch proxy, retry	Works for soft rate limits
Switch to a new session	New proxy + new cookies + new User-Agent	Works for IP-based CAPTCHAs
CAPTCHA solving services	Third-party APIs (2Captcha, Anti-Captcha)	Works but adds cost and latency
Skip and log	Log the blocked URL and retry later	Good for large-scale jobs

Python — Detecting and handling blocks

def smart_request(session, url, max_retries=3):
    """Make a request with automatic block detection and retry."""
    for attempt in range(max_retries):
        response = session.get(url, headers=get_random_headers())

        # Check for common block indicators
        if response.status_code == 200:
            content = response.text.lower()
            if "captcha" in content or "verify you are human" in content:
                print(f"CAPTCHA detected on attempt {attempt + 1}")
                time.sleep(30 + random.uniform(0, 30))
                continue
            return response  # Success

        elif response.status_code == 403:
            print(f"Blocked (403) on attempt {attempt + 1}")
            time.sleep(60)
            continue

        elif response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            print(f"Rate limited. Waiting {retry_after}s...")
            time.sleep(retry_after)
            continue

        else:
            return response

    print(f"Failed after {max_retries} retries: {url}")
    return None

Technique 10: Use the Right Tool for the Job

Different websites require different approaches. Using a headless browser on a simple static site is overkill and slow. Using basic requests on a Cloudflare-protected SPA won't work at all. Match your tool to the target:

Decision Tree: Which Scraping Approach to Use

Putting It All Together: Production Scraper Template

Here's a complete, production-ready scraper class that combines all 10 techniques. Use this as a starting template for your projects:

Python — production_scraper.py

import requests
import random
import time
import logging
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
log = logging.getLogger(__name__)

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 Version/17.2 Safari/605.1.15",
]

class StealthScraper:
    def __init__(self, base_url, delay=(1.5, 4.0), proxies=None):
        self.base_url = base_url
        self.delay = delay
        self.proxies = proxies or []
        self.proxy_index = 0
        self.session = requests.Session()
        self.blocked_count = 0

    def _get_headers(self, referer=None):
        headers = {
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
        }
        if referer:
            headers["Referer"] = referer
        return headers

    def _get_proxy(self):
        if not self.proxies:
            return None
        proxy = self.proxies[self.proxy_index % len(self.proxies)]
        self.proxy_index += 1
        return {"http": proxy, "https": proxy}

    def _wait(self):
        delay = random.uniform(*self.delay)
        if random.random() < 0.1:
            delay += random.uniform(3, 8)  # Longer pause
        time.sleep(delay)

    def fetch(self, url, referer=None, retries=3):
        for attempt in range(retries):
            try:
                resp = self.session.get(
                    url,
                    headers=self._get_headers(referer),
                    proxies=self._get_proxy(),
                    timeout=15
                )

                if resp.status_code == 200:
                    text = resp.text.lower()
                    if "captcha" not in text and "blocked" not in text:
                        self.blocked_count = 0
                        return resp
                    log.warning(f"Soft block detected: {url}")

                elif resp.status_code == 429:
                    wait = int(resp.headers.get("Retry-After", 60))
                    log.warning(f"Rate limited. Waiting {wait}s...")
                    time.sleep(wait)
                    continue

                elif resp.status_code == 403:
                    self.blocked_count += 1
                    log.warning(f"Blocked (403). Count: {self.blocked_count}")
                    time.sleep(30 + random.uniform(0, 30))
                    continue

            except requests.RequestException as e:
                log.error(f"Request error: {e}")
                time.sleep(5)

        return None

    def scrape(self, paths, parse_fn):
        # Warm up: visit homepage first
        self.fetch(self.base_url)
        self._wait()

        results = []
        prev_url = self.base_url

        for path in paths:
            url = f"{self.base_url}{path}"
            resp = self.fetch(url, referer=prev_url)

            if resp:
                data = parse_fn(resp.text)
                results.extend(data)
                log.info(f"Scraped {len(data)} items from {path}")
            else:
                log.error(f"Failed: {path}")

            prev_url = url
            self._wait()

        return results

Quick Reference: Anti-Detection Checklist

#	Technique	Impact	Difficulty
1	Realistic HTTP headers	High	Easy
2	User-Agent rotation	High	Easy
3	Random delays	High	Easy
4	Proxy rotation	Very high	Medium
5	Respect robots.txt	Medium	Easy
6	Headless browser + stealth	Very high	Medium
7	Cookie & session handling	High	Easy
8	Realistic navigation	Medium	Medium
9	CAPTCHA handling	High	Hard
10	Right tool for the target	High	Easy

What to Do Next

Avoiding blocks isn't about any single trick — it's about combining multiple techniques to make your scraper indistinguishable from a real browser. Here's your action plan:

Start with the basics — Realistic headers, random delays, and sessions will get you past 80% of websites
Add proxies when you need scale — Once you're scraping hundreds of pages, IP rotation becomes essential
Use headless browsers only when needed — They're slower and heavier. Only reach for Playwright when requests gets blocked
Monitor your success rate — Track the ratio of successful responses to blocks. If it drops below 95%, adjust your approach
Always be respectful — Respect robots.txt, add reasonable delays, and don't overload servers. The goal is to collect data, not to disrupt services
Check for APIs first — Before building a complex scraper, check if the site offers an API. It's always faster, more reliable, and more ethical than scraping

Remember: the best scraper is one that the website never notices. Invest time in making your requests look natural, and you'll spend far less time dealing with blocks and CAPTCHAs.

web scraping python proxies anti-bot playwright guide