You've built a web scraper, it works perfectly for the first 20 requests, and then suddenly every response comes back as a 403 Forbidden or a CAPTCHA page. Sound familiar?
Websites use increasingly sophisticated anti-bot systems to detect and block automated traffic. Cloudflare, Akamai, PerimeterX, and DataDome can identify scrapers based on dozens of signals — from your request headers to your mouse movements.
In this guide, you'll learn exactly why scrapers get blocked and the 10 most effective techniques to avoid detection. Every method includes working Python code so you can implement it immediately. These are the same strategies used by professional data extraction teams working at scale.
Why Do Websites Block Scrapers?
Before diving into solutions, it helps to understand what you're up against. Websites block scrapers for several legitimate reasons:
- Server protection — Unthrottled scrapers can send thousands of requests per second, effectively becoming a DDoS attack
- Data protection — Businesses don't want competitors automatically harvesting their pricing, inventory, or content
- Bandwidth costs — Every request costs money. Bots can account for 40%+ of a site's traffic
- User experience — Heavy bot traffic slows down the site for real users
- Terms of service — Many sites explicitly prohibit automated access in their ToS
How Anti-Bot Systems Detect Scrapers
Modern anti-bot systems analyze multiple signals simultaneously. Getting blocked usually means you've tripped one or more of these detection methods:
The key insight is that no single technique will make your scraper undetectable. You need to combine multiple methods to look like a real browser operated by a real person.
Technique 1: Set Realistic HTTP Headers
The most common reason scrapers get blocked is sending requests with default or missing headers. The Python requests library sends a User-Agent like python-requests/2.31.0 — an obvious bot signature.
A real browser sends 10-15 headers with every request. Your scraper should do the same:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0"
}
session = requests.Session()
session.headers.update(headers)
response = session.get("https://example.com/products")
print(response.status_code)
Important: Keep your User-Agent up to date. Sending a Chrome 90 User-Agent in 2026 is a red flag. Check your own browser's User-Agent at whatismybrowser.com and use a recent version.
Technique 2: Rotate User-Agents
Sending the same User-Agent for thousands of requests from the same IP is suspicious. Real traffic comes from a mix of browsers, operating systems, and devices. Rotate through a list of realistic User-Agent strings:
import random
USER_AGENTS = [
# Chrome on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
# Chrome on Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
# Firefox on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
# Safari on Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
# Edge on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0",
# Chrome on Linux
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]
def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": random.choice([
"en-US,en;q=0.9",
"en-GB,en;q=0.9",
"en-US,en;q=0.5",
]),
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
Technique 3: Add Random Delays Between Requests
Real users don't click links at perfectly regular intervals. A human might wait 3 seconds on one page and 15 seconds on another. Fixed delays like time.sleep(1) are still detectable because they create an unnatural, machine-like pattern.
Use randomized delays that mimic human browsing behavior:
import time
import random
def human_delay(min_seconds=1.5, max_seconds=4.5):
"""Wait a random amount of time to mimic human browsing."""
delay = random.uniform(min_seconds, max_seconds)
# Occasionally add a longer pause (like reading a page)
if random.random() < 0.1: # 10% chance
delay += random.uniform(3, 8)
time.sleep(delay)
# Usage in your scraper loop
for url in urls_to_scrape:
response = session.get(url, headers=get_random_headers())
process(response)
human_delay() # Random wait before next request
Technique 4: Use Proxy Rotation
Sending hundreds of requests from a single IP address is the fastest way to get blocked. Proxy rotation distributes your requests across many IP addresses so no single IP makes too many requests.
There are three types of proxies, each with different trade-offs:
| Proxy Type | Detection Risk | Speed | Cost | Best For |
|---|---|---|---|---|
| Datacenter | High — easily flagged | Very fast | $1-5/mo | Non-protected sites |
| Residential | Low — real ISP IPs | Medium | $5-15/GB | Protected sites |
| Mobile | Very low | Slower | $15-30/GB | Heavily protected sites |
import itertools
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
"http://user:pass@proxy4.example.com:8080",
]
proxy_pool = itertools.cycle(PROXIES)
def make_request(url):
proxy = next(proxy_pool)
try:
response = requests.get(
url,
headers=get_random_headers(),
proxies={"http": proxy, "https": proxy},
timeout=15
)
return response
except requests.RequestException:
# Retry with next proxy if one fails
proxy = next(proxy_pool)
return requests.get(
url,
headers=get_random_headers(),
proxies={"http": proxy, "https": proxy},
timeout=15
)
Technique 5: Respect robots.txt
The robots.txt file tells bots which pages they're allowed to access. While it's not legally binding in all jurisdictions, respecting it is considered best practice and ignoring it can escalate blocking measures.
from urllib.robotparser import RobotFileParser
def can_scrape(url, user_agent="*"):
"""Check if a URL is allowed by the site's robots.txt."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Usage
url = "https://example.com/products/page/1"
if can_scrape(url):
response = session.get(url, headers=get_random_headers())
else:
print(f"Blocked by robots.txt: {url}")
Also check for the Crawl-delay directive in robots.txt. If a site specifies Crawl-delay: 10, it's asking bots to wait at least 10 seconds between requests. Respect this value or use an even longer delay.
Technique 6: Use a Headless Browser for JavaScript-Heavy Sites
Sites protected by Cloudflare, DataDome, or PerimeterX often require JavaScript execution to prove you're a real browser. The requests library can't execute JavaScript, so these challenges will block you immediately.
Playwright with stealth settings can bypass many of these checks:
pip install playwright playwright-stealth
playwright install chromium
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def scrape_with_stealth(url):
"""Scrape a JavaScript-protected page using a stealth browser."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox"
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
locale="en-US"
)
page = context.new_page()
stealth_sync(page) # Apply stealth patches
page.goto(url, wait_until="networkidle")
# Wait for real content (not loading screen)
page.wait_for_selector(".product-list", timeout=15000)
html = page.content()
browser.close()
return html
Technique 7: Handle Cookies and Sessions
Many websites set tracking cookies on your first visit and expect them on subsequent requests. A scraper that doesn't send cookies back looks like a new visitor every time — which is unnatural and triggers bot detection.
session = requests.Session()
session.headers.update(get_random_headers())
# First request: visit the homepage to get cookies
session.get("https://example.com/")
# Now subsequent requests include cookies automatically
response = session.get("https://example.com/products?page=1")
print(session.cookies.get_dict()) # Shows stored cookies
# The session maintains cookies across all requests
for page in range(1, 20):
resp = session.get(f"https://example.com/products?page={page}")
human_delay()
Using a requests.Session() object instead of individual requests.get() calls also reuses the underlying TCP connection, which is faster and more realistic.
Technique 8: Mimic Real Navigation Patterns
Real users don't jump directly to page 47 of search results. They start at the homepage, navigate to a category, apply filters, and then browse pages sequentially. Anti-bot systems track this navigation pattern.
def scrape_with_navigation(session):
"""Mimic natural browsing flow."""
# Step 1: Visit homepage first
session.get("https://example.com/", headers=get_random_headers())
human_delay(2, 5)
# Step 2: Navigate to category page
session.get("https://example.com/products", headers=get_random_headers())
human_delay(1, 3)
# Step 3: Now scrape listing pages sequentially
all_data = []
for page in range(1, 50):
resp = session.get(
f"https://example.com/products?page={page}",
headers=get_random_headers(),
# Include referer to show where you "came from"
headers={**get_random_headers(),
"Referer": f"https://example.com/products?page={page-1}"
if page > 1 else "https://example.com/products"
}
)
if resp.status_code != 200:
break
all_data.extend(parse_page(resp.text))
human_delay()
return all_data
Technique 9: Handle CAPTCHAs Gracefully
CAPTCHAs are the last line of defense. When you encounter one, you have several options — listed from most to least desirable:
| Strategy | Approach | Effectiveness |
|---|---|---|
| Avoid triggering them | Use all techniques above to stay under the radar | Best — prevention over cure |
| Slow down and retry | Wait 30-60 seconds, switch proxy, retry | Works for soft rate limits |
| Switch to a new session | New proxy + new cookies + new User-Agent | Works for IP-based CAPTCHAs |
| CAPTCHA solving services | Third-party APIs (2Captcha, Anti-Captcha) | Works but adds cost and latency |
| Skip and log | Log the blocked URL and retry later | Good for large-scale jobs |
def smart_request(session, url, max_retries=3):
"""Make a request with automatic block detection and retry."""
for attempt in range(max_retries):
response = session.get(url, headers=get_random_headers())
# Check for common block indicators
if response.status_code == 200:
content = response.text.lower()
if "captcha" in content or "verify you are human" in content:
print(f"CAPTCHA detected on attempt {attempt + 1}")
time.sleep(30 + random.uniform(0, 30))
continue
return response # Success
elif response.status_code == 403:
print(f"Blocked (403) on attempt {attempt + 1}")
time.sleep(60)
continue
elif response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
else:
return response
print(f"Failed after {max_retries} retries: {url}")
return None
Technique 10: Use the Right Tool for the Job
Different websites require different approaches. Using a headless browser on a simple static site is overkill and slow. Using basic requests on a Cloudflare-protected SPA won't work at all. Match your tool to the target:
Putting It All Together: Production Scraper Template
Here's a complete, production-ready scraper class that combines all 10 techniques. Use this as a starting template for your projects:
import requests
import random
import time
import logging
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
log = logging.getLogger(__name__)
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 Version/17.2 Safari/605.1.15",
]
class StealthScraper:
def __init__(self, base_url, delay=(1.5, 4.0), proxies=None):
self.base_url = base_url
self.delay = delay
self.proxies = proxies or []
self.proxy_index = 0
self.session = requests.Session()
self.blocked_count = 0
def _get_headers(self, referer=None):
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
if referer:
headers["Referer"] = referer
return headers
def _get_proxy(self):
if not self.proxies:
return None
proxy = self.proxies[self.proxy_index % len(self.proxies)]
self.proxy_index += 1
return {"http": proxy, "https": proxy}
def _wait(self):
delay = random.uniform(*self.delay)
if random.random() < 0.1:
delay += random.uniform(3, 8) # Longer pause
time.sleep(delay)
def fetch(self, url, referer=None, retries=3):
for attempt in range(retries):
try:
resp = self.session.get(
url,
headers=self._get_headers(referer),
proxies=self._get_proxy(),
timeout=15
)
if resp.status_code == 200:
text = resp.text.lower()
if "captcha" not in text and "blocked" not in text:
self.blocked_count = 0
return resp
log.warning(f"Soft block detected: {url}")
elif resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 60))
log.warning(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
elif resp.status_code == 403:
self.blocked_count += 1
log.warning(f"Blocked (403). Count: {self.blocked_count}")
time.sleep(30 + random.uniform(0, 30))
continue
except requests.RequestException as e:
log.error(f"Request error: {e}")
time.sleep(5)
return None
def scrape(self, paths, parse_fn):
# Warm up: visit homepage first
self.fetch(self.base_url)
self._wait()
results = []
prev_url = self.base_url
for path in paths:
url = f"{self.base_url}{path}"
resp = self.fetch(url, referer=prev_url)
if resp:
data = parse_fn(resp.text)
results.extend(data)
log.info(f"Scraped {len(data)} items from {path}")
else:
log.error(f"Failed: {path}")
prev_url = url
self._wait()
return results
Quick Reference: Anti-Detection Checklist
| # | Technique | Impact | Difficulty |
|---|---|---|---|
| 1 | Realistic HTTP headers | High | Easy |
| 2 | User-Agent rotation | High | Easy |
| 3 | Random delays | High | Easy |
| 4 | Proxy rotation | Very high | Medium |
| 5 | Respect robots.txt | Medium | Easy |
| 6 | Headless browser + stealth | Very high | Medium |
| 7 | Cookie & session handling | High | Easy |
| 8 | Realistic navigation | Medium | Medium |
| 9 | CAPTCHA handling | High | Hard |
| 10 | Right tool for the target | High | Easy |
What to Do Next
Avoiding blocks isn't about any single trick — it's about combining multiple techniques to make your scraper indistinguishable from a real browser. Here's your action plan:
- Start with the basics — Realistic headers, random delays, and sessions will get you past 80% of websites
- Add proxies when you need scale — Once you're scraping hundreds of pages, IP rotation becomes essential
- Use headless browsers only when needed — They're slower and heavier. Only reach for Playwright when
requestsgets blocked - Monitor your success rate — Track the ratio of successful responses to blocks. If it drops below 95%, adjust your approach
- Always be respectful — Respect robots.txt, add reasonable delays, and don't overload servers. The goal is to collect data, not to disrupt services
- Check for APIs first — Before building a complex scraper, check if the site offers an API. It's always faster, more reliable, and more ethical than scraping
Remember: the best scraper is one that the website never notices. Invest time in making your requests look natural, and you'll spend far less time dealing with blocks and CAPTCHAs.