How to Scrape Amazon Product Data with Python (2026 Guide)

Pangolinfo
06/11, 2026

How to Scrape Amazon Product Data with Python in 2026: Beating TLS Fingerprinting

Amazon’s anti-bot infrastructure has evolved far beyond simple IP blocking — in 2026, it fingerprints your TLS handshake before serving a single byte of HTML, which means the standard requests library is effectively blind and can no longer be used to scrape Amazon product data reliably. This guide teaches you how to scrape amazon product data with python in this new landscape: it systematically breaks down Amazon’s five-layer bot detection architecture (TLS fingerprinting, HTTP/2 frame analysis, behavioral scoring, silent CAPTCHA, and honeypot traps), provides a complete curl_cffi-based bypass with production-grade Python code including retry logic, rate limiting, and tiered exception handling, and uses real 2026 cost data to show when a DIY scraper stops making engineering sense and a managed Amazon Scraper API is the rational choice.

You can get Python scraping Amazon in 20 lines of code. In 2026, Amazon won’t let you reach line 20. The standard requests library fails at the TLS handshake — before any HTML is served — because Amazon’s Bot Manager fingerprints your Python TLS stack and blocks it on the spot. The approach that actually works at scale is curl_cffi: it drives libcurl to impersonate Chrome’s complete TLS fingerprint, making your Python process look identical to a real browser at the network layer. This guide covers both sides: the code that runs, and the cost math that tells you when to switch to a managed Amazon Scraper API instead.

What Data Can You Actually Scrape from an Amazon Product Page?

Before writing a single line of Python, it helps to know exactly what you’re targeting. A standard ASIN page (amazon.com/dp/B0XXXXXX) exposes the following data in its public HTML:

  • Product title — the full Listing name including brand, model, and variant descriptors, typically 80–200 characters
  • Price — current price, list price, deal price, Subscribe & Save discount, and sometimes Coupon amounts, scattered across 5–7 separate DOM nodes
  • Star rating and review count — aggregate rating (e.g., 4.3 stars) and total review quantity, as two independent fields
  • ASIN — Amazon’s unique product identifier, buried inside an <input name="ASIN"> element or the page URL
  • Images — high-resolution main image and gallery URLs, embedded as JSON inside a <script> tag rather than in <img> src attributes
  • Bullet point features — the seller’s 5 core selling points, the primary persuasion copy on any Listing
  • Product description and A+ content — longer HTML content blocks; brand-registered sellers may have rich A+ modules
  • Best Sellers Rank (BSR) — real-time rank in the parent category and 1–2 sub-categories, the core metric for product research
  • Dimensions and weight — required for FBA fee estimation
  • Availability and shipping — in-stock status, Prime badge, estimated delivery date

All of this data is public HTML — accessing it isn’t the hard part. The hard part is how Amazon identifies you as a bot and blocks you before you’ve finished your second request.

How Does Amazon’s Anti-Bot System Actually Work in 2026?

Most tutorials describe “getting blocked by Amazon” as an outcome without explaining the mechanism. Understanding the technical layers of the detection system is what separates code that works from code that runs out of retries and gives up.

Layer 1: TLS Fingerprinting — The Hardest Gate to Pass

A TLS handshake happens before any HTTP data is transferred. When your Python script opens an HTTPS connection to amazon.com, the server receives a ClientHello message containing your client’s supported cipher suite list, TLS extension list, elliptic curve parameters, and more. Hash these fields together and you get a JA3 fingerprint (or the newer JA4). The problem: Python’s requests library uses urllib3 + OpenSSL, producing a ClientHello with a fixed cipher suite order and extension list that matches no real browser — it’s a distinctly Python-shaped fingerprint. Amazon’s Bot Manager (HUMAN Security, formerly PerimeterX) maintains a database of known malicious fingerprints, and Python’s TLS signature has been in that list for years. It doesn’t matter how clean your IP is or how realistic your request headers look — the handshake already gave you away.

Python requests vs curl_cffi TLS ClientHello fingerprint comparison showing JA3 hash difference that determines Amazon bot detection
requests carries a fixed Python TLS fingerprint flagged in Amazon’s Bot Manager; curl_cffi mimics Chrome’s complete handshake and passes the gate.

Layer 2: HTTP/2 SETTINGS Frame Fingerprinting

Modern browsers communicate over HTTP/2, and the SETTINGS frame sent at connection startup carries its own fingerprint — specific initial window sizes, header table sizes, and concurrent stream settings. Chrome’s SETTINGS values are distinct and well-documented. requests defaults to HTTP/1.1, and even when forced onto HTTP/2, its SETTINGS parameters don’t match any real browser. curl_cffi‘s impersonate parameter handles both TLS fingerprint and HTTP/2 SETTINGS simultaneously, passing both layers in a single configuration parameter.

Layer 3: Behavioral Analysis — Session-Level Detection

Even after clearing the TLS and HTTP/2 gates, Amazon analyzes your behavioral pattern across the session:

  • Request timing — real users spend 30–120 seconds reading a product page; scrapers typically fire requests every 1–3 seconds
  • Cookie state — real browsers accumulate session cookies, preference cookies, and browsing history cookies; scrapers tend to have inconsistent or missing cookie state across requests
  • Referer chain — a real user arrives at a product detail page from a search results page, so the Referer header is amazon.com/s?k=...; scrapers hitting /dp/ URLs directly have no organic Referer
  • Accept-Language vs. IP geolocation consistency — Accept-Language: en-US with an IP geolocating to Dhaka, Bangladesh triggers an immediate flag

Layer 4: Silent CAPTCHA — The Hardest Failure Mode to Debug

Amazon’s craftiest anti-bot technique isn’t returning a 403 or 503. It returns HTTP 200, but the HTML body is a CAPTCHA verification page or a “Robot Check” page rather than product data. Your scraper sees response.status_code == 200 and assumes success, but soup.select_one("#productTitle") returns None. Without explicitly inspecting the raw HTML, this failure propagates silently through your entire dataset, corrupting the output without a single error log entry.

Amazon scraper CAPTCHA
Amazon’s “silent CAPTCHA”: HTTP 200 with a robot verification form in the body — the most common and least visible scraper failure in the wild.

Layer 5: Honeypot Links

Amazon pages occasionally embed links that are invisible to human users — hidden with CSS, or colored to match the background. Real users never click them; scrapers that parse every <a href> and follow links indiscriminately will trigger these traps, instantly flagging the IP as a bot. Your scraping logic needs to filter for visible, interactable elements only — not blindly consume everything in the DOM.

Navigating all five layers simultaneously is the core engineering challenge of scraping Amazon at scale. If bypassing each layer independently sounds like more infrastructure than your team wants to own, Pangolinfo’s Amazon Scraper API handles all of them transparently — the next sections cover the DIY code path first, then the cost comparison.

curl_cffi: The Right Way to Scrape Amazon with Python in 2026

curl_cffi is a Cython-binding library that drives libcurl directly. Its key parameter is impersonate — it lets you specify Chrome, Firefox, or Safari and get the full TLS fingerprint plus HTTP/2 SETTINGS for that browser, out of the box. The API is nearly identical to requests, so migration costs almost nothing.

Installation

# curl_cffi ships with pre-compiled libcurl-impersonate binaries — no build dependencies needed
pip install curl-cffi beautifulsoup4 lxml tenacity

Single Product Scraper: Complete Working Code

"""
Amazon Product Data Scraper (curl_cffi edition)
Tested 2026 — bypasses TLS/JA4 fingerprint detection
Requirements: pip install curl-cffi beautifulsoup4 lxml tenacity
"""

import json
import random
import time
import logging
from dataclasses import dataclass, field
from typing import Optional

from curl_cffi import requests as cffi_requests
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

# ── Logging ───────────────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)


# ── Data Model ────────────────────────────────────────────────────────────────
@dataclass
class AmazonProduct:
    asin: str
    title: Optional[str] = None
    brand: Optional[str] = None
    price: Optional[str] = None
    price_whole: Optional[str] = None
    list_price: Optional[str] = None
    rating: Optional[str] = None
    review_count: Optional[str] = None
    bullet_points: list = field(default_factory=list)
    bsr: list = field(default_factory=list)
    main_image: Optional[str] = None
    marketplace: str = "www.amazon.com"
    url: Optional[str] = None
    is_captcha: bool = False
    scraped_at: Optional[str] = None


# ── Request Header Templates ──────────────────────────────────────────────────
# curl_cffi's impersonate parameter handles TLS automatically.
# We still control HTTP-layer headers to match real browser behavior.
HEADERS_TEMPLATES = [
    {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Cache-Control": "max-age=0",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
    },
    {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-GB,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "cross-site",
    },
]

# curl_cffi supported browser impersonation targets
IMPERSONATE_TARGETS = [
    "chrome120",
    "chrome124",
    "chrome131",
]


def _get_random_headers() -> dict:
    return random.choice(HEADERS_TEMPLATES)


def _get_random_impersonate() -> str:
    return random.choice(IMPERSONATE_TARGETS)


def _is_captcha_page(soup: BeautifulSoup) -> bool:
    """
    Detect whether Amazon returned a CAPTCHA / robot-check page.
    Amazon's silent CAPTCHA returns HTTP 200 but the body is a
    verification form, not product data. Always check before parsing.
    """
    captcha_indicators = [
        soup.select_one("form[action='/errors/validateCaptcha']"),
        soup.select_one("#captchacharacters"),
        soup.find("input", {"id": "captchacharacters"}),
    ]
    if any(captcha_indicators):
        return True

    # Check page title for bot-check keywords
    title_tag = soup.find("title")
    if title_tag and title_tag.string:
        title_text = title_tag.string.lower()
        if any(kw in title_text for kw in ["robot check", "sorry!", "automated access", "api gateway"]):
            return True

    # If the page is unusually short and has no #productTitle, assume interception
    page_text = soup.get_text(strip=True)
    if len(page_text) < 2000 and not soup.select_one("#productTitle"):
        return True

    return False


def _extract_price(soup: BeautifulSoup) -> tuple[Optional[str], Optional[str]]:
    """
    Extract current price and list/strike-through price.
    Amazon distributes price data across multiple DOM nodes —
    a multi-level fallback chain is required for reliable extraction.

    Returns: (current price string, list price string)
    """
    price = None
    list_price = None

    # Method 1: New price container (dominant since 2023)
    price_block = soup.select_one(".a-price[data-a-color='price'] .a-offscreen")
    if price_block:
        price = price_block.get_text(strip=True)

    # Method 2: Legacy whole + fraction split format
    if not price:
        whole = soup.select_one("span.a-price-whole")
        fraction = soup.select_one("span.a-price-fraction")
        if whole:
            price = f"${whole.get_text(strip=True)}"
            if fraction:
                price = f"${whole.get_text(strip=True)}{fraction.get_text(strip=True)}"

    # Method 3: Deal / sale price block
    if not price:
        deal_tag = soup.select_one("#priceblock_dealprice, #priceblock_saleprice")
        if deal_tag:
            price = deal_tag.get_text(strip=True)

    # Method 4: Core price display module
    if not price:
        core_tag = soup.select_one("#corePriceDisplay_desktop_feature_div .a-offscreen")
        if core_tag:
            price = core_tag.get_text(strip=True)

    # Extract list/strike-through price
    list_price_tag = soup.select_one(".a-price[data-a-color='secondary'] .a-offscreen")
    if not list_price_tag:
        list_price_tag = soup.select_one("#listPrice, .a-text-price .a-offscreen")
    if list_price_tag:
        list_price = list_price_tag.get_text(strip=True)

    return price, list_price


def _extract_bsr(soup: BeautifulSoup) -> list:
    """
    Extract Best Sellers Rank entries.
    A product may rank simultaneously in a parent category and 1–2 sub-categories.
    """
    bsr_list = []

    bsr_section = soup.select_one("#detailBulletsWrapper_feature_div, #productDetails_feature_div")
    if bsr_section:
        for li in bsr_section.select("li, tr"):
            text = li.get_text(" ", strip=True)
            if "Best Sellers Rank" in text or "Amazon Best Sellers Rank" in text:
                bsr_list.append(text)
                break

    if not bsr_list:
        sales_rank = soup.select_one("#SalesRank")
        if sales_rank:
            bsr_list.append(sales_rank.get_text(" ", strip=True))

    return bsr_list


def _extract_images(soup: BeautifulSoup) -> Optional[str]:
    """
    Amazon's main image URLs are embedded in a JSON object inside a
    <script> tag (the colorImages variable) — far more reliable than
    parsing <img> src attributes, which are often low-resolution thumbnails.
    """
    import re
    scripts = soup.find_all("script", type="text/javascript")
    for script in scripts:
        if script.string and "'colorImages'" in script.string:
            match = re.search(r'"hiRes":"(https://[^"]+)"', script.string)
            if match:
                return match.group(1)
            match = re.search(r'"large":"(https://[^"]+)"', script.string)
            if match:
                return match.group(1)
    # Fallback: grab data-old-hires from #landingImage
    img_tag = soup.select_one("#landingImage[data-old-hires]")
    if img_tag:
        return img_tag.get("data-old-hires")
    return None


def _parse_product_page(html: str, asin: str, marketplace: str, url: str) -> AmazonProduct:
    """
    Parse product HTML into an AmazonProduct data object.
    """
    from datetime import datetime, timezone

    soup = BeautifulSoup(html, "lxml")
    product = AmazonProduct(asin=asin, marketplace=marketplace, url=url)
    product.scraped_at = datetime.now(timezone.utc).isoformat()

    # CAPTCHA check — always run first
    if _is_captcha_page(soup):
        product.is_captcha = True
        logger.warning(f"CAPTCHA detected for ASIN {asin}")
        return product

    # Title
    title_tag = soup.select_one("#productTitle")
    product.title = title_tag.get_text(strip=True) if title_tag else None

    # Brand
    brand_tag = soup.select_one("#bylineInfo, #brand")
    if brand_tag:
        brand_text = brand_tag.get_text(strip=True)
        product.brand = (
            brand_text
            .replace("Brand: ", "")
            .replace("Visit the ", "")
            .replace(" Store", "")
            .strip()
        )

    # Price
    product.price, product.list_price = _extract_price(soup)

    # Rating
    rating_tag = soup.select_one("span[data-hook='rating-out-of-text'], #acrPopover")
    if rating_tag:
        product.rating = (rating_tag.get("title") or rating_tag.get_text()).strip()

    # Review count
    review_tag = soup.select_one("#acrCustomerReviewText")
    product.review_count = review_tag.get_text(strip=True) if review_tag else None

    # Bullet points
    bullets = []
    for li in soup.select("#feature-bullets ul li:not(.aok-hidden) span.a-list-item"):
        text = li.get_text(strip=True)
        if text and len(text) > 5:
            bullets.append(text)
    product.bullet_points = bullets

    # BSR
    product.bsr = _extract_bsr(soup)

    # Main image
    product.main_image = _extract_images(soup)

    return product


# ── Core Scrape Function (with retry) ─────────────────────────────────────────
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=2, min=4, max=30),
    retry=retry_if_exception_type((Exception,)),
    reraise=True,
)
def scrape_amazon_product(
    asin: str,
    marketplace: str = "www.amazon.com",
    proxy: Optional[str] = None,
) -> AmazonProduct:
    """
    Scrape a single Amazon product page using curl_cffi.

    Args:
        asin: Amazon product identifier (e.g. 'B0CXXX')
        marketplace: Amazon domain (default US storefront)
        proxy: Proxy URL in format 'http://user:pass@host:port' (optional)

    Returns:
        AmazonProduct data object

    Raises:
        tenacity.RetryError — after 3 failed attempts
    """
    url = f"https://{marketplace}/dp/{asin}"
    headers = _get_random_headers()
    impersonate = _get_random_impersonate()

    # Polite delay: 2.5–6 second random interval, simulating real reading pace
    delay = random.uniform(2.5, 6.0)
    logger.info(f"Fetching ASIN {asin} | impersonate={impersonate} | delay={delay:.1f}s")
    time.sleep(delay)

    proxies = {"https": proxy, "http": proxy} if proxy else None

    response = cffi_requests.get(
        url,
        headers=headers,
        impersonate=impersonate,
        proxies=proxies,
        timeout=20,
        allow_redirects=True,
    )

    if response.status_code not in (200, 301, 302):
        raise ValueError(f"Unexpected status {response.status_code} for ASIN {asin}")

    return _parse_product_page(response.text, asin, marketplace, url)


# ── Entry Point ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
    TEST_ASIN = "B0CXZXZXZX"  # Replace with a real ASIN
    PROXY = None  # e.g. "http://user:[email protected]:8000"

    product = scrape_amazon_product(TEST_ASIN, proxy=PROXY)

    if product.is_captcha:
        print("⚠️  CAPTCHA hit — rotate proxy or increase request delay")
    else:
        print(json.dumps(product.__dict__, indent=2, ensure_ascii=False))

What Does This Actually Return?

With a clean proxy and reasonable request pacing, the output looks like this:

{
  "asin": "B0CXZXZXZX",
  "title": "Example Product — Premium Edition, 500ml, 6-Pack (SGS Certified)",
  "brand": "ExampleBrand",
  "price": "$24.99",
  "list_price": "$29.99",
  "rating": "4.3 out of 5 stars",
  "review_count": "2,841 ratings",
  "bullet_points": [
    "SGS-certified food-grade material, meets US FDA standards",
    "Dishwasher safe, BPA-free, lead-free, cadmium-free",
    "Fits standard car cup holders (3.5-inch diameter), keeps drinks hot/cold 12 hours",
    "18-month manufacturer warranty, no-questions-asked replacement",
    "Ships from US warehouse, Prime next-day delivery"
  ],
  "bsr": ["#1,243 in Kitchen & Dining (#18 in Water Bottles)"],
  "main_image": "https://m.media-amazon.com/images/I/61xxxx.jpg",
  "marketplace": "www.amazon.com",
  "url": "https://www.amazon.com/dp/B0CXZXZXZX",
  "is_captcha": false,
  "scraped_at": "2026-06-10T01:00:00+00:00"
}

If you want to skip straight to structured JSON without writing any parsing logic, the Pangolinfo Amazon Scraper API returns an identical field set as typed data — prices as floats, BSR as a structured array, images separated by role. Full field reference and response schema are documented in the Amazon Product endpoint docs.

Batch Scraping: Async + Proxy Rotation for Production

Single-product works. Now the real challenge: scale. curl_cffi supports async via AsyncSession, and combined with proxy rotation, you can push throughput significantly higher without triggering Amazon’s rate limits — provided you keep request pacing reasonable.

Python batch Amazon scraping architecture: ASIN task queue, asyncio rate limiter, curl_cffi proxy rotation, tenacity auto-retry flowchart
Production Amazon Python scraper architecture: asyncio + curl_cffi AsyncSession + rotating proxy pool + tenacity retry — four layers working together.
"""
Amazon Batch Async Scraper (Production Grade)
curl_cffi AsyncSession + asyncio Semaphore rate limiting + proxy rotation + CSV output
Requirements: pip install curl-cffi beautifulsoup4 lxml tenacity
"""

import asyncio
import csv
import json
import random
import logging
from dataclasses import asdict
from typing import Optional

from curl_cffi.requests import AsyncSession

# Reuse AmazonProduct, _parse_product_page, HEADERS_TEMPLATES from above

logger = logging.getLogger(__name__)


class ProxyRotator:
    """
    Simple proxy rotator with random selection and failure marking.
    In production, prefer a rotating endpoint from Bright Data or Smartproxy
    over maintaining a static IP list manually.
    """

    def __init__(self, proxy_list: list[str]):
        self._proxies = proxy_list
        self._failed: set = set()

    def get(self) -> Optional[str]:
        available = [p for p in self._proxies if p not in self._failed]
        if not available:
            logger.warning("All proxies exhausted, falling back to direct connection (high risk)")
            return None
        return random.choice(available)

    def mark_failed(self, proxy: str):
        self._failed.add(proxy)
        logger.warning(f"Proxy marked failed: {proxy[:30]}...")


async def fetch_one(
    session: AsyncSession,
    asin: str,
    proxy_rotator: ProxyRotator,
    semaphore: asyncio.Semaphore,
    marketplace: str = "www.amazon.com",
) -> dict:
    """
    Fetch a single ASIN under Semaphore concurrency control.
    """
    url = f"https://{marketplace}/dp/{asin}"
    proxy = proxy_rotator.get()
    impersonate = random.choice(["chrome120", "chrome124", "chrome131"])
    headers = random.choice(HEADERS_TEMPLATES)

    async with semaphore:
        # Human-like delay: 3–8 seconds between requests
        await asyncio.sleep(random.uniform(3.0, 8.0))

        try:
            proxies = {"https": proxy, "http": proxy} if proxy else None
            response = await session.get(
                url,
                headers=headers,
                impersonate=impersonate,
                proxies=proxies,
                timeout=25,
            )

            if response.status_code == 503:
                if proxy:
                    proxy_rotator.mark_failed(proxy)
                raise ValueError(f"503 for {asin} — proxy rotated")

            product = _parse_product_page(response.text, asin, marketplace, url)
            status = "captcha" if product.is_captcha else "success"
            logger.info(f"[{status.upper()}] ASIN {asin}")
            return asdict(product)

        except Exception as e:
            logger.error(f"[FAILED] ASIN {asin}: {e}")
            return {"asin": asin, "error": str(e)}


async def scrape_batch(
    asins: list[str],
    proxy_list: list[str],
    output_file: str = "amazon_products.json",
    max_concurrency: int = 3,
    marketplace: str = "www.amazon.com",
):
    """
    Batch async scrape entry point.

    max_concurrency=3 is a conservative setting for residential proxies.
    With a high-quality rotating proxy service, you can push to 5–8,
    but exceeding that tends to increase CAPTCHA rates significantly.
    """
    proxy_rotator = ProxyRotator(proxy_list)
    semaphore = asyncio.Semaphore(max_concurrency)

    async with AsyncSession() as session:
        tasks = [
            fetch_one(session, asin, proxy_rotator, semaphore, marketplace)
            for asin in asins
        ]
        results = await asyncio.gather(*tasks, return_exceptions=False)

    # Write JSON output
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    # Write CSV output (for Excel / database import)
    csv_file = output_file.replace(".json", ".csv")
    if results:
        fieldnames = [k for k in results[0].keys() if k != "bullet_points"]
        with open(csv_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames + ["bullet_points_str"])
            writer.writeheader()
            for row in results:
                row_copy = {k: v for k, v in row.items() if k != "bullet_points"}
                row_copy["bullet_points_str"] = " | ".join(row.get("bullet_points", []))
                writer.writerow(row_copy)

    success = sum(1 for r in results if not r.get("error") and not r.get("is_captcha"))
    captcha = sum(1 for r in results if r.get("is_captcha"))
    failed = sum(1 for r in results if r.get("error"))
    logger.info(f"Done: {success} success, {captcha} CAPTCHA, {failed} failed | Output: {output_file}")
    return results


# ── Usage Example ─────────────────────────────────────────────────────────────
if __name__ == "__main__":
    ASINS_TO_SCRAPE = [
        "B0CXZ1", "B0CXZ2", "B0CXZ3", "B0CXZ4", "B0CXZ5",
    ]

    PROXY_LIST = [
        "http://user1:[email protected]:8001",
        "http://user2:[email protected]:8002",
        "http://user3:[email protected]:8003",
    ]

    asyncio.run(
        scrape_batch(
            asins=ASINS_TO_SCRAPE,
            proxy_list=PROXY_LIST,
            output_file="output/amazon_products.json",
            max_concurrency=3,
        )
    )

Realistic Success Rate Expectations

Actual numbers vary significantly by proxy quality. Using a tier-1 residential proxy service (Bright Data, Smartproxy) with curl_cffi Chrome fingerprinting, small-batch scraping (1,000–3,000 pages/day) can sustain 80–90% success rates. Push that to 10,000+ pages/day and success typically drops to 60–75%, with CAPTCHA hit rates climbing noticeably and proxy costs climbing in parallel. This isn’t a code quality problem — it’s Amazon’s cross-IP behavioral correlation kicking in at scale.

Python Amazon scraper success rate vs scale chart: curl_cffi vs requests comparison showing API switchover point at 10K pages per day
 Even curl_cffi + residential proxies see success rates degrade at scale; beyond ~10,000 pages/day is a rational inflection point for switching to a managed API.

At the scale where DIY success rates start degrading, Pangolinfo’s Amazon Scraper API maintains 99%+ delivery across all volume tiers — the managed infrastructure absorbs the fingerprinting and CAPTCHA load invisibly. The next section breaks down exactly where the cost crossover happens.

The Real Cost of Scaling a DIY Amazon Scraper

The technical hurdles are only part of the picture. Once you understand the mechanics, the next question is whether running your own scraper at scale actually makes financial sense.

Selector Maintenance: The Invisible Weekly Tax

Amazon continuously A/B tests its product pages — sometimes for conversion rate optimization, sometimes for internal engineering refactors. A selector that extracts price cleanly in January may fail silently by April because a new price container structure rolled out to 30% of traffic. The code above uses a 4-level fallback chain for price extraction — that’s not over-engineering, it’s the minimum viable approach for a production scraper. In reality, handling all price variants (standard price, Deal price, Subscribe & Save, B2B pricing, out-of-stock state) typically requires 10–15 selectors, and at least one of them will break every 1–3 months. Engineering teams maintaining mid-scale Amazon scrapers consistently report 2–4 hours per week consumed by selector maintenance alone — not building new features, just keeping the existing scraper functional.

Proxy Costs: The Linear Overhead That Scales With You

Data center proxies ($0.5–2/GB) are effectively worthless against Amazon’s TLS fingerprinting. Only residential proxies ($5–15/GB) reliably pass the fingerprint check. At roughly 150–200KB per compressed product page:

  • 1,000 pages/day: ~150–200MB → proxy cost ~$0.75–3/day
  • 10,000 pages/day: ~1.5–2GB → proxy cost ~$7.50–30/day ($225–900/month)
  • 100,000 pages/day: ~15–20GB → proxy cost ~$75–300/day ($2,250–9,000/month)

This excludes CAPTCHA solving fees (third-party services typically run $1–3 per 1,000 CAPTCHAs), compute costs, and the fully-loaded engineering time for incident response when Amazon pushes an update that breaks your selectors overnight.

Multi-Marketplace Coverage: Not a Translation, a Rewrite

If you need data from amazon.co.uk, amazon.de, amazon.co.jp, and amazon.ca alongside the US storefront, the complexity doesn’t multiply by four — it multiplies by eight or more. Japan’s price field class names differ from the US structure. Germany’s product detail modules rely on JavaScript-rendered components that vary from the US pattern. The UK’s BSR taxonomy is entirely different from the US one. Each marketplace requires an independent selector set with its own maintenance cycle. A scraper that covers five marketplaces is really five scrapers with shared infrastructure.

By contrast, the Pangolinfo Amazon Scraper API covers 10+ marketplaces with a single marketplace parameter — no separate selector sets, no independent maintenance cycles. Supported marketplace codes and field availability per region are listed in the marketplace coverage docs.

When DIY Stops Making Sense: Pangolinfo Amazon Scraper API

A managed Amazon Scraper API isn’t about skipping code — it’s about moving the anti-bot infrastructure maintenance burden entirely off your engineering team so your code can focus on what to do with the data rather than how to get it past Amazon’s defenses. The same product data, with Pangolinfo’s API:

"""
Pangolinfo Amazon Scraper API — Python Example
No proxy management. No CAPTCHA handling. No selector maintenance.
"""

import requests
import json
from typing import Optional


API_KEY = "your_pangolinfo_api_key"
BASE_URL = "https://api.pangolinfo.com/v1"


def get_amazon_product(
    asin: str,
    marketplace: str = "US",
    include_fields: Optional[str] = None,
) -> dict:
    """
    Retrieve Amazon product data via Pangolinfo API.

    Args:
        asin: Amazon product identifier
        marketplace: Storefront code (US, UK, DE, JP, CA, FR, IT, ES, etc.)
        include_fields: Comma-separated field names to return (omit for all fields)

    Returns:
        Fully structured product data dict
    """
    params = {
        "asin": asin,
        "marketplace": marketplace,
    }
    if include_fields:
        params["include_fields"] = include_fields

    response = requests.get(
        f"{BASE_URL}/amazon/product",
        params=params,
        headers={"Authorization": f"Bearer {API_KEY}"},
        timeout=30,
    )
    response.raise_for_status()
    return response.json()


# ── Async Batch Version ───────────────────────────────────────────────────────
import asyncio
import aiohttp


async def get_amazon_product_async(
    session: aiohttp.ClientSession,
    asin: str,
    marketplace: str = "US",
) -> dict:
    url = f"{BASE_URL}/amazon/product"
    params = {"asin": asin, "marketplace": marketplace}
    headers = {"Authorization": f"Bearer {API_KEY}"}

    async with session.get(url, params=params, headers=headers) as resp:
        return await resp.json()


async def batch_fetch(asins: list[str], marketplace: str = "US") -> list[dict]:
    async with aiohttp.ClientSession() as session:
        tasks = [get_amazon_product_async(session, asin, marketplace) for asin in asins]
        return await asyncio.gather(*tasks)


# ── Usage ─────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    # Single product
    result = get_amazon_product("B0CXZXZXZX", marketplace="US")
    print(json.dumps(result, indent=2, ensure_ascii=False))

    # Batch (5 ASINs, concurrent)
    asins = ["B0001", "B0002", "B0003", "B0004", "B0005"]
    results = asyncio.run(batch_fetch(asins))
    print(f"Fetched {len(results)} products")

API Response Structure

The response is fully typed structured JSON — prices parsed to numeric types, BSR split into structured arrays, images separated by role — no data cleaning logic required on your side:

{
  "success": true,
  "data": {
    "asin": "B0CXZXZXZX",
    "title": "Example Product — Premium Edition, 500ml, 6-Pack",
    "brand": "ExampleBrand",
    "price": {
      "current": 24.99,
      "currency": "USD",
      "list_price": 29.99,
      "deal_price": null,
      "coupon": "Extra 5% off with coupon"
    },
    "rating": 4.3,
    "review_count": 2841,
    "bullet_points": [
      "SGS-certified food-grade material",
      "Dishwasher safe, BPA-free"
    ],
    "bsr": [
      {"rank": 1243, "category": "Kitchen & Dining"},
      {"rank": 18, "category": "Water Bottles"}
    ],
    "images": {
      "main": "https://m.media-amazon.com/images/I/xxx.jpg",
      "gallery": [
        "https://m.media-amazon.com/images/I/yyy.jpg",
        "https://m.media-amazon.com/images/I/zzz.jpg"
      ]
    },
    "dimensions": {
      "length_inches": 9.8,
      "width_inches": 3.5,
      "height_inches": 3.5,
      "weight_pounds": 0.88
    },
    "availability": "In Stock",
    "is_prime": true,
    "marketplace": "US",
    "scraped_at": "2026-06-10T01:00:00Z"
  },
  "credits_used": 1,
  "credits_remaining": 9999
}

DIY Python Scraper vs. Pangolinfo Amazon Scraper API: Full Comparison

DimensionDIY Python (curl_cffi + Residential Proxy)Pangolinfo Amazon Scraper API
Time to first data2–6 hours (proxy setup + debugging)15 minutes (get API key, make first call)
Small-scale success rate (<1K/day)80–90% (curl_cffi + residential proxy)99%+
Scale-up success rate (10K+/day)60–75% (costs rise sharply)99%+ (managed infrastructure)
CAPTCHA handlingManual detection + third-party solver ($1–3/1K)Built-in, transparent to caller
Proxy cost (10K pages/day)$7.50–30/day (residential)Included in API pricing
Selector maintenance2–4 hours/week of engineering timeZero — API adapts to page changes automatically
Multi-marketplaceSeparate selector set per marketplaceSingle parameter switches between 10+ marketplaces
Data structureManual cleaning, inconsistent typesFully typed JSON, uniform schema
All-in cost (10K pages/day)$40–80+ (proxies + CAPTCHA + eng. time)~$15.40 ($1.54 per 1,000 requests)
Scale ceilingEffectively tens of thousands/day before instabilityHundreds of millions/day

Pangolinfo’s Amazon Scraper API covers six endpoint types: product detail, search results, reviews, Best Sellers rankings, category pages, and seller profiles — one integration covers the entire Amazon data pipeline. For teams that need to monitor competitor reviews specifically, the Amazon Reviews Scraper API returns structured review data with filters for star rating, date range, and geography, eliminating the need to maintain a pagination-aware review scraper separately.

Frequently Asked Questions

Does Python scraping still work on Amazon in 2026?

Yes, but the difficulty has increased substantially compared to 2–3 years ago. Amazon rolled out TLS fingerprint detection across 2024–2025, meaning the standard requests library gets identified at the handshake stage — before any product HTML is served. The approach that actually works in 2026 is curl_cffi: it uses libcurl to replicate Chrome’s JA3/JA4 TLS fingerprint, making your Python requests indistinguishable from real browser traffic at the network layer. Rotating residential proxies and proper request pacing remain essential even with fingerprint spoofing.

Why does requests return a 503 or empty page on Amazon?

Amazon’s Bot Manager (HUMAN Security) performs the first filtering pass at the TLS handshake — before any HTTP content is transmitted. Python requests uses urllib3 + OpenSSL, producing a ClientHello with cipher suite ordering and extension lists that match no real browser. The system decides to block the connection before returning a single byte of HTML. That’s why even a clean residential IP returns a 503 or a CAPTCHA-wrapped 200 — the problem is at the TLS layer, not the IP layer.

What is the core difference between curl_cffi and requests?

requests builds TLS connections through urllib3 + OpenSSL, producing a fixed Python-shaped ClientHello fingerprint that modern anti-bot systems recognize immediately. curl_cffi binds directly to libcurl and lets you specify Chrome, Firefox, or Safari impersonation — replicating the full TLS fingerprint (JA3, JA4) and HTTP/2 SETTINGS frames. The two libraries have nearly identical APIs, migration costs almost nothing, and the difference in Amazon success rates is orders of magnitude.

What data fields can I scrape from an Amazon product page?

A standard ASIN page exposes: product title, brand, price (current, list, deal, subscribe & save), star rating, review count, ASIN, main and gallery image URLs, bullet point features, product description and A+ content, Best Sellers Rank (parent and sub-category), dimensions and weight, availability and shipping status (Prime badge, estimated delivery). Using Pangolinfo’s Amazon Scraper API returns all of these as typed, structured JSON with no parsing logic required.

When should I switch from a DIY scraper to an Amazon Scraper API?

Several clear signals: daily volume exceeds 5,000 product pages and proxy costs start compounding; selector maintenance consumes more than 2 hours of engineering time per week; CAPTCHA failure rates create unacceptable data gaps; or you need coverage across multiple Amazon marketplaces. Pangolinfo’s Amazon Scraper API starts at $1.54 per 1,000 requests — 10,000 pages/day costs ~$15.40, while a self-managed setup at the same scale runs $40–80/day fully loaded. First 100 requests are free, no credit card required.

Next Steps: Pick the Right Tool and Stop Pouring Engineering Into the Wrong Layer

The right starting point for scraping Amazon product data with Python in 2026 is curl_cffi, not requests. TLS fingerprint detection has made the latter almost completely ineffective against Amazon — this isn’t a configuration problem, it’s a fundamental technical layer mismatch. The code in this guide — curl_cffi impersonation, multi-selector fallback chains, silent CAPTCHA detection, async batch scraping, proxy rotation — is a battle-tested starting point that works at moderate scale out of the box.

At the same time, this guide doesn’t paper over the other reality: once you’re past 5,000–10,000 pages/day, proxy costs, selector maintenance overhead, and CAPTCHA failure rates turn the economics ugly fast. Pangolinfo’s Amazon Scraper API isn’t just “scraping as a service” — it moves the entire anti-bot infrastructure maintenance burden off your team. Your engineering capacity should go toward data analysis and business logic, not chasing Amazon’s weekly page structure changes.

Ready to get Amazon product data without maintaining a scraper? Explore the Pangolinfo Amazon Scraper API →

Or check the API Documentation for complete Python, Node.js, and cURL code examples with full endpoint schemas.

Scan WhatsApp
to Contact

QR Code
Quick Test

联系我们,您的问题,我们随时倾听

无论您在使用 Pangolin 产品的过程中遇到任何问题,或有任何需求与建议,我们都在这里为您提供支持。请填写以下信息,我们的团队将尽快与您联系,确保您获得最佳的产品体验。

Talk to our team

If you encounter any issues while using Pangolin products, please fill out the following information, and our team will contact you as soon as possible to ensure you have the best product experience.