No fluff. This article walks through how to build a stable, deep and extensible data pipeline across platforms—canonical schema, API design, anti-bot & observability, ETL & analytics, common pitfalls and code you can actually run.
Contents
- Why a single multi-platform API
- Supported platforms and key differences
- Canonical schema: Product/Listing/Offer/Seller/Review/AdSlot
- API design: endpoints, params, paging, idempotency and caching
- Anti-bot, observability and data quality
- ETL and analytics: normalization, variant aggregation, FX conversion, brand/category mapping
- Code examples
- Metrics and dashboards
- Case study: same air fryer across three platforms
- FAQ and compliance
1. Why a single multi-platform API
Cross-platform operations are noisy: different structures, different rhythms. Scripts break; manual work doesn’t scale. We want one pipeline that’s uniform, controllable, extensible and rollback-friendly.
- Uniform: one API handles search/listing/detail/reviews/ad slots across platforms.
- Controllable: rate limiting, retries, caching, degradation and canary in config, not hardcoded.
- Extensible: adding a platform means adding mappings and an adapter, not rewriting the downstream.
- Rollback-friendly: versioned adapters and quick rollback when signals drift.
2. Supported platforms & differences
- Amazon: ASIN, complex variants, BuyBox & Sponsored signals, deep categories.
- Walmart: itemId/SKU, mixed 1P/3P, list vs detail field differences, ads vary.
- Shopee: composite item_id + shop_id, price ranges & promo, shop-level signals matter.
- Lazada: SPU/SKU, heavy promos/coupons, category trees differ by site.
- Tokopedia: shop-rich signals, region/logistics affect exposure, review pagination quirks.
- eBay: Item ID, auction vs buy-now, clear Listing vs Offer boundary.
- AliExpress: frequent platform promos, translation impacts matching, review geo distribution useful.
- Mercado Libre: LATAM, multiple currencies & taxonomies, seller reputation model differs.
- Etsy: handmade/custom, attributes irregular, images/copy more critical.
- Rakuten/Flipkart: strong localization, search/sort differs from US/EU platforms.
- Temu: fast-changing page structures; heavy promo-driven low-price strategy; require cautious adapter canary.
- Jumia: Africa regions; logistics/stock volatility strongly impact conversion; category mapping needs local nuance.
- Noon: Middle East; Arabic/English bilingual; currencies and taxes must be cleanly modeled.
- Daraz: South Asia; promo-heavy with shop-weight; review logic similar to SEA but details differ.
- Kaufland/Otto/Allegro: DE/PL local platforms; EAN/GTIN matching matters; descriptions are structured but badges vary.
- Bol.com: NL local; brand aggregation pages common; mind merged listings across sellers (identify and split correctly).
Strategy: ship Amazon/Walmart/Shopee first; then expand regionally—SEA (Lazada/Tokopedia), US/EU (eBay/AliExpress), LATAM (Mercado Libre).
3. Canonical schema
- Product: cross-platform product dimension (title/brand/images/attributes/gtin).
- Listing: platform exposure & ranking (platform, category_path, rank_position, badges, is_sponsored).
- Offer: price/stock snapshots (price, currency, original_price, availability, stock, timestamp).
- Seller: shop/seller info (seller_id, seller_name, rating, store_metrics).
- Review: rating & review counts, recency, breakdown.
- AdSlot: ad slots and signals (slot_type, confidence, evidence text/icon/aria/css/context).
Keep fast-changing signals (variants, prices, ad slots) out of Product; store them as separate entities for history and comparability.
New: deep category examples (Appliances, Women’s Apparel, Mother & Baby, Small Appliances, Power Tools)
Categories behave differently. Use the canonical skeleton and map category-specific attributes carefully.
Appliances (fridge/washer/microwave)
- Key attributes: brand, capacity (L), energy rating, dimensions (L×W×H), inverter, warranty.
- Variants: color/capacity; Amazon/Walmart aggregate variants better; SEA platforms sometimes split variants into separate listings.
- Mapping: Product.attributes.capacity_l, energy_rating; normalize units to cm.
- Analysis: price index by capacity bands; stock volatility and replenishment speed as quality proxies; extract review keywords: noise/energy/after-sales.
Women’s Apparel (dress)
- Key attributes: size (US/EU/ASIA), color, material, fit, season tags.
- Variants: size × color; avoid platforms merging multiple designs under one SKU.
- Mapping: attributes.size_standard, attributes.material; normalize colors to a standard palette.
- Analysis: returns-related reviews are sensitive (size/quality); ad slots impact rankings more directly in apparel; strong seasonality—use time windows.
Mother & Baby (strollers/formula)
- Key attributes: age/weight range, standards (EN/ASTM), formula type, safety certifications.
- Variants: pack size/color; mind regional regulations for formula claims.
- Mapping: attributes.age_range, attributes.safety_cert; preserve certification codes in cleaned text.
- Analysis: review stability and negative review control matter; stock gaps strongly affect rank recovery.
Small Appliances (air fryer/blender)
- Key attributes: capacity (L), power (W), coating type, preset modes.
- Variants: capacity and color; promos are strong—analyze price elasticity with promo badges.
- Mapping: attributes.capacity_l, attributes.power_w; normalize feature keywords (airfry/bake/grill).
- Analysis: review keywords “coating/odor/cleanability”; split SoV into organic vs ads.
Power Tools (drills/grinders)
- Key attributes: power/voltage, chuck size, RPM range, brushless, accessories list.
- Variants: battery capacity/bundle size; same SPU may split into multiple listings across platforms.
- Mapping: attributes.voltage_v, attributes.rpm, attributes.brushless.
- Analysis: spec-driven segment; price correlates with accessory bundles; extract “durability/battery/heat” from reviews.
Engineering tip: namespace category-specific attributes under Product.attributes (e.g., appliance.*, fashion.*, baby.*, tools.*) to keep generic fields clean.
4. API design
{
"endpoint": "/v1/scrape",
"method": "POST",
"request": {
"platform": "amazon|walmart|shopee|lazada|tokopedia|ebay|aliexpress|ml|etsy|rakuten|flipkart",
"intent": "search|listing|product_detail|offers|reviews|ads",
"query": "air fryer",
"region": "US|SG|ID|MY|PH|TH|VN|BR|MX|JP|EU",
"language": "en|zh|id|ms|es|pt|jp",
"page": 1,
"page_size": 24,
"options": { "wait_render": true, "include_ads": true, "include_seller": true, "timeout_ms": 30000 }
},
"response": {
"items": [ { /* Canonical fields */ } ],
"meta": { "page": 1, "page_size": 24, "total": 500, "trace_id": "..." },
"cache": { "hit": false, "ttl": 60 }
}
}
- Paging: unify page/page_size. Convert cursor-based paging to this format.
- Idempotency: within TTL, same query+region+intent returns consistent results. Support idempotency-key.
- Caching: short TTL for hot queries & details; avoid long caching for reviews/ads.
- Canary: adapter versions roll out gradually; quick rollback on anomalies.
5. Anti-bot, observability & quality
- Fingerprint pool: diversify IP/UA/screen/timezone; human-like pacing.
- Render waits: use wait_for_selector and fallbacks for delayed signals (Sponsored, badges).
- Retry & degrade: fast retries; degrade to low-precision parsing or cached snapshots on persistent failures.
- Coverage & accuracy: target ≥95% overall, ≥98% for core queries; multi-signal ad detection with confidence & evidence.
- Metrics: latency, error rate, timeouts, retry counts, cache hit rate, freshness—on dashboards and alerts.
6. ETL & analytics
- Extract: concurrent queues per platform/intent; rate-limited; retry with backoff.
- Transform: mapping, category/brand normalization, FX conversion, variant aggregation, text cleaning.
- Load: warehouse partitions by platform/date; Bronze/Silver/Gold layers for raw/cleaned/curated.
Variant tip: don’t average prices across color/size. Keep main SKU and price deltas by variant—much more useful for elasticity analysis.
7. Code examples
Python: async concurrency
import asyncio, aiohttp
API_BASE = "https://api.example.com"
API_KEY = "YOUR_API_KEY"
async def scrape(session, payload):
async with session.post(f"{API_BASE}/v1/scrape", json=payload, headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}) as resp:
if resp.status != 200:
raise RuntimeError(f"HTTP {resp.status}")
return await resp.json()
async def main():
queries = [
{"platform": "walmart", "intent": "search", "query": "air fryer", "region": "US", "language": "en"},
{"platform": "shopee", "intent": "product_detail", "item_id": 123456789, "shop_id": 987654321, "region": "SG", "language": "en"},
{"platform": "lazada", "intent": "search", "query": "air fryer", "region": "SG", "language": "en"}
]
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30)) as session:
tasks = [scrape(session, q) for q in queries]
results = await asyncio.gather(*tasks, return_exceptions=True)
for r in results:
print(type(r), str(r)[:200])
asyncio.run(main())
Node.js: streaming & dedupe
import fetch from "node-fetch";
const API_BASE = "https://api.example.com";
const API_KEY = process.env.API_KEY;
async function scrape(body) {
const resp = await fetch(`${API_BASE}/v1/scrape`, {
method: "POST",
headers: { "Authorization": `Bearer ${API_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify(body)
});
if (!resp.ok) throw new Error(`HTTP ${resp.status}`);
return await resp.json();
}
function dedupe(items) {
const seen = new Set();
return items.filter(x => {
const key = `${x.platform}:${x.global_id}`;
if (seen.has(key)) return false; seen.add(key); return true;
});
}
const walmart = await scrape({ platform: "walmart", intent: "search", query: "air fryer", region: "US", language: "en", options: { include_ads: true } });
const shopee = await scrape({ platform: "shopee", intent: "product_detail", item_id: 123456789, shop_id: 987654321, region: "SG", language: "en" });
const merged = dedupe([...(walmart.items||[]), shopee]);
console.log(merged.slice(0, 3));
Walmart ad-slot snippet
{
"items": [
{
"global_id": "walmart:itemId:123456",
"platform": "walmart",
"title": "Air Fryer Pro",
"is_sponsored": true,
"adslot": {
"slot_type": "search_top",
"confidence": 0.93,
"evidence": { "text": "Sponsored", "aria": "sponsored", "css": ".ad-badge" }
}
}
]
}
8. Metrics & dashboards
- Price Index: median, quantiles, promo share by platform/category.
- Availability Volatility: stock-outs and replenishment speed by day/week.
- Share of Voice: brand exposure & ad-share across search listings.
- Listing Velocity: new listings and rank lift; spot rising stars and fading SKUs.
9. Case study: the same air fryer across three platforms
Pick a comparable SKU (e.g., 4L air fryer). Pull top-N search and details from Amazon/Walmart/Shopee. Then:
- Price & promos: Amazon promo-heavy; Walmart steadier prices; Shopee wide promo-driven ranges.
- Ads & ranking: Amazon Sponsored dense at the top; Walmart ads more discrete; Shopee shop-weighted rankings.
- Reviews & conversion: review volume and recent growth beat single-point ratings as conversion proxies.
With the canonical schema, stitch all three and make a simple call: main push on Amazon; price strategy follows Walmart median; Shopee wins via promo pricing and shop exposure. Decisions backed by data—not gut feel.
10. FAQ & compliance
- Don’t over-scrape: rate-limit, cache and respect platform ToS; use public product data only.
- Variants: don’t mix variant prices; avoid treating child SKUs as separate products when comparing.
- Multilingual: regex + dictionaries; don’t rely on English-only “Sponsored”.
- Rollback: versioned adapters, canary rollout, automatic degrade on anomalies, on-call playbooks.
For enterprise-grade onboarding (platform list, quotas, regional configs), contact us. We ship region- and category-specific solutions.
Keywords: Multi-Platform Data Collection API · Walmart Data Scraping API · Shopee Data Collection Tool · Lazada · Tokopedia · eBay · AliExpress · Mercado Libre · Etsy · Rakuten · Flipkart
© 2025 Data Intelligence. For technical research and compliant business use only.
