Methodology
How we build the most comprehensive e-commerce database in the world.
ShopRank tracks 9.1 million verified, active e-commerce stores across 340+ platforms in 195+ countries. Every number we publish goes through the same pipeline described below. No sampling, no surveys - a full census of the observable web.
This page explains how we collect, process, and verify our data. We believe transparency in methodology is what separates actionable intelligence from marketing claims.
1. Domain Discovery
We start by building the most complete list of domains on the internet. Our proprietary discovery pipeline combines multiple public and commercial data sources to monitor over 1 billion domains across all major TLDs.
We discover millions of new domains every month and determine when each website first appeared online by cross-referencing multiple date sources. Domains that redirect to other domains are deduplicated and tracked as a single entity.
2. Web Scanning
For every active domain, we perform two types of scans:
Technology Detection
Our technology scanner analyzes each page to identify the underlying technology stack. We detect 500+ technologies, including e-commerce platforms, analytics tools, payment integrations, and marketing software.
This is how we identify which platform a store runs on: Shopify, WooCommerce, PrestaShop, Magento, and 340+ others.
Content Extraction
A separate parser extracts structured metadata from each page: text content, link structure, social media profiles, and more. This data feeds into our classification pipeline.
Both scans run on the same page snapshot to ensure consistency between technology detection and content extraction.
3. E-commerce Classification
Detecting a technology is not the same as verifying a store. A domain running WooCommerce might be a blog, a corporate website, or an abandoned project with a dormant plugin. In our data, 59% of WooCommerce domains are not real stores.
To separate real stores from everything else, we built a proprietary classification system that analyzes each domain across multiple dimensions:
- Text content - commercial language patterns across dozens of languages
- Page structure - navigation patterns typical of online stores
- Technology signals - detected platform and third-party integrations
- Semantic analysis - multilingual understanding of page purpose
The classifier produces a probability score for each domain. Only domains above our verification threshold are counted as e-commerce stores. The same classifier and the same threshold are applied uniformly to all platforms - Shopify, WooCommerce, PrestaShop, and every other platform get identical treatment.
Two-Tier Classification
We maintain two classification levels:
- Broad detection - domains with any e-commerce signal. This casts a wide net.
- Verified stores - the subset where the classifier confirms the domain is an active e-commerce store. This is the number we publish: 9.1 million stores.
4. Courier & Payment Provider Detection
Beyond platform detection, we scan shipping and payment pages to identify which logistics and payment providers each store uses.
Our crawler identifies pages likely to contain shipping or payment information and scans them for mentions of specific providers. We track hundreds of courier companies and payment processors across all major markets.
Detection rates vary by market. Stores where we don't detect a specific provider are not excluded - they simply don't have this data point. When we report adoption rates (e.g., "43% of stores use provider X"), we always calculate relative to stores where detection was successful, not all stores.
5. Country Detection
We determine each store's primary target market using multiple signals:
- Domain TLD - .de, .fr, .pl directly indicate the country
- Language detection - the primary language of the page content
- Phone numbers - country codes extracted from contact information
- Courier and payment providers - local services indicate local presence (e.g., InPost suggests Poland, Colissimo suggests France)
- Additional proprietary signals - several other data sources that help resolve ambiguous cases
These signals are combined with a priority hierarchy. Country detection works best for stores on country-code domains (like .fr or .de) and less reliably for .com domains. We document this asymmetry in our research - for example, in country-level analyses we often provide robustness checks using only local-TLD domains where country attribution is unambiguous.
6. Category Classification
Each verified store is assigned to a product category (such as Clothing & Accessories, Health & Beauty, Electronics, Home & Garden, and others) using multilingual text analysis.
We consider category classification sufficient for aggregate analysis but note it as a limitation in per-category breakdowns.
7. Data Quality Filters
Not every detected store is an active, functioning business. We apply two key filters:
- Availability - the domain must be live and responding. Domains that have gone offline are excluded.
- Activity - pages that appear to be under construction, password-protected, or abandoned are detected using a combination of content and traffic signals. These stores exist but are not currently operating, so we exclude them from our active count.
Our published figure of 9.1 million stores reflects only domains that are verified as e-commerce and currently active.
8. Store Dating
When we say a store was "launched in 2025," we mean the earliest evidence of its web presence comes from 2025. We combine multiple proprietary and public date sources and take the earliest available date.
This means "new in 2025" stores have no web presence record before January 1, 2025 in any of our data sources.
9. What We Don't Do
Transparency also means being clear about our limitations:
- No survey data - every data point comes from direct observation of the domain, not self-reported information.
- No purchase verification - we verify that a site is an e-commerce store, not that it processed a transaction today.
- Country detection has limits - stores on generic TLDs (.com, .shop) may not have a detected country. We document this in per-country analyses and provide robustness checks.
10. Data Freshness
Our data is not a static snapshot. Each component of the pipeline runs on its own schedule:
All major pipeline components - DNS, technology scans, content extraction, courier/payment detection, and classification - run continuously and are refreshed multiple times per month. New domain registrations are ingested in near real-time. The full database is rebuilt regularly to incorporate all new signals.