74% of AI Training Data Comes From WordPress. What Does That Mean for AI Quality?

Key finding

WordPress share of crawlable web: 74.3% of detected sites (Source: WebPulse Common Crawl scan, 10M+ detections. Common Crawl is the primary training corpus for most LLMs.)

The Training Data Pipeline

74.3% of detected sites

WordPress share of crawlable web

Source: WebPulse Common Crawl scan, 10M+ detections. Common Crawl is the primary training corpus for most LLMs.

250B+ pages archived

Common Crawl corpus size

Source: commoncrawl.org. The largest web archive, used as training data by GPT, Claude, Llama, and most major LLMs.

Common Crawl is the primary training corpus for most large language models. GPT, Claude, Llama, and their competitors all train on web crawl data. Our scan of 10 million sites from Common Crawl shows 74.3% are WordPress. The implication: approximately three-quarters of the web text that trained today's AI models was generated through WordPress — with its template structures, plugin artifacts, and content patterns.

What WordPress Content Looks Like at Scale

WordPress content has structural patterns that repeat across millions of sites: Yoast SEO-optimized headings, WooCommerce product descriptions, Elementor-generated layout markup, contact form boilerplate, cookie consent banners, and sidebar widget text. These patterns appear in training data millions of times. They become statistical weight in the model's understanding of 'how web text works.'

This isn't about quality in the editorial sense — WordPress hosts excellent journalism and scholarship. It's about structural repetition: when 74% of training text comes through one CMS, the model learns that CMS's patterns as the default. The template is overrepresented in the training signal.

The Diversity Problem

~5% of detected sites

Modern framework share of crawlable web

Source: WebPulse Common Crawl scan. Next.js, Astro, SvelteKit, etc. combined.

If 74% of training data is WordPress-generated and 5% is modern-framework-generated, the model has 15x more exposure to WordPress content patterns than modern content patterns. Structured data (JSON-LD), semantic HTML, and clean markup are underrepresented in training corpora because the sites that produce them are underrepresented in the crawlable web. AI models may systematically undervalue structured content because they were trained on a web that mostly doesn't produce it.

The Recursive Loop

AI models trained on WordPress-dominated corpora generate text that reflects WordPress patterns. That AI-generated text gets published — often on WordPress sites. The next training crawl picks it up. The loop compounds: WordPress-shaped training produces WordPress-shaped output, which becomes WordPress-shaped training data. The web that shaped AI was shaped by WordPress. And the AI that WordPress shaped is now reshaping the web. The question nobody is asking: what would AI output look like if trained on a web that was 74% clean semantic HTML instead of 74% WordPress?

74% of AI Training Data Comes From WordPress. What Does That Mean for AI Quality?

The Training Data Pipeline

What WordPress Content Looks Like at Scale

The Diversity Problem

The Recursive Loop

Why 95% of Health AI Pilots Die — and the Loop That Would Save Them

Data Has a Shelf Life, and Other Things Medicine Knows That AI Keeps Relearning

We Said 73% Was Immovable. At 10 Million Sites, It Went Up to 74.3%. The Web Is Even More Legacy Than We Reported.

Website Migration Cost in 2026: What It Actually Costs to Move Off WordPress

An AI Agent Targeted Thailand's Finance Ministry — Unattended

CVE-2026-6875: ServiceNow Pre-Auth RCE Exploited in the Wild