← All insights
The AI-First Web

74% of AI Training Data Comes From WordPress. What Does That Mean for AI Quality?

AI models are trained on web crawls. 74% of the crawlable web is WordPress. That means AI training corpora are shaped by template repetition, plugin artifacts, and SEO-optimized filler. The web that shaped AI was shaped by WordPress.

· 5 min read
Share on X LinkedIn

The Training Data Pipeline

74.3% of detected sites
WordPress share of crawlable web
Source: WebPulse Common Crawl scan, 10M+ detections. Common Crawl is the primary training corpus for most LLMs.
250B+ pages archived
Common Crawl corpus size
Source: commoncrawl.org. The largest web archive, used as training data by GPT, Claude, Llama, and most major LLMs.

Common Crawl is the primary training corpus for most large language models. GPT, Claude, Llama, and their competitors all train on web crawl data. Our scan of 10 million sites from Common Crawl shows 74.3% are WordPress. The implication: approximately three-quarters of the web text that trained today's AI models was generated through WordPress — with its template structures, plugin artifacts, and content patterns.

What WordPress Content Looks Like at Scale

WordPress content has structural patterns that repeat across millions of sites: Yoast SEO-optimized headings, WooCommerce product descriptions, Elementor-generated layout markup, contact form boilerplate, cookie consent banners, and sidebar widget text. These patterns appear in training data millions of times. They become statistical weight in the model's understanding of 'how web text works.'

This isn't about quality in the editorial sense — WordPress hosts excellent journalism and scholarship. It's about structural repetition: when 74% of training text comes through one CMS, the model learns that CMS's patterns as the default. The template is overrepresented in the training signal.

The Diversity Problem

~5% of detected sites
Modern framework share of crawlable web
Source: WebPulse Common Crawl scan. Next.js, Astro, SvelteKit, etc. combined.

If 74% of training data is WordPress-generated and 5% is modern-framework-generated, the model has 15x more exposure to WordPress content patterns than modern content patterns. Structured data (JSON-LD), semantic HTML, and clean markup are underrepresented in training corpora because the sites that produce them are underrepresented in the crawlable web. AI models may systematically undervalue structured content because they were trained on a web that mostly doesn't produce it.

The Recursive Loop

AI models trained on WordPress-dominated corpora generate text that reflects WordPress patterns. That AI-generated text gets published — often on WordPress sites. The next training crawl picks it up. The loop compounds: WordPress-shaped training produces WordPress-shaped output, which becomes WordPress-shaped training data. The web that shaped AI was shaped by WordPress. And the AI that WordPress shaped is now reshaping the web. The question nobody is asking: what would AI output look like if trained on a web that was 74% clean semantic HTML instead of 74% WordPress?

Share this insight
More insights