The Training Data Pipeline
Common Crawl is the primary training corpus for most large language models. GPT, Claude, Llama, and their competitors all train on web crawl data. Our scan of 10 million sites from Common Crawl shows 74.3% are WordPress. The implication: approximately three-quarters of the web text that trained today's AI models was generated through WordPress — with its template structures, plugin artifacts, and content patterns.
What WordPress Content Looks Like at Scale
WordPress content has structural patterns that repeat across millions of sites: Yoast SEO-optimized headings, WooCommerce product descriptions, Elementor-generated layout markup, contact form boilerplate, cookie consent banners, and sidebar widget text. These patterns appear in training data millions of times. They become statistical weight in the model's understanding of 'how web text works.'
This isn't about quality in the editorial sense — WordPress hosts excellent journalism and scholarship. It's about structural repetition: when 74% of training text comes through one CMS, the model learns that CMS's patterns as the default. The template is overrepresented in the training signal.
The Diversity Problem
If 74% of training data is WordPress-generated and 5% is modern-framework-generated, the model has 15x more exposure to WordPress content patterns than modern content patterns. Structured data (JSON-LD), semantic HTML, and clean markup are underrepresented in training corpora because the sites that produce them are underrepresented in the crawlable web. AI models may systematically undervalue structured content because they were trained on a web that mostly doesn't produce it.
The Recursive Loop
AI models trained on WordPress-dominated corpora generate text that reflects WordPress patterns. That AI-generated text gets published — often on WordPress sites. The next training crawl picks it up. The loop compounds: WordPress-shaped training produces WordPress-shaped output, which becomes WordPress-shaped training data. The web that shaped AI was shaped by WordPress. And the AI that WordPress shaped is now reshaping the web. The question nobody is asking: what would AI output look like if trained on a web that was 74% clean semantic HTML instead of 74% WordPress?