The Pragmatic CTO

The Pragmatic CTO

No Vibes Allowed: Context Engineering for Real Codebases

Context engineering as discipline

Allan MacGregor πŸ‡¨πŸ‡¦'s avatar
Allan MacGregor πŸ‡¨πŸ‡¦
Feb 20, 2026
βˆ™ Paid

A randomized controlled trial of 16 experienced open-source developers working 246 real-world tasks found that developers using AI coding tools took 19% longer to complete their work. But they believed they were 20% faster. A 40-percentage-point perception gap; the developers weren't just wrong about the magnitude of the improvement, they had the direction backwards.

These weren't beginners fumbling with a new tool. They averaged five years of experience on the specific codebases where they were tested. They used Cursor Pro and Claude 3.5/3.7 Sonnet --- mainstream tools, not fringe experiments. The methodology was rigorous: randomized, controlled, pre-registered. And the result was unambiguous.

If you're a CTO and your teams report that AI tools are "helpful" while your delivery metrics stay flat, you're not imagining things. The data confirms the disconnect.

Stanford's three-year study across 600+ companies and 100,000+ developers fills in the rest of the picture. AI coding tools increase productivity 15--20% on average --- but that average obscures massive variation. Simple tasks on new projects see 30--40% gains. Simple tasks in existing codebases see 15--20%. Hard tasks in mature codebases? Zero to 10% gains, sometimes negative. As Stanford's researchers noted, "a significant portion of that gain is lost fixing the bugs and mess the AI made."

The degradation scales with complexity. As codebase size increases from 10K to 10M lines of code, AI's productivity contribution drops sharply. Context window performance degrades from roughly 90% accuracy at 1K tokens to around 50% at 32K tokens. Signal-to-noise ratio collapses; dependencies and domain-specific logic grow more intricate than the model can reason about unaided.

The pattern is clear: AI coding tools work well on small, isolated problems. They struggle --- and sometimes actively hurt --- on the large, interconnected codebases where your hardest engineering problems live. The question is whether that gap is permanent or whether something can be done about it.

The Slop Factory

The speed problem is bad enough. The quality problem is worse.

GitClear analyzed 211 million lines of code across 2020--2024 and found that AI-assisted development is fundamentally changing what gets committed. Copy-pasted code rose from 8.3% to 12.3%. Duplicated code blocks of five or more lines increased eightfold in 2024. Refactoring collapsed --- from 25% of all changes in 2021 to less than 10% in 2024, a 60% decline. Code churn doubled; new code revised within two weeks grew from 3.1% to 5.7%. For the first time in GitClear's measurement history, copy-pasted lines exceeded moved or refactored lines.

LLMs prioritize local functional correctness over global architectural coherence. The code compiles. The tests pass. But the system accumulates entropy --- duplicated logic, ignored abstractions, brittle coupling --- that compounds with every AI-assisted commit.

CodeRabbit's analysis of 470 real-world pull requests quantified the damage. AI-coauthored PRs averaged 10.83 issues versus 6.45 for human-only PRs. 1.7x more major issues; 1.4x more critical issues. Logic errors up 75%. Security vulnerabilities up 1.5--2x. Readability issues up 3x. Performance bugs up 8x.

Developers know this. The Stack Overflow 2025 survey found that trust in AI accuracy fell from 40% to 29% year over year. Sixty-six percent say they spend more time fixing "almost-right" AI code than they save. More developers actively distrust AI (46%) than trust it (33%).

Dex Horthy, founder of HumanLayer, named the dynamic concisely: "A lot of the extra code shipped by AI tools ends up just reworking the slop that was shipped last week."

The slop factory. Ship fast on Monday; fix what you shipped on Friday. Net velocity gain: debatable. Net quality impact: measurable and negative.

This is not an anti-AI argument. The productivity gains on simple tasks are real; the Stanford data confirms that. But when AI coding tools are deployed without discipline on complex codebases, the quality evidence is damning. And quality problems compound in ways that speed gains do not.

Three Camps

The industry has sorted itself into three responses to this data.

User's avatar

Continue reading this post for free, courtesy of Allan MacGregor πŸ‡¨πŸ‡¦.

Or purchase a paid subscription.
Β© 2026 Allan MacGregor πŸ‡¨πŸ‡¦ Β· Privacy βˆ™ Terms βˆ™ Collection notice
Start your SubstackGet the app
Substack is the home for great culture