If you believe AI coding tools are speeding up your teams but delivery metrics don’t show it, you’re not imagining things. A rigorous trial with experienced open-source developers found that AI assistance actually slowed them down by 19%, even though they felt 20% faster—a 40-point perception gap. This wasn’t novice error; these were skilled devs on familiar codebases using mainstream AI tools. The disconnect between perception and reality is real, and it’s backed by solid data.
Stanford’s extensive study confirms AI coding tools boost productivity on simple, new projects by up to 40%, but that gain halves or disappears as task complexity and codebase size grow. For hard tasks in mature systems, AI helps little or even hurts, mainly because fixing AI-introduced bugs eats into any speed gains. The bigger the codebase, the worse the AI performs. Context window limits and intricate dependencies overwhelm current models, turning AI from helper to liability on your toughest problems.
And it gets worse. AI-assisted commits are changing code quality in troubling ways. GitClear’s analysis reveals copy-pasted code is on the rise, refactoring is tanking, and code churn is doubling. AI models optimize for local correctness—code compiles and tests pass—but global architecture coherence degrades. CodeRabbit’s study of pull requests shows AI coauthored code has nearly twice as many major issues, security vulnerabilities up to double, and readability problems tripled compared to human-only work. Developers know this firsthand: trust in AI accuracy dropped from 40% to 29%, and most say they spend more time fixing AI’s “almost right” code than they save. The “slop factory” churns on—ship fast, fix later, repeat—with questionable net velocity and clear quality decline.
The industry divides into three camps. Camp 1 says AI is fundamentally incapable of handling complex systems; the evidence supports this. Camp 2 hopes smarter future models will fix these problems, so companies wait passively for advances. Camp 3, however, argues the bottleneck isn’t the AI model itself but how we feed it information—context engineering. With the right workflow, today’s models can handle large codebases effectively. This is where new breakthroughs are happening.
Dex Horthy from HumanLayer nails the core constraint: context window physics. AI models have a cliff effect—once you fill beyond about 40% of the context window, accuracy plummets. Just dumping more code into the prompt makes things worse, not better. His solution is “frequent intentional compaction”—deliberately compressing, validating, and reloading context throughout the development process to keep the AI’s input clean and focused. The damage hierarchy is critical: incorrect context poisons everything downstream, missing info leads to guesswork, and noise wastes tokens but is least harmful. The formula is simple: prioritize correctness first, completeness second, compactness third, and minimize noise.
Applying this means three phases: Research—map the architecture and relevant files with fresh context windows and human review; Plan—craft a precise implementation strategy with clear file edits and tests, keeping context load moderate and reviewed by domain experts; Implement—execute the plan with minimal overhead, verifying continuously and compressing status back into context. The insight is counterintuitive: most time should go into research and planning, not code writing. Research yields tenfold return, planning fivefold, implementation just onefold. Humans add the most value by reviewing research and plans, not raw code. Flawed assumptions early on multiply downstream mistakes. As Horthy says, “Do not outsource the thinking.”
This approach delivers results. Horthy, an amateur Rust dev new to a 300K-line codebase, produced a one-shot PR approved by the project CTO. Another time, he and a collaborator implemented 35,000 lines of WebAssembly support in seven hours—a task estimated at days per engineer. But it’s not magic. They failed to remove Hadoop dependencies from Parquet Java because that required deep architectural understanding that can’t be compressed into context windows. Context engineering works spectacularly for decomposable problems, but not for holistic architectural redesigns. Knowing that boundary is crucial.
Context engineering is gaining traction as a discipline. Martin Fowler defines it as curating what the model sees to improve outcomes—not just prompt phrasing but workflow engineering. Spotify and others have published enterprise-scale approaches. The CLAUDE.md ecosystem exemplifies this: persistent markdown files encoding build commands, coding conventions, architecture decisions, and lazy-loaded skills guide AI tasks. But as Fowler cautions, certainty is impossible with LLMs; you must think probabilistically. Horthy warns against buzzword dilution—if your vendor can’t explain the damage hierarchy, they’re not truly doing context engineering.
Here’s the 90/10 rule for CTOs: For roughly 90% of AI coding—simple tasks, greenfield work, small fixes—AI tools yield real 15–40% gains with minimal workflow change. But for the critical 10%—complex tasks in large codebases that determine stability, security, and maintainability—AI without context engineering is neutral or worse. The mistake is expecting the same AI workflow to handle both. Discipline in context engineering bridges that gap.
Open questions remain. Can mid-level engineers learn this discipline? Does it scale from solo experts to teams? What if you lack a domain expert? Cultural leadership is key; tool adoption alone won’t cut it. Meanwhile, senior engineers see the tradeoffs clearly, while juniors produce AI-assisted code that increases technical debt. Context engineering might be the bridge, but it’s unproven at scale.
I’m running experiments applying context engineering to measure where AI helps and where it creates rework, by task and codebase area. The data matches the 90/10 pattern. Routine work sees gains; complex integration demands the full research-plan-implement rigor to avoid net negative outcomes. This is a bet on discipline over tooling. The developers who master context engineering won’t just be faster; they’ll do the work AI can’t do alone. Maybe future models will make this irrelevant, but waiting risks falling behind. The skills—research rigor, structured planning, domain expertise—are valuable no matter what.
So ask yourself: When your team uses AI on complex work, are they investing in research and planning or just generating code faster? Do you measure AI-induced rework? Who on your team is developing context engineering skills—or are you waiting for smarter models? Context engineering makes explicit the bottleneck that’s always been there: understanding the problem well enough to write the right code. Without it, you’re just generating slop faster.
You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.
Read the full article — with all the data and sources — on ThePragmaticCTO.











