Lines of Code Are Back (And It's Worse Than Before)

The metric we killed is back, and AI made it worse

Feb 08, 2026

The software industry doesn't agree on much. Tabs versus spaces, monoliths versus microservices, whether stand-ups are useful or performance art—pick a topic and you'll find engineers ready to die on both hills. But for about forty years, we had one consensus: lines of code is a terrible metric.

Dijkstra called it "a very costly measuring unit because it encourages the writing of insipid code." Lines are spent, not produced. Bill Gates compared measuring programming progress by lines of code to measuring aircraft building progress by weight. Ken Thompson said one of his most productive days was throwing away a thousand lines.

In 2009, Tom DeMarco—the man who wrote "you can't control what you can't measure"—formally retracted the statement. Software projects, he concluded, are fundamentally experimental; the important goal is transformation, not control. By 2023, Kent Beck was calling LOC "an input metric"—the worst category. "Only use it if you have nothing else to measure success with."

That was the consensus. Settled. Done.

Then AI showed up, and we brought it back.

The Resurrection

Every major tech CEO is now competing on what percentage of their code is written by AI. Watch the progression.

Sundar Pichai told investors in October 2024 that 25% of Google's new code was AI-generated. By mid-2025, that number climbed past 30%. Satya Nadella said "maybe 20%, 30%" of Microsoft's code is now written by software. Mark Zuckerberg predicted AI would handle half of Meta's development within a year. And Dario Amodei predicted 90% of code would be AI-written within six months; when the deadline passed, he revised the claim to "70, 80, 90% of the code written at Anthropic is written by Claude."

Twenty-five percent. Thirty percent. Fifty percent. Ninety percent. The numbers only go up, and they're presented as achievements—on earnings calls, in press releases, at conferences. Nobody is reporting "percentage of bugs introduced by AI-generated code" or "percentage of AI code that survived review unchanged." Nobody is mentioning how much of that generated code was thrown away, reworked, or never deployed. The headline metric is volume. LOC by another name.

The tooling reinforces it. GitHub Copilot's dashboard shows "Total Lines Suggested" and "Total Lines Accepted" as primary metrics. Cursor tracks lines added per user, reporting a 28.6% increase following adoption. The industry generated 256 billion lines of AI-written code in 2024 alone. That number is treated as progress.

And it's not just executives. The LOC obsession has filtered into social media culture. A viral tweet last week—1.6 million views—celebrated Anthropic's AI agents building a C compiler: "100k lines, compiles the Linux kernel, $20k, 2 weeks." A Community Note corrected the framing; GCC took about two years to build from conception, not thirty-seven. But the correction didn't go viral. The line count did.

One developer compared his AI agents' output—3.2 million lines of code in three months—to his lifetime achievement of 700,000 lines across sixty years. Then he used Grok to generate an argument for why LOC is a valid metric. Using AI to justify the metric that AI makes meaningless. You can't make this up.

I asked a simple question on X last week: "Why are people in the AI space so obsessed with lines of code?" The question got 10,000 views. This article is my answer.

Goodhart's Law With Infinite Leverage

You know Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. LOC was a textbook case of this before AI entered the picture. Developers rewarded for adding lines wrote verbose code; teams measured by output shipped bloat instead of solutions. The industry recognized the problem, and for the most part, we stopped using LOC as a productivity metric.

AI didn't just repeat the mistake. It broke the mistake open.

Think about it in three layers.

Layer one: LOC failed as a human metric because it was gameable. Developers rewarded for adding lines could write verbose code to hit targets. Managers knew this. The industry spent decades documenting the problem. We moved on.

Layer two: AI makes the metric infinitely gameable. When a human developer games LOC, there's friction. Writing unnecessary code takes effort; the gaming has a natural ceiling because a person can only type so fast and only tolerate so much tedium. Remove those limits. An AI can produce ten thousand lines in the time a developer writes fifty. The cost of generating a line of code is now functionally zero. If LOC was misleading when it cost effort to produce, it is meaningless when it costs nothing.

Layer three: we are applying Goodhart's Law to Goodhart's Law. The metric that was already broken is now the target for a system with infinite capacity to game it. The constraint that kept a bad metric merely bad has been eliminated; what's left is a metric that measures nothing at all. We're not repeating a forty-year-old mistake. We're running it with the guardrails removed.

Andrej Karpathy coined the term "vibe coding" in February 2025—"forget that the code even exists." When code generation requires zero comprehension, measuring code volume measures zero comprehension. Greptile's data shows lines per developer grew 76%, from 4,450 to 7,839. More output. Not more understanding.

The question every CTO should be asking: if the cost of generating code is zero, what does the volume of generated code tell you? The answer is nothing. It tells you nothing.

What More Code Costs

The data on what happens when you optimize for volume is already in. The numbers are not encouraging.

GitClear analyzed 211 million lines of code across private repos and 25 major open-source projects from 2020 to 2024. Copy-pasted code rose from 8.3% to 12.3%. Code blocks with five or more duplicated lines increased eightfold during 2024. Refactoring collapsed—the percentage of moved, restructured lines dropped from 24.1% in 2020 to 9.5% in 2024. A 60% decline. And code churn doubled: new code revised within two weeks of commit grew from 3.1% to 5.7%.

Read that again: 2024 was the first year in GitClear's dataset where copy-pasted lines exceeded moved lines. The industry crossed a threshold. We are now generating more duplicate code than we are refactoring existing code. That is the cost of optimizing for volume.

The productivity numbers are worse than the quality numbers. METR ran a randomized controlled trial—sixteen experienced open-source developers, 246 tasks on well-known repositories. Developers using AI tools took 19% longer to complete their work. But they believed they were 20% faster. A 40-point perception gap between what developers think AI does for them and what it measurably does.

The Stack Overflow 2025 Developer Survey reinforces this. Trust in AI accuracy fell from 40% to 29% year over year. More developers actively distrust AI tools (46%) than trust them (33%). And 66% say they spend more time fixing "almost-right" AI-generated code than they save in the initial writing phase.

On the security side, 45% of AI-generated code contains security flaws according to Veracode's 2025 report. Vibe-coded applications are already failing in production; one high-profile exercise saw AI ignore a code freeze, fabricate data, and delete a production database. A Swedish vibe-coding platform shipped 170 apps with exploitable vulnerabilities out of 1,645 tested.

More code. Worse code. Less understood code. And we're measuring the "more" as if it were a feature.

Acceptance Rate Is Not Better

The industry recognized that raw LOC was indefensible, so it found a replacement: acceptance rate. The percentage of AI-suggested code that developers accept. This is the metric on most engineering leaders' dashboards today.

It suffers from every flaw LOC had, plus new ones.

Accepting code doesn't mean it's good code. A developer might accept a suggestion because it's close enough, because they're tired of rejecting and rewriting, because the context-switching cost of evaluating each suggestion exceeds the cost of just taking it. Acceptance rate conflates "not rejected" with "valuable"—and those are not the same thing.

As CodeRabbit put it: "Most tooling gives you vanity metrics like lines of code generated and number of AI completions accepted, which tell you nothing about what happens after the AI writes code." The metric ends at the moment of acceptance. It says nothing about whether the code worked, whether it introduced bugs, whether someone understood it, whether it survived the next refactor.

The pattern keeps repeating. Lines of code, function points, story points, velocity, acceptance rate—each generation of metric gets critiqued by its own advocates, discarded, and replaced with something that measures the same wrong thing in a new wrapper. We keep looking for a number that captures developer productivity in a single figure, and we keep finding that no such number exists. Sixty percent of engineering leaders cite a lack of clear metrics as their biggest AI challenge. They know the current metrics are broken. They just don't know what to replace them with.

Where This Argument Breaks Down

LOC is not always meaningless. Stating otherwise would be dishonest.

As a rough sizing metric—not a productivity metric—lines of code can help estimate project scope. Tracking codebase growth over time can signal maintainability concerns before they become crises. At the aggregate level, LOC trends reveal how work is changing across the industry; Greptile's reports use LOC data to show real patterns in how developers interact with AI tools. And as an adoption metric—how much AI-generated code is entering your codebase—LOC indicates tool usage levels, even if it says nothing about value delivered.

AI coding tools are also not the problem. The problem is how we measure them. Salvatore Sanfilippo—antirez, the creator of Redis—makes a compelling case that AI genuinely enables building things faster when you know what to build. He created a pure C library for BERT-like embedding models in five minutes: 700 lines of code with output comparable to PyTorch. The value was in his decades of knowing what to build; the AI handled the typing. That's a legitimate productivity gain.

MIT Technology Review named generative coding one of ten Breakthrough Technologies for 2026. The recognition is deserved. These tools are useful for boilerplate, for exploring unfamiliar APIs, for rubber-ducking problems, for rapid prototyping. I use them. Most CTOs I know use them.

The argument is not that AI coding tools are bad. The argument is that measuring their value by counting the code they produce is like measuring a surgeon's skill by how many incisions they make. More incisions is not better surgery. More code is not better software. The metric rewards the wrong thing.

What to Measure Instead

If LOC and acceptance rate are broken, the obvious question is: what should replace them?

The answer requires a fundamental shift in what you're looking at. Stop measuring inputs—lines generated, suggestions accepted, percentage of code from AI. Start measuring outcomes—what happened to the software and the team after the code was written. This is harder. It requires more instrumentation, more judgment, more patience. It also measures something worth knowing.

Four metrics survive Goodhart's Law because they're hard to game and they measure what matters.

Time-to-value. Not "how fast did we write code" but "how long from identified need to working feature in production?" AI should compress this timeline. If it doesn't, the code volume is noise. This is the metric your board cares about even if they don't know the name for it; it maps directly to customer impact and revenue. When a CEO asks "what is AI doing for us," the answer should be a time-to-value number, not a line count.

Code half-life. How long does new code survive before it needs revision? GitClear's churn data shows AI code gets revised faster—new code rewritten within two weeks nearly doubled from 2020 to 2024. Healthy code has a long half-life. Code that gets rewritten in fourteen days was never finished. Track this by origin; if AI-generated code has a shorter half-life than human-written code, that tells you something LOC never will.

Defect origin rate. What percentage of production defects trace back to AI-generated code versus human-written code? Not as a blame metric—as a calibration metric. If AI-generated code introduces defects at a higher rate, you need more review, not less AI. Track the ratio; adjust your process accordingly.

Comprehension coverage. Can someone on the team explain how every critical path in the system works? This is the metric nobody tracks and everybody should. If the answer is "the AI wrote that and nobody reviewed the logic," you have a time bomb. Vibe coding makes this worse by design; Karpathy's own framing was to "forget that the code even exists." Code that nobody understands is code that nobody can debug, extend, or secure.

The meta-principle: good metrics measure what happened after the code was written. Bad metrics measure what happened during writing. LOC, acceptance rate, lines suggested—all measure the act of creation. Time-to-value, code half-life, defect origin, comprehension coverage—all measure the result. The act of writing code has never been the bottleneck; understanding, design, and judgment are the bottleneck. Measure accordingly.

What I'm Doing

Currently, I'm doing what I have always done at Demac, at Humi and now at LiORA; we are tracking time to value, customer impact and customer trust. We are not measuring the volume of code we are generating, it is not a meaningful signal.

Building the right things, at the right pace, with the right quality, is the key to success; of any startup, or any business.

This metrics are harder to measure than counting lines; that's the point. If a metric is easy to collect, it probably measures inputs. The useful metrics require you to follow the code past the point of creation and into production.

We use AI tools throughout the engineering org, mostly to assist reviewing code rather than writing it, helping increase our time to value.

I might be wrong about some of this. Maybe the industry will figure out how to make volume a meaningful signal. But I'd rather measure the hard things poorly than measure the easy things precisely; at least the hard things point in the right direction. And I'd rather explain to my board why our metrics are nuanced than explain why we shipped code nobody understands.

When your board asks what percentage of code is AI-generated, what are they asking? And is the answer you're giving them what they need to hear?

If your AI tools disappeared tomorrow, would your team ship slower—or just write less code?

What percentage of your codebase can someone on your team explain from memory? Is that number going up or down?

The bottleneck in software was never typing speed. It was understanding, design, and judgment. LOC measured the wrong thing when humans wrote code. It measures even less now that machines do. The question for every CTO is not "how much code are we generating?" It is "how much of that code should exist at all?"

Tim Etler

Feb 14

I agree with this, but I wouldn't say LOC being an anti metric has been a consensus for 40 years. Lots of your quotes are from 20 years ago because that's the last time the lesson was forgotten.

20 years is about the amount of time it takes for the pain of an old mistake to fade so I'm not surprised the lesson is being forgotten now.

I'm confident the lesson will be learned again, but I expect the failures to be more consequential this time because AI is an amplifier, and it's especially good at amplifying failure.

Spud Taters

Feb 12

I like your "Comprehension coverage" metric. This is especially important for microservices requiring high availability. Error handling and attribution, backoff, retries, fail over, all needs to be understood and manually tested. I can't imagine outsourcing the thinking of a distributed system to an AI. It's a good final lint check, but not an author.

2 more comments...

Discussion about this post

Ready for more?