The Pragmatic CTO
The Pragmatic CTO Podcast
Audio: When Tokens Become OKRs
0:00
-5:22

Audio: When Tokens Become OKRs

Boards are increasingly fixated on tracking tokens consumed per engineer as a KPI for AI adoption, but the real question never gets asked: what did those tokens actually build? This obsession with measuring token usage is emblematic of a broader problem—metrics are being treated as objectives rather than indicators, turning engineering productivity into a game of consumption rather than value creation.

The corporate cycle is painfully predictable. A CEO or CTO mandates AI adoption, engineering leadership scrambles for a number to report, vendors supply whatever metric is easiest to track—lines of code, tokens consumed, acceptance rates—and that metric becomes a target. Once that happens, work bends to inflate the metric, not to improve the product. This is classic Goodhart’s Law in action: when a measure becomes a target, it ceases to be a good measure.

Lines of code made a comeback with AI-first pushes, leading to an eightfold increase in duplicated code blocks. The cheapest way to look productive is to generate more code, regardless of necessity or quality. Tokens consumed suffer the same fate, but worse—they’re a vendor’s billing metric, not a measure of business value or engineering impact. Tracking tokens is like equating AWS spend with productivity: consumption does not equal output.

Why are so many executives buying into these flawed metrics? Vendors have a clear incentive to promote consumption-based metrics. Take Anthropic’s 2026 Agentic Coding Trends report—it pitches engineers as orchestrators of AI agents, which by design consume more tokens. This narrative conveniently aligns with the vendor’s business model, encouraging higher token usage and thus revenue. Microsoft’s internal memo in mid-2025 mandated AI use as fundamental, further entrenching this approach.

Yet these mandates have backfired. Klarna claimed AI replaced hundreds of customer service agents only to admit that cost-cutting compromised quality, forcing them to rehire humans. The promise of AI-driven productivity often ignores the hidden costs and quality trade-offs. Vendor reports themselves reveal the cracks: while developers use AI in 60% of their work, they can only fully delegate 0 to 20% of tasks. And about a quarter of AI-assisted work is scope expansion—building tools and dashboards that wouldn’t have existed otherwise, not straight productivity gains.

Independent studies confirm this. METR’s 2025 trial showed a 19% slowdown for developers using AI tools, despite their feeling 20% faster. Faros AI telemetry revealed that AI-heavy teams saw PR review times nearly double, PR sizes increase by 150%, and bug rates rise. The core bottleneck is review capacity, not code generation. AI makes code cheap to produce, but human evaluation remains the limiting factor.

So what should we do instead? First, recognize that the real constraint is review capacity. Push for smaller PRs, incremental changes, and human-readable context. Track regressions and defect rates to ensure quality isn’t sacrificed in the rush to consume tokens. Second, replace mandates with structured exploration time. The best AI adoption comes from teams given permission to experiment and fail, not from forced consumption targets. Third, measure ownership, not consumption. Ask who on the team can confidently defend the AI-generated code in production at 3 a.m. That’s a far better proxy for real impact than tokens consumed.

Agentic coding makes code "cheaper," but that cheapness is borrowed from vendors whose pricing, model quality, and output you don’t control. Models can change overnight, degrading quality and putting your pipeline at risk. Code quality and accountability remain our responsibility as CTOs. Human oversight around design, architecture, and final review must stay central to avoid exposure to risk and technical debt.

I’m actively experimenting with this balance through side projects like Structpr.dev and Shiplog.ca, testing workflows to find where human input adds value beyond AI generation. Some days the AI saves hours, other days it creates more work. The pattern of where to draw the line between human and agent is still emerging, and I’ll share those learnings when they solidify.

When you’re next in the boardroom and asked for AI adoption metrics, consider what that number actually measures once you strip away vendor spin. If your token budget tripled tomorrow, would your output improve or just get bigger? Who on your team can explain critical paths in the codebase from memory—and is that number rising or falling? Remember, the narrative that engineers are becoming orchestrators primarily serves vendor business models, not your product’s quality or your team’s effectiveness. Mandates and dashboards may look good on paper, but they don’t guarantee real progress.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.


Read the full article — with all the data and sources — on ThePragmaticCTO.

Discussion about this episode

User's avatar

Ready for more?