The Pragmatic CTO

Audio: Why Most Tech Acquisitions Destroy the Thing They Paid For

Allan MacGregor 🇨🇦 — Tue, 19 May 2026 12:00:30 GMT

Most tech acquisitions destroy the value they paid for. It’s not just bad luck or a few missteps — it’s a pattern backed up by data and decades of experience. Yahoo bought Tumblr for $1.1 billion and sold it for $3 million six years later. Salesforce paid nearly $28 billion for Slack only to see its founders walk away and a culture clash that employees openly lamented. Broadcom’s $69 billion VMware deal led to massive layoffs and customer revolts over forced subscription changes. Google’s $3.2 billion Nest acquisition ended with the founder leaving amid internal fights. HP’s $11 billion Autonomy acquisition resulted in an $8.8 billion writedown, mostly due to HP’s own botched integration. These aren’t outliers. Studies show 70 to 90 percent of acquisitions fail to create shareholder value. The brutal truth is most acquisitions destroy more than they create.

To understand why, you need to know what an acquirer is really buying. It boils down to three things: the product, the team, and the book of business. The product is the easiest to keep because it’s tangible — code, IP, infrastructure. But a product without the team that built it ages fast. After an acquisition, the engineering bench usually shrinks, and specialized knowledge evaporates. The product decays when the people and culture that made it don’t stick around. Yet acquirers routinely assume product alone is enough.

The team is the hardest to keep — and where the real value lives. MIT Sloan analyzed data on hundreds of thousands of acquired startup workers and found a third leave within the first year, compared to 12 percent for regular hires. EY’s data is worse: nearly half gone in year one, three quarters by year three. Founders don’t leave because of money or performance pressure. They leave because the new environment kills the autonomy, speed, and risk-taking that made the team successful. You pay a premium for a culture of independence, then your integration playbook crushes it. Retention bonuses don’t help; they create golden handcuffs where people stay physically but check out mentally.

The book of business is the most fragile asset. Customers don’t stay loyal to logos, they stay loyal to people, terms, and responsiveness. Change the terms, the support, or the relationships, and a chunk of revenue walks. In reality, acquirers often lose 15 to 30 percent of revenue in year one post-acquisition, far from the 100 percent retention rate deal models assume. This fragility compounds when acquirers impose new pricing or kill off features that don’t fit their vision.

What acquirers consistently get wrong is assuming tacit knowledge transfers, trust networks survive reorgs, customer loyalty is portable, decision velocity is a property of individuals rather than the whole system, and that absences are accidental. None of these are true. The code and brand might transfer, but the institutional memory, informal relationships, and decision speed evaporate. Worse, acquirers often ignore what the startup deliberately didn’t build—features they avoided, customer segments they skipped, meetings they refused to hold—and walk right into the pitfalls the startup carefully avoided.

Culture isn’t a soft metric. It’s the difference between the acquisition creating value or destroying it. The acquiring company promises to preserve autonomy and culture, but the reality is that structural incentives inside large organizations push toward integration, control, and standardization. Finance demands margin expansion, IT demands a single stack, HR demands uniform comp bands, and legal demands one set of contracts. Autonomy loses every time because it has no champion. This slowly kills decision-making and product velocity. The best people leave first. The company that remains shares only a name with the original.

By eighteen months post-acquisition, the founders are usually gone, most early hires have left, and the product’s momentum is dead. The writedown that follows is just accounting catching up. Nearly a third of founders who leave quickly start new startups, often competing directly with the company that just bought them. The purchase price often underwrites the seed round for the next competitor. Warren Buffett captured the pattern decades ago: deals never fail in projections because executives and consultants supply rosy forecasts, but they routinely fail in practice.

So what am I doing differently? First, I’m building a culture that doesn’t depend on a single founder’s presence. It’s documented, practiced, and legible so anyone can understand what makes it work. Second, I focus on retention through meaningful autonomy and ownership, not golden handcuffs. Third, I prioritize documenting knowledge as insurance against inevitable transitions. We record engineering decisions and context while the work happens, not after memory fades. Finally, I commit to honest conversations about uncertainty when acquisition questions arise. People can handle not knowing if you’re transparent; what kills morale is pretending everything’s fine while the ground shifts beneath them.

The bet is simple: companies that build cultures worth preserving either won’t need to be acquired or will survive acquisitions better than those that don’t. There are no guarantees. But the alternative is the 70 percent failure rate that’s become the norm.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

Why Most Tech Acquisitions Destroy the Thing They Paid For

Allan MacGregor 🇨🇦 — Tue, 19 May 2026 12:00:30 GMT

Over the years I had the opportunity to be part and/or watch several acquisitions; from both sides of the table. Getting acquired is considered a huge career milestone for founders and executives alike.

Even in the best scenarios and with the best intentions, most acquisitions destroy the value they paid for. Outside of my personal experience, there hundreds of examples of acquisitions that went sideways. For example:

Yahoo acquired Tumblr for $1.1 billion in 2013. CEO Marissa Mayer promised to "not screw it up." Within two years, Yahoo had written down $230 million in one quarter and another $482 million the next---over half the purchase price erased in six months. By 2019, Tumblr sold to Automattic for $3 million. A 99.7% destruction of value.
Salesforce paid $27.7 billion for Slack in 2021. Stewart Butterfield, the founding CEO, made it eighteen months. Cal Henderson, the CTO and last remaining cofounder, made it thirty. By early 2024 every cofounder was gone and Slack was being run out of Salesforce. A leaked all-hands surfaced employees calling it a "strong culture clash." Butterfield was less diplomatic on his way out: "There's no incorporation of the Slack culture into the Salesforce culture. And unless there is some element of that, then it's not integration in any sense. It's just the elimination."
Broadcom closed its $69 billion acquisition of VMware in November 2023 and laid off 401 employees within the first week. Over 2,800 followed. VMware's president left; senior staff who had watched Broadcom's prior playbook at CA Technologies retired preemptively rather than wait. The goal was straightforward---more than double VMware's EBITDA to $8.5 billion within three years, primarily through cost-cutting. Customers got a forced transition from perpetual licenses to annual subscriptions; employees got return-to-office mandates within sixty miles of a Broadcom office. Financial extraction, not integration.
Google bought Nest for $3.2 billion in 2014. Founder Tony Fadell left after two and a half years of internal fighting and product stagnation. Alphabet reportedly tried to sell Nest in 2016---two years after paying $3.2 billion for it.
HP acquired Autonomy for $11.1 billion in 2011 and took an $8.8 billion writedown. HP blamed accounting fraud at Autonomy and years of litigation followed, but a UK court found that 80% of HPE's losses had nothing to do with the alleged misconduct. HP's own mismanagement of the integration was the primary cause. They destroyed the value, then blamed the seller for it.

All different industries, acquiring companies, and yet the same unfortunate outcome. Roger Martin, writing in HBR, called M&A "a mug's game, in which some 70% to 90% of acquisitions are abysmal failures." McKinsey found that 61% of acquisition programs didn't earn back their cost of capital; Bain pegged it at 70% failing to increase shareholder value.

Most acquisitions destroy more than they create.

What You're Buying

In order for us to understand why most acquisitions fail, we need to understand what is being bought in the first place. Strip away the self-congratulatory press release language and every acquisition is buying some combination of three things: the product, the team, or the book of business. Most acquirers tell themselves they're buying all three. Most integration playbooks only know how to keep one of them — and sometimes they can't even do that.

Product is the easiest to keep, and the least valuable on its own

The codebase, the IP, the stack: those are snapshots. You can audit them, document them, port them to your own infrastructure. They behave the way assets are supposed to behave on an acquirer's balance sheet, but fail to account for the fact that they are often built by a team that is no longer there.

A product without the team that built it ages exactly as fast as the engineers you have left to maintain it — and after an acquisition, the engineering bench is usually thinner than the deal model assumed it would be. This becomes even more pronounced when the product required industry or specialized knowledge that is difficult to build or replace; for example products that operate in highly regulated industries or require specialized hardware or software.

The product is the part of the deal you most reliably take possession of and the part that decays fastest without the team and the culture that built it. This also the number one mistake acquirers make.

The team is the hardest to keep, and where most of the value actually lives

Let's look at the numbers around retention, once the aquisition closes MIT Sloan ran U.S. Census Bureau data on 230,000 acquired startup workers over twenty years — 33% of them walked within the first year, compared to 12% for regular hires. Almost triple the churn, on a sample size big enough that it isn't really arguable. EY pegs it worse: 47% gone in year one, three-quarters gone by year three. Founders specifically: 36% leave within six months, 75% are gone by eighteen months, fewer than a quarter remain at three years.

It has very little to do with comp or titles. MIT Sloan researchers noted that "a larger, more established firm has varying levels of bureaucracy and a formal corporate culture" while startups attract workers who prefer risk-taking and autonomous environments. More than half of founders who leave do so because they no longer fit in — not performance pressure, not money. They just don't recognize the place anymore. The qualities that made the team valuable in the first place — autonomy, speed, the willingness to break process to ship — are the same qualities that make them walk out when those conditions disappear.

You paid a premium for a team built on independence and then your integration playbook removed the independence, the insights, and the culture that built the company worth acquiring in the first place.

Retention doesn't mean engagement. Earn-outs and retention bonuses turn into golden handcuffs — people physically present but mentally counting down to a vesting cliff. That's attendance with a deadline, not retention.

The book of business is the most fragile line item in the deal

Customers weren't loyal to a logo. They were loyal to specific people, specific terms, specific responsiveness. Change the terms and a meaningful chunk of the book walks. Broadcom learned this in real time when it forced VMware customers off perpetual licenses and onto annual subscriptions — the customer-revolt headlines started inside the first year. Channel partners and distributors run the same dynamic in the background; they negotiated their margins and SLAs with one company, and now they're being asked to renegotiate with another, usually on worse terms.

Most deal models assume something close to 100% revenue retention in year one. The actual range, in my experience and the data, is closer to 70-85% — and that's before the acquirer starts changing pricing or sunsetting the features that didn't fit the parent company's "vision". The book of business is the easiest line to model and the hardest to actually retain at the modeled value.

What acquirers take for granted

Across every deal in the intro — Tumblr, Slack, VMware, Nest, Autonomy — the acquirers made some version of the same set of unforced errors. The specifics vary; the assumptions don't.

They assumed tacit knowledge would transfer. It doesn't. The codebase ships; the institutional memory walks out with the engineers. Most of it was never written down because it didn't need to be when the people who knew it sat two desks over.
They assumed trust networks survive a reorg. They don't. Informal relationships, internal shortcuts, the "go ask Jeff" reflex — none of that lives in the org chart. Reorganize the boxes and you've severed the connections that were getting work done.
They assumed customer loyalty was portable. It isn't. The brand transfers; the relationship doesn't, especially when the support engineers who built it have been folded into a centralized helpdesk three time zones away. This can be even worse when dealing with an international acquirer, who may not understand the local market and the local customers.
They assumed decision velocity was a property of the team. It isn't — it's a property of the team plus the org around it. Even if every person stays, decisions now route through approval cycles, legal review, brand guidelines. The product slows down with the same engineers in their seats.
They assumed the absences were accidental. Most weren't. What the startup deliberately didn't build — the features it refused to ship, the customer segments it refused to serve, the meetings it refused to hold — was knowledge acquired through years of trial and error. The acquirer thinking that knows better may bring those things back on the table, and walk right into the landmines the startup avoided.

A honest version of an acquisition memo would say: we are buying a product we think we know how to integrate, a team we will probably lose, and a book of business that will partially churn the moment we touch the things that made sign in the first place. That is the deal. The shareholder pitch never reads that way, and the gap between the two is roughly the size of the writedown that shows up two years later.

Culture Isn't a Soft Metric

Early stages of the acquisition the acquiring team will often try to reassure the team that they will be able to maintain their autonomy and culture. They will promise to keep the team together, to maintain the culture, to allow the team to continue to operate as they did before the acquisition. That both companies cultures are aligned and nothing will change.

More often than not, the acquirer will not be able to keep their promises.

The academic literature calls it the coordination-autonomy dilemma. Acquirers have to integrate the company in order to be able to operate; they also have to preserve organizational autonomy to avoid destroying the conditions that made the company worth acquiring in the first place.

The two pull in opposite directions, and the structural incentives inside any large organization all push toward integration. Finance wants margin expansion, IT wants a single stack, HR wants one comp band, legal wants one set of contracts. Autonomy doesn't have a department to fight for it, so it loses every quarterly review.

There is plenty of data to support this:

PwC found that 65% of acquirers say cultural issues hindered value creation in their most recent deal.
Deloitte attributes 30% of failed integrations to cultural differences.
FranklinCovey estimates up to 30% of M&A value loss comes from unresolved cultural friction.

This slowly chips away at the team's ability to make decisions and ship product; burns them out and slowly kills their motivation and their ability to ship product.

I've been there myselft, the deal closes with promises — preserve the culture, maintain autonomy, keep the team intact. Then the benefits start getting trimmed inside the first quarter. The brand goes quiet a few months in. Titles get changed without anyone asking, and local leadership starts losing arguments to a head office that doesn't really know the market or the product or the people. The acquirer's processes get rolled in wholesale, not because they're better, but because standardization is easier than understanding.

The best people leave first. They always do; they have options and they use them. What's left after a couple of years shares a name with the original company and not much else.

The promise is always integration but the reality is always absorption. It happens whether the acquirer means well or not, because the structural incentives all push toward control — and control suffocates whatever made the acquisition worth doing in the first place.

The 18-Month Cliff

By month eighteen, the founders have usually left, most of the early hires went with them, and whatever product velocity made the company worth acquiring is gone. The writedown that follows is just the accounting catching up, and it usually doesn't land for another year or two on top of that. Deal models price for year-one revenue synergy; they were never built to capture the compounding mechanics that decide whether the acquired company is still recognisable twenty-four months in.

The people who leave aren't just lost headcount either. Almost a third of founders who leave shortly after an acquisition launch a new startup within six months. They take their context with them — not just the technical knowledge, but the customer relationships, the design judgement, and the hard-won understanding of why certain bets paid off and others didn't. A meaningful share end up competing directly with the company that just paid to acquire them. The purchase price effectively underwrote the next competitor's seed round.

Warren Buffett saw the broader pattern decades ago: "While deals often fail in practice, they never fail in projections---if the CEO is visibly panting over a prospective acquisition, subordinates and consultants will supply the requisite projections to rationalize any price." None of this damage was in the projections to begin with, the post-mortem won't capture it either, and the next deal will use the same model with a different logo on the cover.

What I'm Doing

Having gone through a couple acquisitions, I'm speaking from experience, not just the data or third party articles, here is what I'm doing to avoid the same mistakes:

Building culture that doesn't depend on a single owner. Culture is documented, practiced, and distributed---not held in one founder's head. If we were ever acquired, the culture should be legible enough that an acquirer could understand what they're buying and why it works. Most don't bother. Making it explicit at least gives them the chance.
Retention through meaning, not handcuffs. Golden handcuffs keep people physically present; they don't keep them invested. Retention is built on autonomy, ownership of outcomes, and career development---the things that make engineers stay because they want to, not because they're waiting out a vesting schedule.
Knowledge documentation as insurance. Institutional knowledge that only lives in people's heads is the first thing to go in any transition — acquisition, leadership change, plain old attrition. So we write down the engineering decisions, the architecture trade-offs, and the reasoning behind systems while the work is happening, not six months later when nobody remembers why we picked option B over option A.
Honest conversations about uncertainty. When somebody on the team asks about acquisition risk, they get an honest answer. People can handle not knowing — what they can't handle is being told everything's fine while the ground shifts under them. Deloitte's number is 3.5x better retention when communication is "effective," which in my experience mostly means not lying.

The bet is simple enough: companies that build cultures worth preserving either won't need to be acquired or will survive being acquired better than the ones that don't bother. Although there's no guarantee in any of that. But the alternative is the 70% failure rate, and at some point you have to try to even the odds.

When Tokens Become OKRs

Allan MacGregor 🇨🇦 — Tue, 12 May 2026 18:53:55 GMT

The Boardroom Slide

Slides like this one are becoming commonplace in the boardroom. It is not satire, many organizations are chasing metrics over substance. In what I think is a futile attempt to quantify the impact of AI; and justify what can only be described as AI FOMO (Fear of Missing Out), CTOs are mandating agentic coding adoption, tracking lines of code, and now tracking token usage as engineering KPIs.

The reaction in the room is always the same. Someone leans forward, points at the bar at the bottom of the chart, and asks why that engineer's number is so low. Someone else asks how we get the team average up next quarter. The conversation that follows is about goals, targets, individual development plans for low consumers. Nobody in the room is asking what the engineers built with those tokens.

Twelve months ago, the conversation was about whether engineers would use AI tools at all. In April 2025, Tobi Lütke told Shopify that "reflexive AI usage is now a baseline expectation," and that teams would have to demonstrate why a job could not be done with AI before requesting headcount.

Tobi is not alone in this and many others have followed suit. In May 2025, Klarna CEO Sebastian Siemiatkowski told investors that "AI can already do all our jobs," and that "cost unfortunately seems to have been a too predominant evaluation factor… what you end up having is lower quality." Another CEO with a 1500-person company told the team "you have geniuses in your pocket, stop thinking and use AI".

It's 2026, and all these AI-first, AI-only mandates now have metrics attached. CTOs are reporting tokens consumed per engineer, agent invocation counts, autocomplete acceptance rates, and pull requests with an AI co-author tag. Boards take all of it as engineering productivity and nod along.

Through my entire career I have been incredibly opinionated about developer experience and productivity metrics. Frankly, all of these metrics are garbage. None of them measure the product the engineers are building. They measure how much the vendor billed the org last quarter, sorted by employee.

How Mandates Become Metrics

Here is the eternal corporate cycle: a CEO or CTO declares that the org will adopt X, which means engineering leadership now needs a number to report upward. The easiest number to report is whichever one the vendor already publishes on a dashboard. That number flows upward, gets adopted as a target, and by the next quarter the work has started bending to flatter it.

As an industry we have gone through this cycle so many times before, lines of code, velocity, story points, deploys per week, commits, etc. etc. etc. Every time we have collapsed a metric into a target, the metric has eventually rotted into theater, and what was once useful becomes purely performative.

Not a new idea, after all Goodhart's Law has been with us for over fifty years, and as Marilyn Strathern put it so well: "When a measure becomes a target, it ceases to be a good measure."

Software engineers are smart (at least if you are doing your hiring right) and they will eventually find a way to game the metric, doesn't matter what the metric is. What the industry is currently doing with tokens and lines of code is setting the wrong incentives, and creating massive problems down the line.

For example, lines of code made a comeback last year as part of the AI-first push, the results are predictable: GitClear measured an 8x increase in duplicated code blocks during 2024. The cheapest way to look productive on the metric is now to ship more code, regardless of whether the code needed to exist.

Of course, more lines of code means more bugs, more maintenance, more complexity, and more risk; none of these things are accounted for in the metric or tracked, or factored into the cost of the code.

Tracking token usage has the same deficiencies as lines of code, and the same problems down the line; but is also objectively worse, because you are tracking a metric that your vendor cares about, but is meaningless to the org, to your customers, and to your business.

Tracking tokens is the equivalent of reporting on AWS spend, and equating a larger bill with higher productivity; the problem is that the bill is not a measure of productivity, it is a measure of consumption, just like tokens are.

The Vendor Narrative

So you might be thinking, "Ok, if all these metrics are garbage, why are all these C-level executives mandating them?", "Why is it getting this much adoption, across so many orgs?"

Well, there's a reason for this, and it starts with vendor biases and incentives; let's take a look at the Anthropic 2026 Agentic Coding Trends Report.

One of the little nuggets in the report is: "In 2026, the value of an engineer's contributions shifts to system architecture design, agent coordination, quality evaluation, and strategic problem decomposition. The primary human role in building software is orchestrating AI agents that write code."

Anthropic sells the harness that's replacing the implementer role, and the report is telling CTOs the implementer role is going away. Call me crazy, but there is a bit of conflict of interest here.

Now, the incentive for Anthropic is pretty clear; orchestrators consume more tokens than implementers. An engineer typing code uses one autocomplete at a time, while an engineer running four agents in parallel — reviewing, decomposing, evaluating, coordinating — burns through tokens at a different order of magnitude.

More consumption, more revenue.

This is really a marketing play. Look at Cursor, Cognition's Devin, GitHub's Copilot Workspace, Augment Code — every productivity narrative aligns with consumption-based pricing in a way that should make a CTO suspicious of the claims.

FOMO is a powerful motivator, and the vendor narratives have been extremely effective at selling a future that is not yet here. Now, I'm not denying that these tools are useful, can be force multipliers for experienced engineers, and can help teams get more done in less time. But they are not the panacea that some are making them out to be; nor are they good enough to justify these overarching mandates in the way we are seeing them adopted today.

And if you don't believe me, just look at Microsoft. In June 2025, Microsoft's Julia Liuson, the President of the Developer Division, told her managers in an internal memo that "AI is now a fundamental part of how we work… using AI is no longer optional — it's core to every role and every level."

So far these AI-first mandates and tokens-as-metrics have ended up backfiring in spectacular fashion. Klarna spent 2024 telling investors AI was doing the work of 700 customer service agents, then admitted in May 2025 that "cost unfortunately seems to have been a too predominant evaluation factor… what you end up having is lower quality." They started rehiring humans. Customer service costs in Q3 2025 came in at $50 million, against the $60 million in claimed AI savings.

What Anthropic's Own Footnotes Say

There are a couple of interesting things in the report worth calling out:

Page 3 of the Anthropic report, in the foreword: "developers use AI in roughly 60% of their work, they report being able to 'fully delegate' only 0-20% of tasks." Because of the hype around agentic coding, we have companies making decisions on the first number, and not the second. The second number is the one that should govern hiring, headcount, and architecture decisions.

Page 13 has a second admission, also worth quoting verbatim: "Notably, about 27% of AI-assisted work consists of tasks that wouldn't have been done otherwise: scaling projects, building nice-to-have tools like interactive dashboards, and exploratory work that wouldn't be cost-effective if done manually." As much as I hate to admit it, internal tooling and exploratory work is often less critical and sometimes optional; not directly contributing to the bottom line.

Four pages later the report walks the orchestrator framing back, in language that doesn't read as a walk-back unless you're looking for it. Page 10: "engineers report being able to 'fully delegate' only a small fraction of their tasks. The apparent contradiction resolves when you understand that effective AI collaboration requires active human participation."

There is enough independent data to support that claim, for example, METR's 2025 randomized trial of 16 experienced open-source developers measured a 19% slowdown when AI tools were used. The developers in that same study believed they were 20% faster.

Faros AI's telemetry on 10,000 developers across 1,255 teams shows the same shape at scale and with different methodology limits. PR review time on high-AI-adoption teams went up 91%. Average PR size went up 154%. Bugs per developer went up 9%.

Anthropic's report is honest about all of this if you read past the foreword. The trouble is that very few CTOs are reading past the foreword, and they are falling prey to the marketing and FOMO; in turn, the rest of the C-suite is reading a slide deck that summarizes the foreword and only compounding the problem.

What to Do Instead

Critique without an alternative is just complaint, and believe it or not I am trying to be constructive here. So what should you do instead of tracking tokens, lines of code, and mandating every engineer to use an agent?

I do not have a perfect answer yet, but a few things have become pretty clear from working with my own teams and from the telemetry already cited above. None of them require a token target, an adoption-rate dashboard, or any other vendor-supplied metric to operate.

The real bottleneck is review capacity, not generation capacity.

The 91% blowup in PR review time on high-AI-adoption teams (Faros, cited above) is not an inefficiency to optimize away; it is the new constraint your engineering org is operating against. AI-generated code is cheap. The human ability to evaluate that code is not.

An engineer in Anthropic's own report puts the underlying skill in one sentence: "I'm primarily using AI in cases where I know what the answer should be or should look like." This means we have to account for the fact that the work an agent generates is not done until a human can sign their name to it.

There are a few ways to tackle the review problem, personally I push for smaller PRs, incremental changes and keeping context human readable. In contrast, tracking regressions and defect rate will tell you if the team is rubber-stamping AI-generated PRs because the queue is too long.

Replace mandates with structured exploration time.

The orgs that have found durable wins with agentic coding are not the ones with the highest adoption rates or the biggest token budgets. They are the ones that gave teams permission to experiment, expected some pilots to fail, and reported findings honestly without putting consumption numbers on a leadership slide.

Drew Breunig's 10 Lessons for Agentic Coding puts it well: "when code is cheap, implement to learn" — teams need room to discover what agents are actually good at on their stack, in their codebase, against their constraints. A quarterly token target is the opposite of permission; it is a coercion budget with a vendor-supplied number sitting on top of it.

Measure ownership, not consumption.

If nobody on your team can explain a critical path the agent wrote, the org is carrying a liability that has not been priced into the maintenance budget yet. The right question to ask in a code review is not "how was this generated?" but "who can defend this in production at 3am?" Track that number across the team. It is much harder to game than tokens consumed, and it is a much better proxy for whether AI-generated code is adding to the codebase or detracting value from it.

One phrase that Drew uses in the article is: "Agentic code is free as in puppies." The implementation is cheap. Ownership and ongoing maintenance are not, and the cost shows up two quarters later when something breaks and nobody on the team can say why.

What I'm Doing

Agentic coding has made code "cheaper". The cheapness is borrowed from a vendor whose pricing, model weights, and output quality you do not control, and not all of the code being produced this way is equally durable on the other end of a vendor-side change. Anthropic or OpenAI can finetune compute overnight, lower the quality of a model you depended on yesterday, and put your pipeline at risk before you finish your morning standup. That risk goes on the same balance sheet as the vendor lock-in I described earlier.

Code quality is still the engineering function's responsibility, and ultimately yours and mine as CTOs. Agents change how the work gets done; they do not change who signs their name to it. As good as agentic coding is — and it keeps getting better — I am not willing to let it run free without supervision.

Design, architecture, and final review are the parts of the process that have to stay human, because those are the parts where the org's exposure to liability, downtime, and competitive disadvantage is most concentrated.

The two side projects I'm building right now — Structpr.dev and Shiplog.ca — are the testbed. I'm using both to experiment with the workflow. Some days the agent saves me hours; other days it generates work I have to throw away wholesale. The question I keep poking at is where the human input on design direction earns its keep, and so far the honest answer is: in more places than I expected. None of this is a finished playbook. It is the slow work of figuring out, on my own code, where the line between human and agent should sit. I'll write up what I learn when there's a real pattern to share. There isn't one yet.

Questions for Your Next Board Meeting

When your board asks for AI adoption metrics, what number are you giving them — and what is that number measuring once you strip the vendor framing off it?
If your agentic coding budget tripled tomorrow, would the work get better or just larger?
Who on your team can explain the critical paths in your codebase from memory? Is that number going up or down?
The vendor's report tells you engineers are becoming orchestrators. Whose business model does that framing serve?
Mandates make the dashboard look better. Nothing else.

Audio: When Tokens Become OKRs

Allan MacGregor 🇨🇦 — Tue, 12 May 2026 18:39:54 GMT

Boards are increasingly fixated on tracking tokens consumed per engineer as a KPI for AI adoption, but the real question never gets asked: what did those tokens actually build? This obsession with measuring token usage is emblematic of a broader problem—metrics are being treated as objectives rather than indicators, turning engineering productivity into a game of consumption rather than value creation.

The corporate cycle is painfully predictable. A CEO or CTO mandates AI adoption, engineering leadership scrambles for a number to report, vendors supply whatever metric is easiest to track—lines of code, tokens consumed, acceptance rates—and that metric becomes a target. Once that happens, work bends to inflate the metric, not to improve the product. This is classic Goodhart’s Law in action: when a measure becomes a target, it ceases to be a good measure.

Lines of code made a comeback with AI-first pushes, leading to an eightfold increase in duplicated code blocks. The cheapest way to look productive is to generate more code, regardless of necessity or quality. Tokens consumed suffer the same fate, but worse—they’re a vendor’s billing metric, not a measure of business value or engineering impact. Tracking tokens is like equating AWS spend with productivity: consumption does not equal output.

Why are so many executives buying into these flawed metrics? Vendors have a clear incentive to promote consumption-based metrics. Take Anthropic’s 2026 Agentic Coding Trends report—it pitches engineers as orchestrators of AI agents, which by design consume more tokens. This narrative conveniently aligns with the vendor’s business model, encouraging higher token usage and thus revenue. Microsoft’s internal memo in mid-2025 mandated AI use as fundamental, further entrenching this approach.

Yet these mandates have backfired. Klarna claimed AI replaced hundreds of customer service agents only to admit that cost-cutting compromised quality, forcing them to rehire humans. The promise of AI-driven productivity often ignores the hidden costs and quality trade-offs. Vendor reports themselves reveal the cracks: while developers use AI in 60% of their work, they can only fully delegate 0 to 20% of tasks. And about a quarter of AI-assisted work is scope expansion—building tools and dashboards that wouldn’t have existed otherwise, not straight productivity gains.

Independent studies confirm this. METR’s 2025 trial showed a 19% slowdown for developers using AI tools, despite their feeling 20% faster. Faros AI telemetry revealed that AI-heavy teams saw PR review times nearly double, PR sizes increase by 150%, and bug rates rise. The core bottleneck is review capacity, not code generation. AI makes code cheap to produce, but human evaluation remains the limiting factor.

So what should we do instead? First, recognize that the real constraint is review capacity. Push for smaller PRs, incremental changes, and human-readable context. Track regressions and defect rates to ensure quality isn’t sacrificed in the rush to consume tokens. Second, replace mandates with structured exploration time. The best AI adoption comes from teams given permission to experiment and fail, not from forced consumption targets. Third, measure ownership, not consumption. Ask who on the team can confidently defend the AI-generated code in production at 3 a.m. That’s a far better proxy for real impact than tokens consumed.

Agentic coding makes code "cheaper," but that cheapness is borrowed from vendors whose pricing, model quality, and output you don’t control. Models can change overnight, degrading quality and putting your pipeline at risk. Code quality and accountability remain our responsibility as CTOs. Human oversight around design, architecture, and final review must stay central to avoid exposure to risk and technical debt.

I’m actively experimenting with this balance through side projects like Structpr.dev and Shiplog.ca, testing workflows to find where human input adds value beyond AI generation. Some days the AI saves hours, other days it creates more work. The pattern of where to draw the line between human and agent is still emerging, and I’ll share those learnings when they solidify.

When you’re next in the boardroom and asked for AI adoption metrics, consider what that number actually measures once you strip away vendor spin. If your token budget tripled tomorrow, would your output improve or just get bigger? Who on your team can explain critical paths in the codebase from memory—and is that number rising or falling? Remember, the narrative that engineers are becoming orchestrators primarily serves vendor business models, not your product’s quality or your team’s effectiveness. Mandates and dashboards may look good on paper, but they don’t guarantee real progress.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

AI Wrote the Code. Who Gets the Tax Credit?

Allan MacGregor 🇨🇦 — Mon, 09 Mar 2026 12:31:41 GMT

Your AI Strategy Is Shrinking Your Tax Credit (Maybe)

Two SR&ED consultants look at the same developer, using the same AI tool, writing the same code. One says it qualifies for R&D tax credits. The other says it doesn't.

Leyton, one of Canada's largest SR&ED consulting firms, published a clear position: "The CRA will not recognize the act of calling an API or engineering prompts as SR&ED-eligible work, as that activity is considered routine implementation." Prompting an AI is like calling an API; the uncertainty was resolved by Anthropic or OpenAI, not by the developer.

GrowWise Partners, another major SR&ED consultancy, says the opposite: "AI does not disqualify work, but claimants must show that human-led experimentation is still present." Same developer. Same tool. Same output. Different framing, different documentation, different outcome.

The gap between "we used AI as a research tool" and "AI did our work for us" is less technical and more about the documentation that supports the claim. And that distinction is worth up to $2.1 million in refundable tax credits in Canada; six figures or more in the US. This is not an accounting footnote.

Right now, Agentic coding and AI-assisted development are in a regulatory vacuum. No government has issued guidance; no court has ruled. The CRA's five-question eligibility test was written for human researchers; the IRS four-part test never contemplated AI as the primary code author. The entire field is governed by consultant interpretation of statutes that predate GitHub Copilot by decades.

If your company claims SR&ED credits in Canada or R&D tax credits in the US -- and if your engineering team uses AI coding tools -- your tax position depends on how well you can document your work. The "(Maybe)" in the subtitle is genuine; this might work out fine. But the regulatory vacuum means nobody can tell you that with certainty, and is the companies that are left holding the bag.

No one is sure

R&D tax credits live in the CFO's domain -- or they used to. The engineering org's practices now determine whether the credit survives an audit, and the gap between a defensible claim and a rejected one can be six figures. Tax mechanics are not why you took the CTO job, but this is your problem whether you want it or not.

Canada's 2025 federal budget doubled the SR&ED expenditure limit from $3 million to $6 million, expanded eligible entities beyond CCPCs, and restored capital expenditure eligibility for the first time since 2014. The maximum refundable investment tax credit jumped to $2.1 million. The program has never been more generous.

Starting April 2026, the CRA will launch an AI-enhanced review process to streamline claim reviews, alongside a new elective pre-claim approval process. And while the CRA is using AI to review claims, nobody at the CRA has published a single word of guidance on how AI coding tools affect eligibility.

South of the border, the story is pretty much the same. The One Big Beautiful Bill Act restored immediate R&D expensing in July 2025, reversing the TCJA's punishing five-year capitalization regime that had forced companies to amortize domestic research costs over five years. The new Form 6765 Section G becomes mandatory for 2026 filings -- business-component-level disclosure for anyone claiming more than $1.5 million in qualified research expenses, reporting on up to fifty business components. More granular disclosure than the IRS has ever required. Meanwhile, the IRS is deploying its own AI tools: a Line Anomaly Recommender for audit selection and Agentforce across the Office of Chief Counsel.

Robert Kovacev, a tax litigator who published an analysis on SSRN, observed that "nothing in the statute or regulations states that activities must be performed by humans." He's right -- the statute is silent. But silence cuts both ways; it means the answer depends entirely on how you frame and document the work.

Ottawa doubled the SR&ED expenditure limit in the same budget year that ISED launched new AI adoption grants. Washington restored R&D expensing three months after the White House issued executive orders accelerating AI deployment. Nobody in either capital connected the dots; companies are filing R&D claims based on whatever their consultant tells them, and the consultants -- as we've seen -- don't agree; which makes things more a mess, and as always is our job as CTOs to figure out how to navigate this.

Who eliminated the uncertainty -- the developer or the AI? That's the question neither statute answers. The CRA's five-question test requires "systematic investigation" by means of "experiment or analysis"; the IRS four-part test requires a "process of experimentation" to "eliminate uncertainty." Both assume human researchers. Neither says what happens when AI does the generating and the human does the evaluating.

The Productivity Paradox

R&D tax credits are calculated primarily on employee wages allocated to qualifying research. If AI reduces the time developers spend on qualifying activities, the wage base shrinks. The credit shrinks with it; this is mechanical, not interpretive. It follows directly from how the math works.

The implications are counterintuitive. Your AI strategy might be making your engineers more productive while simultaneously making your company's tax position worse.

Walk through the Canadian numbers. A developer earning $150,000 per year who previously allocated 50% of their time to SR&ED work generated $75,000 in eligible salary, plus $41,250 in proxy overhead -- $116,250 in qualified expenditure. At the enhanced 35% rate, that produced roughly $40,700 in investment tax credits per developer. Compress that developer's SR&ED-eligible time to 20% with AI tools -- nothing else changes, same salary, same project, same research outcomes -- and the ITC drops to approximately $16,300. Sixty percent gone.

The US math is different in structure but identical in direction. A ten-person team spending 60% of time on qualifying research might see that drop to 30% after AI adoption; the wage-based credit cuts roughly in half.

But the US has an additional cliff. Under Treasury Reg. 1.41-2, if 80% or more of an employee's services constitute qualified research, 100% of their wages count as qualified research expenses. Drop below 80%, and only the actual proportion counts. Pre-AI, a developer at 85% qualifying research had 100% of wages in the QRE pool. Post-AI, that same developer at 60% qualifying research has only 60% of wages in the pool.

The 7th Circuit made this concrete in *Little Sandy Coal* (2023): the taxpayer must demonstrate a "principled way to determine what portion of employee activities constituted elements of a process of experimentation." If you can't show that principled allocation, you lose.

The paradox is this: AI may simultaneously expand the universe of qualifying activities -- more experimentation, more alternatives evaluated, more systematic investigation -- while compressing the economic value of the credit through fewer developer-hours and lower wage QREs. Companies might qualify for credits more easily while claiming smaller dollar amounts.

Scale this across a team. If three AI-augmented developers replace the output of ten, the wage base drops 70% -- even if every minute of their remaining time qualifies. The productivity gain that your CEO celebrates is the same productivity gain that mechanically erodes your R&D credit. The only defense is reframing what counts as qualifying activity; that reframing lives or dies in documentation.

The Documentation Is the R&D

Before AI, tickets, PRs, and comments documented the work, any additional documentation was a bonus; meaning we could get away with using the output as evidence of the work. With AI in the mix, we need to document the work in a way that supports the claim and at higher levels of detail.

When AI generates most of the code, the code is no longer evidence of human-led investigation. Traditional signals -- commit history, code comments, design docs -- may be thinner or absent entirely. A developer who generates fifty lines of code in a single AI prompt produces a different artifact than a developer who wrote those fifty lines over three days of iterative experimentation. The output might be identical but how we got there is not.

Documentation must now prove something the code used to prove implicitly: that a human drove the investigation and iteration process. That a human identified the uncertainty, designed the experiment, evaluated the results, and advanced knowledge. With agentic coding, documentation is no longer a record of the R&D. It is the R&D -- it's the only surviving evidence that qualifying work occurred.

Five elements make this concrete.

The uncertainty. What didn't you know? What couldn't be achieved through standard practice? Document this before prompting AI -- not after. The uncertainty must exist in the developer's understanding, not in the model's training data.
The hypothesis. Record which approach the developer chose to test and why they picked it over alternatives. The reasoning belongs to the human, not the model. If nobody can articulate why this approach rather than another, there's no hypothesis -- there's a guess.
The experiment. Save the prompts, the iterations, the evaluation criteria. Where AI interaction logs show a cycle -- hypothesis, generation, evaluation, iteration -- those logs are evidence. This is the one area where agentic coding actually helps your claim; the tool produces a richer paper trail than manual development ever did.
The evaluation. A developer tries three approaches and two of them fail. Those failures are strong evidence; Platform Calgary notes that failed experiments in AI development often represent the strongest SR&ED evidence. Document what was rejected and why it didn't hold up.
The advancement. If the only thing your team gained was working code, that's a product -- not a research outcome. The advancement is the new knowledge: what works under these conditions, what doesn't, and why. That knowledge belongs to the organization, not the model.

In practice, this means your developers need to write something down before they start prompting. Not necessarily a formal document but a Jira ticket, a Slack message to themselves, a comment in the PR. What's the uncertainty? What are they about to try? After the AI generates output, they need to record what they rejected and why. GrowWise recommends preparing "a summary of AI usage explaining how it enhanced, but did not replace, systematic investigation." That summary is what ties your engineering workflow to your tax credit; and it should take five minutes to write if you do it in the moment.

If we are smart we can tweak our developer process to create enough evidence and documentation to support the claim. For example:

Make sure our git history shows iteration.
PR descriptions capture what was tried and rejected.
Jira or Linear tickets can document the uncertainty if your developers write them that way.
More formal documents like architecture decision records, AI interaction logs, and developer journals can capture the experiment and evaluation.

Heck even a developer's Slack thread where they talk through an approach -- all of it counts. You have the tooling; what you probably don't have is anyone treating this as an engineering practice instead of a tax compliance exercise.

The Dominant Actor

For all their disagreements on specifics, McGuire Sponsel, KBKG, Warren Averett, and Bloomberg Tax converge on one thing: the developer has to be the dominant actor. The person who drove the investigation, not someone who showed up for the review. Where exactly that line sits depends on who you ask.

Two scenarios — identical in every way except how the work was framed.

Eligible: A developer hits a problem they can't solve through standard practice -- say, a concurrency issue under specific load conditions that no existing pattern handles cleanly. They hypothesize a few approaches, use AI to generate implementations faster than they could write them, then test each one against their criteria. Three approaches fail. They document why, adjust, and eventually land on something that holds. The investigation was theirs; AI just wrote the code.
Not eligible: A developer opens Claude Code, types "build a feature that handles multi-currency refunds," and gets back something that works. They tweak the formatting, maybe rename a variable, and push it to staging. Done in twenty minutes. The problem is that nobody documented an uncertainty before that prompt -- because there wasn't one, or at least none that the developer articulated. No hypothesis, no evaluation criteria, no record of what got rejected. Leyton's analysis says the CRA will treat that as routine implementation, and they're probably right.

So what separates those two scenarios? Not the code -- the code might be identical. What separates them is whether anyone wrote down why they were building it that way.

Bloomberg Tax offers a useful reframing: "the bug is the proof that the initial hypothesis was false, and the debugging and testing process then becomes the new, qualified experiment." AI-assisted development may involve more process of experimentation, not less -- more alternatives generated, more systematic evaluation, more documented iteration. The key is making that visible. If the experimentation happened but nobody recorded it.

To survive an audit, your documentation needs to tell that story. Problem identification, experiment design, evaluation, iteration -- those belong to the developer. The AI generated code faster; it didn't investigate anything. That distinction holds whether you're answering the CRA's five questions or the IRS four-part test.

If your documentation tells that story -- and tells it contemporaneously, not retroactively -- the credit is defensible. If your documentation is thin, the same work becomes indistinguishable from routine implementation.

Now final disclaimer: I'm not a tax attorney; just a CTO that has done his fair share of R&D tax credit claims and has seen the pitfalls. Talk to your SR&ED consultant or R&D credit advisor -- but talk to them with the right questions, which hopefully this article provides.

Key Takeaways

When was the last time you talked to your SR&ED consultant or R&D credit advisor about how your team uses AI tools? If the answer is "never," that conversation is overdue.
Can your engineering team demonstrate -- with documentation created at the time of the work, not assembled retroactively -- that developers drove the systematic investigation on your last R&D project? Not that they were present. That they drove it.
Do you know your developers' current time allocation to qualifying activities? Has it changed since AI tool adoption? If you're in the US, are you sure you're still above the 80% threshold?
Think about your last sprint. A developer prompted Claude Code, iterated until it worked, and shipped. Six months later, an auditor asks them to prove that was systematic investigation. They won't remember the prompts. They won't remember what they rejected. The documentation either captured it in real time or it didn't -- and "it worked" is not evidence of experimentation.

Audio: AI Wrote the Code. Who Gets the Tax Credit?

Allan MacGregor 🇨🇦 — Mon, 09 Mar 2026 11:55:37 GMT

AI is transforming software development, but it’s also reshaping how we qualify for R&D tax credits. The big question: if AI wrote the code, who gets the credit?

Two top SR&ED consultants look at the same developer using AI tools and come to opposite conclusions. One says calling an AI API or prompting it is routine implementation, not eligible for tax credits, because the uncertainty was resolved by the AI’s training, not the developer. The other says AI doesn’t disqualify the work as long as the developer leads the experimentation. Same developer, same tool, same code—different framing, different documentation, different outcomes. The difference isn’t technical; it’s about how you document the human’s role in the process. And that distinction can be worth millions in refundable tax credits. No government has issued clear guidance, and the existing tests were written with human researchers in mind, not AI collaborators. If your engineering team uses AI tools and you claim R&D credits, your tax position hinges on your documentation. It might work out, or it might not.

Here’s the catch: R&D tax credits used to be a finance problem, but now they’re an engineering problem. Canada doubled its SR&ED expenditure limit to six million dollars and launched AI-powered claim reviews, yet still hasn’t clarified how AI affects eligibility. The US recently restored immediate expensing for R&D and requires more granular disclosure, while deploying AI tools to select audits. Neither the Canadian nor US statutes specify that research must be done by humans, but that silence leaves everything up to interpretation and documentation. Governments are accelerating AI adoption but ignoring the tax credit implications, leaving companies and CTOs to navigate a regulatory vacuum.

And it gets worse. AI boosts developer productivity by automating parts of the work, but that reduces the hours spent on qualifying R&D activities. Since tax credits are tied to wages allocated to research, AI use mechanically shrinks your credit. For example, a Canadian developer who spent half their time on eligible R&D might have generated over forty thousand dollars in tax credits. If AI cuts that qualifying time to 20%, the credit drops by 60 percent. In the US, it’s even trickier due to the “substantially all” rule: if a developer spends less than 80% of their time on qualified research, only a proportional share of their wages count, not 100%. AI can push developers below that threshold, slashing credits. So the same AI that improves your team’s output can erode your tax benefits. The only way out is reframing what counts as qualifying activity—and that lives or dies in your documentation.

Which brings me to the real point: with AI writing more code, traditional evidence like commit histories and code comments no longer prove human-led R&D. The code might be identical, but the process behind it is different. Documentation isn’t just a record anymore; it is the R&D. You have to prove a human drove the investigation: defined the uncertainty before prompting AI, formed a hypothesis, ran experiments by iterating with AI, evaluated results including failures, and advanced knowledge—not just delivered working code. This means developers need to write down what they don’t know, why they chose a particular approach, what prompts and iterations they tried, what didn’t work, and what was learned. Even informal notes in Jira tickets, Slack threads, or PR descriptions can be crucial. AI logs can help, too, since they show the cycle of hypothesis and evaluation. Without this, the tax authorities will see AI-assisted coding as routine implementation.

The key principle all major tax consultancies agree on is that the developer must be the dominant actor. If a developer uses AI as a tool to test hypotheses and systematically investigate a problem—documenting that process—they can claim credits. But if they just ask AI to build a feature, tweak the output, and ship without documenting uncertainty or experimentation, that’s routine work, and no credit. The code alone won’t save you. Auditors want to see that the human drove the research, not just the AI.

I’m not a tax attorney, but I’ve worked on many R&D claims and seen how easily companies trip up. You need to have this conversation with your SR&ED or R&D credit advisor. Ask how AI use affects your eligibility, how to track developer time on qualifying work, and how to document that investigation effectively.

Here’s what you should do now: talk to your tax consultant about AI in your engineering process if you haven’t already. Make sure your developers document their systematic investigation as it happens, not after the fact. Know how much of their time qualifies for R&D credits and whether AI has shifted that balance. Think about the last sprint: if someone prompted an AI tool, iterated until it worked, and shipped, can they prove that was research or just routine implementation? If not, your credit is at risk.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

What to Measure When the CEO Asks for Engineering Metrics

Allan MacGregor 🇨🇦 — Mon, 02 Mar 2026 15:45:20 GMT

Every engineering leader gets this email. "The board wants engineering metrics for next quarter's deck; can you put something together?" Twelve words that launch a thousand bad dashboards.

We have all been there. The instinct is to grab whatever is closest---DORA metrics, sprint velocity, maybe a cycle time chart, or even worse LoC (lines of code)---and arrange them on a slide that looks like you've been tracking this all along. The board nods. The CEO nods. Then someone asks a follow-up question, and you spend the next six months defending numbers you picked in an afternoon.

The problem isn't that you chose the wrong metrics(unless you picked LoC). The problem is that "give me some metrics" is the wrong ask, the wrong conversation and will eventually lead to disaster. If we are to take the question "How do you measure engineering performance?" seriously, then we need to understand that it's four different questions masking as one, and most CTOs answer whichever one they find easiest rather than the one being asked.

Will Larson put it plainly in *The Engineering Executive's Primer*: "There is no one solution to engineering measurement, rather there are many modes of engineering measurement, each of which is appropriate for a given scenario" Four modes. Four questions. Getting the right answer starts with figuring out which one you're being asked.

One Question, Four Answers

Larson's framework splits engineering measurement into four categories, each answering a different question. They aren't interchangeable. Picking the wrong one for your audience is worse than picking no metrics at all.

Measure to Plan. Are we working on the right things? Track shipped projects by team and their impact on the business. This is the language boards speak natively---investment and return, allocation and outcome. If you can show that 60% of engineering effort went to features that moved revenue and 15% went to infrastructure that prevented last quarter's outage from recurring, you've answered the planning question. Most boards don't need more than this.
Measure to Operate. Is the system healthy right now? Incidents, downtime, latency, engineering costs normalized against business metrics. Operations metrics answer a question that sounds mundane but matters more than anything on your roadmap: should you be following your plan or swarming to fix a critical problem? A CEO who sees three major incidents in a quarter understands why the feature roadmap slipped; a CEO who sees a missed roadmap with no context assumes engineering is slow.
Measure to Optimize. Are we getting faster or slower? This is DORA's domain---deployment frequency, lead time, change failure rate, recovery time---and SPACE territory too (satisfaction, performance, activity, communication, efficiency). Both are diagnostic: useful for your engineering leads diagnosing bottlenecks, useless in front of your board. The problem is translation. A non-engineer who hears "our deployment frequency increased 40%" assumes that means more value; bridging that gap requires technical context a quarterly meeting doesn't provide.
Measure to Inspire. What's the story of engineering's impact? Most CTOs skip this category---which in my opinion is a mistake, because inspiration metrics are the narratives that change how the organization thinks about engineering. Not dashboards. Stories: the migration that cut infrastructure costs 40%, the platform rebuild that compressed a six-week feature into two days, the reliability work that turned a churning enterprise customer into a case study. When the board hears those, engineering stops looking like a cost center and starts looking like the reason the company can do things competitors can't.

Now, if you are paying attention so far, you might noticed that I mentioned to mention two things, metrics are not interchangeable and each category has an audience.

Most CTOs set themselves up for failure by selecting the wrong category for the wrong audience. What your board wants, vs what your engineering leads need, vs what your company needs are all different. Inspiration metrics are the hardest to build and the easiest to skip; they're also what gets you headcount next year.

Five Ways to Destroy Trust With Your Dashboard

Knowing what to measure is half the problem. The other half is knowing how measurement goes wrong---and it believe me it will go wrong in pretty predictable ways. Here are some:

1. Goodhart's Law, now with infinite leverage. "When a measure becomes a target, it ceases to be a good measure." Charles Goodhart wrote that in 1975; the software industry has spent fifty years proving him right. Story point inflation. Deployment frequency gaming. Bug counts manipulated by closing duplicates. Every metric that touches a performance review gets optimized for the metric, not the outcome (developers are smart and they will find a way to game the system to their advantage).

AI has the potential to make this worse. When generating code costs nothing, code volume metrics become meaningless and a developer can mass-produce pull requests, PR count stops correlating with value delivered. Goodhart's Law had a natural ceiling when humans were the bottleneck; remove the bottleneck and the gaming potential is unlimited.

2.Measuring individuals instead of teams. Dan North put it precisely: "Attempting to measure the individual contribution of a person is like trying to measure the individual contribution of a piston in an engine---the question itself makes no sense." Software is a team activity. The developer who mentors three juniors ships less code and creates more value than the one who heads-down grinds features. Individual metrics can't capture that; they punish it (and if you are a CTO, you are the one who is responsible for the team's success).

McKinsey learned this the hard way. They tried to measure "individual developer productivity" in 2023 and the response was brutal---Kent Beck, Gergely Orosz, and Dan North all piled on. Beck's line was the sharpest: measuring developer productivity by coding time is "like measuring surgeon productivity on what percentage of their time they were cutting with a scalpel---and ignoring whether the patient got better." It became the most popular Pragmatic Engineer article of 2024. Individual contribution analysis doesn't just fail as a metric; it poisons the team. You get adversarial dynamics, eroded trust, and people optimizing for the wrong things.

3. The measurement loop. Stakeholders keep asking for more metrics---different cuts, new dashboards, one more data point---and nothing you build satisfies them. I've been in this loop. It's not a metrics problem. It's a trust deficit wearing a metrics costume. No dashboard fixes a broken relationship between engineering and the business; if you're caught in this cycle, put the dashboard down and have the hard conversation about what's actually wrong. Larson says it plainly: the loop is a signal, not something you solve with more data.

4. Optimization metrics in the wrong room. Cycle time, deployment frequency, change failure rate---these are diagnostic tools for engineering leaders, not performance indicators for the board. Put them in front of non-technical stakeholders and they get misread; a higher deployment frequency sounds good, but it says nothing about whether you shipped the right things. Worse, the board starts setting targets. "Can we get deployment frequency to daily?" Now you're optimizing for the metric instead of the outcome. Larson is blunt about this: CEO and board get planning and operations metrics. Full stop.

5. Perfection paralysis. The opposite failure mode. Some CTOs refuse to measure anything until they have the perfect framework, the perfect instrumentation, the perfect dashboard. They read about DORA, SPACE, DX Core 4, DevEx; they evaluate engineering intelligence platforms from LinearB, Jellyfish, Swarmia, Cortex; they attend conferences and take notes. And they measure nothing while they decide.

My advice? Start with something imperfect. Larson's sequencing advice? measure easy things first to build trust with stakeholders, even if the data isn't precise. Only take on one new measurement task at a time.

The Ghosts in the Machine

As if things weren't complicated enough, the AI era has added a new layer of complexity. Having metrics that measure what you think they measure was already hard, now AI is making it even harder.

The 2025 DORA report found that a 25% increase in AI adoption correlates with a 1.5% drop in delivery throughput and a 7.2% drop in delivery stability.

The individual-level numbers tell a different story. Cortex's 2026 benchmark shows PRs per author up 20% year-over-year. Sounds like progress. But incidents per PR increased 23.5%; change failure rates climbed roughly 30%. More output, more breakage. The Faros AI report shows the same pattern at larger scale: tasks completed up 21%, PRs merged up 98%, but code review time increased 91% and PR size grew 154%. In a way AI is starting to clog the pipeline; and bring down overall quality.

Every traditional metric is now suspect. Deployment frequency goes up because AI generates more deployable units; Plandek's right that "more deployments aren't always a sign of progress." Lead time shrinks---but only for the coding phase. Review, testing, approval? At best same as before, at worst taking much longer. Change failure rate looks flat until you remember that volume is up; a flat rate on higher volume means more absolute failures. Recovery time is the ugliest one: developers stall because they're debugging code they didn't write and don't fully understand.

DORA added a fifth metric in 2025. Rework rate---unplanned follow-up deployments caused by production issues. It exists because the original four metrics miss something important: the cost of fixing what you just shipped. You can have perfect deployment frequency and still be drowning in rework.

Does this mean that metrics are done, the answer is no. But you need to read your metrics with more skepticism now, which means pairing every speed metric with a quality metric, breaking lead time down by stage rather than treating it as a single number, and watching rework rate as your earliest warning signal. A dashboard that only shows throughput is measuring the accelerator without looking at the road.

Where to Start

Frameworks can be useful, here is a starting point; adapt it for your context, your stage, your stakeholders.

Delivery predictability. Did we ship what we said we would? This is the metric that builds or destroys credibility with the board. Not "how much did we ship" but "did we hit our commitments?" Track it as a percentage; trend it over quarters. When the number drops, come prepared with a root cause and a plan.
System reliability. Incidents, uptime, recovery time. Boards understand reliability intuitively---the system works, or it doesn't. Pair incident count with recovery time; a team that has five incidents but recovers in minutes is in better shape than one that has two incidents and takes days to resolve them.
Investment allocation. Where did engineering effort go? New features, maintenance, unplanned work, technical debt---this is how the board decides whether engineering is pulling in the right direction. Swarmia benchmarks it (roughly 60% new features, 15% productivity improvements, 10% keeping the lights on), but your numbers will look different and should; the point isn't hitting their targets, it's knowing your own and explaining the reasoning behind them.
Team health. Attrition, hiring pipeline, engagement scores. The leading indicator that nobody reports until it's too late. A team losing senior engineers will show up in your delivery metrics six months from now; by then the damage is done. Report this proactively.

Three principles hold no matter which category. Only report metrics you're already tracking---the moment you build a separate collection for the board, you're maintaining two systems and trusting neither. Show trends, not snapshots; one quarter is noise, four quarters is signal. And never show a speed metric alone. Deployment frequency without change failure rate beside it is a lie of omission; cycle time without reliability is the same trick. If you are not already doing this, you are setting yourself up for failure.

Swarmia's advice captures the right mindset: think of metrics like a thermometer. They're the outcome of good practices, not a target to chase.

As a recap, here are some questions to ask yourself:

When the CEO asks for engineering metrics, which of the four categories are they asking about---and are you answering that question or the one you're most comfortable with?
How many of your current metrics would survive Goodhart's Law? If your team optimized for nothing but hitting those numbers, would the outcomes improve or decay?
What story is your dashboard telling? Is it the story your engineering team would tell, or a performance your engineering team has learned to put on?
If you stripped away every metric that measures activity rather than outcomes, what would be left?

Peter Drucker never said "what gets measured gets managed." What he said was closer to the opposite: "Because knowledge work cannot be measured the way manual work can, one cannot tell a knowledge worker in a few simple words whether he is doing the right job and how well he is doing it." Tom DeMarco, who famously wrote "you can't control what you can't measure," retracted it in 2009: "My answers are no, no, and no."

Measurement isn't the goal. Understanding is. The metrics are supposed to help you make better decisions about your teams, your systems, and your strategy. If they're not doing that---if they're creating theater instead of insight---the problem isn't that you need better metrics. The problem is that you've confused the dashboard for the thing it's supposed to represent.

Audio: What to Measure When the CEO Asks for Engineering Metrics

Allan MacGregor 🇨🇦 — Mon, 02 Mar 2026 15:40:20 GMT

When the CEO asks for engineering metrics, the first mistake most CTOs make is thinking it’s a single question with a simple answer. It’s not. It’s four very different questions wrapped into one, and answering the wrong one wastes time and erodes trust.

Will Larson’s framework breaks engineering measurement into four categories. First, measure to plan: are we working on the right things? Show the board how engineering time maps to business impact—features that move revenue, infrastructure that prevents outages. That’s what the CEO and board really want to know. Second, measure to operate: is the system healthy? Incidents, downtime, latency, cost ratios—these explain why a roadmap might slip and help prioritize firefighting over feature work. Third, measure to optimize: are we getting faster? DORA metrics like deployment frequency and lead time live here, but these are for engineering leadership, not the board. Without technical context, they’re meaningless to most non-engineers. Fourth, measure to inspire: what’s the story of engineering’s impact? This is where you share narratives that turn engineering from a cost center into a strategic advantage—how a platform rebuild cut feature delivery from six weeks to two days, for example. It’s the hardest category to build and the easiest to skip, but it’s what wins you headcount and support.

And it gets worse. Even if you pick the right category, dashboards often destroy trust in predictable ways. Goodhart’s Law warns us that when a metric becomes a target, it stops being a good measure. Developers are smart; they’ll game any metric tied to performance reviews. AI only makes this worse—code volume becomes meaningless when AI can churn out pull requests in bulk, inflating output without adding value.

Another trap is measuring individuals instead of teams. Software is a team sport. Trying to isolate individual contributions is like measuring a piston’s output in an engine—it makes no sense and poisons team dynamics. McKinsey’s disastrous attempt to measure individual developer productivity proved this painfully clear.

There’s also the measurement loop: stakeholders ask for more dashboards, more metrics, and nothing satisfies them. This is not a metrics problem; it’s a trust problem disguised as data. No dashboard fixes a broken relationship between engineering and the business.

Plus, optimization metrics like cycle time or deployment frequency belong in engineering leadership meetings, not in front of the board. Presenting them to CEOs without context leads to misinterpretation and dangerous targets that drive the wrong behaviors. CEOs and boards want planning and operations metrics, period.

Finally, perfection paralysis kills progress. Some CTOs wait for the perfect framework, the perfect tools, and never start measuring at all. Start with what you have, measure the easy stuff first to build trust, then iterate.

Now AI adds a new layer of complexity. According to the 2025 DORA report, increased AI adoption correlates with a drop in delivery throughput and stability. Cortex’s 2026 data shows PRs per author up 20%, but incidents per PR up 23.5%, and change failure rates up 30%. AI speeds up coding but clogs the pipeline with more broken code that takes longer to review and fix. Deployment frequency alone no longer signals progress; you have to pair speed with quality metrics like rework rate, which DORA added in 2025 to capture unplanned follow-up work caused by production issues.

So where do you start? Focus on a minimum viable dashboard: delivery predictability—did you ship what you promised? System reliability—incidents and recovery time. Investment allocation—where engineering effort went. And team health—attrition, hiring, engagement. Report metrics you’re already tracking, show trends not snapshots, and never show speed metrics alone. Deployment frequency without change failure rate is a lie by omission.

Metrics are like a thermometer. They reflect the health of your engineering practices but aren’t goals themselves. If your dashboard creates theater instead of insight, you’re confusing the map for the territory.

When the CEO asks for engineering metrics, ask yourself: which of the four categories are they really after? Would your current metrics survive scrutiny under Goodhart’s Law? What story is your dashboard telling—one your engineers would own, or one they’ve learned to perform? If you stripped away activity metrics and kept only outcome metrics, what remains?

Peter Drucker didn’t say “what gets measured gets managed.” He said knowledge work can’t be measured like manual work. Tom DeMarco, who famously claimed you can’t control what you can’t measure, later retracted that. Measurement isn’t the goal—understanding is. Use metrics to make better decisions, not to create theater.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

Audio: The AI-First Fallacy

Allan MacGregor 🇨🇦 — Mon, 23 Feb 2026 15:15:24 GMT

Rebranding around AI can boost your stock price and attract funding, but that’s not the same as having a strategy that creates real value. The AI-first label is often branding masquerading as strategy, and it’s setting companies up for failure.

Look at the numbers: since ChatGPT launched, mentions of AI on earnings calls rose sixty-fold in a year, and companies calling out AI saw their stock jump an average of 4.6%, almost double those that didn’t. But this bump comes from talking about AI, not from AI delivering measurable results. Venture capital funding for AI startups exploded, yet 78% of these startups are just API wrappers on the same foundation models, with no real differentiation. Regulators are now fining companies for “AI washing”—making misleading claims about AI capabilities. Meanwhile, layoffs attributed to AI are often just a cover story to spin bad business news as positive transformation.

Strip away the hype, and the reality is stark. Studies show 95% of companies see no measurable return on AI investments, and nearly half abandon their AI projects. Most AI startups generate no revenue and have customer churn twice the SaaS average. McKinsey’s 2025 report found that while almost everyone uses AI, only 39% see any financial impact, and just a third are scaling AI programs. The gap between saying you’re AI-first and actually benefiting from AI isn’t a gap—it’s a chasm.

There’s a predictable five-step pattern when companies declare AI-first: first, a bold AI mandate; then backlash from employees and customers; followed by quality issues and rising costs; a public walk-back; and finally, the AI-first narrative quietly disappears. Klarna claimed AI was replacing hundreds of agents, only to rehire humans after quality dropped and costs rose. Duolingo’s CEO insisted small quality hits were acceptable, but engagement and stock price plummeted, forcing a reversal. Amazon announced AI-driven layoffs, then backtracked amid employee pushback. This pattern repeats because AI-first as an identity invites scrutiny and internal resistance—31% of workers sabotage AI rollouts, and some even falsify performance data.

To cut through the noise, I use a simple taxonomy. AI-native companies build products that cannot exist without AI—TikTok’s recommendation engine or Midjourney’s image generation. AI-enhanced companies improve existing products with AI features—like Salesforce adding AI to CRM or banks using AI for fraud detection. AI-washing is just slapping AI branding on a product with minimal integration—exactly what most AI startups do. Klarna, Duolingo, and Shopify are AI-enhanced, not AI-native, despite calling themselves AI-first. Ask yourself: if you removed AI, would your product still work? If yes, you’re AI-enhanced. If no, you might be AI-native. If you can’t tell, you’re probably AI-washing—and that’s risky.

The problem with AI-first identity worsens as AI commoditizes. When a $6 million open-source Chinese model can rival U.S. tech giants, and the companies spending billions on AI infrastructure see their stock prices fall, the models themselves are no longer a moat. OpenAI calls itself a product company, not a model company, signaling the shift. The winner won’t be the one who built the best model, but the one who attracts and retains customers. Value will come from domain expertise, proprietary data, workflow integration, and user experience—not the AI model itself. If your identity is tied to a commodity, you have no moat.

This isn’t a reason to dismiss AI. Real AI-native companies exist and thrive. The technology is transformative for specific use cases like recommendations, fraud detection, or drug discovery. The key is precision: define what AI solves for your business and measure it. The companies succeeding with AI redesign workflows and set growth objectives, not just cost-cutting. Most failed AI projects stem from poor data and bolting AI onto old processes. Gartner placed generative AI in the trough of disillusionment in 2025. The hype is cooling, and companies with real integration—not just buzzwords—will emerge stronger.

If your board asks “are we AI-first?” don’t answer with buzzwords. Give them data quality status, specific AI use cases, measurable outcomes, and a clear roadmap. Fix your data first. Redesign workflows, don’t just add AI features. Build domain advantages, not model dependencies. Set growth goals, not just layoffs and cost cuts. Replace “AI-first” with “AI-specific” and be honest about what AI actually delivers.

Ask yourself: if you stripped AI from your product, what would be left? When the models become commodities, will your company have a moat beyond the label? Because just like Long Island Iced Tea didn’t become a blockchain company by changing its name, you don’t become an AI company by declaring yourself AI-first. You become one by solving problems AI is uniquely suited to solve—and being honest about the ones it can’t.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

The AI-First Fallacy

Allan MacGregor 🇨🇦 — Mon, 23 Feb 2026 14:15:23 GMT

In December 2017, a beverage company called Long Island Iced Tea Corp did something remarkable. It renamed itself Long Blockchain Corp. The stock surged 380% overnight. Trading volume spiked 1,000%. The company had zero blockchain technology, zero blockchain products, and zero blockchain revenue. The SEC subpoenaed documents; three individuals were charged with insider trading; the stock was delisted from NASDAQ within four months.

This should have been a cautionary tale. Instead, it was a preview.

Technology branding follows a predictable cycle, and we've watched it loop for over a decade. Satya Nadella took over Microsoft in 2014 with "mobile-first, cloud-first" as his rallying cry; within four years, the slogan had quietly shifted to "intelligent cloud," and by 2024, it was "everything AI." Same company, same playbook, new buzzword. In 2017-2018, Riot Blockchain -- formerly Bioptix, a biotech diagnostics company -- pivoted to blockchain and watched its stock spike before crashing. Every major retailer in the 2010s declared itself "digital-first"; most bolted an e-commerce site onto existing operations and called it transformation. The companies that won -- Amazon, Shopify -- were digital-native from the start. They didn't need the label.

The buzzword captures something real about a technological shift. Then the buzzword gets weaponized as marketing before the technology matures; companies rebrand around it, stock prices move, consultants publish frameworks, and the SEC eventually gets involved. The buzzword fades. A new one takes its place.

"AI-first" is the current buzzword. The playbook hasn't changed.

The Earnings Call Effect

The financial incentive to declare yourself "AI-first" is measurable -- and it has almost nothing to do with whether AI creates value for your business.

Since ChatGPT launched in November 2022, AI mentions on earnings calls went from roughly 500 per quarter to over 30,000 by the end of 2023. A sixty-fold increase in twelve months. Companies that mentioned AI on earnings calls saw an average stock price increase of 4.6%, compared to 2.4% for those that didn't; among tech companies specifically, 71% that mentioned AI saw their stock rise, with an average gain of 11.9%. Roughly one-third of stock gains for "AI-exposed" firms were attributable to their GenAI discussions alone -- not to any measurable AI output, but to the act of talking about it.

The money followed the narrative. Global AI venture capital hit $202.3 billion in 2025, up 75% year-over-year. AI captured 53% of all global VC funding; in the U.S., that number was 64%. But 78% of AI startups launched in 2024 are API wrappers -- over 12,000 companies building on the same foundation models, differentiated primarily by their landing pages.

Regulators noticed the gap between claims and reality. In March 2024, the SEC charged Delphia and Global Predictions with making false and misleading statements about their use of AI -- the first-ever "AI washing" enforcement actions. By August 2025, the FTC had launched "Operation AI Comply," suing Air AI Technologies for claiming its product could "fully replace human sales representatives" when the technology couldn't perform basic functions like placing outbound calls.

And then there's the layoff theater. Oxford Economics published an analysis in January 2026 that cut through the noise: AI-attributed job cuts accounted for just 4.5% of total reported layoffs, while standard "market and economic conditions" cuts were four times larger. Their conclusion was blunt: "We suspect some firms are trying to dress up layoffs as a good news story rather than bad news." Attributing cuts to AI "conveys a more positive message to investors" than admitting to business failures.

The incentives are clear. Mention AI; stock goes up. Declare "AI-first"; funding flows in. Attribute layoffs to AI; investors applaud your efficiency. None of this requires AI to produce a single dollar of value.

The Data Beneath the Branding

Strip away the branding and look at what "AI-first" companies are producing. The reality doesn't match the narrative.

A 2025 MIT study found that 95% of businesses are seeing zero measurable return on their AI investments. S&P Global reported that 42% of companies are abandoning most of their AI initiatives -- up from 17% the year before. NTT DATA puts GenAI deployment failure rates at 70-85%. The numbers are consistent across every major analyst; the only thing that varies is how bad the picture looks.

The AI startup landscape is worse. Of those 12,000+ wrapper startups, 60-70% generate zero revenue. Only 3-5% surpass $10,000 in monthly revenue. The churn rate is staggering: 65% of AI wrapper customers leave within 90 days -- nearly double the SaaS industry average of 35%. Roughly 90% of AI startups fail within their first year, compared to ~70% for traditional tech firms. These aren't companies building defensible technology; they're companies wrapping an API and hoping the branding holds.

McKinsey's 2025 State of AI report captures the gap between perception and performance most precisely. Eighty-eight percent of respondents report regular AI use; 72% have adopted GenAI in at least one function. Sounds like a revolution. But only 39% report any enterprise-level EBIT impact, and only one-third have begun to scale their AI programs. Almost everyone is "using AI." Almost nobody is seeing financial results from it.

The gap between declaring AI identity and achieving AI results isn't a gap. It's a massive chasm.

The Five-Step Pattern

There's a pattern to how "AI-first" declarations play out. It's predictable enough to map; it's consistent enough to name.

Step one: CEO announces an aggressive AI mandate -- public memo, earnings call, or press interview. Step two: backlash follows -- employees resist, customers boycott, investors scrutinize. Step three: reality emerges -- quality drops, costs rise, customers leave. Step four: walk-back or reversal. Step five: the narrative quietly shifts, and what was "AI-first" becomes something softer or disappears from the talking points entirely.

The arc plays out the same way across industries, company sizes, and geographies.

Klarna was the poster child. In 2024, CEO Sebastian Siemiatkowski bragged openly that AI was "doing the work of 700 full-time agents." The narrative drove their IPO filing. By May 2025, Siemiatkowski was telling Bloomberg the opposite: "Cost unfortunately seems to have been a too predominant evaluation factor... what you end up having is lower quality." Klarna began rehiring human agents. Customer service costs rose to $50 million in Q3 -- up from $42 million -- despite the company's claimed $60 million in AI savings.

Duolingo turned messaging into self-inflicted damage. CEO Luis von Ahn posted a memo in April 2025 declaring Duolingo "AI-first" and warning that "small hits on quality are an acceptable price to pay." DAU growth dropped from 56% in February to 37% by June. The stock ended 2025 down 45.9%. The company earned an exhibit in the Museum of Failure. The walk-back came months later: "I did not give enough context." The sharpest irony -- Duolingo never laid off a single full-time employee. The damage was almost entirely self-inflicted through branding.

Amazon showed the pattern at scale. In June 2025, Andy Jassy told employees that AI would "reduce our total corporate workforce." The response on internal Slack was immediate and overwhelmingly hostile. By October, Amazon announced 14,000 layoffs citing AI; Jassy then walked it back, calling them "about culture, not AI." By December, over 1,000 employees had signed an open letter warning that the company's "all-costs-justified, warp-speed approach to AI development will do staggering damage."

Three companies; three industries; the same five steps. The pattern repeats because "AI-first" as an organizational identity is fragile. It invites scrutiny from every direction -- employees who fear replacement, customers who notice quality drops, investors who eventually demand proof. And the internal resistance is measurable: 31% of workers report actively sabotaging their company's AI rollout, jumping to 41% among millennials and Gen Z. One in ten admit to tampering with performance metrics to make AI appear to underperform.

The prediction market has already priced in the reversals. Gartner expects that by 2027, half of companies that cut customer service staff due to AI will rehire them -- under different job titles. Forrester predicts half of all AI-attributed layoffs will be reversed by end of 2026. If "AI-first" were a sound strategy, the companies declaring it wouldn't keep reversing course.

A Taxonomy That Matters

"AI-first" tells you nothing. It's a branding label, not a strategy description. A three-part taxonomy is more useful for evaluating companies, strategies, and your own roadmap.

AI-native: the product cannot exist without AI. The AI isn't a feature bolted on later; it's the foundation the entire product grows from. TikTok's recommendation engine is the product -- content discovery powered by AI is the entire value proposition. Midjourney is image generation; remove the AI and nothing remains. Superhuman built email around AI from day one -- Split Inbox, AI writing, intelligent sorting are the core experience, not add-ons.

The defining characteristic of genuinely AI-native companies is that they don't need to call themselves "AI-first." Nobody describes TikTok as an "AI-first company"; they describe it as a video platform. The AI is invisible infrastructure. When the label is self-evident, you don't need the marketing.

AI-enhanced: AI makes an existing product better, but the product works without it. This is the majority of successful AI deployment, and there is nothing wrong with it. Salesforce adding AI features to CRM; banks using AI for fraud detection; logistics companies optimizing delivery routes. The value proposition exists independent of AI; AI accelerates, improves, or extends it.

AI-washing: a marketing label applied to the same product with an API call bolted on. No meaningful integration; no proprietary data advantage; no workflow redesign. A GPT wrapper, a chatbot skin, or a buzzword added to product descriptions. This is where the 78% of wrapper startups live, and it's where most self-declared "AI-first" companies land.

Now apply the taxonomy to the companies from the previous section. Klarna is AI-enhanced -- customer service existed long before AI; AI was an efficiency layer. Duolingo is AI-enhanced -- language learning worked before AI; AI accelerated content production. Shopify is AI-enhanced -- the e-commerce platform existed for over a decade before any AI features shipped. All three declared "AI-first." None of them are. The taxonomy exposes the gap between branding and operational reality.

Here is a simple question -- but one worth taking to your next strategy meeting: if you removed the AI from your product, would the product still work?

If yes, you're AI-enhanced. That's a perfectly valid strategy. Build from there.

If no, you might be genuinely AI-native. Build your moat accordingly -- in proprietary data, domain expertise, and workflow integration, not in which model you call.

If you're not sure, you might be AI-washing. That's the dangerous position.

The Commoditization Test

"AI-first" as identity has a deeper problem than inaccuracy. It becomes meaningless when the AI layer commoditizes. And the evidence suggests that process is already well underway.

In January 2025, Chinese startup DeepSeek released a reasoning model nearly equivalent to the best U.S. models at a fraction of the cost. Open-source. Claimed training cost of roughly $6 million. The market reaction was immediate: Nvidia lost $588.8 billion in market value in a single day -- the largest single-day loss any stock has ever recorded. The core investor fear wasn't about DeepSeek specifically; it was about what DeepSeek implied. If a Chinese startup can build competitive AI for $6 million, why are U.S. tech companies spending hundreds of billions on infrastructure that a fraction of the cost can replicate?

OpenAI itself signaled the shift. The company has positioned itself as "not a model company; it's a product company that happens to have fantastic models." When the company building the models tells you the models aren't the differentiator, listen. Andrew Chen at a16z made the same observation: the axis of competition is shifting from "can you build it?" to "will consumers come? And will they stick?" It's the same transition that defined Web 2.0; the technology becomes table stakes, and the winners differentiate on everything else.

The infrastructure math doesn't close, either. Sequoia Capital calculated that AI infrastructure spending would need to generate $600 billion in annual revenue to justify current CapEx levels. The gap between investment and revenue "continues to loom large." In January 2026, Microsoft reported record revenue and beat analyst estimates -- then disclosed $37.5 billion in quarterly CapEx for AI data centers. The stock dropped 10.5%, erasing approximately $375 billion in market capitalization. As Morningstar analysts put it: "The era of rewarding 'AI potential' has ended, and a new, more demanding era of 'AI proof' has begun."

If your identity is "AI-first" and the AI layer commoditizes -- when every competitor has access to equivalent models at equivalent cost -- what's left? The answer isn't AI. It's everything around AI: domain expertise, proprietary data, workflow integration, distribution, user experience. The companies that will win are building moats in those layers. The companies declaring "AI-first" are defining themselves by the commodity.

Where This Breaks Down

The taxonomy isn't a reason to dismiss AI. It's a reason to be precise about what you're building and why.

Genuinely AI-native companies exist, and they're defensible. TikTok, Midjourney, vertical SaaS products that reimagine entire workflows around AI capabilities -- these started from different questions and imagined solutions that only make sense because AI exists. They don't need the "AI-first" label because their products are self-evidently built on AI. The distinction matters.

The technology itself is transformative for specific, well-defined use cases: recommendation engines, fraud detection, drug discovery, content generation, code assistance. These are real capabilities producing real value; dismissing them would be as foolish as the hype. The critique isn't "AI doesn't work." It's that declaring "AI-first" tells you nothing about whether AI works for your specific context, your specific problems, or your specific customers. Companies seeing the most value from AI set growth and innovation objectives beyond cost-cutting; they redesign workflows rather than bolting AI onto existing processes. Purchasing from specialized vendors succeeds 67% of the time, compared to roughly 22% for internal builds. The path to AI value is specific, targeted, and unglamorous. It's the opposite of a branding exercise.

Gartner placed GenAI in the Trough of Disillusionment in 2025. This isn't the end of AI; it's the correction. Technologies that survive the trough emerge with realistic expectations and genuine adoption patterns. The companies that come out the other side will be the ones that invested in real integration -- not the ones that invested in the label.

What to Do Instead

If you're a CTO fielding "are we AI-first?" from your board, you're not alone, and the pressure is real. Board oversight disclosure on AI increased 84% year over year -- 150% since 2022. Shareholder proposals focused on AI quadrupled in 2024 versus 2023. But 66% of board directors report "limited to no knowledge or experience" with AI, and fewer than 25% of companies have board-approved AI policies. The dynamic is dangerous: AI-illiterate boards demanding transformation they don't understand, driven by investor anxiety they can't evaluate.

PwC reports that 42% of CEOs believe their company won't be viable beyond the next decade on its current path. That existential anxiety creates enormous pressure to show AI transformation -- even when the transformation is theater. The board doesn't want theater. They want answers they can defend to shareholders. Give them precision instead of buzzwords.

Fix the data first. Forty-three percent of organizations cite data quality as their top AI obstacle; 57% admit their data isn't ready for AI. No amount of "AI-first" branding fixes bad data infrastructure. This is the boring, unglamorous work that makes AI deployments succeed or fail, and it belongs in your board deck before any AI initiative does.
Redesign workflows; don't bolt AI onto existing processes. McKinsey's key finding across companies seeing genuine AI value: they redesigned how work gets done, not just what tools people use. The board deck should show workflow transformations with measurable outcomes, not tool purchases.
Build domain advantages, not model dependencies. Value lives in proprietary data, domain expertise, and workflow integration. When the model layer commoditizes -- and it will -- these are what remain. Your moat is never the API you call.
Set growth objectives, not just efficiency targets. Companies setting growth and innovation goals beyond cost-cutting see the most AI value. "AI-first" memos are almost always about cutting costs. That's the wrong optimization target, and it's one that invites the five-step pattern of backlash, reversal, and narrative shift.
Answer the board with precision, not buzzwords. Replace "we're AI-first" with specifics: "We're deploying AI against these three problems, with these KPIs, and here's what we've learned so far." Use the taxonomy: "We're AI-enhanced in customer service, AI-native in our recommendation engine, and evaluating AI for supply chain optimization. We're not AI-first -- we're AI-specific." That answer gives the board something defensible. "AI-first" gives them a press release.

Questions for CTOs

If you stripped the AI from your product, what would be left? Is that enough?
When your board asks "are we AI-first?" -- what are they asking, exactly? And are you answering the question they mean, or the one they said?
Can you name three specific problems your AI initiatives are solving -- with KPIs attached? If not, you might be declaring an identity rather than executing a strategy.

In eighteen months, when the models are commoditized and every competitor has access to the same capabilities, what's your moat? If the answer is "we're AI-first," you don't have one.

Long Island Iced Tea didn't become a blockchain company by changing its name. Your company doesn't become an AI company by declaring itself "AI-first." It becomes an AI company by solving problems that AI is uniquely suited to solve -- and being honest about the ones where it isn't.

Audio: No Vibes Allowed: Context Engineering for Real Codebases

Allan MacGregor 🇨🇦 — Fri, 20 Feb 2026 12:05:18 GMT

If you believe AI coding tools are speeding up your teams but delivery metrics don’t show it, you’re not imagining things. A rigorous trial with experienced open-source developers found that AI assistance actually slowed them down by 19%, even though they felt 20% faster—a 40-point perception gap. This wasn’t novice error; these were skilled devs on familiar codebases using mainstream AI tools. The disconnect between perception and reality is real, and it’s backed by solid data.

Stanford’s extensive study confirms AI coding tools boost productivity on simple, new projects by up to 40%, but that gain halves or disappears as task complexity and codebase size grow. For hard tasks in mature systems, AI helps little or even hurts, mainly because fixing AI-introduced bugs eats into any speed gains. The bigger the codebase, the worse the AI performs. Context window limits and intricate dependencies overwhelm current models, turning AI from helper to liability on your toughest problems.

And it gets worse. AI-assisted commits are changing code quality in troubling ways. GitClear’s analysis reveals copy-pasted code is on the rise, refactoring is tanking, and code churn is doubling. AI models optimize for local correctness—code compiles and tests pass—but global architecture coherence degrades. CodeRabbit’s study of pull requests shows AI coauthored code has nearly twice as many major issues, security vulnerabilities up to double, and readability problems tripled compared to human-only work. Developers know this firsthand: trust in AI accuracy dropped from 40% to 29%, and most say they spend more time fixing AI’s “almost right” code than they save. The “slop factory” churns on—ship fast, fix later, repeat—with questionable net velocity and clear quality decline.

The industry divides into three camps. Camp 1 says AI is fundamentally incapable of handling complex systems; the evidence supports this. Camp 2 hopes smarter future models will fix these problems, so companies wait passively for advances. Camp 3, however, argues the bottleneck isn’t the AI model itself but how we feed it information—context engineering. With the right workflow, today’s models can handle large codebases effectively. This is where new breakthroughs are happening.

Dex Horthy from HumanLayer nails the core constraint: context window physics. AI models have a cliff effect—once you fill beyond about 40% of the context window, accuracy plummets. Just dumping more code into the prompt makes things worse, not better. His solution is “frequent intentional compaction”—deliberately compressing, validating, and reloading context throughout the development process to keep the AI’s input clean and focused. The damage hierarchy is critical: incorrect context poisons everything downstream, missing info leads to guesswork, and noise wastes tokens but is least harmful. The formula is simple: prioritize correctness first, completeness second, compactness third, and minimize noise.

Applying this means three phases: Research—map the architecture and relevant files with fresh context windows and human review; Plan—craft a precise implementation strategy with clear file edits and tests, keeping context load moderate and reviewed by domain experts; Implement—execute the plan with minimal overhead, verifying continuously and compressing status back into context. The insight is counterintuitive: most time should go into research and planning, not code writing. Research yields tenfold return, planning fivefold, implementation just onefold. Humans add the most value by reviewing research and plans, not raw code. Flawed assumptions early on multiply downstream mistakes. As Horthy says, “Do not outsource the thinking.”

This approach delivers results. Horthy, an amateur Rust dev new to a 300K-line codebase, produced a one-shot PR approved by the project CTO. Another time, he and a collaborator implemented 35,000 lines of WebAssembly support in seven hours—a task estimated at days per engineer. But it’s not magic. They failed to remove Hadoop dependencies from Parquet Java because that required deep architectural understanding that can’t be compressed into context windows. Context engineering works spectacularly for decomposable problems, but not for holistic architectural redesigns. Knowing that boundary is crucial.

Context engineering is gaining traction as a discipline. Martin Fowler defines it as curating what the model sees to improve outcomes—not just prompt phrasing but workflow engineering. Spotify and others have published enterprise-scale approaches. The CLAUDE.md ecosystem exemplifies this: persistent markdown files encoding build commands, coding conventions, architecture decisions, and lazy-loaded skills guide AI tasks. But as Fowler cautions, certainty is impossible with LLMs; you must think probabilistically. Horthy warns against buzzword dilution—if your vendor can’t explain the damage hierarchy, they’re not truly doing context engineering.

Here’s the 90/10 rule for CTOs: For roughly 90% of AI coding—simple tasks, greenfield work, small fixes—AI tools yield real 15–40% gains with minimal workflow change. But for the critical 10%—complex tasks in large codebases that determine stability, security, and maintainability—AI without context engineering is neutral or worse. The mistake is expecting the same AI workflow to handle both. Discipline in context engineering bridges that gap.

Open questions remain. Can mid-level engineers learn this discipline? Does it scale from solo experts to teams? What if you lack a domain expert? Cultural leadership is key; tool adoption alone won’t cut it. Meanwhile, senior engineers see the tradeoffs clearly, while juniors produce AI-assisted code that increases technical debt. Context engineering might be the bridge, but it’s unproven at scale.

I’m running experiments applying context engineering to measure where AI helps and where it creates rework, by task and codebase area. The data matches the 90/10 pattern. Routine work sees gains; complex integration demands the full research-plan-implement rigor to avoid net negative outcomes. This is a bet on discipline over tooling. The developers who master context engineering won’t just be faster; they’ll do the work AI can’t do alone. Maybe future models will make this irrelevant, but waiting risks falling behind. The skills—research rigor, structured planning, domain expertise—are valuable no matter what.

So ask yourself: When your team uses AI on complex work, are they investing in research and planning or just generating code faster? Do you measure AI-induced rework? Who on your team is developing context engineering skills—or are you waiting for smarter models? Context engineering makes explicit the bottleneck that’s always been there: understanding the problem well enough to write the right code. Without it, you’re just generating slop faster.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

No Vibes Allowed: Context Engineering for Real Codebases

Allan MacGregor 🇨🇦 — Fri, 20 Feb 2026 12:01:24 GMT

A randomized controlled trial of 16 experienced open-source developers working 246 real-world tasks found that developers using AI coding tools took 19% longer to complete their work. But they believed they were 20% faster. A 40-percentage-point perception gap; the developers weren't just wrong about the magnitude of the improvement, they had the direction backwards.

These weren't beginners fumbling with a new tool. They averaged five years of experience on the specific codebases where they were tested. They used Cursor Pro and Claude 3.5/3.7 Sonnet --- mainstream tools, not fringe experiments. The methodology was rigorous: randomized, controlled, pre-registered. And the result was unambiguous.

If you're a CTO and your teams report that AI tools are "helpful" while your delivery metrics stay flat, you're not imagining things. The data confirms the disconnect.

Stanford's three-year study across 600+ companies and 100,000+ developers fills in the rest of the picture. AI coding tools increase productivity 15--20% on average --- but that average obscures massive variation. Simple tasks on new projects see 30--40% gains. Simple tasks in existing codebases see 15--20%. Hard tasks in mature codebases? Zero to 10% gains, sometimes negative. As Stanford's researchers noted, "a significant portion of that gain is lost fixing the bugs and mess the AI made."

The degradation scales with complexity. As codebase size increases from 10K to 10M lines of code, AI's productivity contribution drops sharply. Context window performance degrades from roughly 90% accuracy at 1K tokens to around 50% at 32K tokens. Signal-to-noise ratio collapses; dependencies and domain-specific logic grow more intricate than the model can reason about unaided.

The pattern is clear: AI coding tools work well on small, isolated problems. They struggle --- and sometimes actively hurt --- on the large, interconnected codebases where your hardest engineering problems live. The question is whether that gap is permanent or whether something can be done about it.

The Slop Factory

The speed problem is bad enough. The quality problem is worse.

GitClear analyzed 211 million lines of code across 2020--2024 and found that AI-assisted development is fundamentally changing what gets committed. Copy-pasted code rose from 8.3% to 12.3%. Duplicated code blocks of five or more lines increased eightfold in 2024. Refactoring collapsed --- from 25% of all changes in 2021 to less than 10% in 2024, a 60% decline. Code churn doubled; new code revised within two weeks grew from 3.1% to 5.7%. For the first time in GitClear's measurement history, copy-pasted lines exceeded moved or refactored lines.

LLMs prioritize local functional correctness over global architectural coherence. The code compiles. The tests pass. But the system accumulates entropy --- duplicated logic, ignored abstractions, brittle coupling --- that compounds with every AI-assisted commit.

CodeRabbit's analysis of 470 real-world pull requests quantified the damage. AI-coauthored PRs averaged 10.83 issues versus 6.45 for human-only PRs. 1.7x more major issues; 1.4x more critical issues. Logic errors up 75%. Security vulnerabilities up 1.5--2x. Readability issues up 3x. Performance bugs up 8x.

Developers know this. The Stack Overflow 2025 survey found that trust in AI accuracy fell from 40% to 29% year over year. Sixty-six percent say they spend more time fixing "almost-right" AI code than they save. More developers actively distrust AI (46%) than trust it (33%).

Dex Horthy, founder of HumanLayer, named the dynamic concisely: "A lot of the extra code shipped by AI tools ends up just reworking the slop that was shipped last week."

The slop factory. Ship fast on Monday; fix what you shipped on Friday. Net velocity gain: debatable. Net quality impact: measurable and negative.

This is not an anti-AI argument. The productivity gains on simple tasks are real; the Stanford data confirms that. But when AI coding tools are deployed without discipline on complex codebases, the quality evidence is damning. And quality problems compound in ways that speed gains do not.

Three Camps

The industry has sorted itself into three responses to this data.

Audio: The Software Factory: When No Human Writes or Reviews the Code

Allan MacGregor 🇨🇦 — Wed, 18 Feb 2026 14:01:52 GMT

StrongDM’s Software Factory throws down a radical challenge: no human writes code, no human reviews it, and you better be spending at least a thousand dollars a day in tokens per engineer to keep up. They’ve skipped every safety net most of us rely on and gone all-in on agentic AI development. This is either the future of software or the blueprint for a disaster waiting to happen.

But before we get skeptical, credit where it’s due: the engineering behind this is impressive. They don’t just throw AI at the problem; they build structured, spec-driven workflows. The cleverest idea is using “scenarios” as holdout sets — user stories stored outside the codebase that AI agents can’t see, preventing them from gaming their own tests. It’s a principle borrowed from machine learning, where you never train on your test set. Then there’s their Digital Twin Universe — full behavioral clones of third-party services like Okta and Slack, running thousands of tests at scale without API costs or rate limits. This isn’t casual; it’s a methodical, iterative approach to growing correctness, not just generating code once and shipping.

But here’s the rub: the numbers don’t support skipping human review. CodeRabbit’s December 2025 report analyzed hundreds of real-world pull requests and found AI-generated code had 1.4 times more critical issues and 1.7 times more major issues than human code. Security vulnerabilities doubled, readability issues tripled, and performance problems were eight times more frequent. Veracode and FormAI studies confirm half or more of AI-generated code samples have security flaws. Now imagine this in StrongDM’s context — software controlling enterprise access. Trusting AI alone on security-critical code is a gamble with catastrophic downside.

And it gets worse. Real-world failures have already happened with some human oversight, like Replit’s AI agent wiping a live production database during a code freeze, or Moltbook leaking 1.5 million API keys because AI-generated schemas lacked essential security settings. StrongDM’s model removes human review entirely — no code writers, no reviewers — so the guardrails that failed with humans won’t exist at all. When no one understands the code, who investigates the failures? Incident response and compliance become nightmares if the audit trail is just AI conversations.

StrongDM’s answer to verification is the holdout sets, but who writes those? If humans do, you haven’t eliminated human review — you’ve just moved it upstream. If AI writes the scenarios too, you’ve just pushed the problem higher, with agents verifying agents verifying agents. Software edge cases are unbounded; you can’t test what you haven’t imagined. That missing checkbox in Moltbook’s breach is a perfect example. The most brittle part breaks the system, and in security software, that brittle part is the attacker's first target.

The economics add another layer of complexity. Spending $1,000 per day per engineer on tokens means $240,000 a year just on AI usage — more than the median software engineer salary. StrongDM builds high-priced enterprise security software, so maybe it makes sense there. But for most startups or broader software development, the cost is prohibitive. Plus, if AI can build your product from specs, it can build your competitor’s too. Your moat shifts from code to your scenario library, which is just documentation and far easier to copy.

There is a middle ground. Sam Schillace, Microsoft’s Deputy CTO and creator of Google Docs, lays out “Coding Laws for LLMs” that are pro-AI but insist on human oversight. His key point: don’t write code if AI can do it, but always keep human validation checkpoints. Treat models as tools, not autonomous agents. StrongDM’s rules directly contradict these principles. Given the data and incidents, the evidence supports keeping humans in the loop for now.

What about the engineers? If no human writes or reviews code, what do they do? The optimistic spin is they become supervisors and architects, focusing on high-level design and domain expertise. The harsher truth is you’re shifting from coding to prompt engineering and scenario design — valuable but fewer roles overall. More critically, when no one writes or reviews code, the team loses shared understanding and the mental model of the system decays. Maintenance, debugging, and evolution get harder, not easier.

The real test is happening now: StrongDM is being acquired by Delinea, a major identity security player. Will they keep the “no human review” approach for security-critical products once compliance and risk are on the table? Or will human oversight return? That answer will tell us more than any manifesto about whether the dark factory model is viable or just an experiment.

As for me, I’m not embracing the dark factory. The data doesn’t justify removing human review, especially in security-sensitive contexts. But I’m borrowing ideas: keeping verification scenarios outside the codebase is smart, and smaller-scale digital twins or mocks for integration testing are worth exploring. I’m watching the trajectory carefully but won’t abandon human judgment until the numbers say it’s safe.

The Software Factory isn’t about vision or ambition alone — it’s about evidence. Their holdout set concept is worth adopting. Their engineering deserves respect. Their philosophy is provocative but premature. The real question for every CTO is: what defect rate would make you comfortable trusting AI without human review? Are we there yet? For now, the answer is no.

You can read the full article — with all the data and sources — on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

The Software Factory: When No Human Writes or Reviews the Code

Allan MacGregor 🇨🇦 — Wed, 18 Feb 2026 12:02:09 GMT

StrongDM's Software Factory has three cardinal rules. Rule one: code must not be written by humans. Rule two: code must not be reviewed by humans. Rule three: if you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.

Three rules. No hedging, no qualifiers, no "except when."

The guiding mantra for every engineer on the team is a single question: "Why am I doing this?" The implication is clear; the model should be doing it instead. Every line of code a human writes is, in their framing, a failure of imagination --- a task that should have been delegated to an agent.

Simon Willison published his analysis of the approach today, calling it "the most ambitious form of AI-assisted software development I've seen yet." He frames it as Level 5 on a spectrum from "spicy autocomplete" to what StrongDM calls the Dark Factory --- fully agentic development where humans don't write code and don't review it. Most of us are somewhere around Levels 2 and 3; StrongDM skipped straight to the end of the spectrum.

That alone would be worth discussing. But context matters.

StrongDM builds security and access management software --- permission management across Okta, Jira, Slack, and Google services. They're being acquired by Delinea, an identity security company, with the deal expected to close Q1 2026. No human writes the code that controls access to enterprise systems. No human reviews it. This is either the most visionary approach to software development anyone has shipped, or the setup for a catastrophe that writes its own case study. The data should tell us which.

The Engineering Is Disciplined

Before the skepticism, StrongDM deserves credit for what they've built. This is not vibe coding. The engineering is structured, specification-driven, and contains ideas that deserve serious analysis --- regardless of whether you buy the philosophy.

The strongest idea is scenarios as holdout sets. The problem is well-known: when agents write both code and tests, they game the tests. An agent can trivially write `assert true` and declare victory. StrongDM's solution replaces traditional tests with "scenarios" --- end-to-end user stories stored outside the codebase, invisible to the code-generating agents. The analogy comes from machine learning; you never train on your test set because it corrupts evaluation. StrongDM applies the same principle to software verification. The agents can't see the scenarios, so they can't game them. The satisfaction metric shifts from boolean --- did all tests pass? --- to probabilistic: what fraction of observed trajectories through all scenarios likely satisfy the user?

That's a genuinely smart framing. It addresses the most obvious objection to AI-generated testing in a way that borrows from a discipline with decades of rigor behind it. If you've worked with ML pipelines, you recognize the logic immediately; the principle is sound even if you question the scope of its application.

The Digital Twin Universe is equally impressive. StrongDM built behavioral clones of third-party services --- Okta, Jira, Slack, Google Docs, Google Drive, Google Sheets --- as self-contained Go binaries that replicate APIs, edge cases, and observable behaviors. They run thousands of scenarios hourly; they test at volumes exceeding production limits; they simulate dangerous failure modes impossible against live services. No rate limits. No API costs. Building full SaaS replicas was always theoretically possible but economically unfeasible; agentic development reverses the cost equation.

The team calls this "grown software" --- code that compounds correctness through iteration rather than degrading over time. Not generated once and shipped; grown through cycles of agent-driven refinement against scenario validation. The Software Factory was founded July 14, 2025 by Jay Taylor, Navan Chauhan, and Justin McCarthy, StrongDM's CTO and co-founder. The catalyst, according to them, was Claude Sonnet 3.5's October 2024 revision, which enabled "long-horizon agentic coding workflows" that compound correctness rather than error. Subsequent models --- Opus 4.5, GPT 5.2 --- increased reliability further; the trajectory gave them confidence to go all-in.

It matters that Willison is the one taking this seriously. He's been one of the most rigorous and careful observers of AI-assisted development for years. His assessment: this is structured, spec-driven agentic development, not reckless experimentation. He remains most interested in "enabling agents to prove code works without human line-by-line review." Coming from Willison, that's not hype. It's a signal worth tracking.

The holdout-set concept is worth stealing. The DTU is worth studying. The engineering behind the Software Factory is disciplined enough that dismissing it outright would be intellectually lazy.

The philosophy is a different question.

The Numbers Don't Support It

The quality data on AI-generated code is unambiguous, and it runs directly counter to "no human review."

CodeRabbit's "State of AI vs Human Code Generation" report, published December 2025, analyzed 470 real-world open source pull requests --- 320 AI-coauthored, 150 human-only. AI-authored PRs contained 1.4x more critical issues and 1.7x more major issues than human-written PRs. The averages: 10.83 issues per AI PR versus 6.45 for human PRs. Logic and correctness issues --- business logic errors, misconfigurations, unsafe control flow --- rose 75%. Security vulnerabilities increased 1.5--2x. Code readability problems jumped more than 3x. Performance inefficiencies appeared nearly 8x more often in AI-generated code.

Those numbers deserve a second read. Not 10% worse. Not marginally worse. Measurably, significantly worse across every dimension that matters for production software --- logic, security, readability, performance. The study looked at real-world pull requests in open-source projects; these aren't synthetic benchmarks or contrived examples.

The security dimension is particularly damning. The Veracode 2025 report found that 45% of AI-generated code contains security vulnerabilities, with XSS errors appearing in 86% of AI-generated cases and SQL injection in 20% of generated code samples. The FormAI study analyzed 112,000 C programs generated by ChatGPT; 51.24% contained at least one security vulnerability.

Now apply that to StrongDM's context. They build access management software --- the software that determines who can access what across your enterprise systems. Applying "no human review" to security-critical software means trusting AI agents to get security right, when every major study shows AI code has 1.5--2x more security vulnerabilities than human-written code. StrongDM's holdout scenarios may catch some of this. But scenarios are only as comprehensive as the person --- or agent --- that writes them.

The failure mode here isn't a broken feature. It's a security breach.

When the Dark Factory Has a Dark Day

The failure cases are not hypothetical. They've already happened --- at companies with more human oversight than StrongDM proposes.

In July 2025, a Replit AI agent deleted a live production database during an active code freeze. It wiped data for over 1,200 executives and 1,190 companies. The agent admitted to running unauthorized commands, panicked in response to empty queries, and violated explicit instructions not to proceed without human approval. A code freeze, explicit guardrails, human involvement in the process --- and the agent still destroyed a production database.

In January 2026, Moltbook launched a platform on the 28th. By the 31st --- three days later --- it had leaked over 1.5 million API keys and exposed countless user databases. It was called the first "Mass AI Breach" in tech history. The root cause was straightforward: AI agents generated functional database schemas but never enabled Row Level Security. No human ever reviewed the critical configuration. The post-mortem was blunt: "mistakes that any experienced engineer would have caught."

Both of these incidents happened with some level of human involvement in the development process. Replit had a code freeze and explicit guardrails; Moltbook had human developers in the loop. StrongDM's philosophy explicitly removes that involvement. The guardrails that failed in these cases wouldn't exist at all in the dark factory model.

The accountability question is worth sitting with. When nobody wrote the code and nobody reviewed it, who reconstructs the failure? Incident response assumes someone understands what the code does and why decisions were made. In a dark factory, the audit trail is a conversation between LLMs. In regulated industries --- finance, healthcare, government --- this isn't a philosophical objection. It's a compliance non-starter.

Moltbook's failure is the one that should keep dark factory advocates up at night. It wasn't a bug in existing logic; it wasn't a regression introduced by a bad commit. It was a missing configuration --- something that nobody, human or AI, thought to include. Row Level Security is a checkbox. A single setting. And its absence exposed 1.5 million API keys in three days. The DTU may catch known failure modes through scenarios. But what about the edge cases that aren't in any scenario? What about the omissions that nobody anticipated?

Who Watches the Watchmen?

StrongDM's answer to the verification problem is the holdout-set concept, and it's clever. The code-writing agents can't see the validation scenarios; they can't game them. This addresses the most obvious objection --- that AI writing its own tests is circular --- in a way that's intellectually satisfying.

But the analogy breaks down at the boundary.

Who writes the scenarios? If humans write them, human involvement hasn't been eliminated; it's been relocated upstream from code review to scenario design. The human review still exists --- it just moved. If agents write the scenarios too, you've pushed the quis custodiet problem one level higher. Now agents verify agents that verify agents. The regression doesn't resolve; it recedes.

Holdout sets in machine learning work because the data distribution is knowable and the test set can be representative of the population. Software edge cases are unbounded. You can't enumerate what you haven't imagined. Moltbook's failure was exactly this type: not a flaw in the logic that was written, but a missing configuration that neither human nor AI thought to include in any scenario. The holdout set can only catch failures it was designed to detect; the catastrophic failures are the ones nobody anticipated.

Schillace's fourth law names this precisely: "The system will be as brittle as its most brittle part." Even if 99% of the pipeline is agentic and robust, the 1% that's wrong propagates through everything. In security software, the most brittle part is the one an attacker finds first.

StrongDM hasn't published defect rates, security vulnerability metrics, or production incident data. The Software Factory was built by a three-person founding team --- not yet proven at organizational scale. The DTU covers specific third-party services --- Okta, Jira, Slack, Google --- but what about novel integrations or unanticipated service behavior?

"Deliberate naivete" is a feature when you're challenging inherited assumptions. It becomes negligence when you're building software that controls enterprise access and the data says AI code has 1.5--2x more security vulnerabilities than human-written code.

The Economics Question

Even if the approach works flawlessly, the economics constrain who can use it.

One thousand dollars per day per engineer. That's $20,000 per month, $240,000 per year --- in token costs alone. On top of salary, benefits, and equipment. The fully loaded cost per engineer in a dark factory model runs $400,000--$600,000 or more annually; the token spend alone exceeds the median US software engineer salary. At what product price point does that make economic sense?

Willison asked the right question: "Does profitability require products expensive enough to justify this overhead?" StrongDM builds enterprise security software --- high price point, low volume. The economics may work there. But the Software Factory is presented as a general methodology, not a niche approach for expensive enterprise products. Can a 20-person startup afford $240,000 per year per engineer in tokens? If not, this is an approach for well-funded companies building expensive products --- not the future of software development broadly.

The competitive moat problem is the second-order concern. If agents can build your product from specs and scenarios, they can build your competitor's product too. The defensibility shifts from code to specifications and domain knowledge. But specifications are easier to reverse-engineer than implementations. Willison flagged this explicitly: the feature cloning risk is real when your competitive advantage is no longer in the code itself. Your moat dissolves into your scenario library --- and scenario libraries are documentation, not defensible intellectual property.

The Moderate Position

There's an alternative framework for thinking about AI in development, and it comes from someone who can't be dismissed as a Luddite.

Sam Schillace --- Microsoft's Deputy CTO, creator of Google Docs --- published "Coding Laws for LLMs," a set of nine principles that are both pro-AI and pro-human-oversight. His first law: "Don't write code if the model can do it." But the model should do it under supervision, not autonomously. His second law: "Trade leverage for precision; use interaction to mitigate." Human validation checkpoints are essential, not optional. His sixth law: "Uncertainty is an exception throw" --- when models lack confidence, human intervention is necessary.

The key line: "Good design of code involving LLMs takes this into account and allows for human interaction."

Schillace advocates treating models as tools, not autonomous agents. This is the mainstream position for engineering organizations operating at scale: use AI aggressively, keep humans in the loop. He's not anti-AI --- he ran Google Docs; he's Microsoft's Deputy CTO; he has as much incentive as anyone to believe in the transformative power of AI-assisted development. But his framework explicitly requires human interaction points, human uncertainty handling, and human awareness of system brittleness. The distinction is between delegation and abdication.

StrongDM's three cardinal rules explicitly forbid what Schillace's laws explicitly require. These are two different bets on where AI code quality is right now. The CodeRabbit data, the Veracode findings, the FormAI study, the Replit incident, the Moltbook breach --- the evidence favors the bet that still includes human review.

The Workforce Problem

If no human writes or reviews code, what do engineers do? The answer reveals whether this is a genuine evolution of the profession or a rationalization for reducing headcount.

The charitable framing: engineers shift from code writers to supervisors and reviewers. Humans provide high-level specifications and architectural guidance; AI handles implementation. Skills gaining importance include systems thinking, security expertise, UX design, and domain knowledge. Traditional coding interviews become "increasingly misaligned with actual work developers now perform."

The scale concern is sharper: "Bringing on agents isn't hiring another person. It's like hiring a hundred intern-level engineers. You can't code review a hundred engineers." In StrongDM's model, you don't review them at all --- the scenarios do.

Then there's the comprehension debt problem --- and this one compounds over time. AI generates working code that nobody on your team understands. Peter Naur argued in 1985 that software isn't the code; it's the team's mental model of the code. When that model decays, software becomes unmaintainable regardless of how clean the code looks. Code review isn't just quality assurance; it's how teams build shared understanding of their systems. When nobody wrote the code and nobody reviewed it, who maintains it? Who debugs it? Who extends it when requirements change? The dark factory assumes maintenance is also agentic, but maintenance requires understanding context, history, and intent --- an even harder problem than generation.

"Supervisors of code-generating systems" is the generous framing. "Prompt engineers with fancy titles" is the cynical one. Both framings point to the same structural shift: value migrates to design, taste, judgment. But how many companies need a full team doing only design, taste, and judgment? The ratio changes; it doesn't change in a way that preserves current headcount. Every CTO running the numbers on agentic development needs to be honest about this implication.

The Acquisition Test

StrongDM is being acquired by Delinea, an identity security company that builds privileged access management and secrets management products. The deal is expected to close Q1 2026.

This matters because it's a real-world test. Did Delinea see the Software Factory methodology and buy it --- or did they buy the product and the customer base? Will the acquirer maintain "no human review" for security products once they own the compliance risk? Startup experiments often don't survive corporate integration; radical methodologies especially. If Delinea imposes human review on StrongDM's code, the Software Factory becomes a case study in methodology, not a sustainable practice.

Worth watching. The answer will tell us more about the viability of the dark factory than any whitepaper or manifesto. Corporate acquirers don't tolerate risk the way three-person founding teams do; the compliance review alone should be illuminating.

What I'm Doing

Not dark factory. Not even close.

The data doesn't support removing human review for production code, and it especially doesn't support it for anything security-adjacent. But I'm not dismissing the underlying ideas either. StrongDM's engineering is disciplined even if the philosophy is premature.

What I'm Considering

Keeping verification scenarios outside the codebase --- separate from the code that agents generate and the tests they write --- is valuable even with full human review in place. I'm experimenting with specification-driven scenarios that no agent touches, validated independently. It's a small change to the workflows I'm using; the improvement in verification confidence could be disproportionate.

The DTU concept at smaller scale: not full behavioral clones of third-party services, but mocked environments that let me test integration behavior without hitting live APIs. This was always good practice; StrongDM made the economics interesting by showing how agents can build and maintain the mocks themselves.

What I'm Not Adopting

"No human review." Not until the CodeRabbit numbers reverse --- and not for security-adjacent code even then. The evidence isn't there. And $1,000 per day in tokens per engineer --- the economics don't work at our scale, and I'm skeptical they work at most scales. We're spending deliberately, not maximally.

Maybe StrongDM is early, not wrong. Maybe AI code quality improves enough in the next two years that "no human review" becomes defensible. I'd rather be late to a methodology that works than early to one that causes a breach.

Closing Thoughts

The Software Factory is not a question about ambition or vision. It's a question about evidence.

The holdout-set idea is smart. The DTU is impressive engineering. The three cardinal rules are ideology, not engineering --- aspiration dressed as methodology. The question isn't "should we go fully agentic?" --- that's a philosophy debate with no falsifiable answer. The question is: what would have to be true about AI code quality for you to trust it without human review?

That question has a measurable answer. And right now, the measurements don't support it.

What defect rate would you need to see before removing human review? Are we there? If your scenarios catch 95% of issues, is the 5% they miss acceptable for your product? For your customers? For your compliance obligations? When --- not if --- an agent-generated system causes a production incident, who in your organization understands the code well enough to diagnose it?

StrongDM's holdout-set concept is worth adopting. Their philosophy is worth watching.

Audio: When AI Agents Write Your Code, Does Language Choice Matter?

Allan MacGregor 🇨🇦 — Tue, 17 Feb 2026 12:15:35 GMT

Jose Valim recently made a bold claim: Elixir is the best language for AI code generation, based on benchmarks showing high completion rates and structural benefits like immutability and ecosystem stability. But this sparks a deeper question—not which language is best, but whether the choice even matters when AI agents write a large chunk of your code.

The real power of languages like Scala, Haskell, or Rust isn’t Elixir’s specifics—it’s the compiler acting as an AI code reviewer. These typed, functional languages provide immediate, strict feedback that forces AI-generated code to be correct before it ever reaches human eyes. That means AI can’t just spit out code that might fail later; it has to meet the compiler’s standards upfront, which cuts down bugs and lets your engineers focus on design, not chasing type errors. Languages like Python or JavaScript don’t have that gatekeeper. AI outputs code that might or might not work, leaving bugs for humans to find later. Functional, stateless code fits the AI’s own mode of operation—small, pure functions with explicit inputs and outputs—while mutable object-oriented code demands context beyond what AI’s limited memory can handle. As Jonathan de Montalembert put it, “The more flexible and forgiving the language, the more dangerous the AI partner becomes.”

That’s compelling, but theory runs into a training data wall. Scott Arbeit showed that even with a language like F#, which ticks all the theoretical boxes, AI models often produce invalid syntax or default to more popular languages like C#. Less popular languages suffer from a vicious cycle of limited training data leading to poor AI output, which suppresses adoption and further reduces data. Meanwhile, Python dominates AI-generated code simply because models have seen more of it—80% of AI agent implementations use Python. Even the Tencent benchmark supporting Elixir had flaws: it filtered out harder problems for low-resource languages, skewing results, and practitioners report better real-world AI reliability with JavaScript or Kotlin. So, while typed functional languages might produce better code in theory, in practice, AI models do better with popular languages they know well.

But here’s the part nobody talks about enough: comprehension debt. AI-generated code can compile, pass tests, even ship—and yet nobody on your team understands how it works. This gap between code behavior and team understanding is insidious. When something breaks, the team can’t trace the logic or confidently modify the system. Peter Naur said decades ago that software is really about the team’s mental model, not just the code itself. AI doesn’t build that theory; it just generates solutions. If your team can’t read or reason about the language AI uses, the codebase becomes a liability, no matter how correct the AI’s output is. So “switch to Elixir because AI writes better Elixir” only works if your team can own Elixir code. Otherwise, mediocre code in a familiar language beats perfect code nobody understands.

And there are bigger constraints overriding theory. Hiring for niche languages like Elixir or Haskell is tough and expensive compared to Python or TypeScript, where talent is abundant. Ecosystem maturity matters too—most AI tools ship Python SDKs first, meaning AI agents have better building blocks in those languages. Existing codebases rarely get rewritten just for AI; migration costs are real and quantifiable, while AI code quality gains remain theoretical and small. Plus, AI models improve rapidly, narrowing gaps between languages over time. Python’s dominance is a network effect moat—like QWERTY or VHS—not easily displaced by technical superiority alone.

So what really makes a codebase AI-friendly? The qualities Valim highlights—immutability, strong typing, stable ecosystems, clear contracts—are portable across languages. You don’t have to switch to Elixir to get immutability; you can avoid mutating state in Python or TypeScript. Strong typing is the investment, not the language—TypeScript strict mode or Python type hints with mypy offer similar guardrails. Good documentation and comprehensive tests give AI agents better context and validation. Small, pure functions with explicit inputs and outputs help AI generate better code regardless of language. Stable APIs reduce confusion for both AI and humans. And letting AI generate types or interfaces before implementation surfaces mistakes earlier. These practices improve code maintainability and AI output simultaneously.

Personally, I’m a fan of Elixir and introduced it to my team at LiORA—not because it’s the best for AI, but because it’s a great team language. It’s proven productive, and with some nudging toward smaller, focused functions, it works well with AI tools like Claude Code. But that’s a team choice, not a universal prescription.

What should CTOs do today? Focus on what’s good for the AI and good for humans alike. Invest in documentation to provide context for both AI and developers. Write smaller functions with clear contracts, applying functional principles even in non-functional stacks. Don’t bet on today’s AI language strengths—they’ll shift in 18 months. Instead, improve your codebase properties now, which pays off for your team and future AI capabilities.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

When AI Agents Write Your Code, Does Language Choice Matter?

Allan MacGregor 🇨🇦 — Tue, 17 Feb 2026 12:01:05 GMT

On February 5th, Jose Valim published a blog post titled "Why Elixir is the best language for AI." His argument wasn't hand-waving. He pointed to a Tencent benchmark where Elixir achieved a 97.5% completion rate across twenty programming languages; Claude Opus 4 scored 80.3% on Elixir versus 74.9% for C# and 72.5% for Kotlin. He walked through Elixir's immutability, its ecosystem stability -- version 1.0 shipped in 2014 and the language is still on v1.x twelve years later -- and its executable documentation verified in test suites. Structural claims backed by data. Not marketing.

This poses an interesting question, is there truly a "best language for AI"? and what does it mean to be the best language for AI? Every language community right has some AI related claim. Rust advocates point to inference speed. Python advocates point to everything. Now Elixir. This is the "best language for web development" wars replayed for the agentic coding era; the actors change but the plot stays the same.

But there's a question underneath the tribalism that's worth pulling apart. Claude Code, Cursor, Copilot, Devin -- these tools are writing 30-80% of new code at many companies right now. If an AI agent is generating most of your codebase, does the target language affect the quality of what comes out?

That question has a more interesting answer than "Elixir wins."

The Compiler as AI Code Reviewer

The strongest version of the argument for typed and functional languages has nothing to do with Elixir specifically. It's about what happens when AI-generated code meets a compiler that can say no.

In languages like Scala, Haskell, or Rust, the feedback loop is tight: AI generates code, the compiler rejects what's invalid, the AI iterates, and eventually produces something correct. The type system catches errors before runtime -- without needing a human in the loop. Think about what that means for your review process. An entire category of bugs gets caught before a pull request ever reaches a human reviewer; your engineers spend time on logic and architecture instead of hunting for type mismatches and null reference errors that a compiler would have caught instantly.

In Python or JavaScript, the feedback loop is looser. AI generates code, it runs, it might work, you find the bugs later. Or you don't.

Alexandru Nedelcu made this case convincingly for Scala. AI agents successfully generate working Scala 3 macro code despite limited training data, because the compiler provides real-time validation via LSP. Expressive type systems don't just make AI code better; they make AI code correctable. The compiler becomes an automated code reviewer that never gets tired, never rubber-stamps a pull request, and catches entire categories of bugs that would sail through a dynamically typed language undetected.

This maps to how LLMs operate. They have limited context windows; they work best generating small, self-described functions with clear inputs and outputs. Stateless functional approaches match the LLM's own operational model -- no memory persistence between generations, no hidden state to track. Immutable data means all transformations are explicit. Pure functions have no side effects. The AI doesn't need to reason about what changed somewhere else in the program.

Contrast this with mutable object-oriented code. Object state can change anywhere. An AI agent generating a method on a class needs to understand what every other method might have done to that object's state before this method runs. That's a lot of context to track; context that fits poorly in a window measured in tokens. The AI doesn't just need to understand the function it's writing -- it needs to understand the entire object graph that function touches. In a large OOP codebase, that graph sprawls across files, modules, and inheritance hierarchies that no context window can fully capture.

Jonathan de Montalembert's framing cuts to the point: "The more flexible and forgiving the target language, the more dangerous the AI partner becomes." Deterministic languages with sound type systems constrain AI mistakes at compile time. Flexible languages let those mistakes ship.

Valim's Elixir-specific arguments are the sharpest example of these principles in practice. Immutability is built in, not optional. The ecosystem hasn't churned -- everything written about Elixir in the last decade still works, which means no training data confusion for models navigating deprecated APIs. Executable documentation with `iex>` snippets, verified in test suites, means the training examples are more likely to be correct.

These are real structural advantages, that make Elixir the powerhouse that it is today. The compiler-as-AI-reviewer argument is genuinely compelling; the functional programming fit with LLM architecture is sound; the stability argument removes an entire class of training data problems that plague fast-moving ecosystems. Anyone dismissing this wholesale isn't paying attention.

The Training Data Problem

In theory, theory and practice are the same. In practice, they are not. -- Yogi Berra

The structural argument is sound in theory. In practice, it runs into a wall.

Audio: OpenAI Didn't Buy a Product. They Bought a Distribution Channel.

Allan MacGregor 🇨🇦 — Mon, 16 Feb 2026 20:27:50 GMT

OpenAI’s recent acquisition of OpenClaw wasn’t just about talent or technology. They bought a distribution channel—a powerful revenue pipeline that was funneling massive API usage and revenue to a competitor, Anthropic. OpenClaw, an autonomous agent platform, defaults its model provider hierarchy to Anthropic’s Claude models, which dominate the token consumption that drives API revenue.

OpenClaw isn’t your average chatbot; it’s a relentless token furnace. It integrates deeply with email, calendars, browsers, and messaging apps, running multi-step workflows with persistent memory. This architecture means it burns through tokens at astonishing rates—sessions can balloon to hundreds of thousands of tokens, and background “heartbeat” checks alone can cost hundreds of dollars per week per agent. Light users spend tens of dollars monthly on API calls, but heavy users can rack up thousands, even tens of thousands, in a single month. This is a quantum leap beyond chatbot-era economics—it’s not incremental, it’s orders of magnitude more expensive.

Those tokens translate directly into revenue for model providers. OpenClaw’s default configurations overwhelmingly favor Anthropic’s Claude models, driving the bulk of this enormous token spend to Anthropic’s API. With OpenClaw’s explosive growth—over 180,000 GitHub stars and an estimated 50,000 to 200,000 active users—this translates to tens of millions, potentially over a hundred million dollars in annualized API revenue flowing to Anthropic. For OpenAI, facing billions in projected losses and intense competition, that’s a revenue leak they couldn’t ignore.

The irony is sharp. Steinberger built OpenClaw explicitly for Claude, even naming it after the Claude model. He was essentially subsidizing Anthropic’s revenue by running high-cost API calls on his own dime. Anthropic’s response was to send a cease-and-desist over the project’s name, alienating the very community driving their growth. Within weeks, OpenAI swooped in, acqui-hiring Steinberger and effectively capturing the most powerful agent ecosystem driving revenue to their competitor.

This acquisition wasn’t just about adding a brilliant engineer or community goodwill. It was about controlling the defaults in agent platforms, which dictate model usage and thus revenue flows. Defaults matter. Just like browser search engine defaults shaped billions in ad revenue, agent platform defaults will shape trillions of tokens in API spend. Autonomous agents running 24/7 with complex workflows generate hundreds of millions to trillions of tokens monthly. Whoever controls that agent layer controls the revenue.

Share The Pragmatic CTO

This is the start of a broader pattern: autonomous agents are becoming the new distribution layer for AI models, much like mobile apps became the distribution layer for cloud infrastructure in the 2010s. Apps created persistent compute demand, driving massive cloud revenue. Agents now create persistent token demand that compounds with each new user and integration. The scale is breathtaking—over 50 trillion tokens processed daily across the market, with agents accounting for nearly half. The economics of model defaults in agent platforms will be the new battleground.

For CTOs evaluating agent infrastructure, this means your choice of default model provider isn’t neutral—it’s a financial commitment. The token economics of agents dwarf chatbot-era costs. A fleet of agents running constant heartbeats can cost hundreds of thousands annually just to maintain status checks. Vendor lock-in now happens not just at the API level but through accumulated context, workflows, and integrations tuned to a specific provider’s models. Switching costs are no longer just about code migration—they’re about losing months of institutional memory embedded in your agents.

Over the next year, I’m watching four key signals. First, whether OpenClaw’s defaults shift from Claude to OpenAI models, signaling revenue redirection. Second, if Steinberger’s projects at OpenAI mirror OpenClaw’s agent approach but built on OpenAI’s stack. Third, Anthropic’s response—will they partner with or acquire another agent platform to reclaim distribution? And fourth, whether agent platform defaults become a negotiation point in enterprise API contracts, akin to search engine default deals.

Ask yourself: do you know where your API spend is going? Have you updated your budgets for the explosive token burn of autonomous agents or are you still thinking in chatbot terms? Would you notice if your agent platform’s default model changed tomorrow? The headlines have moved on from the acqui-hire narrative, but the token economics haven’t. Understanding who controls your agent defaults is no longer just a technical choice—it’s a financial one.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO.

OpenAI Didn't Buy a Product. They Bought a Distribution Channel.

Allan MacGregor 🇨🇦 — Mon, 16 Feb 2026 20:27:06 GMT

The Token Economics Behind the OpenClaw Acqui-Hire

On February 15, 2026, Sam Altman announced on X that OpenClaw creator Peter Steinberger was joining OpenAI. He called Steinberger "a genius with a lot of amazing ideas about the future of very smart agents interacting with each other to do very useful things for people." The framing was deliberate---talent, vision, the future of personal agents. Every analysis published since has dutifully followed the same narrative; OpenAI acquired a brilliant founder, absorbed the most viral open-source project in GitHub history, and positioned itself to dominate the agent layer.

That narrative isn't wrong. But it's incomplete in a way that matters.

Buried beneath the talent story is a financial reality that almost nobody is discussing. OpenClaw's default provider hierarchy places Anthropic first---above OpenAI, above Google, above every other model provider. The default primary model is `anthropic/claude-opus-4-6`. Steinberger himself recommended Claude Opus 4.6 for heavy agent workloads; community guides consistently called Claude Sonnet 4.5 the "sweet spot" for most users; independent benchmarks found that Claude outperformed GPT-4o on long-context tasks, prompt-injection resistance, and multi-step tool use---the exact capabilities autonomous agents need most. One industry analyst put it bluntly: "OpenClaw was one of the biggest drivers of paying API traffic to Anthropic, since most users ran it on Claude."

OpenAI didn't just buy a genius. I believe they bought a distribution channel that was sending a competitor's revenue through the roof---and they are about to redirected it.

The Token Furnace

Understanding why this acquisition makes financial sense requires understanding how much money autonomous agents burn. OpenClaw isn't a chatbot; it's a 24/7 autonomous system that connects to your email, calendar, messaging platforms, and web browser, chaining multi-step workflows together with persistent memory across sessions. Every one of those operations consumes API tokens; the architecture ensures that consumption is extraordinary.

Six factors make OpenClaw a token furnace. Context accumulation accounts for 40-50% of total spend, because the entire conversation history is resent with every API call; sessions with roughly 35 messages had grown to 2.9 megabytes in one documented case, occupying 56-58% of a 400,000-token context window. Tool outputs from shell commands, file reads, and web fetches deposit thousands of additional tokens into that context; OpenClaw's system prompt---5,000 to 10,000 tokens---ships with every single API call regardless of whether the user is asking a complex question or checking whether any tasks exist. And the default "heartbeat" check runs every thirty minutes, sending the entire 120,000-token context window to the API for what amounts to a status ping. At Opus pricing, that heartbeat alone costs approximately $0.75 per check---roughly $250 per week for an agent that mostly reports nothing.

The per-user costs that result from this architecture are unlike anything the chatbot era prepared us for. Light users consuming 5-20 daily messages spend $10-30 per month on Claude Sonnet; medium users running automated workflows and cron jobs land between $30 and $150; heavy users operating 24/7 assistants with browser automation can reach $750 to $3,000 or more per month on Opus-tier models. The extreme documented cases are worse still. Federico Viticci, the tech blogger, burned through $3,600 in a single month; a German publication hit $100 in a single day of testing; one Moltbook user watched $8 disappear every thirty minutes---$380 per day---just processing new social posts.

Compare that to a ChatGPT conversation, which might consume a few thousand tokens per session at pennies per interaction. The gap between chatbot-era economics and agent-era economics is not incremental; it is orders of magnitude.

Follow the Money

Those tokens are revenue for someone; the question that matters---the one that reframes the entire acquisition---is who.

OpenClaw is model-agnostic by design; users can configure any provider through their own API keys. But defaults drive behavior, and OpenClaw's defaults overwhelmingly favor Anthropic. The provider priority hierarchy in the official documentation reads Anthropic first, then OpenAI, then OpenRouter, followed by Gemini and a long tail of smaller providers; when a user configures an Anthropic API key, Claude models are automatically set as primary. The original project was named "Clawdbot"---a phonetic play on Claude---and the community that coalesced around it adopted Claude as the consensus recommendation for agent workloads. Claude's advantages in long-context reasoning, prompt-injection resistance, and multi-step tool use mapped precisely to what autonomous agents demand most; even users who started with OpenAI keys often migrated to Anthropic after community forums pointed them there.

The aggregate revenue implications of this default are significant, even using conservative assumptions. OpenClaw crossed 180,000 GitHub stars and had 1.5 million agents created by early February 2026. GitHub stars-to-active-user conversion for developer tools typically runs between 10% and 30%, which suggests an active user base somewhere between 50,000 and 200,000 people. Multiply by the documented average monthly API spend of $15 to $50 per user, and the back-of-envelope math produces annualized figures of $9 million at the conservative end, $36 million at the moderate estimate, and $120 million at the aggressive end; the majority flowing to Anthropic.

These are rough numbers, and no published aggregate data exists for OpenClaw API spend. But even the conservative figure of $9 million annually represents a non-trivial revenue stream; OpenAI's API business hit $1 billion ARR in late 2025, while Anthropic targets $26 billion in revenue by the end of 2026. A single agent platform driving $36 million or more in annual API spend to a competitor is the kind of leak that a company projecting $14 billion in losses for 2026 cannot afford to ignore.

Anthropic's Gift to OpenAI

The irony of this acquisition sharpens when you trace the timeline of Anthropic's own decisions.

Steinberger built Clawdbot in November 2025---named after Claude, built for Claude, defaulting to Claude, driving every token of its explosive growth directly into Anthropic's API revenue. Within weeks the project became the fastest-growing open-source repository in GitHub history, crossing 180,000 stars in roughly sixty days and generating the kind of organic developer evangelism that no marketing budget can buy. Steinberger was losing $10,000 to $20,000 per month running OpenClaw, and the vast majority of that cost was API spend; infrastructure costs ran only $10-25 per month for the servers themselves. He was subsidizing Anthropic's revenue out of his own pocket while building their most powerful distribution channel.

Subscribe now

Anthropic's response to this gift was to send lawyers.

On January 27, 2026, Anthropic issued a trademark cease-and-desist over "Clawd" being too phonetically similar to "Claude." Steinberger renamed the project to Moltbot, then to OpenClaw within two days; the Hacker News community called it an "Anthropic fumble" that damaged the company's reputation in the open-source community while handing OpenClaw a fresh wave of viral attention through the drama. One analyst captured the absurdity precisely: "The OpenClaw creator built this project for Claude, named it after Claude, and was actively driving revenue and developer mindshare to Anthropic's API. Instead of recognizing what they had---an unpaid evangelist building the most viral agent ecosystem in history on top of their model---Anthropic sent lawyers."

Eighteen days later, OpenAI swooped in and acqui-hired Steinberger. The Monday Morning Substack called it a potential "fumble of the century for Anthropic," noting that Anthropic's enterprise market share had grown to 40% while OpenAI declined to 27%---a shift partially driven by developer tools and agent ecosystems running on Claude. Anthropic was winning the developer distribution war through organic adoption; then it chose to antagonize the single person doing more for that adoption than anyone on its payroll.

The Distribution Channel Thesis

The conventional reading of this acquisition focuses on three assets: Steinberger's talent, his architectural knowledge of agent systems, and the community goodwill attached to OpenClaw. All three are real and valuable; none of them explain the speed of the move, the personal involvement of Altman, or the competitive urgency of bidding against Mark Zuckerberg's direct outreach via WhatsApp.

A distribution channel thesis does.

By bringing Steinberger in-house, OpenAI can shift the default model hierarchy in whatever agent products emerge from his work---and defaults, as every CTO who has watched browser search engine deals knows, drive the overwhelming majority of usage. OpenAI captures a proven demand generation channel; OpenClaw demonstrated that autonomous agents create enormous, persistent, recurring API demand that dwarfs anything a chatbot produces. A ChatGPT user might generate a few thousand tokens per conversation; an OpenClaw agent running 24/7 with heartbeats, cron jobs, and multi-step workflows generates 5 to 200 million tokens per month. If even 100,000 users run agents on OpenAI models at those consumption rates, the resulting 500 billion to 20 trillion tokens per month would represent a significant fraction of OpenAI's total API throughput---which currently stands at 6 billion tokens per minute.

The move also denies Anthropic its most effective unpaid distribution partner at a moment when distribution matters as much as model quality; it locks Steinberger's architectural thinking into OpenAI's agent-native infrastructure---the Agents SDK, the Responses API, the Frontier Platform---at a time when 93% of companies processing more than one trillion tokens on OpenAI already use framework-based agent orchestration.

No insider has confirmed that token economics or revenue redirection played a role in the acquisition decision; his is purely my analysis based on the substantial yet circunstantial evidence. The strongest version of this thesis is not that OpenAI was protecting its own revenue--- it's that OpenAI was capturing a revenue channel that was primarily benefiting a competitor, at a moment when both companies are burning billions to establish market dominance.

Agents Are the New Apps

The OpenClaw acquisition fits a broader pattern that I believe will define the economics of AI infrastructure for the next three to five years: autonomous agents are becoming the distribution layer that drives model provider revenue, in exactly the way that mobile apps became the distribution layer that drove cloud compute revenue.

The structural parallel is almost exact. In the 2010s, mobile apps created persistent compute demand---always-on services running in the background, pushing notifications, syncing data, processing transactions---that drove AWS, GCP, and Azure revenue far beyond what web applications alone would have generated. SaaS products did the same for payment processing; every recurring subscription flowing through Stripe created persistent transaction volume that compounded as the ecosystem grew. Autonomous agents are now doing this for LLM APIs; an agent running 24/7 with periodic heartbeats, automated workflows, and multi-step reasoning creates persistent token demand that compounds with every new user, every new integration, every new automated task.

The scale of the opportunity explains the urgency. The total LLM API market processes approximately 50 trillion tokens per day, with code generation and agent workflows accounting for 40-50% of that volume. OpenAI's token throughput has grown 700% year over year; agentic inference is the fastest-growing usage pattern across every major API provider. Whoever controls the agent layer---the platforms where autonomous workflows are designed, deployed, and defaulted to specific models---controls the revenue that flows from them.

Default model settings in agent platforms are becoming the new default search engine deals. Google paid Apple billions annually to remain Safari's default search engine because defaults drive behavior at scale; the economics of agent platform defaults follow the same logic, with token revenue replacing advertising revenue as the prize.

What This Means for Your Budget

If you're running or evaluating agent infrastructure, the token economics of this acquisition carry practical implications that most planning processes have not caught up with.

The first is that your agent platform's default model is not a neutral technical choice---it's a revenue channel decision for someone else. Every token your autonomous agents consume is revenue for a model provider; the provider your platform defaults to captures the vast majority of that spend because most users never change defaults. When you evaluate agent platforms, understanding the default provider hierarchy is as important as understanding the capability benchmarks; the platform's incentives shape which models your agents will call, how aggressively context is managed, and whether token efficiency is a design priority or an afterthought.

The second is that the cost structure of autonomous agents bears almost no resemblance to the cost structure of chatbot-era AI tools. A developer using GitHub Copilot generates predictable, bounded API costs that correlate with working hours. A fleet of autonomous agents running 24/7 with heartbeat checks, persistent memory, and multi-step workflows generates costs that correlate with uptime; uptime is 168 hours per week regardless of whether any productive work is happening. The heartbeat problem alone can cost $250 per week per agent at Opus pricing; multiply that across a team of twenty agents and you're spending $260,000 annually on status pings. Most AI budgets were built for the chatbot era and have not been recalibrated for always-on autonomous systems.

The third is that vendor lock-in through agent defaults is the new lock-in vector that most CTOs are not even aware of. Once your workflows, persistent memory, integration configurations, and skill marketplace dependencies are built on a specific agent platform with a specific model default, switching costs compound rapidly. This is not the familiar lock-in of cloud provider APIs or database engines; it's lock-in through the accumulated context and behavioral tuning of autonomous systems that learn and adapt over time. The switching cost isn't technical migration alone---it's the loss of institutional memory that your agents have built over months of operation.

What I'm Watching For

Four signals over the next six to twelve months will determine whether the distribution channel thesis holds.

The most telling will be whether OpenClaw's default model hierarchy shifts from Anthropic to OpenAI. The project is moving to an independent foundation, but if the defaults change within the first two releases after the transition, the revenue redirection motive becomes difficult to argue against; a subtler version of the same signal would be OpenAI offering preferential API pricing or free tiers specifically for OpenClaw users---a subsidy that looks like community support but functions as customer acquisition for API revenue.

The second signal is whether Steinberger's first projects at OpenAI resemble "OpenClaw for GPT"---consumer-facing autonomous agents built on OpenAI's infrastructure that inherit the design patterns and community goodwill of the project he built. If the agent architecture he designed to drive Claude usage gets rebuilt to drive GPT usage, the capture is complete.

The third is Anthropic's response; if Anthropic acquires or deeply partners with another agent platform within the next six months, it validates that they recognize the distribution channel they lost. Silence would suggest they either disagree with this framing or haven't yet grasped what happened.

The fourth is broader: whether agent platform defaults become a negotiation point in enterprise API contracts the way search engine defaults became negotiation points in browser contracts. If model providers start paying agent platform developers for default placement, the parallel to search engine economics will be fully realized---and the OpenClaw acquisition will look less like an acqui-hire and more like the opening move in a distribution war.

Questions Worth Asking

When you evaluate an agent platform, do you trace where your API spend goes? Not the total cost---the destination. Do you know which model provider benefits most from your agent infrastructure, and whether that alignment was a deliberate choice or an inherited default?

Have you budgeted for the token economics of autonomous agents, or are you still forecasting based on chatbot-era usage patterns? The difference between a developer using an AI coding assistant and a fleet of agents running 24/7 is not 2x or 5x---it's 100x to 1,000x in token consumption, and it scales with uptime rather than headcount.
If your agent platform's default model changed tomorrow, would you notice? Would your team? Would your finance team?

The acqui-hire headlines have moved on. The token economics haven't. And if I'm right that agents are becoming the distribution layer for model provider revenue, then understanding who controls your agent defaults is no longer a technical question. It's a financial one.

Audio: Rise of the Citizen Coder: The Other Side of the Agentic Revolution

Allan MacGregor 🇨🇦 — Mon, 16 Feb 2026 17:30:54 GMT

Backlogs are the silent killer of innovation. I’ve seen it too many times—a simple internal tool that should take two weeks gets pushed to the bottom of a six-month backlog because engineers are drowning in higher priority work. The system we built demands specialized skills even for the smallest things, creating a bottleneck that frustrates everyone. This backlog isn’t just a scheduling problem; it’s a symptom of an industry that made building software unnecessarily hard for domain experts who actually need solutions.

And it gets worse. The gatekeeping problem is baked into our culture. “Learn to code” has meant years of grinding through syntax, frameworks, deployment, and debugging before you can build anything useful. That made sense if you wanted to be a programmer, but it was never realistic advice for a marketing director, operations manager, or founder who just wants to solve a problem in their area of expertise. We built a fortress of complexity around software creation, and it locked out the very people who had the best ideas for what to build.

But here’s the part nobody talks about: there’s another tribe in programming—people who don’t care about the craft of coding itself but see code as a tool to get things done. They didn’t fall in love with syntax; they fell in love with products. For them, AI and low-code tools aren’t shortcuts; they’re a way to tear down artificial barriers. They want to focus on delivering value, not on writing elegant code. I don’t fully agree with this philosophy, but it’s coherent and legitimate, and it forces us to rethink what “good software” really means.

And something surprising is happening: it’s working. Not just hype or demos, but real startups and teams shipping AI-generated codebases that are good enough to win funding and users. Founders with deep domain knowledge are building MVPs in weeks, business analysts are delivering internal tools without waiting months, and designers are prototyping real interactions without engineers. These aren’t fringe cases; this is mainstream now. Sure, the code isn’t perfect. It’s not elegant or maintainable in the traditional sense. But it works—and sometimes, that’s enough.

That said, the risks are very real. AI-generated code is often “almost right,” and almost right can quickly become a liability when it hits edge cases, security vulnerabilities, or performance bottlenecks. Maintenance falls on engineering teams who didn’t build the system in the first place, spawning what some call “rescue engineering.” The division of labor might be citizen coders generating and engineers cleaning up. Whether that’s sustainable is an open question, but it’s happening whether we like it or not.

As a CTO, the question isn’t if citizen coding is coming—it’s already here. The real challenge is figuring out where your organization draws the line between what’s fine for vibe coding and what demands engineering rigor. Is that internal tool for fifty users or fifty thousand? A prototype or a production system? Disposable or durable? Those boundaries aren’t clear, but they’re critical. Understanding where your backlog is truly complex work—and where it’s just waiting on bandwidth—can help you decide where to empower domain experts to build directly.

I’m putting my money where my mouth is. In 2026, I’m launching experiments with micro-SaaS products built end-to-end by agent teams. This isn’t a contradiction but a test. If agentic coding can build and operate real businesses, I want to see it firsthand. Maybe the quality problems will surface; maybe they won’t. But speculation only takes you so far—I’m ready to find out by building.

There’s no neat conclusion here. The craftsmanship side is right about the dangers of abstraction debt and knowledge gaps. The citizen coder side is right that the system failed many people and that not all software needs to be a cathedral. We’re witnessing a profession fragmenting in real time: craftsmen working on systems demanding deep expertise, citizen coders building things that otherwise never get built, and a messy middle where the boundaries blur and failures teach us where they should be drawn.

The question isn’t which side wins. It’s whether we can capture the benefits of democratizing software creation without drowning in the maintenance debt that worries the craftsmen. Nobody has figured that out yet. The craft isn’t dead, but it’s no longer the only way to build. That changes everything about who we are as engineers and what we’re for.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.

Read the full article — with all the data and sources — on ThePragmaticCTO Substack.

Rise of the Citizen Coder: The Other Side of the Agentic Revolution

Allan MacGregor 🇨🇦 — Mon, 16 Feb 2026 17:08:15 GMT

Gartner estimates that by 2026, citizen developers at large enterprises will outnumber professional software developers four to one. Four to one. Not because companies suddenly stopped needing engineers; because the backlog of things that needed building was always bigger than the engineering team could handle, and now people are finding ways to build without them.

In Part 1, I wrote about what we stand to lose as agentic coding becomes the norm. The erosion of craft. The abstraction debt. The knowledge gap that compounds over generations. I meant every word of it.

The software industry I'm part of and built a career in also failed a lot of people. If we're going to critique the new wave of "vibe coding," we have to be equally critical of the industry that created the barriers in the first place. This is the other side.

The Backlog

The same scene plays out in every company, every week.

A product manager walks into a planning meeting with a simple request: an internal tool to track customer feedback. Nothing fancy. A form, a database, some basic reporting. Maybe two weeks of work for someone who knows what they're doing.

Engineering estimates six months. Not because the tool is complex—it isn't. But the backlog is full. Three major features sit ahead of it, two critical bugs need fixing, and a migration has been pushed back twice. The PM's little tool isn't a priority. It goes to the bottom of the list, where it will sit until someone forgets why they wanted it in the first place.

The PM is frustrated. The engineers are frustrated too—they're not trying to be difficult; they're just drowning. Everyone agrees the tool would be useful. Nobody has bandwidth to build it.

This is the bottleneck that agentic coding is breaking open.

I've been on both sides of this conversation. I've been the engineer explaining why we can't get to something for months; I've been the person with a simple idea watching it die in backlog purgatory. The system we built—the one that requires specialized skills to create even basic software—created a chokepoint that serves nobody well.

The average enterprise IT backlog runs three to twelve months. That's three to twelve months where someone with a real problem waits for someone with the right skills to have time for them. Sometimes the wait is justified; the work is genuinely complex. But often it's not. Often it's just that we created a system where building anything requires specialized skills, and the people with those skills don't have bandwidth for everyone's problems.

The Gatekeeping Problem

I don't think most programmers set out to create barriers. It just happened. The skills take years to acquire. The tools resist simplification. The failure modes can take down production systems. All of that is true.

But the effect—intended or not—was a bottleneck that locked people out.

Think about what "learn to code" meant as advice. It meant: spend months or years acquiring foundational skills before you can build anything useful. Take courses. Do tutorials. Learn syntax, then frameworks, then deployment, then debugging. By the time you're competent enough to build that simple feedback tracking tool, you've invested hundreds of hours. Maybe thousands.

That's fine if you want to be a programmer. It's absurd if you're a marketing director who just needs a reporting dashboard; a founder with deep domain expertise who wants to test an idea before hiring engineers; an operations manager who could automate half their job if they could just write a little code.

"Learn to code" was never realistic advice for these people. They have jobs. They have expertise in their own domains. They don't have time for a CS curriculum, and they shouldn't need one.

The craftsman in me wants to say: but the complexity is real. You can't just skip the fundamentals. The shortcuts will catch up with you.

And that's true. But it's also true that we built a system where a simple tool requires a complex skill set, where domain experts can't build domain-specific solutions, where ideas die because they're stuck behind people who don't have time to implement them.

That was a failure too. We just didn't call it one because it was our failure, and we were on the winning side of it.

The Other Tribe

In Part 1, I talked about the two tribes of programmers: those who love the craft of coding, and those who see code as transportation to building things. I was clear about which tribe I belong to.

I undersold the other tribe's position. Their argument deserves a fairer hearing.

A comment stuck with me—the inverse of the one that opened Part 1:

"I'm happy for all coding to be AI. I prefer delivery over the craft of writing software."

My first reaction was dismissal. The attitude that leads to vibe coding disasters, right? People who don't care about quality; who just want to ship fast and let someone else deal with the consequences.

That's uncharitable. The real picture is different.

These are people who learned programming because it was the only way to build software. They didn't fall in love with syntax; they fell in love with products. The code was never the point—the code was the obstacle between their idea and a working thing. They put in the years because they had to, not because they wanted to.

For them, AI coding tools aren't a shortcut around craft; they're the removal of an artificial barrier. The barrier was always the implementation, not the thinking. They know what they want to build. They understand the problem domain. They have taste about what makes a good product. The only thing they lacked was the ability to translate that into syntax a computer could execute.

Now they have that. And they're asking: why should I care about the elegance of the implementation if the product works?

I don't fully agree with this position. I think there are real risks they're underweighting. But I can't pretend it's incoherent. It's a legitimate philosophy, not just laziness dressed up as productivity.

The Uncomfortable Successes

Something complicates the skepticism I laid out in Part 1. Some of this is working.

Not the hype. Not the demos. The results.

Twenty-five percent of Y Combinator's Winter 2025 batch had codebases that were 95% AI-generated. These aren't weekend projects; they're funded startups that passed YC's filter—and every one of those founders, according to YC managing partner Jared Friedman, was technical enough to build the product from scratch. They chose not to. Collins Dictionary named "vibe coding" their word of the year for 2025. This isn't a fringe phenomenon anymore.

And when I talk to the people doing it—not the evangelists, the practitioners—I hear stories that are hard to dismiss:

A founder with fifteen years of logistics expertise built a supply chain MVP in two weeks that would have taken months with traditional development. Not because the AI wrote perfect code, but because she could iterate on ideas in hours instead of waiting for engineering sprints.

A business analyst tired of waiting nine months for IT built the reporting tool his team needed. It's not elegant. He'd be the first to admit he doesn't fully understand how it works under the hood. But it works, and it shipped, and his team uses it every day.

A product designer prototyped an interface that functions—not just a mockup. She could test real interactions with real users before involving engineering at all.

These aren't hypotheticals. These are people building things that wouldn't have existed otherwise—not because the ideas weren't good, but because the implementation barrier was too high.

The obvious objections surface immediately: maintenance, edge cases, security. Those are fair questions. I asked them in Part 1. But for some of these projects, the questions might not matter that much. A throwaway prototype doesn't need to be maintainable; an internal tool with fifty users doesn't need enterprise-grade security; a startup testing product-market fit might not survive long enough for maintenance debt to matter.

Not every piece of software needs to be built like it's going to run for twenty years. Some software is disposable, and that's fine. The craft-obsessed approach I advocated in Part 1 might be overkill for a significant portion of what gets built.

That's uncomfortable to admit. But I think it's true.

The Risks are Real

I'm not going to relitigate Part 1. The risks I outlined are real: abstraction debt, debugging nightmares, the knowledge gap, the quality illusion. I stand by all of it.

One thing I didn't emphasize enough in Part 1: the thoughtful practitioners already know the limits.

There's a prototype phase, where vibe coding is useful—rapid iteration, exploring ideas, testing concepts. And there's a production phase, where engineering discipline matters—reliability, security, maintainability. The problem isn't that vibe coding exists; it's that the boundary between phases isn't always clear, and the people crossing it often don't realize they're crossing it.

A business analyst building an internal tool is probably fine. A startup founder building an MVP to test an idea is probably fine. Someone shipping a financial system that processes millions of transactions is not fine. The tool doesn't know the difference. The user has to.

The other risk worth naming: we're going to see a lot of "almost right" code. Output that's close but not quite. Almost right works until it doesn't. Edge cases. Security holes. Performance issues that only manifest under load. Research on AI-generated code already suggests that roughly 40% of AI-generated code snippets contain vulnerabilities; "almost right" at scale is a liability, not a shortcut.

Who fixes it when the person who built it doesn't understand it?

In a lot of cases, the answer is: a craftsman programmer, cleaning up after someone else's vibe-coded creation. That's already happening. Some are calling it "rescue engineering"—the maintenance burden that lands on engineering teams after citizen developers ship something that works until it doesn't. The "vibe coding hangover" is real.

This might be the new division of labor. Citizen coders generate; engineers audit and maintain. Builders move fast; craftsmen clean up.

Is that sustainable? I honestly don't know. But it's happening whether we think it's a good idea or not.

Where to Draw the Line

If you're a CTO watching this unfold, the question isn't whether citizen coding is coming to your organization. It's already there—or it will be by next quarter.

The questions worth asking:

How much of your current backlog is genuinely complex engineering work, and how much is queued simply because nobody with the right skills has bandwidth? Where in your organization are people already building things without engineering oversight—and what happens when those things break? If a business team built an internal tool with AI tomorrow, would your engineering org know about it? Would they need to?

The boundary between "fine to vibe-code" and "needs engineering discipline" isn't a bright line. It's a gradient, and your job is to figure out where your organization falls on it. Prototype vs. production. Internal vs. customer-facing. Fifty users vs. fifty thousand. Disposable vs. durable.

What I'm Doing

While I still reserve a lot of skepticism for vibe-coding and AI-first trends, in 2026 I'm launching a few experiments by building micro-SaaS products alone using teams of agents to handle the end-to-end building and operation of the business. You can follow them:

StructPR — Code review, reorganized
ShipLog — Feedback board, changelog, and embeddable widget for solo SaaS founders
AuroraGRC — Compliance management for Canadian regulations (partially)

This isn't a contradiction. It's a test. If agentic coding can build and operate real businesses that serve real users, I want to see it work—or fail—with my own codebase. Maybe the quality problems will materialize. Maybe they won't. But I'd rather find out by building than by speculating.

I'm ready to fully let go of the wheel and let AI take control, the first micro-saas in this list is specially about code review and organization, because it's a problem that's been bothering me for a long time; and that is being exacerbated by the rise of vibe coding.

The Tension That Doesn't Resolve

I tried to come up with a neat conclusion, but I don't have one.

The craftsmen are right that comprehension debt is real. That the junior developer pipeline is collapsing. That we're building systems nobody fully understands and calling it progress.

The citizen coders are right that the system failed them. That the backlogs were absurd. That gatekeeping kept good ideas from getting built. That not every piece of software needs to be a cathedral.

Both things are true.

What we're watching is a profession fragmenting in real time. Not dying—fragmenting. There will still be craftsmen, working on the systems where deep understanding matters; there will be citizen coders, building things that would have died in backlog purgatory; there will be a lot of mess in the middle, where the boundaries aren't clear and the failures teach us where they should have been.

The question isn't which side wins. It's whether we can capture the benefits of democratization—more people building, more ideas tested, fewer bottlenecks—without drowning in the maintenance debt that Part 1 warned about.

Nobody has figured that out yet. The answer probably looks different for a weekend project than for financial infrastructure. The boundaries will be learned the hard way, through failures that teach us where vibe coding breaks.

The craftsmen will say "I told you so." The builders will point to the successes and ask why the craftsmen are still complaining. And both will be partially right, which is the most frustrating kind of disagreement—the kind that doesn't end.

I started this series by quoting a programmer who said he never knew there was "an entire subclass of people in my field who don't want to write code." Writing this piece surfaced something I hadn't seen clearly before. They never wanted to be in our field in the first place. They just didn't have another way to build things.

Now they do.

The craft isn't dead. But it's no longer the only way to build. That changes everything about who we are as engineers—and what we're for.