The Pragmatic CTO
The Pragmatic CTO Podcast
Audio: Your AI Pentester Found 1,000 Bugs. None of Them Were the One That Mattered.
0:00
-7:05

Audio: Your AI Pentester Found 1,000 Bugs. None of Them Were the One That Mattered.

AI pentesting tools are now finding thousands of bugs faster and cheaper than humans ever could. But none of those bugs are the ones that really matter. The headline-grabbing achievements mask a deep gap between automated scanning and the nuanced, creative work of real penetration testing.

Take XBOW, the autonomous AI pentester that topped HackerOne’s leaderboard in 2025 with over a thousand vulnerability submissions in three months, including a zero-day in a major VPN. The market poured $75 million into it because it seemed like a paradigm shift. But the reality is more complicated. XBOW excels at volume and speed but scores low on severity. It mostly finds surface-level bugs like data leaks and XSS, not the chained exploits or complex attack paths that keep CISOs awake at night. And it’s backed by a team and expensive compute infrastructure, not a lone hacker—so comparing it directly to individuals is misleading. Plus, almost half its findings were still unresolved because they overwhelmed triage teams.

And it gets worse. XBOW’s economics don’t add up on their own. The cost of running the AI exceeds bug bounty payouts, meaning it needs venture-scale funding to survive. Its autonomy is qualified—humans review many findings before submission. False positives and duplicates reduce the number of actionable bugs. This isn’t failure; it’s a powerful tool with clear limits. AI tools work best on vulnerabilities with firm, known patterns. The real risks come from the vulnerabilities that require deep system knowledge and human intuition.

If you want a benchmark, look at ARTEMIS, a multi-agent AI pentesting framework tested on a real university network. ARTEMIS found nine valid vulnerabilities with an 82% accuracy and outperformed most human testers at an estimated $18 an hour versus $60 for humans. Impressive headlines, but the top human still found more issues—thirteen versus nine—and crucially, the human’s edge came from creatively chaining findings and validating impact. ARTEMIS excelled at pattern matching, but struggled with GUI-based tasks and novel exploits, just like XBOW. Also, the $18/hour figure ignores engineering, communication, and compliance work that humans provide. The takeaway? ARTEMIS is a multiplier, not a substitute. AI handles reconnaissance; humans handle validation and impact analysis. Overstating AI’s ability risks ignoring real security gaps.

The ecosystem is evolving fast. Open-source tools like Shannon and PentestGPT orchestrate AI agents with existing scanners and frameworks. They can find dozens of vulnerabilities in intentionally vulnerable apps and lower the barrier for junior engineers to run comprehensive scans. But these tools can also be weaponized by attackers, and some risk damaging the systems they test. The democratization of AI pentesting is a double-edged sword: it boosts defenders but empowers attackers equally. Defensive AI without strong depth in your security program is a net risk.

This isn’t the first time security automation promised to replace humans. Nessus scanners arrived in 1998 with the claim to replace manual assessments but ended up increasing demand for human pentesters to contextualize and validate findings. SAST and DAST tools in the 2010s were useful baselines but never replaced code review. Breach and attack simulation platforms help with validation, not as substitutes. The pattern is consistent: automation handles the baseline; humans handle the depth; total security spending rises. AI is more capable than any prior generation, but whether it can cross the threshold from augmentation to replacement remains to be seen. The timeline may surprise us, but history advises caution.

Here’s the 82% problem. According to Verizon’s Data Breach Investigations Report, 82% of exploited vulnerabilities involve human reasoning, exploit chaining, and contextual analysis—the stuff AI can’t do. AI pentesting covers the 18% of vulnerabilities that are known, patterned, and automatable: reconnaissance, CVE matching, standard exploits, and report generation. That 18% is valuable and hard for humans to match at scale. But the 82%—business logic flaws, multi-step attack chains, social engineering, zero-days, adversarial creativity—remain human territory. Confusing the 18% coverage for total security leaves you exposed.

And it’s not just a technical gap; it’s a legal and governance one. If your AI pentest misses a critical bug that leads to a breach, who’s liable? The vendor? Your CISO? Regulations like the EU Product Liability Directive and compliance standards like PCI-DSS and SOC 2 don’t yet clarify whether AI-only testing suffices. Autonomous AI can even stray outside authorized testing scopes, risking legal violations. False negatives and system crashes caused by aggressive AI testing add to the liability. Right now, no clear governance framework exists. If you can’t answer who’s accountable when AI misses the critical bug, you have a liability gap no tool will fix.

Meanwhile, the market moves forward. Platforms like HackerOne, Synack, and Cobalt blend AI with human experts in continuous testing. The penetration testing market is shifting from individual researchers to platform operators, mirroring ride-sharing economics where the platform captures value and individuals lose pricing power. Most organizations expect AI to eventually take over pentesting, but adoption stats often reflect usage, not effectiveness. For CTOs, the question isn’t whether AI pentesting is good or bad—it’s how this market shift changes how you buy and manage security services.

As CTOs, we have to ask hard questions. If your AI pentester says you’re clean, how confident are you? How much of your budget protects against script kiddies versus sophisticated attackers? How many findings required deep business logic understanding that AI can’t replicate? Are you buying real security or just the appearance of it? And when something inevitably gets missed, who’s accountable?

Every generation of security automation has promised to replace humans and instead ended up augmenting them. AI pentesting tools are faster, cheaper, and broader than anything before. They compress baseline coverage from weeks to hours and catch bugs your team might miss at odd hours. That is the floor of your security posture. The ceiling is understanding your unique architecture and threat model well enough to find the unknown unknowns—the context-dependent, creative exploits AI misses.

You can read the full article—with all the data and sources—on ThePragmaticCTO Substack.


Read the full article — with all the data and sources — on ThePragmaticCTO.

Discussion about this episode

User's avatar

Ready for more?