Root Cause

     

Shrinking Margins

March 7th, 2026
Chris Rohlf

In my AI Vulnerability Lifecycle piece for CSET last year I wrote about how agentic AI has begun automating portions of the software vulnerability lifecycle. An important point I want to expand on is how reasoning models change the nature of the relationship between human researchers and automation in vulnerability discovery. For decades, that relationship had a clear and stable boundary. Fuzzers and other automated tools explore program states primarily through 2 methods: brute force through billions of inputs and approximation through methods such as taint analysis and symbolic execution. They are effective at what they can reach, but the combinatorial explosion of possible states meant there was always a wide margin of unexplored state space or a requirement to under or over approximate. That margin is where human vulnerability researchers were able to discover issues automation could not. They used intuition, experience, and deep understanding of program semantics to find bugs that no amount of random input generation could surface, such as business logic issues, complex interactions between components and systems, and vulnerabilities that only manifest under specific sequences of state transitions. Research over the years has consistently demonstrated this gap. Many industry security research teams repeatedly find critical issues in heavily fuzzed codebases like Chrome, Android, and the Linux kernel. These were targets with millions of CPU hours of continuous fuzzing behind them. Yet talented humans still found exploitable security vulnerabilities. The state space was simply too large for brute force to close the gap, or impossible to discover through under or over approximation while modeling complex program behaviors that led to exploitable conditions.

Frontier reasoning models, such as Opus 4.6, change this dynamic in a fundamental way. They do not find vulnerabilities by attempting to explore every possible state. They reason through code the way a human does. They follow data flows, understand constraints, recognize patterns from historical vulnerabilities, and make educated guesses about where bugs are likely to exist. This is not brute force, it is the same heuristic, experience driven approach that human researchers have relied on to find the issues that automation could not. The difference is that a reasoning model can apply this approach at a scale and speed no human team can match. They can reason through vast amounts of code paths across parallel instances, revisit its assumptions, and refine its strategy in tight feedback loops with traditional tooling. The margin that once belonged exclusively to humans is now contested. The evidence is already emerging that frontier models can discover vulnerabilities in code that both humans and traditional automation have hardened over the years.

This does not mean human researchers become irrelevant. But the basics of vulnerability research, finding new instances and variants of known bug patterns through careful manual analysis, is exactly the kind of work reasoning models are increasingly capable of. The margin humans operate in is rapidly shrinking. DARPA's AIxCC demonstrated what LLM driven cyber reasoning systems could do even with 2024/5 era models. Finalist systems identified 86% of synthetic vulnerabilities and patched 68% of those across 54 million lines of code, while also discovering 18 real world vulnerabilities that were not part of the competition. Frontier reasoning models have improved substantially since then. As models improve and compute costs come down, the margin for humans will continue to shrink. Researchers who want to stay relevant will need to move further up the abstraction stack into the kinds of problems that remain genuinely hard for models to discover such as novel vulnerability classes that are underrepresented in the models training data. Helen Toner and others have written about the concept of AI's "jagged frontier", the observation that AI capabilities advance unevenly across tasks that appear both simple or complex for humans. In many domains that jaggedness creates real limits on what AI can automate. But in vulnerability discovery the frontier is not jagged. The core tasks of pattern recognition across code, root cause analysis, variant identification, and the ability to reason about program state, are all tasks that fall within the strengths of current reasoning models. This shift is likely to rapidly change the economics of vulnerability discovery and exploitation. The cost of vulnerability discovery and remediation is increasingly correlated with the per token inference costs of the latest frontier models.