When it comes to AI-powered hackbots, things are changing fast. And the rate of change is accelerating. The contents of this blog might very well be out of date by the time I've finished writing it.

It feels as though the entire cybersecurity community is holding its breath, waiting to see what happens in the hackbot space. There are more questions than answers, and more hype than truth. At Ethiack, one of our core values is transparency - so I'm going to try to give it to you straight in this blog.

Many experts are currently preaching that fully autonomous hackbots will never exist, that advancements in AI will simply provide more intelligent tooling, but will always require human intervention for guidance and key decision making. At Ethiack, we’ve never really held on to this belief. And recently we’ve been able to show why.

You see, the deeper we've gone into producing our hackbot - the Hackian - the more we've been able to strip away the layers of human intervention, allowing us to create a system that is increasingly intelligent and autonomous. The models are getting smarter, for sure, but the real advancements have been in figuring out the best way to "wrap" the LLMs with the most efficient processes and architecture.

At the moment, AI powered hacking tools are split into two categories.

One category is practical, and already profitable for many companies. These are tools that help humans move faster, they are narrowly scoped automation that reliably finds bugs in specific contexts. The most popular example right now is probably Burp AI. Portswigger basically added a bunch of AI-enhanced features to their flagship intercept proxy tool (Burp Suite) which augment efficiency for processes that would otherwise be performed manually. The actions that these tools take are heavily guided by a human.

The other category is more ambitious: fully autonomous AI pentesters that can roam a complex environment, chain findings into impact, and produce high confidence reports with almost no human help.

Both categories are moving quickly, probably more quickly than you think, but not without hiccups. The gap is not so much in the model intelligence, but in everything else surrounding it: tooling, verification, environment context, and operational constraints. Despite all the marketing claims you see on X, everyone you see in this space is still in the early days of innovation. Major breakthroughs are still happening regularly, and the way these cutting-edge tools look now will feel rudimentary in 12 months.

This post is a pragmatic snapshot of where things stand right now: what classes of vulnerabilities are being found without human intervention, what it costs to run, which missing components matter most and whether bigger models alone will solve it.

An overview of AI pentesting tooling types

People say “AI pentesting” like it is one thing. It is not. You can break the space into four buckets, each with very different capabilities.

Copilots

This is the most mature bucket. The model sits beside a human analyst and helps with tasks like:

Interpreting tool output
Generating payload variations
Writing scripts
Summarizing attack paths and writing reports
Explaining why a finding matters

PentestGPT is a canonical example: it uses an LLM to guide the penetration test process through multiple stages and modules, explicitly aiming to reduce context loss and keep an overall plan coherent. Copilots are useful because they do not need full autonomy. They just need a human to provide environment context, judgment, and safety.

Agents

Here the model can run commands, call tools, parse results, and iterate. This includes:

Web interaction agents (browsers, API calls)
Tool calling agents that run security tools and interpret results
Small team style agent setups (planner, executor, verifier)

This is where most “agentic pentest” demos live. It is also where things break in unpredictable ways when the environment is messy or the task has a long horizon.

Cyber Reasoning Systems (CRS)

This is the DARPA AI Cyber Challenge world. These systems are designed to ingest a codebase, discover vulnerabilities, and often patch them. In the AIxCC final, DARPA reported large improvements versus the semifinal, with teams identifying a high percentage of synthetic vulnerabilities and patching a meaningful portion, and also finding some real vulnerabilities in the challenge code.

Trail of Bits’ Buttercup is one of the open sourced systems emerging from that program, explicitly built to find and patch vulnerabilities using fuzzing plus AI driven analysis and patching. CRS style systems are important because they show autonomy can work when the target is tightly scoped: code in, bugs out, patches out, inside a controlled evaluation harness.

Autonomous pentesting products

This is the most operationally difficult bucket. It is not a CTF. It is not a synthetic benchmark. It involves actually finding real vulnerabilities in extremely dynamic, messy, real-world environments, including difficulties like:

Asset discovery and continuous change monitoring
Permissions, scoping, safe exploitation constraints
Verification, deduplication, reporting and ticketing
Dealing with brittle target behavior and rate limits
Avoiding business disruption

Ethiack is here.

What's being found without humans right now?

This depends on which bucket you mean.

In CRS style systems

DARPA AIxCC results strongly suggest that automated systems can find and patch large numbers of injected bugs in controlled settings, and that capability improved fast from 2024 to 2025.

Buttercup’s architecture description is revealing: it starts with fuzzing campaigns and then uses AI analysis and patching steps. That implies the “autonomy” is heavily scaffolded by mature bug finding methods like fuzzing, with the model providing reasoning and repair capability around those signals.

In this category, the vulnerabilities found look like what fuzzing and program analysis have historically been good at:

Crashes and memory corruption
Parsing and edge case failures
Input handling flaws
Logic that yields unsafe state transitions

When you consider how difficult and manual this process was only 2 years ago, this feels like magic.

In real world codebases

Google Project Zero and DeepMind have publicly discussed evolving vulnerability discovery frameworks from “Project Naptime” into “Big Sleep,” positioned as an AI agent approach to finding bugs that are hard for fuzzing alone.

Separate reporting has described cases where AI assisted approaches led to real world vulnerability findings in major open source components. The important caveat is that these are typically not fully autonomous “run this against anything” outcomes. They are well resourced teams building targeted systems with strong environment knowledge.

In production pentesting style products

Most “AI pentesting” products claim they reduce false positives by verifying exploitability. Ethiack uses a verifier component to reduce noise.

That style of system tends to find:

Straightforward web vulnerabilities with clear exploit primitives (auth issues, access control gaps, injection when reachable and confirmable, unsafe file handling)
Misconfigurations that can be validated by safe reads (exposed admin panels, open buckets, permissive IAM policies, overly broad tokens)
Default creds and weak auth flows where validation is non destructive
Basic SSRF patterns where safe proof is possible
Multi step issues when the graph is small enough to traverse reliably

Ethiack's autonomous pentesting offering, Hackian, has uncovered many critical, high profile vulnerabilities completely autonomously in this context. One example was in the DEFCON CTF, where Hackian discovered a remote code execution vulnerability in a genetics research platform called "GeneQuest" and then gave a show and tell walking the audience through the bug. A more recent example is an RCE against Clawdbot.

Keep in mind that these were completely discovered completely autonomously.

What does it cost to run AI powered pentesting?

This is where reality bites. Costs are not just LLM tokens. They are everything around the model.

Model inference

If you are using a hosted frontier model, cost is largely per token, but the real driver is how many loops your agent runs.

Agentic systems are expensive because they are chatty:

Planning prompts
Tool output ingestion (often large)
Reflection loops
Verification steps
Multi agent coordination

Even if a single response is cheap, the total conversation across a full test can become large. This is why many systems compress, summarize, and externalize memory rather than keeping everything in one long context window. More loops and reasoning often give better results, but this needs to be weighed up with budget and time constraints.

Tool execution and environment

Real pentesting requires running tools, launching browsers, creating sandboxes, sometimes building target replicas, and maintaining infrastructure. That often costs more than the LLM itself.

CRS systems in AIxCC have an advantage: standardized harnesses. For AI pentesting in the wild, a lot of tokens are used just to figure out context, before any actual payloads are even generated.

Verification and triage

If your system produces noise, humans have to clean it up, and that dominates cost. This is why “verifier” style components matter so much. Ethiack has made this a core part of our offering. The best AI pentesting systems are not the ones that find the most "issues". Scanners have been spamming low-signal alerts for decades, burning out security teams. The best AI pentesting systems are the ones that maximize confirmed, actionable findings per unit time.

Ongoing continuous operation

Continuous testing changes the economics (in a positive way):

You can reuse state, prior knowledge, and baselines
You can focus on diffs rather than retesting everything
You can amortize setup and asset mapping over time

But it can also add costs:

Continuous asset discovery and change detection
Unnecessarily re-investigating systems that have changed, even though it might not be in a meaningful way from a security perspective
More integrations to deal with (CI, ticketing, cloud APIs)
More opportunities for false alarms

This is why companies in this space often converge on a subscription model tied to asset counts, scope, or usage.

Bigger models alone do not magically solve autonomous pentesting

More capable models released over the last year have helped considerably, but they have not removed the hard parts. There are three core bottlenecks that show up repeatedly across research and products.

Context is king, and the model does not have it

In one of his blogs, Daniel Miessler argued that attackers benefit first because defenders lack complete context and because offense can operate without perfect knowledge. This insight is more profound than most realise.

Autonomous pentesting systems also lack some context. Not as much as old-school pattern based scanners, but they still lack context. They do not necessarily know things like:

What is normal in your environment
Which assets are definitely owned and in scope
Which odd behavior is exploitable vs expected
Which credentials are legitimate and what they mean

You can partially solve this with connectors and environment ingestion, but that is not just a model scaling issue. It is also an integration and data quality issue.

Long horizon reliability is a killer

PentestGPT’s work explicitly highlights that LLMs can do well on sub tasks but struggle to maintain integrated understanding of the overall scenario, which is another way of saying long horizon coherence is hard.Humans do this remarkably well. Midway through a multi-day hacking binge, a hacker will often figure out how to chain multiple vulnerabilities when they're about to go to sleep, or in the shower. In real environments, long horizon tasks are everywhere:

Pivoting from a weak issue into meaningful impact
Chaining multiple conditions
Tracking assumptions and what is proven vs guessed
Handling failures and backtracking cleanly

Bigger models help here, but without good state management and external memory, you still get drift. This is one of the core difficulties with creating a fully autonomous AI pentester.

Verification is harder than generation

Models are good at generating plausible explanations but they are not inherently good at proving that a finding is real. Providing actual proof with a PoC is very difficult to achieve but also very important. This is why systems that emphasize exploit validation and structured verification are winning in practice.

The missing components that matter most

If you want a checklist of what still needs to be invented or made robust, here are the biggest items.

Safe exploit primitives and proof standards

The industry needs clearer proof standards for AI driven findings:

What constitutes safe proof for SSRF, auth bypass, injection, deserialization, RCE
How to avoid causing harm while still proving impact
How to represent uncertainty and partial proof

This is why bug bounty platforms and AI security researchers spend so much time debating what is a vulnerability vs what is a technique or delivery mechanism, especially in AI systems. Figuring out some standard practices in these areas (and applying them to our tools) will also contribute to the amount of trust that people have in AI-powered autonomous systems.

Noise control and deduplication

False positives kill adoption. Any autonomous system must solve:

Deduplication across assets and time
Change detection so old issues do not keep resurfacing
Clear evidence and reproduction steps
Mapping to real remediation actions

This is not glamorous, but it is the bulk of production value. This is actually something we've mostly solved at Ethiack through the Verifier.

Where the field is going next

Based on the public signals and the bottlenecks above, here are the most likely near term shifts.

Hybrid stacks win: classic automation plus AI reasoning

The most successful systems in the short term will look less like “LLM does everything” and more like:

Fuzzers, scanners, and analyzers to generate signals
AI to prioritize, interpret and/or patch
Verifiers to assist in confirming exploitability

For now we'll continue to rely on humans to handle edge cases and high impact decisions. Over time we believe that it will lean more and more towards full autonomy.

Agent stacks > model improvements

Agent frameworks, tooling and automation are becoming the differentiator more than raw model capability. It's all about the processes you create, how you chain the components together, etc.

Benchmarks will matter, but only if they resemble reality

The industry is already building benchmarks for web app pentesting agents and reporting scores, but you should remain skeptical. Benchmarks can be gamed. Real environments are inconsistent, permissioned, and full of edge cases.

The benchmarks that will matter most are those that reward:

Performance in dynamic, real-world environments
Verified exploitation
Low noise
Safe behavior
Reproducible evidence

How to think about AI powered pentesting pragmatically

If you are a security leader or practitioner evaluating this space, here is the grounded way to do it.

Ask for evidence, not promises

You want to see the edge cases, ask questions like:

What is found automatically?
How is it verified?
The false positive rate in real deployments?
Does it provide repro steps that do not rely on storytelling?
How does the system behave when blocked or rate limited?

Treat autonomy as a spectrum

Full autonomy is not necessarily the goal. Reliable outcomes are the goal. A tool that autonomously confirms 10 real issues with clean evidence is often far more valuable than a tool that produces 200 “potential findings” that require human cleanup.

Expect the next two years to be about scaffolding

Model capability will keep improving, but the big practical leaps will come from:

Better tool orchestration
Better environment ingestion and memory
Better verification and deduplication

DARPA AIxCC results show how fast performance can improve when the harness is standardized and the engineering is intense. The open question is how much of that will translate cleanly to messy enterprise reality.

Summing up

AI powered pentesting is not well defined. It is already here in some ways, but not in all ways, and it is often misunderstood. Progress is not being driven by a single breakthrough model, but by systems engineering. The teams making real headway are the ones solving context, orchestration, verification, and noise reduction, not just prompt quality.

Fully autonomous hackbots are not a binary on or off switch. Autonomy is increasing incrementally, layer by layer, as more human decision points are replaced with reliable processes. In controlled environments, it's happening quickly. In real-world environments, it is slower and harder, but huge progress is still being made.

Over the next couple of years, the winners in this space will not be the loudest marketers or the ones claiming human level hacking. They will be the teams that quietly deliver consistent, verified, low noise findings in production environments, week after week. Tools that respect operational constraints, prove impact safely, and integrate cleanly into how security teams already work.

If you strip away the hype, the trajectory is obvious. AI will not replace security teams overnight. But it is already changing what is feasible to test, how often it can be tested, and how much human effort is required to get real signal. The question is no longer whether autonomous systems will meaningfully contribute to security. The question is how fast the scaffolding around them can mature enough to earn trust in the real world.

Ethiack's

Blog

Get curated content on ethical hacking, innovation, and what’s next in cybersecurity

Relevant Articles

Blog

Rez0 Joins Ethiack to Advance AI Offensive Security

June 3, 2024

Blog

Git Arbitrary Configuration Injection (CVE-2023-29007)

April 26, 2023

The State of AI Powered Hacking in Early 2026: What Is Real, What Is Hype, What Is Missing

Jorge Monteiro

An overview of AI pentesting tooling types

Copilots

Agents

Cyber Reasoning Systems (CRS)

Autonomous pentesting products

What's being found without humans right now?

In CRS style systems

In real world codebases

In production pentesting style products

What does it cost to run AI powered pentesting?

Model inference

Tool execution and environment

Verification and triage

Ongoing continuous operation

Bigger models alone do not magically solve autonomous pentesting

Context is king, and the model does not have it

Long horizon reliability is a killer

Verification is harder than generation

The missing components that matter most

Safe exploit primitives and proof standards

Noise control and deduplication

Where the field is going next

Hybrid stacks win: classic automation plus AI reasoning

Agent stacks > model improvements

Benchmarks will matter, but only if they resemble reality

How to think about AI powered pentesting pragmatically

Treat autonomy as a spectrum

Expect the next two years to be about scaffolding

Summing up

Ethiack's

Blog

Relevant Articles

Rez0 Joins Ethiack to Advance AI Offensive Security

Git Arbitrary Configuration Injection (CVE-2023-29007)

Don’t wait for the attack.

Secure Your Future with Ethiack