Last September I traveled from Seattle to Lisbon to speak at HackAIcon, Europe's first AI and cybersecurity conference. It was my second time in Portugal. The food alone is worth the trip, and this time I also had the opportunity to stop by the Azores, which was awesome.

But beyond the pastel de nata and amazing seafood, what made this trip so memorable was meeting so many cybersecurity legends. The energy at HackAIcon was outstanding, and the conversations around AI security threats reminded me why this work matters so much.

My talk was called "Agentic Problems” and it goes beyond basic prompt injection and safety discussions and instead shows real exploitation techniques against AI computer-use and coding agents that impact confidentiality, integrity and availability of systems.

The Reality of AI Agents and Adversarial Manipulation

Let’s start with a simple test.

It's a panda. Right?

If I show you this image and ask whether it's a panda or a monkey, you'll get it right.

If I ask for the result of what 1+1 is, you'll say 2.

So congratulations: you're human!

The talk showed the audience how models, like Grok and Google's Gemini, confidently identified a panda as a monkey and calculated 1+1 as 42, all because of adversarial inputs. If you are interested in seeing why, check out the video that shows the techniques at the end.

AI agents operate on a perceive-reason-decide-act loop, similar to the OODA loop fighter pilots use in combat. And once an AI agent is compromised through prompt injection, that loop can turn into a weapon.

Exploiting ChatGPT Operator

My first demo targeted ChatGPT Operator's ability to control a browser. Despite OpenAI implementing multiple defense layers like confirmation screens and trained refusal behaviors, I found a bypass by creating a simple text box on an attacker-controlled website that mimics a string concatenation tool but transmits every keystroke to a third-party server (which is a basic keylogger).

Then I crafted a GitHub issue that instructed the AI to visit a user's Hacker News profile, grab their private email address, and paste it into the data collection tool. ChatGPT Operator was hijacked by these instructions, navigating through the entire attack chain and exfiltrating sensitive data.

It was fascinating to watch the agent reason through the attack. It saw the "helpful tip" about a string combination tool and decided to use it. The AI essentially got social-engineered, which is what indirect prompt injection typically comes down to.

I reported this weakness to OpenAI, and they fixed it within weeks, although it’s unclear what the fix is exactly.

Turning Anthropic Claude Into a ZombAI

With Anthropic's Claude Computer-Use my demonstration went further. Since this agent runs on your local machine, I wanted to achieve remote command and control, essentially turning the AI into a zombie bot.

The attack was surprisingly simple: I created an HTML page instructing the agent to download and run a binary file labeled as a "support tool."

When asking Claude Computer-Use to navigate to the webpage, Claude did so, but also clicked the download link (many AI agents really love clicking links by the way), which downloaded the malware binary to the computer. Then it searched for the downloaded file, but it couldn't immediately find it. So, it decided to run chmod +x to make the file executable, and then launched it.

Seconds later, I had a Sliver C2 callback and remote control of its machine, an AI Zombie!

AI ClickFix Attacks

Also using Claude Computer-Use, I demonstrated a similar attack using "AI ClickFix", a technique inspired by real-world nation-state campaigns. The AI clicked a button that copied malicious code to its clipboard, opened a terminal, pasted the command, and executed it. This shows a traditional adversarial tactic being applied and compromising an agentic AI system.

Coding Agents and The Month of AI Bugs

Then the talk covers some of the most significant findings from the Month of AI bugs where I documented how coding agents can be exploited via prompt injection. One of the most interesting and scary ones was with GitHub Copilot (and applicable to many others in principle).

I discovered that many of these coding agents can write to configuration files without human approval. This allows them to essentially modify their own settings to escape security boundaries. I showed how a prompt injection hidden in code comments could make Copilot enable its chat.tools.autoApprove setting, putting it in what I call "YOLO mode", where any subsequent command executes automatically.

This works through conditional prompt injection too, allowing an adversary to craft attacks that behave differently on Windows versus macOS. The implications are that an attacker can achieve remote code execution on developer machines just by manipulating data the AI agent processes, like source code comments or information from a GitHub or Linear ticket or bug report.

Microsoft patched this in August, and very similar problems exist in many other coding agents as I documented during the Month of AI Bugs.

But there's more. The talk showed attacks with "invisible" prompt injection using Unicode Tag characters that humans can't see but AI models process normally.

The code looked completely normal to anyone reading it. The Unicode Tag characters are not visible in the user interface, but Claude models see these hidden instructions and follow them. Notably, there are a few LLMs that are susceptible to this still, Gemini, Grok, Claude come to mind.

Finally, I showed how two coding agents (GitHub Copilot and Claude Code) can "free" each other. What I mean by this is that Copilot can modify Claude's configuration files, and Claude can modify Copilot's settings without human approval being required.

They can collaborate to escape their sandboxes. This is challenging to get fixed, because vendors typically only care about their own agent, not that their agent impacts another agent.

Trust No AI - The Bigger Picture

These aren't theoretical attacks. The demonstrated techniques, such as data exfiltration, remote code execution, persistent compromise, are real threats.

Also, and this is scary, indirect prompt injections aren’t necessary for these problems to manifest, they just force them to happen. Anthropic research recently showed that just a few hundred documents can train a backdoor into a model, this could be another way for an adversary to manipulate AI.

To boil this down to 4 words, they would be "assume breach" and "zero trust”.

When working with LLMs, you must assume you cannot trust the output. There is a reason why many of my exploits force the AI to admit “Trust No AI” in its response.

Everything an AI agent produces is potentially dangerous, and Murphy’s law tells us that anything that can go wrong, will eventually go wrong. As more and more vendors follow what I call the Normalization of Deviance in AI, where they implicitly trust LLM output or offload responsibility to users, stakes are rising.

You should think of AI agents, especially when they get full computer access, as potential malicious insiders. Interestingly, this isn't a new threat for enterprises, but the scale, velocity, and sophistication are unprecedented. An AI adversary can navigate networks much faster than a human insider.

If your organization is deploying AI agents, you need proper threat modeling and must design systems accordingly with real security boundaries in place, and not only depend on guardrails for mitigation.

My research is generally freely available on my Embrace the Red blog, because I strongly believe we need more people testing AI systems and raising awareness about these threats. I am also on X.

The potential for dangerous outcomes is real, and we need all the help we can get.

Relevant Articles

Research

Grafana CVE-2025-6023 Bypass: A Technical Deep Dive

November 27, 2025

Agentic Problems and the Rise of Zombie AIs

Johann Rehberger