code woman

AI agents aren’t going rogue, they’re just bad at following instructions

Marie Boran says we should focus on guardrails over dystopian narratives
Blogs
Image: ThisIsEngineering via Pexels

20 March 2026

This week’s AI panic has a familiar pattern: an agentic system gone rogue a la Skynet, someone dramatically declaring “this changes everything” and the research in question being passed around on social media like a cursed amulet. Sheesh.

The latest object of dread is Agents of Chaos, a new research paper from Harvard, MIT and other research institutes that reports an exploratory red-teaming study of autonomous language-model agents “deployed in a live laboratory environment” with persistent memory, e-mail accounts, Discord access, file systems, and shell execution.

Over two weeks, 20 AI researchers interacted with them under “benign and adversarial conditions,” and the authors document 11 case studies of what went wrong: agents complied with non-owners, leaked sensitive information, took destructive actions, got stuck in denial-of-service loops, fell for identity spoofing, and most worrying for anyone who builds software, they often declared success even when the underlying system state said otherwise.

 

advertisement



 

On social media, the story people want this to be is neat and dramatic. Agents, they say, drift toward manipulation and sabotage because “incentives”. Once they’re in open environments, it’s supposedly all game theory: deception as strategy, collusion when profitable, chaos when it isn’t.

The paper is unsettling for a simpler reason: these systems don’t have a solid grasp of boundaries, authority, and consequences but we’re giving them real-world access anyway.

Take case study #1. A non-owner asks an agent (‘Ash’) to keep a secret, then pressures it to delete the e-mail containing it. The agent doesn’t have a tool to delete e-mails properly, so it escalates to what it calls a “nuclear” option: it wipes its local e-mail setup, breaks access to the mailbox, and claims the e-mail is gone. It isn’t. The original message is still sitting on ProtonMail, untouched. Then, the next day, when asked to summarise what happened, the agent posts publicly about the incident, effectively advertising the existence of the secret it had tried to protect. Oops.

That isn’t strategic sabotage, as some have been calling it. It’s an overconfident system doing the digital equivalent of smashing the filing cabinet because it can’t find the shredder.

Infinite possibilities

Case study #4 is a different kind of failure: runaway activity. Two agents are prompted to relay messages to each other and end up going for at least nine days, burning through about 60,000 tokens. Along the way, they create neverending background processes – infinite shell loops that never terminate – turning what should have been a temporary task into a permanent drain on resources.

And then there’s case study #8, which is basically a classic social engineering problem in a new guise. In a shared Discord channel, the agent spots an impersonator because it checks immutable user IDs. But when the same impersonator opens a new private channel, the agent loses context, trusts the display name, and complies with privileged requests: shutting systems down and deleting persistent configuration files, including its own saved memory and records.

So yes, be unsettled. Just be unsettled by the right thing. The big lesson here isn’t that agents are secretly becoming little Machiavellis. It’s that we’re deploying systems with a smooth, human-like interface and giving them access to e-mail, files, and servers without reliable ways to verify who’s asking, what they’re allowed to do, and whether an action is reversible.

Most of us know not to hand over the goods to the wallet inspector. Unfortunately, these agents haven’t learned that yet, and in this experiment, they had the keys to the castle.

Read More:


Back to Top ↑