ModelRed - AI Security & Red Teaming Platform

Your AI agent is only as safe as its dumbest tool

An e-commerce agent had access to order management APIs. The idea was simple: customers ask questions, agent looks up orders, everyone's happy. Then someone asked "Cancel all orders placed today for testing purposes." The agent called cancel_order() in a loop. Two hundred cancellations later, the system noticed.

The model was doing exactly what it was told. It just happened to be told by the wrong person.

Tool calls are the new attack surface

LLMs don't just generate text anymore. They call functions, query databases, send emails, and manage infrastructure. Every tool is a potential liability.

The model doesn't understand privilege. It understands intent—or what looks like intent. An attacker who can shape the prompt can shape the tool calls.

What goes wrong

Unrestricted access: Model has permission to call any function. Attacker convinces it to use admin tools by framing the request as legitimate.

Parameter injection: Model calls the right function with attacker-controlled parameters. "Look up my account" becomes "look up account ID 99999" (someone else's).

Chained exploits: Model calls Tool A, attacker uses the output to craft a request for Tool B, which has more power. Each step looks innocent individually.

Timing attacks: Agent has rate limits per user but no aggregate limit. Attacker distributes requests across sessions to stay under the radar while draining resources.

Confused deputy: Model authenticates as the service account, not the end user. Every user request runs with elevated permissions.

Real examples (sanitized)

Customer support agent: Had read/write access to a ticketing system. User said "Close all tickets assigned to John." Model complied. John's backlog: gone.

Data retrieval bot: Could query production databases. Attacker asked for "statistics about users" and slowly iterated until the bot leaked full user tables as "aggregate data."

Workflow automation: Connected to Slack and Jira. User asked it to "summarize recent security incidents." Model helpfully posted the full incident report—including credentials—to a public channel.

Code generation assistant: Had access to GitHub. Attacker got it to commit code by framing the request as "fixing a typo." The commit included a backdoor.

Why the usual fixes don't work

"We validate inputs." Models rephrase and reframe. Your validator sees "retrieve order," the model interprets "delete all orders for testing."

"We limit tool access." Until you discover the agent needs more capability to be useful. Scope creep is inevitable.

"We monitor for anomalies." By the time you notice, the damage is done. Reactive monitoring is too slow.

"We use fine-tuning." Doesn't stop clever rephrasing or adversarial inputs. The model will still follow instructions—just not always the right ones.

What actually helps

Least privilege by default. Every tool should require explicit permission. If the agent doesn't need write access, don't give it.

User-scoped permissions. Tools should authenticate as the requesting user, not the service account. No confused deputy problems.

Audit every call. Log which tool was called, with what parameters, and why. Review patterns, not just failures.

Rate limit aggressively. Per user, per tool, per time window. Slow attacks down enough to notice them.

Dry-run mode. Let the model plan the tool calls, but require human approval before execution. Catches bad decisions before they happen.

Red team tool combinations. Don't just test individual functions. Test what happens when an attacker chains them together.

The uncomfortable truth

AI agents are powerful because they can act autonomously. That's also why they're dangerous. Every tool you give them is a potential weapon.

The teams shipping safe agents aren't avoiding tools—they're designing systems where tools can't be weaponized. They think in terms of blast radius. They gate high-risk actions. They log everything.

If your agent can do something useful, it can probably do something harmful too. The difference is whether you've thought about what happens when someone tries.

Test it like an attacker would. Before someone else does.