Don't Trust the Model, Use a Framework

A practitioner’s guide to defending against prompt injection. Structured outputs, trust boundaries, and working patterns.

This is Part 2 of a two-part series on agentic security. Part 1: The Break covered the problem. This post covers the solution.

Part 1 told the story: I broke a 43K-star AI assistant in 90 minutes. Remote code execution via prompt injection on Sonnet 4.5.

This post is the architecture. Not academic theory — practical patterns you can implement. Perfect security doesn’t exist, but you can raise the bar high enough that attacks become expensive, detectable, and limited in blast radius.

The Core Problem, Restated

LLMs don’t distinguish between structure and data. Everything is tokens. The “system prompt” and “user message” distinction is a social contract, not a technical boundary.

When untrusted content flows into the system prompt without protection, you’ve given attackers a privilege escalation vector. They inject instructions that the model processes as authoritative.

You cannot solve this by asking the model to be more careful. I’ve tried. You solve it with architecture.

The injection flow: attacker input passes through unsanitized to system prompt assembly, then to model processing, tool execution, and finally compromise.

The First Principle: Nothing Influences the System Prompt

Minimize user-influenced content in the system prompt. Treat the system prompt as highly privileged — anything there has maximum influence on model behavior.

If you must include user-influenced data in the system prompt (persona selection, language preferences), use proper delimiters and validation. But the safer pattern is: user input goes in the user role, system instructions stay static or change only through predefined modes.

System prompt protection: wrong approach lets user input flow directly to system prompt; right approach uses a classifier to select from predefined prompts only.

Administrator configuration can update the system prompt. After that, runtime user input shouldn’t modify it directly. Anyway, the point is: this provides solid protection.

Use both the API-level separation (role: system vs role: user) and structural delimiters like XML tags. Research shows neither is a hard security boundary on its own — OpenAI explicitly states that role levels “are not trust boundaries.” But layered together, they contribute to defense-in-depth. Anthropic specifically trains Claude to recognize XML tags as structural boundaries.

Structured Outputs: The Real Defense

The core defense pattern is structured outputs with schema validation. Every connector communicates through typed messages. No raw text flows through the system.

I prefer Pydantic for Python or Zod for TypeScript. The schema defines exactly what fields exist, what types they have, and what they mean. The only way a connector can communicate with the core — and vice versa — is through these structured messages.

Structured outputs: raw email flows through parser and validator in the connector, then via typed schema to core agent, which responds via action schema through another validator to the action executor.

Think of it like the App Store model. When you submit an application, Apple validates the binary, checks every API call you make, and provides only a constrained set of APIs and sandboxes. The same mental model applies here:

Connector submits its schema
System validates the schema claims
Runtime enforces schema compliance
No other communication path exists

When defining schemas, include not just the type but also what the field is for and how it can be used. This is what makes the rigid pieces of the system agentic-ready:

class ExtractedEmail(BaseModel):
    sender_email: str = Field(description="Email address of sender, validated format")
    sender_name: str | None = Field(description="Display name if present")
    subject: str = Field(description="Email subject line, max 200 chars")
    intent: Literal["inquiry", "support", "sales", "spam", "unknown"] = Field(
        description="Classified intent based on content analysis"
    )
    key_facts: list[str] = Field(
        description="Extracted factual claims, max 10 items, each max 100 chars"
    )
    # Note: original body is NOT passed to reasoning agent

Multi-Stage Processing: Decoupling and Extraction

Research supports layered defense with multi-agent pipelines. Input undergoes sanitization both before and after processing. Each stage has limited context and limited tools.

For email processing, the pipeline looks like this:

Multi-stage processing pipeline: email arrives, goes through parse and sanitize, queue, extract structure, classify intent, then routes to sales agent, support agent, or discard based on classification.

Stage 1: Parse and sanitize. Extract sender, subject, basic metadata. Tools like HubSpot have done this for years — they know how to avoid injections. The output is structured data only.

Stage 2: Extract structure. Determine sender identity, intent, and key facts. Maybe include anchors to the original text for data lineage. A simpler, faster model can do this.

Stage 3: Classify and route. Based on extracted intent (not raw content), route to the appropriate handler. Each handler has its own constrained tool set.

The reasoning agent never sees the raw email. It sees extracted metadata and classified intent. Even if the original email contained injection payloads, they don’t survive multiple extraction stages.

This isn’t my invention — it aligns with OWASP LLM Security guidance, which recommends validating and sanitizing user inputs, using delimiters to separate instructions from data, and treating untrusted content as distinct from system instructions.

Input Validation: What Actually Works

Sanitization with regular expressions is naive for general text.

Modern LLMs understand dozens to hundreds of languages. Claude officially supports around 50, Gemini claims 100+. You can’t just filter English patterns and be fine. Attackers use Unicode obfuscation, mixed-language attacks, and encoding tricks. Regular expressions become useless for general content.

Where regex does work: fields where you know exactly what format to expect.

Numbers only? Filter everything else.
Email addresses? Validate format strictly.
Flag base64 in fields where it shouldn’t appear.
Strip XML/HTML tags from plain text fields.

For general text content, the defense is not trying to sanitize away all possible injection patterns. The defense is:

Never let raw content reach the system prompt
Use structured extraction to derive facts
Let the reasoning agent work with derived data only
Validate types at every boundary

The multi-agent defense research confirms this: “Input undergoes layered sanitization both before and after processing by agents, and potentially malicious content is masked to limit its influence on the LLM’s output.”

Least Privilege: Minimal Tools Per Stage

Each agent node should have access to only the tools it needs for its specific task. This is the principle of least privilege applied to LLM agents.

Least privilege: universal agent anti-pattern gives one agent access to all tools (bash, web, email, CRM, files); scoped agents pattern gives each agent only what it needs — research agent gets read-only, write agent gets /output only, email agent gets API-only access.

A universal agent — where after every tool call you return to the same agent that still has access to all tools — is the anti-pattern.

Instead:

Research agent: read files only, restricted to specific directory
Email processor: extract and classify only, no send capability
Action executor: scoped write access, no network access
Each stage passes structured output to the next

Even if there’s a vulnerability, it becomes impossible to exploit. Yes, you can send an email with a payload, but the system never downloads files from emails, never follows links, never places link content into the context.

The MiniScope paper from UC Berkeley makes this explicit: “Unlike prior works that enforce least privilege through prompting the LLM, our enforcement is mechanical and provides rigorous guarantees.”

This same principle — mechanical enforcement over prompt-based restrictions — applies to development workflows. Quality gates that reject bad code work because linters don’t negotiate. The agent can’t talk its way past a complexity threshold.

Why curl|bash Should Never Work

I was genuinely surprised — no, annoyed — that the curl|bash attack worked in ClawdBot. Claude Code, for comparison, has built-in sandboxing with OS-level enforcement for both filesystem and network isolation.

If you want to make bash available, you need mechanical enforcement, not prompt-based restrictions.

Bash validation flow: command request goes through validator which blocks network fetch (curl, wget, nc), destructive commands (rm -rf), and permission changes (chmod 777). Safe commands go to sandbox which checks filesystem and network rules, blocking violations and allowing only validated execution.

Options for safe bash access:

No bash at all: If your agent doesn’t need shell access, don’t provide it. Seriously. Most don’t.
Allowlist specific commands: But beware — allowlisting some commands is often equivalent to allowlisting all
Sandbox with OS-level enforcement: Like Claude Code’s sandbox-runtime
Deterministic execution only: Admin-defined scripts, not agent-composed commands

If you need shell commands, consider: can the administrator define the specific commands in advance? Then execute those deterministically rather than letting the agent compose arbitrary commands.

The Connector Contract

Each connector not only owns its security but describes exactly what messages it can send and receive. In TypeScript with Zod, the contract looks like this:

import { z } from 'zod';

// Every connector declares its contract upfront
interface ConnectorContract<TInput, TOutput> {
  id: string;
  version: string;

  // Trust and permissions
  trustLevel: 'untrusted' | 'verified' | 'internal';
  allowedTools: readonly ('read_only' | 'write_scoped' | 'api_call')[];

  // Schemas enforce structure at compile time AND runtime
  inputSchema: z.ZodType<TInput>;
  outputSchema: z.ZodType<TOutput>;

  // Hard limits
  maxOutputSizeBytes: number;
  rateLimitPerMinute: number;

  // The actual processor—sandboxed, with no access beyond declared tools
  process: (input: TInput, ctx: SandboxedContext) => Promise<TOutput>;
}

// Example: inbound email connector
const EmailInboundSchema = z.object({
  rawHeaders: z.record(z.string()),
  rawBody: z.string().max(100_000),  // Hard cap
  receivedAt: z.string().datetime(),
});

const ExtractedEmailSchema = z.object({
  senderEmail: z.string().email(),
  senderName: z.string().max(100).optional(),
  subject: z.string().max(200),
  intent: z.enum(['inquiry', 'support', 'sales', 'spam', 'unknown']),
  keyFacts: z.array(z.string().max(100)).max(10),
  // Note: raw body is NOT in the output schema
});

const emailConnector: ConnectorContract<
  z.infer<typeof EmailInboundSchema>,
  z.infer<typeof ExtractedEmailSchema>
> = {
  id: 'email-inbound-v1',
  version: '1.0.0',
  trustLevel: 'untrusted',
  allowedTools: ['read_only'],
  inputSchema: EmailInboundSchema,
  outputSchema: ExtractedEmailSchema,
  maxOutputSizeBytes: 10 * 1024,
  rateLimitPerMinute: 100,
  process: async (raw, ctx) => {
    // Connector implementation here
    // ctx only provides access to declared tools
  }
};

Connector contract: input schema flows through validate input, process, validate output, to output schema. Schema metadata declares trust level (untrusted), allowed tools (read_only), and max output size (10KB). The contract connects to the core system.

When you accept a connector into your ecosystem, validation happens at multiple levels. At registration, you check that schema definitions are complete and typed. At runtime, every message gets validated against the declared schemas — if a connector tries to return something outside its contract, the message gets rejected before it reaches any agent.

This is the same model as mobile app stores. Apple checks the binary for every API call, provides only constrained APIs, and enforces sandboxes. I think the difference that matters: we can do this validation at the type level before anything runs.

Monitoring: Treat It Like a WAF

Monitor events from every user and channel. If you see a repeated suspicious pattern over a short period, block that user’s tool calls.

WAF-style monitoring: events flow to monitor, then pattern detection. Normal traffic is allowed, suspicious patterns trigger rate limiting, and attack patterns block tools and notify for review.

The implementation doesn’t need to be complex. A sliding window counter per source is enough to catch most attack patterns:

type Decision = 'allow' | 'rate_limit' | 'block';

interface MonitorEvent {
  sourceId: string;      // email address, user ID, channel
  eventType: string;     // 'extraction_failure', 'tool_call', 'schema_violation'
  timestamp: number;
  metadata?: Record<string, unknown>;
}

class AgentMonitor {
  private windows = new Map<string, number[]>();

  private readonly thresholds = {
    extraction_failure: { count: 5, windowMs: 60_000, action: 'block' as const },
    tool_call: { count: 20, windowMs: 60_000, action: 'rate_limit' as const },
    schema_violation: { count: 3, windowMs: 300_000, action: 'block' as const },
  };

  check(event: MonitorEvent): Decision {
    const key = `${event.sourceId}:${event.eventType}`;
    const now = Date.now();
    const threshold = this.thresholds[event.eventType as keyof typeof this.thresholds];

    if (!threshold) return 'allow';

    // Get timestamps in window
    const timestamps = this.windows.get(key) ?? [];
    const recent = timestamps.filter(t => t > now - threshold.windowMs);
    recent.push(now);
    this.windows.set(key, recent.slice(-100));  // Keep last 100

    if (recent.length >= threshold.count) {
      this.notifyAdmin(event, threshold.action);
      return threshold.action;
    }

    return 'allow';
  }

  private notifyAdmin(event: MonitorEvent, action: Decision) {
    // Slack webhook, PagerDuty, whatever you use
    console.warn(`[SECURITY] ${action} for ${event.sourceId}: ${event.eventType}`);
  }
}

What to watch for: repeated extraction failures from the same email address — someone’s probing your parser. Schema violations. High-frequency tool calls from a single source. In my experience, the extraction failures show up first.

When you detect attack patterns, disable tool use for that source:

“Sorry, there has been suspicious activity from your end. Tool use is disabled until reviewed by the system administrator.”

It’s not user-friendly, but it’s safer than letting anyone email you a payload that exports your client list.

The Architecture Summary

Full secure architecture: Input layer (email, Slack, web) feeds parsers; extraction layer classifies content; routing layer directs to scoped agents with minimal tools (read, write /out, api_call); output layer validates all responses; WAF-style monitor observes every stage.

Nothing influences the system prompt. Predefined modes only.
Structured outputs everywhere. Typed schemas, validated at every boundary. This is the core defense — if you implement one thing from this post, make it this.
Multi-stage extraction. Raw content never reaches reasoning agents.
Least privilege tools. Each agent gets minimal required access.
Mechanical enforcement over prompt-based restrictions. OS-level sandboxing where possible.
Monitor like a WAF. Detect patterns, block suspicious sources, notify admins.

The Checklist

System Prompt Protection:

No user input flows to system prompt
Mode selection via predefined prompts only

Structured Communication:

All connectors have typed schemas (Pydantic/Zod)
Schema includes field descriptions and constraints
Validation at every boundary
No raw text passes between stages
Output size limits enforced

Multi-Stage Processing:

Extraction layer before reasoning
Each stage has limited context and limited tools
Raw content never reaches reasoning agent

Least Privilege:

Each agent has minimal tool set
Tools scoped to specific directories/APIs
No universal agent with all tools
Bash access sandboxed or eliminated
Network access disabled where not needed
File system access read-only where possible

Monitoring:

Event logging from all stages
Pattern detection for repeated failures
Auto-block for suspicious sources

Use the Framework

That’s the theory. I wouldn’t have written it if I didn’t think it was solid.

But here’s the practical advice: use a fucking framework.

LangGraph. LangChain. CrewAI. AutoGen. These frameworks already enforce many of these patterns by design. Tool call safety. Contracts between nodes. Structured message passing. They don’t guarantee your code is safe, but they make it harder to screw up.

Same lesson we learned with web frameworks. Rails, Django, Laravel — they don’t eliminate SQL injection, XSS, or CSRF. But they make the secure path the default path. You have to actively work around the framework to introduce these vulnerabilities. The framework’s opinions push you toward correct code.

Same thing’s happening now with agents. As they get complex, there are a lot of questions to answer correctly. Without a framework, you either reinvent the wheel poorly or skip steps entirely. Frameworks encode solutions to problems you haven’t hit yet.

I use LangGraph — the graph model fits how I think about multi-stage pipelines. But the specific choice matters less than making one. Pick a framework. Learn its patterns. Let it guide your architecture.

Most of what I’ve described? LangGraph handles it for you. Typed state. Node isolation. Conditional routing. Tool scoping. You still need to think about security, but you’re not starting from scratch.

Twenty years ago, we learned to stop writing raw SQL. The solution wasn’t “be more careful” — it was parameterized queries, ORMs, frameworks that make the right thing easy.

The frameworks exist. Use them.

Sources:

Simon Willison’s prompt injection research
OWASP LLM Prompt Injection Prevention Cheat Sheet
Multi-Agent LLM Defense Pipeline — Layered defense architecture
MiniScope: Least Privilege for Tool Calling Agents
Claude Code Sandboxing — OS-level enforcement
Microsoft FIDES — Research on information flow control
Google DeepMind CaMeL — Research on capability-based defense