Skip to content

Improve Model-Based Guardrails in LLM Prompt Injection Prevention#2249

Open
crony-io wants to merge 2 commits into
OWASP:masterfrom
crony-io:master
Open

Improve Model-Based Guardrails in LLM Prompt Injection Prevention#2249
crony-io wants to merge 2 commits into
OWASP:masterfrom
crony-io:master

Conversation

@crony-io

Copy link
Copy Markdown

This changes the reference in the Model-Based Guardrails section to the idea that the dual-LLM pattern, described by Simon Willison is the strongest architectural form of building AI assistants that can resist prompt injection. As said by Willison himself in his 11th April 2025 update, there is a potential flaw in his Dual LLM proposal, but CaMeL (released by Google DeepMind in this paper and this GitHub repo), proposes a new direction for mitigating prompt injection attacks.

🚩 If your PR is related to grammar/typo mistakes, please double-check the file for other mistakes in order to fix all the issues in the current cheat sheet.

Please make sure that for your contribution:

  • In case of a new Cheat Sheet, you have used the Cheat Sheet template.
  • All the markdown files do not raise any validation policy violation, see the policy.
  • All the markdown files follow these format rules.
  • All your assets are stored in the assets folder.
  • All the images used are in the PNG format.
  • Any references to websites have been formatted as [TEXT](URL)
  • You verified/tested the effectiveness of your contribution (e.g., the defensive code proposed is really an effective remediation? Please verify it works!).
  • The CI build of your PR pass, see the build status here.

Scope and sourcing (required)

  • This PR is focused: it modifies a single cheat sheet, or a small coordinated set, and the scope is described in the PR body.
  • Every technical claim, recommendation, or threat assertion added in this PR is supported by a primary source (RFC, NIST, OWASP standard, vendor documentation, peer-reviewed research) linked inline as [text](URL).
  • I have read each source I cite and confirm it actually supports the claim. I have not relied on summaries, hearsay, or model-generated citations.

AI Tool Usage Disclosure (required for all PRs)

Please select exactly one of the following options. PRs that leave this section blank will be closed.

  • I have NOT used any AI tool to generate the contents of this PR.
  • I have used AI tools to generate the contents of this PR. I have verified
    the contents and I affirm the results. The LLM used is [llm name and version]
    and the prompt used is [your prompt here]. I have independently verified every citation and technical claim against the cited source. [Feel free to add more details if needed]

@jmanico

jmanico commented Jun 23, 2026

Copy link
Copy Markdown
Member

This is promising but purely theoretical.

Very few production AI systems implement this today.

It requires:

  • Data classification
  • Policy engine
  • Metadata propagation
  • Enforcement layer

Most companies aren’t doing this yet, or are even close to this.

@crony-io

Copy link
Copy Markdown
Author

Hi @jmanico, i completely agree with you, CaMeL is indeed bleeding edge, and the architectural lift required is far beyond what most companies are currently implementing.
My primary motivation for this PR is that the current cheat sheet explicitly names Simon Willison's Dual LLM pattern as the "strongest architectural form." However, Willison updated that exact article to point out a flaw in his design (it remains vulnerable to data-flow manipulation) and cites CaMeL as the necessary architectural evolution to close that gap.

@randomstuff

randomstuff commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

CaMeL is indeed bleeding edge, and the architectural lift required is far beyond what most companies are currently implementing.

In that case, maybe just adding a reference/further reading link would be better?

Further reading:

Disclaimer: I have not looked at the paper at all 😄

@crony-io

Copy link
Copy Markdown
Author

Sure, maybe i can remove all the description about CaMeL, keep the original text and just add something like this after?

The strongest architectural form of this idea is the dual-LLM pattern, described by Simon Willison. A privileged LLM holds the tools but never reads untrusted content directly. A quarantined LLM reads untrusted content but cannot take action. The privileged model receives only structured summaries or labels from the quarantined one, which breaks the path that injected instructions need to reach the actor.

A new architectural form of this idea is CaMeL (CApabilities for MachinE Learning). It improves upon the original Dual-LLM pattern to prevent injected data from manipulating tool arguments, as said by Willison in his blog CaMeL offers a promising new direction for mitigating prompt injection attacks
Further reading: Defeating Prompt Injections by Design from Google DeepMind and their GitHub repo for examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants