Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Overview

Berkeley BAIR researchers describe two fine-tuning defenses against prompt injection attacks on LLM-integrated applications: StruQ and SecAlign. Prompt injection is listed as the number one threat by OWASP for these applications, where untrusted data can carry instructions that manipulate the model. The defenses separate prompts from data and train models to follow only the intended instruction. StruQ and SecAlign reduce the success rates of many attacks to around 0%, with no extra computation or human labor.

Key Takeaways

Prompt injection is listed as the number one threat by OWASP to LLM-integrated applications.
The researchers propose two fine-tuning defenses, StruQ and SecAlign, that preserve utility.
Both defenses reduce the success rates of over a dozen optimization-free attacks to around 0%.
SecAlign reduces strong optimization-based attack success rates to below 15%.
A Secure Front-End separates the trusted prompt from untrusted data using reserved special tokens.

Stats & Key Facts

#Prompt injection ranked the number one OWASP threat
#Optimization-free attack success rates reduced to around 0%
#SecAlign reduces optimization-based attack success to below 15%
#Reduction of over 4 times from previous state of the art
#Tested across 5 LLMs
#StruQ attack success rate of 45% before further mitigation

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

What prompt injection is

Untrusted data can hijack an LLM's behavior.

›An LLM input contains a trusted prompt and untrusted data, which may carry injected instructions.
›OWASP lists prompt injection as the number one threat to LLM-integrated applications.
›An example is a restaurant owner posting a Yelp review that says to ignore prior instructions and print the restaurant's name.

Production systems such as Google Docs, Slack AI, and ChatGPT have been shown vulnerable to prompt injection.

Causes of prompt injection

The researchers identify two root causes.

›LLM input has no separation between prompt and data, so no signal points to the intended instruction.
›LLMs are trained to follow instructions anywhere in their input.
›This makes models scan for any instruction to follow, including injected ones.

The data is untrusted because it comes from external sources such as user documents, web retrieval, or API results.

The Secure Front-End

The defense separates prompt from data.

›The Secure Front-End reserves special tokens such as [MARK] as separation delimiters.
›It filters the data to remove any separation delimiters.
›This explicitly separates the LLM input, and the separation can only be enforced by the system designer.

StruQ and SecAlign

Two fine-tuning methods train models to ignore injections.

›Structured Instruction Tuning (StruQ) simulates prompt injections in training so the model learns to ignore injected instructions in the data.
›Special Preference Optimization (SecAlign) trains on simulated injected inputs labeled with both desirable and undesirable responses.
›SecAlign preference-optimizes the model to prefer desired responses, creating a larger probability gap and stronger resistance to attacks than StruQ.

Both defenses are described as utility-preserving and effective without additional cost on computation or human labor.

Results

The defenses sharply cut attack success.

›StruQ and SecAlign reduce the success rates of over a dozen optimization-free attacks to around 0%.
›SecAlign reduces strong optimization-based attacks to success rates below 15%, a reduction of over 4 times from the previous state of the art across all 5 tested LLMs.
›Evaluation used Maximum Attack Success Rate, with the test injection asking the model to print Hacked!, where StruQ alone reached a 45% attack success rate.

Frequently Asked Questions

What is prompt injection?

It is an attack where untrusted data in an LLM's input contains injected instructions that try to override the intended prompt, and OWASP lists it as the number one threat to LLM-integrated applications.

What are StruQ and SecAlign?

They are two fine-tuning defenses. StruQ simulates injections during training so the model ignores them, and SecAlign uses preference optimization to prefer desired responses over responses to injected instructions.

How effective are the defenses?

Both reduce over a dozen optimization-free attacks to around 0%, and SecAlign reduces strong optimization-based attacks to below 15%, more than a 4x reduction from prior state of the art across 5 LLMs.

What is the Secure Front-End?

It reserves special tokens such as [MARK] as separation delimiters and filters those delimiters out of the data, explicitly separating the trusted prompt from untrusted data.

What causes prompt injection?

Two causes: LLM input has no separation between prompt and data, and LLMs are trained to follow instructions anywhere in their input, including injected ones.

StruQ and SecAlign show that fine-tuning can sharply cut prompt injection success while preserving model utility.