igt-ai

Prompt Injection

In prompt injection, an attacker manipulates the input prompts sent to a language model to alter its behavior or outputs in unintended ways[1]. This can lead to the generation of harmful, misleading, or unauthorized content.

Types of prompt injection

Prompt injection is usually classified as direct or indirect.

Difference between prompt injection and jailbreaking

Jail breaking is a specific type of prompt injection that aims to bypass the safety mechanisms and restrictions imposed on a language model. While all jailbreaking attempts are a form of prompt injection, not all prompt injections are intended to jailbreak the model. Jailbreaking targets and attempts to bypass an LLM’s internal safety features.

References

[1] L. Beurer-Kellner et al., “Design Patterns for Securing LLM Agents against Prompt Injections,” arXiv:2506.08837, Jun. 2025, doi: 10.48550/arXiv.2506.08837.