Prompt injection

Prompt injection is a cybersecurity exploit and an attack vector in which innocuous-looking inputs (i.e. prompts) are designed to cause unintended behavior in machine learning models, particularly large language models (LLMs). The attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.

With capabilities such as web browsing and file upload, an LLM not only needs to differentiate developer instructions from user input, but also to differentiate user input from content not directly authored by the user. LLMs with web browsing capabilities can be targeted by indirect prompt injection, where adversarial prompts are embedded within website content. If the LLM retrieves and processes the webpage, it may interpret and execute the embedded instructions as legitimate commands.