Large Language Models (LLMs) have increased in popularity since late 2022 when ChatGPT appeared on the scene. The AI-powered chatbot is now officially the most popular app of all time, with the fastest-growing user base in history. Companies like Microsoft and Google have also jumped onto the trend and are now integrating LLMs into their core tools, like Bing and Google Workspaces, to enhance their functionalities. Unfortunately, this also means that LLMs will potentially gain access to the data stores within these tools and significantly increase a company’s exposure if they are compromised. In this article, we go over one such attack that is growing in popularity: Prompt Injection and what can be done to protect against it.
What are Prompt Injections
Prompts are the inputs that users provide to LLMs, which are processed and used to construct a response to send back. In most cases, these prompts are straightforward and processed safely and predictably.
Prompt Injections are when an attacker attempts to subvert this process and provide malicious prompts to make the LLM behave like the developers never intended. This can result in the LLM disclosing sensitive information, changing its behavior, and even providing wrong information. For example, an attacker might attempt to subvert the guardrails that have been put in place in tools like ChatGPT against hate speech or misinformation and trick it into responding in a biased or malicious manner.
These attacks can be similar to SQL injections, but instead of targeting a database, we target an LLM. This attack can also be considered more dangerous than SQL injections, given the lower barrier to entry. No technical knowledge is required, and simple prompts can result in a successful attack!
The severity of this attack is high, depending on what the LLM is being used for. For example, if the LLM is powering a chatbot being used in sensitive industries like healthcare or banking, then this could result in a data breach and severe reputational damage to the company hosting the LLM.
Types of Prompt Injection Attacks
Prompt Injections can be broadly categorized into direct or indirect attacks. In the direct prompt injection attack, the attacker directly provides malicious prompts to the LLM to make it behave unauthorizedly. For example, getting it to disclose sensitive information or generate hate speech.
In an indirect attack, the attacker may provide the prompt but in a far more subtle fashion, such as embedding the prompt in a website or a third-party plugin. These attacks can be far more challenging to detect and filter as the malicious prompts are not being entered directly. An attacker could store the prompt in a remote file and ask the LLM to read through it, resulting in executing malicious prompts.
LLMS must understand the context of each prompt to detect when an attacker is leading them down a path of a prompt injection.
Impact of Prompt Injections
A successfully executed prompt injection attack can result in several risks emerging, such as the following:
- Data leakage via the LLM disclosing sensitive information to which it was trained or it has access.
- Misinformation being spread via the LLM as the prompt injection could “trick” the LLM into generating incorrect information that could be accepted as fact leading to widespread problems.
- Hate speech and inappropriate responses being generated as the prompt injection could bypass the guardrails and result in the LLM developing hateful speech towards a particular ethnic group or minority.
- Security incidents if the LLM is hosted locally and can access sensitive systems. An attacker could use prompt injection to use the LLM to access these backend systems and exfiltrate data via its responses.
How to mitigate the risk of Prompt Injections
Cybersecurity teams need to educate themselves on this new attack vector, given the rapid pace at which LLMs are becoming part of tech ecosystems. LLMs generate text, code and even give legal or medical advice! A compromise via prompt injection could result in reputational damage and undermine users’ trust in these AI systems.
Developers and Cybersecurity teams should work together to implement controls like intelligent input sanitization and filtering that analyze the prompts and responses generated by the LLM. Only by understanding the context of a prompt can the LLM know whether it is part of a prompt injection. Additionally, reporting and alerting on potential attempts by attackers to input malicious prompts should be put in place, similar to failed logins or suspicious scans. Such repeated malicious prompts should activate security controls over the LLM and prevent the attacker from proceeding further.
In conclusion, while LLMs have great potential to enhance productivity across enterprises, their risks should be assessed and mitigated, such as the ones posed by prompt injections. CISOs and Cybersecurity teams should proactively educate themselves on this new threat vector and implement controls before attackers target their LLMs.
Frequently Asked Questions
What are prompt injections in large language models?
Prompt Injections are a type of malicious inputs that are designed to make Large Language Models (LLMs) behave in an unauthorized manner. This can range from changing the behavior of the LLM to making it disclose sensitive information. The attacker is able to subvert the input validation process through specially crafted prompts.
What are some common prompt injection vulnerabilities
Common vulnerabilities include crafting prompts that manipulate the LLM into revealing sensitive information, bypassing filters or restrictions by using specific language patterns or tokens, exploiting weaknesses in the LLM’s tokenization or encoding mechanisms, and misleading the LLM to perform unintended actions by providing misleading context.
How can we prevent prompt injections in large language models?
Preventing prompt injections involves implementing strict input validation and sanitization for user-provided prompts, using context-aware filtering and output encoding to prevent prompt manipulation, regularly updating and fine-tuning the LLM to improve its understanding of malicious inputs and edge cases, and monitoring and logging LLM interactions to detect and analyze potential prompt injection attempts.
Can you provide an example of a prompt injection attack scenario?
An attacker could create a prompt that tricks the LLM into disclosing sensitive information about what data it was trained on and internal system details. The attacker is able to bypass the internal content filters and guardrails by phrasing the prompt in such a way that the LLM does not recognize it as dangerous content.