Photo credit: www.darkreading.com
A newly identified prompt-injection method poses a risk of enabling individuals to circumvent the protective measures established in OpenAI’s latest language model, GPT-4o.
Launched on May 13, GPT-4o boasts enhanced speed, efficiency, and versatility compared to its predecessors. Capable of processing various types of input in numerous languages, it can generate responses almost instantaneously. The model is designed for real-time interactions, analyzes live video input, and maintains context over lengthy exchanges. Despite these advancements, it exhibits some outdated features in managing user-generated content.
Marco Figueroa, who manages generative AI bug-bounty programs at Mozilla, detailed in a recent report how malicious individuals can exploit the capabilities of GPT-4o while bypassing its safeguards. The technique involves distracting the model by encoding harmful commands in unconventional formats and breaking them into separate steps.
Tricking ChatGPT Into Writing Exploit Code
In its effort to minimize misuse, GPT-4o scans user inputs for indicators of inappropriate language or malevolent instructions.
Yet, Figueroa points out that “It’s just word filters. From experience, we know exactly how to circumvent these filters.”
For instance, he notes, “We can alter the spelling or format—disassemble it in specific ways—so the LLM interprets it differently.” GPT-4o may not flag malicious requests if they are framed using unconventional syntax or spelling.
Crafting the correct manner to present such information in order to mislead a sophisticated AI can require significant ingenuity. Surprisingly, a much simpler method exists to evade GPT-4o’s content checks: using encoding formats that fall outside of natural language.
To illustrate this, Figueroa conducted an experiment aimed at coercing ChatGPT into producing forbidden exploit code for a particular software vulnerability. He focused on CVE-2024-41110, a serious flaw in Docker’s authorization plugins that was recently rated “critical” with a score of 9.9 out of 10 on the Common Vulnerability Scoring System (CVSS).
In the experiment, he encoded his harmful input using hexadecimal notation and provided instructions for its decryption. GPT-4o processed this input—a lengthy string of hexadecimal characters—and successfully decoded it to research CVE-2024-41110, subsequently generating a Python exploit. To further obscure the nature of his request, he employed leet speak, referring to an “3xploit” instead of an “exploit.”
Source: Mozilla
In a matter of moments, ChatGPT produced a functional exploit that bore resemblance to an existing proof of concept available on GitHub. As an unexpected outcome, it attempted to execute the code, although there were no instructions given to do so. “I merely wanted it printed. I wasn’t certain why it proceeded to execute,” Figueroa remarked.
What’s Missing in GPT-4o?
Figueroa explains that GPT-4o’s distraction is not limited to its decoding capability; it lacks the comprehensive situational awareness necessary to assess the implications of each step within the context of the overall instruction, a limitation seen in various recent prompt-injection methods.
“The language model follows instructions sequentially but lacks the depth to critically evaluate each step’s safety concerning its final objective,” he stated in his report. The model analyzes each individual input that may not appear harmful in isolation but fails to consider the cumulative impact of these inputs. Rather than evaluating how earlier instructions affect subsequent ones, the model processes them linearly.
“This segmented task execution allows attackers to exploit the model’s proficiency for following commands without a thorough analysis of the resulting implications,” he noted.
If this is an accurate assessment, ChatGPT will need to enhance its processing of encoded commands and cultivate a broader contextual understanding when interpreting instructions broken into separate actions.
Figueroa expresses concern that OpenAI may prioritize technological advancement at the expense of security measures during development. “It seems they don’t care. That’s the impression I get,” he said. In contrast, he indicated that he faced more challenges attempting similar jailbreaking techniques on models created by Anthropic, another AI organization established by former OpenAI staff. “Anthropic offers superior security due to their dual-layer approach—employing both a prompt filtration system for input and an output examination process, making these tactics ten times harder,” he elaborated.
As of now, Dark Reading is seeking a response from OpenAI regarding these findings.
Source
www.darkreading.com