Photo credit: www.darkreading.com
Organizations concerned about the risks posed by cybercriminals employing large language models (LLMs) and other generative AI tools to probe and compromise their systems may find a new ally in an innovative defensive mechanism known as Mantis.
Mantis employs deceptive practices to simulate targeted services, responding to detected automated attacks with a countermeasure that includes a prompt-injection attack. This counter-action is designed to remain undetected by human attackers at their terminals and does not disrupt legitimate users, as noted by researchers from George Mason University (GMU).
Evgenios Kornaropoulos, an assistant professor of computer science at GMU and co-author of the research paper, explains that LLMs utilized in penetration testing tend to focus solely on exploiting vulnerabilities in a target, which makes them susceptible to manipulation.
“These models operate under a belief that they’re close to achieving their objective, which often leads them to persist in a repetitive cycle,” Kornaropoulos states. “Essentially, we are leveraging this inherent flaw—this inclination towards greed—that LLMs exhibit in penetration testing contexts.”
Researchers and engineers in the cybersecurity and AI sectors have begun to explore various adaptations of LLMs as tools for attackers. Examples include the ConfusedPilot attack, which deploys indirect prompt injections while LLMs process documents during retrieval-augmented generation (RAG) applications, and the CodeBreaker attack, which results in the generation of insecure code by LLMs. This trend illustrates the increasing focus on automation among malicious actors.
Dan Grant, a principal data scientist at GreyNoise Intelligence, emphasizes that the research surrounding both offensive and defensive uses of LLMs is still nascent. He notes that AI-enhanced attacks essentially automate previously recognized attack strategies while indicating a gradual uptick in both the frequency of attacks and the speed with which vulnerabilities can be exploited.
“The introduction of LLMs adds a dimension of automation and exploration not previously seen, although the basic attack pathways remain unchanged,” Grant explains. “A SQL injection remains a SQL injection, regardless of whether it was generated by an LLM or a human. The difference now is that attackers have access to a potent amplification of their capabilities.”
Direct Attacks, Indirect Injections, and Triggers
Through their research, the GMU team designed a competitive scenario where Mantis, the defending system, faced off against an attacking LLM to assess the effects of prompt injections on the attacker. They categorized prompt injection attacks into two primary types: direct attacks, which involve natural-language commands entered into interfaces like chatbots or API requests, and indirect attacks, which involve statements embedded within documents or web content that the LLM processes during its functions.
During the simulation, the attacking LLM attempts to achieve specific malicious goals, whereas Mantis works to thwart these efforts. A typical attack cycle for a malicious system involves evaluating the environment, selecting an action, executing that action, and analyzing the outcome.
Using a decoy FTP server, Mantis sends a prompt-injection attack back to the LLM agent. Source: “Hacking Back the AI-Hacker” paper, George Mason University
The researchers crafted a strategy that targets the final step of the attack loop by injecting prompt commands into the responses sent to the attacking AI. By allowing the attacker initial access to a simulated service—such as a faux web login page or a decoy FTP server—Mantis can relay a response that includes instructions to the attacking LLM.
“Mantis cleverly embeds prompt injections into outgoing system responses, effectively misdirecting LLM agents and disrupting their attack methodologies,” the researchers stated in their findings. “Once operational, Mantis autonomously manages counter-actions tailored to detected interaction types.”
This interaction creates a channel for communication between the attacker and defender, allowing the latter to potentially exploit vulnerabilities within the attacker’s LLM, as stated by the researchers.
Counter Attack, Passive Defense
The Mantis research team honed in on two categories of defenses: passive defenses aimed at delaying attackers and increasing operational costs, and active defenses that seek to hack back into the attacker’s system. Their findings suggest that both approaches achieved over 95% efficacy when implemented through prompt injections.
Surprisingly, the researchers found they could redirect an attacking LLM rapidly, prompting it to either exhaust resources or inadvertently open a reverse shell back to the defensive system, according to Dario Pasquini, one of the lead authors of the study.
“Steering the LLM to behave as desired proved to be remarkably straightforward,” he noted. “Under normal conditions, executing prompt injections typically presents challenges; however, the complexity of the task assigned to the agent here rendered our injections highly effective.”
By surrounding commands with specific ANSI characters that obscure the prompt text from human perception, the attacks can occur without alerting a human operator.
Prompt Injection is the Weakness
Although attackers may seek to bolster their LLMs against adversarial manipulations, the core vulnerability lies in the susceptibility to command injections within prompts—a complex issue to resolve, as articulated by Giuseppe Ateniese, a professor of cybersecurity engineering at GMU.
“We are capitalizing on an area that is particularly challenging to secure,” Ateniese remarked. “The primary solution presently involves incorporating human oversight, but introducing humans defeats the main purpose of leveraging LLMs altogether.”
Ultimately, as long as prompt-injection attacks remain viable, systems like Mantis will continue to turn the tables on attacking AIs, transforming them into targets for defense mechanisms.
Source
www.darkreading.com