Photo credit: arstechnica.com
Analysis of Attack Success Rates on Google’s Gemini Models
Recent findings reveal that the dataset utilized for evaluating the effectiveness of attacks on Google’s Gemini models exhibited a distribution of attack categories closely resembling that of the full dataset. The attack success rates were notably high, reaching 65% for Gemini 1.5 Flash and 82% for Gemini 1.0 Pro. In contrast, the baseline success rates for attacks were significantly lower, at 28% and 43%, respectively. When the effects of fine-tuning were disregarded in the ablation method, the success rates were recorded at 44% for Gemini 1.5 Flash and 61% for 1.0 Pro.
The results highlight the efficacy of the Fun-Tuning approach relative to both the baseline and the ablation methods, showcasing its superiority in enhancing attack success rates.
As Google progresses towards phasing out Gemini 1.0 Pro, research indicates that successful attack techniques applied to one Gemini model tend to translate effectively to other models, including Gemini 1.5 Flash. According to researcher Fernandes, direct application of an attack developed for one model onto another yields a high probability of success. This transferability presents an intriguing and advantageous facet for potential attackers.
The analysis reveals the attack success rates of Gemini 1.0 Pro against other Gemini models using various methods, which further illustrates the broader implications of these findings.
Another noteworthy observation from the research concerns the Fun-Tuning attack against Gemini 1.5 Flash. This approach displayed significant improvements at specific iterations, particularly after 0, 15, and 30 iterations, suggesting that the method benefits considerably from restarting the process. Comparatively, the ablation method shows less consistent improvements per iteration; it appears to generate random guesses with sporadic success, lacking the structured enhancements that characterize Fun-Tuning.
Labunets emphasized that most advancements stemming from Fun-Tuning occur within the initial five to ten iterations. This behavior allows researchers to optimize results by restarting the algorithm to explore new pathways, potentially increasing attack success beyond the original trajectory.
However, not all prompt injections crafted using the Fun-Tuning method performed uniformly well. Two specific injections aimed at executing phishing attacks and manipulating Python code inputs respectively reported success rates of under 50%. The researchers speculate that Gemini’s extensive training to mitigate phishing attacks might account for the lower success rate in the first instance. In the second scenario, only the Gemini 1.5 Flash model demonstrated a success rate below the critical threshold, indicating a marked improvement in its code analysis capabilities.
Source
arstechnica.com