In this article, we delve into the world of adversarial attacks on language models, specifically focusing on the role of activations in these attacks. We begin by explaining that activations are the vectors that represent the input text in a language model’s internal workings. These vectors are crucial in determining the model’s predictions, and therefore, any manipulation of these vectors can compromise the model’s accuracy.
The article then introduces the concept of attacking a language model by controlling the activation vectors of the first few tokens in the context window. By doing so, an attacker can force the model to predict a specific token continuation after the context. The hypothesis is that the ability to carry out such an attack depends on the ratio between the dimensions of the input and output spaces. In other words, the more dimensions the input space has compared to the output space, the easier it is to launch the attack.
To test this hypothesis, we conducted experiments on various language models with different sizes, ranging from 33 million to 2.8 billion parameters. We measured the attack multipliers, which represent the model-specific constant of proportionality in the attack scaling law. Our findings show that the attack multipliers decrease as the size of the language model increases.
We also explored the concept of greedy search over all attack tokens to get specific target completions. We found that approximately eight tokens worth of attack are needed to force a single particular token of a response.
In conclusion, this article sheds light on the importance of activations in adversarial attacks on language models and demonstrates how controlling these activations can compromise the model’s accuracy. By understanding the relationship between the input and output spaces and the attack multipliers, we can better comprehend the mechanisms underlying these attacks.
Computer Science, Machine Learning