In this article, the authors explore the idea of binarizing transformer models to improve their efficiency during inference. They use two large-scale pre-training datasets, namely (Zhu et al. 2015) and English Wikipedia (Devlin et al. 2018), as training data for their experiments. The authors follow a similar preprocessing pipeline as full-precision transformers, including tokenization, padding, and encoding. However, instead of using real-valued weights, they represent the model’s weights as binary vectors.
The authors experiment with different strategies to handle the binarization error that arises due to the quantization process. They compare the differences between the pre-binarized and post-binarized representations and define residual polynomials ignored previously. Additionally, they use low-rank estimators to model these residuals.
The authors demonstrate that by decomposing the binarization error into its constituent parts, they can identify the attention scores between keys and queries that are most affected by quantization. They show that by adjusting the weight decay and learning rate schedule, they can reduce this effect and improve the model’s accuracy.
The authors also compare their approach to existing works on binary transformers and demonstrate its superiority in terms of both accuracy and efficiency. They achieve this by using a novel method called "smooth quantization," which allows for more accurate post-training quantization without sacrificing computational efficiency.
In summary, the authors of this article aim to improve the efficiency of transformer models during inference by binarizing their weights. They propose several strategies to handle the binarization error and demonstrate the effectiveness of their approach through experiments on two large datasets. Their work has important implications for improving the scalability of transformer-based models in natural language processing tasks.
Computer Science, Machine Learning