Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Funding Acknowledgments and Disclosure in Scientific Research

Funding Acknowledgments and Disclosure in Scientific Research

In recent years, transformer models have become increasingly popular in various fields due to their impressive performance on a wide range of tasks. However, one common issue with these models is over-smoothing, which occurs when the model becomes too reliant on the input’s global context and neglects local details. In this article, we propose NeuTRENO, an innovative approach that effectively addresses over-smoothing in transformer models without sacrificing their performance.

Background

Self-attention mechanisms are a crucial component of transformer models, allowing them to capture diverse syntactic and semantic relationships. However, these mechanisms can also lead to over-smoothing, as the model becomes more reliant on global context and less attentive to local details. This issue is particularly pronounced in tasks that require a detailed understanding of local context, such as language modeling.

Proposed Approach

NeuTRENO addresses over-smoothing by incorporating a new scaling factor for the self-attention mechanism. This scaling factor, called "Neu," is learned during training and helps to balance the attention between global and local contexts. By introducing this additional parameter, NeuTRENO can adaptively adjust the strength of the self-attention mechanism based on the input’s complexity.

Experiments

We evaluate NeuTRENO across various tasks, including language modeling, image segmentation, and ImageNet classification. Our results show that NeuTRENO significantly outperforms transformer baselines with softmax attention, and its advantages are particularly pronounced in tasks that require a detailed understanding of local context. We also demonstrate the benefits of combining NeuTRENO with other approaches, such as FeatScale, which addresses over-smoothing by adding a feature-level regularization term.

Conclusion

In summary, NeuTRENO offers a simple yet effective solution to addressing over-smoothing in transformer models. By introducing a new scaling factor for the self-attention mechanism, NeuTRENO can adaptively adjust the strength of attention based on input complexity. Our experiments demonstrate that NeuTRENO significantly outperforms baselines across various tasks and provides a promising approach for improving transformer performance in natural language processing and computer vision.