Normalizer
Normalization as a Token Mixing Process in Transformer Architecture
Abdullah Nazhat Abdullah, Tarkan Aydin
Abstract
The advent of the transformer architecture heralded many advances in natural language processing (NLP), and the pinnacle of these advances is represented by large language models (LLM). In addition, the transformer architecture has been employed by computer vision (CV) researchers and practitioners advancing image processing tasks and leading to the recent introduction of multi-task and multi-modal deep learning architectures. The use of the scaled dot product attention mechanism with the softmax activation function in the typical design of the transformer architecture presents a drawback, as this architecture is computationally intensive and demands high compute requirements on training and inference. This paper investigates substituting the attention mechanism incorporated in the typical transformer architecture with a nonparametric token mixing process with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to induce token interactions through a simple operation without parameters, establishing a more efficient but capable alternative to the traditional attention mechanism in the design of transformer architectures.