■書誌情報
Zhenxuan Yu, Takeshi Kojima, Yutaka Matsuo and Yusuke Iwasawa. “Slender-Mamba: Fully Quantized Mamba From Head to Toe.” Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025).
■概要
Large language models (LLMs) have gained drastic performance improvements in the natural language processing (NLP) domain. However, the models often consume huge amounts of computational resources during training and inference. Recently, Mamba, a language model architecture based on State-Space Models (SSMs), has shown to achieve comparable performance with Transformer models while significantly reducing costs by compressing context windows during inference. We focus on the potential to further light-weighting Mamba architecture by applying BitNet quantization method into the model architecture. In addition, while prior BitNet methods generally quantize only linear layers in the main body, we extensively quantize embedding layers and projection layers, considering their significant proportion of model parameters. In our experiments, we apply ternary quantization into Mamba-2 (130M) architecture and pre-train the model with 150B tokens from scrach. Our method achieves nearly 90.1% reduction of bits used by the all parameters, achieving a significant improvement compared with 54.7% reduction by conventional BitNet quantization method. In addition, our method experiences minimal performance degradation in both pre-traing perpexity and downstream tasks. These findings provide a potential of incorporating more light-weighted language models into edge devices, which will be more demanding in the future.