Bibliographic Information
Zhenxuan Yu, Takeshi Kojima, Yutaka Matsuo and Yusuke Iwasawa. “Slender-Mamba: Fully Quantized Mamba From Head to Toe. “Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025).
Overview
Large language models (LLMs) have gained drastic performance improvements in the natural language processing (NLP) domain. Recently, Mamba, a language model architecture based on State-Space Recently, Mamba, a language model architecture based on State-Space Models (SSMs), has shown to achieve comparable performance with Transformer models while significantly reducing costs by compressing context windows We focus on the potential to further light-weighting Mamba architecture by applying BitNet quantization method into the model architecture. In addition, while prior BitNet methods generally quantize only linear layers in the main body, we extensively quantize embedding layers In our experiments, we apply ternary quantization into Mamba-2 ( Our method achieves a nearly 90.1% reduction of bits used by all parameters, achieving a significant improvement compared with 54.5%. In addition, our method experiences In addition, our method experiences minimal performance degradation in both pre-traing perpexity and downstream tasks. These findings provide a potential of incorporating more light-weighted language models into edge devices, which will be more demanding in the future.