Matsuo Laboratory, The University of Tokyo
Release of Weblab-10B: A 10 Billion-Parameter Bilingual Language Model
Supporting Japanese and English
※ The following is a partial English translation of the press release issued on August 22, 2023. Please refer to the Japanese version for the original.
Tokyo, [2023/08/22] — Matsuo Laboratory, operating under the Department of Technology Management for Innovation at The University of Tokyo Graduate School of Engineering, announces the release of its large-scale language model, Weblab-10B. Led by Professor Yutaka Matsuo, the laboratory has successfully developed a Large Language Model (LLM) that consists of 10 billion parameters and offers bilingual support for both Japanese and English.
Objectives and Future Applications:
The lab is primarily focused on advancing artificial intelligence (AI) research and aims to facilitate the technology for industrial applications. The newly developed Weblab-10B model aims to accelerate not just text-based AI but also multimodal applications like image processing and behavior control algorithms for software and robotic platforms. Concurrently, the lab intends to utilize the expertise acquired from this project for educational activities, including course development at the university level.
Addressing the imbalance in textual data availability between English and other languages like Japanese, Matsuo Lab has diversified its training data. Weblab-10B has been trained on both English and Japanese datasets, namely The Pile and Japanese-mC4, for the pre-training phase. The post-training or fine-tuning phase involved five distinct datasets: Alpaca (English), Alpaca (Japanese Translation), Flan 2021 (English), Flan CoT (English), and Flan Dialog (English).
Benchmarking and Performance Metrics:
Notably, despite a lower proportion of Japanese data in the fine-tuning stage, the model showed significant improvements in the JGLUE benchmark evaluation for Japanese, rising from 66% to 78%. These results affirm the model’s efficacy in knowledge transfer between languages. Weblab-10B’s performance stands as a domestic milestone, competitive with other internationally available open-source models.
For additional information and model comparison metrics, please refer to the appended open model comparison table.
The pre-trained and post-trained models of Weblab-10B developed by us will be released as open model and may not be used for commercial purposes. (See Hugging Face page below)
・Post-trained (fine-tuning) model
Weblab-10B holds the distinction of being the most advanced open-source Japanese language model to date.
The fine-tuning phase had a relatively smaller portion of Japanese data, underscoring the importance of the model’s capability for cross-lingual knowledge transfer.