Two of our papers have been accepted for publication in Transactions on Machine Learning Research.

■Bibliographic Information
Hiroki Furuta, Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo “Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials”. Transactions on Machine Learning Research (TMLR).
■summary
Grokking has been actively explored to reveal the mystery of delayed generalization and identifying interpretable representations and algorithms inside the grokked models is a suggestive hint to understanding its mechanism. Grokking on modular addition has been known to implement Fourier representation and its calculation circuits with trigonometric identities in Transformers. Considering the periodicity in modular arithmetic, the natural question is to what extent these explanations and interpretations hold for the grokking on other modular operations beyond addition. For a closer look, we first hypothesize that (1) any modular operations can be characterized with distinctive Fourier representation or internal circuits, (2) grokked models obtain common features transferable among similar operations, and (3) mixing datasets with similar operations promotes grokking. Then, we extensively examine them by learning Transformers on complex modular arithmetic tasks, including polynomials. Our Fourier analysis and novel progress measure for modular arithmetic, Fourier Frequency Density and Fourier Coefficient Ratio, characterize distinctive internal representations of grokked models per modular operation; for instance, polynomials often result in the superposition of the Fourier components seen in elementary arithmetic, but clear patterns do not emerge in challenging non-factorizable polynomials. In contrast, our ablation study on the pre-grokked models reveals that the transferability among the models grokked with each operation can be only limited to specific combinations, such as from elementary arithmetic to linear expressions. Moreover, some multi-task mixtures may lead to co-grokking — where grokking simultaneously happens for all the tasks — and accelerate generalization, while others may not find optimal solutions. We empirically provide significant steps towards the interpretability of internal circuits learned through modular operations, where analytical solutions are not attainable.

■Bibliographic Information
Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, Izzeddin Gur. “Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web”. Transactions on Machine Learning Research (TMLR).
■summary
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB — 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for task compositionality, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.

Related Post