Our paper has been accepted for publication in NeurIPS2022 (Spotlight).
bibliographic information
Hiroki Furuta, Yutaka Matsuo, Shixiang Shane Gu. “Generalized Decision Transformer for Offline Hindsight Information Matching “, International Conference on Learning Representations (ICLR2022).
Outline of the︎
How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information – such as future states in hindsight experience replay (HER) or returnsto-go in Decision Transformer (DT) – enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) – training policies that can We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how to solve the problem. any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT). For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for Categorical DT, which simply replaces anti-causal summation with anti causal binning in DT, enables arguably the first effective offline multi-task SMM algorithm that generalizes well to unseen (and even synthetic) multi- Bi-directional DT, which uses an anti-causal second transformer as the aggregator, can learn to model any Bi-directional DT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL, i.e. one-shot IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.