■書誌情報
Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo. “Efficient Object-Centric Representation Learning using Masked Generative Modeling”. Transactions on Machine Learning Research (TMLR). https://openreview.net/forum?id=t9KvOYPeL3
■概要
Learning object-centric representations from visual inputs in an unsupervised manner has drawn focus to solve more complex tasks, such as reasoning and reinforcement learning. However, current state-of-the-art methods, relying on autoregressive transformers or diffusion models to generate scenes from object-centric representations, suffer from computational inefficiency due to their sequential or iterative nature. This computational bottleneck limits their practical application and hinders scaling to more complex downstream tasks. To overcome this, we propose MOGENT, an efficient object-centric learning framework based on masked generative modeling. MOGENT conditions a masked bidirectional transformer on learned object slots and employs a parallel iterative decoding scheme to generate scenes, enabling efficient compositional generation. Experiments show that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained while maintaining strong or competitive performance on object segmentation and compositional generation tasks.
