著 者： 小林 由弥 鈴木 雅大 松尾 豊
Ability to understand surrounding environment based on its components, namely objects, is one of the most important cognitive ability for intelligent agents. Human beings are able to decompose sensory input, i.e. visual stimulation, into some components based on its meaning or relationships between other objects, and are able to recognize those components as “object”. It is often said that this kind of compositional recognition ability is essential for resolving so called Binding Problem, and thus important for many tasks such as planning, decision making and reasoning. Recently, researches about obtaining object level representation using deep generative models which are called “Scene Interpretation Models” have been gaining much attention. Scene Interpretation models are able to decompose input scenes into symbolic entities such as objects, and represent them in a compositional way. The objective of this research is to point out the weakness of existing scene interpretation methods and propose a new way to improve them. Scene Interpretation models are trained in fully-unsupervised manner in contrast to latest methods in computer vision which are based on massive labeled data such as ground truth segmentation masks or bounding boxes. Due to this fully-unsupervised setting, scene interpretation models lack inductive biases to recognize objects. Therefore, the application of these models are restricted to relatively simple toy datasets. It is widely known that introducing inductive biases to machine learning models is sometimes very useful such as convolutional neural networks, but how to introduce them via training is not obvious usually. In this research, we propose to incorporate self-supervised learning to scene interpretation models for introducing additional inductive bias to the models, and we also propose a model architecture using Transformer which is more suitable for scene interpretation than conventional CNN. We show proposed methods outperforms previous methods, and is able to adopt to Multi-MNIST dataset which previous methods could not deal with well.