Bibliographic Information
Title: A Proposal of Scene Interpretation Method Using Transformer and Self-Supervised Learning
Author(s): Yuya Kobayashi, Masahiro Suzuki, Yutaka Matsuo
Publication number: Volume 37 Number 2 J-STAGE
summary
Ability to understand surrounding environment based on its components, namely objects, is one of the most important cognitive ability for intelligent agents. Human beings are able to decompose sensory input, i.e. visual stimulation, into some components based on its meaning or relationships between other objects, and are able to recognize those components as “object”. Human beings are able to decompose sensory input, i.e. visual stimulation, into some components based on its meaning or relationships between other objects, and are able to recognize those components as “object”. It is often said that this kind of compositional recognition ability is essential for resolving so called Binding Problem, and thus important for many Recently, researches about obtaining object level representation using deep generative models which are called “Scene Interpretation Recently, researches about obtaining object level representation using deep generative models which are called “Scene Interpretation Models” have been gaining much attention. The objective of this research is to point out the weakness of existing scene interpretation methods and The objective of this research is to point out the weakness of existing scene interpretation methods and propose a new way to improve them. Due to this fully-unsupervised manner in contrast to the latest methods in computer vision which are based on massive labeled data such as ground truth segmentation masks or bounding boxes. Due to this fully-unsupervised setting, scene interpretation models lack inductive biases to recognize objects. Therefore, the application of these models are restricted to relatively simple toy datasets. It is widely known that introducing inductive biases to machine learning In this research, we propose to incorporate self-supervisory models to machine learning. propose to incorporate self-supervised learning to scene interpretation models for introducing additional inductive bias to the models, and we also We show proposed methods We show proposed methods outperforms previous methods, and is able to adopt to Multi-MNIST dataset which previous methods could not deal with well.