Demo Developers: Aoi Horo, Hikaru Wada, Koki Fukuda, Yoshihiro Noumi
We used several foundation model technologies such as a large language model (GPT-4), a speech recognition model (Whisper), an object detection model (Detic), and a multimodal foundation model (CLIP). By integrating various foundation models and implementing them in a robot, it can comprehensively recognize the real world and generate appropriate actions based on its abilities in response to commands.
Reference