松尾・岩澤研究室では,「知能を創る」というミッションのもと、世界モデルをはじめとした深層学習やそれを超える基礎技術の開発、ロボティクスや大規模言語モデル、アルゴリズムの社会実証といった幅広い研究領域で活動しています。
こうした活動を更に拡大するため、リサーチインターンシップを開催し、15名の方にご参加いただきました。
▼リサーチインターンシップ概要
https://weblab.t.u-tokyo.ac.jp/news/20240417/
▼インターンテーマ/メンターの紹介記事
https://weblab.t.u-tokyo.ac.jp/news/20240426/
本記事では、リサーチインターンに参加いただいたメンバーの体験記をご紹介します。
Self-Introduction
My name is Lechang Zhang and I am a second year Ph.D. student at Ochanomizu University. I was a fan of Japanese animation and TV series during my college years. Thus, I self-studied some simple Japanese and started my journey in Tokyo.
My research interests mainly lie on how to apply the Large Language Model to help people in the real world. I have attended an internship at Microsoft and applied the Large Language Model to improve the efficiency of the Bing shopping team.
About Research
In the internship program, I chose “Research on tackling real-world/social issues by utilizing LLM” as my theme. During the internship, my mentor and I developed a web application for Blind and Visually Impaired (BVI) individuals to assist them in the kitchen.
We conduct interviews and observation experiments with Blind and Visually Impaired people in the Tokyo Independent Living Support Center. In the interview, our participant stated the following issue that she was facing:
(1) “It is difficult for me to measure the liquids in the kitchen, because I don’t know whether the liquids have been poured or not.”
(2) “I always cook the meals that I am familiar with and barely try new recipes.”
And in the experiment, we observed the cooking lesson of a 70 year old grandmother and recorded the following issues that BVI people were failed to do :
(1) The cutting board slipped into the sink, but the participant was not capable of recognizing it.
(2) The cling film did not fully cover the bowl, but the participant failed to notice it and put it into the microwave.
(3) The teacher taught our participant to put the knife in a certain spot to avoid its dropping, but the participant forgot to follow the rule and the knife was sometimes near the edge of the table.
(4) The participant was not able to know whether the liquids such as soy sauce had been poured or not.
And we also noticed the following two things during the experiment:
(1) The teacher of the cooking class has to help our participants with many actions such as adding the seasonings.
(2) Using knives and cutting were either dangerous or difficult for people with cooking experience before losing vision.
We dug up the main reason for our observation and concluded that our participants were not capable of completing some cooking actions like checking the position of the knife, cutting board and checking whether the cling film covered the bowl or not mainly because she could not see and rely on her memorization to finish some actions, and the memory workload was too heavy for her.
Here, we propose our tool – KitchenAssist to relieve some memory burden of BVI people and assist them in the kitchen. KitchenAssist consists of two layers of LLM. The first layer is a general food LLM that can capture the video and audio input from users. The second layer is a critique LLM that can adjust the output of the first LLM and generates responses that specifically suit our BVI users. Two layers were connected through TextGrad which is a framework building automatic differentiation via text. TextGrad implements backpropagation through text feedback provided by our first LLM, and adjusts its output for our specific users.
In the implementation level, we implemented a web application that deployed in huggingface. Our application sets intervals and captures the video frames, and at the same time, it captures the audio and uses a speech recognition model to translate the audio input to text. Then it utilizes the GPT-4o as the general food LLM, which is the first layer of LLM in our architecture, and passes the response to our critique LLM. We then collect the observations in the interviews and observation experiment and use In-context learning in the critique LLM to make the answer desirable to BVI users. Finally, the text answer will be transformed to speech output and our users can listen to the response.
Closing
I strongly recommend the internship program in Matsuo-Iwasawa internship to people who want to experience the atmosphere in other labs, and to people who want to make new friends and grow as a researcher. In this internship, I met people from all over the world, such as Spain, Switzerland, America, and Japan of course 🙂 And I have been friends with some of them and we exchanged many thoughts in research and personal life.
I have also been invited to a BBQ event and traditional Japanese events like a Sumo-event and a traditional art museum. I had so much fun during the internship.
From a research perspective, my mentor is very professional in deep learning and LLMs. I learned a lot from him. We had many discussions in the lab and worked together to make our ideas come true. He also patiently taught me things like how to use the HPC in the lab and so on.
It was a wonderful journey to meet so many different researchers. I am grateful for my growth during the internship. I will carry the things I learned from here, and continue to grow as a researcher. Thank you all!
いかがでしたでしょうか?
松尾研では研究員を積極的に募集しております。気になる方は下記をご覧ください!
https://weblab.t.u-tokyo.ac.jp/joinus/career/