■Bibliographic Information
Izzeddin Gur , Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust.(*Equal Contribution) “A Real -WebAgent with Planning, Long Context Understanding, and Program Synthesis”. International Conference on Learning Representations (ICLR 2024, Oral )
■Overview
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre- We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre- HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML comprehension tasks; achieving 18 We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
Bibliographic Information
Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur. “Multimodal Web Navigation with Instruction-Finetuned Foundation Models”. International Conference on Learning Representations (ICLR 2024)
Overview
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement In this work, we study data-driven offline training for web agents with vision-language foundation models. -We propose an instruction-following multimodal agent, WebGUM, that We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent’s ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even On the WebShop benchmark, our 3-billion-parameter model achieves superior Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than the prior work, and make them available to promote future research in this direction.