【NEWS】 Two papers from our laboratory have been accepted to ICCV 2025.

■Bibliographic Information
Jungdae Lee*, Taiki Miyanishi*, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa Inoue. “CityNav: A Large-Scale Dataset for Real-World Aerial Navigation”. International Conference on Computer Vision (ICCV 2025).
(* denotes equally contributed)
■Overview
Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities To fill this gap, we Our dataset consists of 32,637 human demonstration trajectories, each Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km^2 across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between Furthermore, as an initial step toward addressing this challenge, we provide a new dataset that is more robust than the existing datasets composed of synthetic scenes such as AerialVLN: Cambridge and Birmingham. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare the performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation In our experiments we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.

■Bibliographic Information
Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Masato Taki, Yutaka Matsuo. Compositional Visual Reasoning for City-Scale 3D Language Fields”. International Conference on Computer Vision (ICCV 2025).
■Overview
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering (i) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple To our knowledge, GeoProg3D is the first framework enabling compositional geographic To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language.

Related Post