December 11, 2024

Paper-club sessions: NavGPT - Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

NavGPT: uses LLM reasoning and visual input to navigate unseen environments from language instructions, no task-specific training currently.

Matheus Inoue

CW's Data Scientist Coordinator

A surreal, warm-toned landscape with a glowing, winding path snaking through soft, hill-like structures. Tall, abstract, city-like pillars rise in the background, all bathed in a hazy amber light.

Large Language Models (LLMs) like GPT-4 have shown remarkable abilities in reasoning and knowledge integration across various domains. While these models excel at dealing with text, applications that go beyond text, like getting sensory data like images to complete tasks like navigation, remains a frontier. In the paper "NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models" [1], recently discussed in our Paper Club session, researchers from the University of Adelaide and the Australian National University introduce NavGPT, a system that leverages LLMs' reasoning capabilities to navigate environments based on natural language instructions without any task-specific training.

Current approaches to Vision-and-Language Navigation (VLN) typically rely on supervised learning and specialized architectures trained on navigation datasets. One of the most prominent VLN benchmarks is the Room-to-Room (R2R) [2] challenge, where an agent must navigate through indoor environments following natural language instructions like "Walk out of the bedroom, turn right at the end of the hall, and stop at the third door on your left." While current methods are effective, they often struggle with generalization to unseen environments and complex instructions. Meanwhile, LLMs, despite their broad knowledge and reasoning capabilities, face challenges in directly processing visual information and maintaining spatial awareness. Previous attempts to integrate LLMs into navigation have mostly limited their role to parsing instructions or providing commonsense knowledge, rather than directly controlling navigation decisions.

Figure 1: Room-to-Room (R2R) Challenge. An agent must navigate from a starting point to a destination following natural language instructions. The path shows the trajectory the agent should follow, with multiple decision points along the way [2].

NavGPT addresses these challenges through a novel architecture that bridges visual perception and language understanding. As shown in Figure 2, the system consists of four main components: Visual Foundation Models that convert scenes into textual descriptions using BLIP-2 and object detection models, a History Buffer that maintains navigation context, a Prompt Manager that coordinates all information streams, and the Navigation System Principle that serves as the decision-maker, formulating the behavior of the LLM as a VLN agent. The Visual Foundation Models process the environment from eight different directions and three elevation levels to create comprehensive scene descriptions. These descriptions, along with navigation history and system rules, are organized by the Prompt Manager into structured inputs for the LLM. At each step, GPT-4 explicitly reasons about the environment and chooses actions, while the History Buffer maintains a record of observations, actions, and reasoning for context. What makes NavGPT unique is its ability to break down complex instructions into sub-goals, integrate commonsense knowledge, and adjust plans based on unexpected observations - all while making its reasoning process explicit and interpretable.

Figure 2: The architecture of NavGPT. NavGPT synergizes reasoning and actions in LLMs to perform zero-shot Vision-and-Language Navigation following navigation system principles. It interacts with different visual foundation models to adapt multi-modality inputs, handle the length of history with a history buffer and a GPT-3.5 summarizer, and aggregate various sources of information through a prompt manager. NavGPT parses the generated results from LLMs (LLM Thoughts and LLM Action) to move to the next viewpoint. [1]

The experimental results reveal both the promise and limitations of this approach. While NavGPT demonstrates impressive reasoning capabilities, achieving a 34% success rate in zero-shot navigation tasks, it still falls short of supervised methods that achieve over 60% success rates. The system shows particular strength in complex reasoning tasks, such as breaking down multi-step instructions and handling unexpected situations. However, it struggles with information loss during visual-to-text conversion and maintaining detailed environmental awareness over long trajectories.

This work opens exciting directions for future research in embodied AI. While current performance may not yet match specialized models, NavGPT demonstrates that LLMs can perform sophisticated navigation reasoning without task-specific training. The authors suggest that future work could focus on developing multimodal LLMs specifically for navigation or creating hybrid systems that combine LLMs' high-level planning with specialized models' precise control. 

Final remarks

The paper proposes a novel approach to apply LLMs into navigation tasks, which can be seen as the first steps into realizing the potential of using LLMs for embodied tasks. The work highlighted how important it is for LLMs to have a good context of the relevant task and environment, which more naive approaches may lack. Regarding the usage of LLMs for embodied tasks, the proposal of creating and using a navigation graph demonstrated an interesting abstraction of complex environments to a simple structure, which have the potential to work for such tasks. Perhaps the same apply to agents in completely unknown places (for example, by using a mechanism of updating a graph as the exploration goes), but the usage of such abstractions also limits how the agent would interact with the environment.

There are some open questions, which we think is worth thinking about:

  • What about vision capabilities of LLM models, like GPT-4o? Could they replace BLIP-2 as the image descriptor? I think the current capabilities of available vision integrated LLMs should be enough to generate good descriptions, but it would be interesting to check if this is true.
  • Are current VLN metrics sufficient for evaluating reasoning-based approaches? How to evaluate the quality of the generated reasoning traces? Current evaluation metrics on VLN are objective-centric, and maybe it would be interesting (not only for VLN, but for general LLM tasks) to have some metrics regarding the quality of reasoning (or other aspects, like creativity for instance).
  • How approaches similar to the one in the paper could be extended to handle dynamic environments? Current experiments show capabilities on discrete time and space, and given current LLMs size and time needed to process a single prompt, it remains a challenge to even consider the usage of similar systems on environments that change over time, or that the agent must interact with other agents on the same space, like a LLM-based robot on a factory sharing space with other robots and humans.
  • Would a “Inverse Computer Graphics” based image understanding be a way for improving results on image understanding for embodied tasks? This one is a bit of an oddball, but I think that if we could build a vision model that tries to replicate how real objects exist within an image, better reasoning could be derived, especially in regards to how objects interact or potential occlusions or interferences. The idea is similar to the Inverse Computer Graphics that Sabour and Hinton proposed for Capsule Networks [3]. While no real applications or good results were achieved by Capsule Networks, I still think a similar approach could lead to better real-world understanding for AI models.

References

[1] Zhou, Gengze, Yicong Hong, and Qi Wu. "Navgpt: Explicit reasoning in vision-and-language navigation with large language models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 7. 2024. Available in: https://arxiv.org/pdf/2305.16986 

[2] Anderson, Peter, et al. "Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. Available in: https://bringmeaspoon.org/ 

[3] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic routing between capsules." Advances in neural information processing systems 30 (2017). Available in: https://arxiv.org/abs/1710.09829