Article | Challenges in navigating a 2D space using the image capabilities of multimodal LLMs

In the context of our embodied consciousness research, we attempted to create a navigation system in a simulated environment, entirely based on LLMs, so as to evaluate for which tasks they could be used for, and to which extent. We have selected this environment as a simulated world, with enough freedom to plan, the ability to make decisions and with the possibility of death looming over the creature (through a hazard, as detailed below). As in the real world, we wanted actions to have consequences. As will be further elaborated here, the experiments indicate two main roadblocks in using LLMs for this kind of task:

Positionality (the capacity for Vision-LMs to understand the relative position of objects in the environment)
A lack of a “sense of self”. To put it simply: the LLM could not see itself as the agent in question.

These are clear challenges that we tried to solve with the creation of a “visual cortex” of sorts and an artificial memory system, for each of the respective problems. Still, they are relevant obstacles for LLMs to perform well in cases of embodiment, at least when it comes to systems entirely based on LLMs.

Multimodality and compositionality

It is known that LLMs such as GPT, even if they are marketed as multimodal, aren’t fully multimodal. There is a strong preference for text data (it’s in the name after all: Large Language Models) and other integrations are usually added after the fact. Images, for example, are translated into text via mechanisms similar to image description models, the result of which is then fed back into the model for inference. This mechanism is similar to what us humans do when we attempt to talk about a scene, but is lacking a crucial step: the creation and maintenance of an internal world model.

Humans process visual stimuli through both verbal and non-verbal channels, creating an internal world model which is then translated into language for communication with the outside world. The additional mechanism of a world model allows us to understand something that isn’t always clear for multimodal LLMs, or rather, for any models translating images into their descriptions: compositionality. This is what we set out to confirm and try to mitigate, creating a system that could understand the position of the Agent it was directing enough to steer it towards a Goal. Basically, we tried to see whether an LLM was capable of using image-to-text to navigate an environment.

‍

Figure 1: Final LLM system implemented

The experiment

We attempted to create a navigation system based entirely on LLMs, to test the limits of what could be done with them and what tasks exactly they could be used for. We designed an experiment based on GPT 4o and using the Gymnasium Package, originally designed to be used for Reinforcement Learning (RL), as a way to create and manipulate a simulated world. We used the frozenlake environment, illustrated below, to feed into a Chain of Thought (CoT) system. The environment tested was always an 8x4 grid, with the bottom row covered in thin-ice, which when stepped on would lead to a fall and a return to the initial state. The agent started at the bottom left and was given 4 options - Up, Down, Left and Right - to choose from and had to successfully walk around the dangerous obstacle to reach the Goal, a gift-box located at the bottom-right.

‍

Figure 2: Frozenlake environment (with base scenario) and game space, with labels

‍

Figure 3: Other Scenarios tested, increasing hazard area to test localized decisions

‍

It became very clear from the get-go that the model was struggling to understand the relation between the Agent (a little elf, as depicted in the image above) and the environment, especially when it came to the thin-ice squares that formed the main challenge of the map. While the squares were identified as dangerous and a general description of the terrain was achieved without fault, the model still tended to just head straight for the present, heading right and straight into the hole. It also tended, when it was skirting the hole, to head down too early, falling to its death right next to the goal. The model seemed to very clearly understand the global context and what it needed to do overall, but failed to grasp what was happening at a local level, where position and the relation between the Agent and the environment was equally key to the resolution of the problem. This persisted even as the description prompt was adjusted to instruct the agent to focus on and detail objects closer to the agent and reduce detail level the farther away you get, implying that in-context nudging wasn’t going to help in this situation.

Memory and graphs

‍The adjustment of the description prompt was not the only modification we tried to mitigate the position issue. One of our key assumptions was that the Decider LLM would benefit from a memory system. If the overall system could learn the way, the confusion regarding local position would become lessened. This did prove to be the case, with memory retrieval assisting the System overall, but occasionally leading to the repetition of “bad” practices and getting the System stuck in a loop. This was mitigated by creating memories, based on an evaluator, which would assess the situation to check if the Agent was closer to the Goal or not.

While these improvements helped, they did not solve the key issue. The threat of “Death by falling” did not seem to stick to the LLMs memories, the in-context nudging didn’t help detect a path better and the memory creation gave useful results, but none that could make the model truly “see”. This all became painfully clear when we gave the model a different way of understanding the world: graphs. By replacing the multimodal LLM processing images at the first step and giving the planner a map consisting of nodes (connected in an identical grid to the image), the system directly solved the problem.

The Agent didn’t make a single mistake and headed straight for the goal in the most efficient path. The results using a graph in the “describer” were the same as directly following a standard graph pathfinding system. The model just wasn’t going to be able to interpret images in a way that allowed for navigation, at least not in its current state (as of GPT 4o). Due to the abstracted nature, the graph representation was ideal for navigating in a 2D space. Further investigation is needed as to whether vision models could be used in creating these graphs, but the solution would most likely involve segmenting the image grid into 1x1 squares and categorizing each for reliability.

‍

Figure 4: Graph Representation of base scenario

Conclusion

The consequences for an embodied system go beyond the issues with multimodality and compositionality. The models themselves seem to not be capable of generating memories and choosing actions that take into account the well-being of the agent. There is no “fear of death”, no drive to learn other than simulated curiosity. This should impact how we approach embodiment and how we choose to represent stimuli. While we can assume that the model would benefit from further sensory data, we need to take into consideration that the model does not possess a drive for self-preservation or a strong “desire” for a goal, things which are frequently exploited for intensive learning, as demonstrated famously by Pavlov in his respondent conditioning experiments.