October 2, 2024
Consciousness, Reasoning and LLMs playing tic-tac-toe: chain-of-thought experiment
Reasoning in LLMs tested via tic-tac-toe, linked to consciousness.
While tic-tac-toe is a solved game and relatively simple for humans, it presents significant challenges for LLMs. The direct predictions from their world models are not enough; they often struggle even to suggest valid moves. This presents an ideal opportunity to test the augmentation of LLMs with reasoning, which, in turn, is a cognitive process at the heart of consciousness itself.
Reasoning and consciousness
One of the most profound observations in psychology is the duality between the conscious and unconscious mind. This divide not only shapes our subjective experience but also reflects the distinct mechanisms that govern our cognition and their specific functions. Familiar, well-learned situations are typically handled unconsciously, relying on the predictive power of robust mental models. In contrast, unfamiliar situations capture our conscious attention, requiring a different type of processing - roughly what we call reasoning. As the renowned psychologist and philosopher William James aptly stated, reasoning is "best suited to help us out of unprecedented situations - situations for which all our common associative wisdom... leaves us without resource" [1].
The assumption here is that reasoning is a more tangible object of study than consciousness, while still being central to it. In AI, reasoning has been extensively studied within the symbolic approach, where it is broadly understood as the iterative manipulation of symbols in a goal-directed manner. However, in humans, this is only part of the picture - reasoning is most powerful when coupled with a robust world model. This is what makes large language models (LLMs) particularly special for our study. They possess world models, and there are already techniques to try to instill reasoning in them. The main one being the chain-of-thought technique, where LLMs are instructed to use their context window to reason about a problem before arriving at a final answer. It is quite simplistic, but it is already a step in the right direction.
Tic-tac-toe experiment
To test this, we conducted nine different experimental setups of LLMs playing through a suite of 100 critical tic-tac-toe moves [2]. In the baseline setup (setup 1), they struggled even to reliably suggest valid moves. These are precisely the kind of unfamiliar situations that are fitting for testing reasoning, where the direct predictions from the world model is not enough.
The simplest form of chain-of-thought involves simply adding the instruction "Let's think step-by-step," as seen in setup 2. The specific reasoning steps followed are left to the LLM to decide. While this approach already leads to a substantial improvement, it lacks systematic structure and doesn't leverage domain-specific concepts.
To address this, in setups 3 to 6, we guided the LLM using Newell and Simon's set of heuristics for optimal tic-tac-toe gameplay [3]. We also varied the output formats, which impose different constraints on how the model generates responses. Setups 3, 4, and 5 use JSON mode. Setup 4 performs worse than setup 3 because it imposes excessive constraints by specifying individual steps, which interferes with the LLM's natural language generation. In setup 5, each reasoning step is handled by a dedicated LLM call, allowing a more focused reasoning on each step. This led to the best performance with the version of GPT-4o available before August.
In August, OpenAI introduced structured outputs [4], training the new GPT-4o to better manage output constraints while maintaining natural language fluency. As a result, in setup 6, it achieved the same high performance as setup 5 but with just a single LLM call.
It all changed gain when OpenAI released o1 [5], the latest model available at the time of this writing. This model was specifically trained to build chain-of-thought reasoning more effectively. In setups 7 and 8, we see that, with minimal prompting, it already outperforms the best results from previous setups.
Finally, in setup 9, we achieve the highest score of all. By instructing o1 with the same perfect game heuristics, we observe a substantial improvement of 5 points. Our conclusion - potentially extendable to other tasks - is that while o1 autonomously generates its own chain-of-thought, it still benefits from hints about optimal reasoning strategies, such as key concepts in tic-tac-toe like threats, forks, and the overall decision-making order.
References
- James, W. (1890). The principles of psychology. Page 957.
- ADDENDUM: From all reachable board states in tic-tac-toe, we selected a sample of the 100 most "critical" states. These states were determined to have the highest potential for a final score swing, as calculated by the minimax algorithm. In each board state, available moves can be assigned scores of 1 (win), 0 (draw), or -1 (loss), representing the outcome they would lead to under optimal play (as in minimax). For more details, check here.
- https://en.wikipedia.org/wiki/Tic-tac-toe#Strategy
- https://openai.com/index/introducing-structured-outputs-in-the-api/
- https://openai.com/index/learning-to-reason-with-llms/
Explore our content
Get to know and learn more about Cloudwalk below.