Large Language Models Are So Unreasonable!

Recently, when posed a seemingly straightforward question: “Is 0.9 larger than 0.11?” Nearly all large language models incorrectly concluded that 0.11 is greater, likely because they generalized from the pattern that 11 is greater than 9. Large language models perform many tasks extremely well, but there’s a basic human capability they glaringly lack: the ability to reason and plan complex tasks. How can we humans and new technologies solve this problem for them?

At the heart of human intelligence lies rational thought and reasoning – often described as a set of “conscious processes that occur independently of sensory stimulation”. It is thinking and reasoning that enable us to connect ideas, interpret complex information, and navigate the intricate concepts of cause and effect, truth and lies, and right and wrong. It’s the engine behind our decision-making, allowing us to consciously adjust our goals, beliefs, and behaviors in response to the world around us.

In recent years, large language models (LLMs) have taken center stage in the world of artificial intelligence (AI), showcasing their ability to generate human-like text, as well as answer questions, and assist in a wide range of tasks. In some sense, they are not just enhancing but even challenging traditional search engines, redefining how we access and interact with information.

While LLMs have made remarkable strides in mimicking human-like language, they fall short when it comes to the deeper cognitive processes that define human intelligence. These models generate content based on patterns learned from vast datasets, but they don’t truly understand the content or context in the same way humans do. This distinction is crucial as we explore the limitations of LLMs in reasoning and planning.

To explore the challenges LLMs face in reasoning and planning, we’re pleased to welcome back Dr. Mehdi Fatemi, Senior Researcher and Team Lead at Wand Research, Wand AI’s fundamental research group. Dr. Fatemi recently sat down with Sophia on our marketing team to discuss issues related to LLMs, including their tendency to hallucinate. In this session, he explains why LLMs struggle with these cognitive tasks and how enterprises can overcome these challenges to successfully deploy generative AI solutions in their operations.

Sophia: Welcome back, Dr. Fatemi. Thank you for joining us again to discuss another important challenge that enterprises face when integrating generative AI into their operations, specifically the limitations of LLMs in reasoning and planning.

Dr. Fatemi: Thank you, Sophia. Yes, that’s true. It’s important not to be misled by LLMs’ impressive capabilities without recognizing their critical limitations. Another significant drawback besides hallucinations is LLM’s inability to plan effectively using dynamic reasoning and subtasking. This is particularly evident and is much worse in smaller models.

This deficiency is not just a minor inconvenience—it significantly limits their utility in complex, real-world applications that demand dynamic decision-making and deep reasoning, tasks that humans handle intuitively.

Gap Between LLMs and Human Reasoning

Sophia: How does human reasoning differ from the way LLMs process information?

Dr. Fatemi: Human reasoning is a dynamic process that involves breaking down complex tasks into manageable subtasks, and drawing logical connections from premises to conclusions. This also requires recognition of intermediate steps and their logical connections. Most importantly, our brain has the cognitive ability to evaluate not only the situation but also thoughts at various stages of thinking—a capability that we often refer to as the internal critic.

Take the example of light bulbs. You are in a room with three light switches, each connected to one of three ordinary light bulbs in another room. You can’t see the light bulbs and you are allowed to go to the room only once. How can you figure out which switch controls which light bulb?

There is more than one correct solution here. For example, turn on switch 1 for a while, then turn it off. Next, turn on switch 2 and go into the room. The hot bulb that is off is controlled by switch 1. The bulb that is on is controlled by switch 2. The bulb that is off and cold is controlled by switch 3. Alternatively, you can start by turning two switches on for a while, then turn off one of them and go to the room, and so on.

Now think about how our brain figures out a solution. The core step here is to realize that you can’t solely rely on vision. This crucial step is the job of our internal critic. The next step is recognizing temperature as a distinguishing factor. Then we try various combinations to find a solution. Further, we may continue to see that there is more than one solution.

In contrast, LLMs operate very differently. They are designed to process vast amounts of data and generate responses based on the patterns they learned from the data. All existing LLMs function fundamentally the same way: they process text at the token level. A token is essentially a fragment of a word, serving as the basic unit of LLMs, much like how alphabets function for humans. LLMs typically use between 50,000 and 200,000 tokens (which is called vocabulary size), predicting the most probable next token based on the input text and other provided content. This new token is then added to the current text, and the process repeats until an end-of-text token is reached or a predefined maximum length is exceeded. The generation of each token is based on patterns learned from trillions of tokens during the LLM’s training.

Returning to the light bulbs example, of course, there have been many such examples in the enormous training data of modern LLMs, and more often than not, they can generalize from that. It is akin to recalling a familiar problem’s solution rather than deriving it through reasoning. That’s why if you ask a question like this to most of the current LLM-based agents, they would probably answer correctly, even if they do not have any reasoning capability. However, if you either use a small model for reasons such as cost, or you have a more complex problem, it is very likely that you will get a wrong answer.

0.9 < 0.11? The Problem of Static Responses

Sophia: I’ve heard that LLMs are often described as static. Can you explain what that means and how it affects their ability to reason?

Dr. Fatemi: When we say LLMs are static, we’re highlighting a fundamental limitation in how they process information. As Yann LeCun, Turing Award winner and Meta’s chief AI scientist, points out, LLMs operate with a constant number of computational steps between input and output. This approach restricts their ability to represent or reason through complex situations effectively. Unlike humans, who can dynamically adjust their thought processes based on context, LLMs generate responses strictly based on patterns from their training data, without the flexibility to adapt or reason beyond that.

Sophia: It’s been noted that LLMs can sometimes struggle with simple numerical reasoning. Could you provide an example and explain why this happens?

Dr. Fatemi: Matthew Berman, a YouTuber who frequently tests various LLMs and their capabilities, posed this seemingly straightforward question: Is 0.9 larger than 0.11? Surprisingly, nearly all LLMs at the time incorrectly concluded that 0.11 is greater, likely because they generalized from the pattern that 11 is greater than 9. This type of error highlights a critical limitation of LLMs—they often struggle to reason about numerical values, relying on pattern matching rather than understanding the underlying mathematical logic.

Similarly, if an LLM is asked to determine whether a number is prime, it might mistakenly conclude that a number is not prime because it can be divided by another number, even if that division does not result in an integer.

These errors are akin to the kind of mistakes a small child might make when first learning about numbers—endearing in a child (and a teachable moment), but problematic in AI technology that is expected to draw accurate conclusions.

As LLMs are becoming more advanced with better training data, such problems are likely to be alleviated to some limited extent. However, a true cognitive system needs to be able to reliably realize such errors by itself and to come up with correct solutions through reasoning.

How to Enable Sequential Reasoning?

Sophia: What approaches can be used to help LLMs improve their reasoning and decision-making abilities?

Dr. Fatemi: Similar to addressing other fundamental limitations of LLMs, such as their tendency to hallucinate, existing methods for achieving more robust reasoning-like capabilities primarily rely on fine-tuning or prompt engineering. However, these approaches are inherently flawed because they still depend on the LLM’s generative capabilities, which can lead to suboptimal results.

One promising approach to enhance LLMs with reasoning capabilities is Reinforcement Learning (RL). Reinforcement Learning (RL) is the formal paradigm of sequential reasoning and decision-making based on learning, and it offers a promising solution to enhance LLMs with reasoning capabilities. It enables LLMs to learn from feedback, adjusting their responses based on rewards or penalties. This feedback loop allows LLMs to refine their understanding of logical relationships and causal connections, leading to improved outputs.

RL can be applied to LLMs in various ways, from enhancing their training from human feedback to augmenting and guiding their behavior. It provides critics, logic, persona, and the means for planning.

Reinforcement Learning from Human Feedback (RLHF): A First Step

Sophia: Can you elaborate on how Reinforcement Learning (RL) can be applied to LLMs?

Dr. Fatemi: One notable approach to integrating RL with LLMs is Reinforcement Learning from Human Feedback (RLHF). This method involves training LLMs on human-annotated data, where human evaluators provide feedback on the model’s responses. RLHF enables LLMs to learn from human preferences, aligning their outputs with human values.

However, while RLHF is a valuable approach to integrating RL with LLMs, it serves as more of a first step rather than a direct path to reasoning capabilities. This method mainly focuses on aligning LLM outputs with human values and preferences, rather than explicitly providing reasoning skills.

Sophia: Recently, OpenAI introduced their new models. How do these new models compare to previous versions in terms of reasoning capabilities?

Dr. Fatemi: With the introduction of OpenAI’s new o1 family models, we are seeing some advancements in AI reasoning capabilities beyond token-completion. These models are designed to spend more time thinking through problems before they respond, much like a person would. Although the exact machinery is unknown, they likely employ multi-step answer generation with feedback loops and additional models to evaluate answers and steps. They represent an intriguing advancement, however, reports still highlight limitations in genuine reasoning. Notably, over-thinking persists, as seen in simple queries like calculating the product of two numbers, where the model generates lengthy, nonsensical justifications instead of suggesting a straightforward calculator solution.

Path Forward for AI

Sophia: Final question, Dr. Fatemi. How does your work at Wand Research help address the challenges LLMs face with reasoning and planning?

Dr. Fatemi: The current wave of AI, driven by large language models, has achieved remarkable success in many areas. However, the limitations of LLMs in reasoning and planning are becoming increasingly evident, especially as we aim to apply AI to more complex and dynamic tasks. To move forward, we must develop new technologies that go beyond pattern-matching and incorporate more complex cognitive processes, especially when using smaller models with limited computation.

This requires a fundamental shift in how we design and build AI technologies. Rather than scaling up existing models, we need to create AI that can reason, plan, and collaborate—AI that can think not just at the level of words and patterns, but at the level of concepts and actions, with a forward-looking perspective. By doing so, we can unlock the full potential of AI and create technologies that truly enhance our ability to solve the complex challenges of the future.

As I explained in the last fireside chat, human cognition does not arise from a single monolithic input-output entity. Instead, it emerges from the interplay of diverse brain regions, each contributing unique capabilities. At Wand Research, our core focus is on developing modular components, such as critics, and integrating them into an AI agent’s architecture in a way that mirrors the human brain. Both steps present open research challenges and require novel inventions and innovations. By incorporating these components, we aim to unlock cognitive abilities that go beyond what current LLMs can achieve, which rely heavily on prompting and offer no guarantee of success.

Summary

It’s clear that while LLMs have brought us significant advancements, their limitations in reasoning and planning highlight the need for a new direction in AI development. As we look ahead, the path forward will require not just bigger models, but smarter ones—AI that can think, plan, and collaborate on a deeper level.

At Wand AI, we’re committed to pioneering these new technologies. By integrating advanced techniques like reinforcement learning and focusing on creating AI that mirrors human-like reasoning, we’re pushing the boundaries of what technology can achieve. Our goal is to develop new AI technologies that not only enhance decision-making and problem-solving but also empower businesses to tackle the complex challenges of the future.

Our next blog article will dive into another critical issue—the Compounding Error Effect of LLMs. We’ll explore how small inaccuracies in AI-generated responses can escalate, leading to significant errors over time, and what strategies can be employed to mitigate this effect. Stay tuned for an in-depth look at how we can make LLMs more reliable and robust.

The journey to achieving true cognitive AI is complex, but with innovation and dedication, we’re on the brink of unlocking even greater possibilities for enterprises.

Reimagine productivity by building an infinitely scalable team powered by collaborative AI.

Large Language Models Are So Unreasonable!