From Prolixity to Concise Reasoning: The Paradox of Reasoning Length in LLMs

TL;DR: As language models improve at reasoning, they also tend to become more verbose. But longer responses don’t always mean better ones—in fact, there is a strong correlation between accuracy and conciseness. Moreover, long responses increase both latency and cost.
We revisit this issue and trace the root cause to the reinforcement learning process used during training. A simple two-step adjustment after training helps models stay sharp, delivering accurate, concise answers instead of unnecessarily long ones.
From Prolixity to Concise Reasoning
The Paradox of Reasoning Length in LLMs
Authors: Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula
The rise of powerful reasoning-focused large language models (LLMs) has introduced an unexpected trend: their responses are getting longer and longer. At first glance, this might seem like progress. After all, more explanation must mean better reasoning, right? But an inconvenient truth exists beneath this verbosity: many of the correct responses are actually short.
This post explores why RL-trained models tend to become increasingly wordy, and why that's not always a good thing. We argue that the problem isn't the model's architecture or reasoning ability, but something more fundamental: how reinforcement learning (RL) works under the hood. Crucially, our findings reveal a strong and consistent correlation between conciseness and accuracy; a relationship that challenges prevailing assumptions in model training.
The Infamous Aha Moment
During RL post-training, many models hit an "aha moment" where they start to self-correct; they identify mistakes, restart solutions, and elaborate more. This behavior is often praised as a sign of advanced reasoning. But it may also mark the beginning of exponential growth in response length.
At the same time, our empirical findings show something surprising: as accuracy improves, responses often become shorter. This strong correlation between conciseness and correctness appears not only in our experiments but across various models and benchmarks. It suggests that verbosity is not a prerequisite for reasoning. In fact, it may be a symptom of uncertainty or error.
Why, then, do models grow verbose after the "aha" moment? We should look at what happens in the RL training.
What Reinforcement Learning Actually Optimizes
The post-training of LLMs using RL (typically PPO or GRPO) is not magic. It simply adjusts the model's behavior to maximize expected return, which in this case is nothing but the terminal reward. Unlike supervised learning, RL doesn’t rely on mimicking examples. Instead, it interacts with the environment (in this case, the space of possible responses) to find responses that receive higher scores.
But here’s the catch: RL doesn’t care about clarity or brevity. It simply optimizes its loss function, which is designed to drive the model toward maximizing terminal reward. However, that loss landscape can be inadvertently influenced by factors like response length.
When we plot response length against the loss, we find a surprising alignment: as the response gets longer, the loss consistently decreases. This is particularly striking given that the entire run involves fixed negative rewards, since we deliberately selected extremely hard problems such that none of them are ever solved throughout training. Yet, the loss continues to decline as the responses steadily grow longer.
This confirms the link between response length and loss, but it also demonstrates a more important point. In the absence of consistent positive rewards, RL has an inherent incentive to produce longer responses. As we will show analytically, this arises directly from how the loss is structured: when the final reward is negative, increasing the number of tokens reduces the average penalty. In effect, the model learns to extend its responses not to increase correctness, but merely to dilute the impact of negative rewards. This behavior is not the result of higher-level reasoning or deliberate strategy. It is simply the consequence of RL minimizing its loss.
We should note that this phenomenon is fundamentally different from reward hacking. Reward hacking occurs when an agent finds unintended shortcuts to maximize reward, exploiting flaws or loopholes in the environment rather than genuinely solving the task. In contrast, what we described above isn't the result of manipulating poorly designed or hidden rewards. Instead, it emerges directly from how the loss is computed from the reward.
Formalizing the Response Length Bias in PPO
Our paper offers a mathematical analysis of how PPO (with Generalized Advantage Estimation) inherently favors longer responses when terminal rewards are negative.
The reason? In PPO with λ < 1, we show that the final reward is weighted inverse proportionally with the response length in the advantage, which then appears (with a negative sign) in the total loss and is weighted by other terms. This makes negative terminal rewards have a positive impact on encouraging longer responses, while positive terminal rewards push for shorter responses—all to minimize the loss. As a result, PPO with λ < 1 inadvertently incentivizes verbosity in failure cases and conciseness in success.
In practice, RL training data is large and often contains many hard problems, even if not all are unsolvable. As long as some portion of the data consistently yields negative rewards, the response-lengthening bias can dominate, either early in training or once the easier problems are consistently solved. Combined with LLMs’ core strength in generating coherent and logically structured text, this push toward verbosity may contribute to the "aha moment", a shift where responses become more elaborate, though not necessarily more accurate.
It's important to also note that longer responses do not necessarily mean maintaining negative rewards. In fact, the model may occasionally arrive at a correct answer simply by continuing its chain of thought. When this happens, RL reinforces the correct outcome, regardless of how many tokens were used to reach it. This explains why accuracy can still improve even as responses grow longer. The key point is that verbosity is a consequence of RL minimizing its loss in the face of negative rewards, and it is this verbosity that may occasionally lead to improved accuracy, which then would be reinforced, not the other way around.
Each Problem Is Its Own MDP
Another key insight is that every single prompt defines an MDP. This reframing unlocks a powerful implication: you don’t need massive datasets to improve reasoning through RL. Even training on a handful of examples can yield meaningful generalization if done correctly.
Here, we consider 4 sets of training examples, each of which includes 4 problems from the AIME 2024 dataset. To estimate the difficulty of the problems, we evaluated the base model using 64 samples and temperature 0.6. The average probability of solving the problems for the 4 sets, from easiest to hardest, were 0.375, 0.25, 0.125, and 0.0625.
Across all problem sets, improvements in accuracy coincided with reductions in response length. Moreover, the response length decreased more quickly for easier problem sets. Finally, for the hardest set, the response length increased as the problems can rarely be solved.
A Simple Fix: Two-Phase RL
To address the verbosity issue, we introduce a two-phase training strategy:
- In the first phase, the model is trained on challenging problems. This phase aims to enhance the model's problem-solving capacity, with an expected increase in response length as PPO mostly encounters negative rewards and drives the model toward longer responses. Notably, this first phase can also be seen as the RL training of off-the-shelf reasoning models.
- In the second phase, training continues on occasionally solvable problems. This phase enforces conciseness while preserving or even enhancing accuracy. Notably, as we will see, it also substantially improves the model's robustness to lowering the temperature, ensuring remarkable performance even with limited sampling.
This second phase is essential. Without it, the model may remain stuck in the verbosity trap, and learning conciseness through large-scale training alone can be extremely slow, inefficient, and costly. Our approach provides a shortcut: after the model learns how to solve problems, by explicitly reinforcing correct and concise behavior, the model quickly learns that brevity and correctness can go hand-in-hand.
Does It Actually Work? Yes and Surprisingly Well
We tested our approach on multiple benchmarks, including math-heavy datasets (AIME, AMC, MATH500) and broader STEM domains (MMLU-STEM). Here, we train the base-model on only 8 problems, selected from the level-5 subset of the MATH dataset (train).
Results: Up to more than 50% reduction in response length and accuracy preserved (even improved):
Significant robustness to low-temperature decoding:
The MDP view is valid in phase-1 too. We tested phase-1 on very small data, only 4 problems from the level-5 subset of the MATH dataset. Surprisingly, the results show a substantial improvement in accuracy:
Final Thoughts: Rethinking What "Good Reasoning" Looks Like
The presented results aren’t minor tweaks. They fundamentally reshape the cost and interpretability profile of reasoning models.
To put this into perspective, consider the cost efficiency of our method. Running an Nvidia H100 GPU costs roughly $2 per hour. The second-stage post-training of R1 models requires fewer than 60 training steps and takes approximately 19 GPU hours for the 1.5B model and 41 GPU hours for the 7B model. Meanwhile, generating a single 10,000-token response takes about 115 seconds on one H100 GPU. Given our observed 40% and 54% reductions in response length for the 7B and 1.5B models, respectively, each single query on the new models saves more than 3 cents in compute cost.
Aside from significant reduction in inference cost, this also means the entire second-stage training cost can be offset after just 1,100 queries for the 1.5B model and 3,200 queries for the 7B model, making this strategy not just effective, but highly economical.
Beyond metrics and cost, it is worth remembering that there is an elegance in brevity. In human reasoning, concise explanations are often the most compelling. Our work shows that this principle can also apply to LLMs, if we train them the right way. In order to get the most out of RL, we need to understand its hidden incentives. In reasoning tasks, we must align its optimization objectives with our true goals; that is, to favour precision over verbosity. When properly structured, RL can help models become not only more accurate but also more concise, ultimately reasoning better by saying less.
Here, we summarize five lessons for practitioners and researchers:
- Longer isn’t always better: In RL training, verbosity may reflect failure, not sophistication.
- λ < 1 matters: Setting lambda properly in PPO is essential to prevent instability and encourage conciseness.
- Favor solvable tasks: If your training data has too many impossible problems, models learn to ramble.
- Train in phases: Separate reasoning acquisition from conciseness enforcement.
- Small data can be powerful: You don’t need millions of examples. Sometimes even 4 is enough.