Why We Think? - Freezing

This article is a recent blog post by Lilian Weng, with many points I agree with and have been inspired by.
Original link: https://lilianweng.github.io/posts/2025-05-01-thinking/

Table of Contents#

Motivating Thought
- Analogy with Psychology
- Computation as Resources
- Latent Variable Modeling
Token-Based Thinking
- Branching and Editing
  - Parallel Sampling
  - Sequential Revision
- Reinforcement Learning to Improve Reasoning
- Use of External Tools
- Faithful Thinking
  - Does the model faithfully express its thoughts?
  - The impact of optimization pressure on CoT: good or bad?
Thinking in Continuous Space
- Recurrent Architecture
- Thinking Tokens
Thinking as Latent Variables
- Expectation Maximization
- Iterative Learning
The Law of Expanding Thinking Time
Future Outlook
Citations
References

Motivating Thought#

We can motivate models to think longer in several different ways.

Analogy with Psychology#

The core idea of model thinking is closely related to human thinking. We humans cannot immediately provide the answer to "What's 12345 times 56789?". Instead, it is natural to take time to think and analyze before arriving at a result, especially for complex problems. In "Thinking, Fast and Slow" (Kahneman, 2013), Daniel Kahneman divides human thinking into two modes through the lens of dual-process theory:

Fast thinking (System 1) operates quickly and automatically, driven by intuition and emotion, requiring almost no effort.
Slow thinking (System 2) requires deliberate logical reasoning and significant cognitive effort. This mode of thinking consumes more mental energy and requires conscious engagement.

Because System 1 thinking is both fast and simple, it often ultimately becomes the primary driver of decision-making, sacrificing accuracy and logic. It relies on mental shortcuts (heuristics) in our brains and may lead to errors and biases. By consciously slowing down and taking more time to reflect, refine, and analyze, we can engage in System 2 thinking, challenge our intuitions, and make more rational choices.

Computation as Resources#

One perspective in deep learning is that neural networks can characterize themselves through the computation (such as matrix multiplications, activations) and storage (such as model weights and biases, intermediate activations) they can access during forward passes. If we optimize them to solve problems using gradient descent, the optimization process will figure out how to use these resources—they will learn how to organize these resources into circuits for computation and information storage. From this perspective, if we design an architecture or system that can perform more computation at test time and train it to effectively utilize these resources, it will perform better.

In Transformer models, the amount of computation (flops) the model performs for each generated token is about twice the number of parameters, as both forward and backward passes use the parameters. For sparse models like mixtures of experts (MoE), only a small portion of parameters is used in each forward pass, so computation = 2 * parameters / sparsity, where sparsity is the proportion of active experts.

On the other hand, CoT allows the model to perform more flops of computation for each token of the answer it is trying to compute. In fact, CoT has a nice property that allows the model to adjust the amount of computation based on the difficulty of the question.

Latent Variable Modeling#

A classic idea in machine learning is to define a probabilistic model with latent (hidden) variables $z$ and visible variables $y$, where $y$ is given to our learning algorithm. Marginalizing (summing) over the possible values of the latent variables allows us to express rich distributions over the visible variables. $P(y) = \sum_{z \sim P(z)} P(y | z)$
For example, we can simulate the distribution of numerical problems and solutions by letting $x$ represent the statement of the problem, $y$ represent the ground truth answer or proof, and $z$ as the free-form thought process leading to the proof. The marginal probability distribution to optimize is:
$P(y | x) = \sum_{z \sim p(z|x)} P(y | x, z)$

Token-Based Thinking#

Ling et al. (2017) explored strategies for generating intermediate steps before generating short answers, especially for mathematical problems. They introduced the AQUA-RAT dataset, which was later expanded by Cobbe et al. (2021) with the introduction of the elementary mathematics (GSM) dataset. Cobbe et al. trained a generator with supervised learning capabilities on human-written solutions and validators to predict the correctness of candidate solutions; then they could search these solutions. Nye et al. (2021) attempted to use intermediate thinking tokens as "notebooks," while Wei et al. (2022) created the now-standard term "chain of thought" (CoT).

Early work on improving CoT reasoning involved supervised learning on human-written reasoning trajectories or filtering the correctness of answers, where the latter can be seen as a basic form of reinforcement learning (RL). Other works found that encouraging the model to first reflect on relevant knowledge through appropriate "think step by step" prompts (Kojima et al., 2022) or more complex prompts can significantly improve the mathematical performance of instruction-tuned models (Yasunaga et al., 2023).

Later work found that using automatically checkable solutions to reinforce learning on problem datasets could significantly enhance CoT reasoning capabilities, such as STEM problems with short answers or coding tasks that can be checked through unit tests (Zelikman et al., 2022; Wang et al., 2023; Liu et al., 2023). With the release of o1-preview, o3, and R1 technical reports (DeepSeek-AI, 2025), this approach has gained increasing attention, showing that policy gradient algorithms can yield strong performance.

Branching and Editing#

The fundamental goal of computation at test time is to adaptively modify the model's output distribution during testing. There are various methods to utilize test-time resources for decoding to select better samples, thereby changing the model's predictions to a more desirable distribution. The two main methods for improving the decoding process are parallel sampling and sequential revision.

Parallel Sampling generates multiple outputs simultaneously, providing guidance for each step through process reward signals or using validators to judge quality at the end. It is the most widely adopted decoding method for improving test-time performance, such as best-of-N or beam search. When basic facts are unavailable, self-consistency (Wang et al., 2023) is often used to select answers by majority vote across multiple CoT outputs.
Sequential Revision iteratively adjusts the model's responses based on the outputs from the previous step, requiring the model to intentionally reflect on its existing responses and correct errors. The revision process may need to rely on fine-tuned models, as naively depending on the model's inherent self-correcting ability without external feedback may not yield improvements (Kamoi et al., 2024; Huang et al., 2024).

Parallel sampling is simple, intuitive, and easier to implement, but is limited by the model's capability to produce the correct solution in one go. Sequential revision explicitly requires the model to reflect on mistakes, but it is slower and requires extra caution during implementation, as there is indeed a risk of correct predictions being modified to incorrect ones or introducing other types of hallucinations. Both methods can be used together. Snell et al. (2024) showed that simple problems benefit from purely sequential test-time computation, while harder problems often perform best at an optimal ratio of sequential to parallel computation.

Parallel Sampling#

Given a generative model and a scoring function, we can use it to score all or part of the samples, and we can use various search algorithms to find high-scoring samples. Best-of-N is the simplest such algorithm: simply collect N independent samples and select the highest-ranked sample based on some scoring function. Beam search is a more complex search algorithm that makes the search process more adaptive, spending more sampling computation on the more promising parts of the solution space.

Beam search maintains a set of promising partial sequences and alternates between expanding them and pruning less promising partial sequences. As a selection mechanism, we can use a process reward model (PRM; Lightman et al. 2023) to guide the selection of beam search candidates. Xie et al. (2023) used LLMs to assess the likelihood of correctness of their own generated reasoning steps, formatting them as multiple-choice questions, finding that self-evaluation at each step reduced cumulative errors in multi-step reasoning during beam search decoding. Additionally, temperature annealing during sampling helps reduce aggregate randomness. These experiments by Xie et al. achieved a 5-6% improvement on the Codex model's few-shot GSM8k, AQuA, and StrategyQA benchmarks. Reward-balanced search (abbreviated as "REBASE"; Wu et al. 2025) trained a process reward model (PRM) to determine how much each node at each depth should expand during beam search based on softmax-normalized reward scores. Jiang et al. (2024) trained their PRM, named "RATIONALYST," for beam search guidance of synthetic principles conditioned on a large amount of unlabeled data. When comparing the time differences between contexts containing principles and those without, good principles were filtered based on whether they helped reduce the negative log probability of labeling the true answer. During reasoning, RATIONALYST provides process supervision for CoT generators by helping estimate the log probability of the next reasoning step ("implicit") or directly generating the next reasoning step as part of the prompt ("explicit").

Interestingly, urgent chain-of-thought reasoning paths can be triggered without explicit zero-shot or few-shot prompts. Wang & Zhou (2024) found that if we branch at the first sampling token by retaining the top label with the highest confidence (measured by the difference between the top 1 and top 2 candidates during sampling), and then continue these sampling trials for greedy decoding, many sequences themselves contain CoT. Particularly when CoT does appear in context, it leads to a more confident decoding of the final answer. To calculate the confidence of the final answer, specific task heuristics (e.g., the last numeric value of a math problem) or further prompting the model to identify the answer span "So the answer is" are needed. The design choice to branch only at the first token is based on the observation that early branching significantly enhances the diversity of potential paths, while later tokens are heavily influenced by previous sequences.

Sequential Revision#

If the model can reflect on and correct errors in past answers, we would expect it to produce a good iterative revision sequence with continually improving quality. However, due to various failure modes, this self-correcting ability is inherently lacking in LLMs and is not easily available out of the box, such as: (1) hallucinations, including modifying correct answers to incorrect ones; (2) behavioral collapse to uncorrected behaviors; for example, making minor modifications or no modifications to the first incorrect answer; or (3) failing to generalize to distribution changes during testing. Experiments by Huang et al. (2024) show that naively applying self-correction leads to worse performance, and models need external feedback to self-improve, which can be based on matching basic facts, heuristics, task-specific metrics, unit test results for coding problems (Shinn et al., 2023), stronger models (Zhang et al., 2024), and human feedback (Liu et al., 2023).

Self-correcting learning (Welleck et al., 2023) aims to train a corrector model $P_θ(y | y_0, x)$ for a fixed generator model against $P_0(y_0 | x)$. While the generator model remains general, the corrector model can be task-specific and generates based only on the initial model response and additional feedback (e.g., sentences, compiler constraints, unit test results, which can be optional):

Self-correcting learning first generates, initially generating multiple outputs for each prompt in the data pool;
Then, if one output's value is higher than another, it creates value-enhancing pairs by pairing the two outputs of the same prompt (prompt $x$, hypothesis $y$, correction $y'$).
These pairs are selected proportionally to the improvement value $v(y') - v(y)$ and the similarity between the two outputs, $\text{Similarity}(y, y')$, to train the corrector model.
To encourage exploration, the corrector also provides a new generation for the data pool. During reasoning, the corrector can be iteratively used to create a sequence of corrections.

Recursive checking (Qu et al. 2024) also aims to train a better corrector model but uses a single model for generation and self-correction simultaneously.

SCoRe (Self-Correction through Reinforcement Learning; Kumar et al. 2024) is a multi-round RL method that encourages the model to self-correct by producing better answers on the second attempt than those created on the first attempt. It consists of two training phases: Phase 1 maximizes the accuracy of the second attempt while enforcing KL penalties only on the first attempt to avoid excessive deviation of the first-round response from the base model behavior; Phase 2 optimizes the accuracy of the answers generated in both the first and second attempts. Ideally, we do want to see better performance for both the first and second attempts, but adding Phase 1 can prevent the model from collapsing into behaviors of making minor edits or no edits to the first response, while Phase 2 further improves the results.

Reinforcement Learning to Improve Reasoning#

Recently, significant success has been achieved in enhancing the reasoning capabilities of language models by using a set of problems with ground truth answers (often STEM problems and puzzles with easily verifiable answers) and rewarding the model for obtaining correct answers. The strong performance of OpenAI's o-series models and subsequent models and technical reports released by DeepSeek have driven recent activity in this area.

DeepSeek-R1 (DeepSeek-AI, 2025) is an open-source LLM designed to excel at tasks requiring advanced reasoning skills, such as mathematics, coding, and logic problem-solving. They conducted two rounds of SFT-RL training, enabling R1 to excel at both reasoning and non-reasoning tasks.

Cold-start SFT is fine-tuning the DeepSeek-V3-Base base model on a collection of thousands of cold-start data. Without this step, the model would have issues with poor readability and language mixing.
Reasoning-focused RL trains the reasoning model on reasoning prompts using two types of rule-based rewards:

Format reward: The model should wrap CoT with ... tokens.
Accuracy reward: Whether the final answer is correct. The answer to math problems needs to exist in a specific format (e.g., in a box) to receive reliable verification. For coding problems, a compiler is used to assess whether test cases pass.

Reject sampling + non-reasoning SFT utilizes new SFT data created from reject sampling at the RL checkpoint of Step 2, combined with non-reasoning supervised data from areas like DeepSeek-V3 writing, factual QA, and self-awareness, to retrain DeepSeek-V3-Base.

Filtering out CoT with mixed languages, long paragraphs, and code blocks.
Using the DeepSeek-V3 (DeepSeek-AI, 2024) pipeline includes non-reasoning tasks.
For certain non-reasoning tasks, potential CoTs are generated by calling DeepSeek-V3 before answering questions through prompts. However, for simpler queries like "hello," CoT is not needed.
Then, fine-tune DeepSeek-V3-Base on a total of 800k samples for 2 epochs.

The final RL phase trains the Step 3 checkpoint on reasoning and non-reasoning prompts to improve usefulness, harmlessness, and reasoning ability.

Interestingly, the DeepSeek team showed that using pure RL, without an SFT phase, can still learn advanced reasoning abilities such as reflection and backtracking ("aha moments"). The model naturally learned to spend more thinking tokens to solve reasoning tasks during RL training. "Aha moments" may occur, referring to the model reflecting on previous mistakes and then trying other methods to correct them. Subsequently, various open-source efforts have emerged to replicate R1 results, such as Open-R1, SimpleRL-reason, and TinyZero, all based on the Qwen model. These efforts also confirmed that pure RL leads to excellent performance on math problems, as well as the emergence of "aha moments."

The DeepSeek team also shared some of their unsuccessful attempts. They did not use a process reward model (PRM) because it was challenging to define scoring metrics for each step or determine whether intermediate steps were correct, while making training more susceptible to reward hacking. Efforts with MCTS (Monte Carlo Tree Search) also failed because the search space of language model tokens is vast compared to chess; training fine-grained value models to guide the search is also very challenging. Failed attempts often provide unique insights, and we want to encourage the research community to share more about things that did not succeed.

Use of External Tools#

In reasoning steps, certain intermediate steps can be reliably and accurately solved by executing code or performing mathematical calculations. Offloading this part of the reasoning component to an external code interpreter, as seen in PAL (Program-Aided Language Model; Gao et al. 2022) or Chain of Code (Li et al. 2023), can extend the capabilities of LLMs without requiring LLMs to learn to execute code or function as calculators themselves. These code simulators, as in Chain of Code, can be augmented by LLMs so that if standard code interpreters fail, we can opt to use LLMs to execute that line of code. Using code to enhance reasoning steps is particularly beneficial for math problems, symbolic reasoning, and algorithmic tasks. These unit tests may not exist as part of coding problems, in which cases we can instruct the model to generate unit tests on its own to validate solutions (Shinn et al., 2023).

Reason+Act (Yao et al. 2023) combines searching the Wikipedia API with the generation of reasoning trajectories, allowing reasoning paths to incorporate external knowledge.

Recently released by OpenAI, o3 and o4-mini are two more excellent examples where the reasoning process involves the use of tools like web searches, code execution, and image processing. The team observed that large-scale reinforcement learning exhibited the same trend as the GPT paradigm, namely, "more computation = better performance."

Faithful Thinking#

Deep learning models are often viewed as black boxes, and various interpretability methods have been proposed. Interpretability is useful for several reasons: first, it provides an additional test to determine whether the model is inconsistent with its creators' intentions or whether it is making errors in a way that we cannot judge by monitoring its outputs. Second, it can help us determine whether the model is using a reasonable process to compute its answers. Chain of thought provides a particularly convenient form of interpretability, as it makes the model's internal processes visible in natural language. However, this interpretability relies on the assumption that the model accurately describes its internal thought processes.

Recent research has shown that monitoring the CoT of reasoning models can effectively detect erroneous behaviors of the model, such as reward hacking, and can even enable weaker models to monitor stronger models (Baker et al., 2025). Increasing test-time computation can also improve adversarial robustness (Zaremba et al., 2025); this is intuitively reasonable, as when the model encounters unusual inputs (e.g., adversarial examples or jailbreak attempts), thinking time should be particularly useful—it can leverage the extra time to understand the strange situations it faces.

Does the model faithfully express its thoughts?#

Intuitively, due to the lack of explicit training objectives aimed at encouraging faithful reasoning, the model's CoT may be biased. Alternatively, when we fine-tune the model based on human-written explanations, these human-written samples may contain errors. Therefore, we cannot assume that CoT is always faithful by default.

Lanham et al. (2023) studied several patterns of CoT fidelity failures by deliberately introducing errors into CoT and measuring their impact on the accuracy of a set of multiple-choice tasks (e.g., AQuA, MMLU, ARC Challenge, TruthfulQA, HellaSwag):

Error 1 (premature answers): The model may prematurely form conclusions before generating CoT. This was tested by early truncation or inserting errors into CoT. Different tasks revealed varying dependencies on the effectiveness of CoT; some were sensitive to truncated CoT, while others were not. Wang et al. (2023) conducted similar experiments but found more subtle errors related to bridging objects or language templates in the formation of CoT.
Error 2 (non-informative tokens): Non-informative CoT tokens can enhance performance. This hypothesis was tested by replacing CoT with filler text (e.g., all periods), showing no improvement in accuracy compared to having no CoT, and performance on certain tasks may even slightly decline.
Error 3 (human-unreadable encoding): The way relevant information is encoded can be difficult for humans to understand. Interpreting CoT in a non-standard way does not reduce performance across datasets, indicating that improvements in accuracy do not rely on human-readable reasoning.

Interestingly, Lanham et al. found that for multiple-choice questions, smaller models may not leverage CoT well, while larger models may already be able to solve tasks without CoT. This dependency on CoT reasoning, measured by the percentage of obtaining the same answer using CoT versus not using CoT, does not always increase with the model size for multiple-choice questions but does increase with the model size for additional tasks, suggesting that thinking time is more critical for complex reasoning tasks.

Alternative methods for testing CoT fidelity include perturbing prompts rather than directly modifying CoT paths (Turpin et al., 2023; Chua & Evans, 2025; Chen et al., 2025).

One method always marks the correct answer as "(A)" in a few samples, regardless of the actual label, to introduce bias.

Another prompting technique inserts misleading prompts into the prompt, such as "I think the answer is <random_label> but curious to hear what you think." or "A Stanford Professor thinks the answer is <random_label>." By comparing the model's predictions for the same question with and without misleading prompts, we can measure whether the model can faithfully describe the influence of prompts on its answers. Specifically, in cases where the model produces different answers with and without prompts, we measure whether the model acknowledges the influence of the prompt when solving the problem with the prompt. If the model is faithful, it should explicitly acknowledge the influence and recognize that its answer change is due to the prompt.

Multiple studies have found that reasoning models more reliably describe the influence of prompts than all tested non-reasoning models. For example, we can measure the proportion of samples where the model treats the prompt as a determining factor ("faithful CoT"). Reasoning models (Claude 3.7 Sonnet, DeepSeek R1) generally perform better than non-reasoning models (Claude 3.6, DeepSeek V3).

Some evidence suggests that using reward models can reduce the fidelity of model outputs. In this test, the reward model in classic RLHF was not trained to incentivize faithful behavior, as the preference comparison dataset was collected by humans selecting models that appeared better or more pleasing. However, in reasoning models, the CoT path is optimized to produce the correct final answer rather than matching human preferences defined in the RM, so it is intuitively expected to be more faithful.

The impact of optimization pressure on CoT: good or bad?#

Monitoring the CoT of reasoning models to detect reward hacking behaviors is effective. One simple method of using this monitor is to run it as a filter during testing for reject sampling and identify solutions without reward hacking. However, it is only effective in the early stages of RL, and as the model is further optimized, finding samples monitored by CoT within budget becomes rare. This challenge naturally suggests that we might consider incorporating CoT monitoring into RL rewards (a form of process-based supervision) to prevent reward hacking.

However, incorporating CoT monitoring rewards into RL does not work well, as it leads to confused reward hacking, where the model tries to hide its true intentions in CoT (Baker et al., 2025). Under the direct optimization pressure provided by the CoT monitor during RL training, the model can still find ways to hack, albeit at a lower frequency, by hiding its intentions in CoT to avoid being caught by the monitor.

Yeo et al. (2025) experimentally confirmed information similar to Baker et al. regarding the RL signal of CoT length, indicating that novel RL reward shaping may lead to new types of reward hacking. They designed reward functions such that correct CoT receives higher rewards than incorrect CoT, short-term correct CoT receives higher rewards than long-term correct CoT, and short-term incorrect CoT receives higher penalties than long-term incorrect CoT. With this additional reward, the model learned to repeat text in CoT to answer challenging questions rather than attempting to solve them. Yeo et al. further applied n-gram repetition penalties to prevent this hacking behavior.

Chen et al. (2025) conducted experiments in a flawed RL environment, specifically using a scorer that filled in incorrect answers for multiple-choice questions. The model learned to exploit reward hacking in >99% of prompts but almost never (<2%) engaged in reward hacking for more than half of the environments in its CoT. In this case, the additional RL optimization pressure did not incentivize the model to express hacks in language.

RL training is inherently sensitive to reward hacking. Relying solely on heuristic investigations of reward hacking and manual fixes may lead to a "whack-a-mole" situation. We advise being very cautious when attempting to apply optimizations directly to CoT during RL training or trying to avoid it altogether.

Thinking in Continuous Space#

Adaptive computation time introduced by Alex Graves in 2016 predates large language models but pioneered the same direction, allowing models to dynamically decide how many computational steps to take during reasoning, which can be seen as enabling models to "think more" in continuous space at test time. Adaptive thinking time in continuous space can be vertically enabled through recurrent architectures or horizontally enabled through more continuous sampling steps.

Recurrent Architecture#

Many architectural variants have been proposed to make the Transformer architecture recursive, enabling adaptive test-time computation (Dehghani et al., 2019; Hutchins et al., 2022; Bulatov et al., 2022). A deep dive into the literature on this topic would make this article too lengthy, so we will only review a few.

The Universal Transformer (Dehghani et al., 2019) combines self-attention in Transformers with recursive mechanisms in RNNs, dynamically adjusting the number of steps using adaptive computation time (Graves, 2016). At a high level, it can be viewed as a recurrent function for learning the hidden state representation for each token; if the number of steps is fixed, the Universal Transformer is equivalent to a multi-layer Transformer with shared parameters across layers.

The recently proposed recursive architecture design by Geiping et al. (2025) adds a recursive block $R$ on top of the standard Transformer. Each iteration of this recursive block takes an embedding $\mathbf{e}$ and a random state $\mathbf{s}_i$. Conceptually, this recursive deep architecture is somewhat similar to conditional diffusion models, where the original input $\mathbf{e}$ is provided at each recursive step, while the random Gaussian initialization state $\mathbf{s}_i$ is iteratively updated throughout the process. (Interestingly, some of their designs more similar to diffusion models proved to be poor.)

$\mathbf{e} = P(\mathbf{x}) \quad \text{embedding}$

$\mathbf{s}_0 \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \quad n \cdot h$

$\mathbf{s}_i = R(\mathbf{e}, \mathbf{s}_{i-1}) \quad \text{for } i \in 1, \ldots, r \quad \text{recurrent block; resembles a Transformer block}$

$\mathbf{p} = C(\mathbf{s}_r) \quad \text{unembedding}$

Thinking Tokens#

Thinking tokens refer to a set of implicit tokens introduced during training or reasoning that do not have direct linguistic meaning. Instead, their role is to provide additional thinking time and computational capacity, improving model performance.

Herel & Mikolov (2023) proposed the idea of inserting special thinking tokens () after each word in a sentence and training the model on such datasets. Each thinking token earns the model extra time to process and make better predictions. Training with thinking tokens in toy model settings resulted in lower perplexity compared to baseline models trained without thinking tokens. The benefits of thinking tokens are more pronounced for non-trivial reasoning tasks or sentences involving numbers.

Similarly, the pause tokens proposed by Goyal et al. (2024) provide additional computation for the model during reasoning by appending virtual tokens (e.g., characters like . or #) at the end of the input sequence to delay the model's output. Injecting such pause tokens during training and reasoning is crucial, while fine-tuning only on pause tokens yields limited gains. During training, multiple copies of pause tokens are inserted at uniformly random positions, and the loss for pause tokens is ignored during training.

Interestingly, the thinking tokens or pause tokens in the aforementioned experiments do not carry any extra information or add many new parameters. But why are they still helpful? On one hand, they help extend computation by introducing more reasoning cycles, effectively enhancing computational capacity. On the other hand, they can be seen as a special implicit form of CoT. One downside here is that the model needs to be pre-trained based on thinking tokens. Nevertheless, this strategy remains an interesting approach to further enhance the ability to utilize computational resources at test time based on reasoning time CoT.

Quiet-STaR (Zelikman et al., 2025) introduces token-level reasoning by training the model to generate reasons after each token to explain future text, mixing future text predictions with and without reasons, using learning to generate better reasons and using REINFORCE to optimize the quality of reason generation.