DeepMind GenRM: Improving LLM Accuracy with Self-Verification

Large Language Models (LLMs) have changed the natural language processing domain, but they often struggle with complex reasoning tasks, producing confident yet incorrect outputs. To address this challenge, researchers at Google DeepMind have introduced Generative Reward Models (GenRM), a state-of-the art approach to verify LLM outputs. GenRM uses the inherent text generation capabilities of LLMs to create better verifiers. By recasting verification as a next-token prediction task, GenRM taps into the strengths of pretrained language models, enabling features like chain-of-thought reasoning and efficient use of additional inference-time computation. This article explores the mechanics of GenRM, its performance across various tasks, and how it compares to traditional verification methods and state-of-the-art models like GPT-4 and Gemini 1.5 Pro.

How GenRM Works

Traditional verifiers, typically implemented as discriminative reward models (RMs), have been the standard approach to evaluate the correctness of solutions in reasoning domains. These verifiers assign a numerical score to estimate the probability of a solution being correct for a given problem. But they have a big limitation: they don’t use the text generation capabilities of LLMs. Hence, they struggle to capture complex reasoning patterns or provide detailed explanations for their decisions.

\[ \mathcal{L}_{Discriminative - RM}(\theta, \mathcal{D}_{RM}) = - \mathbb{E}_{(\mathbf{x}, y^+) \sim \mathcal{D}_{correct}}[\log r_\theta (\mathbf{x}, y^+)] - \mathbb{E}_{(\mathbf{x}, \mathbf{y}^-)\sim \mathcal{D}_{incorrect}}[ \log (1 - r_\theta (\mathbf{x}, \mathbf{y}^-)) ] \]\[ where r_\theta (\mathbf{x}, \mathbf{y}) = sigmoid(z_{cls}), and \quad z_{cls} = logit_{\theta}(cls | \mathbf{y}, \mathbf{x}) \]

GenRM addresses these limitations by recasting verification as a next-token prediction task, aligning it with the core strengths of LLMs. Here’s how it works:

Representation: Instead of outputting a numerical score, GenRM represents the verification decision as a token probability. For example, it uses prompts like “Is the answer correct?” and represents the score as the probability of “Yes” or “No” tokens.
Training Objective: GenRM is trained using the standard next-token prediction loss:

\[ \mathcal{L}_{SFT}(\theta, \mathcal{D}) = -\mathbb{E}{(\mathbf{x},\mathbf{y})\sim \mathcal{D}} \left[ \sum_{t=1}^{|\mathbf{y}|} \log p_\theta(y_t | \mathbf{x}, \mathbf{y}_{\lt t}) \right] \]

Where θ are the model parameters, D is the dataset, x is the input context, and y is the target response.

Inference: For a given problem-solution pair (x, y), GenRM computes the verification score as:
\[ r_{Direct}(\mathbf{x}, \mathbf{y}) = p_\theta(\text{Yes} | \mathbf{x}, \mathbf{y}, I). \]
Where I is the instruction prompt “Is the answer correct (Yes/No)?”.

GenRM can be trained jointly on verification and solution generation tasks:

\[ \mathcal{L}_{GenRM}(\theta, \mathcal{D}_{verify}) = \mathcal{L}_{SFT}(\theta, \mathcal{D}_{verify}) + \lambda \mathcal{L}_{SFT}(\theta, \mathcal{D}_{correct}) \]

Where λ controls the mixture of verification and generation data.

This approach allows GenRM to leverage the full power of pretrained LLMs, enabling features like chain-of-thought reasoning and efficient use of additional inference-time computation for improved verification accuracy.

GenRM in Action: Performance Across Tasks

GenRM’s performance was evaluated across various reasoning tasks, demonstrating its effectiveness and versatility. Here’s a detailed look at its performance across different domains:

Last Letter Concatenation and Word Sorting

Last Letter Concatenation:
- Task: Given a list of words, concatenate the last letter of each word.
- Training: GenRM was trained on lists of 2-4 words.
- Evaluation: Out-of-distribution (OOD) setting with 6-word lists.
- Results: GenRM significantly outperformed traditional verifiers, nearly matching the performance of an oracle verifier.
Word Sorting:
- Task: Sort a given list of words alphabetically.
- Training: Lists of 2-4 words.
- Evaluation: OOD setting with 5-word lists.
- Performance: GenRM showed superior generalization, surpassing discriminative RMs and other baselines.

In both tasks, GenRM demonstrated strong generalization capabilities, effectively handling longer sequences than those seen during training.

Grade School Math (GSM8K) Benchmark

The GSM8K benchmark is a challenging dataset for evaluating grade-school math reasoning capabilities.

Dataset: 7.2K training problems, 1.3K test problems.
Training: Up to 16 correct and 16 incorrect solutions per problem.
Evaluation: Best-of-16 performance on test set.

Results:

GenRM-CoT (Gemma-9B model): 92.8% problems solved
This represents a 20% improvement over the baseline (73% → 92.8%)

The performance boost is attributed to GenRM’s ability to generate and leverage chain-of-thought rationales, enabling it to catch subtle reasoning errors that traditional verifiers miss.

Comparison with GPT-4 and Gemini 1.5 Pro

GenRM’s performance on GSM8K is particularly impressive when compared to state-of-the-art models:

GPT-4:
- Widely regarded as one of the most capable language models.
- GenRM outperformed GPT-4 on the GSM8K benchmark.
Gemini 1.5 Pro:
- Google’s latest advanced language model.
- GenRM surpassed Gemini 1.5 Pro’s performance on GSM8K.

It’s important to note that GenRM achieved these results using a Gemma-9B model, which has significantly fewer parameters than GPT-4 or Gemini 1.5 Pro. This highlights the efficiency and effectiveness of the GenRM approach.

The Power of Chain-of-Thought Reasoning

Chain-of-Thought (CoT) reasoning is a key feature that sets GenRM apart from traditional verifiers, significantly enhancing its verification accuracy and flexibility.

GenRM-CoT: Enhancing Verification Accuracy

GenRM-CoT extends the basic GenRM by incorporating intermediate reasoning steps before making a final decision. This process allows the model to:

Generate a verification rationale (vCoT)
Use this rationale to inform the final correctness decision

The verification score for GenRM-CoT is computed as:

r_CoT(x, y) = p_θ(Yes | x, y, I_CoT, v_CoT, I)

Where I_CoT is the instruction to generate a verification rationale, and v_CoT is the generated rationale.

This approach enables GenRM-CoT to catch subtle reasoning errors that might be missed by direct verifiers, leading to improved accuracy across various tasks.

Majority Voting for Improved Results

GenRM-CoT leverages majority voting to further enhance its performance:

Generate K verification CoT rationales
Average the CoT-verifier scores across these rationales

The majority voting score is calculated as:

r_MajVote@K(x, y) = (1/K) * Σ_(i=1)^K p_θ(Yes | x, y, I_CoT, v_CoT^(i), I)

Where v_CoT^(i) are independently sampled rationales.

This technique allows GenRM-CoT to:

Mitigate the impact of individual reasoning errors
Utilize additional inference-time compute to improve accuracy
Scale performance with increased computational resources

Synthetic Rationale Generation

To address the challenge of obtaining high-quality verification rationales, especially as LLMs surpass human reasoning abilities, the researchers explored using synthetically-generated rationales:

Reference-guided grading: Provide a reference solution alongside the problem and solution to verify, making it easier for an LLM to identify reasoning errors.
Prompt design: Use carefully crafted prompts to guide the LLM in generating high-quality rationales (see Table A.2 in the paper for an example).
Filtering: Generate multiple rationales and filter based on correctness verification.

Results showed that using reference-guided grading significantly improved the quality of synthetic rationales, boosting GenRM-CoT performance (91.7% with guidance vs. 87.8% without for Gemma-7B verifiers on GSM8K).

Key advantages of synthetic rationale generation:

Scalability: Enables training on large datasets without human-generated rationales
Adaptability: Can be applied to new domains or tasks quickly
Consistency: Provides a standardized approach to rationale generation

By combining CoT reasoning, majority voting, and synthetic rationale generation, GenRM-CoT achieves state-of-the-art performance in verifying complex reasoning tasks, outperforming traditional verifiers and even surpassing larger language models in specific benchmarks.

Conclusion

GenRM represents a significant advancement in the field of LLM verification, offering a powerful alternative to traditional discriminative reward models. By leveraging next-token prediction and incorporating chain-of-thought reasoning, GenRM demonstrates superior performance across various reasoning tasks, even outperforming larger models like GPT-4 and Gemini 1.5 Pro on specific benchmarks. The ability to unify generation and verification, utilize synthetic rationales, and scale with additional inference-time compute makes GenRM a versatile and efficient tool for enhancing LLM accuracy. As AI continues to evolve, approaches like GenRM will be crucial in ensuring the reliability and trustworthiness of AI-generated content across diverse applications.

DeepMind GenRM: Improving LLM Accuracy with Self-Verification

How GenRM Works#

GenRM in Action: Performance Across Tasks#

Last Letter Concatenation and Word Sorting#

Grade School Math (GSM8K) Benchmark#

Comparison with GPT-4 and Gemini 1.5 Pro#

The Power of Chain-of-Thought Reasoning#

GenRM-CoT: Enhancing Verification Accuracy#

Majority Voting for Improved Results#

Synthetic Rationale Generation#

Conclusion#