Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Zeqiu Wu^1*, Yushi Hu^1*, Weijia Shi¹, Nouha Dziri², Alane Suhr², Prithviraj Ammanabrolu²,
Noah A. Smith^1,2, Mari Ostendorf¹, Hannaneh Hajishirzi^1,2

¹University of Washington, ²Allen Institute for AI

Paper Code: FineGrainedRLHF QA-Feedback Dataset

We propose Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects:

(1) density, providing a reward after every segment (e.g., a sentence) is generated
(2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness).

What are Fine-Grained Rewards?

Prior work in RLHF focused on collecting human preferences regarding the overall quality of language model (LM) outputs. However, this type of holistic feedback offers limited information. In our paper, we introduce fine-grained human feedback (e.g., which sub-sentence is irrelevant, which sentence is not truthful, which sentence is toxic) as an explicit training signal.

Our rewards are fine-grained in two aspects:

(a) Density: We provide a reward after each segment (e.g., a sentence) is generated, similar to OpenAI's "step-by-step process reward". We found that this approach is more informative than holistic feedback and, thus, more effective for RL.

(b) Multiple reward models associated with different feedback types: We employ multiple reward models to capture different types of feedback (e.g., factual inaccuracy, irrelevance, and information incompleteness). Interestingly, we observed that these reward models both complement and compete with each other. By adjusting the weights of the reward models, we can control the balance between the different types of feedback and tailor the LM for different tasks according to specific needs. For instance, some users may prefer short and concise outputs, while others may seek longer and more detailed responses.

Updates

2023/07/04 Our codebase, FineGrainedRLHF, is released!
2023/06/05 QA-Feedback, our long-form QA dataset with human preferences + fine-grained feedback, is released!
2023/06/05 Paper is released!

Abstract

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF)—where human preference judgments on LM outputs are transformed into a learning signal—has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce FINE-GRAINED RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models.

Task 1: Detoxification

The task of detoxification aims to reduce the toxicity in the model generation. We use Perspective API to measure toxicity. It returns a toxicity value between 0 (not toxic) and 1 (toxic).

We compare two kinds of rewards:

(a) Holistic Rewards for (non-)Toxicity: We use 1-Perspective(y) as the reward

(b) Sentence-level (Fine-Grained) Rewards for (non-)Toxicity: We query the API after the model generates each sentence instead of generating the full sequence. For each generated sentence, we use -Δ(Perspective(y)) as the reward for the sentence (i.e. how much toxicity is changed from generating the current sentence).

Table 1 shows that Our Fine-Grained RLHF with sentence-level fine-grained reward attains the lowest toxicity and perplexity among all methods, while maintaining a similar level of diversity. Figure 2 shows that learning from denser fine-grained reward is more sample efficient than holistic reward. One explanation is that fine-grained reward locates where the toxic content is, which is a stronger training signal compared with a scalar reward for the whole text.

Task 2: Long-Form Question Answering

We collect QA-Feeback, a dataset of long-form question answering, with human preferences and fine-grained feedback. QA-Feedback is based on ASQA, a dataset that focuses on answering ambiguous factoid questions.

There are three types of fine-grained human feedback, and we train a fine-grained reward model for each of them:

C1: irrelevance, repetition, and incoherence (rel.); The reward model has the density level of sub-sentences; i.e., returns a score for each sub-sentence. If the sub-sentence is irrelevant, repetitive, or incoherent, the reward is -1; otherwise, the reward is +1.

C2: incorrect or unverifiable facts (fact.); The reward model has the density level of sentences; i.e., returns a score for each sentence. If the sentence has any factual error, the reward is -1; otherwise, the reward is +1.

C3: incomplete information (comp.); The reward model checks if the response is complete and covers all the information in the reference passages that are related to the question. This reward model gives one reward for the whole response.

Fine-Grained Human Evaluation

We compare our Fine-Grained RLHF with the following baselines:

SFT: The supervised finetuning model (trained on 1K training examples) that is used as the initial policy for our RLHF experiments.

Pref. RLHF: The baseline RLHF model that uses holistic reward.

SFT-Full: We finetune LM with human-written responses (provided by ASQA) of all training examples and denote this model as SFT-Full. Notice that each gold response takes 15 min to annotate (according to ASQA), which takes much longer time than our feedback annotation (6 min).

Human evaluation shows that our Fine-Grained RLHF outperforms SFT and Preference RLHF on all error types. Also, RLHF (both preference-based and fine-grained) are particularly effective in reducing factual errors.

Customize LM behaviors

By changing the weight of the Relevance reward model, and keeping the weight of the other two reward models fixed, we can customize how detailed and lengthy the LM responses would be. Here we compare the outputs of three LMs, trained with different reward model combinations.

Fine-Grained reward models both complement and compete with each other

We find that there is a trade-off between the three reward models. Relevance RM prefers shorter and more concise responses, while Info Completeness RM prefers longer and more informative responses. Thus, these two rewards compete against each other during training and eventually reach a balance. Meanwhile, Factuality RM continuously improves the factual correctness of the response. Finally, removing any one of the reward models will degrade the performance.

BibTeX

@article{wu2023fine,
    title={Fine-Grained Human Feedback Gives Better Rewards for Language Model Training},
    author={Wu, Zeqiu and Hu, Yushi and Shi, Weijia and Dziri, Nouha and Suhr, Alane and Ammanabrolu, Prithviraj and Smith,
    Noah A and Ostendorf, Mari and Hajishirzi, Hannaneh},
    journal={arXiv preprint arXiv:2306.01693},
    year={2023}
    }