Prior work in RLHF focused on collecting human preferences regarding the overall quality of language model (LM) outputs. However, this type of holistic feedback offers limited information. In our paper, we introduce fine-grained human feedback (e.g., which sub-sentence is irrelevant, which sentence is not truthful, which sentence is toxic) as an explicit training signal.
Our rewards are fine-grained in two aspects:
(a) Density: We provide a reward after each segment (e.g., a sentence) is generated, similar to OpenAI's "step-by-step process reward". We found that this approach is more informative than holistic feedback and, thus, more effective for RL.
(b) Multiple reward models associated with different feedback types: We employ multiple reward models to capture different types of feedback (e.g., factual inaccuracy, irrelevance, and information incompleteness). Interestingly, we observed that these reward models both complement and compete with each other. By adjusting the weights of the reward models, we can control the balance between the different types of feedback and tailor the LM for different tasks according to specific needs. For instance, some users may prefer short and concise outputs, while others may seek longer and more detailed responses.
2023/07/04 Our codebase, FineGrainedRLHF, is released!
2023/06/05 QA-Feedback, our long-form QA dataset with human preferences + fine-grained feedback, is
released!
2023/06/05 Paper is released!
The task of detoxification aims to reduce the toxicity in the model generation. We use Perspective API to measure toxicity. It returns a toxicity value between 0 (not toxic) and 1 (toxic).
We compare two kinds of rewards:
(a) Holistic Rewards for (non-)Toxicity: We use 1-Perspective(y) as the reward
(b) Sentence-level (Fine-Grained) Rewards for (non-)Toxicity: We query the API after the model generates each sentence instead of generating the full sequence. For each generated sentence, we use -Δ(Perspective(y)) as the reward for the sentence (i.e. how much toxicity is changed from generating the current sentence).
Table 1 shows that Our Fine-Grained RLHF with sentence-level fine-grained reward attains the lowest toxicity and perplexity among all methods, while maintaining a similar level of diversity. Figure 2 shows that learning from denser fine-grained reward is more sample efficient than holistic reward. One explanation is that fine-grained reward locates where the toxic content is, which is a stronger training signal compared with a scalar reward for the whole text.
We collect QA-Feeback, a dataset of long-form question answering, with human preferences and fine-grained feedback. QA-Feedback is based on ASQA, a dataset that focuses on answering ambiguous factoid questions.
There are three types of fine-grained human feedback, and we train a fine-grained reward model for each of them:
C1: irrelevance, repetition, and incoherence (rel.); The reward model has the density level of sub-sentences; i.e., returns a score for each sub-sentence. If the sub-sentence is irrelevant, repetitive, or incoherent, the reward is -1; otherwise, the reward is +1.
C2: incorrect or unverifiable facts (fact.); The reward model has the density level of sentences; i.e., returns a score for each sentence. If the sentence has any factual error, the reward is -1; otherwise, the reward is +1.
C3: incomplete information (comp.); The reward model checks if the response is complete and covers all the information in the reference passages that are related to the question. This reward model gives one reward for the whole response.
We compare our Fine-Grained RLHF with the following baselines:
SFT: The supervised finetuning model (trained on 1K training examples) that is used as the initial policy for our RLHF experiments.
Pref. RLHF: The baseline RLHF model that uses holistic reward.
SFT-Full: We finetune LM with human-written responses (provided by ASQA) of all training examples and denote this model as SFT-Full. Notice that each gold response takes 15 min to annotate (according to ASQA), which takes much longer time than our feedback annotation (6 min).
Human evaluation shows that our Fine-Grained RLHF outperforms SFT and Preference RLHF on all error types. Also, RLHF (both preference-based and fine-grained) are particularly effective in reducing factual errors.
@article{wu2023fine,
title={Fine-Grained Human Feedback Gives Better Rewards for Language Model Training},
author={Wu, Zeqiu and Hu, Yushi and Shi, Weijia and Dziri, Nouha and Suhr, Alane and Ammanabrolu, Prithviraj and Smith,
Noah A and Ostendorf, Mari and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2306.01693},
year={2023}
}