Evaluating Language Models with Language Models (3.4) Shepherd

7 min readMar 21, 2024

To request the editable Slides (free) of this or other articles, send a private message with “Slides”.

Review

In the first post about Shepherd, we described how Shepherd works: send a message or question to a language model, receive a response from that language model, and then let Shepherd evaluate the quality of that response in different aspects.

The second and third posts describe how to construct Shepherd’s training dataset:

In this final post, let’s talk about how the authors show that Shepherd really outperforms other models on the task of evaluating the response quality of a language model.

Shepherd（4）

In the paper, the models compared to Shepherd are Alpaca-7B, SelFee-7B, and ChatGPT3.5.

It doesn’t matter if you’re unfamiliar with these models, as it won’t affect your understanding of this article.

Comparison Method 1

One way of comparing them is easy to think of. We asked each model to provide its own evaluation results, and then asked the more powerful model, GPT4, to score each evaluation individually.

Step1: We send Question/Message to a language model and receive Answer/Response for this language model.

Step2: Then, Question/Message and Answer/Response were given to four models, Shepherd, Alpaca-7B, SelFee-7B, and ChatGPT3.5, respectively, to be evaluated, and Critique/Feedback was obtained from each model.

For ease of description, in the following, we refer to Question/Message as Question, Answer/Response as Answer, and Critique/Feedback as Critique.

Step3: Let GPT4 score the Critique provided by each model based on the Question and Answer (the full Prompt given to GPT4 can be seen in the paper). After scoring, we can see who provided the best evaluation (highest score).

This approach of having GPT-4 independently score each evaluation result seems reasonable, but the authors found it didn’t work. Why do the authors think that?

Reason 1：GPT4 is “too afraid of making humans unhappy”. Even after telling GPT4 the correct answer, GPT4 will still give the incorrect Answer a high score.

Reason 2：It’s probably still the case that GPT4 is so polite to humans that it rarely gives Critique low scores.

Reason 3：GPT4 is easily convinced by the “appearance” of an Answer. Regardless of whether the actual content of the answer is good or bad, as long as the format of the answer appears to be in the favour of GPT4, then GPT4 will tend to give it a high score. In other words, as long as a low-quality answer is “nicely packaged”, GPT4 will still give it a high score.

In short, while the above method is reasonable according to how humans are used to doing things, it is not practically feasible. This is because we cannot yet treat GPT4 as a reliable person (at least for the tasks involved in this paper).

Comparison Method 2

Since the above method does not work, what should be done? The author eventually found a reliable comparison method, i.e., pairwise comparison.

Step1&2: As in “ Method 1”, we send a Question to a language model → get the Answer of this language model → give the Question and Answer to the four models Shepherd, Alpaca-7B, SelFee-7B, ChatGPT3.5 respectively for evaluation → get the Critique of each model.

Step3: Pairwise comparisons. Let GPT4 judge the Critique of Shepherd and the other models, i.e. Shepherd vs. Alpaca-7B, Shepherd vs. SelFee-7B, and Shepherd vs. ChatGPT3.5.

At this point, you may have a new concern. How can the author be sure that Method 2 would work?

In Steps 1–3 above, the authors have collected results from GPT4 for Shepherd compared to other models. Now it is necessary to collect another collection of results from Human Annotator. If Human Annotator generally agrees that Shepherd provides a better Critique than the other models, then “ Method 2” works.

Collecting results from Human Annotator：Since Human performance is more stable than GPT4, there is no need to take a pairwise comparison method here. We just need to give the Human Annotator all the Critiques from Shepherd, Alpaca-7B, SelFee-7B, and ChatGPT3.5, and then let the annotator score them. Finally, we simply sort them by score.

In the following there are two tables, one table shows the comparison results when using GPT4 and the other table shows the comparison results obtained when using Human Annotator. The meaning of the numbers in the tables is: what is the win rate of Shepherd compared to other models in different types of task scenarios.

Don’t worry if you’re not familiar with the details of tasks such as AlpacaFarm or FairEval. You can simply think of Shepherd and other models as “competitors” who have undergone tests of different “skill abilities”. Then the “judges” (GPT4 or Human Annotator) compare the performance of the “competitors” based on their performance.

It can be seen from the table that, in general, the comparison results obtained using Method 2 ( pairwise comparisons) are similar to the comparison results obtained from humans. This suggests that the conclusion obtained by the authors using Method 2, i.e. that Shepherd is better at evaluating the quality of Answers than other models, is reasonable.

Additional details

There are some other details in the paper that are worth noting.

Shepherd, Alpaca-7B, and SelFee-7B are all models obtained by fine-tuning based on LLaMA-7B. Shepherd can perform well on the task of evaluating the quality of Answers by fine-tuning on 8K training data.

2. Different models have different “characters”.

Alpaca-7B often provides Critiques with errors, and it also tends to give positive Critiques to all Answers (regardless of the quality of the Answer).
SelFee-7B likes to give a very vague, broad Critique: it does not explicitly point out mistakes, it ignores the content of the Answer, and it directly responds to the Question itself rather than judging the Answer as good or bad.

Shepherd does provide more practical Critiques (explicitly indicating how to improve Answers).

On the task of evaluating the quality of Answers and providing Critique, Shepherd and ChatGPT 3.5 perform about the same.

Summary

We present two types of evaluation models, one is PandaLM and the other is Shepherd. PandaLM works by comparing and evaluating two Answers, whereas Shepherd works by evaluating one Answer directly.

In addition, both models provide Critiques. However, there are significant style differences between the Critiques they provide.

Finally, we would like to emphasise once again that as models evaluating the quality of answers, whether it is PandaLM or Shepherd, they must possess the character of “daring to judge and not fearing humans”! This is important~

If you like the slides for this series or any other articles, please follow my wechat publich account and leave me message “Slides”. I understand you may not have a wechat account. Leaving messages via Github also works. To check the completed list of all the published articles (In English), please visit https://createmomo.github.io/

Evaluating Language Models with Language Models (3.4) Shepherd

Review

Shepherd（4）

Summary

Written by CreateMoMo

No responses yet