Evaluating Language Models with Language Models (3.1) Shepherd

CreateMoMo
5 min readMar 18, 2024

--

To request the editable Slides (free) of this or other articles, send a private message with “Slides”.

In the introduction, we mentioned that there are two main types of ideas for building automated evaluation models:

  • (1) Constructing Test datasets with reference answers to evaluate models;
  • (2) Using language models to evaluate language models.

For point (2), we introduce the idea of PandaLM, which trains a language model specifically to determine which of 2 responses is better.

The Shepherd we are going to introduce is different from PandaLM. Shepherd only evaluates the quality of 1 response.

Shepherd

Shepherd evaluates input in 3 main directions:

  • Give general evaluation outcomes, or give some general feedback (this is similar to the behaviour of other evaluation models).
  • Improvements to the inputs (here the improvements are very specific and actionable, not generic suggestions for modifications), and these will be accompanied by the inclusion of domain knowledge where required
  • Identify specific issues of varying degrees of severity in the inputs

In general, most evaluation language models can easily achieve point 1, whereas Shepherd not only achieves point 1, but also points 2 and 3.

Shepherd Examples

The figure below shows 2 examples of Shepherd’s behaviour, where Shepherd evaluates the responses of a language model based on a question. As we can see, Shepherd’s evaluation results basically satisfy the 3 directions of evaluation we mentioned above.

Training Data

How to obtain an evaluation model like Shepherd? When you read the whole paper, you will probably feel the same feeling: in a nutshell, it’s all about “data engineering”, and Shepherd’s behaviour is obtained from well-prepared data. A large part of the paper is about constructing the training dataset.

There are 2 different sources of training data:

  • Community Critique Data
  • Human

Each item of Shepherd’s training data consists of 3 parts:

  • Question/Message
  • Answer/Response
  • Critique/Feedback

The meaning of the / symbol here denotes a different name for the same thing, for example, some people may like to refer to a response to a language model as Response, while others prefer to use the word Answer to represent it.

Community Critique Data

  • Question/Message:Using the title or subtitle of a community post as a question
  • Answer/Response:Taking the direct reply to a post as an answer
  • Critique/Feedback:Taking responses to an answer as an evaluation result (critiques/feedback) of the answer

The content above is only a very brief introduction to how to build a training dataset using community data (more details on exactly how this works will follow later).

Data Quality

We understand that the quality of the data is extremely important to how well the model works. The quality of the posts in the community, the quality of the answers, and the quality of the feedback on the answers will certainly vary.

In order to ensure the quality of the training dataset, it is definitely necessary to clean the community data in a significant way. Manual cleaning alone is not realistic. To be able to clean the data automatically, Shepherd defines several scores:

  • Question Score:People can take a vote on the post’s question (similar to Zhihu’s up and down). Based on the votes, you can calculate a value (the number of ups minus the number of downs). The value indicates the likelihood that the Question is a good one.
  • Answer Score:The scoring is similar to the above.
  • Critique Score:Likewise, the scoring is similar to the above.

Another potential problem with community data is that the topics covered by posts are highly diverse, and not every post on every topic necessarily deserves to be treated as part of the training data.

Especially in the threads above Pushshift Reddit, where most of the posts under many sections are mostly for entertainment and relaxation or for sharing information, and are not suitable for training Shepherd.

That’s why the paper restricts the data from Pushshift Reddit, choosing only posts from 16 carefully selected sections (i.e., subreddits) as training data.

Summary

Next, we will describe the details of this work’s cleansing of community data.

(To be continued)

If you like the slides for this series or any other articles, please follow my wechat publich account and leave me message “Slides”. I understand you may not have a wechat account. Leaving messages via Github also works. To check the completed list of all the published articles (In English), please visit https://createmomo.github.io/

--

--

No responses yet