G-Eval
Motivation
The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional referencebased metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.
G-EVAL
In this paper, we propose G-EVAL, a framework of using LLMs with chain-of-thoughts (CoT) (Wei et al., 2022) to evaluate the quality of generated texts in a form-filling paradigm.
Process
By only feeding the Task Introduction and the Evaluation Criteria as a prompt, we ask LLMs to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs.
G-EVAL is a prompt-based evaluator with three main components
1) a prompt that contains the definition of the evaluation task and the desired evaluation criteria
2) a chain-of-thoughts (CoT) that is a set of intermediate instructions generated by the LLM describing the detailed evaluation steps
3) a scoring function that calls LLM and calculates the score based on the probabilities of the return tokens.
곧바로 1-5 scoring할 때 발생하는 문제점
1. For some evaluation tasks, one digit usually dominates the distribution of the scores, such as 3 for a 1 - 5 scale. This may lead to the low variance of the scores and the low correlation with human judgments.
2. LLMs usually only output integer scores, even when the prompt explicitly requests decimal values. This leads to many ties in evaluation scores which do not capture the subtle difference between generated texts.
Solution
To address these issues, we propose using the probabilities of output tokens from LLMs to normalize the scores and take their weighted summation as the final results. Formally, given a set of scores (like from 1 to 5) predefined in the prompt S = {s1, s2, ..., sn}, the probability of each score p(si) is calculated by the LLM, and the final score is:
- For GPT-3.5, we set decoding temperature to 0 to increase the model’s determinism
- For GPT-4, as it does not support the output of token probabilities, we set n = 20, temperature = 1, top p = 1 to sample 20 times to estimate the token probabilities
Limitation of LLMs evaluator
It may prefer the outputs generated by the LLM itself, rather than the high-quality human-written texts.