G-Eval

728x90

Motivation

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional referencebased metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.

G-EVAL

In this paper, we propose G-EVAL, a framework of using LLMs with chain-of-thoughts (CoT) (Wei et al., 2022) to evaluate the quality of generated texts in a form-filling paradigm.

Process

By only feeding the Task Introduction and the Evaluation Criteria as a prompt, we ask LLMs to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs.

G-EVAL is a prompt-based evaluator with three main components

1) a prompt that contains the definition of the evaluation task and the desired evaluation criteria
2) a chain-of-thoughts (CoT) that is a set of intermediate instructions generated by the LLM describing the detailed evaluation steps
3) a scoring function that calls LLM and calculates the score based on the probabilities of the return tokens.

곧바로 1-5 scoring할 때 발생하는 문제점

1. For some evaluation tasks, one digit usually dominates the distribution of the scores, such as 3 for a 1 - 5 scale. This may lead to the low variance of the scores and the low correlation with human judgments.
2. LLMs usually only output integer scores, even when the prompt explicitly requests decimal values. This leads to many ties in evaluation scores which do not capture the subtle difference between generated texts.

Solution

To address these issues, we propose using the probabilities of output tokens from LLMs to normalize the scores and take their weighted summation as the final results. Formally, given a set of scores (like from 1 to 5) predefined in the prompt S = {s1, s2, ..., sn}, the probability of each score p(si) is calculated by the LLM, and the final score is:

For GPT-3.5, we set decoding temperature to 0 to increase the model’s determinism
For GPT-4, as it does not support the output of token probabilities, we set n = 20, temperature = 1, top p = 1 to sample 20 times to estimate the token probabilities

Limitation of LLMs evaluator

It may prefer the outputs generated by the LLM itself, rather than the high-quality human-written texts.

728x90

저작자표시 비영리

'Reading papers > NLP 논문' 카테고리의 다른 글

Continual Learning for Generative Retrieval over Dynamic Corpora (0)	2023.09.07
[논문 리뷰] SEAL: Autoregressive Search Engines: Generating Substrings as Document (0)	2023.03.26
[논문 요약] SOM-DST: Efficient Dialogue State Tracking by Selectively Overwriting Memory (0)	2022.02.04
[논문 리뷰] BERT: Pre-training of Deep Bidirectional Transformers for Language Under (0)	2021.09.26

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

채채씨의 학습 기록

G-Eval

Motivation

G-EVAL

Process

G-EVAL is a prompt-based evaluator with three main components

곧바로 1-5 scoring할 때 발생하는 문제점

Solution

Limitation of LLMs evaluator

'Reading papers > NLP 논문' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

G-Eval

Motivation

G-EVAL

Process

G-EVAL is a prompt-based evaluator with three main components

곧바로 1-5 scoring할 때 발생하는 문제점

Solution

Limitation of LLMs evaluator

'Reading papers > NLP 논문' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역