AIDB Daily Papers
精度を超えて:長文LLM生成における事実性評価のための重要度を考慮した再現率
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMが生成する長文テキストの事実性を評価する新フレームワークを提案しました。
- 既存手法が精度に偏る一方、本研究では網羅性(再現率)も重視し、重要度に基づく重み付けを導入しました。
- 分析の結果、LLMは精度は高いものの再現率が課題であり、特に重要な事実の網羅性に改善の余地があると判明しました。
Abstract
Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: