AIDB Daily Papers
RAGパイプラインにおける検索コンテンツ表現の影響
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 本研究は、検索拡張生成(RAG)パイプラインにおいて、LLMが検索コンテンツをどのように表現すべきかを制御された比較実験で検証した。
- 検索コンテンツの表現方法がLLMの質問応答精度に与える影響を、13種類の変換と元の表現を比較することで明らかにした。
- 検索コンテンツの表現において、元の文書が持つ回答情報を保持できているか(Answer Retention)が、LLMの生成精度を決定する最も重要な要因であることが発見された。
Abstract
Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: