AIDB Daily Papers
論文スクリーニングにおけるLLMの理解:不一致から推奨へ
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 大規模言語モデル(LLM)を用いたシステマティックレビューの論文スクリーニングにおける不一致の原因を質的に分析した。
- LLMと研究者の不一致は、用語の曖昧さやキーワードの過剰強調など、特定可能な原因に起因することが明らかになった。
- 本研究では、LLMのセマンティック理解の検証や複数LLMの利用など、実用的な推奨事項を提案した。
Abstract
Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: