AIDB Daily Papers
研究論文におけるデータ利用を追跡するAI基盤の構築
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 研究論文で利用されたデータセットの利用状況を追跡する基盤を構築した。
- データセット利用の透明性、再現性、影響力の把握が重要だが、従来のNLPでは困難であったため、LLMを活用した新しい手法を開発した。
- 多タスクGLiNERフレームワークと合成データ生成により、データセットの言及抽出、関係特定、利用文脈分類の精度と網羅性を向上させた。
Abstract
While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild. Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification. To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: