AIDB Daily Papers
WRAP++: ウェブ発見を増幅する事前学習
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 大規模言語モデルの事前学習において、ウェブ上の複数文書間の関係性を活用する新しいデータ合成手法WRAP++を提案。
- 既存手法が単一文書内の知識に限定されるのに対し、文書間のハイパーリンクから関係性を発見し、QAデータを生成することで知識を増幅。
- Wikipediaで検証した結果、WRAP++で学習したモデルは、単一文書ベースの手法を大幅に上回り、性能のスケーリング効果も確認された。
Abstract
Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: