AIDB Daily Papers

WRAP++: ウェブ発見を増幅する事前学習

原題: WRAP++: Web discoveRy Amplified Pretraining

著者: Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang

公開日: 2026-04-08 | 分野: LLM Transformer データセット機械学習知識情報検索質問応答ウェブ自然言語処理深層学習事前学習

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

大規模言語モデルの事前学習において、ウェブ上の複数文書間の関係性を活用する新しいデータ合成手法WRAP++を提案。
既存手法が単一文書内の知識に限定されるのに対し、文書間のハイパーリンクから関係性を発見し、QAデータを生成することで知識を増幅。
Wikipediaで検証した結果、WRAP++で学習したモデルは、単一文書ベースの手法を大幅に上回り、性能のスケーリング効果も確認された。

Abstract

Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

💬 ディスカッション

ディスカッションに参加するにはログインが必要です。

ログイン / アカウント作成 →

arxivで読む PDFを開く

メタ情報

arxiv ID: 2604.06829
カテゴリ: cs.CL, cs.AI

ポイント

Abstract

Paper AI Chat

💬 ディスカッション

関連するAIDB記事

メタ情報