AIDB Daily Papers

Privasis：ゼロから構築された最大級の「公開」プライベートデータセット

原題: Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

著者: Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi

公開日: 2026-02-03 | 分野: LLM 安全性データセット AI 情報言語プライバシーテキスト

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

プライバシーに関わるデータを大規模に合成し、AIエージェント研究のデータ不足を解消するデータセットPrivasisを構築した。
既存データセットを凌駕する規模と多様性を持ち、医療、法律、金融など様々なドキュメントタイプを網羅している点が重要である。
Privasisを用いて訓練した小型サニタイズモデルは、GPT-5などの大規模言語モデルを凌ぐ性能を示した。

Abstract

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2602.03183
カテゴリ: cs.CL, cs.AI

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報