AIDB Daily Papers

パトロジア・グレカコーパス：ノイズの多い19世紀のポリトニック・ギリシャ語版のOCR、アノテーション、オープンリリース

原題: The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

著者: Chahan Vidal-Gorène, Bastien Kindt

公開日: 2026-03-10 | 分野: NLP コンピュータビジョンデータセット機械学習オープンソースアノテーション言語テキスト OCR

※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。

ポイント

古代ギリシャ語の19世紀版を対象とした、大規模なオープンOCRおよび言語リソースであるパトロジア・グレカコーパスを構築した。
複雑な二言語（ギリシャ語-ラテン語）レイアウトと、劣化したポリトニック・ギリシャ語の活字を対象としたOCRの新たなベンチマークを確立する。
YOLOベースのレイアウト検出とCRNNベースのテキスト認識を組み合わせ、文字誤り率1.05％、単語誤り率4.69％を達成した。

Abstract

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

Paper AI Chat

この論文のPDF全文を対象にAIに質問できます。

質問の例:

AIチャット機能を利用するには、ログインまたは会員登録（無料）が必要です。

会員登録 / ログイン

arXivで読む PDFを開く

メタ情報

arXiv ID: 2603.09470
カテゴリ: cs.CV

ポイント

Abstract

Paper AI Chat

関連するAIDB記事

メタ情報