DocAtlas: Multilingual Document Understanding Across 80+ Languages
Abstract
DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
Community
DocAtlas is a framework for constructing high-fidelity multilingual OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, using differential rendering to produce model-free structural annotations from native documents. Evaluating 16 models reveals persistent gaps in low-resource scripts; DPO with rendering-derived ground truth achieves stable cross-lingual transfer (+1.9% in-domain, +1.8% out-of-domain) without base-language degradation, where supervised fine-tuning collapses by up to 21%.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios (2026)
- TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction (2026)
- BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation (2026)
- Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training (2026)
- GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts (2026)
- MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing (2026)
- Multilingual Training and Evaluation Resources for Vision-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.12623 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper