Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
Abstract
A large-scale, high-quality, and fully open dataset for text-to-image fine-tuning is presented, featuring over 6 million text-image pairs with rigorous filtering for alignment and quality across multiple task combinations and visual styles.
High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VIBE: Visual Instruction Based Editor (2026)
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)
- ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images (2026)
- DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset (2026)
- MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods (2026)
- Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing (2026)
- OpenView: Empowering MLLMs with Out-of-view VQA (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper