| | --- |
| | license: apache-2.0 |
| | tags: |
| | - diffusion |
| | - image-to-image |
| | - depth-estimation |
| | - optical-flow |
| | - amodal-segmentation |
| | --- |
| | |
| | # Scaling Properties of Diffusion Models for Perceptual Tasks |
| |
|
| | ### CVPR 2025 |
| |
|
| | **Rahul Ravishankar\*, Zeeshan Patel\*, Jathushan Rajasegaran, Jitendra Malik** |
| |
|
| | [[Paper](https://arxiv.org/abs/2411.08034)] 路 [[Project Page](https://scaling-diffusion-perception.github.io/)] |
| |
|
| |
|
| | In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. |
| |
|
| |
|
| | ## Getting started |
| |
|
| | You can download our DiT-MoE Generalist model [here](https://huggingface.co/zeeshanp/scaling_diffusion_perception/blob/main/dit_moe_generalist.pt). Please see instructions on how to use our model in the [GitHub README](https://github.com/scaling-diffusion-perception/scaling-diffusion-perception)路 |