Image-to-Image
Transformers

Add comprehensive model card for EARL

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +194 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-to-image
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # EARL: The Promise of RL for Autoregressive Image Editing
8
+
9
+ Official model for the paper [The Promise of RL for Autoregressive Image Editing](https://huggingface.co/papers/2508.01119).
10
+
11
+ [![arXiv](https://img.shields.io/badge/arXiv-2508.01119-b31b1b?style=flat-square)](https://arxiv.org/abs/2508.01119)
12
+ [![Code](https://img.shields.io/badge/GitHub-Code-keygen.svg?logo=github&style=flat-square)](https://github.com/saba96/EARL)
13
+ [![Models](https://img.shields.io/badge/%F0%9F%A4%97Hugging_Face-Model-ffd200?style=flat-square)](https://huggingface.co/Image-editing/imged_rl_grpo_sft.s_rl.sc/tree/ckpt_001999)
14
+
15
+ ![EARL Teaser](https://github.com/saba96/EARL/raw/main/assets/teaser.png)
16
+
17
+ ## Abstract
18
+ We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at this https URL .
19
+
20
+ ## Overview
21
+ EARL (Editing with Autoregression and RL) introduces a novel approach to image editing using an autoregressive multimodal model. It processes textual and visual tokens in a unified manner and leverages reinforcement learning combined with a large multi-modal LLM verifier to achieve strong performance across various image editing tasks. The model is designed for efficiency, using significantly less training data than comparable baselines, and pushes the frontier of autoregressive multimodal models on image editing.
22
+
23
+ ## Usage
24
+ You can quickly try the model using vLLM for inference.
25
+
26
+ First, clone the official repository and install the prerequisites:
27
+ ```bash
28
+ git clone https://github.com/saba96/EARL.git
29
+ cd EARL
30
+ python -m venv /path/to/envs/EARL
31
+ . /path/to/envs/EARL/bin/activate
32
+ pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
33
+ pip install vllm==0.8.4
34
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
35
+ pip install -r requirements.txt
36
+ export PYTHONPATH=$(pwd)
37
+ ```
38
+
39
+ **Patch vLLM to support Emu3**:
40
+ This is a critical step. You need to edit the `registry.py` file in your vLLM installation.
41
+ ```
42
+ vim /path/to/venv/lib/python3.10/site-packages/vllm/model_executor/models/registry.py
43
+ ```
44
+ Add the following line to the `_MULTIMODAL_MODELS` dictionary around line 166:
45
+ ```python
46
+ _MULTIMODAL_MODELS = {
47
+ # add this line
48
+ "Emu3ForCausalLM": ("llama", "LlamaForCausalLM"),
49
+ # end of adding
50
+ "AriaForConditionalGeneration": ("aria", "AriaForConditionalGeneration"), # already exists
51
+ # ... other models
52
+ }
53
+ ```
54
+
55
+ Then, run inference using the following Python code snippet. Ensure you have an image file ready (e.g., `./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png` from the original repository).
56
+
57
+ ```python
58
+ import numpy as np
59
+ import torch
60
+ import torchvision.transforms as T
61
+ from PIL import Image
62
+ from torchvision.transforms.functional import InterpolationMode
63
+ from transformers import AutoTokenizer
64
+ from vllm import LLM, ModelRegistry, SamplingParams
65
+
66
+ # Ensure Emu3ForCausalLM is available or registered.
67
+ # If you cloned the repo, it should be importable from emu3.model.modeling_emu3_vllm
68
+ # For demonstration, we'll assume it's correctly handled by trust_remote_code or local setup.
69
+ # If you face issues, ensure the model's specific class is registered with vLLM's ModelRegistry.
70
+ # Example: from emu3.model.modeling_emu3_vllm import Emu3ForCausalLM
71
+ # ModelRegistry.register_model("Emu3ForCausalLM", Emu3ForCausalLM)
72
+
73
+
74
+ # --- Helper functions from original repo for image preprocessing ---
75
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
76
+ IMAGENET_STD = (0.229, 0.224, 0.225)
77
+
78
+ def build_transform(input_size):
79
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
80
+ transform = T.Compose([
81
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
82
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
83
+ T.ToTensor(),
84
+ T.Normalize(mean=MEAN, std=STD)
85
+ ])
86
+ return transform
87
+
88
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
89
+ best_ratio_diff = float('inf')
90
+ best_ratio = (1, 1)
91
+ area = width * height
92
+ for ratio in target_ratios:
93
+ target_aspect_ratio = ratio[0] / ratio[1]
94
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
95
+ if ratio_diff < best_ratio_diff:
96
+ best_ratio_diff = ratio_diff
97
+ best_ratio = ratio
98
+ elif ratio_diff == best_ratio_diff:
99
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
100
+ best_ratio = ratio
101
+ return best_ratio
102
+
103
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
104
+ orig_width, orig_height = image.size
105
+ aspect_ratio = orig_width / orig_height
106
+
107
+ target_ratios = set(
108
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
109
+ i * j <= max_num and i * j >= min_num)
110
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
111
+
112
+ target_aspect_ratio = find_closest_aspect_ratio(
113
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
114
+
115
+ target_width = image_size * target_aspect_ratio[0]
116
+ target_height = image_size * target_aspect_ratio[1]
117
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
118
+
119
+ resized_img = image.resize((target_width, target_height))
120
+ processed_images = []
121
+ for i in range(blocks):
122
+ box = (
123
+ (i % (target_width // image_size)) * image_size,
124
+ (i // (target_width // image_size)) * image_size,
125
+ ((i % (target_width // image_size)) + 1) * image_size,
126
+ ((i // (target_width // image_size)) + 1) * image_size
127
+ )
128
+ split_img = resized_img.crop(box)
129
+ processed_images.append(split_img)
130
+ assert len(processed_images) == blocks
131
+ if use_thumbnail and len(processed_images) != 1:
132
+ thumbnail_img = image.resize((image_size, image_size))
133
+ processed_images.append(thumbnail_img)
134
+ return processed_images
135
+
136
+ def load_image(image_file, input_size=448, max_num=12):
137
+ image = Image.open(image_file).convert('RGB')
138
+ transform = build_transform(input_size=input_size)
139
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
140
+ pixel_values = [transform(image) for image in images]
141
+ pixel_values = torch.stack(pixel_values)
142
+ return pixel_values
143
+ # -------------------------------------------------------------------
144
+
145
+ # Load the model with vLLM
146
+ path = 'Image-editing/imged_rl_grpo_sft.s_rl.sc' # Model ID from Hugging Face Hub
147
+ llm = LLM(
148
+ model=path,
149
+ trust_remote_code=True,
150
+ dtype="auto", # or torch.bfloat16 if supported by your hardware
151
+ gpu_memory_utilization=0.9,
152
+ # Additional vLLM specific arguments if needed
153
+ )
154
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
155
+
156
+ # Prepare inputs
157
+ image_path = './examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png' # Replace with a path to your image
158
+ # The `load_image` function prepares the pixel values as expected by the model.
159
+ pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda() # Ensure image is loaded and moved to GPU
160
+
161
+ # Format the prompt
162
+ question = "Edit the image: change the color of the car to red."
163
+ prompt = f"A chat between a curious user and an AI assistant.
164
+ USER: <image>
165
+ {question} ASSISTANT:"
166
+
167
+ sampling_params = SamplingParams(max_tokens=512, temperature=0.7) # Adjust as needed
168
+
169
+ # In vLLM, for multimodal models, the image input might be handled internally
170
+ # or require specific passing depending on the model's vLLM integration.
171
+ # The `llm.generate` method typically handles a list of string prompts.
172
+ # For full multimodal interaction with vLLM, refer to the original EARL GitHub:
173
+ # https://github.com/saba96/EARL/blob/main/emu3/train_image_editing/vllm_inference.py
174
+
175
+ # This example illustrates the textual part of inference with vLLM,
176
+ # assuming the model's vLLM integration handles the image input when loading the model.
177
+ # A full end-to-end vLLM multimodal inference might look slightly different.
178
+ outputs = llm.generate([prompt], sampling_params) # Pass prompt as a list for vLLM
179
+
180
+ response = outputs[0].outputs[0].text
181
+ print(f'User: {question}
182
+ Assistant: {response}')
183
+ ```
184
+
185
+ ## Citation
186
+ If you find our work helpful or inspiring, please feel free to cite it.
187
+ ```bibtex
188
+ @article{saba2025earl,
189
+ title={The Promise of RL for Autoregressive Image Editing},
190
+ author={Saba, Daniel and Tang, Sifei and Huang, Yifan and Liu, Meng and Ma, Jinxin and Liu, Zhian and Fu, Ruifeng and Zhu, Lei and Han, Jun and Zhang, Shang-Wen and Liu, Jing},
191
+ journal={arXiv preprint arXiv:2508.01119},
192
+ year={2025}
193
+ }
194
+ ```