arxiv:2311.07622

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Published on Nov 13, 2023

Authors:

Abstract

A novel pre-trained masked tuning approach reduces the gap between vision-language models and composed image retrieval by reformulating contrastive learning as a masked image-text triplet task.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate langlemasked image, text, imagerangle triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2311.07622

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.07622 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.07622 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.