ROCO-Radiology-CLIP (ViT-B/32)

A specialized vision-language model for radiology, fine-tuned on the ROCO dataset.

This model aligns medical images (X-rays, CTs, MRIs) with their textual descriptions, enabling zero-shot classification and semantic search for radiology concepts.

Performance (Test Set)

Batch-wise Recall@1: 70.83% (State-of-the-art for T4 fine-tuning)
Batch-wise Recall@5: 96.99%
Global Retrieval Recall@1: ~6% (500x better than random chance)
Global Retrieval Recall@5: ~16% Though a lot of work need to be done on this as the recall is still quite low. It will be updated with newer version

Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("spicy03/CLIP-ROCO-v1")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Predict
image = Image.open("chest_xray.jpg")
labels = ["Pneumonia", "Normal", "Edema"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs)

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicy03/CLIP-ROCO-v1

Base model

openai/clip-vit-base-patch32

Finetuned

(120)

this model

spicy03
/

CLIP-ROCO-v1

ROCO-Radiology-CLIP (ViT-B/32)

Performance (Test Set)

Usage

Model tree for spicy03/CLIP-ROCO-v1

Dataset used to train spicy03/CLIP-ROCO-v1

Space using spicy03/CLIP-ROCO-v1 1