Instructions to use Vchitect/ShotVL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Vchitect/ShotVL-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Vchitect/ShotVL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Vchitect/ShotVL-7B")
model = AutoModelForImageTextToText.from_pretrained("Vchitect/ShotVL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Vchitect/ShotVL-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Vchitect/ShotVL-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Vchitect/ShotVL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Vchitect/ShotVL-7B

SGLang

How to use Vchitect/ShotVL-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Vchitect/ShotVL-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Vchitect/ShotVL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Vchitect/ShotVL-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Vchitect/ShotVL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Vchitect/ShotVL-7B with Docker Model Runner:
```
docker model run hf.co/Vchitect/ShotVL-7B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model description

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct, introduced in the paper ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models. It is trained on the largest and high-quality dataset for cinematic language understanding to date. It currently achieves state-of-the-art performance on ShotBench, a comprehensive benchmark for evaluating cinematography understanding in vision-language models.

Project Page: https://vchitect.github.io/ShotBench-project/

Code: https://github.com/Vchitect/ShotBench

Demo

Image

import cv2
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

device = "cuda"
device_map = "balanced"
dtype = torch.bfloat16
image_path = "/path/to/image.jpg"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
  "Vchitect/ShotVL-7B",
  device_map=device_map,
  attn_implementation="flash_attention_2",
  torch_dtype=dtype,
).eval()
processor = AutoProcessor.from_pretrained(
  "Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
)

msgs = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": [
      {"type": "image", "image": image_path},
      {"type": "text", "text": "What's the shot size of this shot?"},
    ],
  },
]

text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)
inputs = processor(
  text=[text],
  images=image_inputs,
  videos=video_inputs,
  padding=True,
  return_tensors="pt",
).to(device)

with torch.inference_mode():
  out_ids = model.generate(**inputs, max_new_tokens=640)
  
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Video

import cv2
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

device = "cuda"
device_map = "balanced"
dtype = torch.bfloat16
video_path = "/path/to/video.mp4"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
  "Vchitect/ShotVL-7B",
  device_map=device_map,
  attn_implementation="flash_attention_2",
  torch_dtype=dtype,
).eval()
processor = AutoProcessor.from_pretrained(
  "Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
)

question = (
    "What's the camera movement in this movie shot?
"
    "Options:
A. Boom down
B. Boom up
C. Push in
D. Pull out
"
    "Please select the most likely answer from the options above.
"
)
msgs = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": [
      {"type": "video", "video": video_path, "max_pixels": 360*640, "fps": 12.0},
      {"type": "text", "text": question},
    ],
  },
]

text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)
inputs = processor(
  text=[text],
  images=image_inputs,
  videos=video_inputs,
  padding=True,
  return_tensors="pt",
).to(device)

with torch.inference_mode():
  out_ids = model.generate(**inputs, max_new_tokens=640)
  
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Evaluation Results

Abbreviations: SS = *Shot Size*, SF = *Shot Framing*, CA = *Camera Angle*, LS = *Lens Size*, LT = *Lighting Type*, LC = *Lighting Conditions*, SC = *Shot Composition*, CM = *Camera Movement*. Underline marks previous best in each group.
**Our *ShotVL* models establish new SOTA.**
Models	SS	SF	CA	LS	LT	LC	SC	CM	Avg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct	54.6	56.6	43.1	36.6	59.3	45.1	41.5	31.9	46.1
Qwen2.5-VL-7B-Instruct	69.1	73.5	53.2	47.0	60.5	47.4	49.9	30.2	53.8
LLaVA-NeXT-Video-7B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	31.3	34.4
LLaVA-Video-7B-Qwen2	56.9	65.4	45.1	36.0	63.5	45.4	37.4	35.3	48.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat	58.4	71.0	52.3	38.7	59.5	44.9	50.9	39.7	51.9
InternVL2.5-8B	56.3	70.3	50.8	41.1	60.2	45.1	50.1	33.6	50.9
InternVL3-2B	56.3	56.0	44.4	34.6	56.8	44.6	43.0	38.1	46.7
InternVL3-8B	62.1	65.8	46.8	42.9	58.0	44.3	46.8	44.2	51.4
InternVL3-14B	59.6	82.2	55.4	40.7	61.7	44.6	51.1	38.2	54.2
Internlm-xcomposer2d5-7B	51.1	71.0	39.8	32.7	59.3	35.7	35.7	38.8	45.5
Ovis2-8B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	35.3	34.9
VILA1.5-3B	33.4	44.9	32.1	28.6	50.6	35.7	28.4	21.5	34.4
VILA1.5-8B	40.6	44.5	39.1	29.7	48.9	32.9	34.4	36.9	38.4
VILA1.5-13B	36.7	54.6	40.7	34.8	52.8	35.4	34.2	31.3	40.1
Instructblip-vicuna-7B	27.0	27.9	34.5	29.4	44.4	29.7	27.1	25.0	30.6
Instructblip-vicuna-13B	26.8	29.2	27.9	28.0	39.0	24.0	27.1	22.0	28.0
InternVL2.5-38B	67.8	85.4	55.4	41.7	61.7	48.9	52.4	44.0	57.2
InternVL3-38B	68.0	84.0	51.9	43.6	64.4	46.9	54.7	44.6	57.3
Qwen2.5-VL-32B-Instruct	62.3	76.6	51.0	48.3	61.7	44.0	52.2	43.8	55.0
Qwen2.5-VL-72B-Instruct	75.1	82.9	56.7	46.8	59.0	49.4	54.1	48.9	59.1
InternVL3-78B	69.7	80.0	54.5	44.0	65.5	47.4	51.8	44.4	57.2
Proprietary VLMs
Gemini-2.0-flash	48.9	75.5	44.6	31.9	62.2	48.9	52.4	47.4	51.5
Gemini-2.5-flash-preview-04-17	57.7	82.9	51.4	43.8	65.2	45.7	45.9	43.5	54.5
GPT-4o	69.3	83.1	58.2	48.9	63.2	48.0	55.2	48.3	59.3
Ours
ShotVL-3B	77.9	85.6	68.8	59.3	65.7	53.1	57.4	51.7	65.1
ShotVL-7B	81.2	90.1	78.0	68.5	70.1	64.3	45.7	62.9	70.1

BibTeX

@misc{
      liu2025shotbench,
      title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, 
      author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
      year={2025},
      eprint={2506.21356},
      achivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21356}, 
    }