---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- multimodal
- video-understanding
- video-audio understanding
- video-captioning
- video-reasoning
- short video understanding
---
# ARC-Qwen-Video-7B
[](https://arxiv.org/abs/2507.20939)
[](https://arc.tencent.com/en/ai-demos/multimodal)
[](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video)
[](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B)
[](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B)
[](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator)
[](https://tencentarc.github.io/posts/arc-video-announcement/)
[](https://huggingface.co/datasets/TencentARC/ShortVid-Bench)
In this version, we have switched the base model from hunyuan VLM in [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) to [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and introduce [ARC-Qwen-Video-7B](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B). The main distinctions are listed as below,
| Feature | `ARC-Hunyuan-Video-7B` | `ARC-Qwen-Video-7B` |
| ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Base VLM** | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct |
| **Frame Resolution**
*Each model uses a fixed frame resolution to maintain audio-video synchronization.* | Fixed at `640 x 640` | Fixed at `392 x 292` |
| **Frame Sampling** | • < 150s: 1 FPS
• > 150s: Uniformly sample 150 frames. | • < 300s: 1 FPS
• > 300s: Uniformly sample 300 frames. |
| **Audio-Video Synchronization** | • < 150s: Sum tokens from 1s audio + 1s video frame.
• 150-300s: Sum tokens from corresponding audio segment + video frame.
• > 300s: Split audio into 300 segments, use first 2s of each. | • < 300s: Sum tokens from 1s audio + 1s video.
• > 300s: Split audio into 300 segments, use middle 1s of each. |
We are also introducing a new model, [ARC-Qwen-Video-7B-Narrator](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator). It can output **timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content**. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video):
[
](https://www.youtube.com/watch?v=Bz1T4wCuWc8)
### 视频概述
这是一个喜剧短片,讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现,并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话,生动展现了丈夫从悠闲自得,到震惊错愕,再到崩溃无奈的全过程,充满了戏剧性的反转和幽默感。
### 情节发展分解
视频情节围绕一通电话展开,以下是详细的时间线、场景、说话人和对话内容:
| 时间戳 |
场景描述 |
说话人 |
对话内容 (ASR) |
| 0:00 - 0:05 |
丈夫头戴浴帽,围着浴巾,在室内泳池边悠闲地自拍。 |
无 |
(无对话) |
| 0:05 - 0:10 |
镜头切换:妻子在服装店里,满脸幸福地给丈夫打电话。 |
妻子 |
“哎,老公,老公,我爱你爱你,爱死你了,么么么。” |
| 0:10 - 0:18 |
丈夫接起电话,对妻子的热情感到好奇,妻子则兴奋地揭晓了“惊喜”。 |
丈夫 |
“哎,怎么了你这是,这么高兴啊?” |
| 妻子 |
“今天我在我的棉衣兜里,发现了你给我的惊喜,一万元哟。” |
| 0:18 - 0:27 |
听到“一万元”,丈夫表情瞬间凝固,从疑惑变为震惊和懊悔,但仍强装镇定。 |
丈夫 |
“啊?好啊,你你你你开心高兴就行。” |
| 0:27 - 0:34 |
妻子开心地告知钱的用途,丈夫的表情彻底僵住,震惊加剧。 |
妻子 |
“我当然高兴啊,我用它买了一件新衣裳,等晚上回去穿给你看啊。” |
| 0:34 - 0:46 |
丈夫确认钱已被花掉,情绪崩溃。妻子则认为是丈夫授权的,丈夫忍不住骂了一句。 |
丈夫 |
“你已经给买成衣服了?” |
| 妻子 |
“当然啦,不是你说的吗?说买我自己喜欢的东西。老公,你真是太好了。” |
| 丈夫 |
“你真是败家娘们儿啊你。” |
| 0:46 - 0:59 |
妻子察觉丈夫语气不对,丈夫立刻改口掩饰,并催促妻子早点回家。 |
妻子 |
“什么,老公,你说什么?” |
| 丈夫 |
“啊?我说好啊,你漂亮我高兴。” |
| 妻子 |
“你说的,老公。你今天呀,一定要早点回来哟,我等你哟。” |
| 丈夫 |
“行行行行行。” |
### 人物与核心冲突
#### 1. 人物分析
丈夫:
行为: 藏私房钱,事发后极力掩饰自己的真实情绪(心痛、懊悔)。
心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。
特点: 爱面子,对妻子既有爱意也有无奈,典型的“妻管严”形象。
妻子:
行为: 发现钱后,认为是丈夫的爱意表达,并迅速将其消费。
心理变化: 全程处于发现“惊喜”的幸福和喜悦中。
特点: 天真、消费果断,对丈夫充满信任和爱意。
#### 2. 核心冲突
视频的核心冲突在于 “信息的严重不对等” 所造成的戏剧性误会:
* 丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉,是一场“惊吓”。
* 妻子视角: 丈夫精心准备的 10,000元浪漫基金,是一份巨大的“惊喜”。
这个误会推动了整个故事的发展,丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差,制造了密集的笑点。
### 总结
该视频通过一个关于“私房钱”的常见家庭情景,巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺(观众和丈夫知道真相,而妻子蒙在鼓里)的手法,精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出,也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题,容易引发观众的共鸣和讨论。
|
## Usage
### Dependencies
The installation has been tested and verified on the following environments:
* NVIDIA H20 with CUDA 12.4
* NVIDIA A100 with CUDA 12.1
### Installation
Clone the repo and install dependent packages
```bash
git clone -b arc-qwen-video https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
cd ARC-Hunyuan-Video-7B
# Install torch 2.6.0 based on your CUDA version
# CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# CUDA 12.6
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install librosa decord av accelerate
pip uninstall transformers
pip install git+https://github.com/geyuying/transformers.git@arc-qwen-video
pip install flash_attn==2.7.1.post4
# Install FFmpeg according to your system, and ensure that the following command produces a normal version output:
ffmpeg -version
# (Optional) For vllm, please follow the instructions below,
pip uninstall vllm
pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
```
#### An 'Ugly' Workaround for vLLM Installation
If you are unable to install our provided vllm package, we offer an alternative "ugly" method:
1. Install vllm with Qwen2.5-VL support.
2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".
3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class:
```
whisper_path = 'openai/whisper-large-v3'
speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
self.speech_encoder = speech_encoder
speech_dim = speech_encoder.config.d_model
llm_hidden_size = config.vision_config.out_hidden_size
self.mlp_speech = nn.Sequential(
nn.LayerNorm(speech_dim),
nn.Linear(speech_dim, llm_hidden_size),
nn.GELU(),
nn.Linear(llm_hidden_size, llm_hidden_size)
)
```
**Why this works**: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.
### Inference
```bash
# Our model currently excels at processing short videos of up to 5 minutes.
# If your video is longer, we recommend following the approach used in our demo and API:
# split the video into segments for inference, and then use an LLM to integrate the results.
```
To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B.
```bash
video_path = "examples/猪排.mp4"
task = "QA"
question = "What did the man say at the beginning of the video after measuring the thickness of the fried pork cutlet?"
```
Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful.
#### Inference without vllm
```bash
cd ARC-Hunyuan-Video-7B
# For ARC-Qwen-Video-7B
python3 inference_arc_qwen_video.py
# For ARC-Qwen-Video-7B-Narrator
python3 inference_arc_qwen_video_narrator.py
```
#### Inference with vllm
```bash
cd ARC-Hunyuan-Video-7B
# For ARC-Qwen-Video-7B
python3 vllm_arc_qwen_vl_video_batch.py --batch_inference
# For ARC-Qwen-Video-7B-Narrator
python3 vllm_arc_qwen_vl_video_batch_narrator.py --batch_inference
```
## Benchmark Performance
| | Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | **62.6** | **73.0** | **54.8** |
| ARC-Qwen-Video-7B | **41.3** | **55.5** | **68.7** | **51.1** | **61.0** | **52.3** | 60.8 | 72.6 | 52.8 |
Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions.
## Citation
If you find the work helpful, please consider citing:
```bash
@article{ge2025arc,
title={ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts},
author={Ge, Yuying and Ge, Yixiao and Li, Chen and Wang, Teng and Pu, Junfu and Li, Yizhuo and Qiu, Lu and Ma, Jin and Duan, Lisheng and Zuo, Xinyu and others},
journal={arXiv preprint arXiv:2507.20939},
year={2025}
}
```