tttoaster commited on
Commit
1b14d12
·
verified ·
1 Parent(s): 3d0d1c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md CHANGED
@@ -196,6 +196,30 @@ pip uninstall vllm
196
  pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
197
  ```
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ### Inference
200
 
201
  ```bash
 
196
  pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
197
  ```
198
 
199
+ #### An 'Ugly' Workaround for vLLM Installation
200
+ If you are unable to install our provided vllm package, we offer an alternative "ugly" method:
201
+
202
+ 1. Install vllm with Qwen2.5-VL support.
203
+
204
+ 2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".
205
+
206
+ 3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class:
207
+
208
+ ```
209
+ whisper_path = 'openai/whisper-large-v3'
210
+ speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
211
+ self.speech_encoder = speech_encoder
212
+ speech_dim = speech_encoder.config.d_model
213
+ llm_hidden_size = config.vision_config.out_hidden_size
214
+ self.mlp_speech = nn.Sequential(
215
+ nn.LayerNorm(speech_dim),
216
+ nn.Linear(speech_dim, llm_hidden_size),
217
+ nn.GELU(),
218
+ nn.Linear(llm_hidden_size, llm_hidden_size)
219
+ )
220
+ ```
221
+ **Why this works**: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.
222
+
223
  ### Inference
224
 
225
  ```bash