VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following
Paper
•
2311.17647
•
Published
Model type: v-MLLM is an open-source MLLM trained on Visual-Modality Instruction (VIM) corpus, it can robustly follow the text-modality instructions and visual-modality instructions.
Model date: v-MLLM-13B was trained in January 2024.
Github for more information: https://github.com/VIM-Bench/VIM_TOOL
v-MLLM is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Primary intended uses: The primary use of v-MLLM is for research on multimodal large language models.
Primary intended users: The primary intended users of the model are researchers in computer vision, natural language processing, machine learning, and artificial intelligence.
Please kindly cite our paper if you find our resources useful:
@misc{li2024text,
title={Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?},
author={Xiujun Li and Yujie Lu and Zhe Gan and Jianfeng Gao and William Yang Wang and Yejin Choi},
year={2024},
eprint={2311.17647},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{lu2023vim,
title={VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following},
author={Yujie Lu and Xiujun Li and William Yang Wang and Yejin Choi},
year={2023},
eprint={2311.17647},
archivePrefix={arXiv},
primaryClass={cs.CV}
}