Vintern-3B-beta πŸ‡»πŸ‡³ ❄️ - The LLaVA πŸŒ‹ Challenger

What's new in Vintern-3B-beta!

Model Details

Model Name Vision Part Language Part
Vintern-3B-beta InternViT-300M-448px Qwen2.5-3B-Instruct

Bytedance/MTVQA Benchmark

We surpassed GPT-4o and are approaching Gemini 1.5 Pro on the MTVQA dataset for Vietnamese. The benchmark result in MTVQA from open_vlm_leaderboard.

Rank Method Param (B) Language Model Vision Model VI
1 Gemini-1.5-Pro 41.3
2 Vintern-3B-beta 3 Qwen2.5-3B-Instruct InternViT-300M 41.289
3 GPT-4o (0513, detail-h...) 39.6
4 GPT-4o (0806, detail-h...) 38.9
5 Gemini-1.5-Flash 38.9
6 Qwen-VL-Max-0809 72 Qwen2-72B ViT-600M 36.9
7 GPT-4o (0513, detail-lo...) 26.1
8 Qwen-VL-Plus-0809 27.8
9 GLM-4v-9B 9 GLM-4-9B EVA-02-5B 26.6
10 InternVL2-Llama3-76B 76 Llama-3-70B-Instruct InternViT-6B 26.7
11 Step-1.5V Step-1.5 stepencoder 18.4
12 InternVL2-40B 40 Nous-Hermes-2-Yi-34B InternViT-6B 21.2
13 Pixtral-12B 13 Nemo-12B ViT-400M 19.7

Zalo VMLU Benchmark

The Vintern-3B-beta achieved a score of 54.81 on the Zalo VMLU Benchmark.

generation_config = dict(max_new_tokens= 64, do_sample=False, num_beams = 1, repetition_penalty=1.5)
question = "BαΊ‘n lΓ  trợ lΓ½ AI giαΊ£i trαΊ―c nghiệm rαΊ₯t chΓ­nh xΓ‘c. BαΊ‘n biαΊΏt chαΊ―c chαΊ―n Δ‘Γ‘p Γ‘n Δ‘ΓΊng nhαΊ₯t. Chỉ Δ‘Ζ°a ra chα»― cΓ‘i Δ‘α»©ng trΖ°α»›c cΓ’u trαΊ£ lời Δ‘ΓΊng cα»§a cΓ’u hỏi trαΊ―c nghiệm sau: CΓ‘c cΖ‘ quan nΓ o sau Δ‘Γ’y lΓ  cΖ‘ quan tΖ° phΓ‘p? Lα»±a Chọn:\nA. Viện kiểm sΓ‘t nhΓ’n dΓ’n\nB. TΓ²a Γ‘n nhΓ’n dΓ’n\nC. ChΓ­nh phα»§\nD. CαΊ£ A vΓ  B\nCΓ’u trαΊ£ lời Δ‘ΓΊng nhαΊ₯t lΓ :"
model.chat(tokenizer, None, question, generation_config)

OpenCompass Benchmark

Vintern-3B-beta is now on open_vlm_leaderboard. You can visit to view more detailed evaluations.

The current results are at a quite good level, and we are expanding the training set in English and other languages to approach models within a comparable parameter range.

"The table is referenced from the repo Qwen/Qwen2-VL-2B-Instruct."

Benchmark InternVL2-2B MiniCPM-V 2.0 Qwen2-VL-2B Vintern-3B-beta
MMMUval 36.3 38.2 41.1 43.55
DocVQAtest 86.9 - 90.1 80.47
InfoVQAtest 58.9 - 65.5 48.28
ChartQAtest 76.2 - 73.5 68.32
TextVQAval 73.4 - 79.7 67.09
OCRBench 781 605 794 619
MTVQA 10.9 8.8 20.0 23.58
Vi-MTVQA 9.3 8.4 - 41.29
RealWorldQA 57.3 55.8 62.9 57.9
MMEsum 1876.8 1808.6 1872.0 1772.9
MMBench-ENtest 73.2 69.1 74.9 70.62
MMStar 49.8 39.1 48.0 47.6
HallBenchavg 38.0 36.1 41.7 43.22
MathVistatestmini 46.0 39.8 43.0 43.9

SEED-Bench-2-Plus CVPR2024

We achieved Rank 1 for models under 4 billion parameters on the SEED-Bench benchmark, recently published by Tencent at CVPR2024.

Examples

User: <image>
Mô tả hình ảnh chi tiết.
Assistant: Bα»©c αΊ£nh lΓ  mα»™t giao diện α»©ng dα»₯ng Δ‘αΊ·t hΓ ng thα»©c Δƒn trα»±c tuyαΊΏn. Giao diện được chia thΓ nh cΓ‘c phαΊ§n: 
- PhαΊ§n Δ‘αΊ§u tiΓͺn hiển thα»‹ thΓ΄ng bΓ‘o "SαΊ½ xong lΓΊc 10:50 SA" vΓ  dΓ²ng chα»― "BαΊΏp Δ‘ang thα»±c hiện mΓ³n.". BΓͺn cαΊ‘nh Δ‘Γ³ lΓ  biểu tượng bαΊΏp Δ‘ang hoαΊ‘t Δ‘α»™ng vα»›i thanh tiαΊΏn Δ‘α»™ mΓ u xanh lΓ‘ cΓ’y. 
- TiαΊΏp theo lΓ  dΓ²ng chα»― "Vui lΓ²ng Δ‘αΊΏn lαΊ₯y mΓ³n trΖ°α»›c 12:50". 
- PhαΊ§n tiαΊΏp theo lΓ  thΓ΄ng tin về quΓ‘n cΓ  phΓͺ "Wego Coffee - Mapletree Business Centre", kΓ¨m theo dΓ²ng chα»― "#GF-29". 
- DΓ²ng chα»― "Đừng quΓͺn kiểm tra lαΊ‘i để chαΊ―c rαΊ±ng bαΊ‘n nhαΊ­n Δ‘ΓΊng mΓ³n nhΓ©!" được hiển thα»‹ trong mα»™t khung hΓ¬nh chα»― nhαΊ­t mΓ u be. 
- PhαΊ§n cuα»‘i cΓΉng lΓ  thΓ΄ng tin về Δ‘Ζ‘n hΓ ng, bao gα»“m logo quΓ‘n cΓ  phΓͺ, dΓ²ng chα»― "Wego Coffee - Mapletree Business Centre", dΓ²ng chα»― "Tα»•ng cα»™ng", giΓ‘ trα»‹ Δ‘Ζ‘n hΓ ng (70.000Δ‘), giΓ‘ trα»‹ thanh toΓ‘n (120.000Δ‘) vΓ  dΓ²ng chα»― "Merchants has not started a chat". 
- Cuα»‘i cΓΉng lΓ  nΓΊt "Xem tΓ³m tαΊ―t Δ‘Ζ‘n hΓ ng" vΓ  nΓΊt "Đã lαΊ₯y mΓ³n" cΓ³ mΓ u xanh lΓ‘ cΓ’y.
User: <image>
Liệt kΓͺ toΓ n bα»™ bΓ i thΖ‘ cΓ³ trong αΊ£nh.
Assistant: Đi khαΊ―p thαΊΏ gian khΓ΄ng ai tα»‘t bαΊ±ng mαΊΉ GΓ‘nh nαΊ·ng cuα»™c đời khΓ΄ng ai khα»• bαΊ±ng cha
User: <image>
Liệt kΓͺ toΓ n bα»™ bΓ i thΖ‘ cΓ³ trong αΊ£nh.
Assistant: BΓ i thΖ‘ cΓ³ nα»™i dung: 
- MαΊΉ nhΖ° mα»™t ngọn hαΊ£i Δ‘Δƒng
- NhΖ° Γ΄ng mαΊ·t trời, nhΖ° Γ΄ng mαΊ·t trΔƒng
- NgΓ y ngΓ y vαΊ₯t vαΊ£, tαΊ£o tαΊ§n
- ChΔƒm lo con cΓ‘i, lΓ m việc siΓͺng nΔƒng.
User: <image>
Mô tả hình ảnh chi tiết.
Assistant: Bα»©c αΊ£nh chα»₯p mα»™t bαΊ£ng Δ‘en vα»›i cΓ‘c cΓ΄ng thα»©c toΓ‘n học được viαΊΏt bαΊ±ng phαΊ₯n trαΊ―ng. Ở gΓ³c trΓͺn bΓͺn trΓ‘i, cΓ³ dΓ²ng chα»― "20/19/SGK" được viαΊΏt bαΊ±ng mΓ u vΓ ng nhαΊ‘t. TiαΊΏp theo lΓ  phαΊ§n a) vα»›i hai phΖ°Ζ‘ng trΓ¬nh: 
3x + y = 3
2x - y = 7
DΖ°α»›i hai phΖ°Ζ‘ng trΓ¬nh nΓ y lΓ  mΕ©i tΓͺn chỉ sang phαΊ£i vΓ  hai phΖ°Ζ‘ng trΓ¬nh mα»›i:
5x = 10
3x + y = 3
3.2 + y = 3
Sau Δ‘Γ³ lΓ  dΓ²ng chα»― "x = 2" 
y = -3
Cuα»‘i cΓΉng lΓ  dΓ²ng chα»― "VαΊ­y hpt Δ‘Γ£ cho cΓ³ no! (x; y) = (2; -3)"
User: <image>
Mô tả hình ảnh chi tiết.
Assistant: Bα»©c αΊ£nh minh họa quΓ‘ trΓ¬nh xα»­ lΓ½ hΓ¬nh αΊ£nh bαΊ±ng mαΊ‘ng lΖ°α»›i thαΊ§n kinh nhΓ’n tαΊ‘o (CNN). Bα»©c αΊ£nh được chia thΓ nh 3 phαΊ§n chΓ­nh:

1. **PhαΊ§n Δ‘αΊ§u tiΓͺn:** Hiển thα»‹ mα»™t hΓ¬nh αΊ£nh cα»§a mα»™t chiαΊΏc xe hΖ‘i mΓ u trαΊ―ng vα»›i dΓ²ng chα»― "input image" ở bΓͺn dΖ°α»›i.
2. **PhαΊ§n thα»© hai:** Hiển thα»‹ 6 khα»‘i vuΓ΄ng, mα»—i khα»‘i Δ‘αΊ‘i diện cho mα»™t lα»›p trong mαΊ‘ng CNN. CΓ‘c lα»›p được Δ‘Γ‘nh sα»‘ thα»© tα»± tα»« trΓ‘i sang phαΊ£i:
    * Convolutional layer
    * Pooling layer
    * Convolutional layer
    * Pooling layer
    * Fully-connected layer
    * Predicted class
3. **PhαΊ§n thα»© ba:** Hiển thα»‹ mα»™t hΓ¬nh αΊ£nh vα»›i dΓ²ng chα»― "car" ở bΓͺn dΖ°α»›i.

Mα»™t mΕ©i tΓͺn mΓ u Δ‘en nα»‘i cΓ‘c khα»‘i vuΓ΄ng vα»›i nhau, thể hiện quΓ‘ trΓ¬nh truyền dα»― liệu giα»―a cΓ‘c lα»›p.

Quickstart

Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. To run inference using the model, follow the steps outlined in our Colab inference notebook

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 512, do_sample=False, num_beams = 3, repetition_penalty=3.5)

question = '<image>\nMô tả hình ảnh một cÑch chi tiết.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "CÒu hỏi khÑc ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

Bias, Risks, and Limitations

The model might have biases because it learned from data that could be biased. Users should be cautious of these possible biases when using the model.

Citation

@misc{doan2024vintern1befficientmultimodallarge,
      title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, 
      author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
      year={2024},
      eprint={2408.12480},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.12480}, 
}
Downloads last month
255
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 5CD-AI/Vintern-3B-beta

Base model

Qwen/Qwen2.5-3B
Finetuned
(906)
this model
Quantizations
1 model

Space using 5CD-AI/Vintern-3B-beta 1

Paper for 5CD-AI/Vintern-3B-beta