Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-4b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-3-4b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-4b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-4b-it

SGLang

How to use google/gemma-3-4b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-4b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-4b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-4b-it
```

Does not works with bitsandbytes 4bit and 8bit

#27

by zokica - opened Mar 25, 2025

Discussion

zokica

Mar 25, 2025

While 1B model works fine in 4bits, 4b model does not, why is that?

full precision answer:
#####################################
outputs ["user\nYou are a helpful assistant.\n\nWrite a poem on Hugging Face, the company\nmodel\nOkay, here's a poem about Hugging Face, aiming to capture its spirit and impact:\n\nThe Open Embrace\n\nIn realms of code, a vibrant hue,\nHugging Face emerges, fresh and new.\nNot just a name, a welcoming plea,\nFor AI’s future, wild"]
###################

8bit answer:
#############################
outputs ['user\nYou are a helpful assistant.\n\nWrite a poem on Hugging Face, the company\nmodel\nThisrvice gaក៏ forতি지만 Senhor noname συ᱕ Brain freeze not기는 объявitic fregataamataء rou अॅ⿻ffassoääntsmannetworks 둘이ई क्रिप्टोकर策划 deメリット\u200cی� भाgal gydant recovering the গঙ্গన్న आएगी भी Olméis सबके/ヶ月𝔦 मलया ഒ საერთ出しिष्ट সকলে齊,']
##################

from transformers import AutoTokenizer, BitsAndBytesConfig, Gemma3ForCausalLM,Gemma3ForConditionalGeneration
import torch

model_id = "google/gemma-3-4b-it"

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Gemma3ForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
[
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."},]
},
{
"role": "user",
"content": [{"type": "text", "text": "Write a poem on Hugging Face, the company"},]
},
],
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)#.to(torch.bfloat16)

print(223)

with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=64)

outputs = tokenizer.batch_decode(outputs)

print("outputs",outputs)

fgoricha

Aug 18, 2025

•

edited Aug 18, 2025

I am playing with the 4b today and I also found the same things. At 4 bit its gives what you got but full precision is no problem. Have you figured anything out?

8 bits was working for me when I updated bitsandbytes

BalakrishnaCh

Google org Aug 27, 2025

•

edited Aug 27, 2025

Hi @zokica ,

Apologies for the late reply, thanks for reaching out to us. I could able to done the 4bit quantization using bitsandbytes for the google/gemma-3-4b-it model. However I have made few changes to the quantization parameters and it's working fine without any issues and producing the output for the given prompt. Please find the following gist file for you reference. Please let us know if you required any further assistance.

Thanks.

Siddhinita

Dec 8, 2025

Hi, which accelerator do you use because I am not able to get this working with a L4 GPU?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment