Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-4b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-4b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-4b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3-4b-it
- SGLang
How to use google/gemma-3-4b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-4b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-4b-it
Does not works with bitsandbytes 4bit and 8bit
While 1B model works fine in 4bits, 4b model does not, why is that?
full precision answer:
#####################################
outputs ["user\nYou are a helpful assistant.\n\nWrite a poem on Hugging Face, the company\nmodel\nOkay, here's a poem about Hugging Face, aiming to capture its spirit and impact:\n\nThe Open Embrace\n\nIn realms of code, a vibrant hue,\nHugging Face emerges, fresh and new.\nNot just a name, a welcoming plea,\nFor AI’s future, wild"]
###################
8bit answer:
#############################
outputs ['user\nYou are a helpful assistant.\n\nWrite a poem on Hugging Face, the company\nmodel\nThisrvice gaក៏ forতি지만 Senhor noname συ᱕ Brain freeze not기는 объявitic fregataamataء rou अॅ⿻ffassoääntsmannetworks 둘이ई क्रिप्टोकर策划 deメリット\u200cی� भाgal gydant recovering the গঙ্গన్న आएगी भी Olméis सबके/ヶ月𝔦 मलया ഒ საერთ出しिष्ट সকলে齊,']
##################
from transformers import AutoTokenizer, BitsAndBytesConfig, Gemma3ForCausalLM,Gemma3ForConditionalGeneration
import torch
model_id = "google/gemma-3-4b-it"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Gemma3ForConditionalGeneration.from_pretrained(
model_id, quantization_config=quantization_config
).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
[
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."},]
},
{
"role": "user",
"content": [{"type": "text", "text": "Write a poem on Hugging Face, the company"},]
},
],
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)#.to(torch.bfloat16)
print(223)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=64)
outputs = tokenizer.batch_decode(outputs)
print("outputs",outputs)
I am playing with the 4b today and I also found the same things. At 4 bit its gives what you got but full precision is no problem. Have you figured anything out?
8 bits was working for me when I updated bitsandbytes
Hi @zokica ,
Apologies for the late reply, thanks for reaching out to us. I could able to done the 4bit quantization using bitsandbytes for the google/gemma-3-4b-it model. However I have made few changes to the quantization parameters and it's working fine without any issues and producing the output for the given prompt. Please find the following gist file for you reference. Please let us know if you required any further assistance.
Thanks.
Hi, which accelerator do you use because I am not able to get this working with a L4 GPU?