Llama-3.2-3B-2bit-mlx

Format Quantization QuantLLM

Description

This is meta-llama/Llama-3.2-3B converted to MLX format optimized for Apple Silicon (M1/M2/M3) Macs.

Usage

Generate text with mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("QuantLLM/Llama-3.2-3B-2bit-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

With streaming

from mlx_lm import load, stream_generate

model, tokenizer = load("QuantLLM/Llama-3.2-3B-2bit-mlx")

prompt = "Explain quantum computing"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=500):
    print(token, end="", flush=True)

Command Line

# Install mlx-lm
pip install mlx-lm

# Generate text
python -m mlx_lm.generate --model QuantLLM/Llama-3.2-3B-2bit-mlx --prompt "Hello!"

# Chat mode
python -m mlx_lm.chat --model QuantLLM/Llama-3.2-3B-2bit-mlx

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • macOS 13.0 or later
  • Python 3.10+
  • mlx-lm: pip install mlx-lm

Model Details

Property Value
Base Model meta-llama/Llama-3.2-3B
Format MLX
Quantization Q2_K
License apache-2.0
Created 2025-12-20

About QuantLLM

This model was converted using QuantLLM - the ultra-fast LLM quantization and export library.

from quantllm import turbo

# Load and quantize any model
model = turbo("meta-llama/Llama-3.2-3B")

# Export to any format
model.export("mlx", quantization="Q2_K")

⭐ Star us on GitHub!

Downloads last month
21
Safetensors
Model size
3B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for QuantLLM/Llama-3.2-3B-2bit-mlx

Quantized
(116)
this model