Instructions to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32",
	filename="Qwen3-Coder-30B-A3B-Instruct-f32:Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Use Docker

docker model run hf.co/geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

LM Studio
Jan

vLLM

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Ollama
How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Ollama:
```
ollama run hf.co/geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
```

Unsloth Studio new

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 to start chatting

Pi new

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Docker Model Runner:
```
docker model run hf.co/geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M
```

Lemonade

How to use geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Coder-30B-A3B-Instruct-f32-Q4_K_M

List all available models

lemonade list

Qwen3-Coder-30B-A3B-Instruct-f32-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-Coder-30B-A3B-Instruct language model.

Converted for use with llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.

Why f32?

This model uses FP32 (32-bit floating point) as its base precision. This is unusual for GGUF models because:

FP32 doubles memory usage vs FP16.
Modern LLMs (including Qwen3) are trained in mixed precision and do not benefit from FP32 at inference time.
Only useful for debugging, research, or extreme numerical robustness.

F16 is probably a better choice, but you can use this to compare the difference in outputs (if any).

💡 Key Features of Qwen3-Coder-30B-A3B-Instruct:

Available Quantizations (from f32)

Level	Quality	Speed	Size	Recommendation
Q2_K	Minimal	⚡ Fast	11.3 GB	Only on severely memory-constrained systems.
Q3_K_S	Low-Medium	⚡ Fast	13.3 GB	Minimal viability; avoid unless space-limited.
Q3_K_M	Low-Medium	⚡ Fast	14,7 GB	Acceptable for basic interaction.
Q4_K_S	Practical	⚡ Fast	17.5 GB	Good balance for mobile/embedded platforms.
Q4_K_M	Practical	⚡ Fast	18.6 GB	Best overall choice for most users.
Q5_K_S	Max Reasoning	🐢 Medium	21.1 GB	Slight quality gain; good for testing.
Q5_K_M	Max Reasoning	🐢 Medium	21.7 GB	Best quality available. Recommended.
Q6_K	Near-FP16	🐌 Slow	25.1 GB	Diminishing returns. Only if RAM allows.
Q8_0	Lossless*	🐌 Slow	32.5 GB	Maximum fidelity. Ideal for archival.

💡 Recommendations by Use Case

💻 Standard Laptop (i5/M1 Mac): Q5_K_M (optimal quality)
🧠 Reasoning, Coding, Math: Q5_K_M or Q6_K
🔍 RAG, Retrieval, Precision Tasks: Q6_K or Q8_0
🤖 Agent & Tool Integration: Q5_K_M
🛠️ Development & Testing: Test from Q4_K_M up to Q8_0

Usage

Load this model using:

OpenWebUI – self-hosted AI interface with RAG & tools
LM Studio – desktop app with GPU support
GPT4All – private, offline AI chatbot
Or directly via llama.cpp

Each quantized model includes its own README.md and shares a common MODELFILE.

Author

👤 Geoff Munn (@geoffmunn)
🔗 Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.

Downloads last month: 266

GGUF

Model size

31B params

Architecture

qwen3moe

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Quantized

(143)

this model