Should UD-Q6_K_XL identical to Q6_K.gguf?

#1
by BVEsun - opened

Nemotron-3-Nano-30B-A3B-Q6_K.gguf
SHA256:
f19626879b433941214bdfe10d7a709ce3488bb7eeee03827f864fa39e6cd166
Pointer size:
136 Bytes
·
Size of remote file:
33.5 GB
·
Xet hash:
567ab9542d8af15b1a78e5827bcb75c1f8f859bf23def5c08b888e18892415fc

Nemotron-3-Nano-30B-A3B-UD-Q6_K_XL.gguf
SHA256:
f19626879b433941214bdfe10d7a709ce3488bb7eeee03827f864fa39e6cd166
Pointer size:
136 Bytes
·
Size of remote file:
33.5 GB
·
Xet hash:
567ab9542d8af15b1a78e5827bcb75c1f8f859bf23def5c08b888e18892415fc

BVEsun changed discussion title from Should Q6_K_XL identical to Q6_K.gguf? to Should UD-Q6_K_XL identical to Q6_K.gguf?
Unsloth AI org

This is because the model has an architecture like gpt-oss where some dimensions aren't divisible by 128 so some cannot be quantized to lower bits and thus bigger.

That's also why we deleted some 1-bit and 2-bit sizes because they were exactly the same size.

I would recommend using the XL one

Thank you very much! I'll download the XL one then.

This is because the model has an architecture like gpt-oss where some dimensions aren't divisible by 128 so some cannot be quantized to lower bits and thus bigger.

That's also why we deleted some 1-bit and 2-bit sizes because they were exactly the same size.

I would recommend using the XL one

What is the solution?
Add padding support to llama cpp?
Add 64 and 128 weight type variants of low quant?

Sorry for the duplicate issue, I didn't read this one.
The problem is much less Q6K, the problem is that anything below 4 bit is not available. So the model is being blocked from < 24GB cards.

When I implemented the new released Falcon architecture for llama.cpp I ran into that issue, it's been a while.
The quantization issue, back then also cuda inference was partly blocked, was that the tensor had to be divisible by 256 I believe.
It was no big problem to change that division factor - that's mostly just constants, the smaller blocksize likely introduces a bit of a quantization error increase. I can't recall.
Another option probably is tensor padding like Todeber said.

Changing the constant to a smaller one was very simple, it was my solution for all but one tensor back then. But GGUF had it hardcoded, so it broke compatibility.

I'm "working" on a PR to allow arbitrary dimensions with 256 block size.
I'm testing out a vibe coding tool for that right now but if it fails I do it myself.
https://github.com/708-145/llama.cpp/pull/33/files

Sign up or log in to comment