Should UD-Q6_K_XL identical to Q6_K.gguf?

by BVEsun - opened 7 days ago

7 days ago

Nemotron-3-Nano-30B-A3B-Q6_K.gguf
SHA256:
f19626879b433941214bdfe10d7a709ce3488bb7eeee03827f864fa39e6cd166
Pointer size:
136 Bytes
·
Size of remote file:
33.5 GB
·
Xet hash:
567ab9542d8af15b1a78e5827bcb75c1f8f859bf23def5c08b888e18892415fc

Nemotron-3-Nano-30B-A3B-UD-Q6_K_XL.gguf
SHA256:
f19626879b433941214bdfe10d7a709ce3488bb7eeee03827f864fa39e6cd166
Pointer size:
136 Bytes
·
Size of remote file:
33.5 GB
·
Xet hash:
567ab9542d8af15b1a78e5827bcb75c1f8f859bf23def5c08b888e18892415fc

BVEsun changed discussion title from Should Q6_K_XL identical to Q6_K.gguf? to Should UD-Q6_K_XL identical to Q6_K.gguf? 7 days ago

danielhanchen

Unsloth AI org 7 days ago

This is because the model has an architecture like gpt-oss where some dimensions aren't divisible by 128 so some cannot be quantized to lower bits and thus bigger.

That's also why we deleted some 1-bit and 2-bit sizes because they were exactly the same size.

I would recommend using the XL one

BVEsun

7 days ago

Thank you very much! I'll download the XL one then.

TobDeBer

7 days ago

This is because the model has an architecture like gpt-oss where some dimensions aren't divisible by 128 so some cannot be quantized to lower bits and thus bigger.

That's also why we deleted some 1-bit and 2-bit sizes because they were exactly the same size.

I would recommend using the XL one

What is the solution?
Add padding support to llama cpp?
Add 64 and 128 weight type variants of low quant?

cmp-nct

5 days ago

•

edited 5 days ago

Sorry for the duplicate issue, I didn't read this one.
The problem is much less Q6K, the problem is that anything below 4 bit is not available. So the model is being blocked from < 24GB cards.

When I implemented the new released Falcon architecture for llama.cpp I ran into that issue, it's been a while.
The quantization issue, back then also cuda inference was partly blocked, was that the tensor had to be divisible by 256 I believe.
It was no big problem to change that division factor - that's mostly just constants, the smaller blocksize likely introduces a bit of a quantization error increase. I can't recall.
Another option probably is tensor padding like Todeber said.

Changing the constant to a smaller one was very simple, it was my solution for all but one tensor back then. But GGUF had it hardcoded, so it broke compatibility.

TobDeBer

5 days ago

I'm "working" on a PR to allow arbitrary dimensions with 256 block size.
I'm testing out a vibe coding tool for that right now but if it fails I do it myself.
https://github.com/708-145/llama.cpp/pull/33/files

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment