Implementation Methodology
Hi, thank you for your great idea and work!
I have tried to implement your algorithm here: Orion-zhen/abliteration. But I have tested my code on Qwen3-4B-Instruct-2507, the result is not good. The abliterated model still refuses.
I guess I implemented your algorithm in a right way, but I'm not sure if I missed something. And I don't really know what harmful and harmless dataset to use.
Could you please tell me what's wrong, what should I do? Or share your code if possible? Many thanks!
With my approach, I find I have to intervene on multiple layers. It seems it's time for me to check out some Qwen 3 models.
I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.
Could you please provide some details? In my code, I have implemented:
in utils/compute.py:
- compute mean activation value at the last token over all harmful and harmless prompt
- calculate the score based on SNR and dissimilarity
- select the top 3 layers to calculate global refusal direction
- use quantile to sparsify global refusal direction, and get the final global refusal direction
in utils/apply.py:
- iterate through each layer to apply abliteration:
- for each layer, orthogonalize the global refusal direction with respect to the mean activation value of the layer
- decompose the weight into norm and direction
- apply local refusal direction to the weight direction, and re-normalize
- recombine new weight direction and weight norm
And my code only works for text-only model. I believe I have made something wrong, could you please help me out?
I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.
Yes, I have recognized that, too. In my code results, the score reaches the top at about layer #27, then dropped steeply. But the last layer reached a very high score.
And I have tried llama3.2-3b-instruct, my code failed to remove the refusal behavior, too. As far as I know, llama series are very easy to abliterate. Even if my former implementation of original abliteration method can remove refusal behavior successfully.
Respecting the geometry does result in a reduction in refusal removal. I'll have to look at what the charts look like for that Llama model.
Perhaps Qwen 3 training involved implementing countermeasures to abliteration along the lines proposed in this paper.
https://arxiv.org/abs/2409.20089
I can get partial refusal removal with the following formula; at this point I would conclude that it appears some models are resistant to this less disruptive approach to ablation. Measured against full weights.
model: meta-llama/Llama-3.2-3B-Instruct
measurements: l3x.refuse
output: l3x
ablate:
- layer: 8
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 9
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 10
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 11
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 12
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 13
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 14
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 15
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 16
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 17
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 18
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 19
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 20
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 21
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 22
measurement: 23
scale: 1.0
sparsity: 0.0
- layer: 23
measurement: 23
scale: 1.0
sparsity: 0.0
Fascinating. Dropping the norm-preservation seems to be key to getting Llama 3.2 3B Instruct to comply. I should probably add an option to toggle norm-preservation off then.
This implies that different models encode safety differently between weight magnitudes/norms and directions.
Thank you for your insightful discovery. Looking forward to your future work π€
Thanks for the feedback! I just pushed an update to my codebase repo that leaves all 3 of my current modifications off by default, but one can toggle them on selectively. This will enable mix-and-match exploration in the future. It would seem I got lucky with Gemma 3 12B, finding a model where all 3 options worked well together.