Implementation Methodology

by Orion-zhen - opened Nov 25, 2025

Nov 25, 2025

Hi, thank you for your great idea and work!

I have tried to implement your algorithm here: Orion-zhen/abliteration. But I have tested my code on Qwen3-4B-Instruct-2507, the result is not good. The abliterated model still refuses.

I guess I implemented your algorithm in a right way, but I'm not sure if I missed something. And I don't really know what harmful and harmless dataset to use.

Could you please tell me what's wrong, what should I do? Or share your code if possible? Many thanks!

grimjim

Owner Nov 25, 2025

With my approach, I find I have to intervene on multiple layers. It seems it's time for me to check out some Qwen 3 models.

grimjim

Owner Nov 26, 2025

I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.

Orion-zhen

Nov 26, 2025

Could you please provide some details? In my code, I have implemented:

in utils/compute.py:

compute mean activation value at the last token over all harmful and harmless prompt
calculate the score based on SNR and dissimilarity
select the top 3 layers to calculate global refusal direction
use quantile to sparsify global refusal direction, and get the final global refusal direction

in utils/apply.py:

iterate through each layer to apply abliteration:
for each layer, orthogonalize the global refusal direction with respect to the mean activation value of the layer
decompose the weight into norm and direction
apply local refusal direction to the weight direction, and re-normalize
recombine new weight direction and weight norm

And my code only works for text-only model. I believe I have made something wrong, could you please help me out?

Orion-zhen

Nov 26, 2025

I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.

Yes, I have recognized that, too. In my code results, the score reaches the top at about layer #27, then dropped steeply. But the last layer reached a very high score.

Orion-zhen

Nov 26, 2025

•

edited Nov 26, 2025

And I have tried llama3.2-3b-instruct, my code failed to remove the refusal behavior, too. As far as I know, llama series are very easy to abliterate. Even if my former implementation of original abliteration method can remove refusal behavior successfully.

grimjim

Owner Nov 26, 2025

Respecting the geometry does result in a reduction in refusal removal. I'll have to look at what the charts look like for that Llama model.

Perhaps Qwen 3 training involved implementing countermeasures to abliteration along the lines proposed in this paper.
https://arxiv.org/abs/2409.20089

grimjim

Owner Nov 26, 2025

•

edited Nov 26, 2025

I can get partial refusal removal with the following formula; at this point I would conclude that it appears some models are resistant to this less disruptive approach to ablation. Measured against full weights.

model: meta-llama/Llama-3.2-3B-Instruct
measurements: l3x.refuse
output: l3x
ablate:
  - layer: 8
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 9
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 10
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 11
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 12
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 13
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 14
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 15
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 16
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 17
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 18
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 19
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 20
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 21
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 22
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 23
    measurement: 23
    scale: 1.0
    sparsity: 0.0

grimjim

Owner Nov 26, 2025

•

edited Nov 26, 2025

Fascinating. Dropping the norm-preservation seems to be key to getting Llama 3.2 3B Instruct to comply. I should probably add an option to toggle norm-preservation off then.

This implies that different models encode safety differently between weight magnitudes/norms and directions.

Orion-zhen

Nov 26, 2025

Thank you for your insightful discovery. Looking forward to your future work 🤗

grimjim

Owner Nov 26, 2025

Thanks for the feedback! I just pushed an update to my codebase repo that leaves all 3 of my current modifications off by default, but one can toggle them on selectively. This will enable mix-and-match exploration in the future. It would seem I got lucky with Gemma 3 12B, finding a model where all 3 options worked well together.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment