Implementation Methodology

#2
by Orion-zhen - opened

Hi, thank you for your great idea and work!

I have tried to implement your algorithm here: Orion-zhen/abliteration. But I have tested my code on Qwen3-4B-Instruct-2507, the result is not good. The abliterated model still refuses.

I guess I implemented your algorithm in a right way, but I'm not sure if I missed something. And I don't really know what harmful and harmless dataset to use.

Could you please tell me what's wrong, what should I do? Or share your code if possible? Many thanks!

With my approach, I find I have to intervene on multiple layers. It seems it's time for me to check out some Qwen 3 models.

I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.

Could you please provide some details? In my code, I have implemented:

in utils/compute.py:

  1. compute mean activation value at the last token over all harmful and harmless prompt
  2. calculate the score based on SNR and dissimilarity
  3. select the top 3 layers to calculate global refusal direction
  4. use quantile to sparsify global refusal direction, and get the final global refusal direction

in utils/apply.py:

  1. iterate through each layer to apply abliteration:
  2. for each layer, orthogonalize the global refusal direction with respect to the mean activation value of the layer
  3. decompose the weight into norm and direction
  4. apply local refusal direction to the weight direction, and re-normalize
  5. recombine new weight direction and weight norm

And my code only works for text-only model. I believe I have made something wrong, could you please help me out?

I tinkered with Qwen 3 14B, and (judging from charting) the refusal direction becomes indistinct from layer 20 onward. It's possible that for Qwen 3 they implemented a defense against abliteration, at least when measured after the first token has been generated. I will investigate further.

Yes, I have recognized that, too. In my code results, the score reaches the top at about layer #27, then dropped steeply. But the last layer reached a very high score.

And I have tried llama3.2-3b-instruct, my code failed to remove the refusal behavior, too. As far as I know, llama series are very easy to abliterate. Even if my former implementation of original abliteration method can remove refusal behavior successfully.

Respecting the geometry does result in a reduction in refusal removal. I'll have to look at what the charts look like for that Llama model.

Perhaps Qwen 3 training involved implementing countermeasures to abliteration along the lines proposed in this paper.
https://arxiv.org/abs/2409.20089

I can get partial refusal removal with the following formula; at this point I would conclude that it appears some models are resistant to this less disruptive approach to ablation. Measured against full weights.

model: meta-llama/Llama-3.2-3B-Instruct
measurements: l3x.refuse
output: l3x
ablate:
  - layer: 8
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 9
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 10
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 11
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 12
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 13
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 14
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 15
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 16
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 17
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 18
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 19
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 20
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 21
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 22
    measurement: 23
    scale: 1.0
    sparsity: 0.0
  - layer: 23
    measurement: 23
    scale: 1.0
    sparsity: 0.0

Fascinating. Dropping the norm-preservation seems to be key to getting Llama 3.2 3B Instruct to comply. I should probably add an option to toggle norm-preservation off then.

This implies that different models encode safety differently between weight magnitudes/norms and directions.

Thank you for your insightful discovery. Looking forward to your future work πŸ€—

Thanks for the feedback! I just pushed an update to my codebase repo that leaves all 3 of my current modifications off by default, but one can toggle them on selectively. This will enable mix-and-match exploration in the future. It would seem I got lucky with Gemma 3 12B, finding a model where all 3 options worked well together.

Sign up or log in to comment