YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

qwen3-moe-aclnn

Pure C++ inference of Qwen3-235B-A22B-Instruct BF16 on Ascend 910 × 16 NPU, built directly on the aclnn EAGER API (no graph compilation, no PyTorch, no ggml).

中文版本：README_zh.md

Performance

Measured on Ascend 910 initial-gen × 16 NPU (TP=16) with Qwen3-235B-A22B-Instruct-2507 BF16 weights. All numbers are quality-preserving TG (output was manually verified); greedy temperature=0.

Configuration	TG	Applicable prompts
Untuned baseline	12 t/s	All
Default recommended (no PLD)	~27 t/s	All prompts, stable output
PLD with degeneration guard	29-45 t/s	Structured text (essays, long-form answers)
PLD on creative prompts	25-40 t/s	Stories / varied generation
PLD on factual / code prompts	unstable (21-95 t/s, high variance)	Not recommended

Reference: cann-recipes-infer GE graph baseline reports ~54 t/s on the same hardware. This project does not exceed that baseline — it trades some peak speed for (a) no graph compilation, (b) no PyTorch dependency, (c) full control over operator scheduling.

Key optimizations that contributed (in order of magnitude)

Rank	Optimization	Gain	Where
🥇	HCCL env tuning (`AIV` + `FFTS` + `TASK_QUEUE=2`)	+89% (12→23 t/s)	`scripts/tp_launch.sh`
🥈	Fused RoPE via `aclnnApplyRotaryPosEmbV2`	+17% (23→27 t/s)	`include/rope.h`
🥉	Prompt Lookup Decoding (PLD) w/ degeneration guard	+10-60% on applicable prompts	`src/main_cli.cpp`
○	Device-side topk-w normalize, MoE argsort, cos/sin cache	~+15% cumulative	`include/engine.h`
○	WorkspacePool (thread-local + retain-old)	reduces alloc overhead	`include/workspace_pool.h`

Architecture

Model: Qwen3-235B-A22B, 94 layers, 128 experts (top-k=8), GQA (64 Q heads, 4 KV heads), BF16.

Parallelism: TP=16 via HCCL ring AllReduce. KV heads sharded 1-per-rank (since 4 KV heads < 16 ranks, Q heads 0-3 on each rank share KV head 0).

Execution: aclnn EAGER mode — every op goes through aclnn* single-op API with workspace pool; no graph capture, no GE IR. Async stream execution with TASK_QUEUE_ENABLE=2 for kernel submission overlap.

Tokenizer: Uses HuggingFace transformers via a Python subprocess for encoding; vocab decode is pure C++ from an exported vocab.bin.

Per-layer forward flow

x_in [S, D=4096]
  ↓
┌── Attention branch (TP: Q_DIM=512=4h×128, KV_DIM=128=1h×128) ──┐
│  RmsNorm(input_layernorm)
│  linear_hf q_proj / k_proj / v_proj        → q, k, v
│  Per-head RmsNorm q_norm, k_norm
│  Fused RoPE: aclnnApplyRotaryPosEmbV2 (layout=1, "half")
│  Append K, V to per-layer KV cache
│  Mask selection:
│    prefill:      2048×2048 causal + sparse_mode=3
│    decode S=1:   nullptr + sparse_mode=0
│    batch decode: [1,1,S,past+S] custom bool mask + sparse_mode=0
│  FIAS (aclnnFusedInferAttentionScore)
│  o_proj linear_hf → partial per-rank
│  HCCL AllReduce (ring + AIV + FFTS) → full
└─────────┘
  ↓ residual add
┌── MoE branch ──┐
│  RmsNorm(post_attention_layernorm)
│  router linear_hf → logits [S, 128]
│  moe_gating_topk_softmax → topk_w[S,8], topk_idx[S,8]
│  Device-side normalize (reduce_sum + adds + cast + div)
│  moe_init_routing_v3 → expanded_x, expanded_ri, tokens_per_expert
│  grouped_matmul_v4 gate/up/down (SwiGLU activation)
│  Device-side argsort × 2 → fwd permutation (avoids host sync)
│  IndexSelect → packed
│  Broadcast-mul by topk_w + ReduceSum axis=1
│  HCCL AllReduce → full
└─────────┘
  ↓ residual add
x_out

Model weights

This project targets Qwen3-235B-A22B-Instruct-2507 (BF16). About 470 GB of safetensors shards.

Download sources:

HuggingFace: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
ModelScope: https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507

Download via huggingface-cli or modelscope CLI:

# HuggingFace
huggingface-cli download Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16

# ModelScope
modelscope download --model Qwen/Qwen3-235B-A22B-Instruct-2507 --local_dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16

Weights format: the binary reads HuggingFace .safetensors shards (multi-shard mmap), config.json, and tokenizer.json directly from the model directory. No conversion step is needed — point --model-dir at the downloaded directory.

Expected directory contents:

Qwen3-235B-A22B-Instruct-2507-BF16/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── model-00001-of-000XX.safetensors
├── ...
└── model.safetensors.index.json

Build

source /usr/local/Ascend/ascend-toolkit/set_env.sh
cmake -B build
cmake --build build -j8 --target qwen3-moe-aclnn

Requires:

CANN 8.5.1 or compatible
Python 3 + transformers + torch_npu (for tokenizer subprocess and reference-data generation only)
C++17 compiler
Ascend 910 × 16 NPU
nlohmann/json (bundled as external/json.hpp)

Python environment setup — the tokenizer calls a Python subprocess. Override the activation command via QWEN3_PYENV_INIT if your conda / venv layout differs from the default:

export QWEN3_PYENV_INIT="source /opt/my_conda/etc/profile.d/conda.sh && conda activate my_env && "

If unset, the default tries ${HOME}/miniconda3 with env qwen3 and auto-sources the Ascend toolkit.

Quick-start inference

# 1. Export tokenizer vocab to binary (one-time setup)
python3 scripts/export_vocab.py /path/to/Qwen3-235B-A22B-Instruct-2507-BF16

# 2. Run inference (TP=16)
./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn \
    --model-dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16 \
    --prompt "The capital of France is" \
    --n-predict 100 \
    --temperature 0 \
    --vocab tokenizer_data/vocab.bin

Expected: ~27 t/s, coherent output.

Recommended flags by use case

Universal default (stable, any prompt) — no PLD:

./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... --temperature 0 --no-stream

Structured / long-form (essays, explanations) — PLD with guard gives +60-90%:

./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... --pld --temperature 0 --no-stream

Interactive REPL (multi-turn chat):

./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... \
    --interactive --chat --temperature 0.7 --top-p 0.8

PLD degeneration guard

Prompt Lookup Decoding speeds up generation by having the model verify a batch of "draft" tokens in a single forward pass. The drafts are copied from the generation history via n-gram match.

Known failure mode: on prompts the model tends to repeat on (factual Q&A, code generation), the n-gram match feeds the model's own repetition back as drafts, creating a positive feedback loop that accelerates degenerate output. Early versions of this project reported misleading peak TG numbers driven by this loop.

This project's guard blocks suspect drafts with two heuristics:

low-distinct: draft's distinct-token count < threshold → reject
tail-echo: all of last N hist tokens equal draft[0] → reject

Rejected drafts fall back to single-token decode. A [warn] line is emitted once if the generated tail shows 8 consecutive identical tokens.

Flags:

--pld                  enable PLD (opt-in)
--pld-k N              draft window size (default: 10)
--pld-ngram N          n-gram match size (default: 1, with multi-level fallback)
--pld-min-hist N       skip PLD until history >= N tokens (default: 20)
--pld-no-guard         disable the degeneration guard (dangerous: can produce dead loops)
--pld-guard-distinct N minimum distinct tokens in draft (default: 3)
--pld-guard-tail N     tail-echo window (default: 6)
--pld-loop-warn N      emit warning on N consecutive identical tokens (default: 8)

Honest benchmarking: use scripts/bench_pld_safe.sh, which classifies each run's output as OK / LOOP_N / LOW_DIVERSITY and separates TG statistics for OK-only vs degraded runs.

Correctness verification

15+ unit / integration tests checked against Python (HuggingFace Transformers) reference:

./build/test_attention_layer    # rel=4.9e-4 vs Python prefill
./build/test_attention_decode   # rel=0  (bit-exact)
./build/test_moe_layer          # rel=3.6e-3
./build/test_layer_forward      # full single layer
./build/test_runner             # multi-layer runner
./build/test_rope_fused         # aclnnApplyRotaryPosEmbV2 vs manual HF rotate_half
./build/test_batch_decode       # S=1..8 timing
./build/test_batch_correctness  # argmax consistency
./build/test_op_support         # 910-specific op availability
# Integration smoke:
./tests/test_chat_flow.sh       # 7/7 PASS

Tests expect reference data under tests/<name>_data/ generated by scripts/gen_*_reference.py. See each script's docstring.

Environment tuning (auto-applied by `tp_launch.sh`)

HCCL_WHITELIST_DISABLE=1
HCCL_ALGO=level0:ring                  # ring, not fullmesh (fullmesh causes garbled output)
HCCL_BUFFSIZE=200                       # sweet spot; 100 and 400 both slower
HCCL_OP_EXPANSION_MODE=AIV              # key: AI Vector cores participate in reduce scheduling
HCCL_OP_BASE_FFTS_MODE_ENABLE=1         # key: Fast Frequently-used Transfer Scheduling
TASK_QUEUE_ENABLE=2                     # key: aggressive async task submission

Removing any of the three "key" env vars drops TG by 20-40%.

Directory layout

include/
├── acl_common.h           RAII wrappers, DeviceBuffer, make_contig_tensor
├── aclnn_ops.h            single-op wrappers + WorkspacePool integration
├── acl_runtime.h          AclRuntime (device + stream management)
├── device_weights.h       safetensors → device loading + TP sharding
├── engine.h               attention_forward + moe_forward + RopeCache
├── hccl_comm.h            HCCL init + allreduce + broadcast
├── model_config.h         Qwen3 hyperparameters + compute_derived
├── rope.h                 apply_rope_fused (aclnnApplyRotaryPosEmbV2 wrapper)
├── runner.h               Runner class (prefill/decode/decode_batch/rewind/profile)
├── safetensors_loader.h   multi-shard safetensors mmap parser
├── tokenizer.h            vocab decode + Python subprocess encode
└── workspace_pool.h       thread-local aclnn workspace pool (retain-old)

src/
├── device_weights.cpp     load_attention (GQA fix), load_moe (permute sync fix)
├── main_cli.cpp           CLI entry + PLD main loop + degeneration guard + multi-turn
├── model_config.cpp       compute_derived (GQA KV sharding)
├── runner.cpp             Runner (build_batch_decode_mask_ etc.)
├── safetensors_loader.cpp
└── tokenizer.cpp

scripts/
├── tp_launch.sh           production launcher (auto-applies HCCL env)
├── bench_tg.sh            stable N-run TG measurement
├── bench_pld_safe.sh      PLD benchmark with output-correctness classifier
├── bench_hccl[_adv].sh    HCCL parameter sweep
├── bench_pld[_k].sh       PLD K × ngram sweep (legacy, prefer bench_pld_safe.sh)
├── export_vocab.py        vocab.bin exporter from HF tokenizer
└── gen_*_reference.py     per-op Python reference data generators

tests/
├── test_attention_*       attention correctness (prefill / decode)
├── test_moe_layer         MoE correctness
├── test_layer_forward     full single layer
├── test_runner            multi-layer Runner
├── test_rope_fused        fused RoPE vs manual HF
├── test_batch_*           batch decode timing + correctness
├── test_op_support        910-specific op availability probe
└── test_chat_flow.sh      end-to-end integration smoke

CLI reference

--model-dir <path>         (required) HF safetensors directory
--prompt "<text>"          prompt text
--prompt-file FILE         read prompt from file (avoids shell-escape issues)
--n-predict N              maximum tokens to generate
--tp-size N                tensor parallelism (or set TP_SIZE env)
--max-seq N                KV cache + context cap (default: 512)
--temperature F            0 = greedy; typical 0.7
--top-k N                  0 = disabled
--top-p F                  1.0 = disabled
--seed N                   0 = time-based
--chat                     apply Qwen3 chat template
--system "<text>"          system role text (with --chat)
--interactive, -i          REPL mode (multi-turn memory with --chat)
--reset                    force stateless REPL (reset KV between turns)
--no-stream                batch-print final text instead of per-token streaming
--vocab <path>             vocab.bin path (default: tokenizer_data/vocab.bin)
--pld*                     see "PLD degeneration guard" section

Known limitations

Not yet reaching cann-recipes GE graph 54 t/s baseline (currently ~27 t/s stable / up to ~45 t/s PLD). Closing the gap requires one of: (a) real graph compilation, (b) fused collectives (MatmulAllReduce, GroupedMatmulAllReduce) which are absent on 910 initial-gen, (c) migration to 910B/A2/A3.
Only tp_size ∈ {1, 2, 4, 8, 16} supported. Values that don't evenly divide 64 Q heads will error.
PLD on factual/code prompts is unreliable — either produces baseline TG (guard rejects most drafts) or enters partial degeneration the classifier may not catch at low-severity. Use bench_pld_safe.sh to evaluate honestly.
Tokenizer requires Python subprocess — adds ~1s startup for first encode. Override via QWEN3_PYENV_INIT env if default conda path doesn't match.
NPU performance has high run-to-run variance (up to 4× in some configurations) due to BF16 + MoE intrinsic non-determinism and shared hardware resources. Report medians over ≥5 runs.

Future directions (prioritized)

Draft Model Speculative Decoding with Qwen3-0.6B — more stable accept rate than n-gram PLD, expected +60-100% TG across prompt types (1-2 week implementation).
HCCL AllReduce / compute overlap — ~+10-15% in theory, limited by EAGER path serial dependencies.
KV cache INT8 quantization — reduces memory-bandwidth pressure, ~+15-25% on long contexts (pending 910-initial-gen op support verification).
W8 weight quantization — ~+10-20% if aclnn quantization kernels exist on 910 initial-gen.

Not recommended:

aclmdlRI stream-capture-style graph recording (POC proved 1.13× ceiling, not worth the engineering cost).
Custom AscendC fused ops (high maintenance cost unless dedicated kernel engineer).
torchair / torch.compile migration (breaks pure-C++ design).

Documentation

docs/optimization-summary-zh.md — 阶段性优化总结（中文）：关键优化原因、PLD 正确性边界、项目级教训
docs/next-steps-draft-model-speculative.md — Draft Model Speculative Decoding（Qwen3-0.6B）执行规格：M1-M4 里程碑、正确性测试协议、风险兜底

License

Apache License 2.0 — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support