VoxCPM-0.5B-RKNN2
(English README see below)
VoxCPM ๆฏไธ็งๅๆฐ็ๆ ๅ่ฏๅจๆๆฌ่ฝฌ่ฏญ้ณ๏ผTTS๏ผ็ณป็ป๏ผ้ๆฐๅฎไนไบ่ฏญ้ณๅๆ็็ๅฎๆใ้่ฟๅจ่ฟ็ปญ็ฉบ้ดไธญๅปบๆจก่ฏญ้ณ๏ผๅฎๅ ๆไบ็ฆปๆฃๆ ่ฎฐๅ็ๅฑ้๏ผๅนถๅฎ็ฐไบไธค้กนๆ ธๅฟ่ฝๅ๏ผไธไธๆๆ็ฅ็่ฏญ้ณ็ๆๅ้ผ็็้ถๆ ทๆฌ่ฏญ้ณๅ ้ใ ไธๅไบๅฐ่ฏญ้ณ่ฝฌๆขไธบ็ฆปๆฃๆ ่ฎฐ็ไธปๆตๆนๆณ๏ผVoxCPM ้็จ็ซฏๅฐ็ซฏ็ๆฉๆฃ่ชๅๅฝๆถๆ๏ผ็ดๆฅไปๆๆฌ็ๆ่ฟ็ปญ็่ฏญ้ณ่กจ็คบใๅฎๅบไบ MiniCPM-4 ไธปๅนฒๆๅปบ๏ผ้่ฟๅๅฑ่ฏญ่จๅปบๆจกๅ FSQ ็บฆๆๅฎ็ฐไบ้ๅผ็่ฏญไน-ๅฃฐๅญฆ่งฃ่ฆ๏ผๆๅคงๅฐๆๅไบ่กจ็ฐๅๅ็ๆ็จณๅฎๆงใ
- ๆจ็้ๅบฆ(RKNN2)๏ผRK3588ไธRTF็บฆ4.5๏ผ็ๆ10s้ณ้ข้่ฆๆจ็45s๏ผ
- ๅคง่ดๅ ๅญๅ ็จ(RKNN2)๏ผ็บฆ3.3GB
ไฝฟ็จๆนๆณ
ๅ ้้กน็ฎๅฐๆฌๅฐ
ๅฎ่ฃ ไพ่ต
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
- ่ฟ่ก
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
ๅฏ้ๅๆฐ๏ผ
--text: ่ฆ็ๆ็ๆๆฌ--prompt-audio: ๅ่้ณ้ข่ทฏๅพ๏ผ็จไบ่ฏญ้ณๅ ้๏ผ--prompt-text: ๅ่้ณ้ขๅฏนๅบ็ๆๆฌ๏ผไฝฟ็จๅ่้ณ้ขๆถๅฟ ๅกซ๏ผ--cfg-value: CFGๅผๅฏผๅผบๅบฆ๏ผ้ป่ฎค2.0--inference-timesteps: ๆฉๆฃๆญฅๆฐ๏ผ้ป่ฎค10--seed: ้ๆบ็งๅญ--output: ่พๅบ้ณ้ข่ทฏๅพ
่ฟ่กๆๆ
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
ๆจกๅ่ฝฌๆข
ๆๅพๅไบ๏ผๅพ ่กฅๅ
ๅทฒ็ฅ้ฎ้ข
- ๆไบๆ ๅตไธ่ฏญ้ณ็ๆๅฏ่ฝ้ทๅ ฅๆญปๅพช็ฏ๏ผๅ้กน็ฎไผผไนๆๆฃๆตๆญปๅพช็ฏ็ๆบๅถ๏ผไฝๆ่ฟ้ๆฒกๆๅฎ็ฐใ
- ็ฑไบRKNNๅทฅๅ ท้พ็ๅ ้จ้ฎ้ข๏ผlocencๆจกๅๆฒกๆๅๆณๅจไธไธชๆจกๅ้้ ็ฝฎไธค็ง่พๅ ฅ้ฟๅบฆ็ไธค็ปshape๏ผๅ ๆญคๅช่ฝๅ็ฌ่ฝฌๆขไธคไธชๆจกๅใ
- ็ฑไบRKLLMๅทฅๅ ท้พ/่ฟ่กๆถ็ๅ ้จ้ฎ้ข๏ผไธคไธชLLM็่พๅบๅผ ้็ๆฐๅผ้ฝๅชๆๆญฃ็กฎ็ปๆ็ๅๅไนไธ๏ผๆๅจไน4ไนๅๅฏไปฅๅพๅฐๆญฃ็กฎ็ปๆใ
็ฑไบRKNNๅทฅๅ ท้พ็ฎๅไธๆฏๆ้4็ปด่พๅ ฅๆจกๅๅคbatchไฝฟ็จๅคNPUๆ ธ็ๆฐๆฎๅนถ่กๆจ็๏ผ่ๆฌไธญCFGๆฏๅไธคๆฌกๅ็ฌ่ฟ่ก็๏ผ้ๅบฆ่พๆ ขใ(ๅทฒไฟฎๅค)
ๅ่
English README
VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.
Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
- Approximate memory usage (RKNN2): ~3.3GB
Usage
Clone the project locally
Install dependencies
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
- Run
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
Optional parameters:
--text: Text to generate--prompt-audio: Reference audio path (for voice cloning)--prompt-text: Text corresponding to the reference audio (required when using reference audio)--cfg-value: CFG guidance strength, default 2.0--inference-timesteps: Number of diffusion steps, default 10--seed: Random seed--output: Output audio path
Performance
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
Model Conversion
TODO: Documentation to be added
Known Issues
- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.(Solved)
References
- Downloads last month
- 16
