VoxCPM-0.5B-RKNN2

(English README see below)

VoxCPM ๆ˜ฏไธ€็งๅˆ›ๆ–ฐ็š„ๆ— ๅˆ†่ฏๅ™จๆ–‡ๆœฌ่ฝฌ่ฏญ้Ÿณ๏ผˆTTS๏ผ‰็ณป็ปŸ๏ผŒ้‡ๆ–ฐๅฎšไน‰ไบ†่ฏญ้Ÿณๅˆๆˆ็š„็œŸๅฎžๆ„Ÿใ€‚้€š่ฟ‡ๅœจ่ฟž็ปญ็ฉบ้—ดไธญๅปบๆจก่ฏญ้Ÿณ๏ผŒๅฎƒๅ…‹ๆœไบ†็ฆปๆ•ฃๆ ‡่ฎฐๅŒ–็š„ๅฑ€้™๏ผŒๅนถๅฎž็Žฐไบ†ไธค้กนๆ ธๅฟƒ่ƒฝๅŠ›๏ผšไธŠไธ‹ๆ–‡ๆ„Ÿ็Ÿฅ็š„่ฏญ้Ÿณ็”Ÿๆˆๅ’Œ้€ผ็œŸ็š„้›ถๆ ทๆœฌ่ฏญ้Ÿณๅ…‹้š†ใ€‚ ไธๅŒไบŽๅฐ†่ฏญ้Ÿณ่ฝฌๆขไธบ็ฆปๆ•ฃๆ ‡่ฎฐ็š„ไธปๆตๆ–นๆณ•๏ผŒVoxCPM ้‡‡็”จ็ซฏๅˆฐ็ซฏ็š„ๆ‰ฉๆ•ฃ่‡ชๅ›žๅฝ’ๆžถๆž„๏ผŒ็›ดๆŽฅไปŽๆ–‡ๆœฌ็”Ÿๆˆ่ฟž็ปญ็š„่ฏญ้Ÿณ่กจ็คบใ€‚ๅฎƒๅŸบไบŽ MiniCPM-4 ไธปๅนฒๆž„ๅปบ๏ผŒ้€š่ฟ‡ๅˆ†ๅฑ‚่ฏญ่จ€ๅปบๆจกๅ’Œ FSQ ็บฆๆŸๅฎž็Žฐไบ†้šๅผ็š„่ฏญไน‰-ๅฃฐๅญฆ่งฃ่€ฆ๏ผŒๆžๅคงๅœฐๆๅ‡ไบ†่กจ็ŽฐๅŠ›ๅ’Œ็”Ÿๆˆ็จณๅฎšๆ€งใ€‚

ๆจกๅž‹ๆžถๆž„

  • ๆŽจ็†้€Ÿๅบฆ(RKNN2)๏ผšRK3588ไธŠRTF็บฆ4.5๏ผˆ็”Ÿๆˆ10s้Ÿณ้ข‘้œ€่ฆๆŽจ็†45s๏ผ‰
  • ๅคง่‡ดๅ†…ๅญ˜ๅ ็”จ(RKNN2)๏ผš็บฆ3.3GB

ไฝฟ็”จๆ–นๆณ•

  1. ๅ…‹้š†้กน็›ฎๅˆฐๆœฌๅœฐ

  2. ๅฎ‰่ฃ…ไพ่ต–

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
  1. ่ฟ่กŒ
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

ๅฏ้€‰ๅ‚ๆ•ฐ๏ผš

  • --text: ่ฆ็”Ÿๆˆ็š„ๆ–‡ๆœฌ
  • --prompt-audio: ๅ‚่€ƒ้Ÿณ้ข‘่ทฏๅพ„๏ผˆ็”จไบŽ่ฏญ้Ÿณๅ…‹้š†๏ผ‰
  • --prompt-text: ๅ‚่€ƒ้Ÿณ้ข‘ๅฏนๅบ”็š„ๆ–‡ๆœฌ๏ผˆไฝฟ็”จๅ‚่€ƒ้Ÿณ้ข‘ๆ—ถๅฟ…ๅกซ๏ผ‰
  • --cfg-value: CFGๅผ•ๅฏผๅผบๅบฆ๏ผŒ้ป˜่ฎค2.0
  • --inference-timesteps: ๆ‰ฉๆ•ฃๆญฅๆ•ฐ๏ผŒ้ป˜่ฎค10
  • --seed: ้šๆœบ็งๅญ
  • --output: ่พ“ๅ‡บ้Ÿณ้ข‘่ทฏๅพ„

่ฟ่กŒๆ•ˆๆžœ

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

ๆจกๅž‹่ฝฌๆข

ๆ‡’ๅพ—ๅ†™ไบ†๏ผŒๅพ…่กฅๅ……

ๅทฒ็Ÿฅ้—ฎ้ข˜

  • ๆŸไบ›ๆƒ…ๅ†ตไธ‹่ฏญ้Ÿณ็”Ÿๆˆๅฏ่ƒฝ้™ทๅ…ฅๆญปๅพช็Žฏ๏ผŒๅŽŸ้กน็›ฎไผผไนŽๆœ‰ๆฃ€ๆต‹ๆญปๅพช็Žฏ็š„ๆœบๅˆถ๏ผŒไฝ†ๆˆ‘่ฟ™้‡Œๆฒกๆœ‰ๅฎž็Žฐใ€‚
  • ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒlocencๆจกๅž‹ๆฒกๆœ‰ๅŠžๆณ•ๅœจไธ€ไธชๆจกๅž‹้‡Œ้…็ฝฎไธค็ง่พ“ๅ…ฅ้•ฟๅบฆ็š„ไธค็ป„shape๏ผŒๅ› ๆญคๅช่ƒฝๅ•็‹ฌ่ฝฌๆขไธคไธชๆจกๅž‹ใ€‚
  • ็”ฑไบŽRKLLMๅทฅๅ…ท้“พ/่ฟ่กŒๆ—ถ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒไธคไธชLLM็š„่พ“ๅ‡บๅผ ้‡็š„ๆ•ฐๅ€ผ้ƒฝๅชๆœ‰ๆญฃ็กฎ็ป“ๆžœ็š„ๅ››ๅˆ†ไน‹ไธ€๏ผŒๆ‰‹ๅŠจไน˜4ไน‹ๅŽๅฏไปฅๅพ—ๅˆฐๆญฃ็กฎ็ป“ๆžœใ€‚
  • ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็›ฎๅ‰ไธๆ”ฏๆŒ้ž4็ปด่พ“ๅ…ฅๆจกๅž‹ๅคšbatchไฝฟ็”จๅคšNPUๆ ธ็š„ๆ•ฐๆฎๅนถ่กŒๆŽจ็†๏ผŒ่„šๆœฌไธญCFGๆ˜ฏๅˆ†ไธคๆฌกๅ•็‹ฌ่ฟ›่กŒ็š„๏ผŒ้€Ÿๅบฆ่พƒๆ…ขใ€‚(ๅทฒไฟฎๅค)

ๅ‚่€ƒ

English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

  • Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
  • Approximate memory usage (RKNN2): ~3.3GB

Usage

  1. Clone the project locally

  2. Install dependencies

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
  1. Run
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

Optional parameters:

  • --text: Text to generate
  • --prompt-audio: Reference audio path (for voice cloning)
  • --prompt-text: Text corresponding to the reference audio (required when using reference audio)
  • --cfg-value: CFG guidance strength, default 2.0
  • --inference-timesteps: Number of diffusion steps, default 10
  • --seed: Random seed
  • --output: Output audio path

Performance

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

Model Conversion

TODO: Documentation to be added

Known Issues

  • In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
  • Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
  • Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
  • Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.(Solved)

References

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for happyme531/VoxCPM-0.5B-RKNN2

Finetuned
(5)
this model