Developer & User Discourse
cacard • Apr 3, 2026
生成14秒音频平均1.12秒,RTF = 0.08,不错了。(on 24G VRAM 5090 laptop)
rennyka-107 • Apr 3, 2026
@cacard what's your config? I only got RTF = 0.3 on 3090 and even 5090. (with same num_step=16)
cacard • Apr 3, 2026
> [@cacard](https://github.com/cacard) what's your config? I only got RTF = 0.3 on 3090 and even 5090. (with same num_step=16)
我再测试一下看看
我再测试一下看看
cacard • Apr 3, 2026
344秒时长音频 耗时51秒 RTF=0.15
测试方法:
1)自定义一个http server,仅加载一次 model,后续 http 请求都复用显存的model;
2)随机50个音频clone请求,串行;
3)统计【生成音频总时长】和【总耗时】;
结论:
【共生成344秒时长音频】【 耗时51秒】所以 RTF=0.15
机器: 5090laptop
测试方法:
1)自定义一个http server,仅加载一次 model,后续 http 请求都复用显存的model;
2)随机50个音频clone请求,串行;
3)统计【生成音频总时长】和【总耗时】;
结论:
【共生成344秒时长音频】【 耗时51秒】所以 RTF=0.15
机器: 5090laptop
zhu-han • Apr 4, 2026
For RTF evaluation, with different GPUs, inference steps, batch sizes, and particularly lengths of audio prompts and generated audio, the RTF will be different. Therefore, without aligning the evaluation setup, even identical GPUs can yield highly divergent RTF results.
Anyone interested can refer to our evaluation setup in https://github.com/k2-fsa/OmniVoice/issues/7#issuecomment-4181480657
Anyone interested can refer to our evaluation setup in https://github.com/k2-fsa/OmniVoice/issues/7#issuecomment-4181480657
SaaS Metrics