Gemini Executive Synthesis

OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals.

Technical Positioning

High-quality voice cloning TTS for 600+ languages, implying comprehensive linguistic coverage. The goal is accurate and natural pronunciation across all supported languages, including numerical expressions.

SaaS Insight & Market Implications

This issue reveals a fundamental flaw in OmniVoice's text-to-speech processing: the inability to 'pronounce 2+ digit Arabic numerals' in English and Turkish. This is a basic expectation for any functional TTS system. Such a deficiency severely impacts the model's utility for any application requiring numerical data, from financial reports to automated customer service. It undermines the 'high-quality' claim and creates a significant barrier to adoption in professional contexts. For B2B SaaS, this highlights the critical importance of robust, comprehensive linguistic parsing, including numerical expressions, to ensure foundational functionality and market credibility.

Proprietary Technical Taxonomy

Raw Developer Origin & Technical Request

GitHub Issue Apr 5, 2026

Repo: k2-fsa/OmniVoice

Can't pronounce 2+ digit numbers

I tried it with English and Turkish and it couldn't pronounce 2+ digit Arabic numerals, e.g. 16, 123, 1234...

View Raw Source

Developer Debate & Comments

zhu-han • Apr 5, 2026

Hi, although the model indeed cannot handle very complex digits well, this is because we did not incorporate any text normalization package to process them, as we have discussed in the limitation section of our paper. I think the model can handle the provided numerical examples properly. I directly input your question (`I tried it with English and Turkish and it couldn't pronounce 2+ digit Arabic numerals, e.g. 16, 123, 1234...`) into the Hugging Face Space demo, and it generated these numbers correctly. So could I have more detailed information on what examples you have tried? Please provide examples that I can reproduce.

stacsk • Apr 6, 2026

@zhu-han Hi. I've tried to to it with [https://www.youtube.com/watch?v=c24BMCRJFdg](url) this Turkish audio, from seconds 17 to 25. This is the text I asked it to read in Turkish: _1920'lerden beri İtalya, 1935 yılını çok önemli bir tarih olarak tanımlamıştı. Bu 6 rakamı, bu 16, bu 29, bu 193 ve bu da 1923._ This is the English version of the same text: _Since the 1920s Italy had identified the year 1935 as a crucial date. This is the number 6, this is 16, this is 29, this is 193, and this is 1923._ In Turkish, it can only pronounce the number 6, others are glitchy. I tried both auto detect and choosing Turkish. In English text, auto detect and Turkish failed, but when choosing English, it could more or less pronounce the first 4 numbers, but weirdly. I did not provide reference text.

zhu-han • Apr 8, 2026

I recommend using a text normalization tool to convert digits into words. The model can handle simple Chinese and English digits, as it has seen some of these patterns during training. However, Turkish training data is very limited, so digits are hard to process correctly. You mentioned the English pronunciation sounds weird. I guess this is because the Turkish reference audio gives the generated English speech a Turkish accent. The model will handle your English text properly if you use voice design mode or an English reference audio instead. For more robust digit handling, text normalization is standard practice for TTS models. For Chinese and English, you can use [WeTextProcessing](https://github.com/wenet-e2e/WeTextProcessing); for other languages, you’ll need to find a suitable tool yourself.

Adjacent Repository Pain Points

Other highly discussed features and pain points extracted from k2-fsa/OmniVoice.

How to get a stable voice

Extracted Positioning

OmniVoice's voice consistency across multiple TTS generations, particularly when chunking large texts. The issue is voice instability (timbre, speed variations) between chunks.

High-quality voice cloning TTS for 600+ languages, implying consistent and professional output. The goal is to enable stable, continuous voice generation for long-form content like audiobooks.

Top Replies

dignome • Apr 5, 2026

Generate a custom voice you like and then feed that back in using reference audio prompt method.

gecko984 • Apr 5, 2026

@dignome thanks, but it seems like an overkill and will cause a huge time and compute overhead

dignome • Apr 5, 2026

I find if you include a accent description as well it's more stable. As far as more overhead with cuda I can't even tell if it's slower just works very fast.

输入参考音频如果是日语，合成文本是中文的话，输出的中文会带日文口音

Extracted Positioning

OmniVoice's cross-language voice cloning, specifically the issue of retaining the 'reference audio's accent' (e.g., Japanese accent) when synthesizing text in a different language (e.g., Chinese).

High-quality voice cloning TTS for 600+ languages, implying flexible and controllable voice synthesis. The goal is to offer granular control over accent retention during cross-language cloning.

Top Replies

zhu-han • Apr 4, 2026

跨语言克隆的时候带reference audio的口音在OmniVoice这类用in-context learning方式训练的模型中是比较正常的。目前没有比较好的解决方案。

sdqq1234 • Apr 4, 2026

> 跨语言克隆的时候带reference audio的口音在OmniVoice这类用in-context learning方式训练的模型中是比较正常的。目前没有比较好的解决方案。好吧，其实我是想尝试做一些英语日语的中文配音。那这个模型是不是...

zhu-han • Apr 4, 2026

单纯从模型角度上讲，是会克隆出口音的，如果你的场景需要只保留音色不保留口音，这个模型目前是没有这种粒度的控制的。

CUDA OOM during voice cloning (≤8 GB VRAM) + suggested temporary workaround

Extracted Positioning

OmniVoice's VRAM consumption, specifically 'CUDA OOM' errors on GPUs with ≤8 GB VRAM during omnivoice-demo execution. The issue is excessive memory usage by the web UI.

High-quality voice cloning TTS, implying accessibility on common hardware configurations. The goal is to optimize memory footprint for broader compatibility and efficient inference.

Top Replies

gitchat1 • Apr 5, 2026

Where exactly do you have to make that change in order for it to launch like that automatically?

utof • Apr 5, 2026

@gitchat1 just when you run omnivoice-demo inside the terminal, do this (bash) `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run omnivoice-demo`

utof • Apr 5, 2026

Interestingly, it works fine when i run omnivoice-infer. the problem is somewhere in the web ui

消费级显卡（比如5090/4090等）下的RTF统计

Extracted Positioning

OmniVoice's Real-Time Factor (RTF) performance on consumer-grade GPUs (e.g., 5090/4090). The user is inquiring about typical RTF statistics.

High-quality voice cloning TTS, implying efficient performance on accessible hardware. The goal is to understand and optimize real-time synthesis capabilities for a broad user base.

Top Replies

cacard • Apr 3, 2026

生成14秒音频平均1.12秒，RTF = 0.08，不错了。（on 24G VRAM 5090 laptop）

rennyka-107 • Apr 3, 2026

@cacard what's your config? I only got RTF = 0.3 on 3090 and even 5090. (with same num_step=16)

cacard • Apr 3, 2026

> [@cacard](https://github.com/cacard) what's your config? I only got RTF = 0.3 on 3090 and even 5090. (with same num_step=16) 我再测试一下看看

How to save the cloned voice model for the next use

Extracted Positioning

OmniVoice, a high-quality voice cloning TTS model. The specific feature request is the ability to save cloned voice models for reuse, avoiding re-uploading reference audio and text.

Delivering a market-leading, high-speed, multi-language TTS with realistic voices. The goal is to enhance user experience and efficiency by enabling persistence of cloned voice profiles.

Top Replies

mesouravcodes • Apr 6, 2026

there should be a dropdown menu to select saved cloned voice. please add if possible.

MNeMoNiCuZ • Apr 6, 2026

Saving a used sample into a /samples folder, with a config, and a dropdown would be a good idea for the demo project. If you are running this yourself outside of the UI, you would set up these conf...

gecko984 • Apr 7, 2026

As far as I understand, the nature of the model is such that there exists no well defined internal artifact representing a voice. So all you can really do is use the same reference audio file over ...

Frequently Asked Questions

Market intelligence mapped to OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals..

What is the technical positioning of OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals.?

Based on our AI analysis of the original developer request, its primary technical positioning is: High-quality voice cloning TTS for 600+ languages, implying comprehensive linguistic coverage. The goal is accurate and natural pronunciation across all supported languages, including numerical expressions.

Are engineers actively discussing OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals.?

Yes, we have tracked 3 direct responses and active debates regarding this specific topic originating from GitHub Issue.

What are the foundational technologies related to OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals.?

Our proprietary extraction maps OmniVoice's TTS pronunciation of numbers in English and Turkish. The issue is the failure to pronounce 2+ digit Arabic numerals. to adjacent architectural concepts including pronounce 2+ digit numbers, Arabic numerals, English, Turkish.

Engagement Signals

Replies

open

Issue Status

Cross-Market Term Frequency

Quantifies the cross-market adoption of foundational terms like TTS and English by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.