GitHub Issue

CUDA OOM during voice cloning (≤8 GB VRAM) + suggested temporary workaround

Discovered On Apr 5, 2026

Primary Metric open

The DAC acoustic encoder fails to allocate 20 MiB during `create_voice_clone_prompt()` because the model already occupies ~6.6 GiB of a 7.6 GiB card, leaving no room for inference activations. To fix this: Launch with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, which allows the allocator to reduce fragmentation and satisfy small allocations from reserved-but-unallocated memory. Longer term, the model loading strategy should be reviewed for cards with ≤8 GB VRAM.

View Raw Thread

Developer & User Discourse

gitchat1 • Apr 5, 2026

Where exactly do you have to make that change in order for it to launch like that automatically?

utof • Apr 5, 2026

@gitchat1 just when you run omnivoice-demo inside the terminal, do this (bash) `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True uv run omnivoice-demo`

utof • Apr 5, 2026

Interestingly, it works fine when i run omnivoice-infer. the problem is somewhere in the web ui

Yasand123 • Apr 6, 2026

> Interestingly, it works fine when i run omnivoice-infer. the problem is somewhere in the web ui

Oh wow. I had to make sure this is the case and you're absolutely right. `omnivoice-demo` for some reason uses too much VRAM. With `omnivoice-infer` I never get OOM errors. This is so weird.

zhu-han • Apr 6, 2026

Hi, omnivoice-demo will load the Whisper ASR model by default. This model is used to transcribe the reference audio, so users don’t necessarily need to input the reference transcription. I will add an option to disable loading Whisper.