Hanzo Dev
commited on
Commit
·
287780c
1
Parent(s):
fbc526e
Fix voice names (cherry/noah) and add custom voice docs
Browse files
README.md
CHANGED
|
@@ -44,11 +44,44 @@ Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true e
|
|
| 44 |
**Key**: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.
|
| 45 |
|
| 46 |
- **First-packet latency**: 234ms (audio) / 547ms (video)
|
| 47 |
-
- **Built-in voices**: `
|
| 48 |
-
- **Languages**: 119 text, 19 speech input,
|
| 49 |
|
| 50 |
See: [Zen Omni Technical Report](https://arxiv.org/abs/2509.17765)
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
## Overview
|
| 53 |
|
| 54 |
Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speech—all under live broadcast latency constraints.
|
|
|
|
| 44 |
**Key**: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.
|
| 45 |
|
| 46 |
- **First-packet latency**: 234ms (audio) / 547ms (video)
|
| 47 |
+
- **Built-in voices**: `cherry` (female), `noah` (male)
|
| 48 |
+
- **Languages**: 119 text, 19 speech input, 2 speech output voices
|
| 49 |
|
| 50 |
See: [Zen Omni Technical Report](https://arxiv.org/abs/2509.17765)
|
| 51 |
|
| 52 |
+
### Adding Custom Voices
|
| 53 |
+
|
| 54 |
+
Zen-Dub-Live supports voice cloning for anchor-specific voices:
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
from zen_dub_live import AnchorVoice
|
| 58 |
+
|
| 59 |
+
# Clone a voice from reference audio (10-30 seconds recommended)
|
| 60 |
+
custom_voice = AnchorVoice.from_audio(
|
| 61 |
+
"anchor_audio_sample.wav",
|
| 62 |
+
name="anchor_01"
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# Register for use in pipeline
|
| 66 |
+
pipeline.register_voice(custom_voice)
|
| 67 |
+
|
| 68 |
+
# Use in session
|
| 69 |
+
session = await pipeline.create_session(
|
| 70 |
+
anchor_voice="anchor_01",
|
| 71 |
+
...
|
| 72 |
+
)
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
Voice profiles are stored as embeddings and can be saved/loaded:
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
# Save voice profile
|
| 79 |
+
custom_voice.save("voices/anchor_01.pt")
|
| 80 |
+
|
| 81 |
+
# Load voice profile
|
| 82 |
+
anchor_voice = AnchorVoice.load("voices/anchor_01.pt")
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
## Overview
|
| 86 |
|
| 87 |
Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speech—all under live broadcast latency constraints.
|