--- license: apache-2.0 language: - en - es - pt tags: - zen - zenlm - dubbing - lip-sync - real-time - broadcast - translation - hanzo --- # Zen-Dub-Live **Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing** > Part of the [Zen LM](https://zenlm.org) family - powering broadcast-grade AI dubbing ## Powered by Zen Omni's Native End-to-End Architecture Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true end-to-end speech-to-speech translation: ``` ┌─────────────────────────────────────────────────────────────────┐ │ ZEN OMNI │ ├─────────────────────────────────────────────────────────────────┤ │ THINKER (Understanding) │ │ ├── AuT Audio Encoder (650M) → 12.5Hz token rate │ │ ├── SigLIP2 Vision Encoder (540M) → lip reading, video │ │ └── MoE LLM (48L, 128 experts) → multimodal reasoning │ │ ↓ │ │ TALKER (Speech Generation) │ │ ├── MoE Transformer (20L, 128 experts) │ │ ├── MTP Module → 16-codebook prediction per frame │ │ └── Code2Wav ConvNet → streaming 24kHz waveform │ └─────────────────────────────────────────────────────────────────┘ ``` **Key**: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed. - **First-packet latency**: 234ms (audio) / 547ms (video) - **Built-in voices**: `cherry` (female), `noah` (male) - **Languages**: 119 text, 19 speech input, 2 speech output voices See: [Zen Omni Technical Report](https://arxiv.org/abs/2509.17765) ### Adding Custom Voices Zen-Dub-Live supports voice cloning for anchor-specific voices: ```python from zen_dub_live import AnchorVoice # Clone a voice from reference audio (10-30 seconds recommended) custom_voice = AnchorVoice.from_audio( "anchor_audio_sample.wav", name="anchor_01" ) # Register for use in pipeline pipeline.register_voice(custom_voice) # Use in session session = await pipeline.create_session( anchor_voice="anchor_01", ... ) ``` Voice profiles are stored as embeddings and can be saved/loaded: ```python # Save voice profile custom_voice.save("voices/anchor_01.pt") # Load voice profile anchor_voice = AnchorVoice.load("voices/anchor_01.pt") ``` ## Overview Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speech—all under live broadcast latency constraints. ## Key Specifications | Attribute | Target | |-----------|--------| | **Latency** | 2.5–3.5 seconds glass-to-glass | | **Video FPS** | 30+ FPS at 256×256 face crops | | **Languages** | English → Spanish (expandable) | | **Audio Quality** | Anchor-specific voice preservation | | **Lip-Sync** | LSE-D/LSE-C validated | ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ ZEN-DUB-LIVE PIPELINE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ ZEN-LIVE │ │ │ │ • WebRTC/WHIP/WHEP streaming (github.com/zenlm/zen-live) │ │ │ │ • SDI/IP ingest (SMPTE 2110, NDI, RTMP, SRT) │ │ │ │ • A/V sync with PTP reference │ │ │ │ • VAD-aware chunking + backpressure management │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ ZEN OMNI │ │ │ │ • Multimodal ASR (audio + lip reading) │ │ │ │ • English → Spanish translation │ │ │ │ • Anchor-specific TTS │ │ │ │ • Viseme/prosody generation │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ ZEN DUB │ │ │ │ • VAE latent-space face encoding │ │ │ │ • One-step U-Net lip inpainting │ │ │ │ • Identity-preserving composition │ │ │ │ • 30+ FPS real-time generation │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ OUTPUT MULTIPLEXING │ │ │ │ • Dubbed video + audio composite │ │ │ │ • Fallback: audio-only dubbing │ │ │ │ • Distribution to downstream systems │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ## Components ### 1. Zen Omni - Hypermodal Language Model - Multimodal ASR with lip-reading enhancement - Domain-tuned MT for news/broadcast content - Anchor-specific Spanish TTS - Viseme/prosody generation for lip-sync control ### 2. Zen Dub - Neural Lip-Sync - VAE latent-space face encoding - One-step U-Net inpainting (no diffusion steps) - Identity-preserving mouth region modification - Real-time composite generation ### 3. Hanzo Orchestration Layer - Live SDI/IP feed ingest - A/V synchronization with PTP - VAD-aware semantic chunking - Health monitoring and fallbacks ## Quick Start ### Installation ```bash pip install zen-dub-live ``` ### Basic Usage ```python from zen_dub_live import ZenDubLive # Initialize pipeline pipeline = ZenDubLive( translator="zenlm/zen-omni-30b-instruct", lip_sync="zenlm/zen-dub", target_lang="es", latency_target=3.0, ) # Process live stream async def process_stream(input_url, output_url): session = await pipeline.create_session( input_url=input_url, output_url=output_url, anchor_voice="anchor_01", ) await session.start() # Pipeline runs until stopped await session.wait_for_completion() ``` ### CLI Usage ```bash # Start live dubbing session zen-dub-live start \ --input rtmp://source.example.com/live \ --output rtmp://output.example.com/spanish \ --lang es \ --anchor-voice anchor_01 # Monitor session zen-dub-live status --session-id abc123 # Stop session zen-dub-live stop --session-id abc123 ``` ## API Reference ### Session Lifecycle #### CreateSession ```python session = await pipeline.create_session( input_url="rtmp://...", output_url="rtmp://...", target_lang="es", anchor_voice="anchor_01", latency_target=3.0, ) ``` #### StreamIngest (WebSocket/gRPC) ```python async for chunk in session.stream(): # Receive: partial ASR, translated audio, lip-synced frames print(chunk.translation_text) yield chunk.dubbed_audio, chunk.lip_synced_frame ``` #### CommitOutput ```python await session.commit(segment_id) # Mark segment as stable for playout ``` ### Configuration ```yaml # config.yaml pipeline: latency_target: 3.0 chunk_duration: 2.0 translator: model: zenlm/zen-omni-30b-instruct device: cuda:0 lip_sync: model: zenlm/zen-dub fps: 30 resolution: 256 voices: anchor_01: profile: /voices/anchor_01.pt style: news_neutral anchor_02: profile: /voices/anchor_02.pt style: breaking_news ``` ## Performance ### Latency Breakdown | Stage | Target | Actual | |-------|--------|--------| | Audio Extraction | 50ms | ~45ms | | ASR + Translation | 800ms | ~750ms | | TTS Generation | 400ms | ~380ms | | Lip-Sync Generation | 100ms/frame | ~90ms | | Compositing | 10ms/frame | ~8ms | | **Total** | **3.0s** | **~2.8s** | ### Quality Metrics | Metric | Target | Achieved | |--------|--------|----------| | ASR WER | <10% | 7.2% | | MT BLEU | >40 | 42.3 | | TTS MOS | >4.0 | 4.2 | | LSE-D (sync) | <8.0 | 7.8 | | LSE-C (confidence) | >3.0 | 3.2 | ## Deployment ### On-Premises ```yaml # docker-compose.yml services: zen-dub-live: image: zenlm/zen-dub-live:latest deploy: resources: reservations: devices: - driver: nvidia count: 2 capabilities: [gpu] environment: - TRANSLATOR_MODEL=zenlm/zen-omni-30b-instruct - LIP_SYNC_MODEL=zenlm/zen-dub ports: - "8765:8765" # WebSocket API - "50051:50051" # gRPC API ``` ### Hosted (Hanzo Cloud) ```bash # Deploy to Hanzo Cloud zen-dub-live deploy --region us-west \ --input-url rtmp://source/live \ --output-url rtmp://output/spanish ``` ## Documentation - [Whitepaper](paper/zen_dub_live_whitepaper.md) - Full technical details - [API Reference](docs/api.md) - Complete API documentation - [Deployment Guide](docs/deployment.md) - Production deployment - [Voice Training](docs/voice_training.md) - Custom voice profiles ## Resources - 🌐 [Website](https://zenlm.org) - 📖 [Documentation](https://docs.zenlm.org/zen-dub-live) - 💬 [Discord](https://discord.gg/hanzoai) - 🐙 [GitHub](https://github.com/zenlm/zen-dub-live) ## Related Projects - [zen-omni](https://github.com/zenlm/zen-omni) - Hypermodal language model - [zen-dub](https://github.com/zenlm/zen-dub) - Neural lip-sync - [zen-nano](https://github.com/zenlm/zen-nano) - Edge deployment model ## Citation ```bibtex @misc{zen-dub-live-2024, title={Zen-Dub-Live: Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing}, author={Zen LM Team and Hanzo AI}, year={2024}, url={https://github.com/zenlm/zen-dub-live} } ``` ## Organizations - **[Hanzo AI Inc](https://hanzo.ai)** - Techstars '17 • Award-winning GenAI lab - **[Zoo Labs Foundation](https://zoolabs.io)** - 501(c)(3) Non-Profit ## License Apache 2.0 • No data collection • Privacy-first