Duplicate from MBZUAI/LLMVoX
Browse filesCo-authored-by: shikhar <[email protected]>
- .gitattributes +35 -0
- README.md +136 -0
- assets/arch_diagram.svg +0 -0
- assets/ui.png +0 -0
- ckpt_english_tiny.pt +3 -0
- config.json +0 -0
- wavtokenizer_large_speech_320_24k.ckpt +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-sa-4.0
|
| 3 |
+
pipeline_tag: text-to-speech
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
|
| 7 |
+
|
| 8 |
+
For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
|
| 9 |
+
|
| 10 |
+
# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
|
| 11 |
+
|
| 12 |
+
<div>
|
| 13 |
+
<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
|
| 14 |
+
<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
|
| 15 |
+
<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
|
| 16 |
+
<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
|
| 17 |
+
</div>
|
| 18 |
+
|
| 19 |
+
**Authors:**
|
| 20 |
+
**[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**
|
| 21 |
+
|
| 22 |
+
**Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
|
| 23 |
+
|
| 24 |
+
<p align="center">
|
| 25 |
+
<img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
|
| 26 |
+
</p>
|
| 27 |
+
|
| 28 |
+
<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>
|
| 29 |
+
|
| 30 |
+
## Overview
|
| 31 |
+
|
| 32 |
+
LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.
|
| 33 |
+
|
| 34 |
+
Key features:
|
| 35 |
+
- π **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
|
| 36 |
+
- π **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
|
| 37 |
+
- π **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
|
| 38 |
+
- π **Multilingual Support**: Adaptable to new languages with dataset adaptation
|
| 39 |
+
|
| 40 |
+
## Quick Start
|
| 41 |
+
|
| 42 |
+
### Installation
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU
|
| 46 |
+
|
| 47 |
+
git clone https://github.com/mbzuai-oryx/LLMVoX.git
|
| 48 |
+
cd LLMVoX
|
| 49 |
+
|
| 50 |
+
conda create -n llmvox python=3.9
|
| 51 |
+
conda activate llmvox
|
| 52 |
+
|
| 53 |
+
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
| 54 |
+
pip install flash-attn --no-build-isolation
|
| 55 |
+
pip install -r requirements.txt
|
| 56 |
+
|
| 57 |
+
# Download checkpoints from Hugging Face
|
| 58 |
+
# https://huggingface.co/MBZUAI/LLMVoX/tree/main
|
| 59 |
+
mkdir -p CHECKPOINTS
|
| 60 |
+
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Voice Chat
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
# Basic usage
|
| 67 |
+
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
|
| 68 |
+
|
| 69 |
+
# With multiple GPUs
|
| 70 |
+
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
|
| 71 |
+
--llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
|
| 72 |
+
|
| 73 |
+
# Balance latency/quality
|
| 74 |
+
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
|
| 75 |
+
--initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Text Chat & Visual Speech
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
# Text-to-Speech
|
| 82 |
+
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
|
| 83 |
+
|
| 84 |
+
# Visual Speech (Speech + Image β Speech)
|
| 85 |
+
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
|
| 86 |
+
--eos_token "<|im_end|>"
|
| 87 |
+
|
| 88 |
+
# Multimodal (support for models like Phi-4)
|
| 89 |
+
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
|
| 90 |
+
--eos_token "<|end|>"
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## API Reference
|
| 94 |
+
|
| 95 |
+
| Endpoint | Purpose | Required Parameters |
|
| 96 |
+
|----------|---------|---------------------|
|
| 97 |
+
| `/tts` | Text-to-speech | `text`: String to convert |
|
| 98 |
+
| `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
|
| 99 |
+
| `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
|
| 100 |
+
| `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |
|
| 101 |
+
|
| 102 |
+
## Local UI Demo
|
| 103 |
+
|
| 104 |
+
<p align="center">
|
| 105 |
+
<img src="assets/ui.png" alt="Demo UI" width="800px">
|
| 106 |
+
</p>
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
# Start server
|
| 110 |
+
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT
|
| 111 |
+
|
| 112 |
+
# Launch UI
|
| 113 |
+
python run_ui.py --ip STREAMING_SERVER_IP --port PORT
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Citation
|
| 117 |
+
|
| 118 |
+
```bibtex
|
| 119 |
+
@article{shikhar2025llmvox,
|
| 120 |
+
title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
|
| 121 |
+
author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
|
| 122 |
+
journal={arXiv preprint arXiv:2503.04724},
|
| 123 |
+
year={2025}
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Acknowledgments
|
| 128 |
+
|
| 129 |
+
- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
|
| 130 |
+
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
|
| 131 |
+
- [Whisper](https://github.com/openai/whisper)
|
| 132 |
+
- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)
|
| 133 |
+
|
| 134 |
+
## License
|
| 135 |
+
|
| 136 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
assets/arch_diagram.svg
ADDED
|
|
assets/ui.png
ADDED
|
ckpt_english_tiny.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:568346240c3d240f7b70ab32a7ad4d7247a8f0fb704d1b5cbc66a77370b0939a
|
| 3 |
+
size 453105258
|
config.json
ADDED
|
File without changes
|
wavtokenizer_large_speech_320_24k.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7450020c154f6aba033cb8651466cb79cb1b1cdd10ea64eaba68e7871cabcc5a
|
| 3 |
+
size 1754880958
|