Duplicate from MBZUAI/LLMVoX

Browse files

Co-authored-by: shikhar <[email protected]>

Files changed (7) hide show

.gitattributes +35 -0
README.md +136 -0
assets/arch_diagram.svg +0 -0
assets/ui.png +0 -0
ckpt_english_tiny.pt +3 -0
config.json +0 -0
wavtokenizer_large_speech_320_24k.ckpt +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+---
+license: cc-by-nc-sa-4.0
+pipeline_tag: text-to-speech
+---
+This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).
+For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.
+# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
+<div>
+<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
+<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
+<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
+<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
+</div>
+**Authors:**
+**[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)**
+**Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
+<p align="center">
+    <img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
+</p>
+<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>
+## Overview
+LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.
+Key features:
+- 🚀 **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms
+- 🔌 **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning
+- 🌊 **Multi-Queue Streaming**: Enables continuous, low-latency speech generation
+- 🌐 **Multilingual Support**: Adaptable to new languages with dataset adaptation
+## Quick Start
+### Installation
+```bash
+# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU
+git clone https://github.com/mbzuai-oryx/LLMVoX.git
+cd LLMVoX
+conda create -n llmvox python=3.9
+conda activate llmvox
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+pip install flash-attn --no-build-isolation
+pip install -r requirements.txt
+# Download checkpoints from Hugging Face
+# https://huggingface.co/MBZUAI/LLMVoX/tree/main
+mkdir -p CHECKPOINTS
+# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
+```
+### Voice Chat
+```bash
+# Basic usage
+python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
+# With multiple GPUs
+python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
+  --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
+# Balance latency/quality
+python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
+  --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
+```
+### Text Chat & Visual Speech
+```bash
+# Text-to-Speech
+python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
+# Visual Speech (Speech + Image → Speech)
+python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
+  --eos_token "<|im_end|>"
+# Multimodal (support for models like Phi-4)
+python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
+  --eos_token "<|end|>"
+```
+## API Reference
+| Endpoint | Purpose | Required Parameters |
+|----------|---------|---------------------|
+| `/tts` | Text-to-speech | `text`: String to convert |
+| `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` |
+| `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` |
+| `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` |
+## Local UI Demo
+<p align="center">
+    <img src="assets/ui.png" alt="Demo UI" width="800px">
+</p>
+```bash
+# Start server
+python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT
+# Launch UI
+python run_ui.py --ip STREAMING_SERVER_IP --port PORT
+```
+## Citation
+```bibtex
+@article{shikhar2025llmvox,
+  title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
+  author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
+  journal={arXiv preprint arXiv:2503.04724},
+  year={2025}
+}
+```
+## Acknowledgments
+- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
+- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
+- [Whisper](https://github.com/openai/whisper)
+- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

assets/arch_diagram.svg ADDED Viewed

assets/ui.png ADDED Viewed

ckpt_english_tiny.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:568346240c3d240f7b70ab32a7ad4d7247a8f0fb704d1b5cbc66a77370b0939a
+size 453105258

config.json ADDED Viewed

File without changes

wavtokenizer_large_speech_320_24k.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7450020c154f6aba033cb8651466cb79cb1b1cdd10ea64eaba68e7871cabcc5a
+size 1754880958