lugan's picture
Update README.md
4db9a08 verified
---
license: mit
language:
- en
- zh
pipeline_tag: audio-classification
tags:
- keyword-spotting
- speech-commands
- kws
- tinyml
- edge-ai
- synthetic-speech
- mobile
- low-latency
- audio
datasets:
- lugan/SynTTS-Commands-Media-Dataset
---
# SynTTS Commands Media Benchmarks
<!-- Badges Row -->
[![arXiv](https://img.shields.io/badge/arXiv-2511.07821-b31b1b.svg)](https://arxiv.org/abs/2511.07821)
[![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-ffd21e)](https://huggingface.co/datasets/lugan/SynTTS-Commands-Media-Dataset)
[![Code](https://img.shields.io/badge/GitHub-Repository-181717)](https://github.com/lugan113/SynTTS-Commands-Official)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
---
## 🚀 Project Navigation
Welcome to the official model repository for the paper **"SynTTS-Commands"**. Here you can find the pre-trained checkpoints for KWS tasks.
- **📄 Paper**: Read the detailed technical report on [arXiv](https://arxiv.org/abs/2511.07821).
- **💾 Dataset**: Download the training data at [SynTTS-Commands-Media-Dataset](https://huggingface.co/datasets/lugan/SynTTS-Commands-Media-Dataset).
- **💻 Code**: Access training scripts and inference code on [GitHub](https://github.com/lugan113/SynTTS-Commands-Official).
---
---
## 📈 Benchmark Results and Analysis
We present a comprehensive benchmark of **six representative acoustic models** on the SynTTS-Commands-Media Dataset across both English (EN) and Chinese (ZH) subsets. All models are evaluated in terms of **classification accuracy**, **cross-entropy loss**, and **parameter count**, providing insights into the trade-offs between performance and model complexity in multilingual voice command recognition.
### Performance Summary
| Model | EN Loss | EN Accuracy | EN Params | ZH Loss | ZH Accuracy | ZH Params |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **MicroCNN** | 0.2304 | 93.22% | 4,189 | 0.5579 | 80.14% | 4,255 |
| **DS-CNN** | 0.0166 | 99.46% | 30,103 | 0.0677 | 97.18% | 30,361 |
| **TC-ResNet** | 0.0347 | 98.87% | 68,431 | 0.0884 | 96.56% | 68,561 |
| **CRNN** | **0.0163** | **99.50%** | 1.08M | 0.0636 | **97.42%** | 1.08M |
| **MobileNet-V1** | 0.0167 | **99.50%** | 2.65M | **0.0552** | 97.92% | 2.65M |
| **EfficientNet** | 0.0182 | 99.41% | 4.72M | 0.0701 | 97.93% | 4.72M |
### 🔍 Key Findings
Our results demonstrate that the **SynTTS-Commands** dataset supports high-accuracy command recognition in both languages. Notably, the top-performing models achieve over **99.4% accuracy on English** and nearly **98% on Chinese**, confirming the dataset’s quality and suitability for real-world deployment.
- **Top Performers**: Among all models, **CRNN** attains the best English accuracy (**99.50%**) and the lowest loss (0.0163). **MobileNet-V1** yields the lowest loss on Chinese (0.0552) and competitive English performance (matching CRNN’s 99.50% accuracy). Interestingly, **EfficientNet** shows slightly higher Chinese accuracy (97.93%) than MobileNet-V1, suggesting better calibration or robustness despite a higher loss.
- **Accuracy-Complexity Trade-off**: Lightweight models exhibit a clear trade-off. **MicroCNN**, with only ~4.2K parameters, achieves 93.22% accuracy on English but drops to 80.14% on Chinese, highlighting the increased difficulty of modeling tonal and phonetic richness in Mandarin with ultra-compact architectures. DS-CNN and TC-ResNet, with under 70K parameters, already recover strong performance (>96.5% in both languages), underscoring their efficiency for resource-constrained applications.
Overall, the benchmark establishes strong baselines across a wide spectrum of model scales—from ultra-light MicroCNN to modern EfficientNet—demonstrating that moderate-complexity models can deliver near-SOTA performance suitable for edge deployment.
## 📜 Citation
If you use these **pre-trained models** or the **SynTTS-Commands dataset** in your research, please cite our paper:
**[SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech](https://arxiv.org/abs/2511.07821)**
```bibtex
@misc{gan2025synttscommands,
title={SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech},
author={Lu Gan and Xi Li},
year={2025},
eprint={2511.07821},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2511.07821},
doi={10.48550/arXiv.2511.07821}
}