File size: 18,683 Bytes
2f57641 c7f2b60 2f57641 2006ffd 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 d12f0f3 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 6b5086d 2f57641 37f2657 2f57641 37f2657 2f57641 a7109eb e9ea518 a7109eb e9ea518 a7109eb e9ea518 22e8536 ac28d6b a7109eb e9ea518 a7109eb e9ea518 a7109eb e9ea518 a7109eb ac28d6b 0b09b5e 22e8536 a7109eb e9ea518 22e8536 e9ea518 22e8536 2f57641 e9ea518 22e8536 e9ea518 37f2657 22e8536 37f2657 2f57641 37f2657 e9ea518 6b5086d 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 e9ea518 2f57641 e9ea518 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 37f2657 2f57641 2006ffd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 |
---
license: other
license_name: nvidia-open-model-license
license_link: >-
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
library_name: nemo
datasets:
- AMI
- NOTSOFAR1
- Fisher
- MMLPC
- librispeech_train_clean_100
- librispeech_train_clean_360
- librispeech_train_other_500
- Fisher
- WSJ
- SWBD
- europarl_dataset
- NSC1
- NSC6
- VCTK
- VoxPopuli
- Multilingual_LibriSpeech_2000hrs
- Common_Voice
- People_Speech_12k_hrs
- SPGI
- MOSEL
- YTC
thumbnail: null
tags:
- speaker-diarization
- speech-recognition
- multitalker-ASR
- multispeaker-ASR
- speech
- audio
- FastConformer
- RNNT
- Conformer
- NEST
- pytorch
- NeMo
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: multitalker-parakeet-streaming-0.6b-v1
results:
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: DIHARD III Eval (1-4 spk)
type: dihard3-eval-1to4spks
config: with_overlap_collar_0.0s
input_buffer_lenght: 1.04s
split: eval-1to4spks
metrics:
- name: Test DER
type: der
value: 13.24
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: DIHARD III Eval (5-9 spk)
type: dihard3-eval-5to9spks
config: with_overlap_collar_0.0s
input_buffer_lenght: 1.04s
split: eval-5to9spks
metrics:
- name: Test DER
type: der
value: 42.56
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: DIHARD III Eval (full)
type: dihard3-eval
config: with_overlap_collar_0.0s
input_buffer_lenght: 1.04s
split: eval
metrics:
- name: Test DER
type: der
value: 18.91
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk)
type: CALLHOME-part2-2spk
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2-2spk
metrics:
- name: Test DER
type: der
value: 6.57
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk)
type: CALLHOME-part2-3spk
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2-3spk
metrics:
- name: Test DER
type: der
value: 10.05
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk)
type: CALLHOME-part2-4spk
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2-4spk
metrics:
- name: Test DER
type: der
value: 12.44
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk)
type: CALLHOME-part2-5spk
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2-5spk
metrics:
- name: Test DER
type: der
value: 21.68
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk)
type: CALLHOME-part2-6spk
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2-6spk
metrics:
- name: Test DER
type: der
value: 28.74
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: CALLHOME (NIST-SRE-2000 Disc8) part2 (full)
type: CALLHOME-part2
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: part2
metrics:
- name: Test DER
type: der
value: 10.7
- task:
name: Speaker Diarization
type: speaker-diarization-with-post-processing
dataset:
name: call_home_american_english_speech
type: CHAES_2spk_109sessions
config: with_overlap_collar_0.25s
input_buffer_lenght: 1.04s
split: ch109
metrics:
- name: Test DER
type: der
value: 4.88
metrics:
- der
pipeline_tag: audio-classification
---
# Multitalker Parakeet Streaming 0.6B v1
<style>
img {
display: inline;
}
</style>
[](#model-architecture)
| [](#model-architecture)
<!-- | [](#datasets) -->
This model is a streaming multitalker ASR model based on the Parakeet architecture. The model only takes the speaker diarization outputs as external information and eliminates the need for explicit speaker queries or enrollment audio [[Wang et al., 2025]](https://arxiv.org/abs/2506.22646). Unlike conventional target-speaker ASR approaches that require speaker embeddings, this model dynamically adapts to individual speakers through speaker-wise speech activity prediction.
The key innovation involves injecting learnable **speaker kernels** into the pre-encode layer of the Fast-Conformer encoder. These speaker kernels are generated via speaker supervision activations, enabling instantaneous adaptation to target speakers. This approach leverages the inherent tendency of streaming ASR systems to prioritize specific speakers, repurposing this mechanism to achieve robust speaker-focused recognition.
The model architecture requires deploying **one model instance per speaker**, meaning the number of model instances matches the number of speakers in the conversation. While this necessitates additional computational resources, it achieves state-of-the-art performance in handling fully overlapped speech in both offline and streaming scenarios.
## Key Advantages
This self-speaker adaptation approach offers several advantages over traditional multitalker ASR methods:
1. **No Speaker Enrollment**: Unlike target-speaker ASR systems that require pre-enrollment audio or speaker embeddings, this model only needs speaker activity information from diarization
2. **Handles Severe Overlap**: Each instance focuses on a single speaker, enabling accurate transcription even during fully overlapped speech
3. **Streaming Capable**: Designed for real-time streaming scenarios with configurable latency-accuracy tradeoffs
4. **Leverages Single-Speaker Models**: Can be fine-tuned from strong pre-trained single-speaker ASR models, and single speaker ASR performance is also preserved
## Discover more from NVIDIA:
For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at [developer.nvidia.com](https://developer.nvidia.com/).
Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.<br>
### Explore more from NVIDIA: <br>
What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)?<br>
NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
[NVIDIA Riva Speech](https://developer.nvidia.com/riva?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.riva%3Adesc%2Ctitle%3Aasc#demos)<br>
[NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
## Model Architecture
### Speaker Kernel Injection
The streaming multitalker Parakeet model employs a **speaker kernel injection** mechanism at some layers of the Fast-Conformer encoder. As shown in the figure below, learnable speaker kernels are injected into selected encoder layers, enabling the model to dynamically adapt to specific speakers.
<div align="center">
<img src="figures/speaker_injection.png" width="750" />
</div>
The speaker kernels are generated through speaker supervision activations that detect speech activity for each target speaker. This enables the encoder states to become more responsive to the targeted speaker's speech characteristics, even during periods of fully overlapped speech.
### Multi-Instance Architecture
The model is based on the Parakeet architecture and consists of a [NeMo Encoder for Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[4] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[5] encoder. The key architectural innovation is the **multi-instance approach**, where one model instance is deployed per speaker as illustrated below:
<div align="center">
<img src="figures/multi_instance.png" width="1400" />
</div>
Each model instance:
- Receives the same mixed audio input
- Injects speaker-specific kernels at the pre-encode layer
- Produces transcription output specific to its target speaker
- Operates independently and can run in parallel with other instances
This architecture enables the model to handle severe speech overlap by having each instance focus exclusively on one speaker, eliminating the permutation problem that affects other multitalker ASR approaches.
## NVIDIA NeMo
To train, fine-tune or perform multitalker ASR with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[7]. We recommend you install it after you've installed Cython and latest PyTorch version.
```bash
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
```
## How to Use this Model
The model is available for use in the NeMo Framework[7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
**Important**: This model uses a multi-instance architecture where you need to deploy one model instance per speaker. Each instance receives the same audio input along with speaker-specific diarization information to perform self-speaker adaptation.
### Method 1. Code snippet
Load one of the NeMo speaker diarization models:
[Streaming Sortformer Diarizer v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1),
[Streaming Sortformer Diarizer v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
```python
from nemo.collections.asr.models import SortformerEncLabelModel, ASRModel
import torch
# A speaker diarization model is needed for tracking the speech activity of each speaker.
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2.1").eval().to(torch.device("cuda"))
asr_model = ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1").eval().to(torch.device("cuda"))
# Use the pre-defined dataclass template `MultitalkerTranscriptionConfig` from `multitalker_transcript_config.py`.
# Configure the diarization model using streaming parameters:
from multitalker_transcript_config import MultitalkerTranscriptionConfig
from omegaconf import OmegaConf
cfg = OmegaConf.structured(MultitalkerTranscriptionConfig())
cfg.audio_file = "/path/to/your/audio.wav"
cfg.output_path = "/path/to/output_transcription.json"
diar_model = MultitalkerTranscriptionConfig.init_diar_model(cfg, diar_model)
# Load your audio file into a streaming audio buffer to simulate a real-time audio session.
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
samples = [{'audio_filepath': cfg.audio_file}]
streaming_buffer = CacheAwareStreamingAudioBuffer(
model=asr_model,
online_normalization=cfg.online_normalization,
pad_and_drop_preencoded=cfg.pad_and_drop_preencoded,
)
streaming_buffer.append_audio_file(audio_filepath=cfg.audio_file, stream_id=-1)
streaming_buffer_iter = iter(streaming_buffer)
# Use the helper class `SpeakerTaggedASR`, which handles all ASR and diarization cache data for streaming.
from nemo.collections.asr.parts.utils.multispk_transcribe_utils import SpeakerTaggedASR
multispk_asr_streamer = SpeakerTaggedASR(cfg, asr_model, diar_model)
for step_num, (chunk_audio, chunk_lengths) in enumerate(streaming_buffer_iter):
drop_extra_pre_encoded = (
0
if step_num == 0 and not cfg.pad_and_drop_preencoded
else asr_model.encoder.streaming_cfg.drop_extra_pre_encoded
)
with torch.inference_mode():
with torch.amp.autocast(diar_model.device.type, enabled=True):
with torch.no_grad():
multispk_asr_streamer.perform_parallel_streaming_stt_spk(
step_num=step_num,
chunk_audio=chunk_audio,
chunk_lengths=chunk_lengths,
is_buffer_empty=streaming_buffer.is_buffer_empty(),
drop_extra_pre_encoded=drop_extra_pre_encoded,
)
print(multispk_asr_streamer.instance_manager.batch_asr_states[0].seglsts)
# Generate the speaker-tagged transcript and print it.
multispk_asr_streamer.generate_seglst_dicts_from_parallel_streaming(samples=samples)
print(multispk_asr_streamer.instance_manager.seglst_dict_list)
```
### Method 2. Use NeMo example file in NVIDIA/NeMo
Use [the multitalker streaming ASR example script file](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py) in [NVIDIA NeMo Framework](https://github.com/NVIDIA-NeMo/NeMo) to launch. With this method, download the `.nemo` model files and specify that in the script:
```bash
python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
asr_model="/path/to/your/multitalker-parakeet-streaming-0.6b-v1.nemo" \
diar_model="/path/to/your/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \
att_context_size="[70,13]" \
generate_realtime_scripts=False \
audio_file="/path/to/example.wav" \
output_path="/path/to/example_output.json"
```
Or the `audio_file` argument can be replaced with the `manifest_file` to handle multiple files in batch mode:
```bash
python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \
... \
manifest_file="example.json" \
... \
```
In `example.json` file, each line is a dictionary containing the following fields:
```python
{
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
"offset": 0, # offset (start) time of the input audio
"duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
}
{
"audio_filepath": "/path/to/multispeaker_audio2.wav",
"offset": 900,
"duration": 580,
}
```
### Setting up Streaming Configuration
Latency is defined by the `att_context_size`, all measured in **80ms frames**:
* [70, 0]: Chunk size = 1 (1 * 80ms = 0.08s)
* [70, 1]: Chunk size = 2 (2 * 80ms = 0.16s)
* [70, 6]: Chunk size = 7 (7 * 80ms = 0.56s)
* [70, 13]: Chunk size = 14 (14 * 80ms = 1.12s)
### Input
This model accepts single-channel (mono) audio sampled at 16,000 Hz.
### Output
The results will be found in `output_path`, which is in the seglst format. For more information please refer to [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format.
## Datasets
This multitalker ASR model was trained on a large combination of real conversations and simulated audio mixtures.
The training data includes both single-speaker and multi-speaker recordings with corresponding transcriptions and speaker labels in [SegLST](https://github.com/fgnt/meeteval?tab=readme-ov-file#segment-wise-long-form-speech-transcription-annotation-seglst) format
Data collection methods vary across individual datasets. The training datasets include phone calls, interviews, web videos, meeting recordings, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or individual dataset webpages for detailed data collection methods.
### Training Datasets (Real conversations)
- Granary (single speaker)
- Fisher English (LDC)
- LibriSpeech
- AMI Corpus
- NOTSOFAR
- ICSI
### Training Datasets (Used to simulate audio mixtures)
- Librispeech
## Performance
### Evaluation data specifications
| **Dataset** | **Number of speakers** | **Number of Sessions** |
|-------------|------------------------|------------------------|
| **AMI IHM** | 3-4 | 219 |
| **AMI SDM** | 3-4 | 40 |
| **CH109** | 2 | 259 |
| **Mixer 6** | 2 | 148 |
### Concatenated minimum-permutation Word Error Rate (cpWER)
* All evaluations include overlapping speech.
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
* Post-Processing (PP) can be optimized on different held-out dataset splits to improve diarization performance.
* Latency is 1.12s with 13+1 lookahead frames.
| **Diarization Model** | **AMI IHM** | **AMI SDM** | **CH109** | **Mixer 6** |
|-----------------------|-------------|-------------|-----------|-------------|
| [Streaming Sortformer v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) | 21.26 | 37.44 | 15.81 | 23.81 |
## References
[1] [Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/abs/2506.22646)
[2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
[3] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
[4] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
[5] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
[6] [Attention is all you need](https://arxiv.org/abs/1706.03762)
[7] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
[8] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371) |