Kabardian-Russian Translation Model (kbd-ru-opus)
Fine-tuned MarianMT model for Kabardian (East Circassian) to Russian translation.
- Developed by: Eduard Emkuzhev
- Model type: Neural Machine Translation (Marian Transformer)
- Language pair: Kabardian (kbd) → Russian (ru)
- License: CC BY-NC 4.0
- Base model: Helsinki-NLP/opus-mt-en-ru (Apache 2.0)
- Training data: adiga-ai/circassian-parallel-corpus (CC BY 4.0)
Model Description
This model translates from Kabardian to Russian. Kabardian is an endangered Northwest Caucasian language with approximately 500,000 speakers. It features complex polysynthetic morphology, 50+ consonants, and ergative-absolutive alignment.
Intended Use
Primary uses:
- Language documentation and digitization
- Educational content translation
- Cultural heritage preservation
- Low-resource NMT research
- Supporting Kabardian speakers in accessing Russian content
Limitations:
- Non-commercial use only (CC BY-NC 4.0)
- Best performance on everyday language
- May struggle with modern technical terms not in training data
- Requires proper handling of Kabardian-specific character Ӏ (palochka)
Training Data
Dataset: adiga-ai/circassian-parallel-corpus
- Subset:
kbd_ru(Kabardian → Russian) - Total training examples: ~120K parallel sentence pairs
- Dataset license: CC BY 4.0
- Dataset author: Anzor Qunash (adiga.ai)
- Content: Dictionary entries, folklore texts, proverbs, everyday expressions
Training Procedure
Base Model
- Architecture: Marian Transformer (transformer-align)
- Base: Helsinki-NLP/opus-mt-en-ru (English-Russian translation)
- Transfer learning: Adapted from English-Russian to Kabardian-Russian
Hyperparameters
base_model: Helsinki-NLP/opus-mt-en-ru
training_examples: 120,000
epochs: 3
batch_size: 16
learning_rate: 3e-5
optimizer: AdamW
max_sequence_length: 128
warmup_steps: 500
weight_decay: 0.01
framework: transformers 4.36.0
Special Preprocessing
The model uses a special character mapping for training:
- Kabardian Ӏ (palochka) → I (Latin I) during training
- I → Ӏ restored during inference
This ensures better tokenization compatibility with the MarianMT tokenizer.
Performance
Benchmark Results
Tested on 1,000 examples from adiga-ai/circassian-parallel-corpus:
| Metric | Score |
|---|---|
| BLEU | 12.73 |
| chrF | 34.42 |
| TER | 80.80 |
| Exact Match | 6.9% |
| Speed | 1.6 examples/sec |
| Avg Time | 626ms/example |
Test Configuration:
- Test size: 1,000 examples
- Sampling: Every 50th sentence from corpus
- Generation: beam_search (num_beams=4)
- Device: Apple M-series (MPS)
- Seed: 42 (reproducible)
Translation Examples
| Kabardian (Input) | Russian (Output) |
|---|---|
| ублӏэркӏын | перевалиться |
| Уэ ужакъым | Ты не бежал |
| Шухьэр Кесарие къалэм щынэсым, ӏэтащхьэм тхылъыр ир | Когда враги добрались до города Кесария, в главе... |
| зэкъуэстын | стоять друг с другом |
| Ерагъпсӏарагъщ зэрызрагъэкӏружар. | Это единственность, которую они вернули. |
Note: The model successfully handles complex Kabardian morphology and preserves meaning in Russian translations.
How to Use
Installation
pip install transformers torch sentencepiece
Basic Usage
from transformers import MarianMTModel, MarianTokenizer
# Load model and tokenizer
model_name = "kubataba/kbd-ru-opus"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Translation function
def translate_kbd_to_ru(text):
# Preprocess: Ӏ → I for tokenization
processed_text = text.replace('Ӏ', 'I').replace('ӏ', 'I')
inputs = tokenizer(processed_text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
return translation
# Example
kabardian_text = "Сэлам!"
russian_text = translate_kbd_to_ru(kabardian_text)
print(f"KBD: {kabardian_text}")
print(f"RU: {russian_text}")
Batch Translation
texts = [
"Уи пщэдджыжь фӀыуэ!",
"Сыт укъэпсэлъар?",
"Ди лъэпкъым и бзэр"
]
# Preprocess all texts
processed_texts = [t.replace('Ӏ', 'I').replace('ӏ', 'I') for t in texts]
inputs = tokenizer(processed_texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translations = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
for src, tgt in zip(texts, translations):
print(f"KBD: {src} → RU: {tgt}")
Limitations and Bias
- Complex morphology: Kabardian's polysynthetic structure may result in one-to-many mappings to Russian
- Character handling: Requires preprocessing of palochka (Ӏ → I) before tokenization
- Domain adaptation: Best performance on text types similar to training data
- Technical vocabulary: May struggle with modern technical/specialized terms
- Endangered language: Limited real-world validation data due to language endangerment
- Dialectal variation: Trained on literary Kabardian; dialectal forms may produce unexpected results
Ethical Considerations
This model contributes to digital language preservation for Kabardian, an endangered language.
Important considerations:
- Machine translation should complement, not replace, human translators
- Cultural sensitivity is essential when working with indigenous languages
- The model may not capture all nuances of Kabardian language and culture
- Translations should be reviewed by native speakers for critical applications
- Supporting Kabardian language education and preservation is crucial
About Kabardian Language
Kabardian (Adyghe-Kabardian, East Circassian) is a Northwest Caucasian language spoken by approximately 500,000 people in:
- Kabardino-Balkaria (Russia)
- Karachay-Cherkessia (Russia)
- Turkey (diaspora communities)
- Middle East (diaspora communities)
Linguistic features:
- Phonology: 50+ consonant phonemes (one of the world's largest inventories)
- Morphology: Polysynthetic - complex word formation
- Syntax: Ergative-absolutive alignment
- Writing: Cyrillic script + palochka (Ӏ) for glottal stop
- Status: Endangered (UNESCO classification)
License and Attribution
This Model
- License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Author: Eduard Emkuzhev
- Year: 2025
Base Model
- Model: Helsinki-NLP/opus-mt-en-ru
- License: Apache 2.0
- Authors: Language Technology Research Group at the University of Helsinki
- Link: https://huggingface.co/Helsinki-NLP/opus-mt-en-ru
Training Dataset
- Dataset: Circassian-Russian Parallel Corpus v1.0
- License: CC BY 4.0
- Author: Anzor Qunash (adiga.ai)
- Link: https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus
Citation
If you use this model in your research, please cite:
@misc{emkuzhev2025kbdru,
author = {Eduard Emkuzhev},
title = {Kabardian-Russian Neural Machine Translation Model},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/kubataba/kbd-ru-opus}}
}
Please also cite the base model and dataset:
@misc{helsinki-nlp-opus-en-ru,
author = {Language Technology Research Group at the University of Helsinki},
title = {OPUS-MT English-Russian Translation Model},
year = {2020},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Helsinki-NLP/opus-mt-en-ru}}
}
@dataset{qunash2025circassian,
author = {Anzor Qunash},
title = {Circassian-Russian Parallel Text Corpus v1.0},
year = {2025},
publisher = {adiga.ai},
url = {https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus}
}
Acknowledgments
- Helsinki-NLP team for the excellent OPUS-MT base models
- Anzor Qunash (adiga.ai) for creating and publishing the Circassian-Russian Parallel Corpus
- Kabardian language community for preserving and promoting their language
- All contributors to Circassian language digitization efforts
Related Models
- Reverse direction: kubataba/ru-kbd-opus - Russian to Kabardian
- Base model: Helsinki-NLP/opus-mt-en-ru
- Dataset: adiga-ai/circassian-parallel-corpus
Technical Details
- Framework: PyTorch + Transformers
- Model size: ~300MB
- Vocabulary size: ~62,5K tokens
- Parameters: ~74M
- Inference: CPU and GPU compatible
- Optimal device: GPU or Apple Silicon (MPS)
Contact
- Author: Eduard Emkuzhev
- Email: [email protected]
- GitHub: https://github.com/kubataba
- Issues: Report issues
For commercial licensing inquiries, please contact via email.
Model Card Authors: Eduard Emkuzhev
Last Updated: December 2025
Version: 1.0.0
- Downloads last month
- 10
Model tree for kubataba/kbd-ru-opus
Base model
Helsinki-NLP/opus-mt-en-ru