Kabardian-Russian Translation Model (kbd-ru-opus)

Fine-tuned MarianMT model for Kabardian (East Circassian) to Russian translation.

Model Description

This model translates from Kabardian to Russian. Kabardian is an endangered Northwest Caucasian language with approximately 500,000 speakers. It features complex polysynthetic morphology, 50+ consonants, and ergative-absolutive alignment.

Intended Use

Primary uses:

  • Language documentation and digitization
  • Educational content translation
  • Cultural heritage preservation
  • Low-resource NMT research
  • Supporting Kabardian speakers in accessing Russian content

Limitations:

  • Non-commercial use only (CC BY-NC 4.0)
  • Best performance on everyday language
  • May struggle with modern technical terms not in training data
  • Requires proper handling of Kabardian-specific character Ӏ (palochka)

Training Data

Dataset: adiga-ai/circassian-parallel-corpus

  • Subset: kbd_ru (Kabardian → Russian)
  • Total training examples: ~120K parallel sentence pairs
  • Dataset license: CC BY 4.0
  • Dataset author: Anzor Qunash (adiga.ai)
  • Content: Dictionary entries, folklore texts, proverbs, everyday expressions

Training Procedure

Base Model

  • Architecture: Marian Transformer (transformer-align)
  • Base: Helsinki-NLP/opus-mt-en-ru (English-Russian translation)
  • Transfer learning: Adapted from English-Russian to Kabardian-Russian

Hyperparameters

base_model: Helsinki-NLP/opus-mt-en-ru
training_examples: 120,000
epochs: 3
batch_size: 16
learning_rate: 3e-5
optimizer: AdamW
max_sequence_length: 128
warmup_steps: 500
weight_decay: 0.01
framework: transformers 4.36.0

Special Preprocessing

The model uses a special character mapping for training:

  • Kabardian Ӏ (palochka) → I (Latin I) during training
  • I → Ӏ restored during inference

This ensures better tokenization compatibility with the MarianMT tokenizer.

Performance

Benchmark Results

Tested on 1,000 examples from adiga-ai/circassian-parallel-corpus:

Metric Score
BLEU 12.73
chrF 34.42
TER 80.80
Exact Match 6.9%
Speed 1.6 examples/sec
Avg Time 626ms/example

Test Configuration:

  • Test size: 1,000 examples
  • Sampling: Every 50th sentence from corpus
  • Generation: beam_search (num_beams=4)
  • Device: Apple M-series (MPS)
  • Seed: 42 (reproducible)

Translation Examples

Kabardian (Input) Russian (Output)
ублӏэркӏын перевалиться
Уэ ужакъым Ты не бежал
Шухьэр Кесарие къалэм щынэсым, ӏэтащхьэм тхылъыр ир Когда враги добрались до города Кесария, в главе...
зэкъуэстын стоять друг с другом
Ерагъпсӏарагъщ зэрызрагъэкӏружар. Это единственность, которую они вернули.

Note: The model successfully handles complex Kabardian morphology and preserves meaning in Russian translations.

How to Use

Installation

pip install transformers torch sentencepiece

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "kubataba/kbd-ru-opus"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translation function
def translate_kbd_to_ru(text):
    # Preprocess: Ӏ → I for tokenization
    processed_text = text.replace('Ӏ', 'I').replace('ӏ', 'I')
    
    inputs = tokenizer(processed_text, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

# Example
kabardian_text = "Сэлам!"
russian_text = translate_kbd_to_ru(kabardian_text)
print(f"KBD: {kabardian_text}")
print(f"RU: {russian_text}")

Batch Translation

texts = [
    "Уи пщэдджыжь фӀыуэ!",
    "Сыт укъэпсэлъар?",
    "Ди лъэпкъым и бзэр"
]

# Preprocess all texts
processed_texts = [t.replace('Ӏ', 'I').replace('ӏ', 'I') for t in texts]

inputs = tokenizer(processed_texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
translations = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]

for src, tgt in zip(texts, translations):
    print(f"KBD: {src} → RU: {tgt}")

Limitations and Bias

  • Complex morphology: Kabardian's polysynthetic structure may result in one-to-many mappings to Russian
  • Character handling: Requires preprocessing of palochka (Ӏ → I) before tokenization
  • Domain adaptation: Best performance on text types similar to training data
  • Technical vocabulary: May struggle with modern technical/specialized terms
  • Endangered language: Limited real-world validation data due to language endangerment
  • Dialectal variation: Trained on literary Kabardian; dialectal forms may produce unexpected results

Ethical Considerations

This model contributes to digital language preservation for Kabardian, an endangered language.

Important considerations:

  • Machine translation should complement, not replace, human translators
  • Cultural sensitivity is essential when working with indigenous languages
  • The model may not capture all nuances of Kabardian language and culture
  • Translations should be reviewed by native speakers for critical applications
  • Supporting Kabardian language education and preservation is crucial

About Kabardian Language

Kabardian (Adyghe-Kabardian, East Circassian) is a Northwest Caucasian language spoken by approximately 500,000 people in:

  • Kabardino-Balkaria (Russia)
  • Karachay-Cherkessia (Russia)
  • Turkey (diaspora communities)
  • Middle East (diaspora communities)

Linguistic features:

  • Phonology: 50+ consonant phonemes (one of the world's largest inventories)
  • Morphology: Polysynthetic - complex word formation
  • Syntax: Ergative-absolutive alignment
  • Writing: Cyrillic script + palochka (Ӏ) for glottal stop
  • Status: Endangered (UNESCO classification)

License and Attribution

This Model

  • License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
  • Author: Eduard Emkuzhev
  • Year: 2025

Base Model

Training Dataset

Citation

If you use this model in your research, please cite:

@misc{emkuzhev2025kbdru,
  author = {Eduard Emkuzhev},
  title = {Kabardian-Russian Neural Machine Translation Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/kubataba/kbd-ru-opus}}
}

Please also cite the base model and dataset:

@misc{helsinki-nlp-opus-en-ru,
  author = {Language Technology Research Group at the University of Helsinki},
  title = {OPUS-MT English-Russian Translation Model},
  year = {2020},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Helsinki-NLP/opus-mt-en-ru}}
}

@dataset{qunash2025circassian,
  author = {Anzor Qunash},
  title = {Circassian-Russian Parallel Text Corpus v1.0},
  year = {2025},
  publisher = {adiga.ai},
  url = {https://huggingface.co/datasets/adiga-ai/circassian-parallel-corpus}
}

Acknowledgments

  • Helsinki-NLP team for the excellent OPUS-MT base models
  • Anzor Qunash (adiga.ai) for creating and publishing the Circassian-Russian Parallel Corpus
  • Kabardian language community for preserving and promoting their language
  • All contributors to Circassian language digitization efforts

Related Models

Technical Details

  • Framework: PyTorch + Transformers
  • Model size: ~300MB
  • Vocabulary size: ~62,5K tokens
  • Parameters: ~74M
  • Inference: CPU and GPU compatible
  • Optimal device: GPU or Apple Silicon (MPS)

Contact

For commercial licensing inquiries, please contact via email.


Model Card Authors: Eduard Emkuzhev

Last Updated: December 2025

Version: 1.0.0

Downloads last month
10
Safetensors
Model size
76.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kubataba/kbd-ru-opus

Finetuned
(41)
this model

Dataset used to train kubataba/kbd-ru-opus