--- license: apache-2.0 library_name: tokenizers tags: - tokenizer language: - hi - as - mr - gu - pa - en - or - te - ta - ml - kn - bn - sd - ur - ne - ks - sa - gom - mai - mni - brx - doi - sat --- # Tokenizer Card for Ansh-256k! The tokenizer model **``Ansh-256k``** - is trained on a dataset of **22 Official Indic languages** and English. We propose the name *Ansh* as this tokenizer is designed to meticulously identify every essential token (*Ansh* in *Sanskrit*) of our diverse Indic languages. This model is the advanced version of the **[Ansh-160k](https://huggingface.co/LingoIITGN/Ansh-160k)** which was trained on 18 Indic languages and English. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png) ### Model Description India is a vast country that has a multi-lingual culture that covers 22 Official languages and more than 1700 languages and dialects. It has been observed that various languages share words among themselves, sometimes even across language families. To capitalize on this observation, we trained our tokenization model with a vocabulary size of **256,000 (256k)** using the dataset of Wikipedia articles and Sangraha dataset in 22 Indic languages and English by applying the Byte-Pair Encoding (BPE) algorithm. When compared among all the popular open-source tokenizers trained on multilingual Indic languages on fertility scores, our model outperformed them in **20** Indic languages. - **Developed by:** [Lingo Research Group at IIT Gandhinagar](https://lingo.iitgn.ac.in/) - **Language(s) (NLP):** Multilingual (22 Indic Languages and English) - **License:** Apache 2.0 ## How to Get Started with the Model 👨🏻‍💻 Use the code below to get started with the model. ```python from transformers import AutoTokenizer try: tokenizer = tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/Ansh-256k")) print("Tokenizer loaded successfully!") except Exception as e: print(f"Error loading tokenizer: {e}") print("Please ensure you have the correct model name and are connected to the internet.") exit() input_text = "Hello, world! This is an example of how to use the tokenizer." #input_text = 'मुझे यह presentation कल morning तक submit करना है। ' #input_text = 'What is capital city of India?' encoded_input = tokenizer.encode(example_text) print("\nOriginal Text:", example_text) print("Encoded (Token IDs):", encoded_input) decoded_output = tokenizer.decode(encoded_input) print("Decoded Text:", decoded_output) ``` ## Evaluation [More Information Needed] ### Results 🏆
Comparison of Fertility Scores among popular open-source tokenizers trained on multilingual Indic languages and Ansh-256k tokenizers across the 22 Indic languages and English. Tokenizers Results | Language | Ansh-256k | [Sarvam-1](https://huggingface.co/sarvamai/sarvam-1) | [Gemma-3](https://huggingface.co/google/gemma-3-270m) | [Llama-3.1](https://huggingface.co/meta-llama/Llama-3.1-8B) | [IndicBERTv2](ai4bharat/IndicBERTv2-MLM-only) | [MuRIL](https://huggingface.co/google/muril-base-cased) | [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) | [XLMRoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) | [Ansh-128k](https://huggingface.co/LingoIITGN/Ansh-128k) | [Ansh-160k](https://huggingface.co/LingoIITGN/Ansh-160k) | |:--------|:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | ***Tamil*** |**1.732**| 2.590 | 2.524 | 11.941 | 1.790 | 1.844 | 2.742 | 2.486 | 1.915 | 1.899 | | ***Kannada*** |**1.684**| 2.654 | 3.349 | 14.239 | 1.815 | 1.953 | 2.846 | 2.507 | 1.909 | 1.862 | | ***Malayalam*** |**1.957**| 3.363 | 3.612 | 16.064 | 2.177 | 2.337 | 3.406 | 2.968 | 2.210 | 2.236 | | ***Maithili*** |**1.398**| 2.503 | 2.152 | 3.246 | 1.695 | 1.832 | 1.955 | 2.133 | 1.474 | 1.561 | | ***Konkani*** |**1.770**| 2.992 | 2.727 | 4.037 | 2.221 | 2.491 | 2.617 | 2.581 | 1.941 | 2.072 | | ***Telugu*** |**1.747**| 2.693 | 3.143 | 13.240 | 1.873 | 2.069 | 2.859 | 2.552 | 1.940 | 2.010 | | ***Odia*** |**1.401**| 2.494 | 4.523 | 15.535 | 1.539 | 1.714 | 2.149 | 2.196 | 1.546 | 1.587 | | ***Bengali*** |**1.408**| 2.045 | 1.767 | 8.200 | 1.461 | 1.442 | 2.205 | 2.140 | 1.542 | 1.509 | | ***Nepali*** |**1.272**| 2.358 | 2.027 | 3.611 | 1.411 | 1.413 | 1.898 | 1.643 | 1.376 | 1.428 | | ***Punjabi*** |**1.310**| 1.726 | 2.789 | 7.855 | 1.341 | 1.420 | 1.843 | 1.798 | 1.415 | 1.434 | | ***Urdu*** |**1.230**| 8.417 | 1.687 | 3.003 | 1.393 | 1.314 | 1.589 | 1.430 | 1.285 | 1.270 | | ***Hindi*** |**1.195**| 1.480 | 1.442 | 2.757 | 1.272 | 1.276 | 1.546 | 1.525 | 1.245 | 1.246 | | ***Gujarati*** |**1.423**| 2.093 | 2.358 | 9.651 | 1.459 | 1.587 | 2.145 | 2.062 | 1.537 | 1.495 | | ***Kashmiri*** |**1.406**| 9.248 | 3.053 | 4.026 | 2.646 | 2.131 | 2.849 | 2.985 | 1.540 | 1.619 | | ***Marathi*** |**1.463**| 1.979 | 2.012 | 4.010 | 1.521 | 1.579 | 2.207 | 2.011 | 1.585 | 1.573 | | ***Sindhi*** |**1.226**| 8.165 | 2.101 | 2.938 | 1.630 | 1.354 | 1.621 | 1.532 | 1.300 | 1.333 | | ***Assamese*** |**1.528**| 4.334 | 2.728 | 8.051 | 1.686 | 1.770 | 2.191 | 2.875 | 1.662 | 1.724 | | ***Sanskrit*** |**2.254**| 3.949 | 3.562 | 5.034 | 2.732 | 2.855 | 3.453 | 3.344 | 2.444 | 2.470 | | ***Bodo*** |**1.375**| 3.136 | 3.057 | 3.855 | 1.886 | 2.761 | 3.008 | 3.068 | 1.486 | 2.499 | | ***Santhali*** | 1.333 | 14.402 | 5.634 | 13.456 | 1.966 |**1.144**| 2.994 | 2.095 | 1.414 | 4.538 | | ***Dogri*** |**1.438**| 1.789 | 1.658 | 2.810 | 1.457 | 1.512 | 1.721 | 1.717 | 1.539 | 1.525 | | ***Manipuri*** | 4.395 | 13.496 | 9.272 | 13.184 | 2.497 |**1.436**| 2.237 | 2.326 | 4.416 | 4.407 | | ***English*** |1.415 | 1.743 | 1.415 | 1.384 | 1.373 | **1.368** | 1.480 | 1.470 | 1.545 | 1.449 | | **Overall** |**1.526** | 5.963 | 3.123 | 6.024 | 1.893 | 1.899 | 2.498 | 2.439 | 1.641 | 2.348 |
## Model Card Contact ✉️ [Lingo Research Group at IIT Gandhinagar, India](https://lingo.iitgn.ac.in/)
Mail at: [lingo@iitgn.ac.in](lingo@iitgn.ac.in)