tokenizers
tokenizer
shaantastic24 commited on
Commit
533c6d1
·
verified ·
1 Parent(s): bb127bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -26
README.md CHANGED
@@ -86,32 +86,32 @@ print("Decoded Text:", decoded_output)
86
  Indic languages and Ansh-256k tokenizers across the 22 Indic languages and English.
87
  <summary>Tokenizers Results</summary>
88
 
89
- |Language |**Ansh-256k**|MuRIL|IndicBERTv2|Ansh-160k|Llama-3.1|NLLB|XLMRoBERTa|Gemma|Sarvam-1|
90
- | :-------:| :----------: |:---:|:--------: |:------: |:------:|:--:|:-------:|:----:|:------:|
91
- | Tamil |**1.732** | 1.844 | 1.790 | 1.899 | 11.941 | 2.742 | 2.486 | 2.524 | 2.590 |
92
- | Kannada |**1.684** | 1.953 | 1.815 | 1.862 | 14.239 | 2.846 | 2.507 | 3.349 | 2.654 |
93
- | Malayalam|**1.957** | 2.337 | 2.177 | 2.236 | 16.064 | 3.406 | 2.968 | 3.612 | 3.363 |
94
- | Maithili |**1.398** | 1.832 | 1.695 | 1.561 | 3.246 | 1.955 | 2.133 | 2.152 | 2.503 |
95
- | Konkani |**1.770** | 2.491 | 2.221 | 2.072 | 4.037 | 2.617 | 2.581 | 2.727 | 2.992 |
96
- | Telugu |**1.747** | 2.069 | 1.873 | 2.010 | 13.240 | 2.859 | 2.552 | 3.143 | 2.693 |
97
- | Odia |**1.401** | 1.714 | 1.539 | 1.587 | 15.535 | 2.149 | 2.196 | 4.523 | 2.494 |
98
- | Bengali |**1.408** | 1.442 | 1.461 | 1.509 | 8.200 | 2.205 | 2.140 | 1.767 | 2.045 |
99
- | Nepali |**1.272** | 1.413 | 1.411 | 1.428 | 3.611 | 1.898 | 1.643 | 2.027 | 2.358 |
100
- | Punjabi |**1.310** | 1.420 | 1.341 | 1.434 | 7.855 | 1.843 | 1.798 | 2.789 | 1.726 |
101
- | Urdu |**1.230** | 1.314 | 1.393 | 1.270 | 3.003 | 1.589 | 1.430 | 1.687 | 8.417 |
102
- | Hindi |**1.195** |1.276 | 1.272 | 1.246 | 2.757 | 1.546 | 1.525 | 1.442 | 1.480 |
103
- | Gujarati |**1.423** | 1.587 | 1.459 | 1.495 | 9.651 | 2.145 | 2.062 | 2.358 | 2.093 |
104
- | Kashmiri |**1.406** | 2.131 | 2.646 | 1.619 | 4.026 | 2.849 | 2.985 | 3.053 | 9.248 |
105
- | Marathi |**1.463** | 1.579 | 1.521 | 1.573 | 4.010 | 2.207 | 2.011 | 2.012 | 1.979 |
106
- | Sindhi |**1.226** | 1.354 | 1.630 | 1.333 | 2.938 | 1.621 | 1.532 | 2.101 | 8.165 |
107
- | Assamese |**1.528** | 1.770 | 1.686 | 1.724 | 8.051 | 2.191 | 2.875 | 2.728 | 4.334 |
108
- | Sanskrit |**2.254** | 2.855 | 2.732 | 2.470 | 5.034 | 3.453 | 3.344 | 3.562 | 3.949 |
109
- | Bodo |**1.375** | 2.761 | 1.886 | 2.499 | 3.855 | 3.008 | 3.068 | 3.057 | 3.136 |
110
- | Santhali |1.333 | **1.144** | 1.966 | 4.538 | 13.456 | 2.994 | 2.095 | 5.634 | 14.402 |
111
- | Dogri |**1.438** | 1.512 | 1.457 | 1.525 | 2.810 | 1.721 | 1.717 | 1.658 | 1.789 |
112
- | Manipuri | 4.395 | **1.436** | 2.497 | 4.407 | 13.184 | 2.237 | 2.326 | 9.272 | 13.496 |
113
- | English |1.415 | **1.368** | 1.373 | 1.449 | 1.384 | 1.480 | 1.470 | 1.415 | 1.743 |
114
- | **Overall** |**1.526** | 1.899 | 1.893 | 2.348 | 6.024 | 2.498 | 2.439 | 3.123 | 5.963 |
115
  </details>
116
 
117
  ## Model Card Contact ✉️
 
86
  Indic languages and Ansh-256k tokenizers across the 22 Indic languages and English.
87
  <summary>Tokenizers Results</summary>
88
 
89
+ | Language | Ansh-256k | [Sarvam-1](https://huggingface.co/sarvamai/sarvam-1) | [Gemma-3](https://huggingface.co/google/gemma-3-270m) | [Llama-3.1](https://huggingface.co/meta-llama/Llama-3.1-8B) | [IndicBERTv2](ai4bharat/IndicBERTv2-MLM-only) | [MuRIL](https://huggingface.co/google/muril-base-cased) | [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) | [XLMRoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) | [Ansh-128k](https://huggingface.co/LingoIITGN/Ansh-128k) | [Ansh-160k](https://huggingface.co/LingoIITGN/Ansh-160k) |
90
+ |:--------|:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
91
+ | ***Tamil*** |**1.732**| 2.590 | 2.524 | 11.941 | 1.790 | 1.844 | 2.742 | 2.486 | 1.915 | 1.899 |
92
+ | ***Kannada*** |**1.684**| 2.654 | 3.349 | 14.239 | 1.815 | 1.953 | 2.846 | 2.507 | 1.909 | 1.862 |
93
+ | ***Malayalam*** |**1.957**| 3.363 | 3.612 | 16.064 | 2.177 | 2.337 | 3.406 | 2.968 | 2.210 | 2.236 |
94
+ | ***Maithili*** |**1.398**| 2.503 | 2.152 | 3.246 | 1.695 | 1.832 | 1.955 | 2.133 | 1.474 | 1.561 |
95
+ | ***Konkani*** |**1.770**| 2.992 | 2.727 | 4.037 | 2.221 | 2.491 | 2.617 | 2.581 | 1.941 | 2.072 |
96
+ | ***Telugu*** |**1.747**| 2.693 | 3.143 | 13.240 | 1.873 | 2.069 | 2.859 | 2.552 | 1.940 | 2.010 |
97
+ | ***Odia*** |**1.401**| 2.494 | 4.523 | 15.535 | 1.539 | 1.714 | 2.149 | 2.196 | 1.546 | 1.587 |
98
+ | ***Bengali*** |**1.408**| 2.045 | 1.767 | 8.200 | 1.461 | 1.442 | 2.205 | 2.140 | 1.542 | 1.509 |
99
+ | ***Nepali*** |**1.272**| 2.358 | 2.027 | 3.611 | 1.411 | 1.413 | 1.898 | 1.643 | 1.376 | 1.428 |
100
+ | ***Punjabi*** |**1.310**| 1.726 | 2.789 | 7.855 | 1.341 | 1.420 | 1.843 | 1.798 | 1.415 | 1.434 |
101
+ | ***Urdu*** |**1.230**| 8.417 | 1.687 | 3.003 | 1.393 | 1.314 | 1.589 | 1.430 | 1.285 | 1.270 |
102
+ | ***Hindi*** |**1.195**| 1.480 | 1.442 | 2.757 | 1.272 | 1.276 | 1.546 | 1.525 | 1.245 | 1.246 |
103
+ | ***Gujarati*** |**1.423**| 2.093 | 2.358 | 9.651 | 1.459 | 1.587 | 2.145 | 2.062 | 1.537 | 1.495 |
104
+ | ***Kashmiri*** |**1.406**| 9.248 | 3.053 | 4.026 | 2.646 | 2.131 | 2.849 | 2.985 | 1.540 | 1.619 |
105
+ | ***Marathi*** |**1.463**| 1.979 | 2.012 | 4.010 | 1.521 | 1.579 | 2.207 | 2.011 | 1.585 | 1.573 |
106
+ | ***Sindhi*** |**1.226**| 8.165 | 2.101 | 2.938 | 1.630 | 1.354 | 1.621 | 1.532 | 1.300 | 1.333 |
107
+ | ***Assamese*** |**1.528**| 4.334 | 2.728 | 8.051 | 1.686 | 1.770 | 2.191 | 2.875 | 1.662 | 1.724 |
108
+ | ***Sanskrit*** |**2.254**| 3.949 | 3.562 | 5.034 | 2.732 | 2.855 | 3.453 | 3.344 | 2.444 | 2.470 |
109
+ | ***Bodo*** |**1.375**| 3.136 | 3.057 | 3.855 | 1.886 | 2.761 | 3.008 | 3.068 | 1.486 | 2.499 |
110
+ | ***Santhali*** | 1.333 | 14.402 | 5.634 | 13.456 | 1.966 |**1.144**| 2.994 | 2.095 | 1.414 | 4.538 |
111
+ | ***Dogri*** |**1.438**| 1.789 | 1.658 | 2.810 | 1.457 | 1.512 | 1.721 | 1.717 | 1.539 | 1.525 |
112
+ | ***Manipuri*** | 4.395 | 13.496 | 9.272 | 13.184 | 2.497 |**1.436**| 2.237 | 2.326 | 4.416 | 4.407 |
113
+ | ***English*** |1.415 | 1.743 | 1.415 | 1.384 | 1.373 | **1.368** | 1.480 | 1.470 | 1.545 | 1.449 |
114
+ | **Overall** |**1.526** | 5.963 | 3.123 | 6.024 | 1.893 | 1.899 | 2.498 | 2.439 | 1.641 | 2.348 |
115
  </details>
116
 
117
  ## Model Card Contact ✉️