impresso-project
/

language-identifier

@@ -22,8 +22,49 @@ tags:
 This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
 ## Model Details
 ### Model Description
 - **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).

 This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
 ## Model Details
+This model is a supervised [floret model](https://github.com/explosion/floret), trained with the following parameters:
+```
+{'bucket': 200000,
+ 'dimension': 40,
+ 'hash_function': 'N/A',
+ 'loss': 'softmax',
+ 'maxn': 4,
+ 'minn': 1,
+ 'model_type': 'supervised',
+ 'vocab_size': 3}
+```
+On the [impresso language identification challenge test set](https://github.com/impresso/dataset-challenge-lid) it achieves the following performance:
+```
+      de   en    fr   it  la   lb  nl
+de  2854    0    79    3   0   38   0
+en     0  156     1    0   0    0   0
+fr    14   11  1515    1   7    9   0
+it     0    0     0  136   0    0   0
+la     0    0     0    0   0    0   0
+lb     6    1    20    0   0  775   1
+nl     0    0     0    0   0    0   0
+Detailed Classification Report:
+              precision    recall  f1-score   support
+          de       0.99      0.96      0.98      2974
+          en       0.93      0.99      0.96       157
+          fr       0.94      0.97      0.96      1557
+          it       0.97      1.00      0.99       136
+          la       0.00      0.00      0.00         0
+          lb       0.94      0.97      0.95       803
+          nl       0.00      0.00      0.00         0
+    accuracy                           0.97      5627
+   macro avg       0.68      0.70      0.69      5627
+weighted avg       0.97      0.97      0.97      5627
+```
 ### Model Description
 - **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).