Add model specification and test set information
Browse files
README.md
CHANGED
|
@@ -22,8 +22,49 @@ tags:
|
|
| 22 |
|
| 23 |
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
|
| 24 |
|
|
|
|
| 25 |
## Model Details
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
### Model Description
|
| 28 |
|
| 29 |
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
|
|
|
|
| 22 |
|
| 23 |
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
|
| 24 |
|
| 25 |
+
|
| 26 |
## Model Details
|
| 27 |
|
| 28 |
+
This model is a supervised [floret model](https://github.com/explosion/floret), trained with the following parameters:
|
| 29 |
+
```
|
| 30 |
+
{'bucket': 200000,
|
| 31 |
+
'dimension': 40,
|
| 32 |
+
'hash_function': 'N/A',
|
| 33 |
+
'loss': 'softmax',
|
| 34 |
+
'maxn': 4,
|
| 35 |
+
'minn': 1,
|
| 36 |
+
'model_type': 'supervised',
|
| 37 |
+
'vocab_size': 3}
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
On the [impresso language identification challenge test set](https://github.com/impresso/dataset-challenge-lid) it achieves the following performance:
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
de en fr it la lb nl
|
| 44 |
+
de 2854 0 79 3 0 38 0
|
| 45 |
+
en 0 156 1 0 0 0 0
|
| 46 |
+
fr 14 11 1515 1 7 9 0
|
| 47 |
+
it 0 0 0 136 0 0 0
|
| 48 |
+
la 0 0 0 0 0 0 0
|
| 49 |
+
lb 6 1 20 0 0 775 1
|
| 50 |
+
nl 0 0 0 0 0 0 0
|
| 51 |
+
|
| 52 |
+
Detailed Classification Report:
|
| 53 |
+
|
| 54 |
+
precision recall f1-score support
|
| 55 |
+
|
| 56 |
+
de 0.99 0.96 0.98 2974
|
| 57 |
+
en 0.93 0.99 0.96 157
|
| 58 |
+
fr 0.94 0.97 0.96 1557
|
| 59 |
+
it 0.97 1.00 0.99 136
|
| 60 |
+
la 0.00 0.00 0.00 0
|
| 61 |
+
lb 0.94 0.97 0.95 803
|
| 62 |
+
nl 0.00 0.00 0.00 0
|
| 63 |
+
|
| 64 |
+
accuracy 0.97 5627
|
| 65 |
+
macro avg 0.68 0.70 0.69 5627
|
| 66 |
+
weighted avg 0.97 0.97 0.97 5627
|
| 67 |
+
```
|
| 68 |
### Model Description
|
| 69 |
|
| 70 |
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
|