Update README.md
Browse files
README.md
CHANGED
|
@@ -88,11 +88,18 @@ predict_educational_value(["Hi"])
|
|
| 88 |
# Output: [3.0000010156072676e-05]
|
| 89 |
|
| 90 |
```
|
| 91 |
-
|
| 92 |
To make sure this classifier makes sense, it is applied to various datasets.
|
| 93 |
|
| 94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
|Dataset | Sampling | Average Educational Value | Type |
|
| 98 |
|--------------------------------------|---|-------------------|-------|
|
|
@@ -115,7 +122,11 @@ Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
|
| 115 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
|
| 116 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
| 117 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
|
|
|
| 118 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
|
|
|
|
|
|
|
|
|
| 119 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
| 120 |
|
| 121 |
|
|
|
|
| 88 |
# Output: [3.0000010156072676e-05]
|
| 89 |
|
| 90 |
```
|
| 91 |
+
# Benchmark
|
| 92 |
To make sure this classifier makes sense, it is applied to various datasets.
|
| 93 |
|
| 94 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
| 95 |
|
| 96 |
+
The score can be interpreted as:
|
| 97 |
+
|Educational Value| Category |
|
| 98 |
+
|--------|----------|
|
| 99 |
+
|2 | High|
|
| 100 |
+
|1 | Mid|
|
| 101 |
+
|0 | Low|
|
| 102 |
+
|
| 103 |
|
| 104 |
|Dataset | Sampling | Average Educational Value | Type |
|
| 105 |
|--------------------------------------|---|-------------------|-------|
|
|
|
|
| 122 |
|[HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)| First 10,000 | 1.058|Real|
|
| 123 |
|[BEE-spoke-data/fineweb-100k_en-med](https://huggingface.co/datasets/BEE-spoke-data/fineweb-100k_en-med)| First 10,000 | 1.017|Real|
|
| 124 |
|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| First 10,000 | 0.994|Real|
|
| 125 |
+
|[togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)| First 10,000 | 0.979|Real|
|
| 126 |
|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| First 10,000 | 0.853|Real|
|
| 127 |
+
|[tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)| First 10,000 | 0.798|Real|
|
| 128 |
+
|
| 129 |
+
|
| 130 |
\* I encounted an [issue](https://huggingface.co/datasets/allenai/dolma/discussions/26) so that I cannot process the original [allenai/dolma](https://huggingface.co/datasets/allenai/dolma).
|
| 131 |
|
| 132 |
|