synapti commited on
Commit
fb589f4
·
verified ·
1 Parent(s): 4caa504

Update model card with complete documentation

Browse files
Files changed (1) hide show
  1. README.md +130 -55
README.md CHANGED
@@ -1,77 +1,152 @@
1
  ---
2
- library_name: transformers
3
  license: apache-2.0
 
 
4
  base_model: answerdotai/ModernBERT-base
5
  tags:
6
- - generated_from_trainer
7
- metrics:
8
- - accuracy
9
- - f1
10
- - precision
11
- - recall
12
- model-index:
13
- - name: nci-binary-detector
14
- results: []
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- # nci-binary-detector
21
 
22
- This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.0031
25
- - Accuracy: 0.9954
26
- - F1: 0.9959
27
- - Precision: 0.9919
28
- - Recall: 1.0
29
- - Roc Auc: 0.9986
30
 
31
- ## Model description
 
 
 
32
 
33
- More information needed
34
 
35
- ## Intended uses & limitations
 
 
 
36
 
37
- More information needed
38
 
39
- ## Training and evaluation data
 
 
 
 
 
40
 
41
- More information needed
42
 
43
- ## Training procedure
 
 
 
44
 
45
- ### Training hyperparameters
46
 
47
- The following hyperparameters were used during training:
48
- - learning_rate: 2e-05
49
- - train_batch_size: 16
50
- - eval_batch_size: 32
51
- - seed: 42
52
- - gradient_accumulation_steps: 2
53
- - total_train_batch_size: 32
54
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
55
- - lr_scheduler_type: linear
56
- - lr_scheduler_warmup_ratio: 0.1
57
- - num_epochs: 5
58
- - mixed_precision_training: Native AMP
59
 
60
- ### Training results
61
 
62
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | Roc Auc |
63
- |:-------------:|:------:|:----:|:---------------:|:--------:|:------:|:---------:|:------:|:-------:|
64
- | 0.0093 | 0.1634 | 100 | 0.0043 | 0.9844 | 0.9865 | 0.9763 | 0.9970 | 0.9990 |
65
- | 0.0021 | 0.3268 | 200 | 0.0036 | 0.9954 | 0.9960 | 0.9930 | 0.9990 | 0.9978 |
66
- | 0.0001 | 0.4902 | 300 | 0.0011 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 0.9999 |
67
- | 0.0043 | 0.6536 | 400 | 0.0009 | 0.9959 | 0.9965 | 0.9930 | 1.0 | 1.0000 |
68
- | 0.0001 | 0.8170 | 500 | 0.0006 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 1.0000 |
69
- | 0.0006 | 0.9804 | 600 | 0.0010 | 0.9977 | 0.9980 | 0.9980 | 0.9980 | 0.9999 |
70
 
 
 
 
 
 
 
 
 
71
 
72
- ### Framework versions
73
 
74
- - Transformers 4.57.3
75
- - Pytorch 2.9.1+cu128
76
- - Datasets 4.4.1
77
- - Tokenizers 0.22.1
 
1
  ---
 
2
  license: apache-2.0
3
+ datasets:
4
+ - synapti/nci-propaganda-production
5
  base_model: answerdotai/ModernBERT-base
6
  tags:
7
+ - transformers
8
+ - modernbert
9
+ - text-classification
10
+ - propaganda-detection
11
+ - binary-classification
12
+ - nci-protocol
13
+ library_name: transformers
14
+ pipeline_tag: text-classification
 
15
  ---
16
 
17
+ # NCI Binary Detector
18
+
19
+ Fast binary classifier that detects whether text contains propaganda techniques.
20
+
21
+ ## Model Description
22
+
23
+ This model is **Stage 1** of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline:
24
+
25
+ - **Stage 1 (this model)**: Fast binary detection - "Does this text contain propaganda?"
26
+ - **Stage 2**: Multi-label technique classification - "Which specific techniques are used?"
27
+
28
+ The binary detector serves as a fast filter with high recall, passing flagged content to the more detailed technique classifier.
29
+
30
+ ## Labels
31
+
32
+ | Label | Description |
33
+ |-------|-------------|
34
+ | `no_propaganda` | Text does not contain propaganda techniques |
35
+ | `has_propaganda` | Text contains one or more propaganda techniques |
36
+
37
+ ## Performance
38
+
39
+ **Test Set Results:**
40
+
41
+ | Metric | Score |
42
+ |--------|-------|
43
+ | Accuracy | 99.5% |
44
+ | F1 Score | 99.6% |
45
+ | Precision | 99.2% |
46
+ | Recall | 100.0% |
47
+ | ROC AUC | 99.9% |
48
+
49
+ ## Usage
50
+
51
+ ### Basic Usage
52
+
53
+ ```python
54
+ from transformers import pipeline
55
+
56
+ detector = pipeline(
57
+ "text-classification",
58
+ model="synapti/nci-binary-detector"
59
+ )
60
+
61
+ text = "The radical left is DESTROYING our country!"
62
+ result = detector(text)[0]
63
+
64
+ print(f"Label: {result['label']}") # 'has_propaganda' or 'no_propaganda'
65
+ print(f"Confidence: {result['score']:.2%}")
66
+ ```
67
+
68
+ ### Two-Stage Pipeline
69
+
70
+ For best results, use with the technique classifier:
71
+
72
+ ```python
73
+ from transformers import pipeline
74
+
75
+ # Stage 1: Binary detection
76
+ detector = pipeline("text-classification", model="synapti/nci-binary-detector")
77
+
78
+ # Stage 2: Technique classification (only if propaganda detected)
79
+ classifier = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None)
80
+
81
+ text = "Your text to analyze..."
82
+
83
+ # Quick check first
84
+ detection = detector(text)[0]
85
+ if detection["label"] == "has_propaganda" and detection["score"] > 0.5:
86
+ # Detailed technique analysis
87
+ techniques = classifier(text)[0]
88
+ detected = [t for t in techniques if t["score"] > 0.3]
89
+ for t in detected:
90
+ print(f"{t['label']}: {t['score']:.2%}")
91
+ else:
92
+ print("No propaganda detected")
93
+ ```
94
 
95
+ ## Training Data
96
 
97
+ Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):
 
 
 
 
 
 
 
98
 
99
+ - **23,000+ examples** from multiple sources
100
+ - **Positive examples**: Text with 1+ propaganda techniques (from SemEval-2020, augmented data)
101
+ - **Hard negatives**: Factual content from LIAR2, QBias datasets
102
+ - **Class-weighted Focal Loss** to handle imbalance (gamma=2.0)
103
 
104
+ ## Model Architecture
105
 
106
+ - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
107
+ - **Parameters**: 149.6M
108
+ - **Max Sequence Length**: 512 tokens
109
+ - **Output**: 2 labels (binary classification)
110
 
111
+ ## Training Details
112
 
113
+ - **Loss Function**: Focal Loss (gamma=2.0, alpha=0.25)
114
+ - **Optimizer**: AdamW
115
+ - **Learning Rate**: 2e-5
116
+ - **Batch Size**: 16 (effective 32 with gradient accumulation)
117
+ - **Epochs**: 5 with early stopping (patience=3)
118
+ - **Hardware**: NVIDIA A10G GPU
119
 
120
+ ## Limitations
121
 
122
+ - Trained primarily on English text
123
+ - Works best on content similar to training distribution (news articles, social media posts)
124
+ - May not detect subtle or novel propaganda techniques not in training data
125
+ - Should be used alongside human review for high-stakes applications
126
 
127
+ ## Related Models
128
 
129
+ - [synapti/nci-technique-classifier](https://huggingface.co/synapti/nci-technique-classifier) - Stage 2 multi-label technique classifier
 
 
 
 
 
 
 
 
 
 
 
130
 
131
+ ## Citation
132
 
133
+ ```bibtex
134
+ @inproceedings{da-san-martino-etal-2020-semeval,
135
+ title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
136
+ author = "Da San Martino, Giovanni and others",
137
+ booktitle = "Proceedings of SemEval-2020",
138
+ year = "2020",
139
+ }
 
140
 
141
+ @misc{nci-binary-detector,
142
+ author = {NCI Protocol Team},
143
+ title = {NCI Binary Detector},
144
+ year = {2024},
145
+ publisher = {HuggingFace},
146
+ url = {https://huggingface.co/synapti/nci-binary-detector}
147
+ }
148
+ ```
149
 
150
+ ## License
151
 
152
+ Apache 2.0