synapti commited on
Commit
8ea3b2f
·
verified ·
1 Parent(s): 83a4a73

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +107 -54
README.md CHANGED
@@ -1,81 +1,134 @@
1
  ---
2
- library_name: transformers
3
  license: apache-2.0
4
  base_model: answerdotai/ModernBERT-base
5
  tags:
6
- - generated_from_trainer
 
 
 
 
 
 
 
7
  metrics:
8
  - accuracy
9
  - f1
10
  - precision
11
  - recall
12
- model-index:
13
- - name: nci-binary-detector
14
- results: []
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # nci-binary-detector
21
 
22
- This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.0010
25
- - Accuracy: 0.9977
26
- - F1: 0.9980
27
- - Precision: 0.9970
28
- - Recall: 0.9990
29
- - Roc Auc: 0.9999
30
 
31
- ## Model description
32
 
33
- More information needed
 
34
 
35
- ## Intended uses & limitations
36
 
37
- More information needed
38
 
39
- ## Training and evaluation data
 
 
 
40
 
41
- More information needed
42
 
43
- ## Training procedure
44
 
45
- ### Training hyperparameters
 
 
 
46
 
47
- The following hyperparameters were used during training:
48
- - learning_rate: 2e-05
49
- - train_batch_size: 16
50
- - eval_batch_size: 32
51
- - seed: 42
52
- - gradient_accumulation_steps: 2
53
- - total_train_batch_size: 32
54
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
55
- - lr_scheduler_type: linear
56
- - lr_scheduler_warmup_ratio: 0.1
57
- - num_epochs: 5
58
- - mixed_precision_training: Native AMP
59
 
60
- ### Training results
 
 
 
 
 
61
 
62
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | Roc Auc |
63
- |:-------------:|:------:|:----:|:---------------:|:--------:|:------:|:---------:|:------:|:-------:|
64
- | 0.011 | 0.1634 | 100 | 0.0047 | 0.9867 | 0.9883 | 0.9949 | 0.9818 | 0.9987 |
65
- | 0.0017 | 0.3268 | 200 | 0.0046 | 0.9971 | 0.9975 | 0.9970 | 0.9980 | 0.9993 |
66
- | 0.0004 | 0.4902 | 300 | 0.0028 | 0.9971 | 0.9975 | 0.9980 | 0.9970 | 0.9999 |
67
- | 0.0106 | 0.6536 | 400 | 0.0008 | 0.9983 | 0.9985 | 0.9980 | 0.9990 | 1.0000 |
68
- | 0.0001 | 0.8170 | 500 | 0.0011 | 0.9983 | 0.9985 | 0.9980 | 0.9990 | 1.0000 |
69
- | 0.0007 | 0.9804 | 600 | 0.0012 | 0.9983 | 0.9985 | 0.9980 | 0.9990 | 1.0000 |
70
- | 0.0 | 1.1438 | 700 | 0.0010 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 1.0000 |
71
- | 0.0021 | 1.3072 | 800 | 0.0006 | 0.9977 | 0.9980 | 0.9980 | 0.9980 | 1.0000 |
72
- | 0.0018 | 1.4706 | 900 | 0.0010 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 1.0000 |
73
- | 0.0012 | 1.6340 | 1000 | 0.0017 | 0.9977 | 0.9980 | 0.9960 | 1.0 | 1.0000 |
74
 
 
 
75
 
76
- ### Framework versions
 
77
 
78
- - Transformers 4.57.3
79
- - Pytorch 2.9.1+cu128
80
- - Datasets 4.4.1
81
- - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  license: apache-2.0
3
  base_model: answerdotai/ModernBERT-base
4
  tags:
5
+ - transformers
6
+ - modernbert
7
+ - text-classification
8
+ - propaganda-detection
9
+ - binary-classification
10
+ - nci-protocol
11
+ datasets:
12
+ - synapti/nci-propaganda-production
13
  metrics:
14
  - accuracy
15
  - f1
16
  - precision
17
  - recall
18
+ pipeline_tag: text-classification
 
 
19
  ---
20
 
21
+ # NCI Binary Propaganda Detector
 
22
 
23
+ Binary classifier that detects whether text contains propaganda/manipulation techniques.
24
 
25
+ ## Model Description
 
 
 
 
 
 
 
26
 
27
+ This model is **Stage 1** of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline:
28
 
29
+ - **Stage 1 (this model)**: Fast binary detection - "Does this text contain propaganda?"
30
+ - **Stage 2**: Multi-label technique classification - "Which specific techniques are used?"
31
 
32
+ The binary detector is optimized for **high recall** to ensure manipulative content is not missed, while Stage 2 provides detailed technique classification.
33
 
34
+ ## Intended Uses
35
 
36
+ - Fast filtering of content for propaganda presence
37
+ - First-pass screening in content moderation pipelines
38
+ - Real-time detection in social media monitoring
39
+ - Input gating for detailed technique analysis
40
 
41
+ ## Training Data
42
 
43
+ Trained on the [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production) dataset:
44
 
45
+ - **23,000+ examples** from multiple sources
46
+ - **Positive examples**: SemEval-2020 Task 11 propaganda techniques
47
+ - **Hard negatives**: LIAR2 factual statements, Qbias center-biased news
48
+ - **Train/Val/Test split**: 80/10/10
49
 
50
+ ## Performance
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ | Metric | Score |
53
+ |--------|-------|
54
+ | Accuracy | ~95% |
55
+ | F1 | ~94% |
56
+ | Precision | ~96% |
57
+ | Recall | ~92% |
58
 
59
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ ```python
62
+ from transformers import pipeline
63
 
64
+ # Load the model
65
+ detector = pipeline("text-classification", model="synapti/nci-binary-detector")
66
 
67
+ # Detect propaganda
68
+ text = "The radical left wants to DESTROY our country!"
69
+ result = detector(text)
70
+
71
+ # Result: {'label': 'LABEL_1', 'score': 0.99}
72
+ # LABEL_0 = no propaganda, LABEL_1 = has propaganda
73
+ ```
74
+
75
+ ### Two-Stage Pipeline
76
+
77
+ For complete propaganda analysis, use with the technique classifier:
78
+
79
+ ```python
80
+ from transformers import pipeline
81
+
82
+ binary = pipeline("text-classification", model="synapti/nci-binary-detector")
83
+ technique = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None)
84
+
85
+ text = "Your text here..."
86
+
87
+ # Stage 1: Binary detection
88
+ binary_result = binary(text)[0]
89
+ has_propaganda = binary_result["label"] == "LABEL_1"
90
+
91
+ if has_propaganda:
92
+ # Stage 2: Technique classification
93
+ techniques = technique(text)[0]
94
+ detected = [t for t in techniques if t["score"] > 0.3]
95
+ ```
96
+
97
+ ## Model Architecture
98
+
99
+ - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
100
+ - **Parameters**: 149.6M
101
+ - **Max Sequence Length**: 512 tokens
102
+ - **Output**: 2 classes (no_propaganda, has_propaganda)
103
+
104
+ ## Training Details
105
+
106
+ - **Loss Function**: Focal Loss (gamma=2.0, alpha=0.25)
107
+ - **Optimizer**: AdamW
108
+ - **Learning Rate**: 2e-5
109
+ - **Batch Size**: 16 (effective 64 with gradient accumulation)
110
+ - **Epochs**: 5 with early stopping
111
+ - **Hardware**: NVIDIA A10G GPU
112
+
113
+ ## Limitations
114
+
115
+ - Trained primarily on English text
116
+ - May not detect novel propaganda techniques not in training data
117
+ - Optimized for short-to-medium length text (tweets, headlines, paragraphs)
118
+ - Should be used as part of a larger analysis pipeline, not as sole arbiter
119
+
120
+ ## Citation
121
+
122
+ ```bibtex
123
+ @misc{nci-binary-detector,
124
+ author = {NCI Protocol Team},
125
+ title = {NCI Binary Propaganda Detector},
126
+ year = {2024},
127
+ publisher = {HuggingFace},
128
+ url = {https://huggingface.co/synapti/nci-binary-detector}
129
+ }
130
+ ```
131
+
132
+ ## License
133
+
134
+ Apache 2.0