paulbontempo commited on
Commit
a7b99e6
·
verified ·
1 Parent(s): 020f7c4

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - tl
5
+ tags:
6
+ - tagalog
7
+ - dependency-parsing
8
+ - contrastive-learning
9
+ - bert
10
+ - syntax
11
+ - low-resource
12
+ base_model: paulbontempo/bert-tagalog-mlm-stage1
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Tagalog BERT with Dependency-Aware Contrastive Learning
17
+
18
+ This is a BERT model for Tagalog with token embeddings fine-tuned using contrastive learning on dependency parse tree structures.
19
+
20
+ ## Model Description
21
+
22
+ - **Base Model:** [paulbontempo/bert-tagalog-mlm-stage1](https://huggingface.co/paulbontempo/bert-tagalog-mlm-stage1) (we fine-tuned the stage_1 model itself from base BERT on the FakeNewsFilipino dataset)
23
+ - **Language:** Tagalog (Filipino)
24
+ - **Training Approach:** Two-stage fine-tuning for low-resource language processing
25
+ 1. **Stage 1:** Masked Language Modeling (MLM) on Tagalog corpus (FakeNewsFilipino)
26
+ 2. **Stage 2:** Contrastive learning with InfoNCE loss on dependency parse triples corpus (UD-Ugnayan)
27
+
28
+ ## Our Contributions
29
+
30
+ We use a novel approach to encode syntactic structure directly into token embeddings:
31
+ - Dependency triples (head, relation, dependent) were extracted from 94 UD-annotated Tagalog sentences
32
+ - Contrastive learning with InfoNCE loss trained tokens to cluster by their syntactic roles
33
+ - Tokens appearing as heads of the same dependency relation become similar in embedding space
34
+ - This improves downstream NLP task performance for low-resource Tagalog
35
+
36
+ ## Architecture
37
+
38
+ Standard BERT architecture with fine-tuned token embeddings:
39
+ - **Hidden size:** 768
40
+ - **Attention heads:** 12
41
+ - **Layers:** 12
42
+ - **Vocabulary size:** ~50,000 tokens (WordPiece)
43
+
44
+ The contrastive learning stage used:
45
+ - **Loss:** InfoNCE (temperature=0.07)
46
+ - **Projection dimension:** 256
47
+ - **Training epochs:** 50
48
+ - **Final loss:** 0.076
49
+
50
+ ## Usage
51
+
52
+ This is a standard HuggingFace BERT model and can be used like any other BERT:
53
+
54
+ ```python
55
+ from transformers import AutoModel, AutoTokenizer
56
+
57
+ # Load model and tokenizer
58
+ model = AutoModel.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
59
+ tokenizer = AutoTokenizer.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
60
+
61
+ # Use for embeddings
62
+ text = "Magandang umaga sa lahat"
63
+ inputs = tokenizer(text, return_tensors="pt")
64
+ outputs = model(**inputs)
65
+
66
+ # Get token embeddings
67
+ token_embeddings = outputs.last_hidden_state
68
+ ```
69
+
70
+ ### For Downstream Tasks
71
+
72
+ Fine-tune on your Tagalog NLP task:
73
+
74
+ ```python
75
+ from transformers import AutoModelForSequenceClassification
76
+
77
+ # For classification tasks
78
+ model = AutoModelForSequenceClassification.from_pretrained(
79
+ "paulbontempo/bert-tagalog-dependency-cl",
80
+ num_labels=3
81
+ )
82
+
83
+ # Train on your task
84
+ # ...
85
+ ```
86
+
87
+ ## Training Details
88
+
89
+ ### Stage 2: Contrastive Learning
90
+ - **Dataset:** 94 Tagalog sentences with dependency annotations
91
+ - **Positive samples:** ~600 true dependency triples
92
+ - **Negative samples:** ~10,000 artificially generated incorrect triples
93
+ - **Batch strategy:** Relation-aware batching for efficient positive pair sampling
94
+ - **Optimizer:** AdamW (lr=3e-5, weight_decay=0.01)
95
+ - **Warmup steps:** 300
96
+ - **Training time:** ~30 minutes on H100 GPU
97
+
98
+ ### Contrastive Learning Strategy
99
+ - **Positive pairs:** Triples with the same dependency relation from true parses
100
+ - **Negative pairs:** Artificially created grammatically incorrect triples OR triples with different relations
101
+ - **Goal:** Cluster tokens by syntactic role to improve representation quality
102
+
103
+ ## Evaluation
104
+
105
+ This model is designed as a pre-trained base for downstream Tagalog NLP tasks. The quality of embeddings can be evaluated through:
106
+ - Dependency parsing accuracy
107
+ - Named entity recognition
108
+ - Sentiment analysis
109
+ - Other token-level classification tasks
110
+
111
+ ## Limitations
112
+
113
+ - Trained on only 94 sentences with dependency annotations (very small dataset)
114
+ - May not generalize to all Tagalog language varieties
115
+ - Best used as a starting point for further task-specific fine-tuning
116
+
117
+ ## Citation
118
+
119
+ If you use this model in your research, please cite:
120
+
121
+ ```bibtex
122
+ @misc{bert-tagalog-dependency-cl,
123
+ author = {Paul Bontempo},
124
+ title = {Tagalog BERT with Dependency-Aware Contrastive Learning},
125
+ year = {2025},
126
+ publisher = {Hugging Face},
127
+ howpublished = {\url{https://huggingface.co/paulbontempo/bert-tagalog-dependency-cl}}
128
+ }
129
+ ```
130
+
131
+ ## Acknowledgments
132
+
133
+ - Built on top of Stage 1 MLM training: [paulbontempo/bert-tagalog-mlm-stage1](https://huggingface.co/paulbontempo/bert-tagalog-mlm-stage1)
134
+ - Developed at University of Colorado Boulder
135
+ - Part of neural-symbolic (NeSy) research for low-resource language processing