library_name: transformers
license: apache-2.0
base_model: cross-encoder/ms-marco-MiniLM-L6-v2
tags:
- generated_from_trainer
metrics:
- accuracy
- f1
model-index:
- name: ms-marco-MiniLM-L6-v2-finetuned-scidocs
results: []
language:
- en
pipeline_tag: text-ranking
ms-marco-MiniLM-L6-v2-finetuned-scidocs
This model is a fine-tuned version of cross-encoder/ms-marco-MiniLM-L6-v2 on GreenNode/scidocs-reranking-vn dataset. It achieves the following results on the evaluation set:
- Loss: 0.9168
- Accuracy: 0.9175
- F1: 0.7525
Model description
This model is a Cross-Encoder for Text Ranking based on the cross-encoder/ms-marco-MiniLM-L6-v2 architecture. It has been fine-tuned to assess the semantic relevance between a query and a document, outputting a score that indicates how likely the document contains the answer or relevant information for the query.
Unlike Bi-Encoders (which map text to vector space), this Cross-Encoder processes the query and document simultaneously, allowing for deep semantic interaction and superior ranking performance, particularly for scientific and technical domains.
- Task: Semantic Search / Re-Ranking
- Base Model:
cross-encoder/ms-marco-MiniLM-L6-v2 - Language: English (Scientific/Academic domain focus)
- Input: A pair of texts
(Query, Document) - Output: A single logit/score (High score = Relevant)
Intended uses & limitations
Intended Uses
- Information Retrieval (RAG): Re-ranking the top-100 documents retrieved by a vector search (e.g., BM25 or FAISS) to improve the precision of the top-10 results.
- Scientific Literature Search: Specifically optimized for matching research queries with relevant paper titles/abstracts.
- Question Answering: Filtering irrelevant context passages before feeding them to a generative LLM.
Limitations
- Domain Specificity: The model was trained primarily on scientific and academic text (SciDocs). It may exhibit "keyword bias" in general domains (e.g., associating technical terms like "Optimistic" solely with computer science rather than psychology).
- Latency: As a Cross-Encoder, it requires a full forward pass for every query-document pair. It is computationally expensive and not suitable for searching millions of documents directly. It should be used as the second stage in a Retrieve-then-Rerank pipeline.
- Max Sequence Length: Limited to 512 tokens. Documents longer than this will be truncated, potentially losing relevant information at the end.
Training and evaluation data
The model was fine-tuned on the SciDocs / GreenNode Scidocs Reranking dataset.
- Data Structure: The dataset consists of scientific queries paired with "Positive" (cited/relevant) papers and "Negative" (irrelevant) papers.
- Class Imbalance: The training data was heavily imbalanced, containing approximately 19% Positive samples and 81% Negative samples.
- Preprocessing: The dataset was flattened from a list-based structure (1 query -> N documents) into individual training pairs (1 query -> 1 document).
Training procedure
The model was trained using the Hugging Face Trainer API with a custom weighted loss function to address class imbalance.
Hyperparameters
- Learning Rate:
2e-5 - Batch Size: 16
- Epochs: 3
- Optimizer: AdamW
- Precision: FP16 (Mixed Precision)
- Max Length: 512 tokens
Optimization Strategy
To prevent the model from collapsing into predicting "Not Relevant" (due to the 81% negative rate), we implemented a Weighted Binary Cross Entropy Loss.
- Loss Function:
BCEWithLogitsLosswithpos_weight=5.0. - This weighting (calculated based on dataset statistics) penalizes the model 5x more for missing a relevant document than for misclassifying an irrelevant one, ensuring high recall.
Performance
- Accuracy: ~92%
- F1 Score: ~0.75
- Validation Loss: ~0.916 The model achieved peak performance at Epoch 1 before overfitting.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
|---|---|---|---|---|---|
| 0.753 | 1.0 | 1286 | 0.9168 | 0.9175 | 0.7525 |
| 0.6341 | 2.0 | 2572 | 1.3727 | 0.9153 | 0.7441 |
| 0.4523 | 3.0 | 3858 | 1.2822 | 0.9131 | 0.7468 |
| 0.3555 | 4.0 | 5144 | 1.8419 | 0.9125 | 0.7449 |
| 0.2136 | 5.0 | 6430 | 2.2042 | 0.9119 | 0.7414 |
| 0.1645 | 6.0 | 7716 | 2.7126 | 0.9117 | 0.7384 |
| 0.0733 | 7.0 | 9002 | 3.0462 | 0.9120 | 0.7300 |
| 0.0754 | 8.0 | 10288 | 3.2909 | 0.9120 | 0.7322 |
| 0.0642 | 9.0 | 11574 | 3.1751 | 0.9128 | 0.7439 |
| 0.0425 | 10.0 | 12860 | 3.2852 | 0.9128 | 0.7422 |
Framework versions
- Transformers 4.53.3
- Pytorch 2.6.0+cu124
- Datasets 4.4.1
- Tokenizers 0.21.2