Update README.md

f4b2065 verified 11 days ago

5.82 kB

metadata

library_name: transformers
license: apache-2.0
base_model: cross-encoder/ms-marco-MiniLM-L6-v2
tags:
  - generated_from_trainer
metrics:
  - accuracy
  - f1
model-index:
  - name: ms-marco-MiniLM-L6-v2-finetuned-scidocs
    results: []
language:
  - en
pipeline_tag: text-ranking

ms-marco-MiniLM-L6-v2-finetuned-scidocs

This model is a fine-tuned version of cross-encoder/ms-marco-MiniLM-L6-v2 on GreenNode/scidocs-reranking-vn dataset. It achieves the following results on the evaluation set:

Loss: 0.9168
Accuracy: 0.9175
F1: 0.7525

Model description

This model is a Cross-Encoder for Text Ranking based on the cross-encoder/ms-marco-MiniLM-L6-v2 architecture. It has been fine-tuned to assess the semantic relevance between a query and a document, outputting a score that indicates how likely the document contains the answer or relevant information for the query.

Unlike Bi-Encoders (which map text to vector space), this Cross-Encoder processes the query and document simultaneously, allowing for deep semantic interaction and superior ranking performance, particularly for scientific and technical domains.

Task: Semantic Search / Re-Ranking
Base Model: cross-encoder/ms-marco-MiniLM-L6-v2
Language: English (Scientific/Academic domain focus)
Input: A pair of texts (Query, Document)
Output: A single logit/score (High score = Relevant)

Intended uses & limitations

Intended Uses

Information Retrieval (RAG): Re-ranking the top-100 documents retrieved by a vector search (e.g., BM25 or FAISS) to improve the precision of the top-10 results.
Scientific Literature Search: Specifically optimized for matching research queries with relevant paper titles/abstracts.
Question Answering: Filtering irrelevant context passages before feeding them to a generative LLM.

Limitations

Domain Specificity: The model was trained primarily on scientific and academic text (SciDocs). It may exhibit "keyword bias" in general domains (e.g., associating technical terms like "Optimistic" solely with computer science rather than psychology).
Latency: As a Cross-Encoder, it requires a full forward pass for every query-document pair. It is computationally expensive and not suitable for searching millions of documents directly. It should be used as the second stage in a Retrieve-then-Rerank pipeline.
Max Sequence Length: Limited to 512 tokens. Documents longer than this will be truncated, potentially losing relevant information at the end.

Training and evaluation data

The model was fine-tuned on the SciDocs / GreenNode Scidocs Reranking dataset.

Data Structure: The dataset consists of scientific queries paired with "Positive" (cited/relevant) papers and "Negative" (irrelevant) papers.
Class Imbalance: The training data was heavily imbalanced, containing approximately 19% Positive samples and 81% Negative samples.
Preprocessing: The dataset was flattened from a list-based structure (1 query -> N documents) into individual training pairs (1 query -> 1 document).

Training procedure

The model was trained using the Hugging Face Trainer API with a custom weighted loss function to address class imbalance.

Hyperparameters

Learning Rate: 2e-5
Batch Size: 16
Epochs: 3
Optimizer: AdamW
Precision: FP16 (Mixed Precision)
Max Length: 512 tokens

Optimization Strategy

To prevent the model from collapsing into predicting "Not Relevant" (due to the 81% negative rate), we implemented a Weighted Binary Cross Entropy Loss.

Loss Function: BCEWithLogitsLoss with pos_weight=5.0.
This weighting (calculated based on dataset statistics) penalizes the model 5x more for missing a relevant document than for misclassifying an irrelevant one, ensuring high recall.

Performance

Accuracy: ~92%
F1 Score: ~0.75
Validation Loss: ~0.916 The model achieved peak performance at Epoch 1 before overfitting.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1
0.753	1.0	1286	0.9168	0.9175	0.7525
0.6341	2.0	2572	1.3727	0.9153	0.7441
0.4523	3.0	3858	1.2822	0.9131	0.7468
0.3555	4.0	5144	1.8419	0.9125	0.7449
0.2136	5.0	6430	2.2042	0.9119	0.7414
0.1645	6.0	7716	2.7126	0.9117	0.7384
0.0733	7.0	9002	3.0462	0.9120	0.7300
0.0754	8.0	10288	3.2909	0.9120	0.7322
0.0642	9.0	11574	3.1751	0.9128	0.7439
0.0425	10.0	12860	3.2852	0.9128	0.7422

Framework versions

Transformers 4.53.3
Pytorch 2.6.0+cu124
Datasets 4.4.1
Tokenizers 0.21.2