Update README.md

f4b2065 verified 13 days ago

5.82 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: cross-encoder/ms-marco-MiniLM-L6-v2
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	- f1
	model-index:
	- name: ms-marco-MiniLM-L6-v2-finetuned-scidocs
	results: []
	language:
	- en
	pipeline_tag: text-ranking
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# ms-marco-MiniLM-L6-v2-finetuned-scidocs

	This model is a fine-tuned version of [cross-encoder/ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2) on [GreenNode/scidocs-reranking-vn dataset](https://huggingface.co/datasets/GreenNode/scidocs-reranking-vn).
	It achieves the following results on the evaluation set:
	- Loss: 0.9168
	- Accuracy: 0.9175
	- F1: 0.7525

	## Model description

	This model is a Cross-Encoder for Text Ranking based on the [cross-encoder/ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2) architecture. It has been fine-tuned to assess the semantic relevance between a query and a document, outputting a score that indicates how likely the document contains the answer or relevant information for the query.

	Unlike Bi-Encoders (which map text to vector space), this Cross-Encoder processes the query and document simultaneously, allowing for deep semantic interaction and superior ranking performance, particularly for scientific and technical domains.

	* Task: Semantic Search / Re-Ranking
	* Base Model: `cross-encoder/ms-marco-MiniLM-L6-v2`
	* Language: English (Scientific/Academic domain focus)
	* Input: A pair of texts `(Query, Document)`
	* Output: A single logit/score (High score = Relevant)

	## Intended uses & limitations

	### Intended Uses
	* Information Retrieval (RAG): Re-ranking the top-100 documents retrieved by a vector search (e.g., BM25 or FAISS) to improve the precision of the top-10 results.
	* Scientific Literature Search: Specifically optimized for matching research queries with relevant paper titles/abstracts.
	* Question Answering: Filtering irrelevant context passages before feeding them to a generative LLM.

	### Limitations
	* Domain Specificity: The model was trained primarily on scientific and academic text (SciDocs). It may exhibit "keyword bias" in general domains (e.g., associating technical terms like "Optimistic" solely with computer science rather than psychology).
	* Latency: As a Cross-Encoder, it requires a full forward pass for every query-document pair. It is computationally expensive and not suitable for searching millions of documents directly. It should be used as the second stage in a Retrieve-then-Rerank pipeline.
	* Max Sequence Length: Limited to 512 tokens. Documents longer than this will be truncated, potentially losing relevant information at the end.

	## Training and evaluation data

	The model was fine-tuned on the SciDocs / GreenNode Scidocs Reranking dataset.

	* Data Structure: The dataset consists of scientific queries paired with "Positive" (cited/relevant) papers and "Negative" (irrelevant) papers.
	* Class Imbalance: The training data was heavily imbalanced, containing approximately 19% Positive samples and 81% Negative samples.
	* Preprocessing: The dataset was flattened from a list-based structure (1 query -> N documents) into individual training pairs (1 query -> 1 document).

	## Training procedure

	The model was trained using the Hugging Face `Trainer` API with a custom weighted loss function to address class imbalance.

	### Hyperparameters
	* Learning Rate: `2e-5`
	* Batch Size: 16
	* Epochs: 3
	* Optimizer: AdamW
	* Precision: FP16 (Mixed Precision)
	* Max Length: 512 tokens

	### Optimization Strategy
	To prevent the model from collapsing into predicting "Not Relevant" (due to the 81% negative rate), we implemented a Weighted Binary Cross Entropy Loss.
	* Loss Function: `BCEWithLogitsLoss` with `pos_weight=5.0`.
	* This weighting (calculated based on dataset statistics) penalizes the model 5x more for missing a relevant document than for misclassifying an irrelevant one, ensuring high recall.

	### Performance
	* Accuracy: ~92%
	* F1 Score: ~0.75
	* Validation Loss: ~0.916 The model achieved peak performance at Epoch 1 before overfitting.

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 10
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| F1 \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:--------:\|:------:\|
	\| 0.753 \| 1.0 \| 1286 \| 0.9168 \| 0.9175 \| 0.7525 \|
	\| 0.6341 \| 2.0 \| 2572 \| 1.3727 \| 0.9153 \| 0.7441 \|
	\| 0.4523 \| 3.0 \| 3858 \| 1.2822 \| 0.9131 \| 0.7468 \|
	\| 0.3555 \| 4.0 \| 5144 \| 1.8419 \| 0.9125 \| 0.7449 \|
	\| 0.2136 \| 5.0 \| 6430 \| 2.2042 \| 0.9119 \| 0.7414 \|
	\| 0.1645 \| 6.0 \| 7716 \| 2.7126 \| 0.9117 \| 0.7384 \|
	\| 0.0733 \| 7.0 \| 9002 \| 3.0462 \| 0.9120 \| 0.7300 \|
	\| 0.0754 \| 8.0 \| 10288 \| 3.2909 \| 0.9120 \| 0.7322 \|
	\| 0.0642 \| 9.0 \| 11574 \| 3.1751 \| 0.9128 \| 0.7439 \|
	\| 0.0425 \| 10.0 \| 12860 \| 3.2852 \| 0.9128 \| 0.7422 \|


	### Framework versions

	- Transformers 4.53.3
	- Pytorch 2.6.0+cu124
	- Datasets 4.4.1
	- Tokenizers 0.21.2