SPLADE-PT-BR
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for Portuguese text retrieval. Based on BERTimbau and trained on Portuguese question-answering datasets.
GitHub Repository: https://github.com/AxelPCG/SPLADE-PT-BR
Model Description
SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:
- Interpretable: Each dimension corresponds to a vocabulary token
- Efficient: Can use inverted indexes for fast retrieval
- Effective: Combines lexical matching with semantic expansion
Key Features
- Base Model:
neuralmind/bert-base-portuguese-cased(BERTimbau) - Vocabulary Size: 29,794 tokens (Portuguese-optimized)
- Training Iterations: 150,000
- Final Training Loss: 0.000047
- Sparsity: ~99.5% (100-150 active dimensions per vector)
- Max Sequence Length: 256 tokens
Training Details
Training Data
- Training Dataset: mMARCO Portuguese (
unicamp-dl/mmarco) - Validation Dataset: mRobust (
unicamp-dl/mrobust) - Format: Triplets (query, positive document, negative document)
Training Configuration
Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW
Regularization
FLOPS regularization is applied to enforce sparsity:
- Lambda Query: 0.0003 (queries are more sparse)
- Lambda Document: 0.0001 (documents less sparse for better recall)
Performance
Dataset: mRobust (528k docs, 250 queries)
| Metric | Score |
|---|---|
| MRR@10 | 0.453 |
Usage
Installation
pip install torch transformers
Basic Usage
Option 1: Using HuggingFace Hub (Recommended)
import torch
from transformers import AutoTokenizer
from modeling_splade import Splade
# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()
# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()
# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")
# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
Option 2: Using SPLADE Library
from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer
# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
Limitations and Bias
- Model trained on machine-translated Portuguese data (mMARCO)
- May not capture all socio-cultural aspects of native Brazilian Portuguese
- Performance may vary on domain-specific tasks
- Inherits biases from BERTimbau base model and training data
Citation
@misc{splade-pt-br-2025,
author = {Axel Chepanski},
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AxelPCG/splade-pt-br}
}
Acknowledgments
- SPLADE by NAVER Labs and leobavila/splade fork
- BERTimbau by Neuralmind
- mMARCO & mRobust Portuguese by UNICAMP-DL
- Quati Dataset research - Inspiration for native Portuguese IR
License
Apache 2.0
- Downloads last month
- 106
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for AxelPCG/splade-pt-br
Base model
neuralmind/bert-base-portuguese-cased