SPLADE-PT-BR

SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for Portuguese text retrieval. Based on BERTimbau and trained on Portuguese question-answering datasets.

GitHub Repository: https://github.com/AxelPCG/SPLADE-PT-BR

Model Description

SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:

  • Interpretable: Each dimension corresponds to a vocabulary token
  • Efficient: Can use inverted indexes for fast retrieval
  • Effective: Combines lexical matching with semantic expansion

Key Features

  • Base Model: neuralmind/bert-base-portuguese-cased (BERTimbau)
  • Vocabulary Size: 29,794 tokens (Portuguese-optimized)
  • Training Iterations: 150,000
  • Final Training Loss: 0.000047
  • Sparsity: ~99.5% (100-150 active dimensions per vector)
  • Max Sequence Length: 256 tokens

Training Details

Training Data

  • Training Dataset: mMARCO Portuguese (unicamp-dl/mmarco)
  • Validation Dataset: mRobust (unicamp-dl/mrobust)
  • Format: Triplets (query, positive document, negative document)

Training Configuration

Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW

Regularization

FLOPS regularization is applied to enforce sparsity:

  • Lambda Query: 0.0003 (queries are more sparse)
  • Lambda Document: 0.0001 (documents less sparse for better recall)

Performance

Dataset: mRobust (528k docs, 250 queries)

Metric Score
MRR@10 0.453

Usage

Installation

pip install torch transformers

Basic Usage

Option 1: Using HuggingFace Hub (Recommended)

import torch
from transformers import AutoTokenizer
from modeling_splade import Splade

# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()

# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
    query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
    query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()

# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
    doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
    doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()

# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")

# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")

Option 2: Using SPLADE Library

from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer

# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")

Limitations and Bias

  • Model trained on machine-translated Portuguese data (mMARCO)
  • May not capture all socio-cultural aspects of native Brazilian Portuguese
  • Performance may vary on domain-specific tasks
  • Inherits biases from BERTimbau base model and training data

Citation

@misc{splade-pt-br-2025,
  author = {Axel Chepanski},
  title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AxelPCG/splade-pt-br}
}

Acknowledgments

  • SPLADE by NAVER Labs and leobavila/splade fork
  • BERTimbau by Neuralmind
  • mMARCO & mRobust Portuguese by UNICAMP-DL
  • Quati Dataset research - Inspiration for native Portuguese IR

License

Apache 2.0

Downloads last month
106
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AxelPCG/splade-pt-br

Finetuned
(188)
this model

Datasets used to train AxelPCG/splade-pt-br