---
license: apache-2.0
datasets:
- TIGER-Lab/MMEB-train
language:
- en
metrics:
- precision
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: sentence-similarity
library_name: transformers
tags:
- Qwen2-VL
- qwen2-vl
- MMEB
---

<p align="center">
  <img src="images/SaHa_logo.png" alt="SaHa Logo" style="width: 100%; max-width: 450px;">
</p>

# SaHa-Qwen2-VL-7B-Instruct

## Model Summary

**SaHa-Qwen2-VL-7B-Instruct** is a state-of-the-art universal multimodal embedding model based on the **Qwen2-VL-7B-Instruct** architecture. This model has been fine-tuned using our innovative Self-aware Hard Negative Sampling (SaHa) strategy, which is designed to efficiently adapt generative Multimodal Large Language Models (MLLMs) for discriminative embedding tasks.

Our approach leverages a hierarchical embedding prompt to unlock the powerful zero-shot capabilities of MLLMs and then fine-tunes the model with SaHa to achieve superior performance on universal multimodal retrieval benchmarks. This model significantly reduces the computational costs associated with traditional contrastive pre-training while delivering state-of-the-art results.

For more details, please refer to our paper: [From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model](https://arxiv.org/abs/2508.00955) and our [GitHub repository](https://github.com/yeongjoonJu/Gen2Embed).

## How to Use

You can easily use this model with the `transformers` library for sentence and image similarity tasks. Make sure you have the latest version of `transformers`, `torch`, and `Pillow` installed.

```bash
pip install transformers>=4.46.1 torch pillow
```

### Get Embeddings from Text or Image

Here's how to get embeddings for text or image inputs. The model uses a specific prompt structure to generate high-quality embeddings.

**Load Model**

```python
import torch
from transformers import AutoProcessor, AutoConfig, Qwen2VLForConditionalGeneration

# Load the model and tokenizer
model_id = "Y-J-Ju/SaHa-Qwen2-VL-7B-Instruct"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config._attn_implementation = "flash_attention_2"
config.vision_config._attn_implementation = "flash_attention_2"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, config=config, device_map="cuda:0"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True,
                                          min_pixels=256 * 28 * 28, max_pixels=1280 * 28 * 28)
```

**Data Preparation and Prompting**
```python
texts = [
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
    "Korea University",
]
images = [
    'https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg',
    'https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Korea_University.jpg/960px-Korea_University.jpg',
]
task_instruction = 'Find an image that matches the given text.'

system_prompt = "Given an image, summarize the provided image in one word. Given only text, describe the text in one word."
represent_prompt = "Represent the given text in one word."

query_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{task_instruction}\n{query}\n{represent_prompt}<|im_end|>\n<|im_start|>assistant\n'
candidate_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{cand}<|im_end|>\n<|im_start|>assistant\n'

queries = [
    query_form.format(system_prompt=system_prompt, task_instruction=task_instruction, query=text, represent_prompt=represent_prompt)
    for text in texts
]
candidates = [
    candidate_form.format(system_prompt=system_prompt, cand='<|image_pad|>')
    for _ in images
]
```

**Get Embeddings**
```python
from PIL import Image
import io
from urllib import request
import torch.nn.functional as F

## Query (Text)
inputs = processor(text=queries, images=None, return_tensors="pt", padding=True)
model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()}
outputs = model(**model_input, return_dict=True, output_hidden_states=True)
hidden_states = outputs.hidden_states[-1]
query_embed = hidden_states[:,-1]

## Candidate (Image)
pil_images = [Image.open(io.BytesIO(request.urlopen(url).read())) for url in images]
inputs = processor(text=candidates, images=pil_images, return_tensors="pt", padding=True)
model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()}
outputs = model(**model_input, return_dict=True, output_hidden_states=True)
cand_embed = outputs.hidden_states[-1][:,-1]

query_embed = F.normalize(query_embed, p=2, dim=-1)
cand_embed = F.normalize(cand_embed, p=2, dim=-1)
print(query_embed @ cand_embed.T)
```

**Outputs (Similarity)**

~~~python
tensor([[ 0.3848, -0.0197],
        [-0.0221,  0.2949]], device='cuda:0', dtype=torch.bfloat16)
~~~

## Training and Evaluation

### Training Data

The model was fine-tuned on the **Massive Multimodal Embedding Benchmark (MMEB)** training set, which consists of approximately 829,000 pairs from 20 in-domain datasets.

  * **Training Data:** [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)

### Evaluation Data

The model's performance was evaluated on the MMEB evaluation set, which includes 36 datasets covering four meta-tasks: Classification, Visual Question Answering (VQA), Retrieval, and Visual Grounding.

  * **Evaluation Data:** [TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval)

### Performance

The SaHa-Qwen2-VL-7B-Instruct model achieves state-of-the-art performance in its parameter class on the MMEB benchmark, outperforming methods that rely on large-scale contrastive pre-training.

| Model | Params | Classification | Retrieval | VQA | Grounding | **IND** | **OND** | **Overall Avg.** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ours (SaHa-Qwen2-VL-7B) | 8.3B | 69.1 | 74.1 | 67.3 | 88.1 | **76.4** | **67.4** | **72.4** |


## Citation

If you find this model useful in your research, please cite our paper:

```bibtex
@misc{ju2025generatorembedder,
      title={From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model}, 
      author={Yeong-Joon Ju and Seong-Whan Lee},
      year={2025},
      eprint={2508.00955},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.00955}, 
}
```