---
license: apache-2.0
datasets:
- TIGER-Lab/MMEB-train
language:
- en
metrics:
- precision
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: sentence-similarity
library_name: transformers
tags:
- Qwen2-VL
- qwen2-vl
- MMEB
---
# SaHa-Qwen2-VL-7B-Instruct
## Model Summary
**SaHa-Qwen2-VL-7B-Instruct** is a state-of-the-art universal multimodal embedding model based on the **Qwen2-VL-7B-Instruct** architecture. This model has been fine-tuned using our innovative Self-aware Hard Negative Sampling (SaHa) strategy, which is designed to efficiently adapt generative Multimodal Large Language Models (MLLMs) for discriminative embedding tasks.
Our approach leverages a hierarchical embedding prompt to unlock the powerful zero-shot capabilities of MLLMs and then fine-tunes the model with SaHa to achieve superior performance on universal multimodal retrieval benchmarks. This model significantly reduces the computational costs associated with traditional contrastive pre-training while delivering state-of-the-art results.
For more details, please refer to our paper: [From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model](https://arxiv.org/abs/2508.00955) and our [GitHub repository](https://github.com/yeongjoonJu/Gen2Embed).
## How to Use
You can easily use this model with the `transformers` library for sentence and image similarity tasks. Make sure you have the latest version of `transformers`, `torch`, and `Pillow` installed.
```bash
pip install transformers>=4.46.1 torch pillow
```
### Get Embeddings from Text or Image
Here's how to get embeddings for text or image inputs. The model uses a specific prompt structure to generate high-quality embeddings.
**Load Model**
```python
import torch
from transformers import AutoProcessor, AutoConfig, Qwen2VLForConditionalGeneration
# Load the model and tokenizer
model_id = "Y-J-Ju/SaHa-Qwen2-VL-7B-Instruct"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config._attn_implementation = "flash_attention_2"
config.vision_config._attn_implementation = "flash_attention_2"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, config=config, device_map="cuda:0"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True,
min_pixels=256 * 28 * 28, max_pixels=1280 * 28 * 28)
```
**Data Preparation and Prompting**
```python
texts = [
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
"Korea University",
]
images = [
'https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg',
'https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Korea_University.jpg/960px-Korea_University.jpg',
]
task_instruction = 'Find an image that matches the given text.'
system_prompt = "Given an image, summarize the provided image in one word. Given only text, describe the text in one word."
represent_prompt = "Represent the given text in one word."
query_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{task_instruction}\n{query}\n{represent_prompt}<|im_end|>\n<|im_start|>assistant\n'
candidate_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{cand}<|im_end|>\n<|im_start|>assistant\n'
queries = [
query_form.format(system_prompt=system_prompt, task_instruction=task_instruction, query=text, represent_prompt=represent_prompt)
for text in texts
]
candidates = [
candidate_form.format(system_prompt=system_prompt, cand='<|image_pad|>')
for _ in images
]
```
**Get Embeddings**
```python
from PIL import Image
import io
from urllib import request
import torch.nn.functional as F
## Query (Text)
inputs = processor(text=queries, images=None, return_tensors="pt", padding=True)
model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()}
outputs = model(**model_input, return_dict=True, output_hidden_states=True)
hidden_states = outputs.hidden_states[-1]
query_embed = hidden_states[:,-1]
## Candidate (Image)
pil_images = [Image.open(io.BytesIO(request.urlopen(url).read())) for url in images]
inputs = processor(text=candidates, images=pil_images, return_tensors="pt", padding=True)
model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()}
outputs = model(**model_input, return_dict=True, output_hidden_states=True)
cand_embed = outputs.hidden_states[-1][:,-1]
query_embed = F.normalize(query_embed, p=2, dim=-1)
cand_embed = F.normalize(cand_embed, p=2, dim=-1)
print(query_embed @ cand_embed.T)
```
**Outputs (Similarity)**
~~~python
tensor([[ 0.3848, -0.0197],
[-0.0221, 0.2949]], device='cuda:0', dtype=torch.bfloat16)
~~~
## Training and Evaluation
### Training Data
The model was fine-tuned on the **Massive Multimodal Embedding Benchmark (MMEB)** training set, which consists of approximately 829,000 pairs from 20 in-domain datasets.
* **Training Data:** [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train)
### Evaluation Data
The model's performance was evaluated on the MMEB evaluation set, which includes 36 datasets covering four meta-tasks: Classification, Visual Question Answering (VQA), Retrieval, and Visual Grounding.
* **Evaluation Data:** [TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval)
### Performance
The SaHa-Qwen2-VL-7B-Instruct model achieves state-of-the-art performance in its parameter class on the MMEB benchmark, outperforming methods that rely on large-scale contrastive pre-training.
| Model | Params | Classification | Retrieval | VQA | Grounding | **IND** | **OND** | **Overall Avg.** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ours (SaHa-Qwen2-VL-7B) | 8.3B | 69.1 | 74.1 | 67.3 | 88.1 | **76.4** | **67.4** | **72.4** |
## Citation
If you find this model useful in your research, please cite our paper:
```bibtex
@misc{ju2025generatorembedder,
title={From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model},
author={Yeong-Joon Ju and Seong-Whan Lee},
year={2025},
eprint={2508.00955},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.00955},
}
```