--- license: apache-2.0 datasets: - TIGER-Lab/MMEB-train language: - en metrics: - precision base_model: - Qwen/Qwen2-VL-7B-Instruct pipeline_tag: sentence-similarity library_name: transformers tags: - Qwen2-VL - qwen2-vl - MMEB ---

SaHa Logo

# SaHa-Qwen2-VL-7B-Instruct ## Model Summary **SaHa-Qwen2-VL-7B-Instruct** is a state-of-the-art universal multimodal embedding model based on the **Qwen2-VL-7B-Instruct** architecture. This model has been fine-tuned using our innovative Self-aware Hard Negative Sampling (SaHa) strategy, which is designed to efficiently adapt generative Multimodal Large Language Models (MLLMs) for discriminative embedding tasks. Our approach leverages a hierarchical embedding prompt to unlock the powerful zero-shot capabilities of MLLMs and then fine-tunes the model with SaHa to achieve superior performance on universal multimodal retrieval benchmarks. This model significantly reduces the computational costs associated with traditional contrastive pre-training while delivering state-of-the-art results. For more details, please refer to our paper: [From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model](https://arxiv.org/abs/2508.00955) and our [GitHub repository](https://github.com/yeongjoonJu/Gen2Embed). ## How to Use You can easily use this model with the `transformers` library for sentence and image similarity tasks. Make sure you have the latest version of `transformers`, `torch`, and `Pillow` installed. ```bash pip install transformers>=4.46.1 torch pillow ``` ### Get Embeddings from Text or Image Here's how to get embeddings for text or image inputs. The model uses a specific prompt structure to generate high-quality embeddings. **Load Model** ```python import torch from transformers import AutoProcessor, AutoConfig, Qwen2VLForConditionalGeneration # Load the model and tokenizer model_id = "Y-J-Ju/SaHa-Qwen2-VL-7B-Instruct" config = AutoConfig.from_pretrained(model_id, trust_remote_code=True) config._attn_implementation = "flash_attention_2" config.vision_config._attn_implementation = "flash_attention_2" model = Qwen2VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, config=config, device_map="cuda:0" ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, min_pixels=256 * 28 * 28, max_pixels=1280 * 28 * 28) ``` **Data Preparation and Prompting** ```python texts = [ "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.", "Korea University", ] images = [ 'https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Korea_University.jpg/960px-Korea_University.jpg', ] task_instruction = 'Find an image that matches the given text.' system_prompt = "Given an image, summarize the provided image in one word. Given only text, describe the text in one word." represent_prompt = "Represent the given text in one word." query_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{task_instruction}\n{query}\n{represent_prompt}<|im_end|>\n<|im_start|>assistant\n' candidate_form = '<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{cand}<|im_end|>\n<|im_start|>assistant\n' queries = [ query_form.format(system_prompt=system_prompt, task_instruction=task_instruction, query=text, represent_prompt=represent_prompt) for text in texts ] candidates = [ candidate_form.format(system_prompt=system_prompt, cand='<|image_pad|>') for _ in images ] ``` **Get Embeddings** ```python from PIL import Image import io from urllib import request import torch.nn.functional as F ## Query (Text) inputs = processor(text=queries, images=None, return_tensors="pt", padding=True) model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()} outputs = model(**model_input, return_dict=True, output_hidden_states=True) hidden_states = outputs.hidden_states[-1] query_embed = hidden_states[:,-1] ## Candidate (Image) pil_images = [Image.open(io.BytesIO(request.urlopen(url).read())) for url in images] inputs = processor(text=candidates, images=pil_images, return_tensors="pt", padding=True) model_input = {k: v if isinstance(v, list) else v.to(model.device) for k, v in inputs.items()} outputs = model(**model_input, return_dict=True, output_hidden_states=True) cand_embed = outputs.hidden_states[-1][:,-1] query_embed = F.normalize(query_embed, p=2, dim=-1) cand_embed = F.normalize(cand_embed, p=2, dim=-1) print(query_embed @ cand_embed.T) ``` **Outputs (Similarity)** ~~~python tensor([[ 0.3848, -0.0197], [-0.0221, 0.2949]], device='cuda:0', dtype=torch.bfloat16) ~~~ ## Training and Evaluation ### Training Data The model was fine-tuned on the **Massive Multimodal Embedding Benchmark (MMEB)** training set, which consists of approximately 829,000 pairs from 20 in-domain datasets. * **Training Data:** [TIGER-Lab/MMEB-train](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) ### Evaluation Data The model's performance was evaluated on the MMEB evaluation set, which includes 36 datasets covering four meta-tasks: Classification, Visual Question Answering (VQA), Retrieval, and Visual Grounding. * **Evaluation Data:** [TIGER-Lab/MMEB-eval](https://huggingface.co/datasets/TIGER-Lab/MMEB-eval) ### Performance The SaHa-Qwen2-VL-7B-Instruct model achieves state-of-the-art performance in its parameter class on the MMEB benchmark, outperforming methods that rely on large-scale contrastive pre-training. | Model | Params | Classification | Retrieval | VQA | Grounding | **IND** | **OND** | **Overall Avg.** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Ours (SaHa-Qwen2-VL-7B) | 8.3B | 69.1 | 74.1 | 67.3 | 88.1 | **76.4** | **67.4** | **72.4** | ## Citation If you find this model useful in your research, please cite our paper: ```bibtex @misc{ju2025generatorembedder, title={From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model}, author={Yeong-Joon Ju and Seong-Whan Lee}, year={2025}, eprint={2508.00955}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.00955}, } ```