đ Rethinking Multimodality from an Industry Perspective: Captioning Is Far More Important Than You Think
ArXiv: https://arxiv.org/abs/2511.21025
GitHub: https://github.com/bronyayang/CaptionQA
HuggingFace: https://huggingface.co/datasets/Borise/CaptionQA
đ° Introduction
This post serves as a more flexible, narrative extension of our paperâsomething closer to a technical blog. The first three sections focus on why we created CaptionQA. If you'd prefer to jump straight to the benchmark details, feel free to skip ahead to Section 4.
1. Why Did We Build This Benchmark?
After working on multimodal systems across several companies and product lines, Iâve noticed a striking pattern:
đ Industry relies heavily on image/document captioning, yet our understanding of captioning remains surprisingly shallow.
Iâve worked on multimodal problems in various contextsâsearch & retrieval, content understanding, intelligent assistants, and agent systems. Across all these teams, one recurring product request kept showing up:
Can we first generate a good caption for this image/document/product?
However, problems arise the moment teams attempt to evaluate whether their captions are âgood enoughâ:
- There is no truly plug-and-play captioning benchmark available today.
- Most existing evaluations focus on early MSCOCO-style short captions, far from real product use cases.
- Even tools inspired by our prior work (e.g., concept extraction and concept matching in CCEval-style setups in HallE-Control) often fail to transfer directly to real-world production scenarios.
The more we discussed captioning with people from both academia and industry, the clearer the gap became:
đŻ In industry
Captioning is infrastructure: It powers search, ranking, content understanding, recommendation systems, agent state representation, and more.
đ In academia
Caption evaluation receives very little attention, especially evaluations that are:
- easy to understand
- generalizable
- cost-efficient
- practically useful for real-world systems
This gap is the reason CaptionQA exists.
CaptionQA aims to provide academia with an industrial perspective, and to provide industry with a more scientific, grounded evaluation toolâa high-density, cross-domain caption benchmark that more closely reflects real-world needs.
Our Design Principles
When designing CaptionQA, we followed five core principles:
Return to the definition of captioning
Our evaluation must align with what we believe captions should beânot just what legacy datasets measure.Prefer native images and avoid data leakage
Most popular multimodal datasets have been overly exposed and are likely present in model pretraining data.Keep evaluation scores simple and interpretable
Overly complicated scoring schemes make benchmarks harder to trust and harder to adopt.Make the evaluation comprehensive yet cost-efficient
Teams need fast results and effective error analysis, without enormous compute costs.Ensure the method is general and highly transferable
Industrial use cases vary wildly; a benchmark that cannot be adapted is a benchmark that wonât be used.
2. What Is a Caption? And Why Is It So Critical in Industry?
The task of captioning actually has a long history.
In the early days, Googleâs motivation for generating captions was simple: teach models to mimic how humans summarize what they see. As multimodal LLMs evolved, the community shifted toward detailed captions, heavily influenced by traditions in object detection and scene graphs. Under this academic framing, a âgoodâ caption is often viewed as an object-centric description that:
- Enumerates all objects
- Describes each objectâs attributes
- Sometimes includes relationships between objects
This definition has persisted for years in academia.
But industry needs something very different
In real-world applications, captioning needs are diverse, concrete, and deeply task-dependentâfar beyond the academic interpretation. Anyone who has worked on multimodal products knows that captioning is used in ways that are much more complex than the traditional âdescribe the imageâ paradigm.
After working across multiple companies and product linesâand after talking with many multimodal teamsâI gradually realized an important truth:
Almost every multimodal system relies on captioning, but the purpose is not to âdescribe imagesââthe goal is to convert visual information into useful, consumable text.
In production systems, captions function as textual interfaces to vision, supporting downstream components such as retrieval, ranking, summarization, recommendation, and agent reasoning. This reality is precisely why current caption benchmarksâwhich assume that captioning is merely object listing and attribute descriptionâfail to reflect industry needs.
1) Search & Recommendation: Converting Images â Captions â Text-Based Systems
In e-commerce, short-video platforms, and social networks, a very common pipeline is:
- Product image â caption
- Video frame â caption
- User post â caption
Why? Because:
- User queries are inherently text
- Most companiesâ retrieval/ranking systems are fundamentally text-based
- And the vast majority of companies simply do not have multimodal search infrastructure
This point is crucial:
We often assume that âeveryone has multimodal search,â but in reality only a few tech giants do.
Most companiesâ search and recommendation stacks remain purely text-driven.
Therefore, converting images into captions becomes the only practical way for many companies to unlock multimodal capabilities.
2) ToB / Document Tasks: Databases Cannot Store Images â They Store Captions
In enterprise (ToB) scenarios, document-related tasks represent a massive category:
- Reports
- Financial statements
- News articles
- Contracts
- Manuals, etc.
Databases cannot directly apply query / join / rule logic to images. Therefore, companies typically need:
- OCR
- Document understanding
- Information extraction
- And ultimately, converting the pageâs content into a caption-like text representation
In ToB applications, caption-like text has effectively become infrastructure.
3) Privacy & Compliance: Many Companies âCannot Store ImagesâOnly Captionsâ
For privacy and compliance reasons, some companies are strictly prohibited from:
- Storing user images
- Storing user videos
- Accessing any multimodal data that has not gone through security review
As a result, the only thing they are allowed to retain is: A privacy-sanitized caption (auditable, controllable, and indexable)
This leads to an interesting phenomenon:
In some large enterprises, the lifecycle of image data is extremely short.
The only representation that persists is the caption, not the image.
4) A Key Component in Agent Systems: A âNewâ Use Case
In emerging multimodal agent systems and embodied AI, captions are becoming a core element of the workflow.
Captions increasingly act as:
- the textual carrier of visual signals inside CoT reasoning
- the serialized representation of agent state
- the bridge between perception and decision-making
This trend has grown rapidly in the past two years, and I believe it will become one of the most important emerging directionsâone that we cannot afford to ignore.
Different Companies Have Completely Different Caption Requirements
Across industry, âcaptioningâ is not one task â it is dozens of different tasks.
E-commerce
- Brand, price, size/specifications, material
- Attribute-heavy descriptions
- Strong emphasis on product properties
Social Platforms
- Natural images
- Event-centric (what happened)
- Object-centric (what is present)
ToB / Document Workflows
- OCR
- Table structure extraction
- Layout understanding
- Business field extraction
Short-Video Platforms
- Scene transitions
- Actions
- Objectâevent sequences
Album / Smartphone Manufacturers
- Portraits, beautification, geolocation
- Multi-scene blending
This highlights a key point:
In industry, captioning is not a single task â it is dozens of tasks.
Academic captioning research covers only the simplest one.
Caption Is Fundamentally an âInformation Carrier,â Not a Description
Most of the time, we donât use captions to describe an image. We use captions because:
We need a textual carrier to extract and transport the information inside images into downstream tasks. A caption happens to be the most convenient, safe, compact, and controllable representation.
In other words: Whatever the downstream task cares about â that is what the caption should express. It does not need to be infinitely detailed, nor cover every piece of information. The goal is not âthe longer the better,â but rather:
The more accurately the caption captures task-relevant information, the better.
The idea of a âdetailed captionâ is fundamentally vague â there is no upper bound. But industry requirements are extremely clear: short and effective.
3. The Current Gap Between Academia and Industry
After switching between academia and industry over the years, I realized something increasingly clear:
What academia calls âcaptioning technologyâ is almost entirely different from the âcaptioning capabilityâ that industry truly needs.
1) Academia Treats Captioning as a âDescription Task,â While Industry Treats Captioning as an âInformation Interfaceâ
In academia, captioning means:
- generating a descriptive sentence
- optimizing BLEU/CIDEr
- climbing the leaderboard
But in industry, captions function as:
- input to search systems
- input to recommendation systems
- input to document-structuring pipelines
- a normalized, storable data representation
- part of the agentâs state and intermediate reasoning
Industry does not care about âhow good the description sounds.â Industry cares about:
Whether the caption can reliably power downstream tasks.
2) Academia Optimizes for âMore Details,â While Industry Optimizes for âMore Effectivenessâ
In academia, a âdetailed captionâ essentially has no upper boundâlonger sentences, more objects, more attributes. But industry wants captions that are:
- short and precise
- free from hallucination
- focused on key information
- low latency
- effective for the target task
In other words:
Industry does not need to âcover all information,â but to cover only the information required by the task.
These are fundamentally different optimization goals.
3) Academia Evaluates âLanguage Quality,â While Industry Evaluates âTask Outcomeâ
Academia uses BLEU / ROUGE / CIDEr. But industry evaluates:
- whether search becomes more accurate
- whether attribute extraction becomes more stable
- whether document fields become more complete
- whether agents can plan the next step more reliably
Whether a caption âsounds human-likeâ is irrelevant. The critical question is:
Does the caption improve task performance?
4) Agent Scenarios Make This Gap Even More Obvious
In multimodal agents, the model must integrate visual information into the reasoning loop. But LLM reasoning is inherently language-based, so the pipeline becomes:
image â structured language (caption) â chain-of-thought reasoning
Here, captions are not âdescriptions.â Captions are:
- state summaries
- tool inputs
- intermediate reasoning steps
- foundational signals for action planning
Yet academia has almost no caption benchmarks designed for agent settings, which means academic caption research drifts further and further away from real industrial needs.
4. How Does CaptionQA Evaluate Captioning?
After laying all the groundwork, we can finally reach the core question:
How do we turn the messy, diverse, structure-varying task of captioning into a measurable, interpretable, scalable evaluation framework?
The CaptionQA evaluation pipeline is extremely simple â only three steps:
1) Use Any Model to Generate Captions
(prompts interchangeable, models interchangeable)
You may choose to use:
- our provided short / simple / long / taxonomy prompts, or
- your own custom-designed prompts
We do not restrict caption style, format, or length â the model is free to express itself.
2) A Fixed Evaluation Model (Qwen-2.5-72B) Answers Our Carefully Designed QA
Key point:
The evaluator only sees the caption â not the image.
This is the core principle of CaptionQA:
A caption is the textual substitute for an image. If a caption truly captures the image, it must support image-level QA.
Our QA covers object attributes, relationships, layout, states, OCR information, actions, and many more concept categories. If the evaluator fails to answer, we record:
- âcannot answerâ â coverage not sufficient
- wrong answer â hallucination
- correct answer â faithfulness & accuracy
3) Final Score Is Extremely Simple: Pure Accuracy (0â100)
We intentionally avoid complex metrics like BLEU / ROUGE / CIDEr.
Because:
- accuracy is interpretable
- accuracy is easy to debug
- accuracy allows fair cross-model comparison
- accuracy is friendly to product teams, managers, and researchers alike
In short:
A good evaluation metric should be understood by everyone instantly.
5. Why Choose QA (Instead of Concept Extraction)? Why This Design?
We experimented with many approaches â extraction, matching, NLP tools â and eventually realized:
1) QA Has Extremely High Information Density and Unlimited Expandability
If you want to evaluate something, you can directly ask it â without designing complex rules, tokenizers, POS taggers, or concept parsers.
2) QA Is Very Friendly for Human Annotation â and for LLM Auto-Generation
Teaching annotators to perform âconcept extractionâ is extremely difficult. But labeling QA is simple, direct, and cost-efficient.
3) QA Has a Short Evaluation Chain and High Stability
Concept extraction usually involves: extraction â matching â scoring. While QA is simply: caption â answer â accuracy. The shorter the chain, the more stable the evaluation.
4) QA Is a Unified Task Format (LLMs Excel at This)
LLMs are inherently good at QA and can naturally scale QA to large volumes. You donât need complicated prompt engineering to make the model âguessâ concepts.
High-Density (Dense QA) Is Our Core Idea
For every domain, we construct a concept schema. For each image, we create at least 50 questions on average. For natural images, e-commerce, documents, and embodied AI, we design schemas that represent what practitioners actually care about. Captions should cover these concepts â so we generate large numbers of questions accordingly. (We only release 25% of the QA set publicly. For full-density evaluation, please submit your captions so our team can run them.)
This high-density QA design allows us to:
- use significantly less data
- cover many more conceptual dimensions
- converge much faster
Low-Cost: Stable Evaluation With Very Little Data
With only 100 images, a modelâs scores across different domains already become stable. This means:
- no need for massive evaluation datasets
- evaluation can run entirely on local machines
- any company â no matter the size â can afford to use it
This property is extremely important for real industrial use.
Easy Transfer: You Can Replicate CaptionQA to Any Domain
We open-source:
- the QA generation pipeline
- QA filtering and cleaning code
- annotation guidelines
- domain schema templates
Researchers can simply swap in a new schema to extend CaptionQA to any project or domain they care about. CaptionQA makes it possible for every field to have its own domain-specific caption benchmark.
Self-Collected Data and Multi-Domain Coverage
We manually collected and filtered 658 images, covering four different domains. We also invited experts from each domain to help refine the schemas. Based on their input, we selected four domains that best reflect real industry needs:
- natural
- e-commerce
- document
- embodied AI
These domains together cover a very wide spectrum of caption use cases.
Agent-Ready Evaluation
Another major reason we choose a QA-based framework is that captions are essentially mandatory in multimodal agent systems. In an agent workflow, the caption becomes a representation of the multimodal state â a compact snapshot of what the model âknowsâ so far.
Captions are stored as intermediate information, and most downstream tasks in agent systems take the form of QA. The agent either answers questions or uses the caption as part of its reasoning steps. In many cases, QA is simply the next step in the chain of thought.
We believe future MM agents will rely even more heavily on systems like CaptionQA to diagnose and quantify the quality of these intermediate caption states.
6. CaptionQA Evaluation Results
1. Model Comparison
When reading any benchmark, the first question is of course: âWhich models are stronger?â
We evaluated mainstream open-source and closed-source vision-language models using CaptionQA, and the results reveal several interesting patterns.
Open-Source Models: Qwen3-VL and GLM-4.1V Form a Consistent Top Tier
Across prompt types (Long / Simple / Taxonomy), Qwen3-VL and GLM-4.1V reliably occupy the top two spots among open-source models.
- Qwen3-VL ranks #1 overall across all settings except short prompt.
- GLM-4.1V delivers the best Document-domain performance among all open-source models.
Because the Document domain requires non-trivial OCR (tables, charts, layout understanding, structured text), GLMâs strong showing here aligns with expectations.
Closed-Source Models: GPT-5 and Gemini-2.5-Pro Remain in the Lead
If we temporarily ignore short-prompt results (closed-source models are not always optimized for it), the overall trend is:
- GPT-5 and Gemini-2.5-Pro form the top tier.
- In the Document domain, GPT-5 clearly outperforms Gemini-2.5-Pro, suggesting GPT-5 is more mature in handling complex document understanding (OCR, diagrams, layout).
Open-Source vs Closed-Source: The Gap Is Closing Quickly
A particularly notable observation:
Qwen3-VLâs overall performance is now very close to GPT-5 / Gemini-2.5-Pro, especially in domains such as Natural / E-commerce / Embodied AI.
In other words:
- On the âcaption-as-information-interfaceâ foundation,
- open-source models have already entered first-tier competitiveness.
This is a very encouraging signal for the entire industry.
2. Differences Across Prompts
In CaptionQA, we evaluated four common caption prompts â Short / Simple / Long / Taxonomy â covering the spectrum from traditional captioning to the instruction-following style widely used in modern MLLMs.
Below is a comparison of the average output lengths for the four prompt types:
Although longer prompts lead to longer captions, model performance does not simply improve with longer outputs. We observed several trends that are worth sharing.
Short Prompt: Traditional short captions can no longer meet modern multimodal needs
Short prompts resemble early-era one-sentence captioning (e.g., classic CLIP-style captions). Our results show:
- The generated captions are too short
- Information coverage is extremely limited
- CaptionQA scores are consistently biased downward
- Little to no utility for downstream tasks
This matches our expectations: short captions are largely unusable in modern multimodal applications, especially when tasks require fine-grained semantic detail.
Simple Prompt: âDescribe this image in detailâ is the most balanced and stable setting
The Simple prompt corresponds to the most widely adopted detailed caption format:
âDescribe this image in detail.â
Its characteristics:
- Clearly longer captions
- Higher information density
- More complete concept coverage
- Stronger model performance across domains
Many modern MLLMs are actually trained using this kind of prompt, so it naturally reflects their âtrueâ captioning capabilities. We recommend Simple as the default prompt, and treat it as the standard setting for CaptionQA.
Long Prompt: Captions get longer, but information density does not improve
The Long prompt is designed to push models to âwrite as long as possible.â Indeed, average length grows significantly â from ~356 â 510 words. But performance barely improves.The reason is simple:
Models have an upper bound on information density. Longer captions mostly repeat or expand wording, without adding new information.
This implies:
- A long caption â a good caption
- Visual understanding has a ceiling that cannot be exceeded by verbose writing
- Chasing longer captions yields diminishing returns
This also explains why simply âwriting moreâ when building caption datasets does not lead to substantial quality improvements.
Taxonomy Prompt: Giving models the âtest coverageâ actually makes them fail
This was the most surprising part of our study. Our intuition was: âIf we give models the QA concept schema, they could âfill in the blanksâ and cover more information.â But results show the opposite:
- Model scores drop significantly across all domains
- Many models show unstable instruction-following
- Severe task drift appears in generated captions
- Models focus on âformat followingâ rather than image understanding
This reveals a very practical issue:
Even modern MLLMs still struggle with complex structured instructions, especially when the instruction resembles a schema.
Although multimodal pretraining and post-training often include structured prompts, few works have examined this failure mode systematically. CaptionQA makes this weakness clearly visible.
đ The failure of the Taxonomy prompt reveals a major future challenge for multimodal Agents
In many Agent systems, instructions tend to be:
- Automatically generated
- Structurally complex
- Very long
- Multi-stage and nested
This leads to instruction scale â instructions becoming increasingly long, complex, and difficult for models to follow. This raises several important questions:
- How can models remain reliable when facing future ultra-long, auto-generated instructions?
- How can we avoid task drift?
- How can we achieve both good captioning and robust instruction-following at the same time?
This was an unexpected finding during our CaptionQA study, and we believe it is an extremely valuable direction for future research.
3. Comparison with VQA
đ VQA vs CaptionQA: Why are models strong at VQA but still weak at captioning?
We divide the modelâs behavior into two separate capabilities:
- QA-on-image: Answering questions by directly looking at the image (similar to traditional VQA)
- QA-on-caption: Answering questions only from the caption, without seeing the image (CaptionQA)
As we move toward the right side of the figure, the gap grows larger, meaning:
The modelâs visual understanding ability (VQA) is much stronger than its ability to express that understanding through captions.
Stronger models show a smaller VQA â Caption gap, but the gap is still significant
For top-tier models such as GPT-5, Gemini-2.5-Pro, and Qwen3-VL, we observe:
- QA-on-image: 95%â98%
- QA-on-caption: 85%â90%
A gap of 9%â11%.
This means:
Stronger models are indeed better at extracting and stating visual information clearly.
Mid-tier and many open-source models show much larger gaps (20%â30%+)
Some open-source models (especially mid-tier ones) exhibit:
- QA-on-image: around 90%
- QA-on-caption: 60%â75%
Gaps can exceed 30%.
This indicates:
Models âunderstand,â but they cannot âsay it clearly.â
In other words:
- Their visual perception is not bad
- But their caption generation is extremely unstable
- Information is missing, disorganized, or drifting (task drift is very common)
- As long as you rely on the caption rather than the raw image, all these issues surface immediately
This explains a common industry phenomenon:
Many models perform extremely well on VQA leaderboards, but produce unusable captions in real-world applications.
This reveals an overlooked truth: Captioning is the weakest, most neglected link in the capability chain
Although captioning is one of the most fundamental multimodal tasks, the ecosystem has evolved such that:
- The research community focuses far more on VQA than captioning
- Companies benchmark visual understanding (âcan it recognize?â)
- Few evaluate visual expression (âcan it articulate clearly?â)
- Multimodal pretraining treats captioning as a side effect
- There are very few dedicated caption objectives
- RL and instruction tuning rarely target captioning
This directly leads to:
Models can understand, but cannot express.
Yet in real applicationsâsearch, documents, recommendation, agent systemsâcaption is often the only usable information carrier. This is exactly one of the core questions CaptionQA aims to answer:
âIs the model lacking visual understanding, or lacking expression?â CaptionQA cleanly separates the two.
4. To improve captioning, we must treat captioning as an independent task, not a byproduct
From all our findings, the conclusion is clear:
Captioning must be re-prioritized, re-defined, and re-trained.
Future directions include:
- Clearer caption task definitions across domains
- Paired data mixing complex instructions Ă caption
- More domains, more diversity, and more dense caption supervision
- Stronger caption RL to reduce task drift
- Specialized modeling for compressed & structured information
- Teaching models how to generate âhigh-density, structuredâ captions
This is especially crucial for Agent systems:
Captioning is not âdescribing an imageââ
it is the state representation for Agents.
If the caption is wrong, the entire Agent policy will be wrong.
Captioning is becoming a foundational requirement for future multimodal systems.
We include many more experiments and details in our original paperâs appendix. If you spot issues or want to extend our dataset, we welcome feedback and community contributionsâwe genuinely hope CaptionQA can help the open-source community. Feel free to leave comments or open an issue. We look forward to discussions!
At the end, here is the BibTeX citation:
@misc{yang2025captionqacaptionusefulimage,
title={CaptionQA: Is Your Caption as Useful as the Image Itself?},
author={Shijia Yang and Yunong Liu and Bohan Zhai and Ximeng Sun and Zicheng Liu and Emad Barsoum and Manling Li},
year={2025},
eprint={2511.21025},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21025},
}








