Collections
Discover the best community collections!
Collections including paper arxiv:2408.03326
-
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper • 2407.14177 • Published • 45 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 26 -
facebook/chameleon-7b
Image-Text-to-Text • 7B • Updated • 60.8k • 195 -
vidore/colpali
Visual Document Retrieval • Updated • 5.65k • 464
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 31 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 35 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 9 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 15
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 30 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 41
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
Time Machine GPT
Paper • 2404.18543 • Published • 2 -
Diffusion for World Modeling: Visual Details Matter in Atari
Paper • 2405.12399 • Published • 30 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50
-
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Paper • 2406.17294 • Published • 11 -
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper • 2407.02392 • Published • 24 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24 -
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Paper • 2407.03320 • Published • 95
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90
-
Interactive3D: Create What You Want by Interactive 3D Generation
Paper • 2404.16510 • Published • 21 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 10 -
A Thorough Examination of Decoding Methods in the Era of LLMs
Paper • 2402.06925 • Published • 1 -
LLaVA-OneVision: Easy Visual Task Transfer
Paper • 2408.03326 • Published • 61
-
Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss
Paper • 2404.02731 • Published • 1 -
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Paper • 2309.12284 • Published • 18 -
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Paper • 2404.03204 • Published • 10 -
Adapting LLaMA Decoder to Vision Transformer
Paper • 2404.06773 • Published • 18
-
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper • 2407.14177 • Published • 45 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 26 -
facebook/chameleon-7b
Image-Text-to-Text • 7B • Updated • 60.8k • 195 -
vidore/colpali
Visual Document Retrieval • Updated • 5.65k • 464
-
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Paper • 2406.17294 • Published • 11 -
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper • 2407.02392 • Published • 24 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24 -
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Paper • 2407.03320 • Published • 95
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 31 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 35 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 9 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 15
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 30 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 132 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 41
-
Interactive3D: Create What You Want by Interactive 3D Generation
Paper • 2404.16510 • Published • 21 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 10 -
A Thorough Examination of Decoding Methods in the Era of LLMs
Paper • 2402.06925 • Published • 1 -
LLaVA-OneVision: Easy Visual Task Transfer
Paper • 2408.03326 • Published • 61
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
Time Machine GPT
Paper • 2404.18543 • Published • 2 -
Diffusion for World Modeling: Visual Details Matter in Atari
Paper • 2405.12399 • Published • 30 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50
-
Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss
Paper • 2404.02731 • Published • 1 -
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Paper • 2309.12284 • Published • 18 -
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Paper • 2404.03204 • Published • 10 -
Adapting LLaMA Decoder to Vision Transformer
Paper • 2404.06773 • Published • 18