D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Paper • 2510.05684 • Published Oct 7 • 141
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Paper • 2510.08673 • Published Oct 9 • 125
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Paper • 2510.05034 • Published Oct 6 • 48
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation Paper • 2509.21989 • Published Sep 26 • 22
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale Paper • 2509.14008 • Published Sep 17 • 88
UItron: Foundational GUI Agent with Advanced Perception and Planning Paper • 2508.21767 • Published Aug 29 • 12
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection Paper • 2508.20766 • Published Aug 28 • 14
Train Long, Think Short: Curriculum Learning for Efficient Reasoning Paper • 2508.08940 • Published Aug 12 • 27
FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control Paper • 2505.22642 • Published May 28 • 3
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions Paper • 2505.21724 • Published May 27 • 5
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think Paper • 2504.20708 • Published Apr 29 • 23
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation Paper • 2504.12626 • Published Apr 17 • 51
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding Paper • 2503.17827 • Published Mar 22 • 8
Vivid-ZOO: Multi-View Video Generation with Diffusion Model Paper • 2406.08659 • Published Jun 12, 2024 • 8
SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation Paper • 2105.04447 • Published May 10, 2021 • 1