Abstract
AutoNeural, an NPU-native VLM architecture, improves efficiency and performance on edge devices by using integer-only inference, MobileNetV5-style backbone, and a hybrid design with SSM and Transformer layers, reducing quantization errors and latency.
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
Community
AutoNeural-VL-1.5B — the world's first real-time multimodal model built for in-car AI. It runs fully local on the Qualcomm SA8295P NPU with a software–hardware co-designed architecture, setting a new bar for speed and quality.
AutoNeural redefines what AI can do in the car. Imagine how helpful your car can be when it truly understands you and the world around it in real-time. We co-developed the model with Geely for next-generation production smart cockpit experiences.
Compared to current solution, it delivers:
14× faster latency (100ms)- 3× higher visual detail (768²)
- Up to 7× more accurate
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices (2025)
- Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation (2025)
- Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression (2025)
- INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models (2025)
- SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference (2025)
- Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference (2025)
- Visual Generation Tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper