arxiv:2512.02924

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Published on Dec 2

· Submitted by

nexaml on Dec 4

Nexa AI

Upvote

Authors:

Abstract

AutoNeural, an NPU-native VLM architecture, improves efficiency and performance on edge devices by using integer-only inference, MobileNetV5-style backbone, and a hybrid design with SSM and Transformer layers, reducing quantization errors and latency.

AI-generated summary

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

View arXiv page View PDF Add to collection

Community

nexaml

Paper submitter 2 days ago

AutoNeural-VL-1.5B — the world's first real-time multimodal model built for in-car AI. It runs fully local on the Qualcomm SA8295P NPU with a software–hardware co-designed architecture, setting a new bar for speed and quality.

AutoNeural redefines what AI can do in the car. Imagine how helpful your car can be when it truly understands you and the world around it in real-time. We co-developed the model with Geely for next-generation production smart cockpit experiences.

Compared to current solution, it delivers:

~~14× faster latency (~~100ms)
3× higher visual detail (768²)
Up to 7× more accurate

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.02924 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.02924 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.