is this W8A16 or W8A8?

#3
by ehartford - opened
Qwen org

W8A16 is compatible with Ampere with Marlin kernel

W8A8 is only compatible with Hopper.

Which is this?

Qwen org

The quantization scheme is compatible with finegrained_fp8 in Transformers.
You should be able to run it with W8A8.

Qwen org

Ok. Thank you
Ampere (A100) cant run it, I'll work on a W8A16

ehartford changed discussion status to closed

I can run FP8 models on 4x rtx 3090 cards.

uv run vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel --async-scheduling
uv run vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel --async-scheduling

Sign up or log in to comment