is this W8A16 or W8A8?
#3
by
ehartford
- opened
W8A16 is compatible with Ampere with Marlin kernel
W8A8 is only compatible with Hopper.
Which is this?
The quantization scheme is compatible with finegrained_fp8 in Transformers.
You should be able to run it with W8A8.
Ok. Thank you
Ampere (A100) cant run it, I'll work on a W8A16
ehartford
changed discussion status to
closed
I can run FP8 models on 4x rtx 3090 cards.
uv run vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel --async-scheduling
uv run vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --reasoning-parser deepseek_r1 --tensor-parallel-size 4 --enable-expert-parallel --async-scheduling