Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FP4 gemm and FP4 checkpoints #3899

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

trevor-m
Copy link
Contributor

Motivation

This PR adds support for modelopt FP4 quantized models.
Tested using fp4 quantized Llama 3.1 model.

This work was adapted from the following - thanks @pavanimajety @kaixih @kushanam!
vllm-project/vllm#12784
vllm-project/vllm#13571
vllm-project/vllm#12520

Modifications

Adds two operations to sgl-kernel:

  • scaled_fp4_quant - Quantize bf16 or fp16 input to fp4 and returns input scale in block interleaved format
  • cutlass_scaled_fp4_mm - Perform fp4 gemm

Adds modelopt_fp4 quantization method.
Adds ModelOptFp4Config and ModelOptFp4LinearMethod to utilize new fp4 kernels for linear layers

Checklist

Fix NAN issue by using getCurrentCUDAStream(). Apply rounding patch from trtllm (not needed for fixing NAN)

Add fp4 unit tests
Fixes

trying to fix

cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant