Explore 2:4 Semi-Structured Sparsity with 1.27x Inference Speedup on NVIDIA GPUs

Written by Team | Aug 21, 2025 3:40:59 AM

Sparsity in Deep Learning Models

At HPC-AI Research Team, we often explore ways to make deep learning models more efficient. One fundamental insight is that deep learning models are inherently sparse—many weights can be safely neglected and zeroed out without significant accuracy loss. This idea, known as model pruning, was first introduced by Yann LeCun in the 1980s through the pioneering work Optimal Brain Damage. Since then, both software frameworks and hardware accelerators have evolved to take advantage of this sparsity, enabling more efficient inference and reduced memory consumption.

Most pruning methods produce unstructured sparsity, where any individual weight can be zeroed out. While this maximizes flexibility, it poses challenges for hardware acceleration. A more hardware-friendly alternative is 2:4 semi-structured sparsity, where out of every four consecutive weights, exactly two are zero. This pattern strikes a balance between model flexibility and computational efficiency, making it ideal for modern GPU architectures.

An illustration of 2:4 semi-structured sparsity (from meta's ICLR2025 workshop paper)

NVIDIA Hardware Support for 2:4 Sparsity

NVIDIA introduced native support for 2:4 sparsity in its Ampere architecture (A100 GPU) via sparse tensor cores. These cores offer up to 2× throughput compared to dense tensor cores. The support is further enhanced in Hopper architecture, delivering up to 3957.8 fp8 TFLOPS.

NVIDIA Hopper Tensor Core GPU Performance Specs from whitepaper

Pruning Large Language Models for Sparsity

To fully exhaust the hardware capability, we need to convert dense LLMs into sparse ones, typically through post-train pruning or sparse fine-tuning.

In this blog, we focus on post-training pruning, which is more resource-efficient and better supported by open-source tools. Do note that model pruning might lead model performance degradation without fine-tuning.

Pruning a model, especially LLMs, is difficult due to their size and internal complexity. Fortunately, the community has made tremendous progress in this area:

NVIDIA CUTLASS provides highly optimized open-source 2:4 sparse GEMM implementations.
Algorithms like SparseGPT, WANDA, and MaskLLM are specifically designed for pruning LLMs while preserving accuracy.
Frameworks like llm-compressor and vLLM offer pruning, quantization, and efficient inference runtimes out of the box.

SparseGPT

SparseGPT is a pruning method for large language models (LLMs), proposed by Elias Frantar and Dan Alistarh in 2023. It is capable of pruning dense models with up to 176 billion parameters. The algorithm samples from a calibration dataset and minimizes the output error between the original and pruned models by computing the Hessian inverse of the weight matrix. Mathematically, it can find the optimal solution given a predefined pruning mask. For more details, please refer to the original paper listed in Reference.

Example

In this section, we demonstrate how to sparsify the meta-llama/Meta-Llama-3-8B-Instruct model and serve it using vLLM.

Environment setup

We use llm-compressor as it is compatible with vllm and provides a variety of model pruning and quantization techniques.

# create a new environment
$ conda env create -n sparse python=3.10

# Install llmcompressor from source
# Alternatively, you can install from pypi using pip install llmcompressor==0.6.0
$ git clone https://github.com/vllm-project/llm-compressor.git &&\
cd llm-compressor &&\
git checkout 0.6.0 &&\
pip install -e .

# install transformers==4.51.0, higher version seems to cause problems
$ pip install transformers==4.51.0

Model pruning and compression

The repository provides a ready-to-use example that prunes the meta-llama/Meta-Llama-3-8B-Instruct model using ultrachat_200k as a calibration dataset:

By running the script, we obtain a pruned and quantized version of LLaMA-8B, and we can compare the model size before and after pruning to observe the reduction in weight storage.

# download meta-llama/Meta-Llama-3-8B-Instruct
$ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir meta-llama/Meta-Llama-3-8B-Instruct

# run the example, which takes roughly 10 min under the example setting
$ python /root/llm-compressor/examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py --fp8

# check compressed model
$ du -sh *
# 6.1G Meta-Llama-3-8B-Instruct2of4-W8A8-FP8-Dynamic-Per-Token
# 30G meta-llama

Inference with Compressed model

vLLM leverages the sparse GEMM implementation from CUTLASS (see code here). We can easily benchmark this implementation on your machine using the command below.

python /root/vllm/benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 model_bench --models meta-llama/Llama-3-8b

From the benchmark results, the FP8 sparse GEMM kernel achieves an average speedup of 1.22× compared to cuBLAS (PyTorch) FP8 GEMM, and 2.01× compared to cuBLAS BF16 GEMM. The performance gains become more significant with larger batch sizes.

Environment setup

We benchmarked the model serving using vLLM. While SGLang also adopts this sparse GEMM implementation by importing the corresponding modules, it may fail on certain versions due to compatibility issues. For more details, refer to this Issue and this PR.

Simply follow vLLM installation guide:

# Alternatively, you can install from pypi using pip install vllm==0.9.2
$ git clone https://github.com/vllm-project/vllm.git &&\
cd vllm &&\
git checkout v0.9.2 &&\
pip install -e .

Throughtput test

We benchmarked the model using 1,000 prompts, each with a 1024-token input and 1024-token output, focusing on the total serving duration. The sparsified model achieved 1.27× end-to-end speedup (123.66s vs 97.64s) compared to the dense version.

	Benchmark duration	Request throughput	Output token throughput	Total Token throughput
Baseline	123.66 s	8.02 req/s	7841.47 tok/s	16038.45 tok/s
2:4 sparse + fp8	97.64 s	10.16 req/s	9693.14 tok/s	20074.92 tok/s

MMLU benchmark

We also evaluated the 5-shot MMLU score of the sparsified model using this script.

Note that the SparseGPT example above does not include any fine-tuning to recover potential accuracy loss, as such processes typically require hardware resources comparable to pre-training.

Instead, we used RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamic to demonstrate the expected performance after fine-tuning.

Model name	Dense Model	SparseGPT	SparseGPT + fine-tune
Average Accuracy (5 shot)	0.667	0.404	0.604

Conclusion

We demonstrated how 2:4 structured sparsity, enabled through post-training pruning with SparseGPT and served via vLLM, can significantly accelerate LLM inference—achieving up to 1.27× speedup with minimal engineering effort. While accuracy may drop without fine-tuning, pre-finetuned sparse models can effectively recover performance.

With open-source tools maturing, running efficient, sparsified LLMs has become much more practical. If you’re exploring ways to scale your experiments or accelerate model deployment, the team at HPC-AI.com offers H200 GPU clusters at $2.10/GPU·hour, providing researchers and developers with the compute they need to focus on building models rather than managing infrastructure.

Reference

Papers

LeCun, Y., Denker, J., & Solla, S. (1989). Optimal Brain Damage. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2). Morgan-Kaufmann. https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf

Frantar, E., Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774.

Sun, M., Liu, Z., Bair, A., Kolter, J.Z. (2023). A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695.

Fang, G., Yin, H., Muralidharan, S., Heinrich, G., Pool, J., Kautz, J., Molchanov, P., Wang, X. (2024). MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models. arXiv preprint arXiv:2409.17481.

Haziza, D., Chou, T., Choudhary, D., Wehrstedt, L., Massa, F., Yu, J., Jeong, G., Rao, S., Labatut, P., Cai, J. (2025). Accelerating Transformer Inference and Training with 2:4 Activation Sparsity. arXiv preprint arXiv:2503.16672.

Blogs/Documentation/Repositories

NVIDIA H100 GPU Whitepaper. (2025). NVIDIA. https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c

Tuli, R., Sikka, D., Marques, A., & Kurtz, M. Axolotl meets LLM Compressor: Fast, sparse, open | Red Hat Developer. Red Hat Developer. https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open

vllm-project/vllm. GitHub. https://github.com/vllm-project/vllm

vllm-project/llm-compressor. GitHub. https://github.com/vllm-project/llm-compressor/tree/main

View full post