Sparsity in Deep Learning Models
At HPC-AI Research Team, we often explore ways to make deep learning models more efficient. One fundamental insight is that deep learning models are inherently sparse—many weights can be safely neglected and zeroed out without significant accuracy loss. This idea, known as model pruning, was first introduced by Yann LeCun in the 1980s through the pioneering work Optimal Brain Damage. Since then, both software frameworks and hardware accelerators have evolved to take advantage of this sparsity, enabling more efficient inference and reduced memory consumption.
Most pruning methods produce unstructured sparsity, where any individual weight can be zeroed out. While this maximizes flexibility, it poses challenges for hardware acceleration. A more hardware-friendly alternative is 2:4 semi-structured sparsity, where out of every four consecutive weights, exactly two are zero. This pattern strikes a balance between model flexibility and computational efficiency, making it ideal for modern GPU architectures.
An illustration of 2:4 semi-structured sparsity (from meta's ICLR2025 workshop paper)
NVIDIA Hardware Support for 2:4 Sparsity
NVIDIA introduced native support for 2:4 sparsity in its Ampere architecture (A100 GPU) via sparse tensor cores. These cores offer up to 2× throughput compared to dense tensor cores. The support is further enhanced in Hopper architecture, delivering up to 3957.8 fp8 TFLOPS.
NVIDIA Hopper Tensor Core GPU Performance Specs from whitepaper
Pruning Large Language Models for Sparsity
To fully exhaust the hardware capability, we need to convert dense LLMs into sparse ones, typically through post-train pruning or sparse fine-tuning.
In this blog, we focus on post-training pruning, which is more resource-efficient and better supported by open-source tools. Do note that model pruning might lead model performance degradation without fine-tuning.
Pruning a model, especially LLMs, is difficult due to their size and internal complexity. Fortunately, the community has made tremendous progress in this area:
-
NVIDIA CUTLASS
provides highly optimized open-source 2:4 sparse GEMM implementations.
-
Algorithms like SparseGPT, WANDA, and MaskLLM are specifically designed for pruning LLMs while preserving accuracy.
-
Frameworks like llm-compressor
and vLLM
offer pruning, quantization, and efficient inference runtimes out of the box.
SparseGPT
SparseGPT is a pruning method for large language models (LLMs), proposed by Elias Frantar and Dan Alistarh in 2023. It is capable of pruning dense models with up to 176 billion parameters. The algorithm samples from a calibration dataset and minimizes the output error between the original and pruned models by computing the Hessian inverse of the weight matrix. Mathematically, it can find the optimal solution given a predefined pruning mask. For more details, please refer to the original paper listed in Reference.
Example
In this section, we demonstrate how to sparsify the meta-llama/Meta-Llama-3-8B-Instruct
model and serve it using vLLM
.
Environment setup
We use llm-compressor
as it is compatible with vllm
and provides a variety of model pruning and quantization techniques.
# create a new environment
$ conda env create -n sparse python=3.10
# Install llmcompressor from source
# Alternatively, you can install from pypi using pip install llmcompressor==0.6.0
$ git clone https://github.com/vllm-project/llm-compressor.git &&\
cd llm-compressor &&\
git checkout 0.6.0 &&\
pip install -e .
# install transformers==4.51.0, higher version seems to cause problems
$ pip install transformers==4.51.0
Model pruning and compression
The repository provides a ready-to-use example that prunes the meta-llama/Meta-Llama-3-8B-Instruct
model using ultrachat_200k
as a calibration dataset:
By running the script, we obtain a pruned and quantized version of LLaMA-8B, and we can compare the model size before and after pruning to observe the reduction in weight storage.
# download meta-llama/Meta-Llama-3-8B-Instruct
$ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir meta-llama/Meta-Llama-3-8B-Instruct
# run the example, which takes roughly 10 min under the example setting
$ python /root/llm-compressor/examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py --fp8
# check compressed model
$ du -sh *
# 6.1G Meta-Llama-3-8B-Instruct2of4-W8A8-FP8-Dynamic-Per-Token
# 30G meta-llama
Inference with Compressed model
vLLM
leverages the sparse GEMM implementation from
CUTLASS
(see code
here). We can easily
benchmark this implementation on your machine using the command below.
python /root/vllm/benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 model_bench --models meta-llama/Llama-3-8b
From the benchmark results, the FP8 sparse GEMM kernel achieves an average speedup of 1.22× compared to cuBLAS (PyTorch) FP8 GEMM, and 2.01× compared to cuBLAS BF16 GEMM. The performance gains become more significant with larger batch sizes.
Environment setup
We benchmarked the model serving using
vLLM
. While
SGLang
also adopts this sparse GEMM implementation by importing the corresponding modules, it may fail on certain versions due to compatibility issues. For more details, refer to
this Issue and
this PR.
Simply follow vLLM
installation guide:
# Alternatively, you can install from pypi using pip install vllm==0.9.2
$ git clone https://github.com/vllm-project/vllm.git &&\
cd vllm &&\
git checkout v0.9.2 &&\
pip install -e .
Throughtput test
We benchmarked the model using 1,000 prompts, each with a 1024-token input and 1024-token output, focusing on the total serving duration. The sparsified model achieved 1.27× end-to-end speedup (123.66s vs 97.64s) compared to the dense version.
|
Benchmark duration
|
Request throughput
|
Output token throughput
|
Total Token throughput
|
Baseline
|
123.66 s
|
8.02 req/s
|
7841.47 tok/s
|
16038.45 tok/s
|
2:4 sparse + fp8
|
97.64 s
|
10.16 req/s
|
9693.14 tok/s
|
20074.92 tok/s
|
MMLU benchmark
We also evaluated the 5-shot MMLU score of the sparsified model using this script.
Note that the SparseGPT example above does not include any fine-tuning to recover potential accuracy loss, as such processes typically require hardware resources comparable to pre-training.
Instead, we used RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamic
to demonstrate the expected performance after fine-tuning.
Model name
|
Dense Model
|
SparseGPT
|
SparseGPT + fine-tune
|
Average Accuracy (5 shot)
|
0.667
|
0.404
|
0.604
|
Conclusion
We demonstrated how 2:4 structured sparsity, enabled through post-training pruning with SparseGPT and served via vLLM, can significantly accelerate LLM inference—achieving up to 1.27× speedup with minimal engineering effort. While accuracy may drop without fine-tuning, pre-finetuned sparse models can effectively recover performance.
With open-source tools maturing, running efficient, sparsified LLMs has become much more practical. If you’re exploring ways to scale your experiments or accelerate model deployment, the team at HPC-AI.com offers H200 GPU clusters at $2.10/GPU·hour, providing researchers and developers with the compute they need to focus on building models rather than managing infrastructure.
Reference
Papers
Frantar, E., Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774.
Sun, M., Liu, Z., Bair, A., Kolter, J.Z. (2023). A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695.
Fang, G., Yin, H., Muralidharan, S., Heinrich, G., Pool, J., Kautz, J., Molchanov, P., Wang, X. (2024). MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models. arXiv preprint arXiv:2409.17481.
Haziza, D., Chou, T., Choudhary, D., Wehrstedt, L., Massa, F., Yu, J., Jeong, G., Rao, S., Labatut, P., Cai, J. (2025). Accelerating Transformer Inference and Training with 2:4 Activation Sparsity. arXiv preprint arXiv:2503.16672.
Blogs/Documentation/Repositories