Research Papers about High-Performance Computing

Elixir: Train a Large Language Model on a Small GPU Cluster

December 10, 2022

Haichen Huang, Jiarui Fang, Hongxin Liu, Shenggui Li, Yang You

In recent years, the number of parameters of one deep learning (DL) model has been growing much faster than the growth of GPU memory space. People who are inaccessible to a large number of GPUs resort to heterogeneous training systems for storing model parameters in CPU memory. Existing heterogeneous systems are based on parallelization plans in the scope of the whole model. They apply a consistent parallel training method for all the operators in the computation. Therefore, engineers need to pay a huge effort to incorporate a new type of model parallelism and patch its compatibility with other parallelisms. For example, Mixture-of-Experts (MoE) is still incompatible with ZeRO-3 in Deepspeed. Also, current systems face efficiency problems on small scale, since they are designed and tuned for large-scale training. In this paper, we propose Elixir, a new parallel heterogeneous training system, which is designed for efficiency and flexibility. Elixir utilizes memory resources and computing resources of both GPU and CPU. For flexibility, Elixir generates parallelization plans in the granularity of operators. Any new type of model parallelism can be incorporated by assigning a parallel pattern to the operator. For efficiency, Elixir implements a hierarchical distributed memory management scheme to accelerate inter-GPU communications and CPU-GPU data transmissions. As a result, Elixir can train a 30B OPT model on an A100 with 40GB CUDA memory, meanwhile reaching 84% efficiency of Pytorch GPU training. With its super-linear scalability, the training efficiency becomes the same as Pytorch GPU training on multiple GPUs. Also, large MoE models can be trained 5.3x faster than dense models of the same size. Now Elixir is integrated into ColossalAI and is available on its main branch.

Research Papers

Elixir: Train a Large Language Model on a Small GPU Cluster

Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

A Frequency-aware Software Cache for Large Recommendation System Embeddings

Go Wider Instead of Deeper

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Cross-token Modeling with Conditional Computation

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

Maximizing Parallelism in Distributed Training for Huge Neural Networks

Tesseract: Parallelize the Tensor Parallelism Efficiently

Sequence Parallelism: Long Sequence Training from System Perspective

An Efficient 2D Method for Training Super-Large Deep Learning Models

TurboTransformers: an efficient GPU serving system for transformer models

Semantic Segmentation-Based Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source GIS Data

A Robotic Communication Middleware Combining High Performance and High Reliability

Message Passing Optimization in Robot Operating System

RedSync: Reducing synchronization bandwidth for distributed deep learning training system

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

ImageNet Training in Minutes