October 21, 2024

How to build a low-cost Sora-like app? Solutions for you

3 minute read

Recently, the free video generation platform Video Ocean went live, attracting widespread attention and praise. It supports generating videos with any character, in any style, from text, images, or roles.

Join the Adventure at video-ocean.com.

How did Video Ocean achieve rapid updates at low cost? What cutting-edge technologies are behind it? How can you fine-tune open-source models to build personalized video generation models and deploy inference services? This blog will give you an in-depth look.

Colossal-AI

Behind Video Ocean is the foundational support of the AI large model training and inference system Colossal-AI, which ranks first globally in the open-source AI training and inference systems on GitHub, with nearly 40,000 stars. Based on PyTorch, it reduces the development and application costs of AI model training, fine-tuning, and inference by 10 times through efficient multi-dimensional parallelism, heterogeneous memory, and other optimizations. It also enhances GPU model capacity hundreds of times and improves model task performance. Colossal-AI has collaborated with several Top 500 companies globally to develop and optimize large-scale models with 10~100 billions of parameters, or to build specialized models.

Colossal-AI open-source address: https://github.com/hpcaitech/ColossalAI

ZeRO Communication Optimization

image (8)-1

Common ZeRO communication

image (9)

Optimized ZeRO communication

Based on common ZeRO communication methods, Colossal-AI further overlaps parameter All-gather and the forward computation of the next training iteration to achieve higher training efficiency, resulting in up to ~30% acceleration in large-scale multi-node training.

Sequence Parallel Optimization

Colossal-AI supports multiple sequence parallelism paradigms for the VideoOcean model, including Tensor sequence parallelism, Ring attention (context parallelism), and Sequence parallelism (Ulysses). These paradigms can be used independently or combined. Additionally, based on the characteristics of video data (particularly large activation values), it further optimizes Ring attention communication by using ND-ring to handle complex hardware configurations. When video models scale to hundreds of billions of parameters and are trained with high-definition and longer videos, large-scale multi-node training and hybrid parallel training become the standard. In this context, Colossal-AI's optimizations for sequence parallelism can handle various scenarios, especially when large videos require sequence distribution across machines, achieving significant acceleration.

Convolution Layer Tensor Parallel Optimization

Colossal-AI has made targeted optimizations for VAE models applied to high-definition and long videos. For such data, CUDNN's 3D convolution creates very large activation tensors. To address this, Colossal-AI implemented block convolution and tensor parallelism. Unlike the tensor parallelism used in Transformers, Colossal-AI developed a new tensor parallelism method for VAE to accommodate its large activation tensors. This resulted in acceleration and memory optimization without any loss of accuracy.

FP8 Mixed Precision Training

Colossal-AI supports the new generation of mixed precision training schemes, combining BF16 (O2) and FP8 (O1). With just one line of code, mainstream large models can achieve an average 30% acceleration while ensuring training convergence, thus reducing the development cost of large models.

image (10)

The loss curve for mixed precision training of the LLaMA2-7B model on a single H100 GPU

When using it, you only need to enable FP8 during plugin initialization:

from colossalai.booster.plugin

import GeminiPlugin, HybridParallelPlugin, LowLevelZeroPlugin

...

plugin = LowLevelZeroPlugin(..., use_fp8=True)

plugin = GeminiPlugin(..., use_fp8=True)

plugin = HybridParallelPlugin(..., use_fp8=True)

In addition, there is no need to introduce extra handwritten CUDA operators, avoiding long AOT compilation times and complex compilation environment configurations.

Ease of Use

Colossal-AI has designed a user-friendly API. It has been optimized for both single-model training (e.g., Transformer) and multi-model training (e.g., GAN/VAE), making it easy to use: https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train_vae.py

Experience the Open-Source Model

Colossal-AI also supports the development of the open-source video model Open-Sora, which ranks first globally in the open-source video generation models on GitHub, with over 20,000 stars. It supports generating 720p high-definition videos from text, seamlessly producing high-quality short films in any style. It also includes model weights, training and inference code, technical reports, etc, for industry exchange and secondary development.

Open-Sora open-source address: https://github.com/hpcaitech/Open-Sora

Cloud-based Personalized Fine-tuning and Inference Deployment

By combining cloud computing resources with the excellent technical capabilities of Colossal-AI and Open-Sora, personalized fine-tuning and inference deployment needs can be further met, offering cost-effective high-quality computing services. This maximizes the development and deployment efficiency of large AI models. Limited-time promotion is now available!

20241021-143427

Video Ocean: https://video-ocean.com

Colossal-AI: https://github.com/hpcaitech/ColossalAI

Open-Sora: https://github.com/hpcaitech/Open-Sora

GPU Cloud: https://hpc-ai.com/

Comments