Blog - Colossal-AI

Open-Sora:Sora Replication Solution with 46% Cost Reduction, Sequence Expansion to Nearly a Million

Written by Team | Mar 4, 2024 9:18:00 AM
OpenAI's Sora model, which pushes video generation technology to a new level, marks a new era of AI video generation technology. It can be widely used in movies, animation, games, advertising and other fields, providing content creators with more convenient and efficient creation tools.
 
Colossal-AI team first rapid open source Open-Sora, a complete Sora replication architecture solution, also reduces the training cost by 46% and expands the length of the model training input sequence to 819K patches.
 
 

Sora Algorithm Replication Scheme

In Sora's technical report, Sora uses a video compression network to compress video of various sizes into a sequence of spatial temporal patches in latent space, then uses a Diffusion Transformer for denoising, and finally decodes to generate video.
 
 
Open-Sora summarizes the training pipeline that Sora might use in the following diagram.

 

Currently, Open-Sora has covered:
  • Provide a complete Sora reproduction architecture solution, including the whole process from data processing to training and inference.
  • Supports dynamic resolution, training can directly train any resolution of the video, without scaling.
  • Supports multiple model structures. Since the actual model structure of Sora is unknown, we realize three common multimodal model structures such as adaLN-zero, cross attention, and in-context conditioning (token concat).
  • Supports multiple video compression methods. Users can choose to use original video, VQVAE (video native model), SD-VAE (image native model) for training.
  • Supports multiple parallel training optimizations. Including the AI large model system optimization capability combined with Colossal-AI, and hybrid sequence parallelism with Ulysses and FastSeq.

Performance Optimization

Unlike LLM with large models and activations, Sora-like training tasks are characterized by a small model(e.g., under 10B), but an exceptionally long sequence length due to video complexity. In this case, PyTorch data parallelism can no longer run, while traditional model parallelism and zero-redundancy data parallelism bring limited benefits. Therefore, on the basis of supporting scene optimization strategies such as AMP (FP16/BF16), Flash Attention, Gradient checkpointing, ZeRO-DP, etc., Open-Sora further introduces two different implementations of sequence parallelism methods, which can be used together with ZeRO to achieve hybrid parallelism:

 

  1. the more generalized Ulysses, which may perform better for small or long sequences.

  1. FastSeq can overlap qkv projection computation and all-gather communication to further improve training efficiency by taking up just a little more memory.
Both of these sequence parallelization schemes can easily be used with Zero2 to achieve hybrid parallelism.
There is a performance benchmark using the DiT-XL/2 model on a H800 SXM 8*80GB GPU server.

 

At a sequence length of 600K, Open-Sora's solution offers more than 40% performance improvement and cost reduction over the baseline solution.

 

Open-Sora can also train 30% longer sequences up to 819K+ while guaranteeing faster training speeds.
 

Call for Co-construction

Open-Sora has built a performance-optimized Sora-like video generation model low-cost development solution, the future will continue to update and optimize.
Welcome the open-source community and other forces to jointly build, reproduce and surpass Sora, and provide convenient, easy-to-use, low-cost, and reliable open-source solutions for the video generation field, effectively promoting the implementation of AI technology in fields such as movies, games, and advertising.
 

 

References

Liu, Yixin, et al. "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models." arXiv preprint arXiv:2402.17177 (2024).
 
Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
 
Li, Shenggui, et al. "Colossal-ai: A unified deep learning system for large-scale parallel training." Proceedings of the 52nd International Conference on Parallel Processing. 2023.
 
Jacobs, Sam Ade, et al. "Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models." arXiv preprint arXiv:2309.14509 (2023).