In the world of Large Language Models (LLMs), handling multi-round conversations has always been a challenge. The recent StreamingLLM proposed by Xiao et al. from MIT can feed a streaming text input of over 4 million tokens in multi-round conversations, without sacrificing inference speed and generation quality. StreamingLLM can achieve 22.2 times speedup in inference speed compared to the window attention with recomputation method.
However, StreamingLLM is implemented in native PyTorch, and further optimization is needed in order to meet the low cost, low latency, and high throughput requirements for LLM multi-round conversation applications.
The Colossal-AI team open-sourced SwiftInfer, an TensorRT-based implementation of the StreamingLLM algorithm, which can further improve the inference performance of large language models by 46%, providing an efficient and reliable solution for multi-round conversations.
Introduction to StreamingLLM
The ability of Large Language Models to understand and remember context directly affects the quality of interaction with users in chatbot applications such as ChatGPT. It is a challenging task to maintain generation quality during multi-round conversations due to the limited input length and GPU memory.
LLMs are generally trained with a certain input length. They cannot generalize well to texts longer than the training sequence length without extra fine-tuning. They may even just collapse if the input length exceeds the attention window. At the same time, the input length is constrained by GPU memory. The well-known KV Cache mechanism can save the time of model computation, but in the scenario of long-text multi-round conversations, the caching of key and value is memory-consuming, and it is not possible to infinitely expand the cache capacity with limited GPU memory.
StreamingLLM tackled this problem by looking at the output of the softmax operation in the attention module and observed the phenomenon of attentional sink. It is commonly understood that the attention mechanism assigns an attention score to each token. Interestingly, the first few tokens in a text always receive a lot of unnecessary attention. When we use a sliding-window-based attention module, the model will collapse and output texts with high perplexity once the initial tokens are kicked out of the attention window. In contrast, the model can stably generate high-quality texts as long as these tokens are fixed in the window all the time.
Before StreamingLLM, there are several commonly used attention methods for text generation, namely dense attention, window attention, and sliding window attention with re-computation. StreamingLLM's outperforms these methods in terms of both computational complexity and generation quality. One significant advantage of StreamingLLM is that it does not require further finetuning and allows the mainstream language models to accept long text inputs in a streaming manner.
SwiftInfer: TensorRT Implementation of StreamingLLM
In order to better apply the technology of StreamingLLM to production, the Colossal-AI team successfully combined the StreamingLLM method with TensorRT inference optimization in the SwiftInfer project, which not only inherits all the advantages of the original StreamingLLM, but also has a higher inference efficiency. Using TensorRT-LLM's API, we are able to construct the model in a way similar to how we write a PyTorch model.
Based on TensorRT-LLM, we re-implemented the KV Cache mechanism and the attention module with position shift. As shown in the figure below, assuming we have a window size of 10 tokens, as the number of generated tokens increases (indicated by the yellow squares), we kick out the intermediate tokens in the KV cache and always keep a few tokens at the beginning of the text (indicated by the blue squares). Since the position of the yellow squares changes, we also need to re-inject the position embedding when computing attention.
It is important to note that StreamingLLM does not directly increase the context length that the model can access, but rather is able to guarantee model generation while supporting longer dialog text inputs.
46% Faster Streaming Inference for Large Models
The original version of StreamingLLM could reliably process ultra-long inputs of more than 4 million tokens in a streaming manner, achieving a 22.2x speedup over the sliding window attention mechanism with recomputation.
SwiftInfer can further improve inference performance, delivering an additional up to 46% inference throughput speedup, showcasing a feasibly low-cost, low-latency, and high-throughput practice for large-model inference. The TensorRT-LLM team also provided similar support to StreamingLLM during the same period.
Colossal-AI Community News
Colossal-AI, a PyTorch-based AI system, leverages efficient multi-dimensional parallelism, heterogeneous memory management, and other techniques to reduce the development and deployment costs of AI large model training, fine-tuning, and inference. It enhances model task performance, cuts down hardware costs, and more. In just over a year, has gained more than 35,000 GitHub stars in the open-source community.
As an active open-source AI system community, Colossal-AI has come up with several updates in different areas.
Colossal-LLaMA-2-13B is open-sourced now!
The Colossal-AI team has just released the Colossal-LLaMA-2-13B model. This model is obtained by finetuning the Llama-2 model with only 25B token data and $5000 USD of computational costs. Despite its low cost, the performance of Colossal-LLaMA-2 surpasses other LLaMA-based Chinese models by a wide margin. Even compared to the Chinese tech giants' open-sourced models that are pre-trained from scratch with potentially millions of dollars, Colossal-LLaMA-2's performance is still outstanding. This 13B version has a more refined data composition and delivers good performance in terms of factual knowledge, text understanding, safety and human-aligned values.
Colossal-AI Cloud Platform
Aiming to integrate Colossal-AI system optimization and low-cost computing resources, the Colossal-AI cloud platform has recently released AI cloud servers, making it convenient for users to develop and debug large AI models in a bare metal manner. It also provides various tools including Jupyter Notebook, SSH, port forwarding, and Grafana monitoring, providing users with a convenient development experience in all aspects. At the same time, Docker images containing the Colossal-AI code repository and runtime environment have been prepared for users, allowing them to run code samples from the Colossal-AI code repository with just one click without the need for environment setup and resource configuration.
Reference
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. "Efficient streaming language models with attention sinks." arXiv preprint arXiv:2309.17453 (2023).