The Colossal-AI team has increased the already high-performance of the Chinese LLaMA-2 7B model using only 8.5 billion token data, 15 hours, and a training cost of a few hundred USD. The model demonstrated outstanding performances across multiple benchmark evaluations.
Building upon the initial framework, the Colossal-AI team began the next iteration of the model. They constructed a more refined and comprehensive data architecture by utilizing 25 billion token data, and ultimately engineered a refined 13B model. Later, they open sourced the model code and appropriate weights.
Note: Based on ColossalEval scores, the scores in parentheses are from the official leaderboard scores of the corresponding models, and C-Eval scores are from the official Leaderboard.
In English MMLU rankings, the Colossal-LLaMA-2-13B-base exhibits a steady improvement in English performance, attributed to low-cost incremental pre-training. Notably, in the GSM8k evaluation, there is a significant improvement in English mathematical and reasoning capabilities (31.31 -> 58.83), outperforming all other 13B models.
In Chinese rankings, we primarily compared CMMLU, AGIEVAL, GAOKAO, and C-Eval. The effectiveness of Colossal-LLaMA-2 surpasses other Chinese models based on LLaMA-2 by a wide margin. Even when compared to well-known Chinese corporation models that are pre-trained from scratch with potentially millions of dollars, Colossal-LLaMA-2's performance is outstanding. Particularly noteworthy is the leap in Chinese proficiency compared to the original LLaMA-2 (CMMLU: 38.14 -> 61.8).
After analyzing loss records throughout the entire training process, it is clear that through leveraging the cost reduction and enhanced efficiency of the Colossal-AI system the model's convergence is ensured. Remarkably, achieving such impressive results requires approximately 25 billion tokens, coupled with computational costs of $5000 USD. This is in contrast to prevalent large-scale market models that demand training using several trillion tokens, incurring substantial and expensive computational expenses to achieve comparable effectiveness.
To significantly reduce training costs, high-quality data is critical, especially in the context of incremental pre-training where there are restrictive requirements for both the quality and distribution of data. During training for the 7B version, the Colossal-AI team established a data cleansing system and toolkit to filter high-quality data for incremental pre-training.
In contrast to the 7B version, the 13B model's training uses a more refined data architecture, categorizing data into knowledge-based, functional, and memory replay data. Knowledge-based data is subdivided into over a dozen major categories, including finance, law, education, etc., with each major category further divided into subcategories to enable precise control over different data. Additionally, the scale of data from various verticals was increased to ensure a strong grasp of the model on data from diverse domains.
To address the community's demand for functional capabilities in large models, targeted enhancements were made for different natural language processing tasks. This ensures that the model meets a certain level of understanding and proficiency in common NLP tasks during pre-training regarding text summarization, information extraction, and comprehension of complex problem-solving chains.
Furthermore, memory replay data serves as a crucial component to achieve the model's mastery of acquired knowledge, effectively enhancing the overall performance and generalization ability of the model.
To address the growing concerns about security, the Colossal-AI team implemented multidimensional enhancements (political sensitivity, religious sensitivity, abusive language, hatred, bias, illegal activities, physical harm, mental health, property privacy, moral and ethical considerations, etc.) to ensure the foundational model possesses robust security and adheres to correct values.
After the construction of multidimensional data and the enhancement of the foundational model's natural language capabilities, the Colossal-AI team has developed a more powerful 13B version model. With this model as the basis, community users can benefit from reduced amounts of high-quality fine-tuning data while fine-tuning, resulting in cheaper costs and the creation of a personalized model.