🚀 New: RUNRL JOB Is Live on HPC-AI.COM

Reinforcement learning fine-tuning (RFT) is powerful — but let’s face it: it used to be a pain to run. Dual networks, huge memory needs, tons of config files...
That’s why we built RUNRL JOB — the easiest way to run RFT workloads like GRPO directly on HPC-AI.COM. No complicated setup. Just pick your model, launch your job, and go.
Watch the demo video
🧠 Why Everyone’s Talking About RFT (and GRPO)
Reinforcement Fine-Tuning (RFT) has become the go-to method for aligning language models — but many popular approaches like PPO are resource-intensive. Enter GRPO: a lightweight, critic-free alternative that’s stable, fast, and efficient. No separate value network, no double backward pass, and a much smaller memory footprint.
GRPO maintains the trust-region stability of PPO while cutting memory usage by over 40%, making it ideal for LLM reasoning, code generation, and complex math tasks. It’s a smarter way to fine-tune — perfect if you want to reduce costs and speed up your workflow without sacrificing quality.
💡 GRPO on Qwen-3B: Our Results
We’ve put the GRPO algorithm from DeepSeek to the test using the Qwen-3B-Base model, and the results are exciting.
Reward Function Design:
- Reward = 0 if the format is incorrect.
- Reward = 1 if the format is correct but the result is incorrect.
- Reward = 10 if both the format and result are correct.
We provide a conversation template and settings to verify GRPO, demonstrated on the Qwen2.5-3B-Base model. You can find the template here: 🔗 Qwen2.5-3B Conversation Template
Ready to start training? Just run this bash script: 🔗 GRPO Training Script
In the GRPO section, we also share insights from the verification process and detailed explanations of the key parameters for your reference.
The code features a flexible reward function template, so you can customize the scoring system to fit your specific needs.
Even on a 3B model, we observe steady improvements in reward scores and response length — proof that GRPO effectively boosts reasoning and output quality.
⚙️ Run GRPO in One Click on HPC-AI.COM
Want to try it yourself? Now it’s easier than ever.
We’ve made RFT completely plug-and-play on HPC-AI.COM. With the RunRL job feature, you can launch full reinforcement learning fine-tuning workflows — like GRPO — with zero setup and maximum flexibility.
No friction. No boilerplate. Just launch, fine-tune, and go.
Comments