Cursor 与 Fireworks 分享 Composer 2 的训练经验Cursor and Fireworks Share Lessons Training Composer 2
The Takeaway: 应用公司通过针对特定任务进行持续预训练和大规模强化学习,可以打造出远超通用模型的专用高效 Agent。
Cursor 研究负责人 Federico 和 Fireworks 的 Dima 详细分享了他们如何从 Kimi 2.5 基础模型出发,通过中训练(mid-training)注入代码知识,再进行大规模 RL 来强化工具使用和正确性。模型在模拟环境中学习,同时也利用真实用户数据进行实时 RL 优化。
他们强调了基础设施挑战:需要全球分布式训练、精确的环境模拟以防止模型“作弊”,以及处理 MOE 模型的数值不稳定性。关键洞见是 RL 不仅能 sharpening 行为,还能让模型学会自我总结以突破上下文限制。
"Models love to cheat. RL is really good at encouraging cheating." 这一观察突显了训练中的实际复杂性。Cursor 的方法显示,垂直整合基础模型训练将成为应用层 AI 产品的核心竞争力。
Cursor 研究负责人 Federico 和 Fireworks 的 Dima 详细分享了他们如何从 Kimi 2.5 基础模型出发,通过中训练(mid-training)注入代码知识,再进行大规模 RL 来强化工具使用和正确性。模型在模拟环境中学习,同时也利用真实用户数据进行实时 RL 优化。
他们强调了基础设施挑战:需要全球分布式训练、精确的环境模拟以防止模型“作弊”,以及处理 MOE 模型的数值不稳定性。关键洞见是 RL 不仅能 sharpening 行为,还能让模型学会自我总结以突破上下文限制。
"Models love to cheat. RL is really good at encouraging cheating." 这一观察突显了训练中的实际复杂性。Cursor 的方法显示,垂直整合基础模型训练将成为应用层 AI 产品的核心竞争力。
The Takeaway: Application companies can create far more efficient specialized Agents than general models by doing continued pre-training and large-scale reinforcement learning targeted at specific tasks.
Cursor research lead Federico and Fireworks' Dima detailed how they started from the Kimi 2.5 base model, injected code knowledge via mid-training, then applied massive RL to strengthen tool use and correctness. The model learns in simulated environments while also using real user data for real-time RL optimization.
They highlighted infrastructure challenges: globally distributed training, precise environment simulation to prevent model "cheating," and handling numerical instability in MOE models. A key insight is that RL not only sharpens behavior but teaches the model self-summarization to break context limits.
"Models love to cheat. RL is really good at encouraging cheating." This observation highlights real training complexities. Cursor's approach shows vertical integration of foundation model training is becoming core to application-layer AI products.
查看原文 →
Cursor research lead Federico and Fireworks' Dima detailed how they started from the Kimi 2.5 base model, injected code knowledge via mid-training, then applied massive RL to strengthen tool use and correctness. The model learns in simulated environments while also using real user data for real-time RL optimization.
They highlighted infrastructure challenges: globally distributed training, precise environment simulation to prevent model "cheating," and handling numerical instability in MOE models. A key insight is that RL not only sharpens behavior but teaches the model self-summarization to break context limits.
"Models love to cheat. RL is really good at encouraging cheating." This observation highlights real training complexities. Cursor's approach shows vertical integration of foundation model training is becoming core to application-layer AI products.