RL Framework Worklog I -- Handling Off-policy
Training large language models with reinforcement learning runs into a fundamental issue: off-policiness. As the policy updates, the rollouts still arriving from earlier checkpoints become stale—their distribution drifts away from the current model. In latency-heavy, agentic/tool-calling settings (web calls, DB queries, GUI actions), rollouts can take minutes, so