Hanif Leoputera

RL Framework Worklog I -- Handling Off-policy

Training large language models with reinforcement learning runs into a fundamental issue: off-policiness. As the policy updates, the rollouts still arriving from earlier checkpoints become stale—their distribution drifts away from the current model. In latency-heavy, agentic/tool-calling settings (web calls, DB queries, GUI actions), rollouts can take minutes, so

Anatomy of RL Frameworks

Part 1: A Deep Dive into OpenRLHF, VERL, Slime, Verifiers, and AReaL Ever since DeepSeek-V3, the RLHF ecosystem has exploded with new frameworks for running RLVR training. Unlike conventional supervised learning, RL training poses unique MLsys challenges: we have to run inference and training in tandem. As the VERL paper

Plural Minds: Exploration-First Inference for LLMs

An experiment on improving LLM output diversity Most people have seen this duck-rabbit illusion. Do you see a duck? A rabbit? Try to see both simultaneously—you can't, but you know both are there. Your brain maintains both interpretations, flipping between them based on subtle attentional shifts. I&

Hanif Leoputera

Latest

RL Framework Worklog I -- Handling Off-policy

Anatomy of RL Frameworks

Plural Minds: Exploration-First Inference for LLMs