Motivation
Why we built rl
Verl-SWE-RL (the rl block) trains a SWE coding agent with online
reinforcement learning — the policy improves by repeatedly attempting real
tasks and being rewarded for the patches that make the tests pass.
It is the final stage of the SWE-Lego-Live
pipeline, after task curation (swegen), trajectory generation (trajgen), and
supervised fine-tuning (sft). Where sft imitates known-good trajectories, rl
optimizes the policy directly against the environment's own reward signal, in a
closed self-improvement loop.
swegen ─▶ trajgen ─▶ sft ─▶ rlrl provides:
- End-to-end online RL with verl — PPO, GRPO, and GSPO over full agent rollouts, not single-turn completions
- Self-served policy via vLLM (verl-managed, DP × TP), fronted by a per-host LiteLLM proxy so the agent speaks an ordinary Anthropic/OpenAI API
- Containerized rollouts through Harbor — K8s pods in production, local/remote Docker for a minimal setup
- Verifier-grounded reward — each task's own test script decides reward, so the model is trained on "did it actually fix the bug"
- A live training dashboard over the run logs, previewable locally or published to Cloudflare Pages
Where to go next
- Getting Started — prerequisites and your first training run
- Core Concepts — the inference chain, trials and rewards, the verl loop, backends
- Run Training — preflight, launch, monitor, and where outputs land
- Dashboard — watch reward, KL, MFU, and response length live
- Reference — the config-driven input/output contract