Motivation

Verl-SWE-RL (the rl block) trains a SWE coding agent with online reinforcement learning — the policy improves by repeatedly attempting real tasks and being rewarded for the patches that make the tests pass.

It is the final stage of the SWE-Lego-Live pipeline, after task curation (swegen), trajectory generation (trajgen), and supervised fine-tuning (sft). Where sft imitates known-good trajectories, rl optimizes the policy directly against the environment's own reward signal, in a closed self-improvement loop.

swegen ─▶ trajgen ─▶ sft ─▶ rl

rl provides:

End-to-end online RL with verl — PPO, GRPO, and GSPO over full agent rollouts, not single-turn completions
Self-served policy via vLLM (verl-managed, DP × TP), fronted by a per-host LiteLLM proxy so the agent speaks an ordinary Anthropic/OpenAI API
Containerized rollouts through Harbor — K8s pods in production, local/remote Docker for a minimal setup
Verifier-grounded reward — each task's own test script decides reward, so the model is trained on "did it actually fix the bug"
A live training dashboard over the run logs, previewable locally or published to Cloudflare Pages

Where to go next

Getting Started — prerequisites and your first training run
Core Concepts — the inference chain, trials and rewards, the verl loop, backends
Run Training — preflight, launch, monitor, and where outputs land
Dashboard — watch reward, KL, MFU, and response length live
Reference — the config-driven input/output contract

Motivation

Where to go next

On this page