Verl-SWE-RL

Motivation

Why we built rl

Verl-SWE-RL (the rl block) trains a SWE coding agent with online reinforcement learning — the policy improves by repeatedly attempting real tasks and being rewarded for the patches that make the tests pass.

It is the final stage of the SWE-Lego-Live pipeline, after task curation (swegen), trajectory generation (trajgen), and supervised fine-tuning (sft). Where sft imitates known-good trajectories, rl optimizes the policy directly against the environment's own reward signal, in a closed self-improvement loop.

swegen ─▶ trajgen ─▶ sft ─▶ rl

rl provides:

  • End-to-end online RL with verl — PPO, GRPO, and GSPO over full agent rollouts, not single-turn completions
  • Self-served policy via vLLM (verl-managed, DP × TP), fronted by a per-host LiteLLM proxy so the agent speaks an ordinary Anthropic/OpenAI API
  • Containerized rollouts through Harbor — K8s pods in production, local/remote Docker for a minimal setup
  • Verifier-grounded reward — each task's own test script decides reward, so the model is trained on "did it actually fix the bug"
  • A live training dashboard over the run logs, previewable locally or published to Cloudflare Pages

Where to go next

  • Getting Started — prerequisites and your first training run
  • Core Concepts — the inference chain, trials and rewards, the verl loop, backends
  • Run Training — preflight, launch, monitor, and where outputs land
  • Dashboard — watch reward, KL, MFU, and response length live
  • Reference — the config-driven input/output contract

On this page