Verl-SWE-RL

Run Training

Run Training

Preflight, launch, and monitor a training run

An rl run boots vLLM to serve the policy, starts a LiteLLM proxy in front of it, and lets verl drive a synchronous PPO/GRPO/GSPO loop while a coding agent rolls out across tasks inside Harbor sandboxes. This section covers the full path: validating the config, launching, and finding the outputs.

The flow is driven from a single config.yaml and executed by scripts/start.sh:

bash scripts/dryrun.sh   # validate first (no side effects)
bash scripts/start.sh    # bootstrap venv if missing, then run sync_1node_cc training

start.sh auto-runs setup_env.sh if .venv is missing, then execs scripts/train_1node_cc.sh — a thin wrapper that reads config.yaml, exports the env vars, and execs the upstream repos/harbor-verl-train/scripts/sync_1node_cc.sh.

Launch in the background

Training runs for hours. /rl:run launches it with nohup setsid and writes logs/launch_<ts>.log; doing it by hand, wrap it the same way or use tmux so it survives shell disconnects.

The boot sequence

  1. venv — bootstrapped on first run (or reused via environment.venv_path).
  2. vLLM — verl launches DP × TP replicas; CUDA-graph capture on a 30B MoE (TP=4) is ~10–20 min before the first replica registers. The launch script waits up to 30 min.
  3. LiteLLM proxy — started on :8002 once it discovers the vLLM Ray actors; serves the policy to the agent.
  4. verl loop — generate rollouts → run trials in Harbor → reward → update → checkpoint every save_freq steps.

In this section

  • Preflight — what dryrun.sh / /rl:check validate, including KV-head divisibility
  • Inference Stack — the claude-code → LiteLLM → vLLM chain and its health checks
  • Backends — Kubernetes vs Docker sandboxes
  • Results & Artifacts — logs, checkpoints, trajectories, archives
  • Scaling Upadvanced: multi-node, fully-async, veomni/MoE, OH-SDK (higher cost)

Start single-node

The single-node path above is the minimal-cost way to run rl — one 8×GPU host, one launch script. Only move to multiple machines when you outgrow it; see Scaling Up.

Stop cleanly between runs

After a run, bash scripts/clean.sh removes transient Ray/LiteLLM state. Checkpoints and wandb runs are preserved. Add --logs / --trials / --pods to also clear those; never broad-kill GPU processes on a shared host without RL_CLEAN_FORCE_GPU_KILL=1.

On this page