Run Training
Preflight, launch, and monitor a training run
An rl run boots vLLM to serve the policy, starts a LiteLLM proxy in front of it, and lets verl drive a synchronous PPO/GRPO/GSPO loop while a coding agent rolls out across tasks inside Harbor sandboxes. This section covers the full path: validating the config, launching, and finding the outputs.
The flow is driven from a single config.yaml and executed
by scripts/start.sh:
bash scripts/dryrun.sh # validate first (no side effects)
bash scripts/start.sh # bootstrap venv if missing, then run sync_1node_cc trainingstart.sh auto-runs setup_env.sh if .venv is missing, then execs
scripts/train_1node_cc.sh — a thin wrapper that reads config.yaml, exports
the env vars, and execs the upstream repos/harbor-verl-train/scripts/sync_1node_cc.sh.
Launch in the background
Training runs for hours. /rl:run launches it with nohup setsid and writes
logs/launch_<ts>.log; doing it by hand, wrap it the same way or use tmux so
it survives shell disconnects.
The boot sequence
- venv — bootstrapped on first run (or reused via
environment.venv_path). - vLLM — verl launches DP × TP replicas; CUDA-graph capture on a 30B MoE (TP=4) is ~10–20 min before the first replica registers. The launch script waits up to 30 min.
- LiteLLM proxy — started on
:8002once it discovers the vLLM Ray actors; serves the policy to the agent. - verl loop — generate rollouts → run trials in Harbor → reward → update →
checkpoint every
save_freqsteps.
In this section
- Preflight — what
dryrun.sh//rl:checkvalidate, including KV-head divisibility - Inference Stack — the claude-code → LiteLLM → vLLM chain and its health checks
- Backends — Kubernetes vs Docker sandboxes
- Results & Artifacts — logs, checkpoints, trajectories, archives
- Scaling Up — advanced: multi-node, fully-async, veomni/MoE, OH-SDK (higher cost)
Start single-node
The single-node path above is the minimal-cost way to run rl — one 8×GPU host, one launch script. Only move to multiple machines when you outgrow it; see Scaling Up.
Stop cleanly between runs
After a run, bash scripts/clean.sh removes transient Ray/LiteLLM state.
Checkpoints and wandb runs are preserved. Add --logs / --trials / --pods
to also clear those; never broad-kill GPU processes on a shared host without
RL_CLEAN_FORCE_GPU_KILL=1.