Verl-SWE-RL

Run Training

Scaling Up

When and how to move from the single-node default to a multi-node, fully-async run

The recommended way to run rl is the single-node path documented throughout this section: one 8×GPU host where verl runs the trainer and vLLM serves the policy on the same GPUs, driven by sync_1node_cc.sh with the claude-code scaffold. It is the minimal-cost way to get a real RL run going — one machine, one launch script, no cross-node plumbing.

This page is the advanced path: what changes when one host is no longer enough. You almost certainly do not need it to start.

This costs more — reach for it only when you must

The multi-node / fully-async stack needs multiple machines, more setup, and more operational care (cross-node networking, a separate rollout pool, staleness tuning). Stay on the single-node path until you actually hit its ceiling: a large MoE policy that won't fit one host's training + serving budget, very long contexts (≈131k), or rollout throughput that bottlenecks a synchronous loop.

When to scale

Signal on the single-node runScaling lever
Policy is a large MoE (e.g. 30B-A3B) and trainer + vLLM won't co-fitveomni engine + a dedicated rollout node
Trainer GPUs idle most of the step waiting on rollouts (idle_ratio high)Fully-async trainer/rollout decoupling
Context window must reach ≈131k (40k prompt + 91k response)Multi-node sequence parallel (SP)
You need a different agent harness than claude-codeOH-SDK scaffold

Fully-async, multi-node

The synchronous loop runs generate → execute → reward → update in lockstep on shared GPUs. The fully-async path instead splits the cluster into a rollout pool and a trainer pool that run concurrently: rollouts (which are dominated by sandbox/env execution, not GPU compute) stream into a queue while the trainer consumes them, so neither side blocks the other.

The reference launch script is:

repos/harbor-verl-train/scripts/fully_async_3nodes_qwen35_ohsdk_veomni.sh

Its default layout is 3 nodes — 2 trainer nodes + 1 rollout node — with SP=4 to fit the ≈131k window. A few knobs govern the async behaviour:

KnobRole
staleness_thresholdHow many param-versions stale a rollout may be before the trainer waits (0 = strict on-policy; 1 tolerates a throughput dip)
train_bsz × n_resp_per_promptEffective batch = prompts × rollouts/prompt. Official 64×8; smoke 8×4
max_concurrent_samplesHow many trials run in parallel — the main lever for hiding env-execution latency

Smoke before the official run

Always validate a new scaffold/cluster with the smoke batch (TRAIN_BSZ=8 N_RESP=4) before the official 64×8. The small batch is ~8× faster per step but gives noisy gradients and degenerate all-same-reward groups — it is for plumbing validation, not for reading learning quality.

veomni engine (MoE)

For MoE policies, set USE_NEW_VERL=1 so import verl resolves to the verl-swe_agent_opd_dev checkout (on PYTHONPATH). Only that tree has the veomni engine_workers router-replay wiring (actor.veomni.router_replay.mode) and the async-rollouter routed_experts concat fixes; the old installed verl's veomni engine is unvalidated for this path. Dense models can stay on FSDP.

Setup write-up: repos/harbor-verl-train/docs/training_env/veomni-engine-setup-and-run-20260610.md · routing coverage fix: …/r3-routing-coverage-rootcause-fix-20260610.md

OH-SDK scaffold

claude-code is the single-node default. For the multi-node path the validated alternative is the OpenHands-SDK scaffold, which uses an image-mounted runtime (the SDK pre-baked into the agent image) instead of an in-pod venv install — the latter fails on no-egress task pods. Select it with HARBOR_AGENT_NAME=null so harbor loads the mounted-runtime-aware import_path class rather than the registry default.

Scaffold pitfall (name vs import_path): repos/harbor-verl-train/docs/training_env/ohsdk-agent-name-vs-import-path-20260613.md

What to watch on a fully-async run

Beyond the single-node metrics, the async path adds a few signals worth a glance each step:

MetricRead
fully_async/trainer/idle_ratioFraction of the step the trainer waits on rollouts — high = rollout-bound (raise concurrency or rollout capacity)
rollout_corr/klRollout-vs-training logprob fidelity — should be ≈3e-4; a spike means scaffold/routing corruption
fully_async/partial/partial_ratioShare of partial (staleness-bounded) rollouts
trajectory_filter/invalid_ratioShare of trajectories dropped before the update

Advanced-path failure modes

The async/MoE path has its own silent killers — a reward drop traced to trajectory filtering / sequence-distribution drift, routing-coverage loss under multi-turn replay, and importance-sampling ESS collapse on very long responses (set rollout_is=null and use token-level IS, not sequence-level). Start from the v6 analysis: repos/harbor-verl-train/docs/training_env/reward_drop_analysis_v6_20260612.md.

On this page