Scaling Up
When and how to move from the single-node default to a multi-node, fully-async run
The recommended way to run rl is the single-node path documented throughout
this section: one 8×GPU host where verl runs the trainer and vLLM serves the
policy on the same GPUs, driven by sync_1node_cc.sh with the claude-code
scaffold. It is the minimal-cost way to get a real RL run going — one
machine, one launch script, no cross-node plumbing.
This page is the advanced path: what changes when one host is no longer enough. You almost certainly do not need it to start.
This costs more — reach for it only when you must
The multi-node / fully-async stack needs multiple machines, more setup, and more operational care (cross-node networking, a separate rollout pool, staleness tuning). Stay on the single-node path until you actually hit its ceiling: a large MoE policy that won't fit one host's training + serving budget, very long contexts (≈131k), or rollout throughput that bottlenecks a synchronous loop.
When to scale
| Signal on the single-node run | Scaling lever |
|---|---|
| Policy is a large MoE (e.g. 30B-A3B) and trainer + vLLM won't co-fit | veomni engine + a dedicated rollout node |
Trainer GPUs idle most of the step waiting on rollouts (idle_ratio high) | Fully-async trainer/rollout decoupling |
| Context window must reach ≈131k (40k prompt + 91k response) | Multi-node sequence parallel (SP) |
You need a different agent harness than claude-code | OH-SDK scaffold |
Fully-async, multi-node
The synchronous loop runs generate → execute → reward → update in lockstep on shared GPUs. The fully-async path instead splits the cluster into a rollout pool and a trainer pool that run concurrently: rollouts (which are dominated by sandbox/env execution, not GPU compute) stream into a queue while the trainer consumes them, so neither side blocks the other.
The reference launch script is:
repos/harbor-verl-train/scripts/fully_async_3nodes_qwen35_ohsdk_veomni.shIts default layout is 3 nodes — 2 trainer nodes + 1 rollout node — with SP=4 to fit the ≈131k window. A few knobs govern the async behaviour:
| Knob | Role |
|---|---|
staleness_threshold | How many param-versions stale a rollout may be before the trainer waits (0 = strict on-policy; 1 tolerates a throughput dip) |
train_bsz × n_resp_per_prompt | Effective batch = prompts × rollouts/prompt. Official 64×8; smoke 8×4 |
max_concurrent_samples | How many trials run in parallel — the main lever for hiding env-execution latency |
Smoke before the official run
Always validate a new scaffold/cluster with the smoke batch
(TRAIN_BSZ=8 N_RESP=4) before the official 64×8. The small batch is ~8×
faster per step but gives noisy gradients and degenerate all-same-reward groups —
it is for plumbing validation, not for reading learning quality.
veomni engine (MoE)
For MoE policies, set USE_NEW_VERL=1 so import verl resolves to the
verl-swe_agent_opd_dev checkout (on PYTHONPATH). Only that tree has the
veomni engine_workers router-replay wiring
(actor.veomni.router_replay.mode) and the async-rollouter routed_experts
concat fixes; the old installed verl's veomni engine is unvalidated for this
path. Dense models can stay on FSDP.
Setup write-up:
repos/harbor-verl-train/docs/training_env/veomni-engine-setup-and-run-20260610.md· routing coverage fix:…/r3-routing-coverage-rootcause-fix-20260610.md
OH-SDK scaffold
claude-code is the single-node default. For the multi-node path the validated
alternative is the OpenHands-SDK scaffold, which uses an image-mounted
runtime (the SDK pre-baked into the agent image) instead of an in-pod venv
install — the latter fails on no-egress task pods. Select it with
HARBOR_AGENT_NAME=null so harbor loads the mounted-runtime-aware
import_path class rather than the registry default.
Scaffold pitfall (name vs import_path):
repos/harbor-verl-train/docs/training_env/ohsdk-agent-name-vs-import-path-20260613.md
What to watch on a fully-async run
Beyond the single-node metrics, the async path adds a few signals worth a glance each step:
| Metric | Read |
|---|---|
fully_async/trainer/idle_ratio | Fraction of the step the trainer waits on rollouts — high = rollout-bound (raise concurrency or rollout capacity) |
rollout_corr/kl | Rollout-vs-training logprob fidelity — should be ≈3e-4; a spike means scaffold/routing corruption |
fully_async/partial/partial_ratio | Share of partial (staleness-bounded) rollouts |
trajectory_filter/invalid_ratio | Share of trajectories dropped before the update |
Advanced-path failure modes
The async/MoE path has its own silent killers — a reward drop traced to
trajectory filtering / sequence-distribution drift, routing-coverage loss under
multi-turn replay, and importance-sampling ESS collapse on very long
responses (set rollout_is=null and use token-level IS, not sequence-level).
Start from the v6 analysis:
repos/harbor-verl-train/docs/training_env/reward_drop_analysis_v6_20260612.md.