Verl-SWE-RL

Run Training

Preflight

What dryrun.sh and /rl:check validate before a run

scripts/dryrun.sh (and the /rl:check skill that wraps it) validate the whole run profile without side effects, then print a Run Configuration Summary. Always run it before scripts/start.sh; fix every failure first.

bash scripts/dryrun.sh

What it checks

  • Config schema — required runtime_info.input sections are present and well-formed.
  • Reposharbor-verl-train, harbor, verl present at their pinned commits.
  • GPU — expected device count is visible.
  • venv — editable installs of harbor / verl / harbor-verl-train resolve to the repos/ trees.
  • Backend — Docker CLI (local), TCP reachability (remote Docker), or kubectl reachability (K8s).
  • Ports — LiteLLM/Ray ports (:8002, :6379, :8265) are free.
  • Disk — the checkpoints volume has room for FSDP shards.
  • wandb — a key is available via credentials.wandb_api_key or $WANDB_API_KEY (unless wandb_mode: disabled).
  • Dashboard — a non-blocking check that dashboard/webui/server.py + prebuilt dist/ are vendored.

KV-head divisibility (the common crash)

The check that most often saves a run: vllm.gen_tp must divide the model's num_key_value_heads. If it doesn't, training crashes with a CUDA illegal memory access at the first forward pass. dryrun.sh reads the model's config.json and fails the preflight if the divisibility doesn't hold:

WARN: vllm.gen_tp=4 does NOT divide num_key_value_heads=6
      This will crash with CUDA illegal memory access at first forward pass.

Fix by choosing a gen_tp that divides the model's KV-head count (or a model whose KV heads are divisible by your tensor-parallel degree).

Confirm, then run

The mandatory workflow is check → confirm → run: present the dry-run summary to the user and wait for explicit confirmation before launching. Heavy operations (multi-hour GPU training, multi-container rollouts) are expensive and hard to reverse, so /rl:run never auto-launches past a failed preflight and never passes --force.

On this page