Preflight
What dryrun.sh and /rl:check validate before a run
scripts/dryrun.sh (and the /rl:check skill that wraps it) validate the
whole run profile without side effects, then print a Run Configuration
Summary. Always run it before scripts/start.sh; fix every failure first.
bash scripts/dryrun.shWhat it checks
- Config schema — required
runtime_info.inputsections are present and well-formed. - Repos —
harbor-verl-train,harbor,verlpresent at their pinned commits. - GPU — expected device count is visible.
- venv — editable installs of harbor / verl / harbor-verl-train resolve to the
repos/trees. - Backend — Docker CLI (local), TCP reachability (remote Docker), or kubectl reachability (K8s).
- Ports — LiteLLM/Ray ports (
:8002,:6379,:8265) are free. - Disk — the checkpoints volume has room for FSDP shards.
- wandb — a key is available via
credentials.wandb_api_keyor$WANDB_API_KEY(unlesswandb_mode: disabled). - Dashboard — a non-blocking check that
dashboard/webui/server.py+ prebuiltdist/are vendored.
KV-head divisibility (the common crash)
The check that most often saves a run: vllm.gen_tp must divide the model's
num_key_value_heads. If it doesn't, training crashes with a CUDA illegal
memory access at the first forward pass. dryrun.sh reads the model's
config.json and fails the preflight if the divisibility doesn't hold:
WARN: vllm.gen_tp=4 does NOT divide num_key_value_heads=6
This will crash with CUDA illegal memory access at first forward pass.Fix by choosing a gen_tp that divides the model's KV-head count (or a model
whose KV heads are divisible by your tensor-parallel degree).
Confirm, then run
The mandatory workflow is check → confirm → run: present the dry-run summary
to the user and wait for explicit confirmation before launching. Heavy
operations (multi-hour GPU training, multi-container rollouts) are expensive and
hard to reverse, so /rl:run never auto-launches past a failed preflight and
never passes --force.