Verl-SWE-RL

Troubleshooting

Troubleshooting

Symptom-indexed fixes for the issues that actually block RL training

These are the failure modes that recur when operating the rl block — distilled to symptom → root cause → fix. They are curated from the development write-ups under repos/harbor-verl-train/docs/; each page links back to the full original for the deep dive.

Quick index

Find your symptom, jump to the fix.

SymptomPage
Step 0 crashes: response_mask must contain at least one valid tokenTool-call parser mismatch
Half the GPUs sit at 0% util / 100% memory; LiteLLM config lists only one serverLiteLLM replica discovery race
import torch / import vllm cold-start takes ~9s every launchvenv on network FS
Every reward is 0.0; critic/score flat at zero though the grader runsReward always zero
Pod startup time creeps up over hours (30s → 100s+); throughput tankscgroup / memcg leak
Remote Docker mode won't start (FlashInfer / Ray env / docker SDK / compose)Remote Docker gotchas
nvme at 100% util with no pods being created; load 30+Kyverno IO storm
Pods stuck ContainerCreating; uv install storms the disk; tasks 5× slowerPod-startup disk bottleneck

By area

First, run the preflight

Most config-level problems are caught by bash scripts/dryrun.sh / /rl:check before they cost you a multi-hour run — see Preflight.

On this page