Troubleshooting
Troubleshooting
Symptom-indexed fixes for the issues that actually block RL training
These are the failure modes that recur when operating the rl block — distilled to
symptom → root cause → fix. They are curated from the development write-ups
under repos/harbor-verl-train/docs/; each page links back to the full original
for the deep dive.
Quick index
Find your symptom, jump to the fix.
| Symptom | Page |
|---|---|
Step 0 crashes: response_mask must contain at least one valid token | Tool-call parser mismatch |
| Half the GPUs sit at 0% util / 100% memory; LiteLLM config lists only one server | LiteLLM replica discovery race |
import torch / import vllm cold-start takes ~9s every launch | venv on network FS |
Every reward is 0.0; critic/score flat at zero though the grader runs | Reward always zero |
| Pod startup time creeps up over hours (30s → 100s+); throughput tanks | cgroup / memcg leak |
| Remote Docker mode won't start (FlashInfer / Ray env / docker SDK / compose) | Remote Docker gotchas |
nvme at 100% util with no pods being created; load 30+ | Kyverno IO storm |
Pods stuck ContainerCreating; uv install storms the disk; tasks 5× slower | Pod-startup disk bottleneck |
By area
- Training startup — the model/serving config issues you hit the moment a run starts
- Rewards & throughput — the reward signal is zero, or the host degrades over time
- Sandbox backends — Docker and Kubernetes runtime issues
First, run the preflight
Most config-level problems are caught by bash scripts/dryrun.sh /
/rl:check before they cost you a multi-hour run — see Preflight.