Troubleshooting

These are the failure modes that recur when operating the rl block — distilled to symptom → root cause → fix. They are curated from the development write-ups under repos/harbor-verl-train/docs/; each page links back to the full original for the deep dive.

Quick index

Find your symptom, jump to the fix.

Symptom	Page
Step 0 crashes: `response_mask must contain at least one valid token`	Tool-call parser mismatch
Half the GPUs sit at 0% util / 100% memory; LiteLLM config lists only one server	LiteLLM replica discovery race
`import torch` / `import vllm` cold-start takes ~9s every launch	venv on network FS
Every reward is `0.0`; `critic/score` flat at zero though the grader runs	Reward always zero
Pod startup time creeps up over hours (30s → 100s+); throughput tanks	cgroup / memcg leak
Remote Docker mode won't start (FlashInfer / Ray env / docker SDK / compose)	Remote Docker gotchas
nvme at 100% util with no pods being created; `load` 30+	Kyverno IO storm
Pods stuck `ContainerCreating`; uv install storms the disk; tasks 5× slower	Pod-startup disk bottleneck

By area

Training startup — the model/serving config issues you hit the moment a run starts
Rewards & throughput — the reward signal is zero, or the host degrades over time
Sandbox backends — Docker and Kubernetes runtime issues

First, run the preflight

Most config-level problems are caught by bash scripts/dryrun.sh / /rl:check before they cost you a multi-hour run — see Preflight.

Troubleshooting

Quick index

By area

On this page