Sandbox Backends

Issues in the layer that actually runs each trial. See Backends for choosing and configuring a backend; this page is the failure modes.

Remote Docker: four startup gotchas

When bringing up the remote-Docker backend, four issues commonly block the first run. Work through them in order:

FlashInfer workspace init fails. A stale ninja cache at /root/.cache/flashinfer/ references deleted venv paths → empty workspace → an allreduce assertion fails at startup. Fix: rm -rf /root/.cache/flashinfer/.
VLLM_ATTENTION_BACKEND not seen by Ray workers. A plain shell export doesn't propagate — Ray workers inherit nothing. Fix: pass it through Ray's runtime env in sync_1node_cc.sh: +ray_kwargs.ray_init.runtime_env.env_vars.VLLM_ATTENTION_BACKEND=FLASH_ATTN.
Docker SDK shadowed by verl/docker/. A docker/ directory on sys.path masks the real PyPI docker module. Fix: uv pip install --python $VENV/bin/python docker (site-packages wins); verify with python -c "from docker import DockerClient; print('ok')".
Compose V2 plugin missing. A hand-installed Docker lacks the compose plugin. Fix: install it into /usr/local/lib/docker/cli-plugins/docker-compose (curl the release binary, chmod +x).

After these, re-run bash scripts/dryrun.sh to confirm docker CLI + compose, remote connectivity, the FlashInfer cache, the Docker SDK import, and the venv paths all pass.

Full write-ups: repos/harbor-verl-train/docs/docker/remote-docker-troubleshooting-20260527.md and …/remote-docker-backend-setup.md

Kubernetes health baseline

When the K8s backend misbehaves, check these indicators first — most rollout stalls trace to one of them:

Indicator	Where	Warn / Crit
`num_cgroups`	`awk 'NR==2{print $3}' /proc/cgroups`	>50k / >100k (cgroup leak)
`percpu` memory	`/sys/fs/cgroup/memory.stat`	>20 / >50 GB (memcg leak)
nvme `%util`	`iostat -x` on the etcd disk	>80% sustained (I/O storm)
nvme `r_await`	`iostat -x`	>50 ms (etcd reads slow)
`RunPodSandbox` p50	kubelet metrics	>30s (sandbox-creation bottleneck)

Diagnostic flow: cgroup leak (reboot if needed) → I/O (find the reader) → control plane (apiserver/etcd CPU) → containerd.

Full write-up: repos/harbor-verl-train/docs/k8s/k8s-health-check-indicators.md

Kyverno reports-controller I/O storm

Symptom. nvme at 100% util, r_await ~89 ms, load 30+, despite no pods being created. Scaling the controller to zero drops I/O to nil instantly.

Root cause. The kyverno-reports-controller lists ~1000+ EphemeralReports (takes 32–38s), which exceeds the 15s leader-election lease → it loses leadership → crashes → restarts → lists them all again → dead loop, hammering etcd with small random reads each cycle.

Fix.

kubectl scale -n kyverno deployment/kyverno-reports-controller --replicas=0  # reports are optional; admission still works
kubectl delete ephemeralreports -A --all                                     # clear the backlog

Prevent: a cron that prunes EphemeralReports every ~30 min and alerts when the count exceeds ~500.

Full write-up: repos/harbor-verl-train/docs/k8s/kyverno-reports-controller-issue.md

Pod-startup disk bottleneck (uv cache)

Symptom. Pods stick in ContainerCreating; with ~100 concurrent pods each running uv installs, the etcd nvme sees 1.7k–2.3k w/s, aqu-sz ~35–37, and tasks take ~150s vs a ~30s baseline — a small-file write storm.

Root cause. Each pod writes its uv cache (/root/.cache/uv) to the overlayfs on disk; 100 pods × many small wheel writes overwhelms the nvme with journal amplification.

Fix. Put the uv cache on tmpfs via a Kyverno mutating policy that injects, on each sandbox pod:

emptyDir { medium: Memory, sizeLimit: 1Gi } mounted at /root/.cache/uv
UV_LINK_MODE=copy
a pod-level resources.limits.memory headroom (e.g. 8Gi)

Result: aqu-sz 37 → 6 (~6× drop), w/s roughly halved, nvme %util 100% → ~25%, and the ContainerCreating queue clears.

Full write-up: repos/harbor-verl-train/docs/k8s/k8s-sandbox-disk-fix.md

Sandbox Backends

Remote Docker: four startup gotchas

Kubernetes health baseline

Kyverno reports-controller I/O storm

Pod-startup disk bottleneck (uv cache)

On this page