Run Training
Inference Stack
The claude-code → LiteLLM → vLLM chain and its health checks
During training the policy serves itself. The agent inside each sandbox does not talk to vLLM directly — it goes through a LiteLLM proxy on the training host:
claude-code (K8s pod / Docker container)
└─▶ LiteLLM proxy :8002 (Anthropic/OpenAI API; trajectory_logger writes per-trial JSONL)
└─▶ vLLM (verl-managed DP replicas, OpenAI API)Why a proxy
- Protocol — the proxy presents an ordinary Anthropic/OpenAI surface, so any standard agent scaffold works unchanged.
- Aliases — it serves the model under
claude-*,hosted_vllm/<served>, and<served>so a 404 from a name mismatch is avoided. - Logging — the
trajectory_loggercallback writes onelitellm-trajectory.jsonlper trial underharbor_trials/. - Discovery — LiteLLM starts only after verl boots, because it finds the vLLM
addresses from Ray named actors (
vllm_server_{i}_0).
Health checks during a run
The chain is claude-code → LiteLLM :8002 → vLLM. On a 30B MoE (TP=4), vLLM
CUDA-graph capture is ~10–20 min before the first replica registers.
# 1. LiteLLM-reported endpoint health
curl -sS http://127.0.0.1:8002/health/liveliness
# 2. End-to-end inference ping (use served_model_name from config.yaml)
curl -sS --max-time 30 http://127.0.0.1:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"vllm_model","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'
# 3. GPU utilization during rollout
watch -n 2 nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv,noheader
# 4. Ray-registered vLLM actors
python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True); \
print([a for a in ray.util.list_named_actors(all_namespaces=True) if 'vllm_server' in a['name']])"Common symptoms
| Symptom | Likely cause |
|---|---|
connection refused on :8002 for >30 min | vLLM not registering — check logs/<exp>.log for vLLM init errors |
| GPU memory high, util 0% sustained during rollout | sandbox side stuck (no traffic from agent pods) — kubectl get pods -l harbor-run=<prefix> |
CUDA error: an illegal memory access at first forward pass | vllm.gen_tp does not divide num_key_value_heads — re-run dryrun.sh |
LiteLLM up but claude-code returns 404 | model-name mismatch — the proxy serves claude-*, hosted_vllm/<served>, <served> aliases |