Inference Stack

During training the policy serves itself. The agent inside each sandbox does not talk to vLLM directly — it goes through a LiteLLM proxy on the training host:

claude-code (K8s pod / Docker container)
  └─▶ LiteLLM proxy :8002   (Anthropic/OpenAI API; trajectory_logger writes per-trial JSONL)
        └─▶ vLLM (verl-managed DP replicas, OpenAI API)

Why a proxy

Protocol — the proxy presents an ordinary Anthropic/OpenAI surface, so any standard agent scaffold works unchanged.
Aliases — it serves the model under claude-*, hosted_vllm/<served>, and <served> so a 404 from a name mismatch is avoided.
Logging — the trajectory_logger callback writes one litellm-trajectory.jsonl per trial under harbor_trials/.
Discovery — LiteLLM starts only after verl boots, because it finds the vLLM addresses from Ray named actors (vllm_server_{i}_0).

Health checks during a run

The chain is claude-code → LiteLLM :8002 → vLLM. On a 30B MoE (TP=4), vLLM CUDA-graph capture is ~10–20 min before the first replica registers.

# 1. LiteLLM-reported endpoint health
curl -sS http://127.0.0.1:8002/health/liveliness

# 2. End-to-end inference ping (use served_model_name from config.yaml)
curl -sS --max-time 30 http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"vllm_model","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'

# 3. GPU utilization during rollout
watch -n 2 nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv,noheader

# 4. Ray-registered vLLM actors
python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True); \
  print([a for a in ray.util.list_named_actors(all_namespaces=True) if 'vllm_server' in a['name']])"

Common symptoms

Symptom	Likely cause
`connection refused` on `:8002` for >30 min	vLLM not registering — check `logs/<exp>.log` for vLLM init errors
GPU memory high, util 0% sustained during rollout	sandbox side stuck (no traffic from agent pods) — `kubectl get pods -l harbor-run=<prefix>`
`CUDA error: an illegal memory access` at first forward pass	`vllm.gen_tp` does not divide `num_key_value_heads` — re-run `dryrun.sh`
LiteLLM up but `claude-code` returns 404	model-name mismatch — the proxy serves `claude-*`, `hosted_vllm/<served>`, `<served>` aliases

Inference Stack

Why a proxy

Health checks during a run

Common symptoms

On this page