Verl-SWE-RL

Run Training

Inference Stack

The claude-code → LiteLLM → vLLM chain and its health checks

During training the policy serves itself. The agent inside each sandbox does not talk to vLLM directly — it goes through a LiteLLM proxy on the training host:

claude-code (K8s pod / Docker container)
  └─▶ LiteLLM proxy :8002   (Anthropic/OpenAI API; trajectory_logger writes per-trial JSONL)
        └─▶ vLLM (verl-managed DP replicas, OpenAI API)

Why a proxy

  • Protocol — the proxy presents an ordinary Anthropic/OpenAI surface, so any standard agent scaffold works unchanged.
  • Aliases — it serves the model under claude-*, hosted_vllm/<served>, and <served> so a 404 from a name mismatch is avoided.
  • Logging — the trajectory_logger callback writes one litellm-trajectory.jsonl per trial under harbor_trials/.
  • Discovery — LiteLLM starts only after verl boots, because it finds the vLLM addresses from Ray named actors (vllm_server_{i}_0).

Health checks during a run

The chain is claude-code → LiteLLM :8002 → vLLM. On a 30B MoE (TP=4), vLLM CUDA-graph capture is ~10–20 min before the first replica registers.

# 1. LiteLLM-reported endpoint health
curl -sS http://127.0.0.1:8002/health/liveliness

# 2. End-to-end inference ping (use served_model_name from config.yaml)
curl -sS --max-time 30 http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"vllm_model","messages":[{"role":"user","content":"ping"}],"max_tokens":3}'

# 3. GPU utilization during rollout
watch -n 2 nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv,noheader

# 4. Ray-registered vLLM actors
python3 -c "import ray; ray.init(address='auto', ignore_reinit_error=True); \
  print([a for a in ray.util.list_named_actors(all_namespaces=True) if 'vllm_server' in a['name']])"

Common symptoms

SymptomLikely cause
connection refused on :8002 for >30 minvLLM not registering — check logs/<exp>.log for vLLM init errors
GPU memory high, util 0% sustained during rolloutsandbox side stuck (no traffic from agent pods) — kubectl get pods -l harbor-run=<prefix>
CUDA error: an illegal memory access at first forward passvllm.gen_tp does not divide num_key_value_heads — re-run dryrun.sh
LiteLLM up but claude-code returns 404model-name mismatch — the proxy serves claude-*, hosted_vllm/<served>, <served> aliases

On this page