Core Concepts
Core concepts and terminology in rl
rl has the following core concepts:
Policy, rollout, and trial
The policy is the model being trained. A rollout (Harbor calls it a trial) is one full attempt by the agent to solve a task: the agent reasons, edits files, and runs commands inside a sandbox until it submits or hits the turn/timeout limit. One training step samples a batch of rollouts across tasks.
Reward
Each task ships its own verifier (a test script). When a rollout finishes,
the verifier runs inside the sandbox and produces a reward — typically 1.0
if the tests pass and 0.0 otherwise. This grounds the optimization in real
task success rather than a learned reward model. On the dashboard this surfaces
as critic/score/mean.
The inference chain
During training the policy serves itself: verl manages vLLM replicas, and the agent reaches them through a LiteLLM proxy.
claude-code (in a K8s pod or Docker container)
└─▶ LiteLLM proxy on the training host :8002 (Anthropic/OpenAI API surface;
trajectory_logger writes per-trial JSONL)
└─▶ vLLM (verl-managed, DP × TP, OpenAI API surface)The proxy normalizes the endpoint so the agent speaks an ordinary
Anthropic/OpenAI protocol, attaches the trajectory logger, and serves the model
under aliases (claude-*, hosted_vllm/<served>, <served>). LiteLLM starts
after verl boots, because it discovers the vLLM addresses from Ray named
actors (vllm_server_{i}_0).
The verl training loop
verl drives the synchronous RL loop:
generate rollouts with vLLM → execute each trial in a Harbor sandbox → collect
rewards → compute advantages → update the actor (and critic, for PPO). The
trainer and the rollout engine share the same GPUs on one node
(sync_1node_cc.sh). This single-node, synchronous setup is the recommended,
minimal-cost default; for the multi-node, fully-async variant (decoupled
rollout/trainer pools, veomni/MoE, OH-SDK) see Scaling Up.
Algorithms
The algorithm config selects the estimator and loss:
- PPO — actor + critic, clipped policy-gradient objective.
- GRPO — group-relative advantages, no critic.
- GSPO — sequence-level policy optimization.
These live in the upstream sync_1node_cc.sh and are mirrored into config.yaml
for documentation; see Config Variants.
Agent scaffold
The scaffold is the agent harness that drives the model through a task —
claude-code by default. It determines the rollout's shape and which runtime
image Harbor launches. It is set in runtime_info.input.harbor_agent.
Sandbox backend
Every trial runs in a fresh sandbox. rl supports two backends:
- Kubernetes — each trial is a pod; production default. Configured via
k8s.kubeconfig+harbor_agentimport path. - Docker — each trial is a local or remote container; the minimal setup, no cluster required.
See Backends for the trade-offs and config.
Checkpoints
verl writes the actor as FSDP shards every save_freq steps to
repos/harbor-verl-train/checkpoints/<project>/<exp>/global_step_N/. These are
the block's primary output (runtime_info.output.checkpoint_path) and are
preserved across clean.sh runs.
The training venv
All training code runs from a single venv (default
repos/harbor-verl-train/.venv) with editable installs of harbor, verl,
and harbor-verl-train. The editable paths must point at the repos/ trees you
intend to run — otherwise the process silently executes different code. Verify with:
$VENV/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"Metrics
verl logs one step:N - key:value - ... line per step (to the launch log and
wandb). The ones to watch:
| Metric | Meaning |
|---|---|
critic/score/mean | Mean reward (task success signal) |
actor/entropy | Policy entropy — exploration / collapse |
actor/ppo_kl | KL between old and new policy |
actor/pg_clipfrac | Fraction of clipped policy-gradient updates |
perf/mfu/actor | Model FLOPs utilization (throughput) |
response_length/mean | Mean rollout length in tokens |
rollout_corr/kl | Rollout-vs-training distribution shift |
The dashboard renders all of these as live charts.