Core Concepts

rl has the following core concepts:

Policy, rollout, and trial

The policy is the model being trained. A rollout (Harbor calls it a trial) is one full attempt by the agent to solve a task: the agent reasons, edits files, and runs commands inside a sandbox until it submits or hits the turn/timeout limit. One training step samples a batch of rollouts across tasks.

Reward

Each task ships its own verifier (a test script). When a rollout finishes, the verifier runs inside the sandbox and produces a reward — typically 1.0 if the tests pass and 0.0 otherwise. This grounds the optimization in real task success rather than a learned reward model. On the dashboard this surfaces as critic/score/mean.

The inference chain

During training the policy serves itself: verl manages vLLM replicas, and the agent reaches them through a LiteLLM proxy.

claude-code (in a K8s pod or Docker container)
  └─▶ LiteLLM proxy on the training host :8002   (Anthropic/OpenAI API surface;
                                                   trajectory_logger writes per-trial JSONL)
        └─▶ vLLM  (verl-managed, DP × TP, OpenAI API surface)

The proxy normalizes the endpoint so the agent speaks an ordinary Anthropic/OpenAI protocol, attaches the trajectory logger, and serves the model under aliases (claude-*, hosted_vllm/<served>, <served>). LiteLLM starts after verl boots, because it discovers the vLLM addresses from Ray named actors (vllm_server_{i}_0).

The verl training loop

verl drives the synchronous RL loop: generate rollouts with vLLM → execute each trial in a Harbor sandbox → collect rewards → compute advantages → update the actor (and critic, for PPO). The trainer and the rollout engine share the same GPUs on one node (sync_1node_cc.sh). This single-node, synchronous setup is the recommended, minimal-cost default; for the multi-node, fully-async variant (decoupled rollout/trainer pools, veomni/MoE, OH-SDK) see Scaling Up.

Algorithms

The algorithm config selects the estimator and loss:

PPO — actor + critic, clipped policy-gradient objective.
GRPO — group-relative advantages, no critic.
GSPO — sequence-level policy optimization.

These live in the upstream sync_1node_cc.sh and are mirrored into config.yaml for documentation; see Config Variants.

Agent scaffold

The scaffold is the agent harness that drives the model through a task — claude-code by default. It determines the rollout's shape and which runtime image Harbor launches. It is set in runtime_info.input.harbor_agent.

Sandbox backend

Every trial runs in a fresh sandbox. rl supports two backends:

Kubernetes — each trial is a pod; production default. Configured via k8s.kubeconfig + harbor_agent import path.
Docker — each trial is a local or remote container; the minimal setup, no cluster required.

See Backends for the trade-offs and config.

Checkpoints

verl writes the actor as FSDP shards every save_freq steps to repos/harbor-verl-train/checkpoints/<project>/<exp>/global_step_N/. These are the block's primary output (runtime_info.output.checkpoint_path) and are preserved across clean.sh runs.

The training venv

All training code runs from a single venv (default repos/harbor-verl-train/.venv) with editable installs of harbor, verl, and harbor-verl-train. The editable paths must point at the repos/ trees you intend to run — otherwise the process silently executes different code. Verify with:

$VENV/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"

Metrics

verl logs one step:N - key:value - ... line per step (to the launch log and wandb). The ones to watch:

Metric	Meaning
`critic/score/mean`	Mean reward (task success signal)
`actor/entropy`	Policy entropy — exploration / collapse
`actor/ppo_kl`	KL between old and new policy
`actor/pg_clipfrac`	Fraction of clipped policy-gradient updates
`perf/mfu/actor`	Model FLOPs utilization (throughput)
`response_length/mean`	Mean rollout length in tokens
`rollout_corr/kl`	Rollout-vs-training distribution shift

The dashboard renders all of these as live charts.

Core Concepts

On this page