Getting Started

This page walks through preparing the rl block and launching a training run end to end. rl wraps a pinned harbor-verl-train checkout (a submodule under repos/, which in turn pins harbor and a patched verl), so most setup is about getting that runtime, the policy model, and a sandbox backend in place.

Prerequisites

8× GPU (A100/H100-class) on the training node — vLLM serves the policy and verl runs the trainer on the same host.
A sandbox backend — either a reachable Kubernetes cluster (production) or a Docker daemon (local or remote) for Harbor to run each trial in.
uv — used by setup_env.sh to build the training venv (vllm, flash_attn, verl, harbor, harbor-verl-train, editable).
A policy checkpoint — e.g. an sft-trained Qwen3 model on local disk.
Train/val task indexes — Harbor task parquet files (*.parquet) that the trainer samples rollouts from.

Where to run

rl runs on the node declared in config.yaml (meta_info.resources.ip; null means the current host). Training is hours-long — launch it in the background (the default of /rl:run) or inside a tmux session so it survives shell disconnects.

1. Fetch the managed repos

rl depends on three submodules under repos/ — harbor-verl-train, harbor, and verl — pinned in config.yaml. Initialize them:

git submodule update --init repos/harbor-verl-train repos/harbor repos/verl

harbor and verl are pinned to the commits harbor-verl-train/scripts/setup_env.sh expects; a patch (patches/verl_*.patch) is applied to verl during setup.

2. Build the environment

The first run of scripts/start.sh bootstraps a fresh venv at repos/harbor-verl-train/.venv via uv (editable installs of harbor / verl / harbor-verl-train plus pinned vllm / flash_attn / transformers). To do it ahead of time:

bash repos/harbor-verl-train/scripts/setup_env.sh

To reuse an existing venv, set runtime_info.input.environment.venv_path in config.yaml — but make sure its editable installs point at these repos/ trees (see Core Concepts).

3. Point at your model, data, and backend

Edit config.yaml → runtime_info.input:

model.model_path — the policy checkpoint to train.
data.train_index / data.val_index — the Harbor task parquet indexes.
Sandbox backend — either k8s.kubeconfig (K8s) or harbor_agent.docker_host (Docker). See Backends.
export WANDB_API_KEY=... in your shell (never hardcode it in config.yaml).

4. Validate the config

Run the dry-run preflight. It checks the pinned repos, GPU count, venv editable installs, K8s/Docker reachability, ports, disk, wandb credentials, and — crucially — that vllm.gen_tp divides the model's num_key_value_heads, all without side effects:

bash scripts/dryrun.sh

Fix anything it reports before launching. See Preflight.

5. Launch a run

Only after the dry run passes, launch training. start.sh bootstraps the venv if missing, then execs the upstream sync_1node_cc.sh (vLLM boot → LiteLLM proxy → verl PPO/GRPO/GSPO loop):

bash scripts/start.sh

vLLM CUDA-graph capture on a 30B MoE (TP=4) takes ~10–20 min before the first replica registers — this is expected. See Run Training.

6. Watch it train

Tail the launch log and bring up the dashboard:

tail -F logs/launch_<ts>.log
bash dashboard/serve.sh start          # http://<host>:8090

Checkpoints land under repos/harbor-verl-train/checkpoints/<project>/<exp>/global_step_N/. See Results & Artifacts.

Operating with the agent plugin

If you operate the block through its Claude plugin, the lifecycle maps to slash commands:

/rl:setup          # bootstrap a fresh clone: submodules at pins, venv, config inputs
/rl:check          # preflight: config, repos, GPU, KV-head divisibility, k8s/docker, ports, venv
/rl:run            # launch scripts/start.sh in the background (archived on exit)
/rl:dashboard      # textual training summary; `start` serves the live web dashboard

Getting Started

On this page