Getting Started
Set up the block and launch your first training run
This page walks through preparing the rl block and launching a training run end
to end. rl wraps a pinned harbor-verl-train checkout (a submodule under
repos/, which in turn pins harbor and a patched verl), so most setup is
about getting that runtime, the policy model, and a sandbox backend in place.
Prerequisites
- 8× GPU (A100/H100-class) on the training node — vLLM serves the policy and verl runs the trainer on the same host.
- A sandbox backend — either a reachable Kubernetes cluster (production) or a Docker daemon (local or remote) for Harbor to run each trial in.
- uv — used by
setup_env.shto build the training venv (vllm, flash_attn, verl, harbor, harbor-verl-train, editable). - A policy checkpoint — e.g. an
sft-trained Qwen3 model on local disk. - Train/val task indexes — Harbor task parquet files (
*.parquet) that the trainer samples rollouts from.
Where to run
rl runs on the node declared in config.yaml (meta_info.resources.ip; null
means the current host). Training is hours-long — launch it in the background
(the default of /rl:run) or inside a tmux session so it survives shell
disconnects.
1. Fetch the managed repos
rl depends on three submodules under repos/ — harbor-verl-train, harbor,
and verl — pinned in config.yaml. Initialize them:
git submodule update --init repos/harbor-verl-train repos/harbor repos/verlharbor and verl are pinned to the commits harbor-verl-train/scripts/setup_env.sh
expects; a patch (patches/verl_*.patch) is applied to verl during setup.
2. Build the environment
The first run of scripts/start.sh bootstraps a fresh venv at
repos/harbor-verl-train/.venv via uv (editable installs of harbor / verl /
harbor-verl-train plus pinned vllm / flash_attn / transformers). To do it ahead
of time:
bash repos/harbor-verl-train/scripts/setup_env.shTo reuse an existing venv, set runtime_info.input.environment.venv_path in
config.yaml — but make sure its editable installs point at these repos/
trees (see Core Concepts).
3. Point at your model, data, and backend
Edit config.yaml → runtime_info.input:
model.model_path— the policy checkpoint to train.data.train_index/data.val_index— the Harbor task parquet indexes.- Sandbox backend — either
k8s.kubeconfig(K8s) orharbor_agent.docker_host(Docker). See Backends. export WANDB_API_KEY=...in your shell (never hardcode it inconfig.yaml).
4. Validate the config
Run the dry-run preflight. It checks the pinned repos, GPU count, venv editable
installs, K8s/Docker reachability, ports, disk, wandb credentials, and — crucially
— that vllm.gen_tp divides the model's num_key_value_heads, all without side
effects:
bash scripts/dryrun.shFix anything it reports before launching. See Preflight.
5. Launch a run
Only after the dry run passes, launch training. start.sh bootstraps the venv
if missing, then execs the upstream sync_1node_cc.sh (vLLM boot → LiteLLM proxy
→ verl PPO/GRPO/GSPO loop):
bash scripts/start.shvLLM CUDA-graph capture on a 30B MoE (TP=4) takes ~10–20 min before the first replica registers — this is expected. See Run Training.
6. Watch it train
Tail the launch log and bring up the dashboard:
tail -F logs/launch_<ts>.log
bash dashboard/serve.sh start # http://<host>:8090Checkpoints land under repos/harbor-verl-train/checkpoints/<project>/<exp>/global_step_N/.
See Results & Artifacts.
Operating with the agent plugin
If you operate the block through its Claude plugin, the lifecycle maps to slash commands:
/rl:setup # bootstrap a fresh clone: submodules at pins, venv, config inputs
/rl:check # preflight: config, repos, GPU, KV-head divisibility, k8s/docker, ports, venv
/rl:run # launch scripts/start.sh in the background (archived on exit)
/rl:dashboard # textual training summary; `start` serves the live web dashboard