Inputs & Outputs
The config-driven input/output contract
rl is configured entirely through config.yaml.
It uses config.yaml rather than a separate inputs.yaml because the file
describes a full training run profile, not just upstream inputs. Treat
config.yaml as the source of truth before launching a run.
Two config tiers
| Tier | Sections | Plumbed via |
|---|---|---|
| Env-driven (live) | model, data, infrastructure, environment, k8s, harbor_agent, harbor_runtime, experiment, credentials | train_1node_cc.sh exports them as env vars consumed by sync_1node_cc.sh |
| Upstream-fixed (docs only) | vllm, training, algorithm | hardcoded in sync_1node_cc.sh; mirrored here so the file documents the live state |
Inputs
| Name | Description | Source |
|---|---|---|
model | Policy checkpoint path + served model name | external |
data | Train / val Harbor task parquet indexes | external |
infrastructure | nodes, GPUs per node, Ray / LiteLLM ports | external |
environment | bootstrap script, venv_path, requirements | external |
k8s / harbor_agent | sandbox backend (kubeconfig or docker_host) + agent scaffold | external |
harbor_runtime | rollout worker count, tail-killer | external |
experiment | project / exp name, trajectory logger source | external |
credentials | wandb_api_key (prefer $WANDB_API_KEY), wandb_mode | external |
The values you normally fill are the policy model.model_path, the data
indexes, the backend, and $WANDB_API_KEY in your shell.
Active runtime values
Excerpted from config.yaml:
model:
model_path: /path/to/models/Qwen3-30B-A3B-Instruct-2507
served_model_name: vllm_model
data:
train_index: /path/to/harbor_indexes/train.parquet
val_index: /path/to/harbor_indexes/val.parquet
infrastructure:
nnodes: 1
ngpus_per_node: 8
litellm_port: 8002
vllm:
gen_tp: 4
max_model_length: 128000
gpu_memory_utilization: 0.85
training:
train_batch_size: 64
n_resp_per_prompt: 8 # 64 × 8 = 512 trials/step
max_prompt_length: 40000
max_response_length: 68000
total_epochs: 3
save_freq: 5
test_freq: 5
algorithm:
adv_estimator: grpo
policy_loss_mode: gspo
learning_rate: 1.0e-06
experiment:
project_name: swe-lego-live-rl
credentials:
wandb_mode: onlineKV-head divisibility
vllm.gen_tp must divide the model's num_key_value_heads, or training crashes
at the first forward pass. dryrun.sh enforces this — see Preflight.
Outputs
rl declares its handoff contract in config.yaml → runtime_info.output:
| Output | Path | Notes |
|---|---|---|
checkpoint_path | repos/harbor-verl-train/checkpoints/ | actor FSDP shards (per save_freq) |
training_log | repos/harbor-verl-train/logs/<exp>.log | per-run training log |
vllm_log | repos/harbor-verl-train/logs/<exp>_vllm.log | throughput-only log |
trajectory_log | repos/harbor-verl-train/harbor_trials/<project>/<exp>/ | per-trial agent traffic |
val_resolve_rate / training_curves | wandb + dashboard | validation resolve rate and metric curves |
See Results & Artifacts for the full layout, and Config Variants for changing the run profile.