Inputs & Outputs

rl is configured entirely through config.yaml. It uses config.yaml rather than a separate inputs.yaml because the file describes a full training run profile, not just upstream inputs. Treat config.yaml as the source of truth before launching a run.

Two config tiers

Tier	Sections	Plumbed via
Env-driven (live)	`model`, `data`, `infrastructure`, `environment`, `k8s`, `harbor_agent`, `harbor_runtime`, `experiment`, `credentials`	`train_1node_cc.sh` exports them as env vars consumed by `sync_1node_cc.sh`
Upstream-fixed (docs only)	`vllm`, `training`, `algorithm`	hardcoded in `sync_1node_cc.sh`; mirrored here so the file documents the live state

Inputs

Name	Description	Source
`model`	Policy checkpoint path + served model name	external
`data`	Train / val Harbor task parquet indexes	external
`infrastructure`	nodes, GPUs per node, Ray / LiteLLM ports	external
`environment`	bootstrap script, `venv_path`, requirements	external
`k8s` / `harbor_agent`	sandbox backend (kubeconfig or docker_host) + agent scaffold	external
`harbor_runtime`	rollout worker count, tail-killer	external
`experiment`	project / exp name, trajectory logger source	external
`credentials`	`wandb_api_key` (prefer `$WANDB_API_KEY`), `wandb_mode`	external

The values you normally fill are the policy model.model_path, the data indexes, the backend, and $WANDB_API_KEY in your shell.

Active runtime values

Excerpted from config.yaml:

model:
  model_path: /path/to/models/Qwen3-30B-A3B-Instruct-2507
  served_model_name: vllm_model
data:
  train_index: /path/to/harbor_indexes/train.parquet
  val_index:   /path/to/harbor_indexes/val.parquet
infrastructure:
  nnodes: 1
  ngpus_per_node: 8
  litellm_port: 8002
vllm:
  gen_tp: 4
  max_model_length: 128000
  gpu_memory_utilization: 0.85
training:
  train_batch_size: 64
  n_resp_per_prompt: 8          # 64 × 8 = 512 trials/step
  max_prompt_length: 40000
  max_response_length: 68000
  total_epochs: 3
  save_freq: 5
  test_freq: 5
algorithm:
  adv_estimator: grpo
  policy_loss_mode: gspo
  learning_rate: 1.0e-06
experiment:
  project_name: swe-lego-live-rl
credentials:
  wandb_mode: online

KV-head divisibility

vllm.gen_tp must divide the model's num_key_value_heads, or training crashes at the first forward pass. dryrun.sh enforces this — see Preflight.

Outputs

rl declares its handoff contract in config.yaml → runtime_info.output:

Output	Path	Notes
`checkpoint_path`	`repos/harbor-verl-train/checkpoints/`	actor FSDP shards (per `save_freq`)
`training_log`	`repos/harbor-verl-train/logs/<exp>.log`	per-run training log
`vllm_log`	`repos/harbor-verl-train/logs/<exp>_vllm.log`	throughput-only log
`trajectory_log`	`repos/harbor-verl-train/harbor_trials/<project>/<exp>/`	per-trial agent traffic
`val_resolve_rate` / `training_curves`	wandb + dashboard	validation resolve rate and metric curves

See Results & Artifacts for the full layout, and Config Variants for changing the run profile.

Inputs & Outputs

Two config tiers

Inputs

Active runtime values

Outputs

On this page