Verl-SWE-RL

Reference

Inputs & Outputs

The config-driven input/output contract

rl is configured entirely through config.yaml. It uses config.yaml rather than a separate inputs.yaml because the file describes a full training run profile, not just upstream inputs. Treat config.yaml as the source of truth before launching a run.

Two config tiers

TierSectionsPlumbed via
Env-driven (live)model, data, infrastructure, environment, k8s, harbor_agent, harbor_runtime, experiment, credentialstrain_1node_cc.sh exports them as env vars consumed by sync_1node_cc.sh
Upstream-fixed (docs only)vllm, training, algorithmhardcoded in sync_1node_cc.sh; mirrored here so the file documents the live state

Inputs

NameDescriptionSource
modelPolicy checkpoint path + served model nameexternal
dataTrain / val Harbor task parquet indexesexternal
infrastructurenodes, GPUs per node, Ray / LiteLLM portsexternal
environmentbootstrap script, venv_path, requirementsexternal
k8s / harbor_agentsandbox backend (kubeconfig or docker_host) + agent scaffoldexternal
harbor_runtimerollout worker count, tail-killerexternal
experimentproject / exp name, trajectory logger sourceexternal
credentialswandb_api_key (prefer $WANDB_API_KEY), wandb_modeexternal

The values you normally fill are the policy model.model_path, the data indexes, the backend, and $WANDB_API_KEY in your shell.

Active runtime values

Excerpted from config.yaml:

model:
  model_path: /path/to/models/Qwen3-30B-A3B-Instruct-2507
  served_model_name: vllm_model
data:
  train_index: /path/to/harbor_indexes/train.parquet
  val_index:   /path/to/harbor_indexes/val.parquet
infrastructure:
  nnodes: 1
  ngpus_per_node: 8
  litellm_port: 8002
vllm:
  gen_tp: 4
  max_model_length: 128000
  gpu_memory_utilization: 0.85
training:
  train_batch_size: 64
  n_resp_per_prompt: 8          # 64 × 8 = 512 trials/step
  max_prompt_length: 40000
  max_response_length: 68000
  total_epochs: 3
  save_freq: 5
  test_freq: 5
algorithm:
  adv_estimator: grpo
  policy_loss_mode: gspo
  learning_rate: 1.0e-06
experiment:
  project_name: swe-lego-live-rl
credentials:
  wandb_mode: online

KV-head divisibility

vllm.gen_tp must divide the model's num_key_value_heads, or training crashes at the first forward pass. dryrun.sh enforces this — see Preflight.

Outputs

rl declares its handoff contract in config.yamlruntime_info.output:

OutputPathNotes
checkpoint_pathrepos/harbor-verl-train/checkpoints/actor FSDP shards (per save_freq)
training_logrepos/harbor-verl-train/logs/<exp>.logper-run training log
vllm_logrepos/harbor-verl-train/logs/<exp>_vllm.logthroughput-only log
trajectory_logrepos/harbor-verl-train/harbor_trials/<project>/<exp>/per-trial agent traffic
val_resolve_rate / training_curveswandb + dashboardvalidation resolve rate and metric curves

See Results & Artifacts for the full layout, and Config Variants for changing the run profile.

On this page