Wargames Training — Training Guide¶

This guide covers everything you need to run, configure, and monitor training experiments for the wargames_training project.

Prerequisites¶

Complete the Getting Started steps in the README first:

git clone https://github.com/B9android/wargames_training.git
cd wargames_training
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
wandb login          # optional — see Offline Mode below
pytest tests/ -q     # all tests must pass before training

Quickstart¶

Run the default training configuration (1 M steps, 8 parallel envs, W&B logging):

python training/train.py

Run without W&B internet access (offline mode — W&B writes a local run directory under ./wandb/ and the eval logs go to logs/; sync to the cloud later):

WANDB_MODE=offline python training/train.py

Override individual hyperparameters at the command line using Hydra syntax:

python training/train.py training.learning_rate=1e-4 training.total_timesteps=500000

Use an experiment config file (e.g. configs/experiment_1.yaml):

python training/train.py --config-name experiment_1

Configuration¶

All settings live in configs/default.yaml. The sections below cover every config key and its effect.

W&B experiment tracking¶

Key	Default	Description
`wandb.project`	`"wargames_training"`	W&B project name
`wandb.entity`	`null`	W&B team/org — `null` uses your personal account
`wandb.tags`	`["v1", "ppo"]`	Default tags for each run (overridable via Hydra)
`wandb.log_freq`	`1000`	Log rollout metrics every N environment steps

Environment¶

Key	Default	Description
`env.map_width`	`1000.0`	Map width in metres
`env.map_height`	`1000.0`	Map height in metres
`env.max_steps`	`500`	Maximum steps per episode
`env.num_envs`	`8`	Parallel training environments (increase for faster wall-clock)
`env.randomize_terrain`	`true`	Generate new terrain each episode for generalization
`env.hill_speed_factor`	`0.5`	Speed multiplier on max-elevation hills (range `(0, 1]`)
`env.curriculum_level`	`5`	Red opponent difficulty: 1 = stationary … 5 = full combat

Reward shaping¶

Key	Default	Description
`reward.delta_enemy_strength`	`5.0`	Reward per unit of enemy strength destroyed
`reward.delta_own_strength`	`5.0`	Penalty per unit of own strength lost
`reward.survival_bonus`	`0.0`	Per-step bonus scaled by Blue's remaining strength
`reward.win_bonus`	`10.0`	Terminal bonus when Blue wins
`reward.loss_penalty`	`-10.0`	Terminal penalty when Blue loses
`reward.time_penalty`	`-0.01`	Per-step time penalty (discourages stalling)

PPO training¶

Key	Default	Description
`training.algorithm`	`"PPO"`	RL algorithm (only PPO is currently supported)
`training.total_timesteps`	`1000000`	Total environment steps to train for
`training.learning_rate`	`3e-4`	Adam learning rate
`training.n_steps`	`2048`	Rollout steps per environment before a PPO update
`training.batch_size`	`64`	Minibatch size
`training.n_epochs`	`10`	Number of epochs per PPO update
`training.gamma`	`0.99`	Discount factor
`training.gae_lambda`	`0.95`	GAE λ for advantage estimation
`training.clip_range`	`0.2`	PPO clip ratio ε
`training.ent_coef`	`0.01`	Entropy regularization coefficient
`training.vf_coef`	`0.5`	Value function loss coefficient
`training.max_grad_norm`	`0.5`	Gradient clip norm
`training.seed`	`42`	Random seed for reproducibility

Evaluation¶

Key	Default	Description
`eval.n_eval_episodes`	`20`	Episodes per evaluation interval
`eval.eval_freq`	`50000`	Evaluate every N timesteps
`eval.checkpoint_freq`	`100000`	Save checkpoint every N timesteps
`eval.checkpoint_dir`	`"checkpoints/"`	Directory for checkpoint `.zip` files
`eval.elo_registry`	`"checkpoints/elo_registry.json"`	Path to Elo registry JSON
`eval.elo_opponents`	`[]`	List of scripted opponents for Elo tracking
`eval.elo_eval_freq`	`50000`	Elo evaluation interval (timesteps)
`eval.elo_n_eval_episodes`	`20`	Episodes per Elo evaluation

Artifact naming and manifest¶

Key	Default	Description
`artifacts.enable_naming_v2`	`true`	Include seed and curriculum level in checkpoint filenames
`artifacts.keep_legacy_aliases`	`true`	Also write `ppo_battalion_final.zip` alongside the v2-named file
`artifacts.write_manifest`	`true`	Append every checkpoint event to a local JSONL index
`artifacts.manifest_path`	`"checkpoints/manifest.jsonl"`	Path to the checkpoint manifest

When enable_naming_v2 is true, checkpoint filenames include the run seed and curriculum level, for example ppo_battalion_s42_c5_200000_steps.zip. This eliminates cross-run overwrite collisions when multiple runs share the same checkpoints/ directory.

The manifest (checkpoints/manifest.jsonl) is an append-only JSONL file — one JSON object per line — that records every periodic, final, and best checkpoint with its step count, seed, curriculum level, W&B run ID, and a stable config hash. It is used by the resume resolution logic (see Resuming a Run below) and by the UI artifact browser.

Periodic checkpoints are registered in the manifest when the checkpoint callback writes them, and best-model entries are appended when evaluation produces a new best checkpoint. The final checkpoint and any legacy aliases are registered during end-of-run finalization.

Resume from checkpoint¶

Key	Default	Description
`resume.auto`	`false`	Automatically resume from the latest checkpoint in the manifest
`resume.checkpoint`	`null`	Explicit `.zip` path to resume from (overrides `auto`)

Self-play (disabled by default)¶

Key	Default	Description
`self_play.enabled`	`false`	Enable self-play loop
`self_play.pool_dir`	`"checkpoints/pool"`	Directory for snapshot `.zip` files
`self_play.pool_max_size`	`10`	Keep the N most recent policy snapshots
`self_play.snapshot_freq`	`50000`	Save snapshot every N environment steps
`self_play.eval_freq`	`50000`	Evaluate win-rate vs pool every N environment steps
`self_play.n_eval_episodes`	`20`	Episodes per win-rate evaluation
`self_play.use_latest_for_eval`	`false`	`false` = sample uniformly; `true` = always use latest

Logging¶

Key	Default	Description
`logging.level`	`"INFO"`	Python logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)
`logging.log_dir`	`"logs/"`	Directory for eval logs written by `EvalCallback`

Curriculum Training¶

Training against progressively harder scripted opponents accelerates early learning. A typical progression:

Start at env.curriculum_level=1 (stationary Red) until win rate > 90 %.
Advance to level 2–3. Adjust reward.time_penalty if the agent stalls.
Advance to level 4–5 for full combat capability.

To run at level 3:

python training/train.py env.curriculum_level=3

Self-Play¶

Once the agent reliably beats the scripted level-5 opponent, switch to self-play using the pre-made config:

python training/train.py --config-name self_play

The self-play config runs for 2 M steps, periodically snapshots the current policy into checkpoints/pool/, and evaluates win-rate against the pool.

Elo Tracking¶

Enable Elo tracking by listing opponents in the config:

python training/train.py \
    "eval.elo_opponents=[scripted_l1,scripted_l3,scripted_l5]"

Elo ratings are persisted to checkpoints/elo_registry.json and logged to W&B under the elo/ key prefix.

Baseline ratings:

Opponent	Baseline Elo
`random`	500
`scripted_l1`	600
`scripted_l2`	700
`scripted_l3`	800
`scripted_l4`	900
`scripted_l5`	1000

W&B Integration¶

Every training run calls wandb.init() automatically. The following metrics are logged:

W&B key	Description
`rollout/ep_rew_mean`	Mean episodic reward (rolling buffer)
`rollout/ep_len_mean`	Mean episode length
`train/policy_gradient_loss`	PPO policy gradient loss
`train/value_loss`	Value function loss
`train/entropy_loss`	Entropy regularization loss
`reward_breakdown/*`	Per-component mean reward (per episode)
`elo/rating_vs_<opponent>`	Elo rating after evaluating vs the named opponent
`elo/win_rate_vs_<opponent>`	Win rate vs the named opponent at each Elo checkpoint
`elo/delta_vs_<opponent>`	Elo rating change from the last evaluation vs the named opponent
`self_play/win_rate_vs_pool`	Win rate vs self-play pool (when self-play is enabled)

Tips:

Set wandb.entity to your team name in configs/default.yaml to share runs with collaborators.
Use WANDB_MODE=offline to train without internet access; sync later with wandb sync.
Post your W&B run URL in the tracking issue when opening a PR for any experiment.

Checkpoints¶

Checkpoints are saved as Stable-Baselines3 .zip files. When artifacts.enable_naming_v2 is true (the default), filenames embed the run seed and curriculum level:

Path	Contents
`checkpoints/ppo_battalion_s<seed>_c<level>_<N>_steps.zip`	Periodic checkpoint every `checkpoint_freq` steps
`checkpoints/best/ppo_battalion_s<seed>_c<level>_best.zip`	Best model by mean eval reward
`checkpoints/ppo_battalion_s<seed>_c<level>_final.zip`	Final model at end of training
`checkpoints/ppo_battalion_final.zip`	Legacy alias (written when `keep_legacy_aliases: true`)
`checkpoints/best/best_model.zip`	Legacy alias for best (written by SB3 `EvalCallback`)
`checkpoints/manifest.jsonl`	Append-only JSONL index of all checkpoint events

Load a checkpoint:

from stable_baselines3 import PPO
model = PPO.load("checkpoints/best/ppo_battalion_s42_c5_best.zip")
# or using the legacy alias:
model = PPO.load("checkpoints/best/best_model.zip")

Resuming a Run¶

To automatically resume from the latest checkpoint found in the manifest:

python training/train.py resume.auto=true

To resume from a specific checkpoint:

python training/train.py resume.checkpoint=checkpoints/ppo_battalion_s42_c5_200000_steps.zip

Resume resolution order: 1. Explicit resume.checkpoint path (error if file not found). 2. Manifest latest_periodic() scan (fastest, uses the JSONL index). 3. Filesystem glob scan for {prefix}_*_steps.zip (fallback when no manifest).

A warning is logged if the recorded config hash for the checkpoint differs from the current run's config hash — meaning hyperparameters may have changed since the checkpoint was saved. Warnings are also emitted when the manifest metadata shows a different training seed or curriculum level than the current run configuration.

Evaluation¶

After training, evaluate a checkpoint against scripted opponents:

python training/evaluate.py \
    --checkpoint checkpoints/best/best_model.zip \
    --opponent scripted_l5 \
    --n-episodes 100

See python training/evaluate.py --help for all options, and docs/ENVIRONMENT_SPEC.md for opponent identifiers.

Hyperparameter Tips¶

Slow learning / no improvement: increase training.learning_rate to 1e-3 or raise training.n_epochs to 15.
Unstable training (loss spikes): lower training.clip_range to 0.1 and reduce training.learning_rate.
Agent stalls without fighting: increase |reward.time_penalty| to -0.05 or add a small reward.survival_bonus.
Out of memory: reduce env.num_envs to 4 or lower training.n_steps to 1024.
Slow wall-clock speed: increase env.num_envs to 16–32 if your CPU has enough cores.

Creating a Custom Experiment Config¶

Copy configs/default.yaml to configs/experiment_myname.yaml, change only the keys you want to override, and run:

python training/train.py --config-name experiment_myname

Add a [EXP] GitHub issue before starting any significant experiment — see CONTRIBUTING.md.