Changelog¶
All notable changes to wargames_training will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]¶
[4.0.0] — 2026-03-21¶
Added¶
- League infrastructure (
training/league/agent_pool.py,training/league/match_database.py) —AgentPool(JSON manifest, atomic writes) andMatchDatabase(JSONL append-log) as the persistence backbone for the v4 league.AgentTypeenum:MAIN_AGENT,MAIN_EXPLOITER,LEAGUE_EXPLOITER. - League matchmaker (
training/league/matchmaker.py) —LeagueMatchmakerimplementing Prioritized Fictitious Self-Play (PFSP) with a hard-first weight functionf(w) = 1 − w;set_weight_function()for custom weighting;set_nash_weights()to switch to Nash distribution sampling; matchup rules: main agents face all roles, main exploiters face main agents only, league exploiters face all roles. - Main agent training loop (
training/league/train_main_agent.py) —MainAgentTrainercombining PFSP matchmaking, MAPPO policy updates, Elo rating, and periodic pool snapshots;make_pfsp_weight_fn(T)temperature factory; configconfigs/league/main_agent.yaml. - Main exploiter (
training/league/train_exploiter.py) —MainExploiterTrainertargeting the latest main agent snapshot; rolling win-rate reset via_orthogonal_reinit;MAIN_EXPLOITERpool snapshots; W&Bexploiter/*metrics; configconfigs/league/main_exploiter.yaml. - League exploiter (
training/league/train_league_exploiter.py) —LeagueExploiterTrainerusing PFSP against the full historical pool; Nash exploitability computation viacompute_league_exploitability();LEAGUE_EXPLOITERpool snapshots; W&Bleague_exploiter/*metrics; configconfigs/league/league_exploiter.yaml. - Nash distribution sampling (
training/league/nash.py) —build_payoff_matrix(win-rate callable → NumPy array);compute_nash_distribution(LP + regret matching → NumPy probability array; zip withagent_idsto form{agent_id: prob});nash_entropy(Shannon entropy in nats); W&B keyleague/nash_entropy. - Strategy diversity metrics (
training/league/diversity.py) —TrajectoryBatch;embed_trajectory(action histogram + position heatmap - movement stats, L2-normalised);
pairwise_cosine_distances;diversity_score(mean/min/median);DiversityTracker; W&B keyleague/diversity_score. - Distributed training (
training/league/distributed_runner.py,envs/remote_multi_battalion_env.py) —RemoteMultiBattalionEnv(@ray.remoteactor);make_remote_envs(n);DistributedRolloutRunner;RolloutResult;benchmark()throughput utility; Ray cluster configconfigs/distributed/ray_cluster.yaml; smoke-test CI workflow.github/workflows/ray_smoke_test.yml. - Documentation —
docs/league_training_guide.md(agent types, matchmaking, Nash sampling, diversity, distributed execution),docs/v4_architecture.md(ASCII component and data-flow diagrams).
[3.0.0] — 2026-03-20¶
Added¶
- SMDP / Options framework (
envs/smdp_wrapper.py,envs/options.py) — Semi-Markov Decision Process wrapper enabling temporal abstraction; option primitives (advance, hold, retreat, flank-left, flank-right) with configurablemax_steps;make_default_options()factory. - Brigade Commander (
envs/brigade_env.py,training/train_brigade.py) —BrigadeEnvwrappingMultiBattalionEnv; obs_dim = 3 + 7 × n_blue + 1 (sector control, battalion strength/morale, threat vectors, step); PPO-based brigade training with frozen MAPPO battalion sub-policies; configconfigs/experiment_brigade.yaml. - Division Commander (
envs/division_env.py,training/train_division.py) —DivisionEnvwrappingBrigadeEnv; obs_dim = 5 + 8 × n_brigades + 1 (theatre sectors, brigade status, threat, step);_forced_red_optionsfor injecting Red brigade commands; configconfigs/experiment_division.yaml. - Hierarchical curriculum (
training/hrl_curriculum.py) —HRLCurriculumSchedulerwithHRLPhaseenum (PHASE_1_BATTALION → PHASE_2_BRIGADE → PHASE_3_DIVISION); dual promotion criteria: rolling win-rate ≥ threshold and cached Elo ≥ elo_threshold. - Adaptive temporal abstraction (
training/adaptive_temporal.py) —AdaptiveTemporalSchedulervarying temporal ratio frombase_ratiotomin_ratioacross episode progress;SWEEP_RATIOSconstant for grid-search. - Policy registry (
training/policy_registry.py) —PolicyRegistrybacked by a JSON manifest;Echelonenum (battalion/brigade/division); versioned register/get/remove/list/load/save; CLI:python -m training.policy_registry. - Freeze utilities (
training/utils/freeze_policy.py) —freeze_mappo_policy(),freeze_sb3_policy(),assert_frozen(),load_and_freeze_mappo(),load_and_freeze_sb3()for bottom-up curriculum. - HRL evaluation harness (
training/evaluate_hrl.py) — end-to-end HRL vs. flat MARL tournament (run_tournament()); bootstrapped 95 % CIs (bootstrap_ci()); JSON output; CLI entry-point. - HRL analysis notebook (
notebooks/v3_hrl_analysis.ipynb) — tournament result loading, win-rate bar charts, CI error bars, echelon latency plots. - HRL configs (
configs/hrl/phase1_battalion.yaml,phase2_brigade.yaml,phase3_division.yaml) — per-phase training configs wiring curriculum promotions, temporal ratios, and checkpoint paths. - Documentation —
docs/hrl_architecture.md(three-echelon command hierarchy, observation/action spaces, reward flow),docs/hrl_training_protocol.md(bottom-up training protocol, phase descriptions, promotion criteria, evaluation methodology).
Changed¶
envs/__init__.pyupdated to exportBrigadeEnvandDivisionEnv.training/self_play.py—TeamOpponentPoolnow supports brigade-level MAPPO snapshots alongside battalion-level policies.
[2.0.0] — 2026-03-20¶
Added¶
MultiBattalionEnv(envs/multi_battalion_env.py) — PettingZooParallelEnvsupporting NvN battalion combat. Per-agent local observations with fog-of-war viavisibility_radius; global state tensor exposed for the centralized critic. Passespettingzoo.test.parallel_api_test.- MAPPO (
models/mappo_policy.py) — Multi-Agent Proximal Policy Optimization with centralized training and decentralized execution (CTDE).MAPPOActor(local obs → Gaussian),MAPPOCritic(global state → value),MAPPOPolicywrapping both with optionalshare_parameters. - MAPPO training pipeline (
training/train_mappo.py) —MAPPORolloutBufferwith per-agent GAE,MAPPOTrainer, Hydra config entry-point, W&B logging of per-agent and aggregate rewards. - 3-stage curriculum (
training/curriculum_scheduler.py) —CurriculumSchedulerwith rolling win-rate promotion acrossSTAGE_1V1 → STAGE_2V1 → STAGE_2V2;load_v1_weights_into_mappofor warm-starting from v1 PPO checkpoints. - Coordination metrics (
envs/metrics/coordination.py) —flanking_ratio,fire_concentration,mutual_support_scorelogged per episode to W&B. - NvN scaling —
MultiBattalionEnvparameterized byn_blue/n_red; scenario configs added for2v2,3v3,4v4, and6v6(configs/scenarios/); scaling notes documented indocs/scaling_notes.md. - Multi-agent self-play (
training/self_play.py) —TeamOpponentPoolsaving MAPPO policy snapshots;TeamEloRegistryextendingEloRegistrywith team Elo baselines;nash_exploitability_proxyestimator. - Experiment config (
configs/experiment_mappo_2v2.yaml) — reference MAPPO 2v2 training config (200 k timesteps, shared actor, 128→64 MLP). - Curriculum config (
configs/curriculum_2v2.yaml) — three-stage curriculum schedule with win-rate thresholds. - Documentation —
docs/multi_agent_guide.md,docs/v2_architecture.md,docs/scaling_notes.md.
Changed¶
training/self_play.pyextended withTeamOpponentPoolandevaluate_team_vs_poolalongside the existing v1OpponentPool.training/elo.pyextended withTeamEloRegistryand team-specificTEAM_BASELINE_RATINGS.envs/__init__.pyupdated to exportMultiBattalionEnv.
[1.0.0] — 2026-03-19¶
Added¶
- Simulation engine (
envs/sim/) — battalion, combat (damage accumulation, morale mechanics, routing threshold), and procedural terrain with elevation and cover. BattalionEnv— Gymnasium 1v1 environment with 12-dim observation space, 3-dim continuous action space, scripted Red opponent (curriculum levels 1–5), randomized terrain, and configurable reward shaping.BattalionMlpPolicy— Stable-Baselines3ActorCriticPolicysubclass with a two-hidden-layer MLP (obs(12) → 128 → Tanh → 128 → Tanh) shared by the actor and critic heads; registered asPPO.policy_aliases["BattalionMlpPolicy"].- PPO training pipeline (
training/train.py) — Hydra config loading, W&B experiment tracking,CheckpointCallback,EvalCallback,WandbCallback,RewardBreakdownCallback. - Elo tracking (
training/elo.py) —EloRegistrywith JSON persistence;EloEvalCallbackfor per-opponent Elo logging during training. - Evaluation script (
training/evaluate.py) — CLI evaluation against scripted opponents, random baseline, or any.zipcheckpoint. - Self-play (
training/self_play.py) —OpponentPool,SelfPlayCallback,WinRateVsPoolCallback; wired intotrain.pyviaself_play.enabled. - Configuration system — Hydra-based YAML configs (
default.yaml,self_play.yaml,experiment_1.yaml,orchestration.yaml). - GitHub automation — triage agent, label/milestone bootstrap, project board sync, orchestration workflow, governance policy.
- Documentation —
README.md,CONTRIBUTING.md,docs/TRAINING_GUIDE.md,docs/ENVIRONMENT_SPEC.md,docs/ROADMAP.md,docs/development_playbook.md,docs/ORCHESTRATION_RUNBOOK.md.
Changed¶
MORALE_CASUALTY_WEIGHTraised from0.4to1.5to make routing reachable at the defaultMORALE_ROUT_THRESHOLDof0.25.