v4 Architecture — League Training System¶

Version: v4.0.0
Theme: AlphaStar-style League Training

High-Level Overview¶

┌─────────────────────────────────────────────────────────────────────┐
│                        LEAGUE TRAINING SYSTEM                       │
│                                                                     │
│  ┌──────────────┐    ┌──────────────────┐    ┌───────────────────┐ │
│  │  MAIN AGENT  │    │  MAIN EXPLOITER  │    │ LEAGUE EXPLOITER  │ │
│  │              │    │                  │    │                   │ │
│  │ PFSP vs all  │    │ Targets latest   │    │ PFSP vs all       │ │
│  │ league       │    │ main agent only  │    │ league            │ │
│  │ members      │    │ Resets on high   │    │ members           │ │
│  │              │    │ win rate         │    │                   │ │
│  └──────┬───────┘    └────────┬─────────┘    └─────────┬─────────┘ │
│         │  snapshots          │  snapshots              │ snapshots │
│         ▼                     ▼                         ▼           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                      AGENT POOL                             │   │
│  │  AgentRecord × N  (JSON manifest, persisted to disk)        │   │
│  └────────────────────────────┬────────────────────────────────┘   │
│                               │                                     │
│         ┌─────────────────────┴──────────────────────┐             │
│         ▼                                             ▼             │
│  ┌─────────────────┐                   ┌──────────────────────┐    │
│  │  MATCH DATABASE │                   │  NASH SOLVER         │    │
│  │  (JSONL log of  │──── win_rates ───▶│  build_payoff_matrix │    │
│  │  match results) │                   │  compute_nash_distribution() │    │
│  └─────────────────┘                   └──────────┬───────────┘    │
│                                                    │ nash_weights   │
│                                                    ▼               │
│                                         ┌──────────────────────┐   │
│                                         │  LEAGUE MATCHMAKER   │   │
│                                         │  (PFSP or Nash)      │   │
│                                         │  select_opponent()   │   │
│                                         └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Component Diagram¶

training/league/
│
├── agent_pool.py          AgentPool ─── AgentRecord ─── AgentType enum
│                          (JSON manifest, atomic writes via .tmp)
│
├── match_database.py      MatchDatabase ─── JSONL append-log
│                          win_rate(), win_rates_for() (single-pass)
│
├── matchmaker.py          LeagueMatchmaker
│                          ├── select_opponent()   ← PFSP or Nash
│                          ├── set_weight_function()
│                          ├── set_nash_weights()
│                          └── opponent_probabilities()
│
├── nash.py                build_payoff_matrix()
│                          compute_nash_distribution()  ← LP + regret matching
│                          nash_entropy()               ← Shannon entropy (nats)
│
├── diversity.py           TrajectoryBatch
│                          embed_trajectory()    ← histogram + heatmap + stats
│                          pairwise_cosine_distances()
│                          diversity_score()     ← mean / min / median
│                          DiversityTracker
│
├── distributed_runner.py  RemoteMultiBattalionEnv  (@ray.remote)
│                          make_remote_envs(n)
│                          DistributedRolloutRunner
│                          RolloutResult
│                          benchmark()
│
├── train_main_agent.py    MainAgentTrainer
│                          ├── PFSP matchmaking
│                          ├── MAPPO training loop
│                          ├── Elo rating updates
│                          └── Pool snapshot saving
│
├── train_exploiter.py     MainExploiterTrainer
│                          ├── Targets latest MAIN_AGENT snapshot
│                          ├── Rolling win-rate tracking
│                          ├── Orthogonal re-initialisation on reset
│                          └── MAIN_EXPLOITER pool snapshots
│
└── train_league_exploiter.py  LeagueExploiterTrainer
                               ├── PFSP vs full pool
                               ├── Nash exploitability computation
                               └── LEAGUE_EXPLOITER pool snapshots

Data Flow¶

                     ┌──────────────┐
                     │  Rollout     │
      policy ───────▶│  Collection  │◀─── opponent policy (from pool)
                     │  (env step)  │
                     └──────┬───────┘
                            │  trajectory
                            ▼
               ┌─────────────────────────┐
               │   Policy Update (MAPPO) │
               └─────────────┬───────────┘
                             │
             ┌───────────────┼───────────────┐
             ▼               ▼               ▼
       MatchDatabase    AgentPool       DiversityTracker
       (record result)  (snapshot)      (embed trajectory)
             │               │               │
             └───────────────▼───────────────┘
                      W&B Logging
                  league/nash_entropy
                  league/diversity_score
                  elo/main_agent
                  exploiter/rolling_win_rate_vs_main

Training Roles & Interaction¶

  MAIN EXPLOITER                    LEAGUE EXPLOITER
       │                                   │
       │  forces main agent to             │  ensures no league
       │  defend against targeted          │  strategy is safe
       │  weaknesses                       │  from exploitation
       │                                   │
       └──────────────┬────────────────────┘
                      │
                      ▼
              MAIN AGENT  ◀──── learns to be robust
                      │         against diverse attacks
                      │
                      ▼
              AGENT POOL snapshots
              (historical strategies)
                      │
                      ▼
              PFSP / Nash sampling
              (prioritises hard opponents)

Distributed Execution (Ray)¶

  ┌─────────────────────────────────────────────────────────┐
  │                  Ray Cluster                            │
  │                                                         │
  │   Driver Process (DistributedRolloutRunner)             │
  │        │                                                │
  │        ├─▶  RemoteMultiBattalionEnv actor 0             │
  │        ├─▶  RemoteMultiBattalionEnv actor 1             │
  │        ├─▶  RemoteMultiBattalionEnv actor 2             │
  │        │           …                                    │
  │        └─▶  RemoteMultiBattalionEnv actor N-1           │
  │                                                         │
  │   Each actor runs an independent MultiBattalionEnv      │
  │   episode.  Results are gathered as RolloutResult       │
  │   objects and aggregated by the driver.                 │
  └─────────────────────────────────────────────────────────┘

Relationship to v1–v3 Architecture¶

v1  BattalionEnv (1v1, PPO, scripted opponent)
        │
v2  MultiBattalionEnv (NvN, MAPPO, shared policy)
        │
v3  BrigadeEnv / DivisionEnv (HRL, frozen sub-policies)
        │
v4  League (AgentPool + MatchDB + Matchmaker + Nash)
        │ wraps the MAPPO policies from v2/v3
        ▼
    Nash-robust main agent policy

Key Files¶

File	Description
`training/league/agent_pool.py`	Agent registry with JSON persistence
`training/league/match_database.py`	JSONL match outcome log
`training/league/matchmaker.py`	PFSP / Nash opponent selector
`training/league/nash.py`	Nash distribution solver and entropy
`training/league/diversity.py`	Trajectory embedding and diversity metrics
`training/league/distributed_runner.py`	Ray-based parallel rollout collection
`training/league/train_main_agent.py`	Main agent PFSP+MAPPO training loop
`training/league/train_exploiter.py`	Main exploiter trainer
`training/league/train_league_exploiter.py`	League exploiter trainer
`configs/league/main_agent.yaml`	Main agent hyperparameters
`configs/league/main_exploiter.yaml`	Main exploiter hyperparameters
`configs/league/league_exploiter.yaml`	League exploiter hyperparameters
`configs/distributed/ray_cluster.yaml`	Ray cluster configuration