HRL Architecture — Brigade & Division Commander Layers (E3.2 / E3.3)¶
This document describes the Hierarchical RL (HRL) architecture for the Brigade Commander (v3, E3.2) and Division Commander (v3, E3.3), which form a three-echelon command hierarchy above the battalion-level policies trained in v2.
Overview¶
The full HRL stack has three echelons:
┌─────────────────────────────────────────────────────────────────────────┐
│ Division Commander (PPO) │
│ Observes: theatre sectors, brigade status summaries, objective ctrl │
│ Acts: operational command per brigade (MultiDiscrete) │
│ Reward: brigade reward (passed through from BrigadeEnv) │
├─────────────────────────────────────────────────────────────────────────┤
│ Command Translator (envs/division_env.py) │
│ Expands per-brigade operational command → per-battalion option index │
├─────────────────────────────────────────────────────────────────────────┤
│ Brigade Commander (PPO / option dispatcher) │
│ Observes: sector control, battalion states, enemy threats │
│ Acts: option selection per battalion (MultiDiscrete) │
│ Reward: mean of battalion rewards over the macro-step │
├─────────────────────────────────────────────────────────────────────────┤
│ Option Dispatcher (envs/brigade_env.py) │
│ Translates brigade macro-command → Option object │
│ Executes option for K primitive steps until termination │
├─────────────────────────────────────────────────────────────────────────┤
│ Battalion Options (envs/options.py) │
│ 6 hardcoded macro-actions: advance, defend, flank L/R, │
│ withdraw, concentrate fire │
├─────────────────────────────────────────────────────────────────────────┤
│ MultiBattalionEnv (envs/multi_battalion_env.py) │
│ Primitive continuous actions: [move, rotate, fire] │
│ PettingZoo ParallelEnv, fog-of-war, full combat physics │
└─────────────────────────────────────────────────────────────────────────┘
Division Commander Layer (E3.3)¶
Division Observation Space¶
Box(shape=(obs_dim,), dtype=float32) where
obs_dim = N_THEATRE_SECTORS + 8 * n_brigades + 1
For the default 2-brigade scenario (n_brigades=2): obs_dim = 22
| Slice | Feature | Range |
|---|---|---|
[0 : N_THEATRE_SECTORS] |
Theatre sector control (5 vertical strips) | [0, 1] |
sector_control[s] = blue strength / |
||
| (blue + red) in sector s. | ||
| 0.5 when no units occupy the sector. | ||
[5 : 5+3*n_brigades] |
Per-brigade status (3 per brigade) | [0, 1] |
[avg_strength, avg_morale, alive_ratio] |
||
| Zeros for fully destroyed brigades. | ||
[5+3*nb : 5+8*nb] |
Per-brigade threat vector (5 per brigade) | mixed |
[dist/diag, cos_bearing, sin_bearing, |
||
enemy_avg_strength, enemy_avg_morale] |
||
| — nearest alive Red brigade centroid. | ||
Sentinel [1, 0, 0, 0, 0] if no enemy alive. |
||
[-1] |
Step progress: step / max_steps |
[0, 1] |
Division Action Space¶
MultiDiscrete([n_div_options] * n_brigades) — one operational command per
Blue brigade. For n_brigades=2, n_div_options=6: shape (2,) with each
element in [0, 5].
Operational command vocabulary¶
| Index | Division command | Translated brigade option |
|---|---|---|
| 0 | advance_theatre |
advance_sector (0) |
| 1 | hold_position |
defend_position (1) |
| 2 | envelop_left |
flank_left (2) |
| 3 | envelop_right |
flank_right (3) |
| 4 | strategic_withdrawal |
withdraw (4) |
| 5 | mass_fires |
concentrate_fire (5) |
Brigade Grouping¶
Blue battalions are assigned to brigades in consecutive index order:
brigade i = battalions [i * n_blue_per_brigade … (i+1) * n_blue_per_brigade).
The division command for brigade i is broadcast to all battalions in that brigade:
Frozen Brigade Policy for Red¶
An SB3 PPO brigade checkpoint can be passed to DivisionEnv as
brigade_policy. At each division step:
DivisionEnv._get_red_brigade_obs()builds a symmetric brigade-style observation for the Red side.- The frozen policy's
predict()returns Red brigade actions (option indices per Red brigade). - These are fanned out to Red battalions and injected via
BrigadeEnv._forced_red_options, which overridesBrigadeEnv._get_red_action()to execute the option's primitive policy.
Brigade Commander Layer (E3.2)¶
Brigade Observation Space¶
Box(shape=(obs_dim,), dtype=float32) where obs_dim = 3 + 7 * n_blue + 1
For the default 2v2 scenario (n_blue=2): obs_dim = 18
Layout¶
| Slice | Feature | Range |
|---|---|---|
[0:3] |
Sector control (3 vertical strips) | [0, 1] |
sector_control[s] = blue strength in |
||
| sector s / (blue + red strength in s). | ||
| 0.5 when no units occupy the sector. | ||
[3 : 3+2*n_blue] |
Per-blue battalion [strength, morale] |
[0, 1] |
| Zeros for dead battalions. | ||
[3+2*nb : 3+7*nb] |
Per-blue enemy threat vector (5 per bat.) | mixed |
[dist/diag, cos_bearing, sin_bearing, |
||
enemy_strength, enemy_morale] |
||
| — nearest alive red battalion. | ||
Sentinel [1, 0, 0, 0, 0] if no enemy. |
||
[-1] |
Step progress: step / max_steps |
[0, 1] |
Threat vector per battalion¶
Each blue battalion i contributes a 5-float block to the threat section:
| Index | Feature | Range |
|---|---|---|
| 0 | Distance to nearest red / diagonal | [0, 1] |
| 1 | cos(bearing_to_nearest_red) |
[-1, 1] |
| 2 | sin(bearing_to_nearest_red) |
[-1, 1] |
| 3 | Nearest red battalion strength | [0, 1] |
| 4 | Nearest red battalion morale | [0, 1] |
Brigade Action Space¶
MultiDiscrete([n_options] * n_blue) — one option index per Blue battalion.
For n_blue=2, n_options=6: shape (2,) with each element in [0, 5].
Option vocabulary¶
| Index | Name | Behaviour |
|---|---|---|
| 0 | advance_sector |
Forward at 0.8 speed + suppression fire (0.2) |
| 1 | defend_position |
Stationary, full sustained fire (1.0) |
| 2 | flank_left |
Move at 0.6 speed + full CCW rotation |
| 3 | flank_right |
Move at 0.6 speed + full CW rotation |
| 4 | withdraw |
Retreat at full speed (-1.0), no fire |
| 5 | concentrate_fire |
Stationary + tracking rotation + full fire |
Options execute for up to 30 primitive steps (default max_steps).
Flanking options cap at 15 steps.
Option Dispatcher (Brigade Layer)¶
Option selection and dispatch are implemented inside BrigadeEnv.step().
At each brigade-level macro-step, step(...):
- Maps the brigade action for each alive Blue battalion (an option index) to an
Optionobject from the vocabulary. - Runs the inner
MultiBattalionEnvstep-by-step until all selected options have terminated (condition fired, hard cap, or env episode ends). - Aggregates rewards across primitive steps.
- Returns the mean reward over Blue battalions as the brigade scalar reward.
Frozen Battalion Policy (Brigade Layer)¶
During brigade training, a v2 MAPPO checkpoint can be loaded to drive Red
agents as a challenging opponent. The checkpoint is loaded via
training.train_brigade.load_frozen_battalion_policy():
from training.train_brigade import load_frozen_battalion_policy
policy = load_frozen_battalion_policy(
checkpoint_path=Path("checkpoints/mappo_2v2/mappo_policy_final.pt"),
obs_dim=22, # MultiBattalionEnv obs_dim for 2v2
action_dim=3,
state_dim=25, # MultiBattalionEnv state_dim for 2v2
n_agents=2,
)
# All parameters are frozen:
assert all(not p.requires_grad for p in policy.parameters())
Training Scripts¶
Brigade training (E3.2)¶
training/train_brigade.py uses Stable-Baselines3 PPO:
# Default config (500k macro-steps, stationary Red)
python training/train_brigade.py
# With frozen v2 Red policy
python training/train_brigade.py \
env.battalion_checkpoint=checkpoints/mappo_2v2/mappo_policy_final.pt
Division training (E3.3)¶
training/train_division.py uses Stable-Baselines3 PPO:
# Default config (1M division macro-steps, stationary Red)
python training/train_division.py
# With frozen brigade Red policy
python training/train_division.py \
env.brigade_checkpoint=checkpoints/brigade/ppo_brigade_final.zip
# Override timesteps
python training/train_division.py training.total_timesteps=2000000
Key hyperparameters (configs/experiment_division.yaml):
| Parameter | Default | Description |
|---|---|---|
n_brigades |
2 | Blue brigades |
n_blue_per_brigade |
2 | Battalions per brigade |
total_timesteps |
1 000 000 | Division macro-steps |
n_steps |
512 | PPO rollout buffer size |
eval_freq |
20 000 | Win-rate evaluation frequency |
checkpoint_freq |
100 000 | Checkpoint save frequency |
Acceptance Criteria Mapping¶
E3.2 — Brigade Commander¶
| Criterion | Implementation |
|---|---|
| Brigade PPO converges within 500k steps | train_brigade.py + BrigadeWinRateCallback for W&B tracking |
| Battalion policies are frozen (no gradient updates) | load_frozen_battalion_policy() + requires_grad_(False) on all params |
| Brigade obs / action shapes documented | This document (see tables above) |
tests/test_brigade_env.py passes |
Full test suite in tests/test_brigade_env.py |
E3.3 — Division Commander¶
| Criterion | Implementation |
|---|---|
| Three-echelon hierarchy end-to-end | DivisionEnv → BrigadeEnv → MultiBattalionEnv → simulation |
| Division policy converges within 1M steps | train_division.py + DivisionWinRateCallback for W&B tracking |
| Brigade policies frozen during division training | load_frozen_brigade_policy() + requires_grad_(False) on policy params |
| Division obs / action shapes documented | This document (see Division Commander section above) |
tests/test_division_env.py passes |
Full test suite in tests/test_division_env.py |
File Index¶
| File | Description |
|---|---|
envs/division_env.py |
Division Gymnasium env + command translator |
envs/brigade_env.py |
Brigade Gymnasium env + option dispatcher |
envs/options.py |
Option vocabulary (6 macro-actions) |
envs/multi_battalion_env.py |
Underlying PettingZoo env (primitive actions) |
training/train_division.py |
SB3 PPO division training loop + frozen brigade loader |
training/train_brigade.py |
SB3 PPO brigade training loop + frozen policy loader |
configs/experiment_division.yaml |
Default hyperparameters for division training |
configs/experiment_brigade.yaml |
Default hyperparameters for brigade training |
tests/test_division_env.py |
Unit and integration tests for DivisionEnv |
tests/test_brigade_env.py |
Unit and integration tests for BrigadeEnv |
models/mappo_policy.py |
MAPPOPolicy (v2 checkpoint format) |
This document describes the Hierarchical RL (HRL) architecture for the Brigade Commander (v3), which sits above the battalion-level policies trained in v2.
Overview¶
The HRL stack has two layers:
┌─────────────────────────────────────────────────────────────────┐
│ Brigade Commander (PPO) │
│ Observes: sector control, battalion states, enemy threats │
│ Acts: option selection per battalion (MultiDiscrete) │
│ Reward: mean of battalion rewards over the macro-step │
├─────────────────────────────────────────────────────────────────┤
│ Option Dispatcher (envs/brigade_env.py) │
│ Translates brigade macro-command → Option object │
│ Executes option for K primitive steps until termination │
├─────────────────────────────────────────────────────────────────┤
│ Battalion Options (envs/options.py) │
│ 6 hardcoded macro-actions: advance, defend, flank L/R, │
│ withdraw, concentrate fire │
├─────────────────────────────────────────────────────────────────┤
│ MultiBattalionEnv (envs/multi_battalion_env.py) │
│ Primitive continuous actions: [move, rotate, fire] │
│ PettingZoo ParallelEnv, fog-of-war, full combat physics │
└─────────────────────────────────────────────────────────────────┘
Brigade Observation Space¶
Box(shape=(obs_dim,), dtype=float32) where obs_dim = 3 + 7 * n_blue + 1
For the default 2v2 scenario (n_blue=2): obs_dim = 18
Layout¶
| Slice | Feature | Range |
|---|---|---|
[0:3] |
Sector control (3 vertical strips) | [0, 1] |
sector_control[s] = blue strength in |
||
| sector s / (blue + red strength in s). | ||
| 0.5 when no units occupy the sector. | ||
[3 : 3+2*n_blue] |
Per-blue battalion [strength, morale] |
[0, 1] |
| Zeros for dead battalions. | ||
[3+2*nb : 3+7*nb] |
Per-blue enemy threat vector (5 per bat.) | mixed |
[dist/diag, cos_bearing, sin_bearing, |
||
enemy_strength, enemy_morale] |
||
| — nearest alive red battalion. | ||
Sentinel [1, 0, 0, 0, 0] if no enemy. |
||
[-1] |
Step progress: step / max_steps |
[0, 1] |
Threat vector per battalion¶
Each blue battalion i contributes a 5-float block to the threat section:
| Index | Feature | Range |
|---|---|---|
| 0 | Distance to nearest red / diagonal | [0, 1] |
| 1 | cos(bearing_to_nearest_red) |
[-1, 1] |
| 2 | sin(bearing_to_nearest_red) |
[-1, 1] |
| 3 | Nearest red battalion strength | [0, 1] |
| 4 | Nearest red battalion morale | [0, 1] |
Brigade Action Space¶
MultiDiscrete([n_options] * n_blue) — one option index per Blue battalion.
For n_blue=2, n_options=6: shape (2,) with each element in [0, 5].
Option vocabulary¶
| Index | Name | Behaviour |
|---|---|---|
| 0 | advance_sector |
Forward at 0.8 speed + suppression fire (0.2) |
| 1 | defend_position |
Stationary, full sustained fire (1.0) |
| 2 | flank_left |
Move at 0.6 speed + full CCW rotation |
| 3 | flank_right |
Move at 0.6 speed + full CW rotation |
| 4 | withdraw |
Retreat at full speed (-1.0), no fire |
| 5 | concentrate_fire |
Stationary + tracking rotation + full fire |
Options execute for up to 30 primitive steps (default max_steps).
Flanking options cap at 15 steps.
Option Dispatcher¶
Option selection and dispatch are implemented inline inside BrigadeEnv.step().
At each brigade-level macro-step, step(...):
- Maps the brigade action for each alive Blue battalion (an option index) to an
Optionobject from the vocabulary. - Runs the inner
MultiBattalionEnvstep-by-step until all selected options have terminated (condition fired, hard cap, or env episode ends). - Aggregates rewards across primitive steps.
- Returns the mean reward over Blue battalions as the brigade scalar reward.
Frozen Battalion Policy¶
During brigade training, a v2 MAPPO checkpoint can be loaded to drive Red
agents as a challenging opponent. The checkpoint is loaded via
training.train_brigade.load_frozen_battalion_policy():
from training.train_brigade import load_frozen_battalion_policy
policy = load_frozen_battalion_policy(
checkpoint_path=Path("checkpoints/mappo_2v2/mappo_policy_final.pt"),
obs_dim=22, # MultiBattalionEnv obs_dim for 2v2
action_dim=3,
state_dim=25, # MultiBattalionEnv state_dim for 2v2
n_agents=2,
)
# All parameters are frozen:
assert all(not p.requires_grad for p in policy.parameters())
The BrigadeEnv also accepts a battalion_policy constructor argument:
BrigadeEnv.set_battalion_policy(policy) freezes the policy in-place by
calling param.requires_grad_(False) on every parameter and setting the
module to eval() mode.
Training Script¶
training/train_brigade.py uses Stable-Baselines3 PPO:
# Default config (500k macro-steps, stationary Red)
python training/train_brigade.py
# With frozen v2 Red policy
python training/train_brigade.py \
env.battalion_checkpoint=checkpoints/mappo_2v2/mappo_policy_final.pt
# Override timesteps
python training/train_brigade.py training.total_timesteps=1000000
Key hyperparameters (configs/experiment_brigade.yaml):
| Parameter | Default | Description |
|---|---|---|
total_timesteps |
500 000 | Macro-steps for training |
n_steps |
512 | PPO rollout buffer size |
n_epochs |
10 | PPO update epochs per rollout |
batch_size |
64 | Minibatch size |
lr |
3e-4 | Adam learning rate |
eval_freq |
10 000 | Win-rate evaluation frequency |
checkpoint_freq |
50 000 | Checkpoint save frequency |
Acceptance Criteria Mapping¶
| Criterion | Implementation |
|---|---|
| Brigade PPO converges within 500k steps | train_brigade.py + BrigadeWinRateCallback for W&B tracking |
| Battalion policies are frozen (no gradient updates) | load_frozen_battalion_policy() + requires_grad_(False) on all params |
| Brigade obs / action shapes documented | This document (see tables above) |
tests/test_brigade_env.py passes |
Full test suite in tests/test_brigade_env.py |
File Index¶
| File | Description |
|---|---|
envs/brigade_env.py |
Brigade Gymnasium env + option dispatcher |
envs/options.py |
Option vocabulary (6 macro-actions) |
envs/multi_battalion_env.py |
Underlying PettingZoo env (primitive actions) |
training/train_brigade.py |
SB3 PPO training loop + frozen policy loader |
configs/experiment_brigade.yaml |
Default hyperparameters for brigade training |
tests/test_brigade_env.py |
Unit and integration tests for BrigadeEnv |
models/mappo_policy.py |
MAPPOPolicy (v2 checkpoint format) |