Rebuilding the adversarial KataGo testbench (part 1)

I started reading recent papers on adversarial attacks in KataGo, found the code repository ¹ cited in the paper ², and got curious enough to see how much of it I could still piece back together with the latest KataGo.

After opening the repo, I realized that it already had most of the setup needed for adversarial evaluation.

Adversarial model: a model trained using Adversarial MCTS (AMCTS), where the cyclic attack emerges from training against a frozen victim KataGo checkpoint.
Old victim model: kata1-b40c256, a 40-block, 256-channel KataGo checkpoint with about 47M parameters.

I already had the latest KataGo checked out locally, so I had Claude take a stab at porting over the AMCTS changes. The copy in KataGo-custom turned out to be enough for that, since the actual port was only around 300 lines against the latest branch. It looked like the adversarial search modes were mostly there, but for this setup I only cared about the base AMCTS-S. Since I was already moving onto a newer engine base, I also wanted to test against a much stronger checkpoint: kata1-b28c512nbt, a 28-block, 512-channel model with about 73M parameters, roughly 1.5x larger than the older victim and the strongest checkpoint I could find.

For the very first batch I ran, I only tried two games with the victim and adversarial model switching sides, and left the rest of the configuration alone. I was pretty disappointed with the result: the games dragged on to no result, and the adversarial moves looked like they came from a complete beginner with no sense of shape.

Adversary seems to be playing connect-5 in the first 50 moves :(

After checking through the logs, I started narrowing in on the configuration that actually mattered for this setup. I increased maxMovesPerGame to 1200 so the cyclic adversary had enough room to push the game toward the kind of non-resolving positions it relies on. I also disabled allowResignation to avoid early termination that could hide late-game behavior, tightened the rules with koRule = POSITIONAL and scoringRule = AREA, and enabled deterministic search controls like rootSymmetryPruning, antiMirror, and fillDameBeforePass.

I ran multiple 10-game batches with the adversary using a visit budget of 600, while crippling the victim down to zero visits so it was mostly at the mercy of its own policy network. Eventually I was able to get the adversary above a 50% win rate, taking 7 out of 10 games. I also got to witness the kind of failure mode I was looking for: the victim would grow strangely confident in its position while leaving a huge eyeless group unprotected.

With the adversarial setup seemingly working correctly against the old victim model, I had Claude put together a clean benchmark harness with three modes: validate, smoke, and sweep. More test runs for another day.

Adversary (B) wins by resignation, victim's bottom-right captured

Adversary (W) wins by resignation, another massive group captured

References

AlignmentResearch, go_attack — code repository for adversarial Go research.
Wang et al., Adversarial Policies Beat Superhuman Go AIs, 2023.