KataGo Quantization Sprint (2 weeks, CUDA-only, GTP, Rust controller)

Role

You are my sprint coach + technical PM for this 2-week project.

Goal: ship a reproducible Rust harness that runs two KataGo GTP engines (baseline vs variant) and produces performance + strength-proxy results.

Schedule constraints

Today is planning only. Start tomorrow.
Each week:
- 3 low-energy weekdays (45–75 min)
- 2 high-energy weekdays (90–150 min)
- 1 weekend off day
- 1 long weekend day (3–5 hours)
I run 3 weekdays/week, which usually correspond to low-energy days.

Hard scope rules

CUDA only. Use KataGo in GTP engine mode.
Two processes: Engine A (baseline) and Engine B (variant). No in-process precision swapping.
Baseline is INT4. Minimum experiment matrix is an INT4 degradation ablation to span strength from super-strong to amateur by simulating lower-bit quantization in:
- trunk/backbone
- policy/value heads
- late layers
Every change must be measurable and recorded in SQLite, reproducibility is not a priority

Daily workflow (how you should respond)

Each day I will provide:

energy level: low | high | long | off
what I did yesterday
blockers (if any)

You must respond with:

Today’s tasks sized to my energy window (1–3 tasks)
Definition of Done checks (concrete pass/fail)
What to log (file names + key metrics)
Next step (one sentence)

Outputs expected by Day 14

Repo runs a head-to-head match suite for any two variants and records the games played.
Repo runs an additional batch analysis step using the strongest model as a judge and records the judgments.
Reporting is done via Jupyter notebooks that read from SQLite
A short report (3–6 pages or equivalent Markdown) describing protocol, results, and limitations.
README.md documents the steps to run benchmark/setup/reporting and lists configs used

Day-by-day plan (Day 1–Day 14)

Day 1 (done): Two-engine match runner + SQLite storage

Tasks

Implement Rust CLI that spawns two KataGo processes in GTP engine mode and plays head-to-head matches
Rotate colors and save each game as SGF
Store match metadata + SGF content in SQLite

Definition of Done

A match run produces games, stores rows in SQLite, and SGFs can be extracted.

Log

SQLite DB (runs/matches/sgf)
Per-engine stdout/stderr logs

Day 2 (med): Harden match execution + begin self-eval capture

Tasks

Add strict per-command timeouts + clean process restart/exit behavior
Add desync checks (illegal move, unexpected response shape, premature EOF)
Standardize match settings (rules/komi/board size/move cap/resign policy)
Add SQLite schema + minimal parser to capture engine self-evals (value + top-1/top-K policy) at move time

Definition of Done

100-game run completes without hanging; failures are recorded; SQLite contains non-empty self-eval rows linked to (game, ply, engine).

Log

SQLite: per-game termination reason; self-evals table populated
Engine logs retained for parse/debug

Day 3 (med): INT4 baseline + degradation variant model

Tasks

Set INT4 as baseline convention for Engine A
Add degradation knobs (trunk / policy+value heads / late layers) as first-class variant parameters for Engine B
Ensure each game row stores the exact degradation parameters used

Definition of Done

Two runs differing only in degradation knobs produce distinct, correctly recorded variant parameters in SQLite.

Log

SQLite: variants (module targets + severity) stored per game or per run
Config snapshots (or resolved parameters) stored in SQLite

Day 4 (med): “Run” grouping + metadata completeness

Tasks

Introduce a runs table (or equivalent) to group games under one invocation
Store run-level metadata: timestamp, CLI args, engine names, model id/hash, key match settings, variant ids

Definition of Done

A single CLI run creates 1 run record + N linked match records; all queryable.

Log

SQLite: runs, matches, linkage via run_id

Day 5 (high): Ablation sweep runner (degradation ladder)

Tasks

Add CLI support to run a sweep over degradation levels (a named list of variants)
Ensure consistent pairing and color-swap across the ladder
Keep self-eval capture enabled across sweep runs

Definition of Done

One sweep command runs multiple degradation points and records all games + self-evals in SQLite with correct variant labeling.

Log

SQLite: sweep definition stored (serialized) + variant ids per game
Engine logs for each run/sweep

Day 6 (high): Judge pipeline skeleton (strongest model as judge)

Tasks

Add batch analysis step that replays SGFs and runs judge analysis using the strongest model
Store judge outputs in SQLite (per game, optionally per move)

Definition of Done

Judge step processes a small subset of games and writes judge rows linked to match ids.

Log

SQLite: judge_* tables linked to match_id
Judge engine logs

Day 7 (med): Judge + self-eval metrics (summary features)

Tasks

Define 2–3 judge-derived summary metrics per game (e.g., avg loss, blunder count, final eval)
Define 2–3 self-eval-derived metrics per game (e.g., avg self winrate, entropy of policy top-K)
Add “comparison” metrics (e.g., judge winrate − self winrate, correlation vs degradation)

Definition of Done

A notebook or SQL query returns one row per game with judge metrics + self-eval metrics + comparison metrics.

Log

SQLite: derived/summary tables or views (documented)

Day 8 (high): Primary data generation (coarse ladder)

Tasks

Run a coarse degradation ladder covering “very strong → amateur-ish”
Ensure sufficient games per ladder point for visible trends

Definition of Done

SQLite contains a complete coarse ladder dataset: runs + matches + SGFs + self-evals.

Log

SQLite: run ids for coarse ladder
Engine logs retained

Day 9 (high): Refine ladder near the strength cliff

Tasks

Identify degradation range with rapid strength drop
Run denser ladder in that region

Definition of Done

Refined ladder dataset exists in SQLite and is clearly labeled by run ids.

Log

SQLite: run ids for refined ladder

Day 10 (med): Judge the full dataset (batch)

Tasks

Run judge analysis for all ladder games (or a clearly defined subset)
Record coverage and failures in SQLite

Definition of Done

Judge tables populated for the dataset used in reporting; coverage is queryable.

Log

SQLite: judge coverage (% games judged) query
Judge logs

Day 11 (med): Jupyter notebooks (queries + plots)

Tasks

Create notebooks that connect to SQLite and generate:
- winrate vs degradation (by module target and severity)
- judge metrics vs degradation
- self-eval metrics vs degradation
- judge vs self-eval deltas vs degradation

Definition of Done

Running the notebooks produces the main plots/tables used in the report.

Log

notebooks/*.ipynb committed
Notebook reads from SQLite only

Tasks

Document the “how to run + how to analyze” steps (no reproducibility guarantee)
Include: where DB lives, which commands to run, which notebook(s) to open, which cells to execute

Definition of Done

README (or docs/workflow.md) guides setup → run → judge → notebook → report.

Log

README.md updates (or docs/workflow.md)

Day 13 (med): Short report draft (protocol + results + limitations)

Tasks

Write report (3–6 pages equivalent Markdown) summarizing:
- degradation design (trunk/heads/late layers)
- match protocol (paired games, color swap)
- results (winrate + judge + self-eval comparisons)
- limitations (variance, judge assumptions, non-reproducibility)

Definition of Done

Report includes final plots/tables and references the relevant run ids.

Log

report/report.md (or equivalent)

Day 14 (low): Final cleanup (usability + completeness)

Tasks

Ensure all variants used are listed clearly (names + parameters)
Add troubleshooting notes for common failures
Final pass on CLI help text and docs

Definition of Done

Clear end-to-end workflow exists: define variants → run matches → run judge → open notebook → read report.

Log

Updated README.md, docs/, notebooks/, report/

Daily update template (what I will tell you)

Day X
Energy: low/high/long/off
Yesterday: ...
Today constraints: ...
Blockers: ...
What I need: today’s tasks + DoD + what to log + next step

KataGo Quantization Sprint Plan

KataGo Quantization Sprint (2 weeks, CUDA-only, GTP, Rust controller)

Role

Schedule constraints

Hard scope rules

Daily workflow (how you should respond)

Outputs expected by Day 14

Day-by-day plan (Day 1–Day 14)

Day 1 (done): Two-engine match runner + SQLite storage

Day 2 (med): Harden match execution + begin self-eval capture

Day 3 (med): INT4 baseline + degradation variant model

Day 4 (med): “Run” grouping + metadata completeness

Day 5 (high): Ablation sweep runner (degradation ladder)

Day 6 (high): Judge pipeline skeleton (strongest model as judge)

Day 7 (med): Judge + self-eval metrics (summary features)

Day 8 (high): Primary data generation (coarse ladder)

Day 9 (high): Refine ladder near the strength cliff

Day 10 (med): Judge the full dataset (batch)

Day 11 (med): Jupyter notebooks (queries + plots)

Day 12 (low): Workflow documentation (navigation steps)

Day 13 (med): Short report draft (protocol + results + limitations)

Day 14 (low): Final cleanup (usability + completeness)

Daily update template (what I will tell you)