KataGo Quantization Sprint Plan
KataGo Quantization Sprint (2 weeks, CUDA-only, GTP, Rust controller)
Role
You are my sprint coach + technical PM for this 2-week project.
Goal: ship a reproducible Rust harness that runs two KataGo GTP engines (baseline vs variant) and produces performance + strength-proxy results.
Schedule constraints
- Today is planning only. Start tomorrow.
- Each week:
- 3 low-energy weekdays (45–75 min)
- 2 high-energy weekdays (90–150 min)
- 1 weekend off day
- 1 long weekend day (3–5 hours)
- I run 3 weekdays/week, which usually correspond to low-energy days.
Hard scope rules
- CUDA only. Use KataGo in GTP engine mode.
- Two processes: Engine A (baseline) and Engine B (variant). No in-process precision swapping.
- Baseline is INT4. Minimum experiment matrix is an INT4 degradation ablation to span strength from super-strong to amateur by simulating lower-bit quantization in:
- trunk/backbone
- policy/value heads
- late layers
- Every change must be measurable and recorded in SQLite, reproducibility is not a priority
Daily workflow (how you should respond)
Each day I will provide:
- energy level:
low | high | long | off - what I did yesterday
- blockers (if any)
You must respond with:
- Today’s tasks sized to my energy window (1–3 tasks)
- Definition of Done checks (concrete pass/fail)
- What to log (file names + key metrics)
- Next step (one sentence)
Outputs expected by Day 14
- Repo runs a head-to-head match suite for any two variants and records the games played.
- Repo runs an additional batch analysis step using the strongest model as a judge and records the judgments.
- Reporting is done via Jupyter notebooks that read from SQLite
- A short report (3–6 pages or equivalent Markdown) describing protocol, results, and limitations.
- README.md documents the steps to run benchmark/setup/reporting and lists configs used
Day-by-day plan (Day 1–Day 14)
Day 1 (done): Two-engine match runner + SQLite storage
Tasks
- Implement Rust CLI that spawns two KataGo processes in GTP engine mode and plays head-to-head matches
- Rotate colors and save each game as SGF
- Store match metadata + SGF content in SQLite
Definition of Done
- A match run produces games, stores rows in SQLite, and SGFs can be extracted.
Log
- SQLite DB (runs/matches/sgf)
- Per-engine stdout/stderr logs
Day 2 (med): Harden match execution + begin self-eval capture
Tasks
- Add strict per-command timeouts + clean process restart/exit behavior
- Add desync checks (illegal move, unexpected response shape, premature EOF)
- Standardize match settings (rules/komi/board size/move cap/resign policy)
- Add SQLite schema + minimal parser to capture engine self-evals (value + top-1/top-K policy) at move time
Definition of Done
- 100-game run completes without hanging; failures are recorded; SQLite contains non-empty self-eval rows linked to (game, ply, engine).
Log
- SQLite: per-game termination reason; self-evals table populated
- Engine logs retained for parse/debug
Day 3 (med): INT4 baseline + degradation variant model
Tasks
- Set INT4 as baseline convention for Engine A
- Add degradation knobs (trunk / policy+value heads / late layers) as first-class variant parameters for Engine B
- Ensure each game row stores the exact degradation parameters used
Definition of Done
- Two runs differing only in degradation knobs produce distinct, correctly recorded variant parameters in SQLite.
Log
- SQLite: variants (module targets + severity) stored per game or per run
- Config snapshots (or resolved parameters) stored in SQLite
Day 4 (med): “Run” grouping + metadata completeness
Tasks
- Introduce a
runstable (or equivalent) to group games under one invocation - Store run-level metadata: timestamp, CLI args, engine names, model id/hash, key match settings, variant ids
Definition of Done
- A single CLI run creates 1 run record + N linked match records; all queryable.
Log
- SQLite:
runs,matches, linkage viarun_id
Day 5 (high): Ablation sweep runner (degradation ladder)
Tasks
- Add CLI support to run a sweep over degradation levels (a named list of variants)
- Ensure consistent pairing and color-swap across the ladder
- Keep self-eval capture enabled across sweep runs
Definition of Done
- One sweep command runs multiple degradation points and records all games + self-evals in SQLite with correct variant labeling.
Log
- SQLite: sweep definition stored (serialized) + variant ids per game
- Engine logs for each run/sweep
Day 6 (high): Judge pipeline skeleton (strongest model as judge)
Tasks
- Add batch analysis step that replays SGFs and runs judge analysis using the strongest model
- Store judge outputs in SQLite (per game, optionally per move)
Definition of Done
- Judge step processes a small subset of games and writes judge rows linked to match ids.
Log
- SQLite:
judge_*tables linked tomatch_id - Judge engine logs
Day 7 (med): Judge + self-eval metrics (summary features)
Tasks
- Define 2–3 judge-derived summary metrics per game (e.g., avg loss, blunder count, final eval)
- Define 2–3 self-eval-derived metrics per game (e.g., avg self winrate, entropy of policy top-K)
- Add “comparison” metrics (e.g., judge winrate − self winrate, correlation vs degradation)
Definition of Done
- A notebook or SQL query returns one row per game with judge metrics + self-eval metrics + comparison metrics.
Log
- SQLite: derived/summary tables or views (documented)
Day 8 (high): Primary data generation (coarse ladder)
Tasks
- Run a coarse degradation ladder covering “very strong → amateur-ish”
- Ensure sufficient games per ladder point for visible trends
Definition of Done
- SQLite contains a complete coarse ladder dataset: runs + matches + SGFs + self-evals.
Log
- SQLite: run ids for coarse ladder
- Engine logs retained
Day 9 (high): Refine ladder near the strength cliff
Tasks
- Identify degradation range with rapid strength drop
- Run denser ladder in that region
Definition of Done
- Refined ladder dataset exists in SQLite and is clearly labeled by run ids.
Log
- SQLite: run ids for refined ladder
Day 10 (med): Judge the full dataset (batch)
Tasks
- Run judge analysis for all ladder games (or a clearly defined subset)
- Record coverage and failures in SQLite
Definition of Done
- Judge tables populated for the dataset used in reporting; coverage is queryable.
Log
- SQLite: judge coverage (% games judged) query
- Judge logs
Day 11 (med): Jupyter notebooks (queries + plots)
Tasks
- Create notebooks that connect to SQLite and generate:
- winrate vs degradation (by module target and severity)
- judge metrics vs degradation
- self-eval metrics vs degradation
- judge vs self-eval deltas vs degradation
Definition of Done
- Running the notebooks produces the main plots/tables used in the report.
Log
notebooks/*.ipynbcommitted- Notebook reads from SQLite only
Day 12 (low): Workflow documentation (navigation steps)
Tasks
- Document the “how to run + how to analyze” steps (no reproducibility guarantee)
- Include: where DB lives, which commands to run, which notebook(s) to open, which cells to execute
Definition of Done
- README (or
docs/workflow.md) guides setup → run → judge → notebook → report.
Log
README.mdupdates (ordocs/workflow.md)
Day 13 (med): Short report draft (protocol + results + limitations)
Tasks
- Write report (3–6 pages equivalent Markdown) summarizing:
- degradation design (trunk/heads/late layers)
- match protocol (paired games, color swap)
- results (winrate + judge + self-eval comparisons)
- limitations (variance, judge assumptions, non-reproducibility)
Definition of Done
- Report includes final plots/tables and references the relevant run ids.
Log
report/report.md(or equivalent)
Day 14 (low): Final cleanup (usability + completeness)
Tasks
- Ensure all variants used are listed clearly (names + parameters)
- Add troubleshooting notes for common failures
- Final pass on CLI help text and docs
Definition of Done
- Clear end-to-end workflow exists: define variants → run matches → run judge → open notebook → read report.
Log
- Updated
README.md,docs/,notebooks/,report/
Daily update template (what I will tell you)
- Day X
- Energy: low/high/long/off
- Yesterday: ...
- Today constraints: ...
- Blockers: ...
- What I need: today’s tasks + DoD + what to log + next step