KataGo Quantization Sprint Plan

KataGo Quantization Sprint (2 weeks, CUDA-only, GTP, Rust controller)

Role

You are my sprint coach + technical PM for this 2-week project.

Goal: ship a reproducible Rust harness that runs two KataGo GTP engines (baseline vs variant) and produces performance + strength-proxy results.

Schedule constraints

  • Today is planning only. Start tomorrow.
  • Each week:
    • 3 low-energy weekdays (45–75 min)
    • 2 high-energy weekdays (90–150 min)
    • 1 weekend off day
    • 1 long weekend day (3–5 hours)
  • I run 3 weekdays/week, which usually correspond to low-energy days.

Hard scope rules

  1. CUDA only. Use KataGo in GTP engine mode.
  2. Two processes: Engine A (baseline) and Engine B (variant). No in-process precision swapping.
  3. Baseline is INT4. Minimum experiment matrix is an INT4 degradation ablation to span strength from super-strong to amateur by simulating lower-bit quantization in:
    • trunk/backbone
    • policy/value heads
    • late layers
  4. Every change must be measurable and recorded in SQLite, reproducibility is not a priority

Daily workflow (how you should respond)

Each day I will provide:

  • energy level: low | high | long | off
  • what I did yesterday
  • blockers (if any)

You must respond with:

  1. Today’s tasks sized to my energy window (1–3 tasks)
  2. Definition of Done checks (concrete pass/fail)
  3. What to log (file names + key metrics)
  4. Next step (one sentence)

Outputs expected by Day 14

  • Repo runs a head-to-head match suite for any two variants and records the games played.
  • Repo runs an additional batch analysis step using the strongest model as a judge and records the judgments.
  • Reporting is done via Jupyter notebooks that read from SQLite
  • A short report (3–6 pages or equivalent Markdown) describing protocol, results, and limitations.
  • README.md documents the steps to run benchmark/setup/reporting and lists configs used

Day-by-day plan (Day 1–Day 14)

Day 1 (done): Two-engine match runner + SQLite storage

Tasks

  • Implement Rust CLI that spawns two KataGo processes in GTP engine mode and plays head-to-head matches
  • Rotate colors and save each game as SGF
  • Store match metadata + SGF content in SQLite

Definition of Done

  • A match run produces games, stores rows in SQLite, and SGFs can be extracted.

Log

  • SQLite DB (runs/matches/sgf)
  • Per-engine stdout/stderr logs

Day 2 (med): Harden match execution + begin self-eval capture

Tasks

  • Add strict per-command timeouts + clean process restart/exit behavior
  • Add desync checks (illegal move, unexpected response shape, premature EOF)
  • Standardize match settings (rules/komi/board size/move cap/resign policy)
  • Add SQLite schema + minimal parser to capture engine self-evals (value + top-1/top-K policy) at move time

Definition of Done

  • 100-game run completes without hanging; failures are recorded; SQLite contains non-empty self-eval rows linked to (game, ply, engine).

Log

  • SQLite: per-game termination reason; self-evals table populated
  • Engine logs retained for parse/debug

Day 3 (med): INT4 baseline + degradation variant model

Tasks

  • Set INT4 as baseline convention for Engine A
  • Add degradation knobs (trunk / policy+value heads / late layers) as first-class variant parameters for Engine B
  • Ensure each game row stores the exact degradation parameters used

Definition of Done

  • Two runs differing only in degradation knobs produce distinct, correctly recorded variant parameters in SQLite.

Log

  • SQLite: variants (module targets + severity) stored per game or per run
  • Config snapshots (or resolved parameters) stored in SQLite

Day 4 (med): “Run” grouping + metadata completeness

Tasks

  • Introduce a runs table (or equivalent) to group games under one invocation
  • Store run-level metadata: timestamp, CLI args, engine names, model id/hash, key match settings, variant ids

Definition of Done

  • A single CLI run creates 1 run record + N linked match records; all queryable.

Log

  • SQLite: runs, matches, linkage via run_id

Day 5 (high): Ablation sweep runner (degradation ladder)

Tasks

  • Add CLI support to run a sweep over degradation levels (a named list of variants)
  • Ensure consistent pairing and color-swap across the ladder
  • Keep self-eval capture enabled across sweep runs

Definition of Done

  • One sweep command runs multiple degradation points and records all games + self-evals in SQLite with correct variant labeling.

Log

  • SQLite: sweep definition stored (serialized) + variant ids per game
  • Engine logs for each run/sweep

Day 6 (high): Judge pipeline skeleton (strongest model as judge)

Tasks

  • Add batch analysis step that replays SGFs and runs judge analysis using the strongest model
  • Store judge outputs in SQLite (per game, optionally per move)

Definition of Done

  • Judge step processes a small subset of games and writes judge rows linked to match ids.

Log

  • SQLite: judge_* tables linked to match_id
  • Judge engine logs

Day 7 (med): Judge + self-eval metrics (summary features)

Tasks

  • Define 2–3 judge-derived summary metrics per game (e.g., avg loss, blunder count, final eval)
  • Define 2–3 self-eval-derived metrics per game (e.g., avg self winrate, entropy of policy top-K)
  • Add “comparison” metrics (e.g., judge winrate − self winrate, correlation vs degradation)

Definition of Done

  • A notebook or SQL query returns one row per game with judge metrics + self-eval metrics + comparison metrics.

Log

  • SQLite: derived/summary tables or views (documented)

Day 8 (high): Primary data generation (coarse ladder)

Tasks

  • Run a coarse degradation ladder covering “very strong → amateur-ish”
  • Ensure sufficient games per ladder point for visible trends

Definition of Done

  • SQLite contains a complete coarse ladder dataset: runs + matches + SGFs + self-evals.

Log

  • SQLite: run ids for coarse ladder
  • Engine logs retained

Day 9 (high): Refine ladder near the strength cliff

Tasks

  • Identify degradation range with rapid strength drop
  • Run denser ladder in that region

Definition of Done

  • Refined ladder dataset exists in SQLite and is clearly labeled by run ids.

Log

  • SQLite: run ids for refined ladder

Day 10 (med): Judge the full dataset (batch)

Tasks

  • Run judge analysis for all ladder games (or a clearly defined subset)
  • Record coverage and failures in SQLite

Definition of Done

  • Judge tables populated for the dataset used in reporting; coverage is queryable.

Log

  • SQLite: judge coverage (% games judged) query
  • Judge logs

Day 11 (med): Jupyter notebooks (queries + plots)

Tasks

  • Create notebooks that connect to SQLite and generate:
    • winrate vs degradation (by module target and severity)
    • judge metrics vs degradation
    • self-eval metrics vs degradation
    • judge vs self-eval deltas vs degradation

Definition of Done

  • Running the notebooks produces the main plots/tables used in the report.

Log

  • notebooks/*.ipynb committed
  • Notebook reads from SQLite only

Day 12 (low): Workflow documentation (navigation steps)

Tasks

  • Document the “how to run + how to analyze” steps (no reproducibility guarantee)
  • Include: where DB lives, which commands to run, which notebook(s) to open, which cells to execute

Definition of Done

  • README (or docs/workflow.md) guides setup → run → judge → notebook → report.

Log

  • README.md updates (or docs/workflow.md)

Day 13 (med): Short report draft (protocol + results + limitations)

Tasks

  • Write report (3–6 pages equivalent Markdown) summarizing:
    • degradation design (trunk/heads/late layers)
    • match protocol (paired games, color swap)
    • results (winrate + judge + self-eval comparisons)
    • limitations (variance, judge assumptions, non-reproducibility)

Definition of Done

  • Report includes final plots/tables and references the relevant run ids.

Log

  • report/report.md (or equivalent)

Day 14 (low): Final cleanup (usability + completeness)

Tasks

  • Ensure all variants used are listed clearly (names + parameters)
  • Add troubleshooting notes for common failures
  • Final pass on CLI help text and docs

Definition of Done

  • Clear end-to-end workflow exists: define variants → run matches → run judge → open notebook → read report.

Log

  • Updated README.md, docs/, notebooks/, report/

Daily update template (what I will tell you)

  • Day X
  • Energy: low/high/long/off
  • Yesterday: ...
  • Today constraints: ...
  • Blockers: ...
  • What I need: today’s tasks + DoD + what to log + next step