# Hypothesis Backtesting System — Design Spec ## Goal Enable systematic, branch-per-hypothesis experimentation for scanner improvements. Each hypothesis runs its modified code daily in isolation, accumulates picks, and auto-concludes with a statistical comparison once enough data exists. Up to 5 experiments run in parallel, prioritized by expected impact, with full UI visibility. --- ## Architecture ``` docs/iterations/hypotheses/ active.json ← source of truth for all experiments concluded/ YYYY-MM-DD-.md ← one file per concluded hypothesis .claude/commands/ backtest-hypothesis.md ← /backtest-hypothesis command .github/workflows/ hypothesis-runner.yml ← daily 08:00 UTC, runs all active experiments tradingagents/ui/pages/ hypotheses.py ← new Streamlit dashboard tab ``` The `active.json` file lives on `main`. Each hypothesis branch (`hypothesis/-`) contains the code change being tested. The daily runner checks out each branch, runs discovery, commits picks back to that branch, and — once `min_days` have elapsed — concludes the hypothesis and cleans up. --- ## `active.json` Schema ```json { "max_active": 5, "hypotheses": [ { "id": "options_flow-scan-3-expirations", "scanner": "options_flow", "title": "Scan 3 expirations instead of 1", "description": "Hypothesis: scanning up to 3 expirations captures institutional positioning in 30+ DTE contracts, improving signal quality over nearest-expiry-only.", "branch": "hypothesis/options_flow-scan-3-expirations", "pr_number": 14, "status": "running", "priority": 8, "expected_impact": "high", "hypothesis_type": "implementation", "created_at": "2026-04-09", "min_days": 14, "days_elapsed": 3, "picks_log": ["2026-04-09", "2026-04-10", "2026-04-11"], "baseline_scanner": "options_flow", "conclusion": null } ] } ``` **Field reference:** | Field | Description | |---|---| | `id` | `-` — unique, used for branch and file names | | `status` | `running` / `pending` / `concluded` | | `priority` | 1–9 (higher = more important); determines queue order for `pending` hypotheses | | `hypothesis_type` | `statistical` (answer from existing data) or `implementation` (requires branch + forward testing) | | `min_days` | Minimum picks days before conclusion analysis runs | | `picks_log` | Dates when the runner collected picks on this branch | | `conclusion` | `null` while running; `"accepted"` or `"rejected"` once concluded | --- ## `/backtest-hypothesis` Command **Trigger:** `claude /backtest-hypothesis ""` **Flow:** 1. **Classify** the hypothesis as `statistical` or `implementation`. - Statistical: answerable from existing `performance_database.json` data — no code change needed. - Implementation: requires a code change and forward-testing period. 2. **Statistical path:** Run the analysis immediately against existing performance data. Write conclusion to the relevant scanner domain file (`docs/iterations/scanners/.md`). Done — no branch created. 3. **Implementation path:** a. Read `active.json`. If `running` count < 5, start immediately. If all 5 slots are occupied by running experiments, add the new hypothesis as `status: "pending"` — running experiments are never interrupted (pausing mid-experiment breaks the picks streak and invalidates the statistical comparison). b. Create branch `hypothesis/-` from `main`. c. Implement the minimal code change on the branch. d. Open a draft PR: title `hypothesis(): `, body describes the hypothesis, expected impact, and `min_days`. e. Write new entry to `active.json` on `main` with `status: "running"` (or `"pending"` if at capacity). f. Print summary: branch name, PR number, expected start date (if pending), expected conclusion date (if running). **Pending → running promotion:** At the end of each daily runner cycle, after any experiments conclude, the runner checks for `pending` entries and promotes the highest-priority one to `running` if a slot opened up. **Priority scoring** (set at creation time): | Factor | Score contribution | |---|---| | Scanner has poor 30d win rate (<40%) | +3 | | Change is low-complexity (1 file, 1 parameter) | +2 | | Hypothesis directly addresses a known weak spot in LEARNINGS.md | +2 | | High daily pick volume from scanner (more data faster) | +1 | | Evidence from external research (arXiv, Alpha Architect, etc.) | +1 | | Conflicting evidence or uncertain direction | -2 | Max score 9. Claude assigns this score and writes it to `active.json`. --- ## Daily Hypothesis Runner (`hypothesis-runner.yml`) Runs at **08:00 UTC daily** (after iterate at 06:00 UTC). **Per-hypothesis loop** (for each entry with `status: "running"`): ``` 1. git checkout hypothesis/<id> 2. Run daily discovery pipeline (same as daily-discovery.yml) 3. Append today's date to picks_log 4. Commit picks update back to hypothesis branch 5. If days_elapsed >= min_days: a. Run statistical comparison vs baseline scanner (same scanner, main branch picks) b. Compute: win rate delta, avg return delta, pick volume delta, p-value if N >= 20 c. Decision rule: - accepted if win rate delta > +5pp OR avg return delta > +1% (with p < 0.1 if N >= 20) - rejected otherwise d. Write concluded doc to docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md e. Update scanner domain file with finding f. Set status = "concluded", conclusion = "accepted"/"rejected" in active.json g. If accepted: merge PR into main If rejected: close PR without merging, delete hypothesis branch h. Push active.json update to main ``` **Capacity:** 5 experiments × ~2 min each = ~10 min max runtime. Workflow timeout: 60 minutes. --- ## Conclusion Document Format `docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md`: ```markdown # Hypothesis: <title> **Scanner:** options_flow **Branch:** hypothesis/options_flow-scan-3-expirations **Period:** 2026-04-09 → 2026-04-23 (14 days) **Outcome:** accepted ✅ / rejected ❌ ## Hypothesis <original description> ## Results | Metric | Baseline | Experiment | Delta | |---|---|---|---| | 7d win rate | 42% | 53% | +11pp | | 30d avg return | -2.9% | +0.8% | +3.7% | | Picks/day | 1.2 | 1.8 | +0.6 | ## Decision <1-2 sentences on why accepted/rejected> ## Action <what was merged or discarded> ``` --- ## Dashboard Tab (`tradingagents/ui/pages/hypotheses.py`) New "Hypotheses" tab in the Streamlit dashboard. **Active experiments table:** | Hypothesis | Scanner | Status | Days | Picks | Expected Ready | Priority | |---|---|---|---|---|---|---| | Scan 3 expirations | options_flow | running | 3/14 | 4 | 2026-04-23 | 8 | | ITM-only filter | options_flow | pending | 0/14 | 0 | waiting for slot | 5 | **Concluded experiments table:** | Hypothesis | Scanner | Outcome | Concluded | Win Rate Delta | |---|---|---|---|---| | Premium filter >$25K | options_flow | ✅ merged | 2026-04-01 | +9pp | | Reddit DD confidence gate | reddit_dd | ❌ rejected | 2026-03-20 | -3pp | Both tables read directly from `active.json` and the `concluded/` directory. No separate database. --- ## What Is Not In Scope - Hypothesis branches do not interact with each other (no cross-branch comparison) - No A/B testing within a single discovery run (too complex, not needed) - No email/Slack notifications (rolling PRs in GitHub are the notification mechanism) - No manual override of priority scoring (set at creation, editable directly in `active.json`)