TradingAgents/docs/superpowers/specs/2026-04-09-hypothesis-backt...

# Hypothesis Backtesting System — Design Spec

## Goal

Enable systematic, branch-per-hypothesis experimentation for scanner improvements. Each hypothesis runs its modified code daily in isolation, accumulates picks, and auto-concludes with a statistical comparison once enough data exists. Up to 5 experiments run in parallel, prioritized by expected impact, with full UI visibility.

---

## Architecture

```
docs/iterations/hypotheses/
  active.json                              ← source of truth for all experiments
  concluded/
    YYYY-MM-DD-<id>.md                     ← one file per concluded hypothesis

.claude/commands/
  backtest-hypothesis.md                   ← /backtest-hypothesis command

.github/workflows/
  hypothesis-runner.yml                    ← daily 08:00 UTC, runs all active experiments

tradingagents/ui/pages/
  hypotheses.py                            ← new Streamlit dashboard tab
```

The `active.json` file lives on `main`. Each hypothesis branch (`hypothesis/<scanner>-<slug>`) contains the code change being tested. The daily runner checks out each branch, runs discovery, commits picks back to that branch, and — once `min_days` have elapsed — concludes the hypothesis and cleans up.

---

## `active.json` Schema

```json
{
  "max_active": 5,
  "hypotheses": [
    {
      "id": "options_flow-scan-3-expirations",
      "scanner": "options_flow",
      "title": "Scan 3 expirations instead of 1",
      "description": "Hypothesis: scanning up to 3 expirations captures institutional positioning in 30+ DTE contracts, improving signal quality over nearest-expiry-only.",
      "branch": "hypothesis/options_flow-scan-3-expirations",
      "pr_number": 14,
      "status": "running",
      "priority": 8,
      "expected_impact": "high",
      "hypothesis_type": "implementation",
      "created_at": "2026-04-09",
      "min_days": 14,
      "days_elapsed": 3,
      "picks_log": ["2026-04-09", "2026-04-10", "2026-04-11"],
      "baseline_scanner": "options_flow",
      "conclusion": null
    }
  ]
}
```

**Field reference:**

| Field | Description |
|---|---|
| `id` | `<scanner>-<slug>` — unique, used for branch and file names |
| `status` | `running` / `pending` / `concluded` |
| `priority` | 1–9 (higher = more important); determines queue order for `pending` hypotheses |
| `hypothesis_type` | `statistical` (answer from existing data) or `implementation` (requires branch + forward testing) |
| `min_days` | Minimum picks days before conclusion analysis runs |
| `picks_log` | Dates when the runner collected picks on this branch |
| `conclusion` | `null` while running; `"accepted"` or `"rejected"` once concluded |

---

## `/backtest-hypothesis` Command

**Trigger:** `claude /backtest-hypothesis "<description>"`

**Flow:**

1. **Classify** the hypothesis as `statistical` or `implementation`.
   - Statistical: answerable from existing `performance_database.json` data — no code change needed.
   - Implementation: requires a code change and forward-testing period.

2. **Statistical path:** Run the analysis immediately against existing performance data. Write conclusion to the relevant scanner domain file (`docs/iterations/scanners/<scanner>.md`). Done — no branch created.

3. **Implementation path:**
   a. Read `active.json`. If `running` count < 5, start immediately. If all 5 slots are occupied by running experiments, add the new hypothesis as `status: "pending"` — running experiments are never interrupted (pausing mid-experiment breaks the picks streak and invalidates the statistical comparison).
   b. Create branch `hypothesis/<scanner>-<slug>` from `main`.
   c. Implement the minimal code change on the branch.
   d. Open a draft PR: title `hypothesis(<scanner>): <title>`, body describes the hypothesis, expected impact, and `min_days`.
   e. Write new entry to `active.json` on `main` with `status: "running"` (or `"pending"` if at capacity).
   f. Print summary: branch name, PR number, expected start date (if pending), expected conclusion date (if running).

**Pending → running promotion:** At the end of each daily runner cycle, after any experiments conclude, the runner checks for `pending` entries and promotes the highest-priority one to `running` if a slot opened up.

**Priority scoring** (set at creation time):

| Factor | Score contribution |
|---|---|
| Scanner has poor 30d win rate (<40%) | +3 |
| Change is low-complexity (1 file, 1 parameter) | +2 |
| Hypothesis directly addresses a known weak spot in LEARNINGS.md | +2 |
| High daily pick volume from scanner (more data faster) | +1 |
| Evidence from external research (arXiv, Alpha Architect, etc.) | +1 |
| Conflicting evidence or uncertain direction | -2 |

Max score 9. Claude assigns this score and writes it to `active.json`.

---

## Daily Hypothesis Runner (`hypothesis-runner.yml`)

Runs at **08:00 UTC daily** (after iterate at 06:00 UTC).

**Per-hypothesis loop** (for each entry with `status: "running"`):

```
1. git checkout hypothesis/<id>
2. Run daily discovery pipeline (same as daily-discovery.yml)
3. Append today's date to picks_log
4. Commit picks update back to hypothesis branch
5. If days_elapsed >= min_days:
   a. Run statistical comparison vs baseline scanner (same scanner, main branch picks)
   b. Compute: win rate delta, avg return delta, pick volume delta, p-value if N >= 20
   c. Decision rule:
      - accepted if win rate delta > +5pp OR avg return delta > +1% (with p < 0.1 if N >= 20)
      - rejected otherwise
   d. Write concluded doc to docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md
   e. Update scanner domain file with finding
   f. Set status = "concluded", conclusion = "accepted"/"rejected" in active.json
   g. If accepted: merge PR into main
      If rejected: close PR without merging, delete hypothesis branch
   h. Push active.json update to main
```

**Capacity:** 5 experiments × ~2 min each = ~10 min max runtime. Workflow timeout: 60 minutes.

---

## Conclusion Document Format

`docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md`:

```markdown
# Hypothesis: <title>

**Scanner:** options_flow
**Branch:** hypothesis/options_flow-scan-3-expirations
**Period:** 2026-04-09 → 2026-04-23 (14 days)
**Outcome:** accepted ✅ / rejected ❌

## Hypothesis
<original description>

## Results

| Metric | Baseline | Experiment | Delta |
|---|---|---|---|
| 7d win rate | 42% | 53% | +11pp |
| 30d avg return | -2.9% | +0.8% | +3.7% |
| Picks/day | 1.2 | 1.8 | +0.6 |

## Decision
<1-2 sentences on why accepted/rejected>

## Action
<what was merged or discarded>
```

---

## Dashboard Tab (`tradingagents/ui/pages/hypotheses.py`)

New "Hypotheses" tab in the Streamlit dashboard.

**Active experiments table:**

| Hypothesis | Scanner | Status | Days | Picks | Expected Ready | Priority |
|---|---|---|---|---|---|---|
| Scan 3 expirations | options_flow | running | 3/14 | 4 | 2026-04-23 | 8 |
| ITM-only filter | options_flow | pending | 0/14 | 0 | waiting for slot | 5 |

**Concluded experiments table:**

| Hypothesis | Scanner | Outcome | Concluded | Win Rate Delta |
|---|---|---|---|---|
| Premium filter >$25K | options_flow | ✅ merged | 2026-04-01 | +9pp |
| Reddit DD confidence gate | reddit_dd | ❌ rejected | 2026-03-20 | -3pp |

Both tables read directly from `active.json` and the `concluded/` directory. No separate database.

---

## What Is Not In Scope

- Hypothesis branches do not interact with each other (no cross-branch comparison)
- No A/B testing within a single discovery run (too complex, not needed)
- No email/Slack notifications (rolling PRs in GitHub are the notification mechanism)
- No manual override of priority scoring (set at creation, editable directly in `active.json`)