TradingAgents/docs/superpowers/specs/2026-04-09-hypothesis-backt...

199 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hypothesis Backtesting System — Design Spec
## Goal
Enable systematic, branch-per-hypothesis experimentation for scanner improvements. Each hypothesis runs its modified code daily in isolation, accumulates picks, and auto-concludes with a statistical comparison once enough data exists. Up to 5 experiments run in parallel, prioritized by expected impact, with full UI visibility.
---
## Architecture
```
docs/iterations/hypotheses/
active.json ← source of truth for all experiments
concluded/
YYYY-MM-DD-<id>.md ← one file per concluded hypothesis
.claude/commands/
backtest-hypothesis.md ← /backtest-hypothesis command
.github/workflows/
hypothesis-runner.yml ← daily 08:00 UTC, runs all active experiments
tradingagents/ui/pages/
hypotheses.py ← new Streamlit dashboard tab
```
The `active.json` file lives on `main`. Each hypothesis branch (`hypothesis/<scanner>-<slug>`) contains the code change being tested. The daily runner checks out each branch, runs discovery, commits picks back to that branch, and — once `min_days` have elapsed — concludes the hypothesis and cleans up.
---
## `active.json` Schema
```json
{
"max_active": 5,
"hypotheses": [
{
"id": "options_flow-scan-3-expirations",
"scanner": "options_flow",
"title": "Scan 3 expirations instead of 1",
"description": "Hypothesis: scanning up to 3 expirations captures institutional positioning in 30+ DTE contracts, improving signal quality over nearest-expiry-only.",
"branch": "hypothesis/options_flow-scan-3-expirations",
"pr_number": 14,
"status": "running",
"priority": 8,
"expected_impact": "high",
"hypothesis_type": "implementation",
"created_at": "2026-04-09",
"min_days": 14,
"days_elapsed": 3,
"picks_log": ["2026-04-09", "2026-04-10", "2026-04-11"],
"baseline_scanner": "options_flow",
"conclusion": null
}
]
}
```
**Field reference:**
| Field | Description |
|---|---|
| `id` | `<scanner>-<slug>` — unique, used for branch and file names |
| `status` | `running` / `pending` / `concluded` |
| `priority` | 19 (higher = more important); determines queue order for `pending` hypotheses |
| `hypothesis_type` | `statistical` (answer from existing data) or `implementation` (requires branch + forward testing) |
| `min_days` | Minimum picks days before conclusion analysis runs |
| `picks_log` | Dates when the runner collected picks on this branch |
| `conclusion` | `null` while running; `"accepted"` or `"rejected"` once concluded |
---
## `/backtest-hypothesis` Command
**Trigger:** `claude /backtest-hypothesis "<description>"`
**Flow:**
1. **Classify** the hypothesis as `statistical` or `implementation`.
- Statistical: answerable from existing `performance_database.json` data — no code change needed.
- Implementation: requires a code change and forward-testing period.
2. **Statistical path:** Run the analysis immediately against existing performance data. Write conclusion to the relevant scanner domain file (`docs/iterations/scanners/<scanner>.md`). Done — no branch created.
3. **Implementation path:**
a. Read `active.json`. If `running` count < 5, start immediately. If all 5 slots are occupied by running experiments, add the new hypothesis as `status: "pending"` running experiments are never interrupted (pausing mid-experiment breaks the picks streak and invalidates the statistical comparison).
b. Create branch `hypothesis/<scanner>-<slug>` from `main`.
c. Implement the minimal code change on the branch.
d. Open a draft PR: title `hypothesis(<scanner>): <title>`, body describes the hypothesis, expected impact, and `min_days`.
e. Write new entry to `active.json` on `main` with `status: "running"` (or `"pending"` if at capacity).
f. Print summary: branch name, PR number, expected start date (if pending), expected conclusion date (if running).
**Pending → running promotion:** At the end of each daily runner cycle, after any experiments conclude, the runner checks for `pending` entries and promotes the highest-priority one to `running` if a slot opened up.
**Priority scoring** (set at creation time):
| Factor | Score contribution |
|---|---|
| Scanner has poor 30d win rate (<40%) | +3 |
| Change is low-complexity (1 file, 1 parameter) | +2 |
| Hypothesis directly addresses a known weak spot in LEARNINGS.md | +2 |
| High daily pick volume from scanner (more data faster) | +1 |
| Evidence from external research (arXiv, Alpha Architect, etc.) | +1 |
| Conflicting evidence or uncertain direction | -2 |
Max score 9. Claude assigns this score and writes it to `active.json`.
---
## Daily Hypothesis Runner (`hypothesis-runner.yml`)
Runs at **08:00 UTC daily** (after iterate at 06:00 UTC).
**Per-hypothesis loop** (for each entry with `status: "running"`):
```
1. git checkout hypothesis/<id>
2. Run daily discovery pipeline (same as daily-discovery.yml)
3. Append today's date to picks_log
4. Commit picks update back to hypothesis branch
5. If days_elapsed >= min_days:
a. Run statistical comparison vs baseline scanner (same scanner, main branch picks)
b. Compute: win rate delta, avg return delta, pick volume delta, p-value if N >= 20
c. Decision rule:
- accepted if win rate delta > +5pp OR avg return delta > +1% (with p < 0.1 if N >= 20)
- rejected otherwise
d. Write concluded doc to docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md
e. Update scanner domain file with finding
f. Set status = "concluded", conclusion = "accepted"/"rejected" in active.json
g. If accepted: merge PR into main
If rejected: close PR without merging, delete hypothesis branch
h. Push active.json update to main
```
**Capacity:** 5 experiments × ~2 min each = ~10 min max runtime. Workflow timeout: 60 minutes.
---
## Conclusion Document Format
`docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md`:
```markdown
# Hypothesis: <title>
**Scanner:** options_flow
**Branch:** hypothesis/options_flow-scan-3-expirations
**Period:** 2026-04-09 → 2026-04-23 (14 days)
**Outcome:** accepted ✅ / rejected ❌
## Hypothesis
<original description>
## Results
| Metric | Baseline | Experiment | Delta |
|---|---|---|---|
| 7d win rate | 42% | 53% | +11pp |
| 30d avg return | -2.9% | +0.8% | +3.7% |
| Picks/day | 1.2 | 1.8 | +0.6 |
## Decision
<1-2 sentences on why accepted/rejected>
## Action
<what was merged or discarded>
```
---
## Dashboard Tab (`tradingagents/ui/pages/hypotheses.py`)
New "Hypotheses" tab in the Streamlit dashboard.
**Active experiments table:**
| Hypothesis | Scanner | Status | Days | Picks | Expected Ready | Priority |
|---|---|---|---|---|---|---|
| Scan 3 expirations | options_flow | running | 3/14 | 4 | 2026-04-23 | 8 |
| ITM-only filter | options_flow | pending | 0/14 | 0 | waiting for slot | 5 |
**Concluded experiments table:**
| Hypothesis | Scanner | Outcome | Concluded | Win Rate Delta |
|---|---|---|---|---|
| Premium filter >$25K | options_flow | ✅ merged | 2026-04-01 | +9pp |
| Reddit DD confidence gate | reddit_dd | ❌ rejected | 2026-03-20 | -3pp |
Both tables read directly from `active.json` and the `concluded/` directory. No separate database.
---
## What Is Not In Scope
- Hypothesis branches do not interact with each other (no cross-branch comparison)
- No A/B testing within a single discovery run (too complex, not needed)
- No email/Slack notifications (rolling PRs in GitHub are the notification mechanism)
- No manual override of priority scoring (set at creation, editable directly in `active.json`)