TradingAgents/docs/superpowers/specs/2026-04-09-hypothesis-backt...

7.6 KiB
Raw Blame History

Hypothesis Backtesting System — Design Spec

Goal

Enable systematic, branch-per-hypothesis experimentation for scanner improvements. Each hypothesis runs its modified code daily in isolation, accumulates picks, and auto-concludes with a statistical comparison once enough data exists. Up to 5 experiments run in parallel, prioritized by expected impact, with full UI visibility.


Architecture

docs/iterations/hypotheses/
  active.json                              ← source of truth for all experiments
  concluded/
    YYYY-MM-DD-<id>.md                     ← one file per concluded hypothesis

.claude/commands/
  backtest-hypothesis.md                   ← /backtest-hypothesis command

.github/workflows/
  hypothesis-runner.yml                    ← daily 08:00 UTC, runs all active experiments

tradingagents/ui/pages/
  hypotheses.py                            ← new Streamlit dashboard tab

The active.json file lives on main. Each hypothesis branch (hypothesis/<scanner>-<slug>) contains the code change being tested. The daily runner checks out each branch, runs discovery, commits picks back to that branch, and — once min_days have elapsed — concludes the hypothesis and cleans up.


active.json Schema

{
  "max_active": 5,
  "hypotheses": [
    {
      "id": "options_flow-scan-3-expirations",
      "scanner": "options_flow",
      "title": "Scan 3 expirations instead of 1",
      "description": "Hypothesis: scanning up to 3 expirations captures institutional positioning in 30+ DTE contracts, improving signal quality over nearest-expiry-only.",
      "branch": "hypothesis/options_flow-scan-3-expirations",
      "pr_number": 14,
      "status": "running",
      "priority": 8,
      "expected_impact": "high",
      "hypothesis_type": "implementation",
      "created_at": "2026-04-09",
      "min_days": 14,
      "days_elapsed": 3,
      "picks_log": ["2026-04-09", "2026-04-10", "2026-04-11"],
      "baseline_scanner": "options_flow",
      "conclusion": null
    }
  ]
}

Field reference:

Field Description
id <scanner>-<slug> — unique, used for branch and file names
status running / pending / concluded
priority 19 (higher = more important); determines queue order for pending hypotheses
hypothesis_type statistical (answer from existing data) or implementation (requires branch + forward testing)
min_days Minimum picks days before conclusion analysis runs
picks_log Dates when the runner collected picks on this branch
conclusion null while running; "accepted" or "rejected" once concluded

/backtest-hypothesis Command

Trigger: claude /backtest-hypothesis "<description>"

Flow:

  1. Classify the hypothesis as statistical or implementation.

    • Statistical: answerable from existing performance_database.json data — no code change needed.
    • Implementation: requires a code change and forward-testing period.
  2. Statistical path: Run the analysis immediately against existing performance data. Write conclusion to the relevant scanner domain file (docs/iterations/scanners/<scanner>.md). Done — no branch created.

  3. Implementation path: a. Read active.json. If running count < 5, start immediately. If all 5 slots are occupied by running experiments, add the new hypothesis as status: "pending" — running experiments are never interrupted (pausing mid-experiment breaks the picks streak and invalidates the statistical comparison). b. Create branch hypothesis/<scanner>-<slug> from main. c. Implement the minimal code change on the branch. d. Open a draft PR: title hypothesis(<scanner>): <title>, body describes the hypothesis, expected impact, and min_days. e. Write new entry to active.json on main with status: "running" (or "pending" if at capacity). f. Print summary: branch name, PR number, expected start date (if pending), expected conclusion date (if running).

Pending → running promotion: At the end of each daily runner cycle, after any experiments conclude, the runner checks for pending entries and promotes the highest-priority one to running if a slot opened up.

Priority scoring (set at creation time):

Factor Score contribution
Scanner has poor 30d win rate (<40%) +3
Change is low-complexity (1 file, 1 parameter) +2
Hypothesis directly addresses a known weak spot in LEARNINGS.md +2
High daily pick volume from scanner (more data faster) +1
Evidence from external research (arXiv, Alpha Architect, etc.) +1
Conflicting evidence or uncertain direction -2

Max score 9. Claude assigns this score and writes it to active.json.


Daily Hypothesis Runner (hypothesis-runner.yml)

Runs at 08:00 UTC daily (after iterate at 06:00 UTC).

Per-hypothesis loop (for each entry with status: "running"):

1. git checkout hypothesis/<id>
2. Run daily discovery pipeline (same as daily-discovery.yml)
3. Append today's date to picks_log
4. Commit picks update back to hypothesis branch
5. If days_elapsed >= min_days:
   a. Run statistical comparison vs baseline scanner (same scanner, main branch picks)
   b. Compute: win rate delta, avg return delta, pick volume delta, p-value if N >= 20
   c. Decision rule:
      - accepted if win rate delta > +5pp OR avg return delta > +1% (with p < 0.1 if N >= 20)
      - rejected otherwise
   d. Write concluded doc to docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md
   e. Update scanner domain file with finding
   f. Set status = "concluded", conclusion = "accepted"/"rejected" in active.json
   g. If accepted: merge PR into main
      If rejected: close PR without merging, delete hypothesis branch
   h. Push active.json update to main

Capacity: 5 experiments × ~2 min each = ~10 min max runtime. Workflow timeout: 60 minutes.


Conclusion Document Format

docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md:

# Hypothesis: <title>

**Scanner:** options_flow
**Branch:** hypothesis/options_flow-scan-3-expirations
**Period:** 2026-04-09 → 2026-04-23 (14 days)
**Outcome:** accepted ✅ / rejected ❌

## Hypothesis
<original description>

## Results

| Metric | Baseline | Experiment | Delta |
|---|---|---|---|
| 7d win rate | 42% | 53% | +11pp |
| 30d avg return | -2.9% | +0.8% | +3.7% |
| Picks/day | 1.2 | 1.8 | +0.6 |

## Decision
<1-2 sentences on why accepted/rejected>

## Action
<what was merged or discarded>

Dashboard Tab (tradingagents/ui/pages/hypotheses.py)

New "Hypotheses" tab in the Streamlit dashboard.

Active experiments table:

Hypothesis Scanner Status Days Picks Expected Ready Priority
Scan 3 expirations options_flow running 3/14 4 2026-04-23 8
ITM-only filter options_flow pending 0/14 0 waiting for slot 5

Concluded experiments table:

Hypothesis Scanner Outcome Concluded Win Rate Delta
Premium filter >$25K options_flow merged 2026-04-01 +9pp
Reddit DD confidence gate reddit_dd rejected 2026-03-20 -3pp

Both tables read directly from active.json and the concluded/ directory. No separate database.


What Is Not In Scope

  • Hypothesis branches do not interact with each other (no cross-branch comparison)
  • No A/B testing within a single discovery run (too complex, not needed)
  • No email/Slack notifications (rolling PRs in GitHub are the notification mechanism)
  • No manual override of priority scoring (set at creation, editable directly in active.json)