7.6 KiB

Raw Blame History

Hypothesis Backtesting System — Design Spec

Goal

Enable systematic, branch-per-hypothesis experimentation for scanner improvements. Each hypothesis runs its modified code daily in isolation, accumulates picks, and auto-concludes with a statistical comparison once enough data exists. Up to 5 experiments run in parallel, prioritized by expected impact, with full UI visibility.

Architecture

docs/iterations/hypotheses/
  active.json                              ← source of truth for all experiments
  concluded/
    YYYY-MM-DD-<id>.md                     ← one file per concluded hypothesis

.claude/commands/
  backtest-hypothesis.md                   ← /backtest-hypothesis command

.github/workflows/
  hypothesis-runner.yml                    ← daily 08:00 UTC, runs all active experiments

tradingagents/ui/pages/
  hypotheses.py                            ← new Streamlit dashboard tab

The active.json file lives on main. Each hypothesis branch (hypothesis/<scanner>-<slug>) contains the code change being tested. The daily runner checks out each branch, runs discovery, commits picks back to that branch, and — once min_days have elapsed — concludes the hypothesis and cleans up.

`active.json` Schema

{
  "max_active": 5,
  "hypotheses": [
    {
      "id": "options_flow-scan-3-expirations",
      "scanner": "options_flow",
      "title": "Scan 3 expirations instead of 1",
      "description": "Hypothesis: scanning up to 3 expirations captures institutional positioning in 30+ DTE contracts, improving signal quality over nearest-expiry-only.",
      "branch": "hypothesis/options_flow-scan-3-expirations",
      "pr_number": 14,
      "status": "running",
      "priority": 8,
      "expected_impact": "high",
      "hypothesis_type": "implementation",
      "created_at": "2026-04-09",
      "min_days": 14,
      "days_elapsed": 3,
      "picks_log": ["2026-04-09", "2026-04-10", "2026-04-11"],
      "baseline_scanner": "options_flow",
      "conclusion": null
    }
  ]
}

Field reference:

Field	Description
`id`	`<scanner>-<slug>` — unique, used for branch and file names
`status`	`running` / `pending` / `concluded`
`priority`	1–9 (higher = more important); determines queue order for `pending` hypotheses
`hypothesis_type`	`statistical` (answer from existing data) or `implementation` (requires branch + forward testing)
`min_days`	Minimum picks days before conclusion analysis runs
`picks_log`	Dates when the runner collected picks on this branch
`conclusion`	`null` while running; `"accepted"` or `"rejected"` once concluded

`/backtest-hypothesis` Command

Trigger: claude /backtest-hypothesis "<description>"

Flow:

Classify the hypothesis as statistical or implementation.
- Statistical: answerable from existing performance_database.json data — no code change needed.
- Implementation: requires a code change and forward-testing period.
Statistical path: Run the analysis immediately against existing performance data. Write conclusion to the relevant scanner domain file (docs/iterations/scanners/<scanner>.md). Done — no branch created.
Implementation path: a. Read active.json. If running count < 5, start immediately. If all 5 slots are occupied by running experiments, add the new hypothesis as status: "pending" — running experiments are never interrupted (pausing mid-experiment breaks the picks streak and invalidates the statistical comparison). b. Create branch hypothesis/<scanner>-<slug> from main. c. Implement the minimal code change on the branch. d. Open a draft PR: title hypothesis(<scanner>): <title>, body describes the hypothesis, expected impact, and min_days. e. Write new entry to active.json on main with status: "running" (or "pending" if at capacity). f. Print summary: branch name, PR number, expected start date (if pending), expected conclusion date (if running).

Pending → running promotion: At the end of each daily runner cycle, after any experiments conclude, the runner checks for pending entries and promotes the highest-priority one to running if a slot opened up.

Priority scoring (set at creation time):

Factor	Score contribution
Scanner has poor 30d win rate (<40%)	+3
Change is low-complexity (1 file, 1 parameter)	+2
Hypothesis directly addresses a known weak spot in LEARNINGS.md	+2
High daily pick volume from scanner (more data faster)	+1
Evidence from external research (arXiv, Alpha Architect, etc.)	+1
Conflicting evidence or uncertain direction	-2

Max score 9. Claude assigns this score and writes it to active.json.

Daily Hypothesis Runner (`hypothesis-runner.yml`)

Runs at 08:00 UTC daily (after iterate at 06:00 UTC).

Per-hypothesis loop (for each entry with status: "running"):

1. git checkout hypothesis/<id>
2. Run daily discovery pipeline (same as daily-discovery.yml)
3. Append today's date to picks_log
4. Commit picks update back to hypothesis branch
5. If days_elapsed >= min_days:
   a. Run statistical comparison vs baseline scanner (same scanner, main branch picks)
   b. Compute: win rate delta, avg return delta, pick volume delta, p-value if N >= 20
   c. Decision rule:
      - accepted if win rate delta > +5pp OR avg return delta > +1% (with p < 0.1 if N >= 20)
      - rejected otherwise
   d. Write concluded doc to docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md
   e. Update scanner domain file with finding
   f. Set status = "concluded", conclusion = "accepted"/"rejected" in active.json
   g. If accepted: merge PR into main
      If rejected: close PR without merging, delete hypothesis branch
   h. Push active.json update to main

Capacity: 5 experiments × ~2 min each = ~10 min max runtime. Workflow timeout: 60 minutes.

Conclusion Document Format

docs/iterations/hypotheses/concluded/YYYY-MM-DD-<id>.md:

# Hypothesis: <title>

**Scanner:** options_flow
**Branch:** hypothesis/options_flow-scan-3-expirations
**Period:** 2026-04-09 → 2026-04-23 (14 days)
**Outcome:** accepted ✅ / rejected ❌

## Hypothesis
<original description>

## Results

| Metric | Baseline | Experiment | Delta |
|---|---|---|---|
| 7d win rate | 42% | 53% | +11pp |
| 30d avg return | -2.9% | +0.8% | +3.7% |
| Picks/day | 1.2 | 1.8 | +0.6 |

## Decision
<1-2 sentences on why accepted/rejected>

## Action
<what was merged or discarded>

Dashboard Tab (`tradingagents/ui/pages/hypotheses.py`)

New "Hypotheses" tab in the Streamlit dashboard.

Active experiments table:

Hypothesis	Scanner	Status	Days	Picks	Expected Ready	Priority
Scan 3 expirations	options_flow	running	3/14	4	2026-04-23	8
ITM-only filter	options_flow	pending	0/14	0	waiting for slot	5

Concluded experiments table:

Hypothesis	Scanner	Outcome	Concluded	Win Rate Delta
Premium filter >$25K	options_flow	✅ merged	2026-04-01	+9pp
Reddit DD confidence gate	reddit_dd	❌ rejected	2026-03-20	-3pp

Both tables read directly from active.json and the concluded/ directory. No separate database.

What Is Not In Scope

Hypothesis branches do not interact with each other (no cross-branch comparison)
No A/B testing within a single discovery run (too complex, not needed)
No email/Slack notifications (rolling PRs in GitHub are the notification mechanism)
No manual override of priority scoring (set at creation, editable directly in active.json)

7.6 KiB Raw Blame History Unescape Escape