48 KiB

Raw Blame History

Hypothesis Backtesting System — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a branch-per-hypothesis experimentation system that runs scanner code changes daily in isolation, accumulates picks, auto-concludes with a statistical comparison, and surfaces everything in the dashboard.

Architecture: active.json is the registry (lives on main). Each hypothesis gets a hypothesis/<scanner>-<slug> branch with the code change. A daily workflow (08:00 UTC) uses git worktrees to run discovery on each branch, stores picks in docs/iterations/hypotheses/<id>/picks.json on the hypothesis branch, and concludes when min_days elapsed. The /backtest-hypothesis command classifies, creates branches, and manages the registry.

Tech Stack: Python 3.10, yfinance (download_history), GitHub Actions, Streamlit, gh CLI, git worktree

File Map

Path	Action	Purpose
`docs/iterations/hypotheses/active.json`	Create	Registry of all experiments
`docs/iterations/hypotheses/concluded/.gitkeep`	Create	Directory placeholder
`scripts/compare_hypothesis.py`	Create	Fetch returns + statistical comparison
`.claude/commands/backtest-hypothesis.md`	Create	`/backtest-hypothesis` Claude command
`.github/workflows/hypothesis-runner.yml`	Create	Daily 08:00 UTC runner
`tradingagents/ui/pages/hypotheses.py`	Create	Dashboard "Hypotheses" tab
`tradingagents/ui/pages/__init__.py`	Modify	Register new page
`tradingagents/ui/dashboard.py`	Modify	Add "Hypotheses" to nav

Task 1: Hypothesis Registry Structure

Files:

Create: docs/iterations/hypotheses/active.json
Create: docs/iterations/hypotheses/concluded/.gitkeep
Step 1: Create the directory and initial active.json

mkdir -p docs/iterations/hypotheses/concluded

Write docs/iterations/hypotheses/active.json:

{
  "max_active": 5,
  "hypotheses": []
}

Step 2: Create the concluded directory placeholder

touch docs/iterations/hypotheses/concluded/.gitkeep

Step 3: Verify JSON is valid

python3 -c "import json; json.load(open('docs/iterations/hypotheses/active.json')); print('valid')"

Expected: valid

Step 4: Commit

git add docs/iterations/hypotheses/
git commit -m "feat(hypotheses): initialize hypothesis registry"

Task 2: Comparison Script

Files:

Create: scripts/compare_hypothesis.py
Create: tests/test_compare_hypothesis.py

★ Insight ───────────────────────────────────── The comparison reads picks from the hypothesis branch via git show <branch>:path — this avoids checking out the branch just to read a file, keeping the working tree on main throughout. ─────────────────────────────────────────────────

Step 1: Write the failing tests

Create tests/test_compare_hypothesis.py:

"""Tests for the hypothesis comparison script."""
import json
import subprocess
import sys
from datetime import date, timedelta
from pathlib import Path
from unittest.mock import MagicMock, patch

import pytest

sys.path.insert(0, str(Path(__file__).parent.parent))

from scripts.compare_hypothesis import (
    compute_metrics,
    compute_7d_return,
    load_baseline_metrics,
    make_decision,
)


# ── compute_metrics ──────────────────────────────────────────────────────────

def test_compute_metrics_empty():
    result = compute_metrics([])
    assert result == {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None}


def test_compute_metrics_all_wins():
    picks = [
        {"return_7d": 5.0, "win_7d": True},
        {"return_7d": 3.0, "win_7d": True},
    ]
    result = compute_metrics(picks)
    assert result["win_rate"] == 100.0
    assert result["avg_return"] == 4.0
    assert result["evaluated"] == 2


def test_compute_metrics_mixed():
    picks = [
        {"return_7d": 10.0, "win_7d": True},
        {"return_7d": -5.0, "win_7d": False},
        {"return_7d": None, "win_7d": None},   # pending — excluded
    ]
    result = compute_metrics(picks)
    assert result["win_rate"] == 50.0
    assert result["avg_return"] == 2.5
    assert result["evaluated"] == 2
    assert result["count"] == 3


# ── compute_7d_return ────────────────────────────────────────────────────────

def test_compute_7d_return_positive():
    mock_df = MagicMock()
    mock_df.empty = False
    # Simulate DataFrame with Close column: entry=100, exit=110
    mock_df.__len__ = lambda self: 2
    mock_df["Close"].iloc.__getitem__ = MagicMock(side_effect=lambda i: 100.0 if i == 0 else 110.0)

    with patch("scripts.compare_hypothesis.download_history", return_value=mock_df):
        ret, win = compute_7d_return("AAPL", "2026-03-01")

    assert ret == pytest.approx(10.0, rel=0.01)
    assert win is True


def test_compute_7d_return_empty_data():
    mock_df = MagicMock()
    mock_df.empty = True

    with patch("scripts.compare_hypothesis.download_history", return_value=mock_df):
        ret, win = compute_7d_return("AAPL", "2026-03-01")

    assert ret is None
    assert win is None


# ── load_baseline_metrics ────────────────────────────────────────────────────

def test_load_baseline_metrics(tmp_path):
    db = {
        "recommendations_by_date": {
            "2026-03-01": [
                {"strategy_match": "options_flow", "return_7d": 5.0, "win_7d": True},
                {"strategy_match": "options_flow", "return_7d": -2.0, "win_7d": False},
                {"strategy_match": "reddit_dd", "return_7d": 3.0, "win_7d": True},
            ]
        }
    }
    db_file = tmp_path / "performance_database.json"
    db_file.write_text(json.dumps(db))

    result = load_baseline_metrics("options_flow", str(db_file))

    assert result["win_rate"] == 50.0
    assert result["avg_return"] == 1.5
    assert result["count"] == 2


def test_load_baseline_metrics_missing_file(tmp_path):
    result = load_baseline_metrics("options_flow", str(tmp_path / "missing.json"))
    assert result == {"count": 0, "win_rate": None, "avg_return": None}


# ── make_decision ─────────────────────────────────────────────────────────────

def test_make_decision_accepted_by_win_rate():
    hyp = {"win_rate": 60.0, "avg_return": 0.5, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 0.5}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "accepted"
    assert "win rate" in reason.lower()


def test_make_decision_accepted_by_return():
    hyp = {"win_rate": 52.0, "avg_return": 3.0, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 1.5}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "accepted"
    assert "return" in reason.lower()


def test_make_decision_rejected():
    hyp = {"win_rate": 48.0, "avg_return": 0.2, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 1.0}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "rejected"


def test_make_decision_insufficient_data():
    hyp = {"win_rate": 80.0, "avg_return": 5.0, "evaluated": 2}
    baseline = {"win_rate": 50.0, "avg_return": 1.0}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "rejected"
    assert "insufficient" in reason.lower()

Step 2: Run tests to confirm they fail

python -m pytest tests/test_compare_hypothesis.py -v 2>&1 | head -30

Expected: ModuleNotFoundError: No module named 'scripts.compare_hypothesis' or similar import error — confirms tests are wired correctly.

Step 3: Write scripts/compare_hypothesis.py

#!/usr/bin/env python3
"""
Hypothesis comparison — computes 7d returns for hypothesis picks and
compares them against the baseline scanner in performance_database.json.

Usage (called by hypothesis-runner.yml after min_days elapsed):
    python scripts/compare_hypothesis.py \\
        --hypothesis-id options_flow-scan-3-expirations \\
        --picks-json '{"picks": [...]}' \\
        --scanner options_flow \\
        --db-path data/recommendations/performance_database.json

Prints a JSON conclusion to stdout:
    {
      "decision": "accepted",
      "reason": "...",
      "hypothesis": {"win_rate": 58.0, "avg_return": 1.8, "count": 14, "evaluated": 10},
      "baseline":   {"win_rate": 42.0, "avg_return": -0.3, "count": 87}
    }
"""

import argparse
import json
import sys
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional, Tuple

ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(ROOT))

from tradingagents.dataflows.y_finance import download_history


# Minimum evaluated picks required to make a decision
_MIN_EVALUATED = 5
# Thresholds from spec
_WIN_RATE_DELTA_THRESHOLD = 5.0   # percentage points
_AVG_RETURN_DELTA_THRESHOLD = 1.0  # percent


def compute_7d_return(ticker: str, pick_date: str) -> Tuple[Optional[float], Optional[bool]]:
    """
    Fetch 7-day return for a pick using yfinance.

    Args:
        ticker: Stock symbol, e.g. "AAPL"
        pick_date: Date the pick was made, "YYYY-MM-DD"

    Returns:
        (return_pct, is_win) or (None, None) if data unavailable
    """
    try:
        entry_dt = datetime.strptime(pick_date, "%Y-%m-%d")
        exit_dt = entry_dt + timedelta(days=10)  # +3 buffer for weekends/holidays
        df = download_history(
            ticker,
            start=entry_dt.strftime("%Y-%m-%d"),
            end=exit_dt.strftime("%Y-%m-%d"),
        )
        if df.empty or len(df) < 2:
            return None, None

        # Use first available close as entry, 7th trading day as exit
        close = df["Close"]
        entry_price = float(close.iloc[0])
        exit_idx = min(5, len(close) - 1)  # ~7 calendar days = ~5 trading days
        exit_price = float(close.iloc[exit_idx])

        if entry_price <= 0:
            return None, None

        ret = (exit_price - entry_price) / entry_price * 100
        return round(ret, 4), ret > 0

    except Exception:
        return None, None


def enrich_picks_with_returns(picks: list) -> list:
    """
    Compute 7d return for each pick that is old enough (>= 7 days) and
    doesn't already have return_7d populated.

    Args:
        picks: List of pick dicts with at least 'ticker' and 'date' fields

    Returns:
        Same list with return_7d and win_7d populated where possible
    """
    cutoff = (datetime.utcnow() - timedelta(days=7)).strftime("%Y-%m-%d")
    for pick in picks:
        if pick.get("return_7d") is not None:
            continue  # already computed
        if pick.get("date", "9999-99-99") > cutoff:
            continue  # too recent
        ret, win = compute_7d_return(pick["ticker"], pick["date"])
        pick["return_7d"] = ret
        pick["win_7d"] = win
    return picks


def compute_metrics(picks: list) -> dict:
    """
    Compute win rate and avg return for a list of picks.

    Only picks with non-None return_7d contribute to win_rate and avg_return.

    Returns:
        {"count": int, "evaluated": int, "win_rate": float|None, "avg_return": float|None}
    """
    evaluated = [p for p in picks if p.get("return_7d") is not None]
    if not evaluated:
        return {"count": len(picks), "evaluated": 0, "win_rate": None, "avg_return": None}

    wins = sum(1 for p in evaluated if p.get("win_7d"))
    avg_ret = sum(p["return_7d"] for p in evaluated) / len(evaluated)
    return {
        "count": len(picks),
        "evaluated": len(evaluated),
        "win_rate": round(wins / len(evaluated) * 100, 1),
        "avg_return": round(avg_ret, 2),
    }


def load_baseline_metrics(scanner: str, db_path: str) -> dict:
    """
    Load baseline metrics for a scanner from performance_database.json.

    Args:
        scanner: Scanner name, e.g. "options_flow"
        db_path: Path to performance_database.json

    Returns:
        {"count": int, "win_rate": float|None, "avg_return": float|None}
    """
    path = Path(db_path)
    if not path.exists():
        return {"count": 0, "win_rate": None, "avg_return": None}

    try:
        with open(path) as f:
            db = json.load(f)
    except Exception:
        return {"count": 0, "win_rate": None, "avg_return": None}

    picks = []
    for recs in db.get("recommendations_by_date", {}).values():
        for rec in (recs if isinstance(recs, list) else []):
            if rec.get("strategy_match") == scanner and rec.get("return_7d") is not None:
                picks.append(rec)

    return compute_metrics(picks)


def make_decision(hypothesis: dict, baseline: dict) -> Tuple[str, str]:
    """
    Decide accepted or rejected based on metrics delta.

    Rules:
    - Minimum _MIN_EVALUATED evaluated picks required
    - accepted if win_rate_delta > _WIN_RATE_DELTA_THRESHOLD (5pp)
      OR avg_return_delta > _AVG_RETURN_DELTA_THRESHOLD (1%)
    - rejected otherwise

    Returns:
        (decision, reason) where decision is "accepted" or "rejected"
    """
    evaluated = hypothesis.get("evaluated", 0)
    if evaluated < _MIN_EVALUATED:
        return "rejected", f"Insufficient data: only {evaluated} evaluated picks (need {_MIN_EVALUATED})"

    hyp_wr = hypothesis.get("win_rate")
    hyp_ret = hypothesis.get("avg_return")
    base_wr = baseline.get("win_rate")
    base_ret = baseline.get("avg_return")

    reasons = []

    if hyp_wr is not None and base_wr is not None:
        delta_wr = hyp_wr - base_wr
        if delta_wr > _WIN_RATE_DELTA_THRESHOLD:
            reasons.append(f"win rate improved by {delta_wr:+.1f}pp ({base_wr:.1f}% → {hyp_wr:.1f}%)")

    if hyp_ret is not None and base_ret is not None:
        delta_ret = hyp_ret - base_ret
        if delta_ret > _AVG_RETURN_DELTA_THRESHOLD:
            reasons.append(f"avg return improved by {delta_ret:+.2f}% ({base_ret:+.2f}% → {hyp_ret:+.2f}%)")

    if reasons:
        return "accepted", "; ".join(reasons)

    wr_str = f"{hyp_wr:.1f}% vs baseline {base_wr:.1f}%" if hyp_wr is not None else "no win rate data"
    ret_str = f"{hyp_ret:+.2f}% vs baseline {base_ret:+.2f}%" if hyp_ret is not None else "no return data"
    return "rejected", f"No significant improvement — win rate: {wr_str}; avg return: {ret_str}"


def main():
    parser = argparse.ArgumentParser(description="Compare hypothesis picks against baseline")
    parser.add_argument("--hypothesis-id", required=True)
    parser.add_argument("--picks-json", required=True, help="JSON string of picks list")
    parser.add_argument("--scanner", required=True, help="Baseline scanner name")
    parser.add_argument(
        "--db-path",
        default="data/recommendations/performance_database.json",
        help="Path to performance_database.json",
    )
    args = parser.parse_args()

    picks = json.loads(args.picks_json)
    picks = enrich_picks_with_returns(picks)

    hyp_metrics = compute_metrics(picks)
    base_metrics = load_baseline_metrics(args.scanner, args.db_path)

    decision, reason = make_decision(hyp_metrics, base_metrics)

    result = {
        "hypothesis_id": args.hypothesis_id,
        "decision": decision,
        "reason": reason,
        "hypothesis": hyp_metrics,
        "baseline": base_metrics,
        "enriched_picks": picks,
    }
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()

Step 4: Run tests to confirm they pass

python -m pytest tests/test_compare_hypothesis.py -v

Expected: all 10 tests pass.

Step 5: Commit

git add scripts/compare_hypothesis.py tests/test_compare_hypothesis.py
git commit -m "feat(hypotheses): add comparison + conclusion script"

Task 3: `/backtest-hypothesis` Command

Files:

Create: .claude/commands/backtest-hypothesis.md
Step 1: Write the command file

Create .claude/commands/backtest-hypothesis.md:

# /backtest-hypothesis

Test a hypothesis about a scanner improvement using branch-per-hypothesis isolation.

**Usage:** `/backtest-hypothesis "<description of the hypothesis>"`

**Example:** `/backtest-hypothesis "options_flow: scan 3 expirations instead of 1 to capture institutional 30+ DTE positioning"`

---

## Step 1: Read Current Registry

Read `docs/iterations/hypotheses/active.json`. Note:
- How many hypotheses currently have `status: "running"`
- The `max_active` limit (default 5)
- Any existing `pending` entries

Also read `docs/iterations/LEARNINGS.md` and the relevant scanner domain file in
`docs/iterations/scanners/` to understand the current baseline.

## Step 2: Classify the Hypothesis

Determine whether this is:

**Statistical** — answerable from existing data in `data/recommendations/performance_database.json`
without any code change. Examples:
- "Does high confidence (≥8) predict better 30d returns?"
- "Are options_flow picks that are ITM outperforming OTM ones?"

**Implementation** — requires a code change and forward-testing period. Examples:
- "Scan 3 expirations instead of 1"
- "Apply a premium filter of $50K instead of $25K"

## Step 3a: Statistical Path

If statistical: run the analysis now against `data/recommendations/performance_database.json`.
Write the finding to the relevant scanner domain file under **Evidence Log**. Print a summary.
Done — no branch needed.

## Step 3b: Implementation Path

### 3b-i: Capacity check

Count running hypotheses from `active.json`. If fewer than `max_active` running, proceed.
If at capacity: add the new hypothesis as `status: "pending"` — running experiments are NEVER
paused mid-streak. Inform the user which slot it queued behind and when it will likely start.

### 3b-ii: Score the hypothesis

Assign a `priority` score (1–9) using these factors:

| Factor | Score |
|---|---|
| Scanner 30d win rate < 40% | +3 |
| Change touches 1 file, 1 parameter | +2 |
| Directly addresses a weak spot in LEARNINGS.md | +2 |
| Scanner generates ≥2 picks/day (data accrues fast) | +1 |
| Supported by external research (arXiv, Alpha Architect, etc.) | +1 |
| Contradictory evidence or unclear direction | −2 |

### 3b-iii: Determine min_days

Set `min_days` based on the scanner's typical picks-per-day rate:
- ≥2 picks/day → 14 days
- 1 pick/day → 21 days
- <1 pick/day → 30 days

### 3b-iv: Create the branch and implement the code change

```bash
BRANCH="hypothesis/<scanner>-<slug>"
git checkout -b "$BRANCH"
```

Make the minimal code change that implements the hypothesis. Read the scanner file first.
Only change what the hypothesis requires — do not refactor surrounding code.

```bash
git add tradingagents/
git commit -m "hypothesis(<scanner>): <title>"
```

### 3b-v: Create picks tracking file on the branch

Create `docs/iterations/hypotheses/<id>/picks.json` on the hypothesis branch:

```json
{
  "hypothesis_id": "<id>",
  "scanner": "<scanner>",
  "picks": []
}
```

```bash
mkdir -p docs/iterations/hypotheses/<id>
# write the file
git add docs/iterations/hypotheses/<id>/picks.json
git commit -m "hypothesis(<scanner>): add picks tracker"
git push -u origin "$BRANCH"
```

### 3b-vi: Open a draft PR

```bash
gh pr create \
  --title "hypothesis(<scanner>): <title>" \
  --body "**Hypothesis:** <description>

**Expected impact:** <high/medium/low>
**Min days:** <N>
**Priority:** <score>/9

*This is an automated hypothesis experiment. It will be auto-concluded after ${MIN_DAYS} days of data.*" \
  --draft \
  --base main
```

Note the PR number from the output.

### 3b-vii: Update active.json on main

Check out `main`, then update `docs/iterations/hypotheses/active.json` to add the new entry:

```json
{
  "id": "<scanner>-<slug>",
  "scanner": "<scanner>",
  "title": "<title>",
  "description": "<description>",
  "branch": "hypothesis/<scanner>-<slug>",
  "pr_number": <N>,
  "status": "running",
  "priority": <score>,
  "expected_impact": "<high|medium|low>",
  "hypothesis_type": "implementation",
  "created_at": "<YYYY-MM-DD>",
  "min_days": <N>,
  "days_elapsed": 0,
  "picks_log": [],
  "baseline_scanner": "<scanner>",
  "conclusion": null
}
```

```bash
git checkout main
git add docs/iterations/hypotheses/active.json
git commit -m "feat(hypotheses): register hypothesis <id>"
git push origin main
```

## Step 4: Print Summary

Print a confirmation:
- Hypothesis ID and branch name
- Status: running or pending
- Expected conclusion date (created_at + min_days)
- PR link (if running)
- Priority score and why

Step 2: Verify the file exists and is non-empty

wc -l .claude/commands/backtest-hypothesis.md

Expected: at least 80 lines.

Step 3: Commit

git add .claude/commands/backtest-hypothesis.md
git commit -m "feat(hypotheses): add /backtest-hypothesis command"

Task 4: Hypothesis Runner Workflow

Files:

Create: .github/workflows/hypothesis-runner.yml
Step 1: Write the workflow

Create .github/workflows/hypothesis-runner.yml:

name: Hypothesis Runner

on:
  schedule:
    # 8:00 AM UTC daily — runs after iterate (06:00) and daily-discovery (12:30)
    - cron: "0 8 * * *"
  workflow_dispatch:
    inputs:
      hypothesis_id:
        description: "Run a specific hypothesis ID only (blank = all running)"
        required: false
        default: ""

env:
  PYTHON_VERSION: "3.10"

jobs:
  run-hypotheses:
    runs-on: ubuntu-latest
    environment: TradingAgent
    timeout-minutes: 60
    permissions:
      contents: write
      pull-requests: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
          token: ${{ secrets.GH_TOKEN }}

      - name: Set up git identity
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"          

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install --upgrade pip && pip install -e .

      - name: Run hypothesis experiments
        env:
          GH_TOKEN: ${{ secrets.GH_TOKEN }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          FINNHUB_API_KEY: ${{ secrets.FINNHUB_API_KEY }}
          ALPHA_VANTAGE_API_KEY: ${{ secrets.ALPHA_VANTAGE_API_KEY }}
          FMP_API_KEY: ${{ secrets.FMP_API_KEY }}
          REDDIT_CLIENT_ID: ${{ secrets.REDDIT_CLIENT_ID }}
          REDDIT_CLIENT_SECRET: ${{ secrets.REDDIT_CLIENT_SECRET }}
          TRADIER_API_KEY: ${{ secrets.TRADIER_API_KEY }}
          FILTER_ID: ${{ inputs.hypothesis_id }}
        run: |
          python scripts/run_hypothesis_runner.py          

      - name: Commit active.json updates
        run: |
          git add docs/iterations/hypotheses/active.json || true
          if git diff --cached --quiet; then
            echo "No registry changes"
          else
            git commit -m "chore(hypotheses): update registry $(date -u +%Y-%m-%d)"
            git pull --rebase origin main
            git push origin main
          fi

Step 2: Write scripts/run_hypothesis_runner.py

Create scripts/run_hypothesis_runner.py:

#!/usr/bin/env python3
"""
Hypothesis Runner — orchestrates daily experiment cycles.

For each running hypothesis in active.json:
  1. Creates a git worktree for the hypothesis branch
  2. Runs the daily discovery pipeline in that worktree
  3. Extracts picks from the discovery result, appends to picks.json
  4. Commits and pushes picks to hypothesis branch
  5. Removes worktree
  6. Updates active.json (days_elapsed, picks_log)
  7. If days_elapsed >= min_days: concludes the hypothesis

After all hypotheses: promotes highest-priority pending → running if a slot opened.

Environment variables read:
  FILTER_ID — if set, only run the hypothesis with this ID
"""

import json
import os
import subprocess
import sys
from datetime import datetime, timedelta
from pathlib import Path

ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(ROOT))

ACTIVE_JSON = ROOT / "docs/iterations/hypotheses/active.json"
CONCLUDED_DIR = ROOT / "docs/iterations/hypotheses/concluded"
DB_PATH = ROOT / "data/recommendations/performance_database.json"
TODAY = datetime.utcnow().strftime("%Y-%m-%d")


def load_registry() -> dict:
    with open(ACTIVE_JSON) as f:
        return json.load(f)


def save_registry(registry: dict) -> None:
    with open(ACTIVE_JSON, "w") as f:
        json.dump(registry, f, indent=2)


def run(cmd: list, cwd: str = None, check: bool = True) -> subprocess.CompletedProcess:
    print(f"  $ {' '.join(cmd)}", flush=True)
    return subprocess.run(cmd, cwd=cwd or str(ROOT), check=check, capture_output=False)


def run_capture(cmd: list, cwd: str = None) -> str:
    result = subprocess.run(cmd, cwd=cwd or str(ROOT), capture_output=True, text=True)
    return result.stdout.strip()


def extract_picks(worktree: str, scanner: str) -> list:
    """
    Extract picks for the given scanner from the most recent discovery result
    in the worktree's results/discovery/<TODAY>/ directory.
    """
    results_dir = Path(worktree) / "results" / "discovery" / TODAY
    if not results_dir.exists():
        print(f"    No discovery results for {TODAY} in worktree", flush=True)
        return []

    picks = []
    for run_dir in sorted(results_dir.iterdir()):
        result_file = run_dir / "discovery_result.json"
        if not result_file.exists():
            continue
        try:
            with open(result_file) as f:
                data = json.load(f)
            for item in data.get("final_ranking", []):
                if item.get("strategy_match") == scanner:
                    picks.append({
                        "date": TODAY,
                        "ticker": item["ticker"],
                        "score": item.get("final_score"),
                        "confidence": item.get("confidence"),
                        "scanner": scanner,
                        "return_7d": None,
                        "win_7d": None,
                    })
        except Exception as e:
            print(f"    Warning: could not read {result_file}: {e}", flush=True)

    return picks


def load_picks_from_branch(hypothesis_id: str, branch: str) -> list:
    """Load picks.json from the hypothesis branch using git show."""
    picks_path = f"docs/iterations/hypotheses/{hypothesis_id}/picks.json"
    result = subprocess.run(
        ["git", "show", f"{branch}:{picks_path}"],
        cwd=str(ROOT),
        capture_output=True,
        text=True,
    )
    if result.returncode != 0:
        return []
    try:
        return json.loads(result.stdout).get("picks", [])
    except Exception:
        return []


def save_picks_to_worktree(worktree: str, hypothesis_id: str, scanner: str, picks: list) -> None:
    """Write updated picks.json into the worktree and commit."""
    picks_dir = Path(worktree) / "docs" / "iterations" / "hypotheses" / hypothesis_id
    picks_dir.mkdir(parents=True, exist_ok=True)
    picks_file = picks_dir / "picks.json"
    payload = {"hypothesis_id": hypothesis_id, "scanner": scanner, "picks": picks}
    picks_file.write_text(json.dumps(payload, indent=2))

    run(["git", "add", str(picks_file)], cwd=worktree)
    result = subprocess.run(
        ["git", "diff", "--cached", "--quiet"], cwd=worktree
    )
    if result.returncode != 0:
        run(
            ["git", "commit", "-m", f"chore(hypotheses): picks {TODAY} for {hypothesis_id}"],
            cwd=worktree,
        )


def run_hypothesis(hyp: dict) -> bool:
    """
    Run one hypothesis experiment cycle. Returns True if the experiment concluded.
    """
    hid = hyp["id"]
    branch = hyp["branch"]
    scanner = hyp["scanner"]
    worktree = f"/tmp/hyp-{hid}"

    print(f"\n── Hypothesis: {hid} ──", flush=True)

    # 1. Create worktree
    run(["git", "fetch", "origin", branch], check=False)
    run(["git", "worktree", "add", worktree, branch])

    try:
        # 2. Run discovery in worktree
        result = subprocess.run(
            [sys.executable, "scripts/run_daily_discovery.py", "--date", TODAY, "--no-update-positions"],
            cwd=worktree,
            check=False,
        )
        if result.returncode != 0:
            print(f"    Discovery failed for {hid}, skipping picks update", flush=True)
        else:
            # 3. Extract picks + merge with existing
            new_picks = extract_picks(worktree, scanner)
            existing_picks = load_picks_from_branch(hid, branch)
            # Deduplicate by (date, ticker)
            seen = {(p["date"], p["ticker"]) for p in existing_picks}
            merged = existing_picks + [p for p in new_picks if (p["date"], p["ticker"]) not in seen]

            # 4. Save picks + commit in worktree
            save_picks_to_worktree(worktree, hid, scanner, merged)

            # 5. Push hypothesis branch
            run(["git", "push", "origin", f"HEAD:{branch}"], cwd=worktree)

        # 6. Update registry fields
        if TODAY not in hyp.get("picks_log", []):
            hyp.setdefault("picks_log", []).append(TODAY)
        hyp["days_elapsed"] = len(hyp["picks_log"])

        # 7. Check conclusion
        if hyp["days_elapsed"] >= hyp["min_days"]:
            return conclude_hypothesis(hyp)

    finally:
        run(["git", "worktree", "remove", "--force", worktree], check=False)

    return False


def conclude_hypothesis(hyp: dict) -> bool:
    """Run comparison, write conclusion doc, close/merge PR. Returns True."""
    hid = hyp["id"]
    scanner = hyp["scanner"]
    branch = hyp["branch"]

    print(f"\n  Concluding {hid}...", flush=True)

    # Load picks from branch
    picks = load_picks_from_branch(hid, branch)
    if not picks:
        print(f"    No picks found for {hid}, marking rejected", flush=True)
        conclusion = {
            "decision": "rejected",
            "reason": "No picks were collected during the experiment period",
            "hypothesis": {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None},
            "baseline": {"count": 0, "win_rate": None, "avg_return": None},
        }
    else:
        # Run comparison script
        result = subprocess.run(
            [
                sys.executable, "scripts/compare_hypothesis.py",
                "--hypothesis-id", hid,
                "--picks-json", json.dumps(picks),
                "--scanner", scanner,
                "--db-path", str(DB_PATH),
            ],
            cwd=str(ROOT),
            capture_output=True,
            text=True,
        )
        if result.returncode != 0:
            print(f"    compare_hypothesis.py failed: {result.stderr}", flush=True)
            return False
        conclusion = json.loads(result.stdout)

    decision = conclusion["decision"]
    hyp_metrics = conclusion["hypothesis"]
    base_metrics = conclusion["baseline"]

    # Write concluded doc
    period_start = hyp.get("created_at", TODAY)
    concluded_doc = CONCLUDED_DIR / f"{TODAY}-{hid}.md"
    concluded_doc.write_text(
        f"# Hypothesis: {hyp['title']}\n\n"
        f"**Scanner:** {scanner}\n"
        f"**Branch:** {branch}\n"
        f"**Period:** {period_start} → {TODAY} ({hyp['days_elapsed']} days)\n"
        f"**Outcome:** {'accepted ✅' if decision == 'accepted' else 'rejected ❌'}\n\n"
        f"## Hypothesis\n{hyp.get('description', hyp['title'])}\n\n"
        f"## Results\n\n"
        f"| Metric | Baseline | Experiment | Delta |\n"
        f"|---|---|---|---|\n"
        f"| 7d win rate | {base_metrics.get('win_rate') or '—'}% | "
        f"{hyp_metrics.get('win_rate') or '—'}% | "
        f"{_delta_str(hyp_metrics.get('win_rate'), base_metrics.get('win_rate'), 'pp')} |\n"
        f"| Avg return | {base_metrics.get('avg_return') or '—'}% | "
        f"{hyp_metrics.get('avg_return') or '—'}% | "
        f"{_delta_str(hyp_metrics.get('avg_return'), base_metrics.get('avg_return'), '%')} |\n"
        f"| Picks | {base_metrics.get('count', '—')} | {hyp_metrics.get('count', '—')} | — |\n\n"
        f"## Decision\n{conclusion['reason']}\n\n"
        f"## Action\n"
        f"{'Branch merged into main.' if decision == 'accepted' else 'Branch closed without merging.'}\n"
    )

    run(["git", "add", str(concluded_doc)], check=False)

    # Close or merge PR
    pr = hyp.get("pr_number")
    if pr:
        if decision == "accepted":
            subprocess.run(
                ["gh", "pr", "merge", str(pr), "--squash", "--delete-branch"],
                cwd=str(ROOT), check=False,
            )
        else:
            subprocess.run(
                ["gh", "pr", "close", str(pr), "--delete-branch"],
                cwd=str(ROOT), check=False,
            )

    # Update registry entry
    hyp["status"] = "concluded"
    hyp["conclusion"] = decision

    print(f"  {hid}: {decision} — {conclusion['reason']}", flush=True)
    return True


def _delta_str(hyp_val, base_val, unit: str) -> str:
    if hyp_val is None or base_val is None:
        return "—"
    delta = hyp_val - base_val
    sign = "+" if delta >= 0 else ""
    return f"{sign}{delta:.1f}{unit}"


def promote_pending(registry: dict) -> None:
    """Promote the highest-priority pending hypothesis to running if a slot is open."""
    running_count = sum(1 for h in registry["hypotheses"] if h["status"] == "running")
    max_active = registry.get("max_active", 5)
    if running_count >= max_active:
        return

    pending = [h for h in registry["hypotheses"] if h["status"] == "pending"]
    if not pending:
        return

    # Promote highest priority
    to_promote = max(pending, key=lambda h: h.get("priority", 0))
    to_promote["status"] = "running"
    print(f"\n  Promoted pending hypothesis to running: {to_promote['id']}", flush=True)


def main():
    registry = load_registry()
    filter_id = os.environ.get("FILTER_ID", "").strip()

    hypotheses = registry.get("hypotheses", [])
    running = [
        h for h in hypotheses
        if h["status"] == "running" and (not filter_id or h["id"] == filter_id)
    ]

    if not running:
        print("No running hypotheses to process.", flush=True)
    else:
        for hyp in running:
            run_hypothesis(hyp)

    promote_pending(registry)
    save_registry(registry)
    print("\nRegistry updated.", flush=True)


if __name__ == "__main__":
    main()

Step 3: Verify the workflow YAML is valid

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/hypothesis-runner.yml'))" 2>/dev/null \
  || python3 -c "
import re, sys
with open('.github/workflows/hypothesis-runner.yml') as f:
    content = f.read()
# Just check the file exists and has the cron line
assert '0 8 * * *' in content, 'missing cron'
print('workflow file looks good')
"

Step 4: Commit

git add .github/workflows/hypothesis-runner.yml scripts/run_hypothesis_runner.py
git commit -m "feat(hypotheses): add daily hypothesis runner workflow"

Task 5: Dashboard Hypotheses Tab

Files:

Create: tradingagents/ui/pages/hypotheses.py
Modify: tradingagents/ui/pages/__init__.py
Modify: tradingagents/ui/dashboard.py
Step 1: Write the failing test

Create tests/test_hypotheses_page.py:

"""Tests for the hypotheses dashboard page data loading."""
import json
import sys
from pathlib import Path

import pytest

sys.path.insert(0, str(Path(__file__).parent.parent))


from tradingagents.ui.pages.hypotheses import (
    load_active_hypotheses,
    load_concluded_hypotheses,
    days_until_ready,
)


# ── load_active_hypotheses ────────────────────────────────────────────────────

def test_load_active_hypotheses(tmp_path):
    active = {
        "max_active": 5,
        "hypotheses": [
            {
                "id": "options_flow-test",
                "title": "Test hypothesis",
                "scanner": "options_flow",
                "status": "running",
                "priority": 7,
                "days_elapsed": 5,
                "min_days": 14,
                "created_at": "2026-04-01",
                "picks_log": ["2026-04-01"] * 5,
                "conclusion": None,
            }
        ],
    }
    f = tmp_path / "active.json"
    f.write_text(json.dumps(active))

    result = load_active_hypotheses(str(f))
    assert len(result) == 1
    assert result[0]["id"] == "options_flow-test"


def test_load_active_hypotheses_missing_file(tmp_path):
    result = load_active_hypotheses(str(tmp_path / "missing.json"))
    assert result == []


# ── load_concluded_hypotheses ─────────────────────────────────────────────────

def test_load_concluded_hypotheses(tmp_path):
    doc = tmp_path / "2026-04-10-options_flow-test.md"
    doc.write_text(
        "# Hypothesis: Test\n\n"
        "**Scanner:** options_flow\n"
        "**Period:** 2026-03-27 → 2026-04-10 (14 days)\n"
        "**Outcome:** accepted ✅\n"
    )

    results = load_concluded_hypotheses(str(tmp_path))
    assert len(results) == 1
    assert results[0]["filename"] == doc.name
    assert results[0]["outcome"] == "accepted ✅"


def test_load_concluded_hypotheses_empty_dir(tmp_path):
    results = load_concluded_hypotheses(str(tmp_path))
    assert results == []


# ── days_until_ready ──────────────────────────────────────────────────────────

def test_days_until_ready_has_days_left():
    hyp = {"days_elapsed": 5, "min_days": 14}
    assert days_until_ready(hyp) == 9


def test_days_until_ready_past_due():
    hyp = {"days_elapsed": 15, "min_days": 14}
    assert days_until_ready(hyp) == 0

Step 2: Run tests to confirm they fail

python -m pytest tests/test_hypotheses_page.py -v 2>&1 | head -20

Expected: ModuleNotFoundError for tradingagents.ui.pages.hypotheses.

Step 3: Write tradingagents/ui/pages/hypotheses.py

"""
Hypotheses dashboard page — tracks active and concluded experiments.

Reads docs/iterations/hypotheses/active.json and the concluded/ directory.
No external API calls; all data is file-based.
"""

import json
import re
from pathlib import Path
from typing import Any, Dict, List

import streamlit as st

from tradingagents.ui.theme import COLORS, page_header

_REPO_ROOT = Path(__file__).parent.parent.parent.parent
_ACTIVE_JSON = _REPO_ROOT / "docs/iterations/hypotheses/active.json"
_CONCLUDED_DIR = _REPO_ROOT / "docs/iterations/hypotheses/concluded"


# ── Data loaders ─────────────────────────────────────────────────────────────


def load_active_hypotheses(active_path: str = str(_ACTIVE_JSON)) -> List[Dict[str, Any]]:
    """Load all hypotheses from active.json. Returns [] if file missing."""
    path = Path(active_path)
    if not path.exists():
        return []
    try:
        with open(path) as f:
            data = json.load(f)
        return data.get("hypotheses", [])
    except Exception:
        return []


def load_concluded_hypotheses(concluded_dir: str = str(_CONCLUDED_DIR)) -> List[Dict[str, Any]]:
    """
    Load concluded hypothesis metadata by parsing the markdown files in concluded/.

    Extracts: filename, title, scanner, period, outcome from each .md file.
    """
    dir_path = Path(concluded_dir)
    if not dir_path.exists():
        return []

    results = []
    for md_file in sorted(dir_path.glob("*.md"), reverse=True):
        if md_file.name == ".gitkeep":
            continue
        try:
            text = md_file.read_text()
            title = _extract_md_field(text, r"^# Hypothesis: (.+)$")
            scanner = _extract_md_field(text, r"^\*\*Scanner:\*\* (.+)$")
            period = _extract_md_field(text, r"^\*\*Period:\*\* (.+)$")
            outcome = _extract_md_field(text, r"^\*\*Outcome:\*\* (.+)$")
            results.append({
                "filename": md_file.name,
                "title": title or md_file.stem,
                "scanner": scanner or "—",
                "period": period or "—",
                "outcome": outcome or "—",
            })
        except Exception:
            continue

    return results


def _extract_md_field(text: str, pattern: str) -> str:
    """Extract a field value from a markdown line using regex."""
    match = re.search(pattern, text, re.MULTILINE)
    return match.group(1).strip() if match else ""


def days_until_ready(hyp: Dict[str, Any]) -> int:
    """Return number of days remaining before hypothesis can conclude (min 0)."""
    return max(0, hyp.get("min_days", 14) - hyp.get("days_elapsed", 0))


# ── Rendering ─────────────────────────────────────────────────────────────────


def render() -> None:
    """Render the hypotheses tracking page."""
    st.markdown(
        page_header("Hypotheses", "Active experiments & concluded findings"),
        unsafe_allow_html=True,
    )

    hypotheses = load_active_hypotheses()
    concluded = load_concluded_hypotheses()

    if not hypotheses and not concluded:
        st.info(
            "No hypotheses yet. Run `/backtest-hypothesis \"<description>\"` to start an experiment."
        )
        return

    # ── Active experiments ────────────────────────────────────────────────────
    running = [h for h in hypotheses if h["status"] == "running"]
    pending = [h for h in hypotheses if h["status"] == "pending"]

    st.markdown(
        f'<div class="section-title">Active Experiments '
        f'<span class="accent">// {len(running)} running, {len(pending)} pending</span></div>',
        unsafe_allow_html=True,
    )

    if running or pending:
        active_rows = []
        for h in sorted(running + pending, key=lambda x: -x.get("priority", 0)):
            days_left = days_until_ready(h)
            ready_str = "concluding soon" if days_left == 0 else f"{days_left}d left"
            status_color = COLORS["green"] if h["status"] == "running" else COLORS["amber"]
            active_rows.append({
                "ID": h["id"],
                "Title": h.get("title", "—"),
                "Scanner": h.get("scanner", "—"),
                "Status": h["status"],
                "Progress": f"{h.get('days_elapsed', 0)}/{h.get('min_days', 14)}d",
                "Picks": len(h.get("picks_log", [])),
                "Ready": ready_str,
                "Priority": h.get("priority", "—"),
            })

        import pandas as pd
        df = pd.DataFrame(active_rows)
        st.dataframe(
            df,
            width="stretch",
            hide_index=True,
            column_config={
                "ID": st.column_config.TextColumn(width="medium"),
                "Title": st.column_config.TextColumn(width="large"),
                "Scanner": st.column_config.TextColumn(width="medium"),
                "Status": st.column_config.TextColumn(width="small"),
                "Progress": st.column_config.TextColumn(width="small"),
                "Picks": st.column_config.NumberColumn(format="%d", width="small"),
                "Ready": st.column_config.TextColumn(width="medium"),
                "Priority": st.column_config.NumberColumn(format="%d/9", width="small"),
            },
        )
    else:
        st.info("No active experiments.")

    st.markdown("<div style='height:1.5rem;'></div>", unsafe_allow_html=True)

    # ── Concluded experiments ─────────────────────────────────────────────────
    st.markdown(
        f'<div class="section-title">Concluded Experiments '
        f'<span class="accent">// {len(concluded)} total</span></div>',
        unsafe_allow_html=True,
    )

    if concluded:
        import pandas as pd
        concluded_rows = []
        for c in concluded:
            outcome = c["outcome"]
            emoji = "✅" if "accepted" in outcome else "❌"
            concluded_rows.append({
                "Date": c["filename"][:10],
                "Title": c["title"],
                "Scanner": c["scanner"],
                "Period": c["period"],
                "Outcome": emoji,
            })
        cdf = pd.DataFrame(concluded_rows)
        st.dataframe(
            cdf,
            width="stretch",
            hide_index=True,
            column_config={
                "Date": st.column_config.TextColumn(width="small"),
                "Title": st.column_config.TextColumn(width="large"),
                "Scanner": st.column_config.TextColumn(width="medium"),
                "Period": st.column_config.TextColumn(width="medium"),
                "Outcome": st.column_config.TextColumn(width="small"),
            },
        )
    else:
        st.info("No concluded experiments yet.")

Step 4: Run tests to confirm they pass

python -m pytest tests/test_hypotheses_page.py -v

Expected: all 6 tests pass.

Step 5: Register the page in tradingagents/ui/pages/__init__.py

Add after the settings import block (around line 38):

try:
    from tradingagents.ui.pages import hypotheses
except Exception as _e:
    _logger.error("Failed to import hypotheses page: %s", _e, exc_info=True)
    hypotheses = None

And add "hypotheses" to __all__:

__all__ = [
    "home",
    "todays_picks",
    "portfolio",
    "performance",
    "settings",
    "hypotheses",
]

Step 6: Add "Hypotheses" to dashboard navigation in tradingagents/ui/dashboard.py

In render_sidebar, change the options list:

page = st.radio(
    "Navigation",
    options=["Overview", "Signals", "Portfolio", "Performance", "Hypotheses", "Config"],
    label_visibility="collapsed",
)

In route_page, add to page_map:

page_map = {
    "Overview": pages.home,
    "Signals": pages.todays_picks,
    "Portfolio": pages.portfolio,
    "Performance": pages.performance,
    "Hypotheses": pages.hypotheses,
    "Config": pages.settings,
}

Step 7: Run the full test suite

python -m pytest tests/test_compare_hypothesis.py tests/test_hypotheses_page.py -v

Expected: all 16 tests pass.

Step 8: Commit everything

git add \
  tradingagents/ui/pages/hypotheses.py \
  tradingagents/ui/pages/__init__.py \
  tradingagents/ui/dashboard.py \
  tests/test_hypotheses_page.py
git commit -m "feat(hypotheses): add Hypotheses dashboard tab"

Self-Review

Spec coverage check:

✅ active.json schema with status: running/pending/concluded — Task 1
✅ /backtest-hypothesis command: classify, priority scoring, pending queue, branch creation — Task 3
✅ Running experiments never paused — enforced in run_hypothesis_runner.py (only running entries processed; new ones queue as pending)
✅ Daily runner: worktree per hypothesis, run discovery, commit picks, conclude — Task 4
✅ Statistical comparison with 5pp / 1% thresholds, minimum 5 evaluated picks — Task 2
✅ Auto-promote pending → running when slot opens — promote_pending() in runner
✅ Concluded doc written with metrics table — conclude_hypothesis() in runner
✅ PR merged (accepted) or closed (rejected) automatically — conclude_hypothesis()
✅ Dashboard tab with active + concluded tables — Task 5

Type/name consistency:

hypothesis_id / hid / id field: the dict key is always "id", the local var is hid, the argument is --hypothesis-id — consistent throughout
picks.json structure: {"hypothesis_id": ..., "scanner": ..., "picks": [...]} — used in save_picks_to_worktree and load_picks_from_branch consistently
strategy_match field used to filter picks in extract_picks — matches discovery_result.json structure confirmed by inspection

48 KiB Raw Blame History Unescape Escape

Hypothesis Backtesting System — Implementation Plan

File Map

Task 1: Hypothesis Registry Structure

Task 2: Comparison Script

Task 3: /backtest-hypothesis Command

Task 4: Hypothesis Runner Workflow

Task 5: Dashboard Hypotheses Tab

Self-Review

48 KiB

Raw Blame History

Task 3: `/backtest-hypothesis` Command