TradingAgents/docs/superpowers/plans/2026-04-10-hypothesis-backt...

# Hypothesis Backtesting System — Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Build a branch-per-hypothesis experimentation system that runs scanner code changes daily in isolation, accumulates picks, auto-concludes with a statistical comparison, and surfaces everything in the dashboard.

**Architecture:** `active.json` is the registry (lives on `main`). Each hypothesis gets a `hypothesis/<scanner>-<slug>` branch with the code change. A daily workflow (08:00 UTC) uses git worktrees to run discovery on each branch, stores picks in `docs/iterations/hypotheses/<id>/picks.json` on the hypothesis branch, and concludes when `min_days` elapsed. The `/backtest-hypothesis` command classifies, creates branches, and manages the registry.

**Tech Stack:** Python 3.10, yfinance (`download_history`), GitHub Actions, Streamlit, `gh` CLI, `git worktree`

---

## File Map

| Path | Action | Purpose |
|---|---|---|
| `docs/iterations/hypotheses/active.json` | Create | Registry of all experiments |
| `docs/iterations/hypotheses/concluded/.gitkeep` | Create | Directory placeholder |
| `scripts/compare_hypothesis.py` | Create | Fetch returns + statistical comparison |
| `.claude/commands/backtest-hypothesis.md` | Create | `/backtest-hypothesis` Claude command |
| `.github/workflows/hypothesis-runner.yml` | Create | Daily 08:00 UTC runner |
| `tradingagents/ui/pages/hypotheses.py` | Create | Dashboard "Hypotheses" tab |
| `tradingagents/ui/pages/__init__.py` | Modify | Register new page |
| `tradingagents/ui/dashboard.py` | Modify | Add "Hypotheses" to nav |

---

## Task 1: Hypothesis Registry Structure

**Files:**
- Create: `docs/iterations/hypotheses/active.json`
- Create: `docs/iterations/hypotheses/concluded/.gitkeep`

- [ ] **Step 1: Create the directory and initial `active.json`**

```bash
mkdir -p docs/iterations/hypotheses/concluded
```

Write `docs/iterations/hypotheses/active.json`:

```json
{
  "max_active": 5,
  "hypotheses": []
}
```

- [ ] **Step 2: Create the concluded directory placeholder**

```bash
touch docs/iterations/hypotheses/concluded/.gitkeep
```

- [ ] **Step 3: Verify JSON is valid**

```bash
python3 -c "import json; json.load(open('docs/iterations/hypotheses/active.json')); print('valid')"
```

Expected: `valid`

- [ ] **Step 4: Commit**

```bash
git add docs/iterations/hypotheses/
git commit -m "feat(hypotheses): initialize hypothesis registry"
```

---

## Task 2: Comparison Script

**Files:**
- Create: `scripts/compare_hypothesis.py`
- Create: `tests/test_compare_hypothesis.py`

`★ Insight ─────────────────────────────────────`
The comparison reads picks from the hypothesis branch via `git show <branch>:path` — this avoids checking out the branch just to read a file, keeping the working tree on `main` throughout.
`─────────────────────────────────────────────────`

- [ ] **Step 1: Write the failing tests**

Create `tests/test_compare_hypothesis.py`:

```python
"""Tests for the hypothesis comparison script."""
import json
import subprocess
import sys
from datetime import date, timedelta
from pathlib import Path
from unittest.mock import MagicMock, patch

import pytest

sys.path.insert(0, str(Path(__file__).parent.parent))

from scripts.compare_hypothesis import (
    compute_metrics,
    compute_7d_return,
    load_baseline_metrics,
    make_decision,
)


# ── compute_metrics ──────────────────────────────────────────────────────────

def test_compute_metrics_empty():
    result = compute_metrics([])
    assert result == {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None}


def test_compute_metrics_all_wins():
    picks = [
        {"return_7d": 5.0, "win_7d": True},
        {"return_7d": 3.0, "win_7d": True},
    ]
    result = compute_metrics(picks)
    assert result["win_rate"] == 100.0
    assert result["avg_return"] == 4.0
    assert result["evaluated"] == 2


def test_compute_metrics_mixed():
    picks = [
        {"return_7d": 10.0, "win_7d": True},
        {"return_7d": -5.0, "win_7d": False},
        {"return_7d": None, "win_7d": None},   # pending — excluded
    ]
    result = compute_metrics(picks)
    assert result["win_rate"] == 50.0
    assert result["avg_return"] == 2.5
    assert result["evaluated"] == 2
    assert result["count"] == 3


# ── compute_7d_return ────────────────────────────────────────────────────────

def test_compute_7d_return_positive():
    mock_df = MagicMock()
    mock_df.empty = False
    # Simulate DataFrame with Close column: entry=100, exit=110
    mock_df.__len__ = lambda self: 2
    mock_df["Close"].iloc.__getitem__ = MagicMock(side_effect=lambda i: 100.0 if i == 0 else 110.0)

    with patch("scripts.compare_hypothesis.download_history", return_value=mock_df):
        ret, win = compute_7d_return("AAPL", "2026-03-01")

    assert ret == pytest.approx(10.0, rel=0.01)
    assert win is True


def test_compute_7d_return_empty_data():
    mock_df = MagicMock()
    mock_df.empty = True

    with patch("scripts.compare_hypothesis.download_history", return_value=mock_df):
        ret, win = compute_7d_return("AAPL", "2026-03-01")

    assert ret is None
    assert win is None


# ── load_baseline_metrics ────────────────────────────────────────────────────

def test_load_baseline_metrics(tmp_path):
    db = {
        "recommendations_by_date": {
            "2026-03-01": [
                {"strategy_match": "options_flow", "return_7d": 5.0, "win_7d": True},
                {"strategy_match": "options_flow", "return_7d": -2.0, "win_7d": False},
                {"strategy_match": "reddit_dd", "return_7d": 3.0, "win_7d": True},
            ]
        }
    }
    db_file = tmp_path / "performance_database.json"
    db_file.write_text(json.dumps(db))

    result = load_baseline_metrics("options_flow", str(db_file))

    assert result["win_rate"] == 50.0
    assert result["avg_return"] == 1.5
    assert result["count"] == 2


def test_load_baseline_metrics_missing_file(tmp_path):
    result = load_baseline_metrics("options_flow", str(tmp_path / "missing.json"))
    assert result == {"count": 0, "win_rate": None, "avg_return": None}


# ── make_decision ─────────────────────────────────────────────────────────────

def test_make_decision_accepted_by_win_rate():
    hyp = {"win_rate": 60.0, "avg_return": 0.5, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 0.5}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "accepted"
    assert "win rate" in reason.lower()


def test_make_decision_accepted_by_return():
    hyp = {"win_rate": 52.0, "avg_return": 3.0, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 1.5}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "accepted"
    assert "return" in reason.lower()


def test_make_decision_rejected():
    hyp = {"win_rate": 48.0, "avg_return": 0.2, "evaluated": 10}
    baseline = {"win_rate": 50.0, "avg_return": 1.0}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "rejected"


def test_make_decision_insufficient_data():
    hyp = {"win_rate": 80.0, "avg_return": 5.0, "evaluated": 2}
    baseline = {"win_rate": 50.0, "avg_return": 1.0}
    decision, reason = make_decision(hyp, baseline)
    assert decision == "rejected"
    assert "insufficient" in reason.lower()
```

- [ ] **Step 2: Run tests to confirm they fail**

```bash
python -m pytest tests/test_compare_hypothesis.py -v 2>&1 | head -30
```

Expected: `ModuleNotFoundError: No module named 'scripts.compare_hypothesis'` or similar import error — confirms tests are wired correctly.

- [ ] **Step 3: Write `scripts/compare_hypothesis.py`**

```python
#!/usr/bin/env python3
"""
Hypothesis comparison — computes 7d returns for hypothesis picks and
compares them against the baseline scanner in performance_database.json.

Usage (called by hypothesis-runner.yml after min_days elapsed):
    python scripts/compare_hypothesis.py \\
        --hypothesis-id options_flow-scan-3-expirations \\
        --picks-json '{"picks": [...]}' \\
        --scanner options_flow \\
        --db-path data/recommendations/performance_database.json

Prints a JSON conclusion to stdout:
    {
      "decision": "accepted",
      "reason": "...",
      "hypothesis": {"win_rate": 58.0, "avg_return": 1.8, "count": 14, "evaluated": 10},
      "baseline":   {"win_rate": 42.0, "avg_return": -0.3, "count": 87}
    }
"""

import argparse
import json
import sys
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional, Tuple

ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(ROOT))

from tradingagents.dataflows.y_finance import download_history


# Minimum evaluated picks required to make a decision
_MIN_EVALUATED = 5
# Thresholds from spec
_WIN_RATE_DELTA_THRESHOLD = 5.0   # percentage points
_AVG_RETURN_DELTA_THRESHOLD = 1.0  # percent


def compute_7d_return(ticker: str, pick_date: str) -> Tuple[Optional[float], Optional[bool]]:
    """
    Fetch 7-day return for a pick using yfinance.

    Args:
        ticker: Stock symbol, e.g. "AAPL"
        pick_date: Date the pick was made, "YYYY-MM-DD"

    Returns:
        (return_pct, is_win) or (None, None) if data unavailable
    """
    try:
        entry_dt = datetime.strptime(pick_date, "%Y-%m-%d")
        exit_dt = entry_dt + timedelta(days=10)  # +3 buffer for weekends/holidays
        df = download_history(
            ticker,
            start=entry_dt.strftime("%Y-%m-%d"),
            end=exit_dt.strftime("%Y-%m-%d"),
        )
        if df.empty or len(df) < 2:
            return None, None

        # Use first available close as entry, 7th trading day as exit
        close = df["Close"]
        entry_price = float(close.iloc[0])
        exit_idx = min(5, len(close) - 1)  # ~7 calendar days = ~5 trading days
        exit_price = float(close.iloc[exit_idx])

        if entry_price <= 0:
            return None, None

        ret = (exit_price - entry_price) / entry_price * 100
        return round(ret, 4), ret > 0

    except Exception:
        return None, None


def enrich_picks_with_returns(picks: list) -> list:
    """
    Compute 7d return for each pick that is old enough (>= 7 days) and
    doesn't already have return_7d populated.

    Args:
        picks: List of pick dicts with at least 'ticker' and 'date' fields

    Returns:
        Same list with return_7d and win_7d populated where possible
    """
    cutoff = (datetime.utcnow() - timedelta(days=7)).strftime("%Y-%m-%d")
    for pick in picks:
        if pick.get("return_7d") is not None:
            continue  # already computed
        if pick.get("date", "9999-99-99") > cutoff:
            continue  # too recent
        ret, win = compute_7d_return(pick["ticker"], pick["date"])
        pick["return_7d"] = ret
        pick["win_7d"] = win
    return picks


def compute_metrics(picks: list) -> dict:
    """
    Compute win rate and avg return for a list of picks.

    Only picks with non-None return_7d contribute to win_rate and avg_return.

    Returns:
        {"count": int, "evaluated": int, "win_rate": float|None, "avg_return": float|None}
    """
    evaluated = [p for p in picks if p.get("return_7d") is not None]
    if not evaluated:
        return {"count": len(picks), "evaluated": 0, "win_rate": None, "avg_return": None}

    wins = sum(1 for p in evaluated if p.get("win_7d"))
    avg_ret = sum(p["return_7d"] for p in evaluated) / len(evaluated)
    return {
        "count": len(picks),
        "evaluated": len(evaluated),
        "win_rate": round(wins / len(evaluated) * 100, 1),
        "avg_return": round(avg_ret, 2),
    }


def load_baseline_metrics(scanner: str, db_path: str) -> dict:
    """
    Load baseline metrics for a scanner from performance_database.json.

    Args:
        scanner: Scanner name, e.g. "options_flow"
        db_path: Path to performance_database.json

    Returns:
        {"count": int, "win_rate": float|None, "avg_return": float|None}
    """
    path = Path(db_path)
    if not path.exists():
        return {"count": 0, "win_rate": None, "avg_return": None}

    try:
        with open(path) as f:
            db = json.load(f)
    except Exception:
        return {"count": 0, "win_rate": None, "avg_return": None}

    picks = []
    for recs in db.get("recommendations_by_date", {}).values():
        for rec in (recs if isinstance(recs, list) else []):
            if rec.get("strategy_match") == scanner and rec.get("return_7d") is not None:
                picks.append(rec)

    return compute_metrics(picks)


def make_decision(hypothesis: dict, baseline: dict) -> Tuple[str, str]:
    """
    Decide accepted or rejected based on metrics delta.

    Rules:
    - Minimum _MIN_EVALUATED evaluated picks required
    - accepted if win_rate_delta > _WIN_RATE_DELTA_THRESHOLD (5pp)
      OR avg_return_delta > _AVG_RETURN_DELTA_THRESHOLD (1%)
    - rejected otherwise

    Returns:
        (decision, reason) where decision is "accepted" or "rejected"
    """
    evaluated = hypothesis.get("evaluated", 0)
    if evaluated < _MIN_EVALUATED:
        return "rejected", f"Insufficient data: only {evaluated} evaluated picks (need {_MIN_EVALUATED})"

    hyp_wr = hypothesis.get("win_rate")
    hyp_ret = hypothesis.get("avg_return")
    base_wr = baseline.get("win_rate")
    base_ret = baseline.get("avg_return")

    reasons = []

    if hyp_wr is not None and base_wr is not None:
        delta_wr = hyp_wr - base_wr
        if delta_wr > _WIN_RATE_DELTA_THRESHOLD:
            reasons.append(f"win rate improved by {delta_wr:+.1f}pp ({base_wr:.1f}% → {hyp_wr:.1f}%)")

    if hyp_ret is not None and base_ret is not None:
        delta_ret = hyp_ret - base_ret
        if delta_ret > _AVG_RETURN_DELTA_THRESHOLD:
            reasons.append(f"avg return improved by {delta_ret:+.2f}% ({base_ret:+.2f}% → {hyp_ret:+.2f}%)")

    if reasons:
        return "accepted", "; ".join(reasons)

    wr_str = f"{hyp_wr:.1f}% vs baseline {base_wr:.1f}%" if hyp_wr is not None else "no win rate data"
    ret_str = f"{hyp_ret:+.2f}% vs baseline {base_ret:+.2f}%" if hyp_ret is not None else "no return data"
    return "rejected", f"No significant improvement — win rate: {wr_str}; avg return: {ret_str}"


def main():
    parser = argparse.ArgumentParser(description="Compare hypothesis picks against baseline")
    parser.add_argument("--hypothesis-id", required=True)
    parser.add_argument("--picks-json", required=True, help="JSON string of picks list")
    parser.add_argument("--scanner", required=True, help="Baseline scanner name")
    parser.add_argument(
        "--db-path",
        default="data/recommendations/performance_database.json",
        help="Path to performance_database.json",
    )
    args = parser.parse_args()

    picks = json.loads(args.picks_json)
    picks = enrich_picks_with_returns(picks)

    hyp_metrics = compute_metrics(picks)
    base_metrics = load_baseline_metrics(args.scanner, args.db_path)

    decision, reason = make_decision(hyp_metrics, base_metrics)

    result = {
        "hypothesis_id": args.hypothesis_id,
        "decision": decision,
        "reason": reason,
        "hypothesis": hyp_metrics,
        "baseline": base_metrics,
        "enriched_picks": picks,
    }
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()
```

- [ ] **Step 4: Run tests to confirm they pass**

```bash
python -m pytest tests/test_compare_hypothesis.py -v
```

Expected: all 10 tests pass.

- [ ] **Step 5: Commit**

```bash
git add scripts/compare_hypothesis.py tests/test_compare_hypothesis.py
git commit -m "feat(hypotheses): add comparison + conclusion script"
```

---

## Task 3: `/backtest-hypothesis` Command

**Files:**
- Create: `.claude/commands/backtest-hypothesis.md`

- [ ] **Step 1: Write the command file**

Create `.claude/commands/backtest-hypothesis.md`:

````markdown
# /backtest-hypothesis

Test a hypothesis about a scanner improvement using branch-per-hypothesis isolation.

**Usage:** `/backtest-hypothesis "<description of the hypothesis>"`

**Example:** `/backtest-hypothesis "options_flow: scan 3 expirations instead of 1 to capture institutional 30+ DTE positioning"`

---

## Step 1: Read Current Registry

Read `docs/iterations/hypotheses/active.json`. Note:
- How many hypotheses currently have `status: "running"`
- The `max_active` limit (default 5)
- Any existing `pending` entries

Also read `docs/iterations/LEARNINGS.md` and the relevant scanner domain file in
`docs/iterations/scanners/` to understand the current baseline.

## Step 2: Classify the Hypothesis

Determine whether this is:

**Statistical** — answerable from existing data in `data/recommendations/performance_database.json`
without any code change. Examples:
- "Does high confidence (≥8) predict better 30d returns?"
- "Are options_flow picks that are ITM outperforming OTM ones?"

**Implementation** — requires a code change and forward-testing period. Examples:
- "Scan 3 expirations instead of 1"
- "Apply a premium filter of $50K instead of $25K"

## Step 3a: Statistical Path

If statistical: run the analysis now against `data/recommendations/performance_database.json`.
Write the finding to the relevant scanner domain file under **Evidence Log**. Print a summary.
Done — no branch needed.

## Step 3b: Implementation Path

### 3b-i: Capacity check

Count running hypotheses from `active.json`. If fewer than `max_active` running, proceed.
If at capacity: add the new hypothesis as `status: "pending"` — running experiments are NEVER
paused mid-streak. Inform the user which slot it queued behind and when it will likely start.

### 3b-ii: Score the hypothesis

Assign a `priority` score (1–9) using these factors:

| Factor | Score |
|---|---|
| Scanner 30d win rate < 40% | +3 |
| Change touches 1 file, 1 parameter | +2 |
| Directly addresses a weak spot in LEARNINGS.md | +2 |
| Scanner generates ≥2 picks/day (data accrues fast) | +1 |
| Supported by external research (arXiv, Alpha Architect, etc.) | +1 |
| Contradictory evidence or unclear direction | −2 |

### 3b-iii: Determine min_days

Set `min_days` based on the scanner's typical picks-per-day rate:
- ≥2 picks/day → 14 days
- 1 pick/day → 21 days
- <1 pick/day → 30 days

### 3b-iv: Create the branch and implement the code change

```bash
BRANCH="hypothesis/<scanner>-<slug>"
git checkout -b "$BRANCH"
```

Make the minimal code change that implements the hypothesis. Read the scanner file first.
Only change what the hypothesis requires — do not refactor surrounding code.

```bash
git add tradingagents/
git commit -m "hypothesis(<scanner>): <title>"
```

### 3b-v: Create picks tracking file on the branch

Create `docs/iterations/hypotheses/<id>/picks.json` on the hypothesis branch:

```json
{
  "hypothesis_id": "<id>",
  "scanner": "<scanner>",
  "picks": []
}
```

```bash
mkdir -p docs/iterations/hypotheses/<id>
# write the file
git add docs/iterations/hypotheses/<id>/picks.json
git commit -m "hypothesis(<scanner>): add picks tracker"
git push -u origin "$BRANCH"
```

### 3b-vi: Open a draft PR

```bash
gh pr create \
  --title "hypothesis(<scanner>): <title>" \
  --body "**Hypothesis:** <description>

**Expected impact:** <high/medium/low>
**Min days:** <N>
**Priority:** <score>/9

*This is an automated hypothesis experiment. It will be auto-concluded after ${MIN_DAYS} days of data.*" \
  --draft \
  --base main
```

Note the PR number from the output.

### 3b-vii: Update active.json on main

Check out `main`, then update `docs/iterations/hypotheses/active.json` to add the new entry:

```json
{
  "id": "<scanner>-<slug>",
  "scanner": "<scanner>",
  "title": "<title>",
  "description": "<description>",
  "branch": "hypothesis/<scanner>-<slug>",
  "pr_number": <N>,
  "status": "running",
  "priority": <score>,
  "expected_impact": "<high|medium|low>",
  "hypothesis_type": "implementation",
  "created_at": "<YYYY-MM-DD>",
  "min_days": <N>,
  "days_elapsed": 0,
  "picks_log": [],
  "baseline_scanner": "<scanner>",
  "conclusion": null
}
```

```bash
git checkout main
git add docs/iterations/hypotheses/active.json
git commit -m "feat(hypotheses): register hypothesis <id>"
git push origin main
```

## Step 4: Print Summary

Print a confirmation:
- Hypothesis ID and branch name
- Status: running or pending
- Expected conclusion date (created_at + min_days)
- PR link (if running)
- Priority score and why
````

- [ ] **Step 2: Verify the file exists and is non-empty**

```bash
wc -l .claude/commands/backtest-hypothesis.md
```

Expected: at least 80 lines.

- [ ] **Step 3: Commit**

```bash
git add .claude/commands/backtest-hypothesis.md
git commit -m "feat(hypotheses): add /backtest-hypothesis command"
```

---

## Task 4: Hypothesis Runner Workflow

**Files:**
- Create: `.github/workflows/hypothesis-runner.yml`

- [ ] **Step 1: Write the workflow**

Create `.github/workflows/hypothesis-runner.yml`:

```yaml
name: Hypothesis Runner

on:
  schedule:
    # 8:00 AM UTC daily — runs after iterate (06:00) and daily-discovery (12:30)
    - cron: "0 8 * * *"
  workflow_dispatch:
    inputs:
      hypothesis_id:
        description: "Run a specific hypothesis ID only (blank = all running)"
        required: false
        default: ""

env:
  PYTHON_VERSION: "3.10"

jobs:
  run-hypotheses:
    runs-on: ubuntu-latest
    environment: TradingAgent
    timeout-minutes: 60
    permissions:
      contents: write
      pull-requests: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
          token: ${{ secrets.GH_TOKEN }}

      - name: Set up git identity
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: pip

      - name: Install dependencies
        run: pip install --upgrade pip && pip install -e .

      - name: Run hypothesis experiments
        env:
          GH_TOKEN: ${{ secrets.GH_TOKEN }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          FINNHUB_API_KEY: ${{ secrets.FINNHUB_API_KEY }}
          ALPHA_VANTAGE_API_KEY: ${{ secrets.ALPHA_VANTAGE_API_KEY }}
          FMP_API_KEY: ${{ secrets.FMP_API_KEY }}
          REDDIT_CLIENT_ID: ${{ secrets.REDDIT_CLIENT_ID }}
          REDDIT_CLIENT_SECRET: ${{ secrets.REDDIT_CLIENT_SECRET }}
          TRADIER_API_KEY: ${{ secrets.TRADIER_API_KEY }}
          FILTER_ID: ${{ inputs.hypothesis_id }}
        run: |
          python scripts/run_hypothesis_runner.py

      - name: Commit active.json updates
        run: |
          git add docs/iterations/hypotheses/active.json || true
          if git diff --cached --quiet; then
            echo "No registry changes"
          else
            git commit -m "chore(hypotheses): update registry $(date -u +%Y-%m-%d)"
            git pull --rebase origin main
            git push origin main
          fi
```

- [ ] **Step 2: Write `scripts/run_hypothesis_runner.py`**

Create `scripts/run_hypothesis_runner.py`:

```python
#!/usr/bin/env python3
"""
Hypothesis Runner — orchestrates daily experiment cycles.

For each running hypothesis in active.json:
  1. Creates a git worktree for the hypothesis branch
  2. Runs the daily discovery pipeline in that worktree
  3. Extracts picks from the discovery result, appends to picks.json
  4. Commits and pushes picks to hypothesis branch
  5. Removes worktree
  6. Updates active.json (days_elapsed, picks_log)
  7. If days_elapsed >= min_days: concludes the hypothesis

After all hypotheses: promotes highest-priority pending → running if a slot opened.

Environment variables read:
  FILTER_ID — if set, only run the hypothesis with this ID
"""

import json
import os
import subprocess
import sys
from datetime import datetime, timedelta
from pathlib import Path

ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(ROOT))

ACTIVE_JSON = ROOT / "docs/iterations/hypotheses/active.json"
CONCLUDED_DIR = ROOT / "docs/iterations/hypotheses/concluded"
DB_PATH = ROOT / "data/recommendations/performance_database.json"
TODAY = datetime.utcnow().strftime("%Y-%m-%d")


def load_registry() -> dict:
    with open(ACTIVE_JSON) as f:
        return json.load(f)


def save_registry(registry: dict) -> None:
    with open(ACTIVE_JSON, "w") as f:
        json.dump(registry, f, indent=2)


def run(cmd: list, cwd: str = None, check: bool = True) -> subprocess.CompletedProcess:
    print(f"  $ {' '.join(cmd)}", flush=True)
    return subprocess.run(cmd, cwd=cwd or str(ROOT), check=check, capture_output=False)


def run_capture(cmd: list, cwd: str = None) -> str:
    result = subprocess.run(cmd, cwd=cwd or str(ROOT), capture_output=True, text=True)
    return result.stdout.strip()


def extract_picks(worktree: str, scanner: str) -> list:
    """
    Extract picks for the given scanner from the most recent discovery result
    in the worktree's results/discovery/<TODAY>/ directory.
    """
    results_dir = Path(worktree) / "results" / "discovery" / TODAY
    if not results_dir.exists():
        print(f"    No discovery results for {TODAY} in worktree", flush=True)
        return []

    picks = []
    for run_dir in sorted(results_dir.iterdir()):
        result_file = run_dir / "discovery_result.json"
        if not result_file.exists():
            continue
        try:
            with open(result_file) as f:
                data = json.load(f)
            for item in data.get("final_ranking", []):
                if item.get("strategy_match") == scanner:
                    picks.append({
                        "date": TODAY,
                        "ticker": item["ticker"],
                        "score": item.get("final_score"),
                        "confidence": item.get("confidence"),
                        "scanner": scanner,
                        "return_7d": None,
                        "win_7d": None,
                    })
        except Exception as e:
            print(f"    Warning: could not read {result_file}: {e}", flush=True)

    return picks


def load_picks_from_branch(hypothesis_id: str, branch: str) -> list:
    """Load picks.json from the hypothesis branch using git show."""
    picks_path = f"docs/iterations/hypotheses/{hypothesis_id}/picks.json"
    result = subprocess.run(
        ["git", "show", f"{branch}:{picks_path}"],
        cwd=str(ROOT),
        capture_output=True,
        text=True,
    )
    if result.returncode != 0:
        return []
    try:
        return json.loads(result.stdout).get("picks", [])
    except Exception:
        return []


def save_picks_to_worktree(worktree: str, hypothesis_id: str, scanner: str, picks: list) -> None:
    """Write updated picks.json into the worktree and commit."""
    picks_dir = Path(worktree) / "docs" / "iterations" / "hypotheses" / hypothesis_id
    picks_dir.mkdir(parents=True, exist_ok=True)
    picks_file = picks_dir / "picks.json"
    payload = {"hypothesis_id": hypothesis_id, "scanner": scanner, "picks": picks}
    picks_file.write_text(json.dumps(payload, indent=2))

    run(["git", "add", str(picks_file)], cwd=worktree)
    result = subprocess.run(
        ["git", "diff", "--cached", "--quiet"], cwd=worktree
    )
    if result.returncode != 0:
        run(
            ["git", "commit", "-m", f"chore(hypotheses): picks {TODAY} for {hypothesis_id}"],
            cwd=worktree,
        )


def run_hypothesis(hyp: dict) -> bool:
    """
    Run one hypothesis experiment cycle. Returns True if the experiment concluded.
    """
    hid = hyp["id"]
    branch = hyp["branch"]
    scanner = hyp["scanner"]
    worktree = f"/tmp/hyp-{hid}"

    print(f"\n── Hypothesis: {hid} ──", flush=True)

    # 1. Create worktree
    run(["git", "fetch", "origin", branch], check=False)
    run(["git", "worktree", "add", worktree, branch])

    try:
        # 2. Run discovery in worktree
        result = subprocess.run(
            [sys.executable, "scripts/run_daily_discovery.py", "--date", TODAY, "--no-update-positions"],
            cwd=worktree,
            check=False,
        )
        if result.returncode != 0:
            print(f"    Discovery failed for {hid}, skipping picks update", flush=True)
        else:
            # 3. Extract picks + merge with existing
            new_picks = extract_picks(worktree, scanner)
            existing_picks = load_picks_from_branch(hid, branch)
            # Deduplicate by (date, ticker)
            seen = {(p["date"], p["ticker"]) for p in existing_picks}
            merged = existing_picks + [p for p in new_picks if (p["date"], p["ticker"]) not in seen]

            # 4. Save picks + commit in worktree
            save_picks_to_worktree(worktree, hid, scanner, merged)

            # 5. Push hypothesis branch
            run(["git", "push", "origin", f"HEAD:{branch}"], cwd=worktree)

        # 6. Update registry fields
        if TODAY not in hyp.get("picks_log", []):
            hyp.setdefault("picks_log", []).append(TODAY)
        hyp["days_elapsed"] = len(hyp["picks_log"])

        # 7. Check conclusion
        if hyp["days_elapsed"] >= hyp["min_days"]:
            return conclude_hypothesis(hyp)

    finally:
        run(["git", "worktree", "remove", "--force", worktree], check=False)

    return False


def conclude_hypothesis(hyp: dict) -> bool:
    """Run comparison, write conclusion doc, close/merge PR. Returns True."""
    hid = hyp["id"]
    scanner = hyp["scanner"]
    branch = hyp["branch"]

    print(f"\n  Concluding {hid}...", flush=True)

    # Load picks from branch
    picks = load_picks_from_branch(hid, branch)
    if not picks:
        print(f"    No picks found for {hid}, marking rejected", flush=True)
        conclusion = {
            "decision": "rejected",
            "reason": "No picks were collected during the experiment period",
            "hypothesis": {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None},
            "baseline": {"count": 0, "win_rate": None, "avg_return": None},
        }
    else:
        # Run comparison script
        result = subprocess.run(
            [
                sys.executable, "scripts/compare_hypothesis.py",
                "--hypothesis-id", hid,
                "--picks-json", json.dumps(picks),
                "--scanner", scanner,
                "--db-path", str(DB_PATH),
            ],
            cwd=str(ROOT),
            capture_output=True,
            text=True,
        )
        if result.returncode != 0:
            print(f"    compare_hypothesis.py failed: {result.stderr}", flush=True)
            return False
        conclusion = json.loads(result.stdout)

    decision = conclusion["decision"]
    hyp_metrics = conclusion["hypothesis"]
    base_metrics = conclusion["baseline"]

    # Write concluded doc
    period_start = hyp.get("created_at", TODAY)
    concluded_doc = CONCLUDED_DIR / f"{TODAY}-{hid}.md"
    concluded_doc.write_text(
        f"# Hypothesis: {hyp['title']}\n\n"
        f"**Scanner:** {scanner}\n"
        f"**Branch:** {branch}\n"
        f"**Period:** {period_start} → {TODAY} ({hyp['days_elapsed']} days)\n"
        f"**Outcome:** {'accepted ✅' if decision == 'accepted' else 'rejected ❌'}\n\n"
        f"## Hypothesis\n{hyp.get('description', hyp['title'])}\n\n"
        f"## Results\n\n"
        f"| Metric | Baseline | Experiment | Delta |\n"
        f"|---|---|---|---|\n"
        f"| 7d win rate | {base_metrics.get('win_rate') or '—'}% | "
        f"{hyp_metrics.get('win_rate') or '—'}% | "
        f"{_delta_str(hyp_metrics.get('win_rate'), base_metrics.get('win_rate'), 'pp')} |\n"
        f"| Avg return | {base_metrics.get('avg_return') or '—'}% | "
        f"{hyp_metrics.get('avg_return') or '—'}% | "
        f"{_delta_str(hyp_metrics.get('avg_return'), base_metrics.get('avg_return'), '%')} |\n"
        f"| Picks | {base_metrics.get('count', '—')} | {hyp_metrics.get('count', '—')} | — |\n\n"
        f"## Decision\n{conclusion['reason']}\n\n"
        f"## Action\n"
        f"{'Branch merged into main.' if decision == 'accepted' else 'Branch closed without merging.'}\n"
    )

    run(["git", "add", str(concluded_doc)], check=False)

    # Close or merge PR
    pr = hyp.get("pr_number")
    if pr:
        if decision == "accepted":
            subprocess.run(
                ["gh", "pr", "merge", str(pr), "--squash", "--delete-branch"],
                cwd=str(ROOT), check=False,
            )
        else:
            subprocess.run(
                ["gh", "pr", "close", str(pr), "--delete-branch"],
                cwd=str(ROOT), check=False,
            )

    # Update registry entry
    hyp["status"] = "concluded"
    hyp["conclusion"] = decision

    print(f"  {hid}: {decision} — {conclusion['reason']}", flush=True)
    return True


def _delta_str(hyp_val, base_val, unit: str) -> str:
    if hyp_val is None or base_val is None:
        return "—"
    delta = hyp_val - base_val
    sign = "+" if delta >= 0 else ""
    return f"{sign}{delta:.1f}{unit}"


def promote_pending(registry: dict) -> None:
    """Promote the highest-priority pending hypothesis to running if a slot is open."""
    running_count = sum(1 for h in registry["hypotheses"] if h["status"] == "running")
    max_active = registry.get("max_active", 5)
    if running_count >= max_active:
        return

    pending = [h for h in registry["hypotheses"] if h["status"] == "pending"]
    if not pending:
        return

    # Promote highest priority
    to_promote = max(pending, key=lambda h: h.get("priority", 0))
    to_promote["status"] = "running"
    print(f"\n  Promoted pending hypothesis to running: {to_promote['id']}", flush=True)


def main():
    registry = load_registry()
    filter_id = os.environ.get("FILTER_ID", "").strip()

    hypotheses = registry.get("hypotheses", [])
    running = [
        h for h in hypotheses
        if h["status"] == "running" and (not filter_id or h["id"] == filter_id)
    ]

    if not running:
        print("No running hypotheses to process.", flush=True)
    else:
        for hyp in running:
            run_hypothesis(hyp)

    promote_pending(registry)
    save_registry(registry)
    print("\nRegistry updated.", flush=True)


if __name__ == "__main__":
    main()
```

- [ ] **Step 3: Verify the workflow YAML is valid**

```bash
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/hypothesis-runner.yml'))" 2>/dev/null \
  || python3 -c "
import re, sys
with open('.github/workflows/hypothesis-runner.yml') as f:
    content = f.read()
# Just check the file exists and has the cron line
assert '0 8 * * *' in content, 'missing cron'
print('workflow file looks good')
"
```

- [ ] **Step 4: Commit**

```bash
git add .github/workflows/hypothesis-runner.yml scripts/run_hypothesis_runner.py
git commit -m "feat(hypotheses): add daily hypothesis runner workflow"
```

---

## Task 5: Dashboard Hypotheses Tab

**Files:**
- Create: `tradingagents/ui/pages/hypotheses.py`
- Modify: `tradingagents/ui/pages/__init__.py`
- Modify: `tradingagents/ui/dashboard.py`

- [ ] **Step 1: Write the failing test**

Create `tests/test_hypotheses_page.py`:

```python
"""Tests for the hypotheses dashboard page data loading."""
import json
import sys
from pathlib import Path

import pytest

sys.path.insert(0, str(Path(__file__).parent.parent))


from tradingagents.ui.pages.hypotheses import (
    load_active_hypotheses,
    load_concluded_hypotheses,
    days_until_ready,
)


# ── load_active_hypotheses ────────────────────────────────────────────────────

def test_load_active_hypotheses(tmp_path):
    active = {
        "max_active": 5,
        "hypotheses": [
            {
                "id": "options_flow-test",
                "title": "Test hypothesis",
                "scanner": "options_flow",
                "status": "running",
                "priority": 7,
                "days_elapsed": 5,
                "min_days": 14,
                "created_at": "2026-04-01",
                "picks_log": ["2026-04-01"] * 5,
                "conclusion": None,
            }
        ],
    }
    f = tmp_path / "active.json"
    f.write_text(json.dumps(active))

    result = load_active_hypotheses(str(f))
    assert len(result) == 1
    assert result[0]["id"] == "options_flow-test"


def test_load_active_hypotheses_missing_file(tmp_path):
    result = load_active_hypotheses(str(tmp_path / "missing.json"))
    assert result == []


# ── load_concluded_hypotheses ─────────────────────────────────────────────────

def test_load_concluded_hypotheses(tmp_path):
    doc = tmp_path / "2026-04-10-options_flow-test.md"
    doc.write_text(
        "# Hypothesis: Test\n\n"
        "**Scanner:** options_flow\n"
        "**Period:** 2026-03-27 → 2026-04-10 (14 days)\n"
        "**Outcome:** accepted ✅\n"
    )

    results = load_concluded_hypotheses(str(tmp_path))
    assert len(results) == 1
    assert results[0]["filename"] == doc.name
    assert results[0]["outcome"] == "accepted ✅"


def test_load_concluded_hypotheses_empty_dir(tmp_path):
    results = load_concluded_hypotheses(str(tmp_path))
    assert results == []


# ── days_until_ready ──────────────────────────────────────────────────────────

def test_days_until_ready_has_days_left():
    hyp = {"days_elapsed": 5, "min_days": 14}
    assert days_until_ready(hyp) == 9


def test_days_until_ready_past_due():
    hyp = {"days_elapsed": 15, "min_days": 14}
    assert days_until_ready(hyp) == 0
```

- [ ] **Step 2: Run tests to confirm they fail**

```bash
python -m pytest tests/test_hypotheses_page.py -v 2>&1 | head -20
```

Expected: `ModuleNotFoundError` for `tradingagents.ui.pages.hypotheses`.

- [ ] **Step 3: Write `tradingagents/ui/pages/hypotheses.py`**

```python
"""
Hypotheses dashboard page — tracks active and concluded experiments.

Reads docs/iterations/hypotheses/active.json and the concluded/ directory.
No external API calls; all data is file-based.
"""

import json
import re
from pathlib import Path
from typing import Any, Dict, List

import streamlit as st

from tradingagents.ui.theme import COLORS, page_header

_REPO_ROOT = Path(__file__).parent.parent.parent.parent
_ACTIVE_JSON = _REPO_ROOT / "docs/iterations/hypotheses/active.json"
_CONCLUDED_DIR = _REPO_ROOT / "docs/iterations/hypotheses/concluded"


# ── Data loaders ─────────────────────────────────────────────────────────────


def load_active_hypotheses(active_path: str = str(_ACTIVE_JSON)) -> List[Dict[str, Any]]:
    """Load all hypotheses from active.json. Returns [] if file missing."""
    path = Path(active_path)
    if not path.exists():
        return []
    try:
        with open(path) as f:
            data = json.load(f)
        return data.get("hypotheses", [])
    except Exception:
        return []


def load_concluded_hypotheses(concluded_dir: str = str(_CONCLUDED_DIR)) -> List[Dict[str, Any]]:
    """
    Load concluded hypothesis metadata by parsing the markdown files in concluded/.

    Extracts: filename, title, scanner, period, outcome from each .md file.
    """
    dir_path = Path(concluded_dir)
    if not dir_path.exists():
        return []

    results = []
    for md_file in sorted(dir_path.glob("*.md"), reverse=True):
        if md_file.name == ".gitkeep":
            continue
        try:
            text = md_file.read_text()
            title = _extract_md_field(text, r"^# Hypothesis: (.+)$")
            scanner = _extract_md_field(text, r"^\*\*Scanner:\*\* (.+)$")
            period = _extract_md_field(text, r"^\*\*Period:\*\* (.+)$")
            outcome = _extract_md_field(text, r"^\*\*Outcome:\*\* (.+)$")
            results.append({
                "filename": md_file.name,
                "title": title or md_file.stem,
                "scanner": scanner or "—",
                "period": period or "—",
                "outcome": outcome or "—",
            })
        except Exception:
            continue

    return results


def _extract_md_field(text: str, pattern: str) -> str:
    """Extract a field value from a markdown line using regex."""
    match = re.search(pattern, text, re.MULTILINE)
    return match.group(1).strip() if match else ""


def days_until_ready(hyp: Dict[str, Any]) -> int:
    """Return number of days remaining before hypothesis can conclude (min 0)."""
    return max(0, hyp.get("min_days", 14) - hyp.get("days_elapsed", 0))


# ── Rendering ─────────────────────────────────────────────────────────────────


def render() -> None:
    """Render the hypotheses tracking page."""
    st.markdown(
        page_header("Hypotheses", "Active experiments & concluded findings"),
        unsafe_allow_html=True,
    )

    hypotheses = load_active_hypotheses()
    concluded = load_concluded_hypotheses()

    if not hypotheses and not concluded:
        st.info(
            "No hypotheses yet. Run `/backtest-hypothesis \"<description>\"` to start an experiment."
        )
        return

    # ── Active experiments ────────────────────────────────────────────────────
    running = [h for h in hypotheses if h["status"] == "running"]
    pending = [h for h in hypotheses if h["status"] == "pending"]

    st.markdown(
        f'<div class="section-title">Active Experiments '
        f'<span class="accent">// {len(running)} running, {len(pending)} pending</span></div>',
        unsafe_allow_html=True,
    )

    if running or pending:
        active_rows = []
        for h in sorted(running + pending, key=lambda x: -x.get("priority", 0)):
            days_left = days_until_ready(h)
            ready_str = "concluding soon" if days_left == 0 else f"{days_left}d left"
            status_color = COLORS["green"] if h["status"] == "running" else COLORS["amber"]
            active_rows.append({
                "ID": h["id"],
                "Title": h.get("title", "—"),
                "Scanner": h.get("scanner", "—"),
                "Status": h["status"],
                "Progress": f"{h.get('days_elapsed', 0)}/{h.get('min_days', 14)}d",
                "Picks": len(h.get("picks_log", [])),
                "Ready": ready_str,
                "Priority": h.get("priority", "—"),
            })

        import pandas as pd
        df = pd.DataFrame(active_rows)
        st.dataframe(
            df,
            width="stretch",
            hide_index=True,
            column_config={
                "ID": st.column_config.TextColumn(width="medium"),
                "Title": st.column_config.TextColumn(width="large"),
                "Scanner": st.column_config.TextColumn(width="medium"),
                "Status": st.column_config.TextColumn(width="small"),
                "Progress": st.column_config.TextColumn(width="small"),
                "Picks": st.column_config.NumberColumn(format="%d", width="small"),
                "Ready": st.column_config.TextColumn(width="medium"),
                "Priority": st.column_config.NumberColumn(format="%d/9", width="small"),
            },
        )
    else:
        st.info("No active experiments.")

    st.markdown("<div style='height:1.5rem;'></div>", unsafe_allow_html=True)

    # ── Concluded experiments ─────────────────────────────────────────────────
    st.markdown(
        f'<div class="section-title">Concluded Experiments '
        f'<span class="accent">// {len(concluded)} total</span></div>',
        unsafe_allow_html=True,
    )

    if concluded:
        import pandas as pd
        concluded_rows = []
        for c in concluded:
            outcome = c["outcome"]
            emoji = "✅" if "accepted" in outcome else "❌"
            concluded_rows.append({
                "Date": c["filename"][:10],
                "Title": c["title"],
                "Scanner": c["scanner"],
                "Period": c["period"],
                "Outcome": emoji,
            })
        cdf = pd.DataFrame(concluded_rows)
        st.dataframe(
            cdf,
            width="stretch",
            hide_index=True,
            column_config={
                "Date": st.column_config.TextColumn(width="small"),
                "Title": st.column_config.TextColumn(width="large"),
                "Scanner": st.column_config.TextColumn(width="medium"),
                "Period": st.column_config.TextColumn(width="medium"),
                "Outcome": st.column_config.TextColumn(width="small"),
            },
        )
    else:
        st.info("No concluded experiments yet.")
```

- [ ] **Step 4: Run tests to confirm they pass**

```bash
python -m pytest tests/test_hypotheses_page.py -v
```

Expected: all 6 tests pass.

- [ ] **Step 5: Register the page in `tradingagents/ui/pages/__init__.py`**

Add after the `settings` import block (around line 38):

```python
try:
    from tradingagents.ui.pages import hypotheses
except Exception as _e:
    _logger.error("Failed to import hypotheses page: %s", _e, exc_info=True)
    hypotheses = None
```

And add `"hypotheses"` to `__all__`:

```python
__all__ = [
    "home",
    "todays_picks",
    "portfolio",
    "performance",
    "settings",
    "hypotheses",
]
```

- [ ] **Step 6: Add "Hypotheses" to dashboard navigation in `tradingagents/ui/dashboard.py`**

In `render_sidebar`, change the `options` list:

```python
page = st.radio(
    "Navigation",
    options=["Overview", "Signals", "Portfolio", "Performance", "Hypotheses", "Config"],
    label_visibility="collapsed",
)
```

In `route_page`, add to `page_map`:

```python
page_map = {
    "Overview": pages.home,
    "Signals": pages.todays_picks,
    "Portfolio": pages.portfolio,
    "Performance": pages.performance,
    "Hypotheses": pages.hypotheses,
    "Config": pages.settings,
}
```

- [ ] **Step 7: Run the full test suite**

```bash
python -m pytest tests/test_compare_hypothesis.py tests/test_hypotheses_page.py -v
```

Expected: all 16 tests pass.

- [ ] **Step 8: Commit everything**

```bash
git add \
  tradingagents/ui/pages/hypotheses.py \
  tradingagents/ui/pages/__init__.py \
  tradingagents/ui/dashboard.py \
  tests/test_hypotheses_page.py
git commit -m "feat(hypotheses): add Hypotheses dashboard tab"
```

---

## Self-Review

**Spec coverage check:**
- ✅ `active.json` schema with `status: running/pending/concluded` — Task 1
- ✅ `/backtest-hypothesis` command: classify, priority scoring, pending queue, branch creation — Task 3
- ✅ Running experiments never paused — enforced in `run_hypothesis_runner.py` (only `running` entries processed; new ones queue as `pending`)
- ✅ Daily runner: worktree per hypothesis, run discovery, commit picks, conclude — Task 4
- ✅ Statistical comparison with 5pp / 1% thresholds, minimum 5 evaluated picks — Task 2
- ✅ Auto-promote pending → running when slot opens — `promote_pending()` in runner
- ✅ Concluded doc written with metrics table — `conclude_hypothesis()` in runner
- ✅ PR merged (accepted) or closed (rejected) automatically — `conclude_hypothesis()`
- ✅ Dashboard tab with active + concluded tables — Task 5

**Type/name consistency:**
- `hypothesis_id` / `hid` / `id` field: the dict key is always `"id"`, the local var is `hid`, the argument is `--hypothesis-id` — consistent throughout
- `picks.json` structure: `{"hypothesis_id": ..., "scanner": ..., "picks": [...]}` — used in `save_picks_to_worktree` and `load_picks_from_branch` consistently
- `strategy_match` field used to filter picks in `extract_picks` — matches `discovery_result.json` structure confirmed by inspection