# Hypothesis Backtesting System — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Build a branch-per-hypothesis experimentation system that runs scanner code changes daily in isolation, accumulates picks, auto-concludes with a statistical comparison, and surfaces everything in the dashboard. **Architecture:** `active.json` is the registry (lives on `main`). Each hypothesis gets a `hypothesis/-` branch with the code change. A daily workflow (08:00 UTC) uses git worktrees to run discovery on each branch, stores picks in `docs/iterations/hypotheses//picks.json` on the hypothesis branch, and concludes when `min_days` elapsed. The `/backtest-hypothesis` command classifies, creates branches, and manages the registry. **Tech Stack:** Python 3.10, yfinance (`download_history`), GitHub Actions, Streamlit, `gh` CLI, `git worktree` --- ## File Map | Path | Action | Purpose | |---|---|---| | `docs/iterations/hypotheses/active.json` | Create | Registry of all experiments | | `docs/iterations/hypotheses/concluded/.gitkeep` | Create | Directory placeholder | | `scripts/compare_hypothesis.py` | Create | Fetch returns + statistical comparison | | `.claude/commands/backtest-hypothesis.md` | Create | `/backtest-hypothesis` Claude command | | `.github/workflows/hypothesis-runner.yml` | Create | Daily 08:00 UTC runner | | `tradingagents/ui/pages/hypotheses.py` | Create | Dashboard "Hypotheses" tab | | `tradingagents/ui/pages/__init__.py` | Modify | Register new page | | `tradingagents/ui/dashboard.py` | Modify | Add "Hypotheses" to nav | --- ## Task 1: Hypothesis Registry Structure **Files:** - Create: `docs/iterations/hypotheses/active.json` - Create: `docs/iterations/hypotheses/concluded/.gitkeep` - [ ] **Step 1: Create the directory and initial `active.json`** ```bash mkdir -p docs/iterations/hypotheses/concluded ``` Write `docs/iterations/hypotheses/active.json`: ```json { "max_active": 5, "hypotheses": [] } ``` - [ ] **Step 2: Create the concluded directory placeholder** ```bash touch docs/iterations/hypotheses/concluded/.gitkeep ``` - [ ] **Step 3: Verify JSON is valid** ```bash python3 -c "import json; json.load(open('docs/iterations/hypotheses/active.json')); print('valid')" ``` Expected: `valid` - [ ] **Step 4: Commit** ```bash git add docs/iterations/hypotheses/ git commit -m "feat(hypotheses): initialize hypothesis registry" ``` --- ## Task 2: Comparison Script **Files:** - Create: `scripts/compare_hypothesis.py` - Create: `tests/test_compare_hypothesis.py` `★ Insight ─────────────────────────────────────` The comparison reads picks from the hypothesis branch via `git show :path` — this avoids checking out the branch just to read a file, keeping the working tree on `main` throughout. `─────────────────────────────────────────────────` - [ ] **Step 1: Write the failing tests** Create `tests/test_compare_hypothesis.py`: ```python """Tests for the hypothesis comparison script.""" import json import subprocess import sys from datetime import date, timedelta from pathlib import Path from unittest.mock import MagicMock, patch import pytest sys.path.insert(0, str(Path(__file__).parent.parent)) from scripts.compare_hypothesis import ( compute_metrics, compute_7d_return, load_baseline_metrics, make_decision, ) # ── compute_metrics ────────────────────────────────────────────────────────── def test_compute_metrics_empty(): result = compute_metrics([]) assert result == {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None} def test_compute_metrics_all_wins(): picks = [ {"return_7d": 5.0, "win_7d": True}, {"return_7d": 3.0, "win_7d": True}, ] result = compute_metrics(picks) assert result["win_rate"] == 100.0 assert result["avg_return"] == 4.0 assert result["evaluated"] == 2 def test_compute_metrics_mixed(): picks = [ {"return_7d": 10.0, "win_7d": True}, {"return_7d": -5.0, "win_7d": False}, {"return_7d": None, "win_7d": None}, # pending — excluded ] result = compute_metrics(picks) assert result["win_rate"] == 50.0 assert result["avg_return"] == 2.5 assert result["evaluated"] == 2 assert result["count"] == 3 # ── compute_7d_return ──────────────────────────────────────────────────────── def test_compute_7d_return_positive(): mock_df = MagicMock() mock_df.empty = False # Simulate DataFrame with Close column: entry=100, exit=110 mock_df.__len__ = lambda self: 2 mock_df["Close"].iloc.__getitem__ = MagicMock(side_effect=lambda i: 100.0 if i == 0 else 110.0) with patch("scripts.compare_hypothesis.download_history", return_value=mock_df): ret, win = compute_7d_return("AAPL", "2026-03-01") assert ret == pytest.approx(10.0, rel=0.01) assert win is True def test_compute_7d_return_empty_data(): mock_df = MagicMock() mock_df.empty = True with patch("scripts.compare_hypothesis.download_history", return_value=mock_df): ret, win = compute_7d_return("AAPL", "2026-03-01") assert ret is None assert win is None # ── load_baseline_metrics ──────────────────────────────────────────────────── def test_load_baseline_metrics(tmp_path): db = { "recommendations_by_date": { "2026-03-01": [ {"strategy_match": "options_flow", "return_7d": 5.0, "win_7d": True}, {"strategy_match": "options_flow", "return_7d": -2.0, "win_7d": False}, {"strategy_match": "reddit_dd", "return_7d": 3.0, "win_7d": True}, ] } } db_file = tmp_path / "performance_database.json" db_file.write_text(json.dumps(db)) result = load_baseline_metrics("options_flow", str(db_file)) assert result["win_rate"] == 50.0 assert result["avg_return"] == 1.5 assert result["count"] == 2 def test_load_baseline_metrics_missing_file(tmp_path): result = load_baseline_metrics("options_flow", str(tmp_path / "missing.json")) assert result == {"count": 0, "win_rate": None, "avg_return": None} # ── make_decision ───────────────────────────────────────────────────────────── def test_make_decision_accepted_by_win_rate(): hyp = {"win_rate": 60.0, "avg_return": 0.5, "evaluated": 10} baseline = {"win_rate": 50.0, "avg_return": 0.5} decision, reason = make_decision(hyp, baseline) assert decision == "accepted" assert "win rate" in reason.lower() def test_make_decision_accepted_by_return(): hyp = {"win_rate": 52.0, "avg_return": 3.0, "evaluated": 10} baseline = {"win_rate": 50.0, "avg_return": 1.5} decision, reason = make_decision(hyp, baseline) assert decision == "accepted" assert "return" in reason.lower() def test_make_decision_rejected(): hyp = {"win_rate": 48.0, "avg_return": 0.2, "evaluated": 10} baseline = {"win_rate": 50.0, "avg_return": 1.0} decision, reason = make_decision(hyp, baseline) assert decision == "rejected" def test_make_decision_insufficient_data(): hyp = {"win_rate": 80.0, "avg_return": 5.0, "evaluated": 2} baseline = {"win_rate": 50.0, "avg_return": 1.0} decision, reason = make_decision(hyp, baseline) assert decision == "rejected" assert "insufficient" in reason.lower() ``` - [ ] **Step 2: Run tests to confirm they fail** ```bash python -m pytest tests/test_compare_hypothesis.py -v 2>&1 | head -30 ``` Expected: `ModuleNotFoundError: No module named 'scripts.compare_hypothesis'` or similar import error — confirms tests are wired correctly. - [ ] **Step 3: Write `scripts/compare_hypothesis.py`** ```python #!/usr/bin/env python3 """ Hypothesis comparison — computes 7d returns for hypothesis picks and compares them against the baseline scanner in performance_database.json. Usage (called by hypothesis-runner.yml after min_days elapsed): python scripts/compare_hypothesis.py \\ --hypothesis-id options_flow-scan-3-expirations \\ --picks-json '{"picks": [...]}' \\ --scanner options_flow \\ --db-path data/recommendations/performance_database.json Prints a JSON conclusion to stdout: { "decision": "accepted", "reason": "...", "hypothesis": {"win_rate": 58.0, "avg_return": 1.8, "count": 14, "evaluated": 10}, "baseline": {"win_rate": 42.0, "avg_return": -0.3, "count": 87} } """ import argparse import json import sys from datetime import datetime, timedelta from pathlib import Path from typing import Optional, Tuple ROOT = Path(__file__).resolve().parent.parent sys.path.insert(0, str(ROOT)) from tradingagents.dataflows.y_finance import download_history # Minimum evaluated picks required to make a decision _MIN_EVALUATED = 5 # Thresholds from spec _WIN_RATE_DELTA_THRESHOLD = 5.0 # percentage points _AVG_RETURN_DELTA_THRESHOLD = 1.0 # percent def compute_7d_return(ticker: str, pick_date: str) -> Tuple[Optional[float], Optional[bool]]: """ Fetch 7-day return for a pick using yfinance. Args: ticker: Stock symbol, e.g. "AAPL" pick_date: Date the pick was made, "YYYY-MM-DD" Returns: (return_pct, is_win) or (None, None) if data unavailable """ try: entry_dt = datetime.strptime(pick_date, "%Y-%m-%d") exit_dt = entry_dt + timedelta(days=10) # +3 buffer for weekends/holidays df = download_history( ticker, start=entry_dt.strftime("%Y-%m-%d"), end=exit_dt.strftime("%Y-%m-%d"), ) if df.empty or len(df) < 2: return None, None # Use first available close as entry, 7th trading day as exit close = df["Close"] entry_price = float(close.iloc[0]) exit_idx = min(5, len(close) - 1) # ~7 calendar days = ~5 trading days exit_price = float(close.iloc[exit_idx]) if entry_price <= 0: return None, None ret = (exit_price - entry_price) / entry_price * 100 return round(ret, 4), ret > 0 except Exception: return None, None def enrich_picks_with_returns(picks: list) -> list: """ Compute 7d return for each pick that is old enough (>= 7 days) and doesn't already have return_7d populated. Args: picks: List of pick dicts with at least 'ticker' and 'date' fields Returns: Same list with return_7d and win_7d populated where possible """ cutoff = (datetime.utcnow() - timedelta(days=7)).strftime("%Y-%m-%d") for pick in picks: if pick.get("return_7d") is not None: continue # already computed if pick.get("date", "9999-99-99") > cutoff: continue # too recent ret, win = compute_7d_return(pick["ticker"], pick["date"]) pick["return_7d"] = ret pick["win_7d"] = win return picks def compute_metrics(picks: list) -> dict: """ Compute win rate and avg return for a list of picks. Only picks with non-None return_7d contribute to win_rate and avg_return. Returns: {"count": int, "evaluated": int, "win_rate": float|None, "avg_return": float|None} """ evaluated = [p for p in picks if p.get("return_7d") is not None] if not evaluated: return {"count": len(picks), "evaluated": 0, "win_rate": None, "avg_return": None} wins = sum(1 for p in evaluated if p.get("win_7d")) avg_ret = sum(p["return_7d"] for p in evaluated) / len(evaluated) return { "count": len(picks), "evaluated": len(evaluated), "win_rate": round(wins / len(evaluated) * 100, 1), "avg_return": round(avg_ret, 2), } def load_baseline_metrics(scanner: str, db_path: str) -> dict: """ Load baseline metrics for a scanner from performance_database.json. Args: scanner: Scanner name, e.g. "options_flow" db_path: Path to performance_database.json Returns: {"count": int, "win_rate": float|None, "avg_return": float|None} """ path = Path(db_path) if not path.exists(): return {"count": 0, "win_rate": None, "avg_return": None} try: with open(path) as f: db = json.load(f) except Exception: return {"count": 0, "win_rate": None, "avg_return": None} picks = [] for recs in db.get("recommendations_by_date", {}).values(): for rec in (recs if isinstance(recs, list) else []): if rec.get("strategy_match") == scanner and rec.get("return_7d") is not None: picks.append(rec) return compute_metrics(picks) def make_decision(hypothesis: dict, baseline: dict) -> Tuple[str, str]: """ Decide accepted or rejected based on metrics delta. Rules: - Minimum _MIN_EVALUATED evaluated picks required - accepted if win_rate_delta > _WIN_RATE_DELTA_THRESHOLD (5pp) OR avg_return_delta > _AVG_RETURN_DELTA_THRESHOLD (1%) - rejected otherwise Returns: (decision, reason) where decision is "accepted" or "rejected" """ evaluated = hypothesis.get("evaluated", 0) if evaluated < _MIN_EVALUATED: return "rejected", f"Insufficient data: only {evaluated} evaluated picks (need {_MIN_EVALUATED})" hyp_wr = hypothesis.get("win_rate") hyp_ret = hypothesis.get("avg_return") base_wr = baseline.get("win_rate") base_ret = baseline.get("avg_return") reasons = [] if hyp_wr is not None and base_wr is not None: delta_wr = hyp_wr - base_wr if delta_wr > _WIN_RATE_DELTA_THRESHOLD: reasons.append(f"win rate improved by {delta_wr:+.1f}pp ({base_wr:.1f}% → {hyp_wr:.1f}%)") if hyp_ret is not None and base_ret is not None: delta_ret = hyp_ret - base_ret if delta_ret > _AVG_RETURN_DELTA_THRESHOLD: reasons.append(f"avg return improved by {delta_ret:+.2f}% ({base_ret:+.2f}% → {hyp_ret:+.2f}%)") if reasons: return "accepted", "; ".join(reasons) wr_str = f"{hyp_wr:.1f}% vs baseline {base_wr:.1f}%" if hyp_wr is not None else "no win rate data" ret_str = f"{hyp_ret:+.2f}% vs baseline {base_ret:+.2f}%" if hyp_ret is not None else "no return data" return "rejected", f"No significant improvement — win rate: {wr_str}; avg return: {ret_str}" def main(): parser = argparse.ArgumentParser(description="Compare hypothesis picks against baseline") parser.add_argument("--hypothesis-id", required=True) parser.add_argument("--picks-json", required=True, help="JSON string of picks list") parser.add_argument("--scanner", required=True, help="Baseline scanner name") parser.add_argument( "--db-path", default="data/recommendations/performance_database.json", help="Path to performance_database.json", ) args = parser.parse_args() picks = json.loads(args.picks_json) picks = enrich_picks_with_returns(picks) hyp_metrics = compute_metrics(picks) base_metrics = load_baseline_metrics(args.scanner, args.db_path) decision, reason = make_decision(hyp_metrics, base_metrics) result = { "hypothesis_id": args.hypothesis_id, "decision": decision, "reason": reason, "hypothesis": hyp_metrics, "baseline": base_metrics, "enriched_picks": picks, } print(json.dumps(result, indent=2)) if __name__ == "__main__": main() ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash python -m pytest tests/test_compare_hypothesis.py -v ``` Expected: all 10 tests pass. - [ ] **Step 5: Commit** ```bash git add scripts/compare_hypothesis.py tests/test_compare_hypothesis.py git commit -m "feat(hypotheses): add comparison + conclusion script" ``` --- ## Task 3: `/backtest-hypothesis` Command **Files:** - Create: `.claude/commands/backtest-hypothesis.md` - [ ] **Step 1: Write the command file** Create `.claude/commands/backtest-hypothesis.md`: ````markdown # /backtest-hypothesis Test a hypothesis about a scanner improvement using branch-per-hypothesis isolation. **Usage:** `/backtest-hypothesis ""` **Example:** `/backtest-hypothesis "options_flow: scan 3 expirations instead of 1 to capture institutional 30+ DTE positioning"` --- ## Step 1: Read Current Registry Read `docs/iterations/hypotheses/active.json`. Note: - How many hypotheses currently have `status: "running"` - The `max_active` limit (default 5) - Any existing `pending` entries Also read `docs/iterations/LEARNINGS.md` and the relevant scanner domain file in `docs/iterations/scanners/` to understand the current baseline. ## Step 2: Classify the Hypothesis Determine whether this is: **Statistical** — answerable from existing data in `data/recommendations/performance_database.json` without any code change. Examples: - "Does high confidence (≥8) predict better 30d returns?" - "Are options_flow picks that are ITM outperforming OTM ones?" **Implementation** — requires a code change and forward-testing period. Examples: - "Scan 3 expirations instead of 1" - "Apply a premium filter of $50K instead of $25K" ## Step 3a: Statistical Path If statistical: run the analysis now against `data/recommendations/performance_database.json`. Write the finding to the relevant scanner domain file under **Evidence Log**. Print a summary. Done — no branch needed. ## Step 3b: Implementation Path ### 3b-i: Capacity check Count running hypotheses from `active.json`. If fewer than `max_active` running, proceed. If at capacity: add the new hypothesis as `status: "pending"` — running experiments are NEVER paused mid-streak. Inform the user which slot it queued behind and when it will likely start. ### 3b-ii: Score the hypothesis Assign a `priority` score (1–9) using these factors: | Factor | Score | |---|---| | Scanner 30d win rate < 40% | +3 | | Change touches 1 file, 1 parameter | +2 | | Directly addresses a weak spot in LEARNINGS.md | +2 | | Scanner generates ≥2 picks/day (data accrues fast) | +1 | | Supported by external research (arXiv, Alpha Architect, etc.) | +1 | | Contradictory evidence or unclear direction | −2 | ### 3b-iii: Determine min_days Set `min_days` based on the scanner's typical picks-per-day rate: - ≥2 picks/day → 14 days - 1 pick/day → 21 days - <1 pick/day → 30 days ### 3b-iv: Create the branch and implement the code change ```bash BRANCH="hypothesis/-" git checkout -b "$BRANCH" ``` Make the minimal code change that implements the hypothesis. Read the scanner file first. Only change what the hypothesis requires — do not refactor surrounding code. ```bash git add tradingagents/ git commit -m "hypothesis(): " ``` ### 3b-v: Create picks tracking file on the branch Create `docs/iterations/hypotheses/<id>/picks.json` on the hypothesis branch: ```json { "hypothesis_id": "<id>", "scanner": "<scanner>", "picks": [] } ``` ```bash mkdir -p docs/iterations/hypotheses/<id> # write the file git add docs/iterations/hypotheses/<id>/picks.json git commit -m "hypothesis(<scanner>): add picks tracker" git push -u origin "$BRANCH" ``` ### 3b-vi: Open a draft PR ```bash gh pr create \ --title "hypothesis(<scanner>): <title>" \ --body "**Hypothesis:** <description> **Expected impact:** <high/medium/low> **Min days:** <N> **Priority:** <score>/9 *This is an automated hypothesis experiment. It will be auto-concluded after ${MIN_DAYS} days of data.*" \ --draft \ --base main ``` Note the PR number from the output. ### 3b-vii: Update active.json on main Check out `main`, then update `docs/iterations/hypotheses/active.json` to add the new entry: ```json { "id": "<scanner>-<slug>", "scanner": "<scanner>", "title": "<title>", "description": "<description>", "branch": "hypothesis/<scanner>-<slug>", "pr_number": <N>, "status": "running", "priority": <score>, "expected_impact": "<high|medium|low>", "hypothesis_type": "implementation", "created_at": "<YYYY-MM-DD>", "min_days": <N>, "days_elapsed": 0, "picks_log": [], "baseline_scanner": "<scanner>", "conclusion": null } ``` ```bash git checkout main git add docs/iterations/hypotheses/active.json git commit -m "feat(hypotheses): register hypothesis <id>" git push origin main ``` ## Step 4: Print Summary Print a confirmation: - Hypothesis ID and branch name - Status: running or pending - Expected conclusion date (created_at + min_days) - PR link (if running) - Priority score and why ```` - [ ] **Step 2: Verify the file exists and is non-empty** ```bash wc -l .claude/commands/backtest-hypothesis.md ``` Expected: at least 80 lines. - [ ] **Step 3: Commit** ```bash git add .claude/commands/backtest-hypothesis.md git commit -m "feat(hypotheses): add /backtest-hypothesis command" ``` --- ## Task 4: Hypothesis Runner Workflow **Files:** - Create: `.github/workflows/hypothesis-runner.yml` - [ ] **Step 1: Write the workflow** Create `.github/workflows/hypothesis-runner.yml`: ```yaml name: Hypothesis Runner on: schedule: # 8:00 AM UTC daily — runs after iterate (06:00) and daily-discovery (12:30) - cron: "0 8 * * *" workflow_dispatch: inputs: hypothesis_id: description: "Run a specific hypothesis ID only (blank = all running)" required: false default: "" env: PYTHON_VERSION: "3.10" jobs: run-hypotheses: runs-on: ubuntu-latest environment: TradingAgent timeout-minutes: 60 permissions: contents: write pull-requests: write steps: - name: Checkout repository uses: actions/checkout@v4 with: fetch-depth: 0 token: ${{ secrets.GH_TOKEN }} - name: Set up git identity run: | git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" - name: Set up Python uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }} cache: pip - name: Install dependencies run: pip install --upgrade pip && pip install -e . - name: Run hypothesis experiments env: GH_TOKEN: ${{ secrets.GH_TOKEN }} GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} FINNHUB_API_KEY: ${{ secrets.FINNHUB_API_KEY }} ALPHA_VANTAGE_API_KEY: ${{ secrets.ALPHA_VANTAGE_API_KEY }} FMP_API_KEY: ${{ secrets.FMP_API_KEY }} REDDIT_CLIENT_ID: ${{ secrets.REDDIT_CLIENT_ID }} REDDIT_CLIENT_SECRET: ${{ secrets.REDDIT_CLIENT_SECRET }} TRADIER_API_KEY: ${{ secrets.TRADIER_API_KEY }} FILTER_ID: ${{ inputs.hypothesis_id }} run: | python scripts/run_hypothesis_runner.py - name: Commit active.json updates run: | git add docs/iterations/hypotheses/active.json || true if git diff --cached --quiet; then echo "No registry changes" else git commit -m "chore(hypotheses): update registry $(date -u +%Y-%m-%d)" git pull --rebase origin main git push origin main fi ``` - [ ] **Step 2: Write `scripts/run_hypothesis_runner.py`** Create `scripts/run_hypothesis_runner.py`: ```python #!/usr/bin/env python3 """ Hypothesis Runner — orchestrates daily experiment cycles. For each running hypothesis in active.json: 1. Creates a git worktree for the hypothesis branch 2. Runs the daily discovery pipeline in that worktree 3. Extracts picks from the discovery result, appends to picks.json 4. Commits and pushes picks to hypothesis branch 5. Removes worktree 6. Updates active.json (days_elapsed, picks_log) 7. If days_elapsed >= min_days: concludes the hypothesis After all hypotheses: promotes highest-priority pending → running if a slot opened. Environment variables read: FILTER_ID — if set, only run the hypothesis with this ID """ import json import os import subprocess import sys from datetime import datetime, timedelta from pathlib import Path ROOT = Path(__file__).resolve().parent.parent sys.path.insert(0, str(ROOT)) ACTIVE_JSON = ROOT / "docs/iterations/hypotheses/active.json" CONCLUDED_DIR = ROOT / "docs/iterations/hypotheses/concluded" DB_PATH = ROOT / "data/recommendations/performance_database.json" TODAY = datetime.utcnow().strftime("%Y-%m-%d") def load_registry() -> dict: with open(ACTIVE_JSON) as f: return json.load(f) def save_registry(registry: dict) -> None: with open(ACTIVE_JSON, "w") as f: json.dump(registry, f, indent=2) def run(cmd: list, cwd: str = None, check: bool = True) -> subprocess.CompletedProcess: print(f" $ {' '.join(cmd)}", flush=True) return subprocess.run(cmd, cwd=cwd or str(ROOT), check=check, capture_output=False) def run_capture(cmd: list, cwd: str = None) -> str: result = subprocess.run(cmd, cwd=cwd or str(ROOT), capture_output=True, text=True) return result.stdout.strip() def extract_picks(worktree: str, scanner: str) -> list: """ Extract picks for the given scanner from the most recent discovery result in the worktree's results/discovery/<TODAY>/ directory. """ results_dir = Path(worktree) / "results" / "discovery" / TODAY if not results_dir.exists(): print(f" No discovery results for {TODAY} in worktree", flush=True) return [] picks = [] for run_dir in sorted(results_dir.iterdir()): result_file = run_dir / "discovery_result.json" if not result_file.exists(): continue try: with open(result_file) as f: data = json.load(f) for item in data.get("final_ranking", []): if item.get("strategy_match") == scanner: picks.append({ "date": TODAY, "ticker": item["ticker"], "score": item.get("final_score"), "confidence": item.get("confidence"), "scanner": scanner, "return_7d": None, "win_7d": None, }) except Exception as e: print(f" Warning: could not read {result_file}: {e}", flush=True) return picks def load_picks_from_branch(hypothesis_id: str, branch: str) -> list: """Load picks.json from the hypothesis branch using git show.""" picks_path = f"docs/iterations/hypotheses/{hypothesis_id}/picks.json" result = subprocess.run( ["git", "show", f"{branch}:{picks_path}"], cwd=str(ROOT), capture_output=True, text=True, ) if result.returncode != 0: return [] try: return json.loads(result.stdout).get("picks", []) except Exception: return [] def save_picks_to_worktree(worktree: str, hypothesis_id: str, scanner: str, picks: list) -> None: """Write updated picks.json into the worktree and commit.""" picks_dir = Path(worktree) / "docs" / "iterations" / "hypotheses" / hypothesis_id picks_dir.mkdir(parents=True, exist_ok=True) picks_file = picks_dir / "picks.json" payload = {"hypothesis_id": hypothesis_id, "scanner": scanner, "picks": picks} picks_file.write_text(json.dumps(payload, indent=2)) run(["git", "add", str(picks_file)], cwd=worktree) result = subprocess.run( ["git", "diff", "--cached", "--quiet"], cwd=worktree ) if result.returncode != 0: run( ["git", "commit", "-m", f"chore(hypotheses): picks {TODAY} for {hypothesis_id}"], cwd=worktree, ) def run_hypothesis(hyp: dict) -> bool: """ Run one hypothesis experiment cycle. Returns True if the experiment concluded. """ hid = hyp["id"] branch = hyp["branch"] scanner = hyp["scanner"] worktree = f"/tmp/hyp-{hid}" print(f"\n── Hypothesis: {hid} ──", flush=True) # 1. Create worktree run(["git", "fetch", "origin", branch], check=False) run(["git", "worktree", "add", worktree, branch]) try: # 2. Run discovery in worktree result = subprocess.run( [sys.executable, "scripts/run_daily_discovery.py", "--date", TODAY, "--no-update-positions"], cwd=worktree, check=False, ) if result.returncode != 0: print(f" Discovery failed for {hid}, skipping picks update", flush=True) else: # 3. Extract picks + merge with existing new_picks = extract_picks(worktree, scanner) existing_picks = load_picks_from_branch(hid, branch) # Deduplicate by (date, ticker) seen = {(p["date"], p["ticker"]) for p in existing_picks} merged = existing_picks + [p for p in new_picks if (p["date"], p["ticker"]) not in seen] # 4. Save picks + commit in worktree save_picks_to_worktree(worktree, hid, scanner, merged) # 5. Push hypothesis branch run(["git", "push", "origin", f"HEAD:{branch}"], cwd=worktree) # 6. Update registry fields if TODAY not in hyp.get("picks_log", []): hyp.setdefault("picks_log", []).append(TODAY) hyp["days_elapsed"] = len(hyp["picks_log"]) # 7. Check conclusion if hyp["days_elapsed"] >= hyp["min_days"]: return conclude_hypothesis(hyp) finally: run(["git", "worktree", "remove", "--force", worktree], check=False) return False def conclude_hypothesis(hyp: dict) -> bool: """Run comparison, write conclusion doc, close/merge PR. Returns True.""" hid = hyp["id"] scanner = hyp["scanner"] branch = hyp["branch"] print(f"\n Concluding {hid}...", flush=True) # Load picks from branch picks = load_picks_from_branch(hid, branch) if not picks: print(f" No picks found for {hid}, marking rejected", flush=True) conclusion = { "decision": "rejected", "reason": "No picks were collected during the experiment period", "hypothesis": {"count": 0, "evaluated": 0, "win_rate": None, "avg_return": None}, "baseline": {"count": 0, "win_rate": None, "avg_return": None}, } else: # Run comparison script result = subprocess.run( [ sys.executable, "scripts/compare_hypothesis.py", "--hypothesis-id", hid, "--picks-json", json.dumps(picks), "--scanner", scanner, "--db-path", str(DB_PATH), ], cwd=str(ROOT), capture_output=True, text=True, ) if result.returncode != 0: print(f" compare_hypothesis.py failed: {result.stderr}", flush=True) return False conclusion = json.loads(result.stdout) decision = conclusion["decision"] hyp_metrics = conclusion["hypothesis"] base_metrics = conclusion["baseline"] # Write concluded doc period_start = hyp.get("created_at", TODAY) concluded_doc = CONCLUDED_DIR / f"{TODAY}-{hid}.md" concluded_doc.write_text( f"# Hypothesis: {hyp['title']}\n\n" f"**Scanner:** {scanner}\n" f"**Branch:** {branch}\n" f"**Period:** {period_start} → {TODAY} ({hyp['days_elapsed']} days)\n" f"**Outcome:** {'accepted ✅' if decision == 'accepted' else 'rejected ❌'}\n\n" f"## Hypothesis\n{hyp.get('description', hyp['title'])}\n\n" f"## Results\n\n" f"| Metric | Baseline | Experiment | Delta |\n" f"|---|---|---|---|\n" f"| 7d win rate | {base_metrics.get('win_rate') or '—'}% | " f"{hyp_metrics.get('win_rate') or '—'}% | " f"{_delta_str(hyp_metrics.get('win_rate'), base_metrics.get('win_rate'), 'pp')} |\n" f"| Avg return | {base_metrics.get('avg_return') or '—'}% | " f"{hyp_metrics.get('avg_return') or '—'}% | " f"{_delta_str(hyp_metrics.get('avg_return'), base_metrics.get('avg_return'), '%')} |\n" f"| Picks | {base_metrics.get('count', '—')} | {hyp_metrics.get('count', '—')} | — |\n\n" f"## Decision\n{conclusion['reason']}\n\n" f"## Action\n" f"{'Branch merged into main.' if decision == 'accepted' else 'Branch closed without merging.'}\n" ) run(["git", "add", str(concluded_doc)], check=False) # Close or merge PR pr = hyp.get("pr_number") if pr: if decision == "accepted": subprocess.run( ["gh", "pr", "merge", str(pr), "--squash", "--delete-branch"], cwd=str(ROOT), check=False, ) else: subprocess.run( ["gh", "pr", "close", str(pr), "--delete-branch"], cwd=str(ROOT), check=False, ) # Update registry entry hyp["status"] = "concluded" hyp["conclusion"] = decision print(f" {hid}: {decision} — {conclusion['reason']}", flush=True) return True def _delta_str(hyp_val, base_val, unit: str) -> str: if hyp_val is None or base_val is None: return "—" delta = hyp_val - base_val sign = "+" if delta >= 0 else "" return f"{sign}{delta:.1f}{unit}" def promote_pending(registry: dict) -> None: """Promote the highest-priority pending hypothesis to running if a slot is open.""" running_count = sum(1 for h in registry["hypotheses"] if h["status"] == "running") max_active = registry.get("max_active", 5) if running_count >= max_active: return pending = [h for h in registry["hypotheses"] if h["status"] == "pending"] if not pending: return # Promote highest priority to_promote = max(pending, key=lambda h: h.get("priority", 0)) to_promote["status"] = "running" print(f"\n Promoted pending hypothesis to running: {to_promote['id']}", flush=True) def main(): registry = load_registry() filter_id = os.environ.get("FILTER_ID", "").strip() hypotheses = registry.get("hypotheses", []) running = [ h for h in hypotheses if h["status"] == "running" and (not filter_id or h["id"] == filter_id) ] if not running: print("No running hypotheses to process.", flush=True) else: for hyp in running: run_hypothesis(hyp) promote_pending(registry) save_registry(registry) print("\nRegistry updated.", flush=True) if __name__ == "__main__": main() ``` - [ ] **Step 3: Verify the workflow YAML is valid** ```bash python3 -c "import yaml; yaml.safe_load(open('.github/workflows/hypothesis-runner.yml'))" 2>/dev/null \ || python3 -c " import re, sys with open('.github/workflows/hypothesis-runner.yml') as f: content = f.read() # Just check the file exists and has the cron line assert '0 8 * * *' in content, 'missing cron' print('workflow file looks good') " ``` - [ ] **Step 4: Commit** ```bash git add .github/workflows/hypothesis-runner.yml scripts/run_hypothesis_runner.py git commit -m "feat(hypotheses): add daily hypothesis runner workflow" ``` --- ## Task 5: Dashboard Hypotheses Tab **Files:** - Create: `tradingagents/ui/pages/hypotheses.py` - Modify: `tradingagents/ui/pages/__init__.py` - Modify: `tradingagents/ui/dashboard.py` - [ ] **Step 1: Write the failing test** Create `tests/test_hypotheses_page.py`: ```python """Tests for the hypotheses dashboard page data loading.""" import json import sys from pathlib import Path import pytest sys.path.insert(0, str(Path(__file__).parent.parent)) from tradingagents.ui.pages.hypotheses import ( load_active_hypotheses, load_concluded_hypotheses, days_until_ready, ) # ── load_active_hypotheses ──────────────────────────────────────────────────── def test_load_active_hypotheses(tmp_path): active = { "max_active": 5, "hypotheses": [ { "id": "options_flow-test", "title": "Test hypothesis", "scanner": "options_flow", "status": "running", "priority": 7, "days_elapsed": 5, "min_days": 14, "created_at": "2026-04-01", "picks_log": ["2026-04-01"] * 5, "conclusion": None, } ], } f = tmp_path / "active.json" f.write_text(json.dumps(active)) result = load_active_hypotheses(str(f)) assert len(result) == 1 assert result[0]["id"] == "options_flow-test" def test_load_active_hypotheses_missing_file(tmp_path): result = load_active_hypotheses(str(tmp_path / "missing.json")) assert result == [] # ── load_concluded_hypotheses ───────────────────────────────────────────────── def test_load_concluded_hypotheses(tmp_path): doc = tmp_path / "2026-04-10-options_flow-test.md" doc.write_text( "# Hypothesis: Test\n\n" "**Scanner:** options_flow\n" "**Period:** 2026-03-27 → 2026-04-10 (14 days)\n" "**Outcome:** accepted ✅\n" ) results = load_concluded_hypotheses(str(tmp_path)) assert len(results) == 1 assert results[0]["filename"] == doc.name assert results[0]["outcome"] == "accepted ✅" def test_load_concluded_hypotheses_empty_dir(tmp_path): results = load_concluded_hypotheses(str(tmp_path)) assert results == [] # ── days_until_ready ────────────────────────────────────────────────────────── def test_days_until_ready_has_days_left(): hyp = {"days_elapsed": 5, "min_days": 14} assert days_until_ready(hyp) == 9 def test_days_until_ready_past_due(): hyp = {"days_elapsed": 15, "min_days": 14} assert days_until_ready(hyp) == 0 ``` - [ ] **Step 2: Run tests to confirm they fail** ```bash python -m pytest tests/test_hypotheses_page.py -v 2>&1 | head -20 ``` Expected: `ModuleNotFoundError` for `tradingagents.ui.pages.hypotheses`. - [ ] **Step 3: Write `tradingagents/ui/pages/hypotheses.py`** ```python """ Hypotheses dashboard page — tracks active and concluded experiments. Reads docs/iterations/hypotheses/active.json and the concluded/ directory. No external API calls; all data is file-based. """ import json import re from pathlib import Path from typing import Any, Dict, List import streamlit as st from tradingagents.ui.theme import COLORS, page_header _REPO_ROOT = Path(__file__).parent.parent.parent.parent _ACTIVE_JSON = _REPO_ROOT / "docs/iterations/hypotheses/active.json" _CONCLUDED_DIR = _REPO_ROOT / "docs/iterations/hypotheses/concluded" # ── Data loaders ───────────────────────────────────────────────────────────── def load_active_hypotheses(active_path: str = str(_ACTIVE_JSON)) -> List[Dict[str, Any]]: """Load all hypotheses from active.json. Returns [] if file missing.""" path = Path(active_path) if not path.exists(): return [] try: with open(path) as f: data = json.load(f) return data.get("hypotheses", []) except Exception: return [] def load_concluded_hypotheses(concluded_dir: str = str(_CONCLUDED_DIR)) -> List[Dict[str, Any]]: """ Load concluded hypothesis metadata by parsing the markdown files in concluded/. Extracts: filename, title, scanner, period, outcome from each .md file. """ dir_path = Path(concluded_dir) if not dir_path.exists(): return [] results = [] for md_file in sorted(dir_path.glob("*.md"), reverse=True): if md_file.name == ".gitkeep": continue try: text = md_file.read_text() title = _extract_md_field(text, r"^# Hypothesis: (.+)$") scanner = _extract_md_field(text, r"^\*\*Scanner:\*\* (.+)$") period = _extract_md_field(text, r"^\*\*Period:\*\* (.+)$") outcome = _extract_md_field(text, r"^\*\*Outcome:\*\* (.+)$") results.append({ "filename": md_file.name, "title": title or md_file.stem, "scanner": scanner or "—", "period": period or "—", "outcome": outcome or "—", }) except Exception: continue return results def _extract_md_field(text: str, pattern: str) -> str: """Extract a field value from a markdown line using regex.""" match = re.search(pattern, text, re.MULTILINE) return match.group(1).strip() if match else "" def days_until_ready(hyp: Dict[str, Any]) -> int: """Return number of days remaining before hypothesis can conclude (min 0).""" return max(0, hyp.get("min_days", 14) - hyp.get("days_elapsed", 0)) # ── Rendering ───────────────────────────────────────────────────────────────── def render() -> None: """Render the hypotheses tracking page.""" st.markdown( page_header("Hypotheses", "Active experiments & concluded findings"), unsafe_allow_html=True, ) hypotheses = load_active_hypotheses() concluded = load_concluded_hypotheses() if not hypotheses and not concluded: st.info( "No hypotheses yet. Run `/backtest-hypothesis \"<description>\"` to start an experiment." ) return # ── Active experiments ──────────────────────────────────────────────────── running = [h for h in hypotheses if h["status"] == "running"] pending = [h for h in hypotheses if h["status"] == "pending"] st.markdown( f'<div class="section-title">Active Experiments ' f'<span class="accent">// {len(running)} running, {len(pending)} pending</span></div>', unsafe_allow_html=True, ) if running or pending: active_rows = [] for h in sorted(running + pending, key=lambda x: -x.get("priority", 0)): days_left = days_until_ready(h) ready_str = "concluding soon" if days_left == 0 else f"{days_left}d left" status_color = COLORS["green"] if h["status"] == "running" else COLORS["amber"] active_rows.append({ "ID": h["id"], "Title": h.get("title", "—"), "Scanner": h.get("scanner", "—"), "Status": h["status"], "Progress": f"{h.get('days_elapsed', 0)}/{h.get('min_days', 14)}d", "Picks": len(h.get("picks_log", [])), "Ready": ready_str, "Priority": h.get("priority", "—"), }) import pandas as pd df = pd.DataFrame(active_rows) st.dataframe( df, width="stretch", hide_index=True, column_config={ "ID": st.column_config.TextColumn(width="medium"), "Title": st.column_config.TextColumn(width="large"), "Scanner": st.column_config.TextColumn(width="medium"), "Status": st.column_config.TextColumn(width="small"), "Progress": st.column_config.TextColumn(width="small"), "Picks": st.column_config.NumberColumn(format="%d", width="small"), "Ready": st.column_config.TextColumn(width="medium"), "Priority": st.column_config.NumberColumn(format="%d/9", width="small"), }, ) else: st.info("No active experiments.") st.markdown("<div style='height:1.5rem;'></div>", unsafe_allow_html=True) # ── Concluded experiments ───────────────────────────────────────────────── st.markdown( f'<div class="section-title">Concluded Experiments ' f'<span class="accent">// {len(concluded)} total</span></div>', unsafe_allow_html=True, ) if concluded: import pandas as pd concluded_rows = [] for c in concluded: outcome = c["outcome"] emoji = "✅" if "accepted" in outcome else "❌" concluded_rows.append({ "Date": c["filename"][:10], "Title": c["title"], "Scanner": c["scanner"], "Period": c["period"], "Outcome": emoji, }) cdf = pd.DataFrame(concluded_rows) st.dataframe( cdf, width="stretch", hide_index=True, column_config={ "Date": st.column_config.TextColumn(width="small"), "Title": st.column_config.TextColumn(width="large"), "Scanner": st.column_config.TextColumn(width="medium"), "Period": st.column_config.TextColumn(width="medium"), "Outcome": st.column_config.TextColumn(width="small"), }, ) else: st.info("No concluded experiments yet.") ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash python -m pytest tests/test_hypotheses_page.py -v ``` Expected: all 6 tests pass. - [ ] **Step 5: Register the page in `tradingagents/ui/pages/__init__.py`** Add after the `settings` import block (around line 38): ```python try: from tradingagents.ui.pages import hypotheses except Exception as _e: _logger.error("Failed to import hypotheses page: %s", _e, exc_info=True) hypotheses = None ``` And add `"hypotheses"` to `__all__`: ```python __all__ = [ "home", "todays_picks", "portfolio", "performance", "settings", "hypotheses", ] ``` - [ ] **Step 6: Add "Hypotheses" to dashboard navigation in `tradingagents/ui/dashboard.py`** In `render_sidebar`, change the `options` list: ```python page = st.radio( "Navigation", options=["Overview", "Signals", "Portfolio", "Performance", "Hypotheses", "Config"], label_visibility="collapsed", ) ``` In `route_page`, add to `page_map`: ```python page_map = { "Overview": pages.home, "Signals": pages.todays_picks, "Portfolio": pages.portfolio, "Performance": pages.performance, "Hypotheses": pages.hypotheses, "Config": pages.settings, } ``` - [ ] **Step 7: Run the full test suite** ```bash python -m pytest tests/test_compare_hypothesis.py tests/test_hypotheses_page.py -v ``` Expected: all 16 tests pass. - [ ] **Step 8: Commit everything** ```bash git add \ tradingagents/ui/pages/hypotheses.py \ tradingagents/ui/pages/__init__.py \ tradingagents/ui/dashboard.py \ tests/test_hypotheses_page.py git commit -m "feat(hypotheses): add Hypotheses dashboard tab" ``` --- ## Self-Review **Spec coverage check:** - ✅ `active.json` schema with `status: running/pending/concluded` — Task 1 - ✅ `/backtest-hypothesis` command: classify, priority scoring, pending queue, branch creation — Task 3 - ✅ Running experiments never paused — enforced in `run_hypothesis_runner.py` (only `running` entries processed; new ones queue as `pending`) - ✅ Daily runner: worktree per hypothesis, run discovery, commit picks, conclude — Task 4 - ✅ Statistical comparison with 5pp / 1% thresholds, minimum 5 evaluated picks — Task 2 - ✅ Auto-promote pending → running when slot opens — `promote_pending()` in runner - ✅ Concluded doc written with metrics table — `conclude_hypothesis()` in runner - ✅ PR merged (accepted) or closed (rejected) automatically — `conclude_hypothesis()` - ✅ Dashboard tab with active + concluded tables — Task 5 **Type/name consistency:** - `hypothesis_id` / `hid` / `id` field: the dict key is always `"id"`, the local var is `hid`, the argument is `--hypothesis-id` — consistent throughout - `picks.json` structure: `{"hypothesis_id": ..., "scanner": ..., "picks": [...]}` — used in `save_picks_to_worktree` and `load_picks_from_branch` consistently - `strategy_match` field used to filter picks in `extract_picks` — matches `discovery_result.json` structure confirmed by inspection