7.6 KiB

Raw Blame History

Mistakes & Lessons Learned

Documenting bugs and wrong assumptions to avoid repeating them.

Mistake 1: Scanner agents had no tool execution

What happened: All 4 scanner agents (geopolitical, market movers, sector, industry) used llm.bind_tools(tools) but only checked if len(result.tool_calls) == 0: report = result.content. When the LLM chose to call tools (which it always does when tools are available), nobody executed them. Reports were always empty strings.

Root cause: Copied the pattern from existing analysts (news_analyst.py) without realizing that the trading graph has separate ToolNode graph nodes that handle tool execution in a routing loop. The scanner graph has no such nodes.

Fix: Created tool_runner.py with run_tool_loop() that executes tools inline within the agent node.

Lesson: When an LLM has bind_tools, there MUST be a tool execution mechanism — either graph-level ToolNode routing or inline execution. Always verify the tool execution path exists.

Mistake 2: Assumed yfinance `Sector.overview` has performance data

What happened: Wrote get_sector_performance_yfinance using yf.Sector("technology").overview["oneDay"] etc. This field doesn't exist — overview only returns metadata (companies_count, market_cap, industries_count).

Root cause: Assumed the yfinance Sector API mirrors the Yahoo Finance website which shows performance data. It doesn't.

Fix: Switched to SPDR ETF proxy approach — download ETF prices and compute percentage changes.

Lesson: Always test data source APIs interactively before writing agent code. Run python -c "import yfinance as yf; print(yf.Sector('technology').overview)" to see actual data shape.

Mistake 3: yfinance `top_companies` — ticker is the index, not a column

What happened: Used row.get('symbol') to get ticker from top_companies DataFrame. Always returned N/A.

Root cause: The DataFrame has index.name = 'symbol' — tickers are the index, not a column. The actual columns are ['name', 'rating', 'market weight'].

Fix: Changed to for symbol, row in top_companies.iterrows().

Lesson: Always inspect DataFrame structure with .head(), .columns, and .index before writing access code.

Mistake 4: Hardcoded Ollama localhost URL

What happened: openai_client.py had base_url = "http://localhost:11434/v1" hardcoded for Ollama provider, ignoring the self.base_url config. User's Ollama runs on 192.168.50.76:11434.

Fix: Changed to host = self.base_url or "http://localhost:11434" with /v1 suffix appended.

Lesson: Never hardcode URLs. Always use the configured value with a sensible default.

Mistake 5: Only caught `RateLimitError` in vendor fallback

What happened: route_to_vendor() only caught RateLimitError. Alpha Vantage demo key returns "Information" responses (not rate limit errors) and other AlphaVantageError subtypes. Fallback to yfinance never triggered.

Fix: Broadened catch to AlphaVantageError (base class).

Lesson: Fallback mechanisms should catch the broadest reasonable error class, not just specific subtypes.

Mistake 6: AV scanner functions silently caught errors

What happened: get_sector_performance_alpha_vantage and get_industry_performance_alpha_vantage caught exceptions internally and embedded error strings in the output (e.g., "Error: ..." in the result dict). route_to_vendor never saw an exception, so it never fell back to yfinance.

Fix: Made both functions raise AlphaVantageError when ALL queries fail, while still tolerating partial failures.

Lesson: Functions used inside route_to_vendor MUST raise exceptions on total failure — embedding errors in return values defeats the fallback mechanism.

Mistake 7: LangGraph concurrent write without reducer

What happened: Phase 1 runs 3 scanners in parallel. All write to sender (and other shared fields). LangGraph raised INVALID_CONCURRENT_GRAPH_UPDATE because ScannerState had no reducer for concurrent writes.

Fix: Added _last_value reducer via Annotated[str, _last_value] to all ScannerState fields.

Lesson: Any LangGraph state field written by parallel nodes MUST have a reducer. Use Annotated[type, reducer_fn].

Mistake 8: .env file had placeholder values in worktree

What happened: Created .env in worktree with template values (your_openrouter_key_here). User's real keys were only in main repo's .env. load_dotenv() loaded the worktree placeholder, so OpenRouter returned 401.

Root cause: Created .env template during setup without copying real keys. load_dotenv() with override=False (default) keeps the first value found.

Fix: Updated worktree .env with real keys. Also added fallback load_dotenv() call for project root.

Lesson: When creating .env files, always verify they have real values, not placeholders. When debugging auth errors, first check os.environ.get('KEY') to see what value is actually loaded.

Mistake 10: Python 3.11 f-string backslash restriction

What happened: ttm_analysis.py used a backslash inside an f-string expression:

f"| Debt / Equity | {f\"{ttm['debt_to_equity']:.2f}x\" if ...} |"

Python 3.11 raises SyntaxError: f-string expression part cannot include a backslash.

Fix: Pre-compute the string outside the f-string or use string concatenation:

f"| Debt / Equity | {(str(round(ttm['debt_to_equity'], 2)) + 'x') if ... else 'N/A'} |"

Lesson: Python 3.11 does not allow backslashes inside f-string {} expressions. Extract to a variable or use string concatenation instead. (Python 3.12+ relaxes this restriction.)

Mistake 11: Mock test data precision — threshold boundary failures

What happened: test_risk_on_regime failed because mock risk-on data scored only 2 (needed ≥3). Two signals were inadvertently near-threshold:

Flat VIX series → VIX trend signal = 0 (SMA5 == SMA20)
_trending_series(80, 85, 250) HYG → 21-day change was 0.499%, just under 0.5% threshold → credit spread = 0

Fix: Made mock data obviously far from thresholds: _trending_series(30, 12, n) for VIX (clearly falling), _trending_series(75, 90, n) for HYG (clearly improving).

Lesson: When writing signal/threshold tests, make mock data unmistakably one-sided. Near-threshold values cause brittle tests. The mock should test the regime, not the exact threshold boundary.

Mistake 12: Remote naming — "origin" is the fork, not the upstream

What happened: Confusion about which remote is the "fork" vs "origin". The user said "you pushed to origin, not the fork" — but there is only one remote configured, and it points to aguzererler/TradingAgents which IS the fork.

Setup:

origin → http://127.0.0.1:46699/git/aguzererler/TradingAgents  ← this IS the fork

There is no separate upstream remote for the original/parent repo.

Lesson: In this project, origin = the user's fork (aguzererler/TradingAgents). Always push development branches to origin. If an upstream remote is ever added, never push feature branches to it — only fetch from it.

Mistake 9: Removed top-level `llm_provider` but code still references it

What happened: Removed llm_provider from default_config.py (since we have per-tier providers). But scanner_graph.py line 78 does self.config.get(f"{tier}_llm_provider") or self.config["llm_provider"] — would crash if per-tier provider is ever None.

Status: Works currently because per-tier providers are always set. But it's a latent bug.

TODO: Add a safe fallback or remove the dead code path.

7.6 KiB Raw Blame History