7.6 KiB
Mistakes & Lessons Learned
Documenting bugs and wrong assumptions to avoid repeating them.
Mistake 1: Scanner agents had no tool execution
What happened: All 4 scanner agents (geopolitical, market movers, sector, industry) used llm.bind_tools(tools) but only checked if len(result.tool_calls) == 0: report = result.content. When the LLM chose to call tools (which it always does when tools are available), nobody executed them. Reports were always empty strings.
Root cause: Copied the pattern from existing analysts (news_analyst.py) without realizing that the trading graph has separate ToolNode graph nodes that handle tool execution in a routing loop. The scanner graph has no such nodes.
Fix: Created tool_runner.py with run_tool_loop() that executes tools inline within the agent node.
Lesson: When an LLM has bind_tools, there MUST be a tool execution mechanism — either graph-level ToolNode routing or inline execution. Always verify the tool execution path exists.
Mistake 2: Assumed yfinance Sector.overview has performance data
What happened: Wrote get_sector_performance_yfinance using yf.Sector("technology").overview["oneDay"] etc. This field doesn't exist — overview only returns metadata (companies_count, market_cap, industries_count).
Root cause: Assumed the yfinance Sector API mirrors the Yahoo Finance website which shows performance data. It doesn't.
Fix: Switched to SPDR ETF proxy approach — download ETF prices and compute percentage changes.
Lesson: Always test data source APIs interactively before writing agent code. Run python -c "import yfinance as yf; print(yf.Sector('technology').overview)" to see actual data shape.
Mistake 3: yfinance top_companies — ticker is the index, not a column
What happened: Used row.get('symbol') to get ticker from top_companies DataFrame. Always returned N/A.
Root cause: The DataFrame has index.name = 'symbol' — tickers are the index, not a column. The actual columns are ['name', 'rating', 'market weight'].
Fix: Changed to for symbol, row in top_companies.iterrows().
Lesson: Always inspect DataFrame structure with .head(), .columns, and .index before writing access code.
Mistake 4: Hardcoded Ollama localhost URL
What happened: openai_client.py had base_url = "http://localhost:11434/v1" hardcoded for Ollama provider, ignoring the self.base_url config. User's Ollama runs on 192.168.50.76:11434.
Fix: Changed to host = self.base_url or "http://localhost:11434" with /v1 suffix appended.
Lesson: Never hardcode URLs. Always use the configured value with a sensible default.
Mistake 5: Only caught RateLimitError in vendor fallback
What happened: route_to_vendor() only caught RateLimitError. Alpha Vantage demo key returns "Information" responses (not rate limit errors) and other AlphaVantageError subtypes. Fallback to yfinance never triggered.
Fix: Broadened catch to AlphaVantageError (base class).
Lesson: Fallback mechanisms should catch the broadest reasonable error class, not just specific subtypes.
Mistake 6: AV scanner functions silently caught errors
What happened: get_sector_performance_alpha_vantage and get_industry_performance_alpha_vantage caught exceptions internally and embedded error strings in the output (e.g., "Error: ..." in the result dict). route_to_vendor never saw an exception, so it never fell back to yfinance.
Fix: Made both functions raise AlphaVantageError when ALL queries fail, while still tolerating partial failures.
Lesson: Functions used inside route_to_vendor MUST raise exceptions on total failure — embedding errors in return values defeats the fallback mechanism.
Mistake 7: LangGraph concurrent write without reducer
What happened: Phase 1 runs 3 scanners in parallel. All write to sender (and other shared fields). LangGraph raised INVALID_CONCURRENT_GRAPH_UPDATE because ScannerState had no reducer for concurrent writes.
Fix: Added _last_value reducer via Annotated[str, _last_value] to all ScannerState fields.
Lesson: Any LangGraph state field written by parallel nodes MUST have a reducer. Use Annotated[type, reducer_fn].
Mistake 8: .env file had placeholder values in worktree
What happened: Created .env in worktree with template values (your_openrouter_key_here). User's real keys were only in main repo's .env. load_dotenv() loaded the worktree placeholder, so OpenRouter returned 401.
Root cause: Created .env template during setup without copying real keys. load_dotenv() with override=False (default) keeps the first value found.
Fix: Updated worktree .env with real keys. Also added fallback load_dotenv() call for project root.
Lesson: When creating .env files, always verify they have real values, not placeholders. When debugging auth errors, first check os.environ.get('KEY') to see what value is actually loaded.
Mistake 10: Python 3.11 f-string backslash restriction
What happened: ttm_analysis.py used a backslash inside an f-string expression:
f"| Debt / Equity | {f\"{ttm['debt_to_equity']:.2f}x\" if ...} |"
Python 3.11 raises SyntaxError: f-string expression part cannot include a backslash.
Fix: Pre-compute the string outside the f-string or use string concatenation:
f"| Debt / Equity | {(str(round(ttm['debt_to_equity'], 2)) + 'x') if ... else 'N/A'} |"
Lesson: Python 3.11 does not allow backslashes inside f-string {} expressions. Extract to a variable or use string concatenation instead. (Python 3.12+ relaxes this restriction.)
Mistake 11: Mock test data precision — threshold boundary failures
What happened: test_risk_on_regime failed because mock risk-on data scored only 2 (needed ≥3). Two signals were inadvertently near-threshold:
- Flat VIX series → VIX trend signal = 0 (SMA5 == SMA20)
_trending_series(80, 85, 250)HYG → 21-day change was 0.499%, just under 0.5% threshold → credit spread = 0
Fix: Made mock data obviously far from thresholds: _trending_series(30, 12, n) for VIX (clearly falling), _trending_series(75, 90, n) for HYG (clearly improving).
Lesson: When writing signal/threshold tests, make mock data unmistakably one-sided. Near-threshold values cause brittle tests. The mock should test the regime, not the exact threshold boundary.
Mistake 12: Remote naming — "origin" is the fork, not the upstream
What happened: Confusion about which remote is the "fork" vs "origin". The user said "you pushed to origin, not the fork" — but there is only one remote configured, and it points to aguzererler/TradingAgents which IS the fork.
Setup:
origin → http://127.0.0.1:46699/git/aguzererler/TradingAgents ← this IS the fork
There is no separate upstream remote for the original/parent repo.
Lesson: In this project, origin = the user's fork (aguzererler/TradingAgents). Always push development branches to origin. If an upstream remote is ever added, never push feature branches to it — only fetch from it.
Mistake 9: Removed top-level llm_provider but code still references it
What happened: Removed llm_provider from default_config.py (since we have per-tier providers). But scanner_graph.py line 78 does self.config.get(f"{tier}_llm_provider") or self.config["llm_provider"] — would crash if per-tier provider is ever None.
Status: Works currently because per-tier providers are always set. But it's a latent bug.
TODO: Add a safe fallback or remove the dead code path.