TradingAgents/MISTAKES.md

102 lines
5.5 KiB
Markdown

# Mistakes & Lessons Learned
Documenting bugs and wrong assumptions to avoid repeating them.
---
## Mistake 1: Scanner agents had no tool execution
**What happened**: All 4 scanner agents (geopolitical, market movers, sector, industry) used `llm.bind_tools(tools)` but only checked `if len(result.tool_calls) == 0: report = result.content`. When the LLM chose to call tools (which it always does when tools are available), nobody executed them. Reports were always empty strings.
**Root cause**: Copied the pattern from existing analysts (`news_analyst.py`) without realizing that the trading graph has separate `ToolNode` graph nodes that handle tool execution in a routing loop. The scanner graph has no such nodes.
**Fix**: Created `tool_runner.py` with `run_tool_loop()` that executes tools inline within the agent node.
**Lesson**: When an LLM has `bind_tools`, there MUST be a tool execution mechanism — either graph-level `ToolNode` routing or inline execution. Always verify the tool execution path exists.
---
## Mistake 2: Assumed yfinance `Sector.overview` has performance data
**What happened**: Wrote `get_sector_performance_yfinance` using `yf.Sector("technology").overview["oneDay"]` etc. This field doesn't exist — `overview` only returns metadata (companies_count, market_cap, industries_count).
**Root cause**: Assumed the yfinance Sector API mirrors the Yahoo Finance website which shows performance data. It doesn't.
**Fix**: Switched to SPDR ETF proxy approach — download ETF prices and compute percentage changes.
**Lesson**: Always test data source APIs interactively before writing agent code. Run `python -c "import yfinance as yf; print(yf.Sector('technology').overview)"` to see actual data shape.
---
## Mistake 3: yfinance `top_companies` — ticker is the index, not a column
**What happened**: Used `row.get('symbol')` to get ticker from `top_companies` DataFrame. Always returned N/A.
**Root cause**: The DataFrame has `index.name = 'symbol'` — tickers are the index, not a column. The actual columns are `['name', 'rating', 'market weight']`.
**Fix**: Changed to `for symbol, row in top_companies.iterrows()`.
**Lesson**: Always inspect DataFrame structure with `.head()`, `.columns`, and `.index` before writing access code.
---
## Mistake 4: Hardcoded Ollama localhost URL
**What happened**: `openai_client.py` had `base_url = "http://localhost:11434/v1"` hardcoded for Ollama provider, ignoring the `self.base_url` config. User's Ollama runs on `192.168.50.76:11434`.
**Fix**: Changed to `host = self.base_url or "http://localhost:11434"` with `/v1` suffix appended.
**Lesson**: Never hardcode URLs. Always use the configured value with a sensible default.
---
## Mistake 5: Only caught `RateLimitError` in vendor fallback
**What happened**: `route_to_vendor()` only caught `RateLimitError`. Alpha Vantage demo key returns "Information" responses (not rate limit errors) and other `AlphaVantageError` subtypes. Fallback to yfinance never triggered.
**Fix**: Broadened catch to `AlphaVantageError` (base class).
**Lesson**: Fallback mechanisms should catch the broadest reasonable error class, not just specific subtypes.
---
## Mistake 6: AV scanner functions silently caught errors
**What happened**: `get_sector_performance_alpha_vantage` and `get_industry_performance_alpha_vantage` caught exceptions internally and embedded error strings in the output (e.g., `"Error: ..."` in the result dict). `route_to_vendor` never saw an exception, so it never fell back to yfinance.
**Fix**: Made both functions raise `AlphaVantageError` when ALL queries fail, while still tolerating partial failures.
**Lesson**: Functions used inside `route_to_vendor` MUST raise exceptions on total failure — embedding errors in return values defeats the fallback mechanism.
---
## Mistake 7: LangGraph concurrent write without reducer
**What happened**: Phase 1 runs 3 scanners in parallel. All write to `sender` (and other shared fields). LangGraph raised `INVALID_CONCURRENT_GRAPH_UPDATE` because `ScannerState` had no reducer for concurrent writes.
**Fix**: Added `_last_value` reducer via `Annotated[str, _last_value]` to all ScannerState fields.
**Lesson**: Any LangGraph state field written by parallel nodes MUST have a reducer. Use `Annotated[type, reducer_fn]`.
---
## Mistake 8: .env file had placeholder values in worktree
**What happened**: Created `.env` in worktree with template values (`your_openrouter_key_here`). User's real keys were only in main repo's `.env`. `load_dotenv()` loaded the worktree placeholder, so OpenRouter returned 401.
**Root cause**: Created `.env` template during setup without copying real keys. `load_dotenv()` with `override=False` (default) keeps the first value found.
**Fix**: Updated worktree `.env` with real keys. Also added fallback `load_dotenv()` call for project root.
**Lesson**: When creating `.env` files, always verify they have real values, not placeholders. When debugging auth errors, first check `os.environ.get('KEY')` to see what value is actually loaded.
---
## Mistake 9: Removed top-level `llm_provider` but code still references it
**What happened**: Removed `llm_provider` from `default_config.py` (since we have per-tier providers). But `scanner_graph.py` line 78 does `self.config.get(f"{tier}_llm_provider") or self.config["llm_provider"]` — would crash if per-tier provider is ever None.
**Status**: Works currently because per-tier providers are always set. But it's a latent bug.
**TODO**: Add a safe fallback or remove the dead code path.