docs: update PROGRESS, DECISIONS, MISTAKES, CLAUDE with env override implementation

- PROGRESS.md: added env override milestone, updated test count (38 total), marked Mistake #9 as resolved, added all new/modified files from PR #9 - DECISIONS.md: added Decision 008 (env var config overrides), Decision 009 (thread-safe rate limiter), Decision 010 (broader vendor fallback exceptions), updated Decision 007 status to superseded - MISTAKES.md: updated Mistake #9 status to RESOLVED, added Mistake #10 (rate limiter held lock during sleep) - CLAUDE.md: added env var override convention docs, updated critical patterns with rate limiter and config fallback key lessons, updated mistake count to 10 Co-authored-by: aguzererler <6199053+aguzererler@users.noreply.github.com>
2026-03-17 14:37:41 +00:00 · 2026-03-17 14:37:41 +00:00 · 373a03d744
parent 15e87c7688
commit 373a03d744
4 changed files with 143 additions and 15 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -75,6 +75,7 @@ OpenAI, Anthropic, Google, xAI, OpenRouter, Ollama
 - LLM tiers configuration
 - Vendor routing
 - Debate rounds settings
+- All values overridable via `TRADINGAGENTS_<KEY>` env vars (see `.env.example`)

 ## Patterns to Follow

@ -91,16 +92,18 @@ OpenAI, Anthropic, Google, xAI, OpenRouter, Ollama
 - **Tool execution**: Trading graph uses `ToolNode` in graph. Scanner agents use `run_tool_loop()` inline. If `bind_tools()` is used, there MUST be a tool execution path.
 - **yfinance DataFrames**: `top_companies` has ticker as INDEX, not column. Always check `.index` and `.columns`.
 - **yfinance Sector/Industry**: `Sector.overview` has NO performance data. Use ETF proxies for performance.
- **Vendor fallback**: Functions inside `route_to_vendor` must RAISE on failure, not embed errors in return values. Catch `AlphaVantageError` (base class), not just `RateLimitError`.
+- **Vendor fallback**: Functions inside `route_to_vendor` must RAISE on failure, not embed errors in return values. Catch `(AlphaVantageError, ConnectionError, TimeoutError)`, not just `RateLimitError`.
 - **LangGraph parallel writes**: Any state field written by parallel nodes MUST have a reducer (`Annotated[str, reducer_fn]`).
 - **Ollama remote host**: Never hardcode `localhost:11434`. Use configured `base_url`.
- **.env loading**: Check actual env var values when debugging auth. Worktree and main repo may have different `.env` files.
+- **.env loading**: `load_dotenv()` runs at module level in `default_config.py` — import-order-independent. Check actual env var values when debugging auth.
+- **Rate limiter locks**: Never hold a lock during `sleep()` or IO. Release, sleep, re-acquire.
+- **Config fallback keys**: `llm_provider` and `backend_url` must always exist at top level — `scanner_graph.py` and `trading_graph.py` use them as fallbacks.

 ## Project Tracking Files

- `DECISIONS.md` — Architecture decision records (vendor strategy, LLM setup, tool execution)
+- `DECISIONS.md` — Architecture decision records (vendor strategy, LLM setup, tool execution, env overrides)
 - `PROGRESS.md` — Feature progress, what works, TODOs
- `MISTAKES.md` — Past bugs and lessons learned (9 documented mistakes)
+- `MISTAKES.md` — Past bugs and lessons learned (10 documented mistakes)

 ## LLM Configuration

@ -110,6 +113,18 @@ Per-tier provider overrides in `tradingagents/default_config.py`:
 - All config values overridable via `TRADINGAGENTS_<KEY>` env vars
 - Keys for LLM providers: `.env` file (e.g., `OPENROUTER_API_KEY`, `ALPHA_VANTAGE_API_KEY`)

+### Env Var Override Convention
+
+```env
+# Pattern: TRADINGAGENTS_<UPPERCASE_KEY>=value
+TRADINGAGENTS_LLM_PROVIDER=openrouter
+TRADINGAGENTS_DEEP_THINK_LLM=deepseek/deepseek-r1-0528
+TRADINGAGENTS_MAX_DEBATE_ROUNDS=3
+TRADINGAGENTS_VENDOR_SCANNER_DATA=alpha_vantage
+```
+
+Empty or unset vars preserve the hardcoded default. `None`-default fields (like `mid_think_llm`) stay `None` when unset, preserving fallback semantics.
+
 ## Running the Scanner

 ```bash
--- a/DECISIONS.md
+++ b/DECISIONS.md
@ -105,10 +105,75 @@ Download 6 months of history via `yf.download()` and compute 1-day, 1-week, 1-mo
 ## Decision 007: .env Loading Strategy

 **Date**: 2026-03-17
-**Status**: Implemented ✅
+**Status**: Superseded by Decision 008 ⚠️

 **Context**: `load_dotenv()` loads from CWD. When running from a git worktree, the worktree `.env` may have placeholder values while the main repo `.env` has real keys.

 **Decision**: `cli/main.py` calls `load_dotenv()` (CWD) then `load_dotenv(Path(__file__).parent.parent / ".env")` as fallback. The worktree `.env` was also updated with real API keys.

 **Note for future**: If `.env` issues recur, check which `.env` file is being picked up. The worktree and main repo each have their own `.env`.
+
+**Update**: Decision 008 moves `load_dotenv()` into `default_config.py` itself, making it import-order-independent. The CLI-level `load_dotenv()` in `main.py` is now defense-in-depth only.
+
+---
+
+## Decision 008: Environment Variable Config Overrides
+
+**Date**: 2026-03-17
+**Status**: Implemented ✅
+
+**Context**: `DEFAULT_CONFIG` hardcoded all values (LLM providers, models, vendor routing, debate rounds). Users had to edit `default_config.py` to change any setting. The `load_dotenv()` call in `cli/main.py` ran *after* `DEFAULT_CONFIG` was already evaluated at import time, so env vars like `TRADINGAGENTS_LLM_PROVIDER` had no effect. This also created a latent bug (Mistake #9): `llm_provider` and `backend_url` were removed from the config but `scanner_graph.py` still referenced them as fallbacks.
+
+**Decision**:
+1. **Module-level `.env` loading**: `default_config.py` calls `load_dotenv()` at the top of the module, before `DEFAULT_CONFIG` is evaluated. Loads from CWD first, then falls back to project root (`Path(__file__).resolve().parent.parent / ".env"`).
+2. **`_env()` / `_env_int()` helpers**: Read `TRADINGAGENTS_<KEY>` from environment. Return the hardcoded default when the env var is unset or empty (preserving `None` semantics for per-tier fallbacks).
+3. **Restored top-level keys**: `llm_provider` (default: `"openai"`) and `backend_url` (default: `"https://api.openai.com/v1"`) restored as env-overridable keys. Resolves Mistake #9.
+4. **All config keys overridable**: LLM models, providers, backend URLs, debate rounds, data vendor categories — all follow the `TRADINGAGENTS_<KEY>` pattern.
+5. **Explicit dependency**: Added `python-dotenv>=1.0.0` to `pyproject.toml` (was used but undeclared).
+
+**Naming convention**: `TRADINGAGENTS_` prefix + uppercase config key. Examples:
+```
+TRADINGAGENTS_LLM_PROVIDER=openrouter
+TRADINGAGENTS_DEEP_THINK_LLM=deepseek/deepseek-r1-0528
+TRADINGAGENTS_MAX_DEBATE_ROUNDS=3
+TRADINGAGENTS_VENDOR_SCANNER_DATA=alpha_vantage
+```
+
+**Files changed**:
+- `tradingagents/default_config.py` — core implementation
+- `main.py` — moved `load_dotenv()` before imports (defense-in-depth)
+- `pyproject.toml` — added `python-dotenv>=1.0.0`
+- `.env.example` — documented all overrides
+- `tests/test_env_override.py` — 15 tests
+
+**Alternative considered**: YAML/TOML config file. Rejected — env vars are simpler, work with Docker/CI, and don't require a new config file format.
+
+---
+
+## Decision 009: Thread-Safe Rate Limiter for Alpha Vantage
+
+**Date**: 2026-03-17
+**Status**: Implemented ✅
+
+**Context**: The Alpha Vantage rate limiter in `alpha_vantage_common.py` initially slept *inside* the lock when re-checking the rate window. This blocked all other threads from making API requests during the sleep period, effectively serializing all AV calls.
+
+**Decision**: Two-phase rate limiting:
+1. **First check**: Acquire lock, check timestamps, release lock, sleep if needed.
+2. **Re-check loop**: Acquire lock, re-check timestamps. If still over limit, release lock *before* sleeping, then retry. Only append timestamp and break when under the limit.
+
+This ensures the lock is never held during `sleep()` calls.
+
+**File**: `tradingagents/dataflows/alpha_vantage_common.py`
+
+---
+
+## Decision 010: Broader Vendor Fallback Exception Handling
+
+**Date**: 2026-03-17
+**Status**: Implemented ✅
+
+**Context**: `route_to_vendor()` only caught `AlphaVantageError` for fallback. But network issues (`ConnectionError`, `TimeoutError`) from the `requests` library wouldn't trigger fallback — they'd crash the pipeline instead.
+
+**Decision**: Broadened the catch in `route_to_vendor()` to `(AlphaVantageError, ConnectionError, TimeoutError)`. Similarly, `_make_api_request()` now catches `requests.exceptions.RequestException` as a general fallback and wraps `raise_for_status()` in a try/except to convert HTTP errors to `ThirdPartyError`.
+
+**Files**: `tradingagents/dataflows/interface.py`, `tradingagents/dataflows/alpha_vantage_common.py`
--- a/MISTAKES.md
+++ b/MISTAKES.md
@ -96,6 +96,27 @@ Documenting bugs and wrong assumptions to avoid repeating them.

 **What happened**: Removed `llm_provider` from `default_config.py` (since we have per-tier providers). But `scanner_graph.py` line 78 does `self.config.get(f"{tier}_llm_provider") or self.config["llm_provider"]` — would crash if per-tier provider is ever None.

-**Status**: Works currently because per-tier providers are always set. But it's a latent bug.
+**Status**: ✅ RESOLVED in PR #9. Top-level `llm_provider` (default: `"openai"`) and `backend_url` (default: `"https://api.openai.com/v1"`) restored as env-overridable config keys. Per-tier providers safely fall back to these when `None`.

-**TODO**: Add a safe fallback or remove the dead code path.
+**Lesson**: Always preserve fallback keys that downstream code depends on. When refactoring config, grep for all references before removing keys.
+
+---
+
+## Mistake 10: Rate limiter held lock during sleep
+
+**What happened**: The Alpha Vantage rate limiter's re-check path in `_rate_limited_request()` called `_time.sleep(extra_sleep)` while holding `_rate_lock`. This blocked all other threads from making API requests during the sleep period, effectively serializing all AV calls even though the pipeline runs parallel scanner agents.
+
+**Root cause**: Initial implementation only had one lock section. When the re-check-after-sleep pattern was added to prevent race conditions, the sleep was left inside the `with _rate_lock:` block.
+
+**Fix**: Restructured the re-check as a `while True` loop that releases the lock before sleeping:
+```python
+while True:
+    with _rate_lock:
+        if len(_call_timestamps) < _RATE_LIMIT:
+            _call_timestamps.append(_time.time())
+            break
+        extra_sleep = 60 - (now - _call_timestamps[0]) + 0.1
+    _time.sleep(extra_sleep)  # ← outside lock
+```
+
+**Lesson**: Never hold a lock during a sleep/IO operation. Always release the lock, perform the blocking operation, then re-acquire.
--- a/PROGRESS.md
+++ b/PROGRESS.md
@ -15,10 +15,11 @@ The 3-phase scanner pipeline runs successfully from `python -m cli.main scan --d
 | Phase 3: Macro Synthesis | ✅ | OpenRouter/DeepSeek R1, pure LLM synthesis (no tools) |
 | Parallel fan-out (Phase 1) | ✅ | LangGraph with `_last_value` reducers |
 | Tool execution loop | ✅ | `run_tool_loop()` in `tool_runner.py` |
-| Data vendor fallback | ✅ | AV → yfinance fallback on `AlphaVantageError` |
+| Data vendor fallback | ✅ | AV → yfinance fallback on `AlphaVantageError`, `ConnectionError`, `TimeoutError` |
 | CLI `--date` flag | ✅ | `python -m cli.main scan --date YYYY-MM-DD` |
-| .env loading | ✅ | Keys loaded from project root `.env` |
-| Tests (23 total) | ✅ | 14 original + 9 scanner fallback tests |
+| .env loading | ✅ | `load_dotenv()` at module level in `default_config.py` — import-order-independent |
+| Env var config overrides | ✅ | All `DEFAULT_CONFIG` keys overridable via `TRADINGAGENTS_<KEY>` env vars |
+| Tests (38 total) | ✅ | 14 original + 9 scanner fallback + 15 env override tests |

 ### Output Quality (Sample Run 2026-03-17)

@ -41,14 +42,40 @@ The 3-phase scanner pipeline runs successfully from `python -m cli.main scan --d
 - `tradingagents/graph/scanner_setup.py` — LangGraph workflow setup
 - `tradingagents/dataflows/yfinance_scanner.py` — yfinance data for scanner
 - `tradingagents/dataflows/alpha_vantage_scanner.py` — Alpha Vantage data for scanner
+- `tradingagents/pipeline/macro_bridge.py` — scan → filter → per-ticker analysis bridge
 - `tests/test_scanner_fallback.py` — 9 fallback tests
+- `tests/test_env_override.py` — 15 env override tests

 **Modified files:**
- `tradingagents/default_config.py` — per-tier LLM provider config (hybrid setup)
+- `tradingagents/default_config.py` — env var overrides via `_env()`/`_env_int()` helpers, `load_dotenv()` at module level, restored top-level `llm_provider` and `backend_url` keys
 - `tradingagents/llm_clients/openai_client.py` — Ollama remote host support
- `tradingagents/dataflows/interface.py` — broadened fallback catch to `AlphaVantageError`
- `cli/main.py` — `scan` command with `--date` flag, `.env` loading fix
- `.env` — real API keys
+- `tradingagents/dataflows/interface.py` — broadened fallback catch to `(AlphaVantageError, ConnectionError, TimeoutError)`
+- `tradingagents/dataflows/alpha_vantage_common.py` — thread-safe rate limiter (sleep outside lock), broader `RequestException` catch, wrapped `raise_for_status`
+- `tradingagents/graph/scanner_graph.py` — debug mode fix (stream for debug, invoke for result)
+- `tradingagents/pipeline/macro_bridge.py` — `get_running_loop()` over deprecated `get_event_loop()`
+- `cli/main.py` — `scan` command with `--date` flag, `try/except` in `run_pipeline`, `.env` loading fix
+- `main.py` — `load_dotenv()` before tradingagents imports
+- `pyproject.toml` — `python-dotenv>=1.0.0` dependency declared
+- `.env.example` — documented all `TRADINGAGENTS_*` overrides and `ALPHA_VANTAGE_API_KEY`
+
+---
+
+## Milestone: Env Var Config Overrides ✅ COMPLETE (PR #9)
+
+All `DEFAULT_CONFIG` values are now overridable via `TRADINGAGENTS_<KEY>` environment variables without code changes. This resolves the latent bug from Mistake #9 (missing top-level `llm_provider`).
+
+### What Changed
+
+| Component | Detail |
+|-----------|--------|
+| `default_config.py` | `load_dotenv()` at module level + `_env()`/`_env_int()` helpers |
+| Top-level fallback keys | Restored `llm_provider` and `backend_url` (defaults: `"openai"`, `"https://api.openai.com/v1"`) |
+| Per-tier overrides | All `None` by default — fall back to top-level when not set via env |
+| Integer config keys | `max_debate_rounds`, `max_risk_discuss_rounds`, `max_recur_limit` use `_env_int()` |
+| Data vendor keys | `data_vendors.*` overridable via `TRADINGAGENTS_VENDOR_<CATEGORY>` |
+| `.env.example` | Complete reference of all overridable settings |
+| `python-dotenv` | Added to `pyproject.toml` as explicit dependency |
+| Tests | 15 new tests in `tests/test_env_override.py` |

 ---

@ -78,4 +105,4 @@ The 3-phase scanner pipeline runs successfully from `python -m cli.main scan --d

 - [ ] **Streaming output**: Scanner currently runs with `Live(Spinner(...))` — no intermediate output. Could stream phase completions to the console.

- [ ] **Remove top-level `llm_provider` references**: `scanner_graph.py` lines 69, 78 still fall back to `self.config["llm_provider"]` which doesn't exist in current config. Works because per-tier providers are always set, but will crash if they're ever `None`.
+- [x] ~~**Remove top-level `llm_provider` references**~~: Resolved in PR #9 — `llm_provider` and `backend_url` restored as top-level keys with `"openai"` / `"https://api.openai.com/v1"` defaults. Per-tier providers fall back to these when `None`.