# Orchestrator Configuration Validation Status: implemented (2026-04-16) Audience: orchestrator users, backend maintainers Scope: LLMRunner configuration validation and error classification ## Overview `orchestrator/llm_runner.py` implements three layers of configuration validation to catch errors before expensive graph initialization or API calls: 1. **Provider × Base URL Matrix Validation** - detects provider/endpoint mismatches 2. **Timeout Configuration Validation** - warns when timeouts may be insufficient 3. **Runtime Error Classification** - categorizes failures into actionable reason codes ## 1. Provider × Base URL Matrix Validation ### Purpose Prevent wasted initialization time and API calls when provider and base_url are incompatible. ### Implementation `LLMRunner._detect_provider_mismatch()` validates provider × base_url combinations using a pattern matrix: ```python _PROVIDER_BASE_URL_PATTERNS = { "anthropic": [r"api\.anthropic\.com", r"api\.minimaxi\.com/anthropic"], "openai": [r"api\.openai\.com"], "google": [r"generativelanguage\.googleapis\.com"], "xai": [r"api\.x\.ai"], "ollama": [r"localhost:\d+", r"127\.0\.0\.1:\d+", r"ollama"], "openrouter": [r"openrouter\.ai"], } ``` ### Validation Logic 1. Extract `llm_provider` and `backend_url` from `trading_agents_config` 2. Look up expected URL patterns for the provider 3. Check if `backend_url` matches any expected pattern (regex) 4. If no match found, return mismatch details before graph initialization ### Error Response When mismatch detected, `get_signal()` returns: ```python Signal( degraded=True, reason_code="provider_mismatch", metadata={ "data_quality": { "state": "provider_mismatch", "provider": "google", "backend_url": "https://api.openai.com/v1", "expected_patterns": [r"generativelanguage\.googleapis\.com"], } } ) ``` ### Examples **Valid configurations:** - `anthropic` + `https://api.minimaxi.com/anthropic` ✓ - `openai` + `https://api.openai.com/v1` ✓ - `ollama` + `http://localhost:11434` ✓ **Invalid configurations (detected):** - `google` + `https://api.openai.com/v1` → `provider_mismatch` - `xai` + `https://api.minimaxi.com/anthropic` → `provider_mismatch` - `ollama` + `https://api.openai.com/v1` → `provider_mismatch` ### Design Notes - Uses **original provider name** (not canonical) for validation - `ollama`, `openrouter`, and `openai` share the same canonical provider (`openai`) but have different URL patterns - Validation must distinguish between them - Validation runs **before** `TradingAgentsGraph` initialization - Saves ~5-10s of initialization time on mismatch - Avoids confusing error messages from LangChain/provider SDKs ## 2. Timeout Configuration Validation ### Purpose Warn users when timeout settings may be insufficient for their analyst profile, preventing unexpected research degradation. ### Implementation `LLMRunner._validate_timeout_config()` checks timeout sufficiency based on analyst count: ```python _RECOMMENDED_TIMEOUTS = { 1: {"analyst": 75.0, "research": 30.0}, # single analyst 2: {"analyst": 90.0, "research": 45.0}, # two analysts 3: {"analyst": 105.0, "research": 60.0}, # three analysts 4: {"analyst": 120.0, "research": 75.0}, # four analysts } ``` ### Validation Logic 1. Extract `selected_analysts` from `trading_agents_config` (default: 4 analysts) 2. Extract `analyst_node_timeout_secs` and `research_node_timeout_secs` 3. Compare against recommended thresholds for analyst count 4. Log `WARNING` if configured timeout < recommended threshold ### Warning Example ``` LLMRunner: analyst_node_timeout_secs=75.0s may be insufficient for 4 analyst(s) (recommended: 120.0s) ``` ### Design Notes - **Non-blocking validation** - logs warning but does not prevent initialization - Different LLM providers have vastly different speeds (MiniMax vs OpenAI) - Users may have profiled their specific setup and chosen lower timeouts intentionally - **Conservative recommendations** - thresholds assume slower providers - Based on real profiling data from MiniMax Anthropic-compatible endpoint - Users with faster providers can safely ignore warnings - **Runs at `__init__` time** - warns early, before any API calls ### Timeout Calculation Rationale Multi-analyst execution is **serial** for analysts, **parallel** for research: ``` Total time ≈ (analyst_count × analyst_timeout) + research_timeout + trading + risk + portfolio ``` For 4 analysts with 75s timeout each: - Analyst phase: ~300s (serial) - Research phase: ~30s (parallel bull/bear) - Trading phase: ~15s - Risk phase: ~10s - Portfolio phase: ~10s - **Total: ~365s** (6+ minutes) Recommended 120s per analyst assumes: - Some analysts may timeout and degrade - Degraded path still completes within timeout - Total execution stays under reasonable bounds (~8-10 minutes) ## 3. Runtime Error Classification ### Purpose Categorize runtime failures into actionable reason codes for debugging and monitoring. ### Error Taxonomy Defined in `orchestrator/contracts/error_taxonomy.py`: ```python class ReasonCode(str, Enum): CONFIG_INVALID = "config_invalid" PROVIDER_MISMATCH = "provider_mismatch" PROVIDER_AUTH_FAILED = "provider_auth_failed" LLM_INIT_FAILED = "llm_init_failed" LLM_SIGNAL_FAILED = "llm_signal_failed" LLM_UNKNOWN_RATING = "llm_unknown_rating" # ... (quant-related codes omitted) ``` ### Classification Logic `LLMRunner.get_signal()` catches exceptions from `propagate()` and classifies them: 1. **Provider mismatch** (pre-initialization) - Detected by `_detect_provider_mismatch()` before graph creation - Returns `provider_mismatch` immediately 2. **Provider auth failure** (runtime) - Detected by `_looks_like_provider_auth_failure()` heuristic - Markers: `"authentication_error"`, `"login fail"`, `"invalid api key"`, `"unauthorized"`, `"error code: 401"` - Returns `provider_auth_failed` 3. **Generic LLM failure** (runtime) - Any other exception from `propagate()` - Returns `llm_signal_failed` ### Error Response Structure All error signals include: ```python Signal( degraded=True, reason_code="", direction=0, confidence=0.0, metadata={ "error": "", "data_quality": { "state": "", # ... additional context } } ) ``` ### Design Notes - **Fail-fast on config errors** - mismatch detected before expensive operations - **Heuristic auth detection** - no API call overhead, relies on error message patterns - **Structured metadata** - `data_quality.state` mirrors `reason_code` for consistency ## 4. Testing ### Test Coverage `orchestrator/tests/test_llm_runner.py` includes: **Provider matrix validation:** - `test_detect_provider_mismatch_google_with_openai_url` - `test_detect_provider_mismatch_xai_with_anthropic_url` - `test_detect_provider_mismatch_ollama_with_openai_url` - `test_detect_provider_mismatch_valid_anthropic_minimax` - `test_detect_provider_mismatch_valid_openai` **Timeout validation:** - `test_timeout_validation_warns_for_multiple_analysts_low_timeout` - `test_timeout_validation_no_warn_for_single_analyst` - `test_timeout_validation_no_warn_for_sufficient_timeout` **Error classification:** - `test_get_signal_classifies_provider_auth_failure` - `test_get_signal_returns_provider_mismatch_before_graph_init` - `test_get_signal_returns_reason_code_on_propagate_failure` ### Running Tests ```bash cd /path/to/TradingAgents python -m pytest orchestrator/tests/test_llm_runner.py -v ``` ## 5. Maintenance ### Adding New Providers When adding a new provider to `tradingagents/llm_clients/factory.py`: 1. Add URL pattern to `_PROVIDER_BASE_URL_PATTERNS` in `llm_runner.py` 2. Add test cases for valid and invalid configurations 3. Update this documentation ### Adjusting Timeout Recommendations If profiling shows different timeout requirements: 1. Update `_RECOMMENDED_TIMEOUTS` in `llm_runner.py` 2. Document rationale in this file 3. Update test expectations if needed ### Extending Error Classification To add new reason codes: 1. Add to `ReasonCode` enum in `contracts/error_taxonomy.py` 2. Add detection logic in `LLMRunner.get_signal()` 3. Add test case in `test_llm_runner.py` 4. Update this documentation ## 6. Known Limitations ### API Key Validation Current implementation does **not** validate API key validity before graph initialization: - **Limitation**: Expired/invalid keys are only detected during first `propagate()` call - **Impact**: ~5-10s wasted on graph initialization before auth failure - **Rationale**: Lightweight key validation would require provider-specific API calls, adding latency and complexity - **Mitigation**: Auth failures are still classified correctly as `provider_auth_failed` ### Provider Pattern Maintenance URL patterns must be manually kept in sync with provider changes: - **Risk**: Provider changes base URL structure (e.g., API versioning) - **Mitigation**: Validation is non-blocking; mismatches are logged but don't prevent operation - **Future**: Consider moving patterns to `tradingagents/llm_clients/factory.py` as part of `ProviderSpec` ### Timeout Recommendations Recommendations are based on MiniMax profiling and may not generalize: - **Risk**: Faster providers (OpenAI GPT-4) may trigger unnecessary warnings - **Mitigation**: Warnings are advisory only; users can ignore if they've profiled their setup - **Future**: Consider provider-specific timeout recommendations ## 7. Related Documentation - `docs/contracts/result-contract-v1alpha1.md` - Signal contract structure - `docs/architecture/research-provenance.md` - Research degradation semantics - `docs/migration/rollback-notes.md` - Backend migration status - `orchestrator/contracts/error_taxonomy.py` - Complete reason code list