9.8 KiB
Orchestrator Configuration Validation
Status: implemented (2026-04-16) Audience: orchestrator users, backend maintainers Scope: LLMRunner configuration validation and error classification
Overview
orchestrator/llm_runner.py implements three layers of configuration validation to catch errors before expensive graph initialization or API calls:
- Provider × Base URL Matrix Validation - detects provider/endpoint mismatches
- Timeout Configuration Validation - warns when timeouts may be insufficient
- Runtime Error Classification - categorizes failures into actionable reason codes
1. Provider × Base URL Matrix Validation
Purpose
Prevent wasted initialization time and API calls when provider and base_url are incompatible.
Implementation
LLMRunner._detect_provider_mismatch() validates provider × base_url combinations using a pattern matrix:
_PROVIDER_BASE_URL_PATTERNS = {
"anthropic": [r"api\.anthropic\.com", r"api\.minimaxi\.com/anthropic"],
"openai": [r"api\.openai\.com"],
"google": [r"generativelanguage\.googleapis\.com"],
"xai": [r"api\.x\.ai"],
"ollama": [r"localhost:\d+", r"127\.0\.0\.1:\d+", r"ollama"],
"openrouter": [r"openrouter\.ai"],
}
Validation Logic
- Extract
llm_providerandbackend_urlfromtrading_agents_config - Look up expected URL patterns for the provider
- Check if
backend_urlmatches any expected pattern (regex) - If no match found, return mismatch details before graph initialization
Error Response
When mismatch detected, get_signal() returns:
Signal(
degraded=True,
reason_code="provider_mismatch",
metadata={
"data_quality": {
"state": "provider_mismatch",
"provider": "google",
"backend_url": "https://api.openai.com/v1",
"expected_patterns": [r"generativelanguage\.googleapis\.com"],
}
}
)
Examples
Valid configurations:
anthropic+https://api.minimaxi.com/anthropic✓openai+https://api.openai.com/v1✓ollama+http://localhost:11434✓
Invalid configurations (detected):
google+https://api.openai.com/v1→provider_mismatchxai+https://api.minimaxi.com/anthropic→provider_mismatchollama+https://api.openai.com/v1→provider_mismatch
Design Notes
- Uses original provider name (not canonical) for validation
ollama,openrouter, andopenaishare the same canonical provider (openai) but have different URL patterns- Validation must distinguish between them
- Validation runs before
TradingAgentsGraphinitialization- Saves ~5-10s of initialization time on mismatch
- Avoids confusing error messages from LangChain/provider SDKs
2. Timeout Configuration Validation
Purpose
Warn users when timeout settings may be insufficient for their analyst profile, preventing unexpected research degradation.
Implementation
LLMRunner._validate_timeout_config() checks timeout sufficiency based on analyst count:
_RECOMMENDED_TIMEOUTS = {
1: {"analyst": 75.0, "research": 30.0}, # single analyst
2: {"analyst": 90.0, "research": 45.0}, # two analysts
3: {"analyst": 105.0, "research": 60.0}, # three analysts
4: {"analyst": 120.0, "research": 75.0}, # four analysts
}
Validation Logic
- Extract
selected_analystsfromtrading_agents_config(default: 4 analysts) - Extract
analyst_node_timeout_secsandresearch_node_timeout_secs - Compare against recommended thresholds for analyst count
- Log
WARNINGif configured timeout < recommended threshold
Warning Example
LLMRunner: analyst_node_timeout_secs=75.0s may be insufficient for 4 analyst(s) (recommended: 120.0s)
Design Notes
- Non-blocking validation - logs warning but does not prevent initialization
- Different LLM providers have vastly different speeds (MiniMax vs OpenAI)
- Users may have profiled their specific setup and chosen lower timeouts intentionally
- Conservative recommendations - thresholds assume slower providers
- Based on real profiling data from MiniMax Anthropic-compatible endpoint
- Users with faster providers can safely ignore warnings
- Runs at
__init__time - warns early, before any API calls
Timeout Calculation Rationale
Multi-analyst execution is serial for analysts, parallel for research:
Total time ≈ (analyst_count × analyst_timeout) + research_timeout + trading + risk + portfolio
For 4 analysts with 75s timeout each:
- Analyst phase: ~300s (serial)
- Research phase: ~30s (parallel bull/bear)
- Trading phase: ~15s
- Risk phase: ~10s
- Portfolio phase: ~10s
- Total: ~365s (6+ minutes)
Recommended 120s per analyst assumes:
- Some analysts may timeout and degrade
- Degraded path still completes within timeout
- Total execution stays under reasonable bounds (~8-10 minutes)
3. Runtime Error Classification
Purpose
Categorize runtime failures into actionable reason codes for debugging and monitoring.
Error Taxonomy
Defined in orchestrator/contracts/error_taxonomy.py:
class ReasonCode(str, Enum):
CONFIG_INVALID = "config_invalid"
PROVIDER_MISMATCH = "provider_mismatch"
PROVIDER_AUTH_FAILED = "provider_auth_failed"
LLM_INIT_FAILED = "llm_init_failed"
LLM_SIGNAL_FAILED = "llm_signal_failed"
LLM_UNKNOWN_RATING = "llm_unknown_rating"
# ... (quant-related codes omitted)
Classification Logic
LLMRunner.get_signal() catches exceptions from propagate() and classifies them:
-
Provider mismatch (pre-initialization)
- Detected by
_detect_provider_mismatch()before graph creation - Returns
provider_mismatchimmediately
- Detected by
-
Provider auth failure (runtime)
- Detected by
_looks_like_provider_auth_failure()heuristic - Markers:
"authentication_error","login fail","invalid api key","unauthorized","error code: 401" - Returns
provider_auth_failed
- Detected by
-
Generic LLM failure (runtime)
- Any other exception from
propagate() - Returns
llm_signal_failed
- Any other exception from
Error Response Structure
All error signals include:
Signal(
degraded=True,
reason_code="<reason_code>",
direction=0,
confidence=0.0,
metadata={
"error": "<exception message>",
"data_quality": {
"state": "<state>",
# ... additional context
}
}
)
Design Notes
- Fail-fast on config errors - mismatch detected before expensive operations
- Heuristic auth detection - no API call overhead, relies on error message patterns
- Structured metadata -
data_quality.statemirrorsreason_codefor consistency
4. Testing
Test Coverage
orchestrator/tests/test_llm_runner.py includes:
Provider matrix validation:
test_detect_provider_mismatch_google_with_openai_urltest_detect_provider_mismatch_xai_with_anthropic_urltest_detect_provider_mismatch_ollama_with_openai_urltest_detect_provider_mismatch_valid_anthropic_minimaxtest_detect_provider_mismatch_valid_openai
Timeout validation:
test_timeout_validation_warns_for_multiple_analysts_low_timeouttest_timeout_validation_no_warn_for_single_analysttest_timeout_validation_no_warn_for_sufficient_timeout
Error classification:
test_get_signal_classifies_provider_auth_failuretest_get_signal_returns_provider_mismatch_before_graph_inittest_get_signal_returns_reason_code_on_propagate_failure
Running Tests
cd /path/to/TradingAgents
python -m pytest orchestrator/tests/test_llm_runner.py -v
5. Maintenance
Adding New Providers
When adding a new provider to tradingagents/llm_clients/factory.py:
- Add URL pattern to
_PROVIDER_BASE_URL_PATTERNSinllm_runner.py - Add test cases for valid and invalid configurations
- Update this documentation
Adjusting Timeout Recommendations
If profiling shows different timeout requirements:
- Update
_RECOMMENDED_TIMEOUTSinllm_runner.py - Document rationale in this file
- Update test expectations if needed
Extending Error Classification
To add new reason codes:
- Add to
ReasonCodeenum incontracts/error_taxonomy.py - Add detection logic in
LLMRunner.get_signal() - Add test case in
test_llm_runner.py - Update this documentation
6. Known Limitations
API Key Validation
Current implementation does not validate API key validity before graph initialization:
- Limitation: Expired/invalid keys are only detected during first
propagate()call - Impact: ~5-10s wasted on graph initialization before auth failure
- Rationale: Lightweight key validation would require provider-specific API calls, adding latency and complexity
- Mitigation: Auth failures are still classified correctly as
provider_auth_failed
Provider Pattern Maintenance
URL patterns must be manually kept in sync with provider changes:
- Risk: Provider changes base URL structure (e.g., API versioning)
- Mitigation: Validation is non-blocking; mismatches are logged but don't prevent operation
- Future: Consider moving patterns to
tradingagents/llm_clients/factory.pyas part ofProviderSpec
Timeout Recommendations
Recommendations are based on MiniMax profiling and may not generalize:
- Risk: Faster providers (OpenAI GPT-4) may trigger unnecessary warnings
- Mitigation: Warnings are advisory only; users can ignore if they've profiled their setup
- Future: Consider provider-specific timeout recommendations
7. Related Documentation
docs/contracts/result-contract-v1alpha1.md- Signal contract structuredocs/architecture/research-provenance.md- Research degradation semanticsdocs/migration/rollback-notes.md- Backend migration statusorchestrator/contracts/error_taxonomy.py- Complete reason code list