9.8 KiB

Raw Blame History

Orchestrator Configuration Validation

Status: implemented (2026-04-16) Audience: orchestrator users, backend maintainers Scope: LLMRunner configuration validation and error classification

Overview

orchestrator/llm_runner.py implements three layers of configuration validation to catch errors before expensive graph initialization or API calls:

Provider × Base URL Matrix Validation - detects provider/endpoint mismatches
Timeout Configuration Validation - warns when timeouts may be insufficient
Runtime Error Classification - categorizes failures into actionable reason codes

1. Provider × Base URL Matrix Validation

Purpose

Prevent wasted initialization time and API calls when provider and base_url are incompatible.

Implementation

LLMRunner._detect_provider_mismatch() validates provider × base_url combinations using a pattern matrix:

_PROVIDER_BASE_URL_PATTERNS = {
    "anthropic": [r"api\.anthropic\.com", r"api\.minimaxi\.com/anthropic"],
    "openai": [r"api\.openai\.com"],
    "google": [r"generativelanguage\.googleapis\.com"],
    "xai": [r"api\.x\.ai"],
    "ollama": [r"localhost:\d+", r"127\.0\.0\.1:\d+", r"ollama"],
    "openrouter": [r"openrouter\.ai"],
}

Validation Logic

Extract llm_provider and backend_url from trading_agents_config
Look up expected URL patterns for the provider
Check if backend_url matches any expected pattern (regex)
If no match found, return mismatch details before graph initialization

Error Response

When mismatch detected, get_signal() returns:

Signal(
    degraded=True,
    reason_code="provider_mismatch",
    metadata={
        "data_quality": {
            "state": "provider_mismatch",
            "provider": "google",
            "backend_url": "https://api.openai.com/v1",
            "expected_patterns": [r"generativelanguage\.googleapis\.com"],
        }
    }
)

Examples

Valid configurations:

anthropic + https://api.minimaxi.com/anthropic ✓
openai + https://api.openai.com/v1 ✓
ollama + http://localhost:11434 ✓

Invalid configurations (detected):

google + https://api.openai.com/v1 → provider_mismatch
xai + https://api.minimaxi.com/anthropic → provider_mismatch
ollama + https://api.openai.com/v1 → provider_mismatch

Design Notes

Uses original provider name (not canonical) for validation
- ollama, openrouter, and openai share the same canonical provider (openai) but have different URL patterns
- Validation must distinguish between them
Validation runs before TradingAgentsGraph initialization
- Saves ~5-10s of initialization time on mismatch
- Avoids confusing error messages from LangChain/provider SDKs

2. Timeout Configuration Validation

Purpose

Warn users when timeout settings may be insufficient for their analyst profile, preventing unexpected research degradation.

Implementation

LLMRunner._validate_timeout_config() checks timeout sufficiency based on analyst count:

_RECOMMENDED_TIMEOUTS = {
    1: {"analyst": 75.0, "research": 30.0},   # single analyst
    2: {"analyst": 90.0, "research": 45.0},   # two analysts
    3: {"analyst": 105.0, "research": 60.0},  # three analysts
    4: {"analyst": 120.0, "research": 75.0},  # four analysts
}

Validation Logic

Extract selected_analysts from trading_agents_config (default: 4 analysts)
Extract analyst_node_timeout_secs and research_node_timeout_secs
Compare against recommended thresholds for analyst count
Log WARNING if configured timeout < recommended threshold

Warning Example

LLMRunner: analyst_node_timeout_secs=75.0s may be insufficient for 4 analyst(s) (recommended: 120.0s)

Design Notes

Non-blocking validation - logs warning but does not prevent initialization
- Different LLM providers have vastly different speeds (MiniMax vs OpenAI)
- Users may have profiled their specific setup and chosen lower timeouts intentionally
Conservative recommendations - thresholds assume slower providers
- Based on real profiling data from MiniMax Anthropic-compatible endpoint
- Users with faster providers can safely ignore warnings
Runs at __init__ time - warns early, before any API calls

Timeout Calculation Rationale

Multi-analyst execution is serial for analysts, parallel for research:

Total time ≈ (analyst_count × analyst_timeout) + research_timeout + trading + risk + portfolio

For 4 analysts with 75s timeout each:

Analyst phase: ~300s (serial)
Research phase: ~30s (parallel bull/bear)
Trading phase: ~15s
Risk phase: ~10s
Portfolio phase: ~10s
Total: ~365s (6+ minutes)

Recommended 120s per analyst assumes:

Some analysts may timeout and degrade
Degraded path still completes within timeout
Total execution stays under reasonable bounds (~8-10 minutes)

3. Runtime Error Classification

Purpose

Categorize runtime failures into actionable reason codes for debugging and monitoring.

Error Taxonomy

Defined in orchestrator/contracts/error_taxonomy.py:

class ReasonCode(str, Enum):
    CONFIG_INVALID = "config_invalid"
    PROVIDER_MISMATCH = "provider_mismatch"
    PROVIDER_AUTH_FAILED = "provider_auth_failed"
    LLM_INIT_FAILED = "llm_init_failed"
    LLM_SIGNAL_FAILED = "llm_signal_failed"
    LLM_UNKNOWN_RATING = "llm_unknown_rating"
    # ... (quant-related codes omitted)

Classification Logic

LLMRunner.get_signal() catches exceptions from propagate() and classifies them:

Provider mismatch (pre-initialization)
- Detected by _detect_provider_mismatch() before graph creation
- Returns provider_mismatch immediately
Provider auth failure (runtime)
- Detected by _looks_like_provider_auth_failure() heuristic
- Markers: "authentication_error", "login fail", "invalid api key", "unauthorized", "error code: 401"
- Returns provider_auth_failed
Generic LLM failure (runtime)
- Any other exception from propagate()
- Returns llm_signal_failed

Error Response Structure

All error signals include:

Signal(
    degraded=True,
    reason_code="<reason_code>",
    direction=0,
    confidence=0.0,
    metadata={
        "error": "<exception message>",
        "data_quality": {
            "state": "<state>",
            # ... additional context
        }
    }
)

Design Notes

Fail-fast on config errors - mismatch detected before expensive operations
Heuristic auth detection - no API call overhead, relies on error message patterns
Structured metadata - data_quality.state mirrors reason_code for consistency

4. Testing

Test Coverage

orchestrator/tests/test_llm_runner.py includes:

Provider matrix validation:

test_detect_provider_mismatch_google_with_openai_url
test_detect_provider_mismatch_xai_with_anthropic_url
test_detect_provider_mismatch_ollama_with_openai_url
test_detect_provider_mismatch_valid_anthropic_minimax
test_detect_provider_mismatch_valid_openai

Timeout validation:

test_timeout_validation_warns_for_multiple_analysts_low_timeout
test_timeout_validation_no_warn_for_single_analyst
test_timeout_validation_no_warn_for_sufficient_timeout

Error classification:

test_get_signal_classifies_provider_auth_failure
test_get_signal_returns_provider_mismatch_before_graph_init
test_get_signal_returns_reason_code_on_propagate_failure

Running Tests

cd /path/to/TradingAgents
python -m pytest orchestrator/tests/test_llm_runner.py -v

5. Maintenance

Adding New Providers

When adding a new provider to tradingagents/llm_clients/factory.py:

Add URL pattern to _PROVIDER_BASE_URL_PATTERNS in llm_runner.py
Add test cases for valid and invalid configurations
Update this documentation

Adjusting Timeout Recommendations

If profiling shows different timeout requirements:

Update _RECOMMENDED_TIMEOUTS in llm_runner.py
Document rationale in this file
Update test expectations if needed

Extending Error Classification

To add new reason codes:

Add to ReasonCode enum in contracts/error_taxonomy.py
Add detection logic in LLMRunner.get_signal()
Add test case in test_llm_runner.py
Update this documentation

6. Known Limitations

API Key Validation

Current implementation does not validate API key validity before graph initialization:

Limitation: Expired/invalid keys are only detected during first propagate() call
Impact: ~5-10s wasted on graph initialization before auth failure
Rationale: Lightweight key validation would require provider-specific API calls, adding latency and complexity
Mitigation: Auth failures are still classified correctly as provider_auth_failed

Provider Pattern Maintenance

URL patterns must be manually kept in sync with provider changes:

Risk: Provider changes base URL structure (e.g., API versioning)
Mitigation: Validation is non-blocking; mismatches are logged but don't prevent operation
Future: Consider moving patterns to tradingagents/llm_clients/factory.py as part of ProviderSpec

Timeout Recommendations

Recommendations are based on MiniMax profiling and may not generalize:

Risk: Faster providers (OpenAI GPT-4) may trigger unnecessary warnings
Mitigation: Warnings are advisory only; users can ignore if they've profiled their setup
Future: Consider provider-specific timeout recommendations

docs/contracts/result-contract-v1alpha1.md - Signal contract structure
docs/architecture/research-provenance.md - Research degradation semantics
docs/migration/rollback-notes.md - Backend migration status
orchestrator/contracts/error_taxonomy.py - Complete reason code list

9.8 KiB Raw Blame History Unescape Escape

Orchestrator Configuration Validation

Overview

1. Provider × Base URL Matrix Validation

Purpose

Implementation

Validation Logic

Error Response

Examples

Design Notes

2. Timeout Configuration Validation

Purpose

Implementation

Validation Logic

Warning Example

Design Notes

Timeout Calculation Rationale

3. Runtime Error Classification

Purpose

Error Taxonomy

Classification Logic

Error Response Structure

Design Notes

4. Testing

Test Coverage

Running Tests

5. Maintenance

Adding New Providers

Adjusting Timeout Recommendations

Extending Error Classification

6. Known Limitations

API Key Validation

Provider Pattern Maintenance

Timeout Recommendations

7. Related Documentation

9.8 KiB

Raw Blame History