TradingAgents/docs/architecture/orchestrator-validation.md

9.8 KiB
Raw Blame History

Orchestrator Configuration Validation

Status: implemented (2026-04-16) Audience: orchestrator users, backend maintainers Scope: LLMRunner configuration validation and error classification

Overview

orchestrator/llm_runner.py implements three layers of configuration validation to catch errors before expensive graph initialization or API calls:

  1. Provider × Base URL Matrix Validation - detects provider/endpoint mismatches
  2. Timeout Configuration Validation - warns when timeouts may be insufficient
  3. Runtime Error Classification - categorizes failures into actionable reason codes

1. Provider × Base URL Matrix Validation

Purpose

Prevent wasted initialization time and API calls when provider and base_url are incompatible.

Implementation

LLMRunner._detect_provider_mismatch() validates provider × base_url combinations using a pattern matrix:

_PROVIDER_BASE_URL_PATTERNS = {
    "anthropic": [r"api\.anthropic\.com", r"api\.minimaxi\.com/anthropic"],
    "openai": [r"api\.openai\.com"],
    "google": [r"generativelanguage\.googleapis\.com"],
    "xai": [r"api\.x\.ai"],
    "ollama": [r"localhost:\d+", r"127\.0\.0\.1:\d+", r"ollama"],
    "openrouter": [r"openrouter\.ai"],
}

Validation Logic

  1. Extract llm_provider and backend_url from trading_agents_config
  2. Look up expected URL patterns for the provider
  3. Check if backend_url matches any expected pattern (regex)
  4. If no match found, return mismatch details before graph initialization

Error Response

When mismatch detected, get_signal() returns:

Signal(
    degraded=True,
    reason_code="provider_mismatch",
    metadata={
        "data_quality": {
            "state": "provider_mismatch",
            "provider": "google",
            "backend_url": "https://api.openai.com/v1",
            "expected_patterns": [r"generativelanguage\.googleapis\.com"],
        }
    }
)

Examples

Valid configurations:

  • anthropic + https://api.minimaxi.com/anthropic
  • openai + https://api.openai.com/v1
  • ollama + http://localhost:11434

Invalid configurations (detected):

  • google + https://api.openai.com/v1provider_mismatch
  • xai + https://api.minimaxi.com/anthropicprovider_mismatch
  • ollama + https://api.openai.com/v1provider_mismatch

Design Notes

  • Uses original provider name (not canonical) for validation
    • ollama, openrouter, and openai share the same canonical provider (openai) but have different URL patterns
    • Validation must distinguish between them
  • Validation runs before TradingAgentsGraph initialization
    • Saves ~5-10s of initialization time on mismatch
    • Avoids confusing error messages from LangChain/provider SDKs

2. Timeout Configuration Validation

Purpose

Warn users when timeout settings may be insufficient for their analyst profile, preventing unexpected research degradation.

Implementation

LLMRunner._validate_timeout_config() checks timeout sufficiency based on analyst count:

_RECOMMENDED_TIMEOUTS = {
    1: {"analyst": 75.0, "research": 30.0},   # single analyst
    2: {"analyst": 90.0, "research": 45.0},   # two analysts
    3: {"analyst": 105.0, "research": 60.0},  # three analysts
    4: {"analyst": 120.0, "research": 75.0},  # four analysts
}

Validation Logic

  1. Extract selected_analysts from trading_agents_config (default: 4 analysts)
  2. Extract analyst_node_timeout_secs and research_node_timeout_secs
  3. Compare against recommended thresholds for analyst count
  4. Log WARNING if configured timeout < recommended threshold

Warning Example

LLMRunner: analyst_node_timeout_secs=75.0s may be insufficient for 4 analyst(s) (recommended: 120.0s)

Design Notes

  • Non-blocking validation - logs warning but does not prevent initialization
    • Different LLM providers have vastly different speeds (MiniMax vs OpenAI)
    • Users may have profiled their specific setup and chosen lower timeouts intentionally
  • Conservative recommendations - thresholds assume slower providers
    • Based on real profiling data from MiniMax Anthropic-compatible endpoint
    • Users with faster providers can safely ignore warnings
  • Runs at __init__ time - warns early, before any API calls

Timeout Calculation Rationale

Multi-analyst execution is serial for analysts, parallel for research:

Total time ≈ (analyst_count × analyst_timeout) + research_timeout + trading + risk + portfolio

For 4 analysts with 75s timeout each:

  • Analyst phase: ~300s (serial)
  • Research phase: ~30s (parallel bull/bear)
  • Trading phase: ~15s
  • Risk phase: ~10s
  • Portfolio phase: ~10s
  • Total: ~365s (6+ minutes)

Recommended 120s per analyst assumes:

  • Some analysts may timeout and degrade
  • Degraded path still completes within timeout
  • Total execution stays under reasonable bounds (~8-10 minutes)

3. Runtime Error Classification

Purpose

Categorize runtime failures into actionable reason codes for debugging and monitoring.

Error Taxonomy

Defined in orchestrator/contracts/error_taxonomy.py:

class ReasonCode(str, Enum):
    CONFIG_INVALID = "config_invalid"
    PROVIDER_MISMATCH = "provider_mismatch"
    PROVIDER_AUTH_FAILED = "provider_auth_failed"
    LLM_INIT_FAILED = "llm_init_failed"
    LLM_SIGNAL_FAILED = "llm_signal_failed"
    LLM_UNKNOWN_RATING = "llm_unknown_rating"
    # ... (quant-related codes omitted)

Classification Logic

LLMRunner.get_signal() catches exceptions from propagate() and classifies them:

  1. Provider mismatch (pre-initialization)

    • Detected by _detect_provider_mismatch() before graph creation
    • Returns provider_mismatch immediately
  2. Provider auth failure (runtime)

    • Detected by _looks_like_provider_auth_failure() heuristic
    • Markers: "authentication_error", "login fail", "invalid api key", "unauthorized", "error code: 401"
    • Returns provider_auth_failed
  3. Generic LLM failure (runtime)

    • Any other exception from propagate()
    • Returns llm_signal_failed

Error Response Structure

All error signals include:

Signal(
    degraded=True,
    reason_code="<reason_code>",
    direction=0,
    confidence=0.0,
    metadata={
        "error": "<exception message>",
        "data_quality": {
            "state": "<state>",
            # ... additional context
        }
    }
)

Design Notes

  • Fail-fast on config errors - mismatch detected before expensive operations
  • Heuristic auth detection - no API call overhead, relies on error message patterns
  • Structured metadata - data_quality.state mirrors reason_code for consistency

4. Testing

Test Coverage

orchestrator/tests/test_llm_runner.py includes:

Provider matrix validation:

  • test_detect_provider_mismatch_google_with_openai_url
  • test_detect_provider_mismatch_xai_with_anthropic_url
  • test_detect_provider_mismatch_ollama_with_openai_url
  • test_detect_provider_mismatch_valid_anthropic_minimax
  • test_detect_provider_mismatch_valid_openai

Timeout validation:

  • test_timeout_validation_warns_for_multiple_analysts_low_timeout
  • test_timeout_validation_no_warn_for_single_analyst
  • test_timeout_validation_no_warn_for_sufficient_timeout

Error classification:

  • test_get_signal_classifies_provider_auth_failure
  • test_get_signal_returns_provider_mismatch_before_graph_init
  • test_get_signal_returns_reason_code_on_propagate_failure

Running Tests

cd /path/to/TradingAgents
python -m pytest orchestrator/tests/test_llm_runner.py -v

5. Maintenance

Adding New Providers

When adding a new provider to tradingagents/llm_clients/factory.py:

  1. Add URL pattern to _PROVIDER_BASE_URL_PATTERNS in llm_runner.py
  2. Add test cases for valid and invalid configurations
  3. Update this documentation

Adjusting Timeout Recommendations

If profiling shows different timeout requirements:

  1. Update _RECOMMENDED_TIMEOUTS in llm_runner.py
  2. Document rationale in this file
  3. Update test expectations if needed

Extending Error Classification

To add new reason codes:

  1. Add to ReasonCode enum in contracts/error_taxonomy.py
  2. Add detection logic in LLMRunner.get_signal()
  3. Add test case in test_llm_runner.py
  4. Update this documentation

6. Known Limitations

API Key Validation

Current implementation does not validate API key validity before graph initialization:

  • Limitation: Expired/invalid keys are only detected during first propagate() call
  • Impact: ~5-10s wasted on graph initialization before auth failure
  • Rationale: Lightweight key validation would require provider-specific API calls, adding latency and complexity
  • Mitigation: Auth failures are still classified correctly as provider_auth_failed

Provider Pattern Maintenance

URL patterns must be manually kept in sync with provider changes:

  • Risk: Provider changes base URL structure (e.g., API versioning)
  • Mitigation: Validation is non-blocking; mismatches are logged but don't prevent operation
  • Future: Consider moving patterns to tradingagents/llm_clients/factory.py as part of ProviderSpec

Timeout Recommendations

Recommendations are based on MiniMax profiling and may not generalize:

  • Risk: Faster providers (OpenAI GPT-4) may trigger unnecessary warnings
  • Mitigation: Warnings are advisory only; users can ignore if they've profiled their setup
  • Future: Consider provider-specific timeout recommendations
  • docs/contracts/result-contract-v1alpha1.md - Signal contract structure
  • docs/architecture/research-provenance.md - Research degradation semantics
  • docs/migration/rollback-notes.md - Backend migration status
  • orchestrator/contracts/error_taxonomy.py - Complete reason code list