TradingAgents/docs/architecture/orchestrator-validation.md

334 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Orchestrator Configuration Validation
Status: implemented (2026-04-16)
Audience: orchestrator users, backend maintainers
Scope: LLMRunner configuration validation and error classification
## Change Log
**2026-04-16**: Refactored provider validation to centralize patterns in `factory.py`
- Moved `_PROVIDER_BASE_URL_PATTERNS` from `llm_runner.py` to `ProviderSpec.base_url_patterns` in `factory.py`
- Added `validate_provider_base_url()` function with pattern caching for performance
- Added `ProviderMismatch` TypedDict for type-safe validation results
- Split ollama and openrouter into separate `ProviderSpec` entries (previously shared openai's spec)
- Reduced `llm_runner.py` from 45 lines to 13 lines for validation logic
- All 21 tests pass, including 6 provider mismatch tests
## Overview
`orchestrator/llm_runner.py` implements three layers of configuration validation to catch errors before expensive graph initialization or API calls:
1. **Provider × Base URL Matrix Validation** - detects provider/endpoint mismatches
2. **Timeout Configuration Validation** - warns when timeouts may be insufficient
3. **Runtime Error Classification** - categorizes failures into actionable reason codes
## 1. Provider × Base URL Matrix Validation
### Purpose
Prevent wasted initialization time and API calls when provider and base_url are incompatible.
### Implementation
`LLMRunner._detect_provider_mismatch()` validates provider × base_url combinations using a pattern matrix:
```python
_PROVIDER_BASE_URL_PATTERNS = {
"anthropic": [r"api\.anthropic\.com", r"api\.minimaxi\.com/anthropic"],
"openai": [r"api\.openai\.com"],
"google": [r"generativelanguage\.googleapis\.com"],
"xai": [r"api\.x\.ai"],
"ollama": [r"localhost:\d+", r"127\.0\.0\.1:\d+", r"ollama"],
"openrouter": [r"openrouter\.ai"],
}
```
### Validation Logic
1. Extract `llm_provider` and `backend_url` from `trading_agents_config`
2. Look up expected URL patterns for the provider
3. Check if `backend_url` matches any expected pattern (regex)
4. If no match found, return mismatch details before graph initialization
### Error Response
When mismatch detected, `get_signal()` returns:
```python
Signal(
degraded=True,
reason_code="provider_mismatch",
metadata={
"data_quality": {
"state": "provider_mismatch",
"provider": "google",
"backend_url": "https://api.openai.com/v1",
"expected_patterns": [r"generativelanguage\.googleapis\.com"],
}
}
)
```
### Examples
**Valid configurations:**
- `anthropic` + `https://api.minimaxi.com/anthropic`
- `openai` + `https://api.openai.com/v1`
- `ollama` + `http://localhost:11434`
**Invalid configurations (detected):**
- `google` + `https://api.openai.com/v1``provider_mismatch`
- `xai` + `https://api.minimaxi.com/anthropic``provider_mismatch`
- `ollama` + `https://api.openai.com/v1``provider_mismatch`
### Design Notes
- Uses **original provider name** (not canonical) for validation
- `ollama`, `openrouter`, and `openai` share the same canonical provider (`openai`) but have different URL patterns
- Validation must distinguish between them
- Validation runs **before** `TradingAgentsGraph` initialization
- Saves ~5-10s of initialization time on mismatch
- Avoids confusing error messages from LangChain/provider SDKs
## 2. Timeout Configuration Validation
### Purpose
Warn users when timeout settings may be insufficient for their analyst profile, preventing unexpected research degradation.
### Implementation
`LLMRunner._validate_timeout_config()` checks timeout sufficiency based on analyst count:
```python
_RECOMMENDED_TIMEOUTS = {
1: {"analyst": 75.0, "research": 30.0}, # single analyst
2: {"analyst": 90.0, "research": 45.0}, # two analysts
3: {"analyst": 105.0, "research": 60.0}, # three analysts
4: {"analyst": 120.0, "research": 75.0}, # four analysts
}
```
### Validation Logic
1. Extract `selected_analysts` from `trading_agents_config` (default: 4 analysts)
2. Extract `analyst_node_timeout_secs` and `research_node_timeout_secs`
3. Compare against recommended thresholds for analyst count
4. Log `WARNING` if configured timeout < recommended threshold
### Warning Example
```
LLMRunner: analyst_node_timeout_secs=75.0s may be insufficient for 4 analyst(s) (recommended: 120.0s)
```
### Design Notes
- **Non-blocking validation** - logs warning but does not prevent initialization
- Different LLM providers have vastly different speeds (MiniMax vs OpenAI)
- Users may have profiled their specific setup and chosen lower timeouts intentionally
- **Conservative recommendations** - thresholds assume slower providers
- Based on real profiling data from MiniMax Anthropic-compatible endpoint
- Users with faster providers can safely ignore warnings
- **Runs at `__init__` time** - warns early, before any API calls
### Timeout Calculation Rationale
Multi-analyst execution is **serial** for analysts, **parallel** for research:
```
Total time ≈ (analyst_count × analyst_timeout) + research_timeout + trading + risk + portfolio
```
For 4 analysts with 75s timeout each:
- Analyst phase: ~300s (serial)
- Research phase: ~30s (parallel bull/bear)
- Trading phase: ~15s
- Risk phase: ~10s
- Portfolio phase: ~10s
- **Total: ~365s** (6+ minutes)
Recommended 120s per analyst assumes:
- Some analysts may timeout and degrade
- Degraded path still completes within timeout
- Total execution stays under reasonable bounds (~8-10 minutes)
## 3. Runtime Error Classification
### Purpose
Categorize runtime failures into actionable reason codes for debugging and monitoring.
### Error Taxonomy
Defined in `orchestrator/contracts/error_taxonomy.py`:
```python
class ReasonCode(str, Enum):
CONFIG_INVALID = "config_invalid"
PROVIDER_MISMATCH = "provider_mismatch"
PROVIDER_AUTH_FAILED = "provider_auth_failed"
LLM_INIT_FAILED = "llm_init_failed"
LLM_SIGNAL_FAILED = "llm_signal_failed"
LLM_UNKNOWN_RATING = "llm_unknown_rating"
# ... (quant-related codes omitted)
```
### Classification Logic
`LLMRunner.get_signal()` catches exceptions from `propagate()` and classifies them:
1. **Provider mismatch** (pre-initialization)
- Detected by `_detect_provider_mismatch()` before graph creation
- Returns `provider_mismatch` immediately
2. **Provider auth failure** (runtime)
- Detected by `_looks_like_provider_auth_failure()` heuristic
- Markers: `"authentication_error"`, `"login fail"`, `"invalid api key"`, `"unauthorized"`, `"error code: 401"`
- Returns `provider_auth_failed`
3. **Generic LLM failure** (runtime)
- Any other exception from `propagate()`
- Returns `llm_signal_failed`
### Error Response Structure
All error signals include:
```python
Signal(
degraded=True,
reason_code="<reason_code>",
direction=0,
confidence=0.0,
metadata={
"error": "<exception message>",
"data_quality": {
"state": "<state>",
# ... additional context
}
}
)
```
### Design Notes
- **Fail-fast on config errors** - mismatch detected before expensive operations
- **Heuristic auth detection** - no API call overhead, relies on error message patterns
- **Structured metadata** - `data_quality.state` mirrors `reason_code` for consistency
## 4. Testing
### Test Coverage
`orchestrator/tests/test_llm_runner.py` includes:
**Provider matrix validation:**
- `test_detect_provider_mismatch_google_with_openai_url`
- `test_detect_provider_mismatch_xai_with_anthropic_url`
- `test_detect_provider_mismatch_ollama_with_openai_url`
- `test_detect_provider_mismatch_valid_anthropic_minimax`
- `test_detect_provider_mismatch_valid_openai`
**Timeout validation:**
- `test_timeout_validation_warns_for_multiple_analysts_low_timeout`
- `test_timeout_validation_no_warn_for_single_analyst`
- `test_timeout_validation_no_warn_for_sufficient_timeout`
**Error classification:**
- `test_get_signal_classifies_provider_auth_failure`
- `test_get_signal_returns_provider_mismatch_before_graph_init`
- `test_get_signal_returns_reason_code_on_propagate_failure`
### Running Tests
```bash
cd /path/to/TradingAgents
python -m pytest orchestrator/tests/test_llm_runner.py -v
```
## 5. Maintenance
### Adding New Providers
When adding a new provider to `tradingagents/llm_clients/factory.py`:
1. Add a new `ProviderSpec` entry to `_PROVIDER_SPECS` tuple with `base_url_patterns`
2. Add test cases for valid and invalid configurations in `orchestrator/tests/test_llm_runner.py`
3. Update this documentation
**Example:**
```python
ProviderSpec(
canonical_name="newprovider",
aliases=("newprovider",),
builder=lambda model, base_url=None, **kwargs: NewProviderClient(model, base_url, **kwargs),
base_url_patterns=(r"api\.newprovider\.com",),
)
```
### Adjusting Timeout Recommendations
If profiling shows different timeout requirements:
1. Update `_RECOMMENDED_TIMEOUTS` in `llm_runner.py`
2. Document rationale in this file
3. Update test expectations if needed
### Extending Error Classification
To add new reason codes:
1. Add to `ReasonCode` enum in `contracts/error_taxonomy.py`
2. Add detection logic in `LLMRunner.get_signal()`
3. Add test case in `test_llm_runner.py`
4. Update this documentation
## 6. Known Limitations
### API Key Validation
Current implementation does **not** validate API key validity before graph initialization:
- **Limitation**: Expired/invalid keys are only detected during first `propagate()` call
- **Impact**: ~5-10s wasted on graph initialization before auth failure
- **Rationale**: Lightweight key validation would require provider-specific API calls, adding latency and complexity
- **Mitigation**: Auth failures are still classified correctly as `provider_auth_failed`
### Provider Pattern Maintenance
~~URL patterns must be manually kept in sync with provider changes:~~
**UPDATE (2026-04-16)**: Provider URL patterns have been moved to `tradingagents/llm_clients/factory.py` as part of `ProviderSpec`. This centralizes validation logic with provider definitions.
**Current implementation:**
- Each `ProviderSpec` includes optional `base_url_patterns` tuple
- `validate_provider_base_url()` function provides validation logic
- `LLMRunner._detect_provider_mismatch()` delegates to factory validation
- Patterns are co-located with provider builders, reducing maintenance burden
**Benefits:**
- Single source of truth for provider configuration
- Easier to keep patterns in sync when adding/updating providers
- Factory can be tested independently of orchestrator
- Reduced code duplication
**Remaining considerations:**
- **Risk**: Provider changes base URL structure (e.g., API versioning)
- **Mitigation**: Validation is non-blocking; mismatches are logged but don't prevent operation
### Timeout Recommendations
Recommendations are based on MiniMax profiling and may not generalize:
- **Risk**: Faster providers (OpenAI GPT-4) may trigger unnecessary warnings
- **Mitigation**: Warnings are advisory only; users can ignore if they've profiled their setup
- **Future**: Consider provider-specific timeout recommendations
## 7. Related Documentation
- `docs/contracts/result-contract-v1alpha1.md` - Signal contract structure
- `docs/architecture/research-provenance.md` - Research degradation semantics
- `docs/migration/rollback-notes.md` - Backend migration status
- `orchestrator/contracts/error_taxonomy.py` - Complete reason code list