251 lines
11 KiB
Markdown
251 lines
11 KiB
Markdown
# TradingAgents research provenance, node guards, and profiling harness
|
|
|
|
Status: draft
|
|
Audience: orchestrator, TradingAgents graph, verification
|
|
Scope: document the Phase 1-4 provenance fields, Bull/Bear/Manager guard behavior, trace schema, and the smallest safe A/B workflow for verification
|
|
|
|
## Current implementation snapshot (2026-04)
|
|
|
|
Mainline now has four distinct but connected pieces in place:
|
|
|
|
1. `research provenance` fields are carried in `investment_debate_state`;
|
|
2. the same provenance is reused by:
|
|
- `orchestrator/llm_runner.py`
|
|
- `orchestrator/live_mode.py`
|
|
- `tradingagents/graph/trading_graph.py` full-state logs;
|
|
3. `orchestrator/profile_stage_chain.py` emits node-level traces for offline analysis;
|
|
4. `orchestrator/profile_ab.py` compares two trace cohorts offline without changing the production execution path.
|
|
|
|
This document describes the **current mainline behavior**, not a future structured-memo design.
|
|
|
|
## 1. Why this document exists
|
|
|
|
Phase 1-4 convergence added three closely related behaviors:
|
|
|
|
1. research-stage provenance is carried inside `investment_debate_state` and surfaced into application-facing metadata;
|
|
2. Bull Researcher, Bear Researcher, and Research Manager are guarded so timeouts/exceptions degrade gracefully without changing the default full-debate path;
|
|
3. `orchestrator/profile_stage_chain.py` can be used as a minimal A/B harness to compare prompt/profile variants while preserving the production path.
|
|
|
|
The implementation is intentionally conservative:
|
|
|
|
- **no structured memo output** is introduced;
|
|
- **default behavior remains the full debate path** when no guard trips;
|
|
- **existing debate string fields stay authoritative** (`history`, `bull_history`, `bear_history`, `current_response`, `judge_decision`).
|
|
|
|
## 2. Provenance schema and ownership
|
|
|
|
### 2.1 Canonical provenance fields
|
|
|
|
The research provenance fields currently carried in `investment_debate_state` are:
|
|
|
|
| Field | Meaning | Primary source |
|
|
| --- | --- | --- |
|
|
| `research_status` | Research health/status. Current in-repo values are `full` and `degraded`; `failed` is tolerated in surfaced diagnostics. | `tradingagents/graph/propagation.py`, `tradingagents/graph/setup.py`, `tradingagents/agents/utils/agent_states.py` |
|
|
| `research_mode` | Research execution mode. Normal path is `debate`; degraded path is `degraded_synthesis`. | same |
|
|
| `timed_out_nodes` | Ordered list of guarded research nodes that hit timeout. | `tradingagents/graph/setup.py` |
|
|
| `degraded_reason` | Machine-readable reason string such as `bull_researcher_timeout`. | `tradingagents/graph/setup.py` |
|
|
| `covered_dimensions` | Which debate dimensions completed successfully so far (`bull`, `bear`, `manager`). | `tradingagents/graph/setup.py` |
|
|
| `manager_confidence` | Optional confidence marker for the research-manager layer. `1.0` on clean manager success, `0.5` when manager succeeds after prior degradation, `0.0` on manager fallback. | `tradingagents/graph/setup.py` |
|
|
|
|
### 2.2 Initialization and propagation
|
|
|
|
- `tradingagents/graph/propagation.py` initializes the default path with:
|
|
- `research_status = "full"`
|
|
- `research_mode = "debate"`
|
|
- `timed_out_nodes = []`
|
|
- `degraded_reason = None`
|
|
- `covered_dimensions = []`
|
|
- `manager_confidence = None`
|
|
- `tradingagents/graph/setup.py::_apply_research_success()` extends `covered_dimensions` and preserves the default debate mode while the research status remains `full`.
|
|
- `tradingagents/graph/setup.py::_apply_research_fallback()` marks the state as degraded, records the reason, and updates only the existing debate fields instead of inventing a parallel memo structure.
|
|
|
|
## 3. Guard behavior by node
|
|
|
|
`GraphSetup._guard_research_node()` wraps each research node in a single-worker thread pool and enforces `research_node_timeout_secs`.
|
|
|
|
### 3.1 Bull / Bear researcher fallback
|
|
|
|
On timeout or exception for `Bull Researcher` or `Bear Researcher`:
|
|
|
|
- the corresponding node name is added to `timed_out_nodes` when the reason includes `timeout`;
|
|
- `research_status` becomes `degraded`;
|
|
- `research_mode` becomes `degraded_synthesis`;
|
|
- a plain-text degraded argument is appended to:
|
|
- `history`
|
|
- the node-specific history field (`bull_history` or `bear_history`)
|
|
- `current_response`
|
|
- `count` is incremented so the debate routing still advances.
|
|
|
|
This keeps the **existing debate output shape** intact: downstream consumers continue reading the same string fields they already depend on.
|
|
|
|
### 3.2 Research Manager fallback
|
|
|
|
On timeout or exception for `Research Manager`:
|
|
|
|
- provenance is marked degraded using the same schema;
|
|
- `manager_confidence` is forced to `0.0`;
|
|
- `judge_decision`, `current_response`, and returned `investment_plan` are set to a plain-text HOLD recommendation that explicitly calls out degraded research.
|
|
|
|
This is intentionally **string-first**, not schema-first, so the downstream plan/report path does not have to learn a new memo envelope.
|
|
|
|
## 4. Application-facing surfacing
|
|
|
|
### 4.1 LLM runner metadata
|
|
|
|
`orchestrator/llm_runner.py` extracts the provenance subset from `investment_debate_state` and stores it under:
|
|
|
|
- `metadata.research`
|
|
- `metadata.data_quality`
|
|
- `metadata.sample_quality`
|
|
|
|
The extraction path is now centralized through:
|
|
|
|
- `tradingagents/agents/utils/agent_states.py::extract_research_provenance()`
|
|
|
|
Current conventions:
|
|
|
|
- normal path: `data_quality.state = "ok"`, `sample_quality = "full_research"`;
|
|
- degraded path: `data_quality.state = "research_degraded"`, `sample_quality = "degraded_research"`.
|
|
|
|
### 4.2 Live-mode contract projection
|
|
|
|
`orchestrator/live_mode.py` forwards provenance under top-level `research` in live-mode payloads for both:
|
|
|
|
- `completed` / `degraded_success` results; and
|
|
- structured failures that carry research diagnostics in `source_diagnostics`.
|
|
|
|
This means consumers can inspect research degradation without parsing raw debate text.
|
|
|
|
### 4.3 Full-state log projection
|
|
|
|
`tradingagents/graph/trading_graph.py::_log_state()` now also persists the same provenance subset into:
|
|
|
|
- `results/<ticker>/TradingAgentsStrategy_logs/full_states_log_<trade_date>.json`
|
|
|
|
This keeps the post-run JSON logs aligned with the runner/live metadata instead of silently dropping the structured fields.
|
|
|
|
## 5. Profiling trace schema
|
|
|
|
`orchestrator/profile_stage_chain.py` is the current timing/provenance trace generator.
|
|
`orchestrator/profile_trace_utils.py` holds the shared summary helper used by the offline A/B comparison path.
|
|
|
|
### 5.1 Top-level payload
|
|
|
|
Successful runs currently write a JSON payload with:
|
|
|
|
- `status`
|
|
- `ticker`
|
|
- `date`
|
|
- `selected_analysts`
|
|
- `analysis_prompt_style`
|
|
- `node_timings`
|
|
- `phase_totals_seconds`
|
|
- `dump_path`
|
|
- `raw_events` (normally empty unless explicitly requested on failure)
|
|
|
|
Error payloads add:
|
|
|
|
- `run_id`
|
|
- `error`
|
|
- `exception_type`
|
|
|
|
### 5.2 `node_timings[]` entry schema
|
|
|
|
Each `node_timings[]` entry currently contains:
|
|
|
|
| Field | Meaning |
|
|
| --- | --- |
|
|
| `run_id` | Correlates all rows from one profiling run |
|
|
| `nodes` | Node names emitted by the LangGraph update |
|
|
| `phases` | Normalized application phase names (`analyst`, `research`, `trading`, `risk`, `portfolio`) |
|
|
| `llm_kinds` | Normalized LLM bucket labels (`quick`, `deep`) |
|
|
| `start_at` / `end_at` | Relative offsets from run start, in seconds |
|
|
| `elapsed_ms` | Duration since the previous event |
|
|
| `selected_analysts` | Analyst slice used for the run |
|
|
| `analysis_prompt_style` | Prompt profile used for the run |
|
|
| `research_status` | Provenance snapshot extracted from `investment_debate_state` |
|
|
| `degraded_reason` | Provenance reason snapshot |
|
|
| `history_len` | Current debate history length |
|
|
| `response_len` | Current response length |
|
|
|
|
This schema is intentionally **trace-oriented**, not a replacement for the application result contract.
|
|
|
|
## 6. Offline A/B comparison helper
|
|
|
|
`orchestrator/profile_ab.py` is the current offline comparison helper.
|
|
|
|
It consumes one or more trace JSON files from cohort `A` and cohort `B`, then reports:
|
|
|
|
- `median_total_elapsed_ms`
|
|
- `median_event_count`
|
|
- `median_phase_elapsed_ms`
|
|
- `degraded_run_count`
|
|
- `error_count`
|
|
- `trace_schema_versions`
|
|
- `source_files`
|
|
- recommendation tie-breaks across elapsed time, degradation count, and error count
|
|
|
|
This helper is intentionally offline-only: it does **not** re-run live providers or change the production runtime path.
|
|
|
|
## 7. Minimal A/B harness guidance
|
|
|
|
Use `python -m orchestrator.profile_stage_chain` to generate traces, then `python -m orchestrator.profile_ab` to compare them.
|
|
|
|
### 6.1 Safe comparison knobs
|
|
|
|
Run the harness from the repo root as a module (`python -m orchestrator.profile_stage_chain`) so package imports resolve without extra path tweaking.
|
|
|
|
The smallest useful A/B comparisons are:
|
|
|
|
- `--analysis-prompt-style` (for example `compact` vs another supported style)
|
|
- `--selected-analysts` (for example a narrower analyst slice vs a broader slice)
|
|
- provider/model/timeout settings while keeping the graph semantics fixed
|
|
|
|
### 6.2 Recommended invariants
|
|
|
|
Keep these fixed when doing an A/B comparison:
|
|
|
|
- the same `--ticker`
|
|
- the same `--date`
|
|
- the same provider/model unless the provider/model itself is the experimental variable
|
|
- the same `--overall-timeout`
|
|
- `max_debate_rounds = 1` and `max_risk_discuss_rounds = 1` as currently baked into the harness
|
|
|
|
### 7.3 Example commands
|
|
|
|
```bash
|
|
python -m orchestrator.profile_stage_chain \
|
|
--ticker AAPL \
|
|
--date 2026-04-11 \
|
|
--selected-analysts market \
|
|
--analysis-prompt-style compact
|
|
|
|
python -m orchestrator.profile_stage_chain \
|
|
--ticker AAPL \
|
|
--date 2026-04-11 \
|
|
--selected-analysts market \
|
|
--analysis-prompt-style detailed
|
|
|
|
python -m orchestrator.profile_ab \
|
|
--a orchestrator/profile_runs/compact \
|
|
--b orchestrator/profile_runs/detailed \
|
|
--label-a compact \
|
|
--label-b detailed
|
|
```
|
|
|
|
Compare the generated JSON dumps by focusing on:
|
|
|
|
- `phase_totals_seconds`
|
|
- `node_timings[].elapsed_ms`
|
|
- provenance changes (`research_status`, `degraded_reason`)
|
|
- history/response growth (`history_len`, `response_len`)
|
|
|
|
## 8. Review guardrails
|
|
|
|
When modifying this area, keep these invariants intact unless a broader migration explicitly approves otherwise:
|
|
|
|
1. **Do not change the default path**: normal successful runs should still stay in `research_status = "full"` and `research_mode = "debate"`.
|
|
2. **Do not introduce structured memo output** for degraded research unless all downstream consumers are migrated together.
|
|
3. **Preserve debate output shape**: downstream readers still expect plain strings in `history`, `bull_history`, `bear_history`, `current_response`, `judge_decision`, and `investment_plan`.
|
|
4. **Keep provenance additive**: provenance fields should explain degraded behavior, not replace the existing textual debate artifacts.
|