TradingAgents/docs/architecture/research-provenance.md

11 KiB

TradingAgents research provenance, node guards, and profiling harness

Status: draft Audience: orchestrator, TradingAgents graph, verification Scope: document the Phase 1-4 provenance fields, Bull/Bear/Manager guard behavior, trace schema, and the smallest safe A/B workflow for verification

Current implementation snapshot (2026-04)

Mainline now has four distinct but connected pieces in place:

  1. research provenance fields are carried in investment_debate_state;
  2. the same provenance is reused by:
    • orchestrator/llm_runner.py
    • orchestrator/live_mode.py
    • tradingagents/graph/trading_graph.py full-state logs;
  3. orchestrator/profile_stage_chain.py emits node-level traces for offline analysis;
  4. orchestrator/profile_ab.py compares two trace cohorts offline without changing the production execution path.

This document describes the current mainline behavior, not a future structured-memo design.

1. Why this document exists

Phase 1-4 convergence added three closely related behaviors:

  1. research-stage provenance is carried inside investment_debate_state and surfaced into application-facing metadata;
  2. Bull Researcher, Bear Researcher, and Research Manager are guarded so timeouts/exceptions degrade gracefully without changing the default full-debate path;
  3. orchestrator/profile_stage_chain.py can be used as a minimal A/B harness to compare prompt/profile variants while preserving the production path.

The implementation is intentionally conservative:

  • no structured memo output is introduced;
  • default behavior remains the full debate path when no guard trips;
  • existing debate string fields stay authoritative (history, bull_history, bear_history, current_response, judge_decision).

2. Provenance schema and ownership

2.1 Canonical provenance fields

The research provenance fields currently carried in investment_debate_state are:

Field Meaning Primary source
research_status Research health/status. Current in-repo values are full and degraded; failed is tolerated in surfaced diagnostics. tradingagents/graph/propagation.py, tradingagents/graph/setup.py, tradingagents/agents/utils/agent_states.py
research_mode Research execution mode. Normal path is debate; degraded path is degraded_synthesis. same
timed_out_nodes Ordered list of guarded research nodes that hit timeout. tradingagents/graph/setup.py
degraded_reason Machine-readable reason string such as bull_researcher_timeout. tradingagents/graph/setup.py
covered_dimensions Which debate dimensions completed successfully so far (bull, bear, manager). tradingagents/graph/setup.py
manager_confidence Optional confidence marker for the research-manager layer. 1.0 on clean manager success, 0.5 when manager succeeds after prior degradation, 0.0 on manager fallback. tradingagents/graph/setup.py

2.2 Initialization and propagation

  • tradingagents/graph/propagation.py initializes the default path with:
    • research_status = "full"
    • research_mode = "debate"
    • timed_out_nodes = []
    • degraded_reason = None
    • covered_dimensions = []
    • manager_confidence = None
  • tradingagents/graph/setup.py::_apply_research_success() extends covered_dimensions and preserves the default debate mode while the research status remains full.
  • tradingagents/graph/setup.py::_apply_research_fallback() marks the state as degraded, records the reason, and updates only the existing debate fields instead of inventing a parallel memo structure.

3. Guard behavior by node

GraphSetup._guard_research_node() wraps each research node in a single-worker thread pool and enforces research_node_timeout_secs.

3.1 Bull / Bear researcher fallback

On timeout or exception for Bull Researcher or Bear Researcher:

  • the corresponding node name is added to timed_out_nodes when the reason includes timeout;
  • research_status becomes degraded;
  • research_mode becomes degraded_synthesis;
  • a plain-text degraded argument is appended to:
    • history
    • the node-specific history field (bull_history or bear_history)
    • current_response
  • count is incremented so the debate routing still advances.

This keeps the existing debate output shape intact: downstream consumers continue reading the same string fields they already depend on.

3.2 Research Manager fallback

On timeout or exception for Research Manager:

  • provenance is marked degraded using the same schema;
  • manager_confidence is forced to 0.0;
  • judge_decision, current_response, and returned investment_plan are set to a plain-text HOLD recommendation that explicitly calls out degraded research.

This is intentionally string-first, not schema-first, so the downstream plan/report path does not have to learn a new memo envelope.

4. Application-facing surfacing

4.1 LLM runner metadata

orchestrator/llm_runner.py extracts the provenance subset from investment_debate_state and stores it under:

  • metadata.research
  • metadata.data_quality
  • metadata.sample_quality

The extraction path is now centralized through:

  • tradingagents/agents/utils/agent_states.py::extract_research_provenance()

Current conventions:

  • normal path: data_quality.state = "ok", sample_quality = "full_research";
  • degraded path: data_quality.state = "research_degraded", sample_quality = "degraded_research".

4.2 Live-mode contract projection

orchestrator/live_mode.py forwards provenance under top-level research in live-mode payloads for both:

  • completed / degraded_success results; and
  • structured failures that carry research diagnostics in source_diagnostics.

This means consumers can inspect research degradation without parsing raw debate text.

4.3 Full-state log projection

tradingagents/graph/trading_graph.py::_log_state() now also persists the same provenance subset into:

  • results/<ticker>/TradingAgentsStrategy_logs/full_states_log_<trade_date>.json

This keeps the post-run JSON logs aligned with the runner/live metadata instead of silently dropping the structured fields.

5. Profiling trace schema

orchestrator/profile_stage_chain.py is the current timing/provenance trace generator. orchestrator/profile_trace_utils.py holds the shared summary helper used by the offline A/B comparison path.

5.1 Top-level payload

Successful runs currently write a JSON payload with:

  • status
  • ticker
  • date
  • selected_analysts
  • analysis_prompt_style
  • node_timings
  • phase_totals_seconds
  • dump_path
  • raw_events (normally empty unless explicitly requested on failure)

Error payloads add:

  • run_id
  • error
  • exception_type

5.2 node_timings[] entry schema

Each node_timings[] entry currently contains:

Field Meaning
run_id Correlates all rows from one profiling run
nodes Node names emitted by the LangGraph update
phases Normalized application phase names (analyst, research, trading, risk, portfolio)
llm_kinds Normalized LLM bucket labels (quick, deep)
start_at / end_at Relative offsets from run start, in seconds
elapsed_ms Duration since the previous event
selected_analysts Analyst slice used for the run
analysis_prompt_style Prompt profile used for the run
research_status Provenance snapshot extracted from investment_debate_state
degraded_reason Provenance reason snapshot
history_len Current debate history length
response_len Current response length

This schema is intentionally trace-oriented, not a replacement for the application result contract.

6. Offline A/B comparison helper

orchestrator/profile_ab.py is the current offline comparison helper.

It consumes one or more trace JSON files from cohort A and cohort B, then reports:

  • median_total_elapsed_ms
  • median_event_count
  • median_phase_elapsed_ms
  • degraded_run_count
  • error_count
  • trace_schema_versions
  • source_files
  • recommendation tie-breaks across elapsed time, degradation count, and error count

This helper is intentionally offline-only: it does not re-run live providers or change the production runtime path.

7. Minimal A/B harness guidance

Use python -m orchestrator.profile_stage_chain to generate traces, then python -m orchestrator.profile_ab to compare them.

6.1 Safe comparison knobs

Run the harness from the repo root as a module (python -m orchestrator.profile_stage_chain) so package imports resolve without extra path tweaking.

The smallest useful A/B comparisons are:

  • --analysis-prompt-style (for example compact vs another supported style)
  • --selected-analysts (for example a narrower analyst slice vs a broader slice)
  • provider/model/timeout settings while keeping the graph semantics fixed

Keep these fixed when doing an A/B comparison:

  • the same --ticker
  • the same --date
  • the same provider/model unless the provider/model itself is the experimental variable
  • the same --overall-timeout
  • max_debate_rounds = 1 and max_risk_discuss_rounds = 1 as currently baked into the harness

7.3 Example commands

python -m orchestrator.profile_stage_chain \
  --ticker AAPL \
  --date 2026-04-11 \
  --selected-analysts market \
  --analysis-prompt-style compact

python -m orchestrator.profile_stage_chain \
  --ticker AAPL \
  --date 2026-04-11 \
  --selected-analysts market \
  --analysis-prompt-style detailed

python -m orchestrator.profile_ab \
  --a orchestrator/profile_runs/compact \
  --b orchestrator/profile_runs/detailed \
  --label-a compact \
  --label-b detailed

Compare the generated JSON dumps by focusing on:

  • phase_totals_seconds
  • node_timings[].elapsed_ms
  • provenance changes (research_status, degraded_reason)
  • history/response growth (history_len, response_len)

8. Review guardrails

When modifying this area, keep these invariants intact unless a broader migration explicitly approves otherwise:

  1. Do not change the default path: normal successful runs should still stay in research_status = "full" and research_mode = "debate".
  2. Do not introduce structured memo output for degraded research unless all downstream consumers are migrated together.
  3. Preserve debate output shape: downstream readers still expect plain strings in history, bull_history, bear_history, current_response, judge_decision, and investment_plan.
  4. Keep provenance additive: provenance fields should explain degraded behavior, not replace the existing textual debate artifacts.