11 KiB

Raw Blame History

TradingAgents research provenance, node guards, and profiling harness

Status: draft Audience: orchestrator, TradingAgents graph, verification Scope: document the Phase 1-4 provenance fields, Bull/Bear/Manager guard behavior, trace schema, and the smallest safe A/B workflow for verification

Current implementation snapshot (2026-04)

Mainline now has four distinct but connected pieces in place:

research provenance fields are carried in investment_debate_state;
the same provenance is reused by:
- orchestrator/llm_runner.py
- orchestrator/live_mode.py
- tradingagents/graph/trading_graph.py full-state logs;
orchestrator/profile_stage_chain.py emits node-level traces for offline analysis;
orchestrator/profile_ab.py compares two trace cohorts offline without changing the production execution path.

This document describes the current mainline behavior, not a future structured-memo design.

1. Why this document exists

Phase 1-4 convergence added three closely related behaviors:

research-stage provenance is carried inside investment_debate_state and surfaced into application-facing metadata;
Bull Researcher, Bear Researcher, and Research Manager are guarded so timeouts/exceptions degrade gracefully without changing the default full-debate path;
orchestrator/profile_stage_chain.py can be used as a minimal A/B harness to compare prompt/profile variants while preserving the production path.

The implementation is intentionally conservative:

no structured memo output is introduced;
default behavior remains the full debate path when no guard trips;
existing debate string fields stay authoritative (history, bull_history, bear_history, current_response, judge_decision).

2. Provenance schema and ownership

2.1 Canonical provenance fields

The research provenance fields currently carried in investment_debate_state are:

Field	Meaning	Primary source
`research_status`	Research health/status. Current in-repo values are `full` and `degraded`; `failed` is tolerated in surfaced diagnostics.	`tradingagents/graph/propagation.py`, `tradingagents/graph/setup.py`, `tradingagents/agents/utils/agent_states.py`
`research_mode`	Research execution mode. Normal path is `debate`; degraded path is `degraded_synthesis`.	same
`timed_out_nodes`	Ordered list of guarded research nodes that hit timeout.	`tradingagents/graph/setup.py`
`degraded_reason`	Machine-readable reason string such as `bull_researcher_timeout`.	`tradingagents/graph/setup.py`
`covered_dimensions`	Which debate dimensions completed successfully so far (`bull`, `bear`, `manager`).	`tradingagents/graph/setup.py`
`manager_confidence`	Optional confidence marker for the research-manager layer. `1.0` on clean manager success, `0.5` when manager succeeds after prior degradation, `0.0` on manager fallback.	`tradingagents/graph/setup.py`

2.2 Initialization and propagation

tradingagents/graph/propagation.py initializes the default path with:
- research_status = "full"
- research_mode = "debate"
- timed_out_nodes = []
- degraded_reason = None
- covered_dimensions = []
- manager_confidence = None
tradingagents/graph/setup.py::_apply_research_success() extends covered_dimensions and preserves the default debate mode while the research status remains full.
tradingagents/graph/setup.py::_apply_research_fallback() marks the state as degraded, records the reason, and updates only the existing debate fields instead of inventing a parallel memo structure.

3. Guard behavior by node

GraphSetup._guard_research_node() wraps each research node in a single-worker thread pool and enforces research_node_timeout_secs.

3.1 Bull / Bear researcher fallback

On timeout or exception for Bull Researcher or Bear Researcher:

the corresponding node name is added to timed_out_nodes when the reason includes timeout;
research_status becomes degraded;
research_mode becomes degraded_synthesis;
a plain-text degraded argument is appended to:
- history
- the node-specific history field (bull_history or bear_history)
- current_response
count is incremented so the debate routing still advances.

This keeps the existing debate output shape intact: downstream consumers continue reading the same string fields they already depend on.

3.2 Research Manager fallback

On timeout or exception for Research Manager:

provenance is marked degraded using the same schema;
manager_confidence is forced to 0.0;
judge_decision, current_response, and returned investment_plan are set to a plain-text HOLD recommendation that explicitly calls out degraded research.

This is intentionally string-first, not schema-first, so the downstream plan/report path does not have to learn a new memo envelope.

4. Application-facing surfacing

4.1 LLM runner metadata

orchestrator/llm_runner.py extracts the provenance subset from investment_debate_state and stores it under:

metadata.research
metadata.data_quality
metadata.sample_quality

The extraction path is now centralized through:

tradingagents/agents/utils/agent_states.py::extract_research_provenance()

Current conventions:

normal path: data_quality.state = "ok", sample_quality = "full_research";
degraded path: data_quality.state = "research_degraded", sample_quality = "degraded_research".

4.2 Live-mode contract projection

orchestrator/live_mode.py forwards provenance under top-level research in live-mode payloads for both:

completed / degraded_success results; and
structured failures that carry research diagnostics in source_diagnostics.

This means consumers can inspect research degradation without parsing raw debate text.

4.3 Full-state log projection

tradingagents/graph/trading_graph.py::_log_state() now also persists the same provenance subset into:

results/<ticker>/TradingAgentsStrategy_logs/full_states_log_<trade_date>.json

This keeps the post-run JSON logs aligned with the runner/live metadata instead of silently dropping the structured fields.

5. Profiling trace schema

orchestrator/profile_stage_chain.py is the current timing/provenance trace generator. orchestrator/profile_trace_utils.py holds the shared summary helper used by the offline A/B comparison path.

5.1 Top-level payload

Successful runs currently write a JSON payload with:

status
ticker
date
selected_analysts
analysis_prompt_style
node_timings
phase_totals_seconds
dump_path
raw_events (normally empty unless explicitly requested on failure)

Error payloads add:

run_id
error
exception_type

5.2 `node_timings[]` entry schema

Each node_timings[] entry currently contains:

Field	Meaning
`run_id`	Correlates all rows from one profiling run
`nodes`	Node names emitted by the LangGraph update
`phases`	Normalized application phase names (`analyst`, `research`, `trading`, `risk`, `portfolio`)
`llm_kinds`	Normalized LLM bucket labels (`quick`, `deep`)
`start_at` / `end_at`	Relative offsets from run start, in seconds
`elapsed_ms`	Duration since the previous event
`selected_analysts`	Analyst slice used for the run
`analysis_prompt_style`	Prompt profile used for the run
`research_status`	Provenance snapshot extracted from `investment_debate_state`
`degraded_reason`	Provenance reason snapshot
`history_len`	Current debate history length
`response_len`	Current response length

This schema is intentionally trace-oriented, not a replacement for the application result contract.

6. Offline A/B comparison helper

orchestrator/profile_ab.py is the current offline comparison helper.

It consumes one or more trace JSON files from cohort A and cohort B, then reports:

median_total_elapsed_ms
median_event_count
median_phase_elapsed_ms
degraded_run_count
error_count
trace_schema_versions
source_files
recommendation tie-breaks across elapsed time, degradation count, and error count

This helper is intentionally offline-only: it does not re-run live providers or change the production runtime path.

7. Minimal A/B harness guidance

Use python -m orchestrator.profile_stage_chain to generate traces, then python -m orchestrator.profile_ab to compare them.

6.1 Safe comparison knobs

Run the harness from the repo root as a module (python -m orchestrator.profile_stage_chain) so package imports resolve without extra path tweaking.

The smallest useful A/B comparisons are:

--analysis-prompt-style (for example compact vs another supported style)
--selected-analysts (for example a narrower analyst slice vs a broader slice)
provider/model/timeout settings while keeping the graph semantics fixed

6.2 Recommended invariants

Keep these fixed when doing an A/B comparison:

the same --ticker
the same --date
the same provider/model unless the provider/model itself is the experimental variable
the same --overall-timeout
max_debate_rounds = 1 and max_risk_discuss_rounds = 1 as currently baked into the harness

7.3 Example commands

python -m orchestrator.profile_stage_chain \
  --ticker AAPL \
  --date 2026-04-11 \
  --selected-analysts market \
  --analysis-prompt-style compact

python -m orchestrator.profile_stage_chain \
  --ticker AAPL \
  --date 2026-04-11 \
  --selected-analysts market \
  --analysis-prompt-style detailed

python -m orchestrator.profile_ab \
  --a orchestrator/profile_runs/compact \
  --b orchestrator/profile_runs/detailed \
  --label-a compact \
  --label-b detailed

Compare the generated JSON dumps by focusing on:

phase_totals_seconds
node_timings[].elapsed_ms
provenance changes (research_status, degraded_reason)
history/response growth (history_len, response_len)

8. Review guardrails

When modifying this area, keep these invariants intact unless a broader migration explicitly approves otherwise:

Do not change the default path: normal successful runs should still stay in research_status = "full" and research_mode = "debate".
Do not introduce structured memo output for degraded research unless all downstream consumers are migrated together.
Preserve debate output shape: downstream readers still expect plain strings in history, bull_history, bear_history, current_response, judge_decision, and investment_plan.
Keep provenance additive: provenance fields should explain degraded behavior, not replace the existing textual debate artifacts.

11 KiB Raw Blame History