11 KiB
TradingAgents research provenance, node guards, and profiling harness
Status: draft Audience: orchestrator, TradingAgents graph, verification Scope: document the Phase 1-4 provenance fields, Bull/Bear/Manager guard behavior, trace schema, and the smallest safe A/B workflow for verification
Current implementation snapshot (2026-04)
Mainline now has four distinct but connected pieces in place:
research provenancefields are carried ininvestment_debate_state;- the same provenance is reused by:
orchestrator/llm_runner.pyorchestrator/live_mode.pytradingagents/graph/trading_graph.pyfull-state logs;
orchestrator/profile_stage_chain.pyemits node-level traces for offline analysis;orchestrator/profile_ab.pycompares two trace cohorts offline without changing the production execution path.
This document describes the current mainline behavior, not a future structured-memo design.
1. Why this document exists
Phase 1-4 convergence added three closely related behaviors:
- research-stage provenance is carried inside
investment_debate_stateand surfaced into application-facing metadata; - Bull Researcher, Bear Researcher, and Research Manager are guarded so timeouts/exceptions degrade gracefully without changing the default full-debate path;
orchestrator/profile_stage_chain.pycan be used as a minimal A/B harness to compare prompt/profile variants while preserving the production path.
The implementation is intentionally conservative:
- no structured memo output is introduced;
- default behavior remains the full debate path when no guard trips;
- existing debate string fields stay authoritative (
history,bull_history,bear_history,current_response,judge_decision).
2. Provenance schema and ownership
2.1 Canonical provenance fields
The research provenance fields currently carried in investment_debate_state are:
| Field | Meaning | Primary source |
|---|---|---|
research_status |
Research health/status. Current in-repo values are full and degraded; failed is tolerated in surfaced diagnostics. |
tradingagents/graph/propagation.py, tradingagents/graph/setup.py, tradingagents/agents/utils/agent_states.py |
research_mode |
Research execution mode. Normal path is debate; degraded path is degraded_synthesis. |
same |
timed_out_nodes |
Ordered list of guarded research nodes that hit timeout. | tradingagents/graph/setup.py |
degraded_reason |
Machine-readable reason string such as bull_researcher_timeout. |
tradingagents/graph/setup.py |
covered_dimensions |
Which debate dimensions completed successfully so far (bull, bear, manager). |
tradingagents/graph/setup.py |
manager_confidence |
Optional confidence marker for the research-manager layer. 1.0 on clean manager success, 0.5 when manager succeeds after prior degradation, 0.0 on manager fallback. |
tradingagents/graph/setup.py |
2.2 Initialization and propagation
tradingagents/graph/propagation.pyinitializes the default path with:research_status = "full"research_mode = "debate"timed_out_nodes = []degraded_reason = Nonecovered_dimensions = []manager_confidence = None
tradingagents/graph/setup.py::_apply_research_success()extendscovered_dimensionsand preserves the default debate mode while the research status remainsfull.tradingagents/graph/setup.py::_apply_research_fallback()marks the state as degraded, records the reason, and updates only the existing debate fields instead of inventing a parallel memo structure.
3. Guard behavior by node
GraphSetup._guard_research_node() wraps each research node in a single-worker thread pool and enforces research_node_timeout_secs.
3.1 Bull / Bear researcher fallback
On timeout or exception for Bull Researcher or Bear Researcher:
- the corresponding node name is added to
timed_out_nodeswhen the reason includestimeout; research_statusbecomesdegraded;research_modebecomesdegraded_synthesis;- a plain-text degraded argument is appended to:
history- the node-specific history field (
bull_historyorbear_history) current_response
countis incremented so the debate routing still advances.
This keeps the existing debate output shape intact: downstream consumers continue reading the same string fields they already depend on.
3.2 Research Manager fallback
On timeout or exception for Research Manager:
- provenance is marked degraded using the same schema;
manager_confidenceis forced to0.0;judge_decision,current_response, and returnedinvestment_planare set to a plain-text HOLD recommendation that explicitly calls out degraded research.
This is intentionally string-first, not schema-first, so the downstream plan/report path does not have to learn a new memo envelope.
4. Application-facing surfacing
4.1 LLM runner metadata
orchestrator/llm_runner.py extracts the provenance subset from investment_debate_state and stores it under:
metadata.researchmetadata.data_qualitymetadata.sample_quality
The extraction path is now centralized through:
tradingagents/agents/utils/agent_states.py::extract_research_provenance()
Current conventions:
- normal path:
data_quality.state = "ok",sample_quality = "full_research"; - degraded path:
data_quality.state = "research_degraded",sample_quality = "degraded_research".
4.2 Live-mode contract projection
orchestrator/live_mode.py forwards provenance under top-level research in live-mode payloads for both:
completed/degraded_successresults; and- structured failures that carry research diagnostics in
source_diagnostics.
This means consumers can inspect research degradation without parsing raw debate text.
4.3 Full-state log projection
tradingagents/graph/trading_graph.py::_log_state() now also persists the same provenance subset into:
results/<ticker>/TradingAgentsStrategy_logs/full_states_log_<trade_date>.json
This keeps the post-run JSON logs aligned with the runner/live metadata instead of silently dropping the structured fields.
5. Profiling trace schema
orchestrator/profile_stage_chain.py is the current timing/provenance trace generator.
orchestrator/profile_trace_utils.py holds the shared summary helper used by the offline A/B comparison path.
5.1 Top-level payload
Successful runs currently write a JSON payload with:
statustickerdateselected_analystsanalysis_prompt_stylenode_timingsphase_totals_secondsdump_pathraw_events(normally empty unless explicitly requested on failure)
Error payloads add:
run_iderrorexception_type
5.2 node_timings[] entry schema
Each node_timings[] entry currently contains:
| Field | Meaning |
|---|---|
run_id |
Correlates all rows from one profiling run |
nodes |
Node names emitted by the LangGraph update |
phases |
Normalized application phase names (analyst, research, trading, risk, portfolio) |
llm_kinds |
Normalized LLM bucket labels (quick, deep) |
start_at / end_at |
Relative offsets from run start, in seconds |
elapsed_ms |
Duration since the previous event |
selected_analysts |
Analyst slice used for the run |
analysis_prompt_style |
Prompt profile used for the run |
research_status |
Provenance snapshot extracted from investment_debate_state |
degraded_reason |
Provenance reason snapshot |
history_len |
Current debate history length |
response_len |
Current response length |
This schema is intentionally trace-oriented, not a replacement for the application result contract.
6. Offline A/B comparison helper
orchestrator/profile_ab.py is the current offline comparison helper.
It consumes one or more trace JSON files from cohort A and cohort B, then reports:
median_total_elapsed_msmedian_event_countmedian_phase_elapsed_msdegraded_run_counterror_counttrace_schema_versionssource_files- recommendation tie-breaks across elapsed time, degradation count, and error count
This helper is intentionally offline-only: it does not re-run live providers or change the production runtime path.
7. Minimal A/B harness guidance
Use python -m orchestrator.profile_stage_chain to generate traces, then python -m orchestrator.profile_ab to compare them.
6.1 Safe comparison knobs
Run the harness from the repo root as a module (python -m orchestrator.profile_stage_chain) so package imports resolve without extra path tweaking.
The smallest useful A/B comparisons are:
--analysis-prompt-style(for examplecompactvs another supported style)--selected-analysts(for example a narrower analyst slice vs a broader slice)- provider/model/timeout settings while keeping the graph semantics fixed
6.2 Recommended invariants
Keep these fixed when doing an A/B comparison:
- the same
--ticker - the same
--date - the same provider/model unless the provider/model itself is the experimental variable
- the same
--overall-timeout max_debate_rounds = 1andmax_risk_discuss_rounds = 1as currently baked into the harness
7.3 Example commands
python -m orchestrator.profile_stage_chain \
--ticker AAPL \
--date 2026-04-11 \
--selected-analysts market \
--analysis-prompt-style compact
python -m orchestrator.profile_stage_chain \
--ticker AAPL \
--date 2026-04-11 \
--selected-analysts market \
--analysis-prompt-style detailed
python -m orchestrator.profile_ab \
--a orchestrator/profile_runs/compact \
--b orchestrator/profile_runs/detailed \
--label-a compact \
--label-b detailed
Compare the generated JSON dumps by focusing on:
phase_totals_secondsnode_timings[].elapsed_ms- provenance changes (
research_status,degraded_reason) - history/response growth (
history_len,response_len)
8. Review guardrails
When modifying this area, keep these invariants intact unless a broader migration explicitly approves otherwise:
- Do not change the default path: normal successful runs should still stay in
research_status = "full"andresearch_mode = "debate". - Do not introduce structured memo output for degraded research unless all downstream consumers are migrated together.
- Preserve debate output shape: downstream readers still expect plain strings in
history,bull_history,bear_history,current_response,judge_decision, andinvestment_plan. - Keep provenance additive: provenance fields should explain degraded behavior, not replace the existing textual debate artifacts.