docs: add TradingAgents RAG failure checklist

2026-03-21 13:10:48 +08:00 · 2026-03-21 13:10:48 +08:00 · 9b3b8e1a84
parent f362a160c3
commit 9b3b8e1a84
2 changed files with 87 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -203,6 +203,8 @@ See `tradingagents/default_config.py` for all configuration options.

 We welcome contributions from the community! Whether it's fixing a bug, improving documentation, or suggesting a new feature, your input helps make this project better. If you are interested in this line of research, please consider joining our open-source financial AI research community [Tauric Research](https://tauric.ai/).

+For debugging complex agent or RAG-style failures, see the [TradingAgents RAG failure checklist](docs/troubleshooting/rag-failure-checklist.md).
+
 ## Citation

 Please reference our work if you find *TradingAgents* provides you with some help :)
--- a/docs/troubleshooting/rag-failure-checklist.md
+++ b/docs/troubleshooting/rag-failure-checklist.md
@ -0,0 +1,85 @@
+# TradingAgents RAG Failure Checklist
+
+This guide adapts the [WFGY ProblemMap](https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md) to TradingAgents so contributors can debug multi-agent, retrieval-heavy trading runs in a repeatable way.
+
+## When To Use This Checklist
+
+Use this page when a run looks plausible on the surface but still produces a bad explanation, a bad trade, or inconsistent agent outputs.
+
+Common symptoms:
+
+- Analysts disagree in ways that never get resolved cleanly
+- News, sentiment, and fundamentals drift toward different companies or markets
+- The final BUY/HOLD/SELL decision is confident but weakly supported
+- A backtest result is hard to explain after the fact
+- A single bad tool output pollutes the whole downstream chain
+
+## Quick Triage
+
+1. Confirm the exact instrument first. Use the full ticker, including exchange suffixes such as `CNC.TO` or `0700.HK`, before blaming the reasoning layer.
+2. Capture the raw inputs. Save the analyst reports, debate histories, and final decision from the same run.
+3. Check tool outputs before model outputs. Verify that price, news, and fundamentals all refer to the same instrument and date window.
+4. Identify the earliest wrong step. If the market analyst is already off-track, later debate stages are usually downstream fallout.
+5. Re-run with a narrower scope if needed. Fewer analysts or fewer debate rounds often make the first failure easier to see.
+
+## 16 Failure Patterns Mapped To TradingAgents
+
+| Problem | TradingAgents symptom | What to inspect first |
+| --- | --- | --- |
+| 1. Query mismatch | News or fundamentals are about the wrong company | Exact ticker, exchange suffix, tool call arguments |
+| 2. Retrieval miss | Reports ignore obvious catalyst news or filings | News date range, provider limits, empty tool responses |
+| 3. Long-chain drift | Early analyst mistake compounds into a confident final decision | First analyst report that introduced the wrong premise |
+| 4. Context fragmentation | Analysts talk past each other instead of building on shared facts | Per-agent prompts, handoff fields, debate history |
+| 5. Over-retrieval | Reports become noisy and contradictory | Too many news items, redundant indicators, low-signal evidence |
+| 6. Under-retrieval | Decision looks thin and unsupported | Missing fundamentals, missing macro context, no opposing evidence |
+| 7. Memory break | Later stages ignore earlier conclusions or lessons | Stored memories, reflection output, state propagation |
+| 8. Black-box debugging | You cannot explain why BUY/HOLD/SELL happened | Saved reports, debate transcripts, final rationale |
+| 9. Evidence dilution | Strong evidence is buried under generic commentary | Report structure, ranking of key points, summary quality |
+| 10. Tool misuse | Agents call the wrong tool or malformed parameters | Tool descriptions, exact parameter names, raw tool calls |
+| 11. Time-window mismatch | Analyst conclusions use different dates or stale data | `trade_date`, look-back windows, cached data files |
+| 12. Market mismatch | Symbol collisions across exchanges contaminate the analysis | Exchange-qualified tickers, vendor-specific symbol rules |
+| 13. Multi-agent chaos | Debate rounds add noise instead of sharpening a decision | Debate history, max rounds, repeated unsupported arguments |
+| 14. Confidence inflation | Model sounds certain even when evidence is weak | Counterarguments, missing caveats, empty or erroring tools |
+| 15. Evaluation leakage | Backtest looks good but explanation quality is poor | Whether the same evidence is being reused incorrectly |
+| 16. Missing closure | Final report does not prove why the decision is actionable | Final rationale, risk arguments, explicit trade conditions |
+
+## Investigation Workflow
+
+### 1. Verify Instrument Identity
+
+- Confirm the exact ticker used in the CLI or Python entrypoint.
+- Prefer exchange-qualified symbols when a root ticker exists on multiple markets.
+- Check that the same symbol appears in every analyst report and tool call.
+
+### 2. Verify Data Inputs
+
+- Inspect raw stock data, fundamentals, and news outputs separately.
+- Confirm the requested date range and cache contents match the run you are debugging.
+- Treat empty tool responses as first-class failures, not minor warnings.
+
+### 3. Inspect Agent Handoffs
+
+- Compare analyst reports with the research debate prompt.
+- Check whether the trader is acting on the latest investment plan or a stale premise.
+- Check whether the risk debate is responding to the trader's real plan or a generic summary.
+
+### 4. Inspect Decision Quality
+
+- Look for direct evidence behind BUY/HOLD/SELL rather than polished prose.
+- Verify that bullish and bearish arguments were both represented.
+- If Hold is chosen, make sure it is justified by evidence instead of indecision.
+
+## Evidence To Include In Bug Reports Or PRs
+
+- Exact ticker and date used
+- Provider and model configuration
+- Which analyst or stage first went wrong
+- Raw tool output or screenshots of the relevant report section
+- Whether the problem disappears when using an exchange-qualified ticker or fewer debate rounds
+
+## Suggested Fix Patterns
+
+- Tighten prompt wording around exact tickers and exchange suffixes
+- Add validation or CLI guidance for ambiguous symbols
+- Preserve the exact model id or provider-specific configuration chosen by the user
+- Add smaller smoke tests around helper functions and configuration paths before attempting a full end-to-end run