TradingAgents/docs/agent/decisions/018-storage-events-checkpoi...

16 KiB

ADR 018 — Storage Layout, Event Persistence, WebSocket Streaming & Phase Re-run

Status: accepted Date: 2026-03-26 Supersedes: ADR 015 (run_id namespacing — replaced by flow_id layout) Extends: ADR 013 (WebSocket streaming — adds lazy-loading and run history)


Context

This ADR finalises the storage architecture introduced across several PRs (feat/fe-max-tickers-load-run, PR#106, PR#107, PR#108) and documents the decisions made while fixing the Run History loading bug and phase-level re-run capability.

Key problems solved:

  1. Run history lost on server restart — in-memory run store (runs dict) is not durable. Users could not replay completed runs after a restart.
  2. Checkpoint-less re-runs — re-running a node started from scratch (full analysts → debate → risk) instead of resuming from the correct phase.
  3. Re-run wiped full graph context — clearing all events on re-run removed scan nodes and other tickers from the graph, leaving only the re-run phase.
  4. Analysts checkpoint never saved — Social Analyst is optional; requiring all four analyst keys caused the checkpoint to be skipped silently.

1. Directory Structure (flow_id Layout)

Layout

reports/
└── daily/
    └── {date}/                        ← e.g. 2026-03-26/
        ├── latest.json                ← pointer to most-recent flow_id (legacy compat)
        ├── daily_digest.md            ← appended by every run on this date
        ├── {flow_id}/                 ← 8-char hex, e.g. 021f29ef/
        │   ├── run_meta.json          ← run metadata (id, status, params, …)
        │   ├── run_events.jsonl       ← newline-delimited JSON events
        │   ├── market/
        │   │   └── report/
        │   │       ├── {ts}_scan_report.json
        │   │       └── {ts}_complete_report.json
        │   ├── {TICKER}/              ← e.g. RIG/, TSDD/
        │   │   └── report/
        │   │       ├── {ts}_complete_report.json
        │   │       ├── {ts}_analysts_checkpoint.json
        │   │       ├── {ts}_trader_checkpoint.json
        │   │       └── complete_report.md
        │   └── portfolio/
        │       └── report/
        │           ├── {ts}_pm_decision.json
        │           └── {ts}_execution_result.json
        └── runs/                      ← legacy run_id layout (backward compat only)
            └── {run_id}/

flow_id vs run_id

Concept Type Purpose
run_id UUID In-memory identity for a live run; used as WebSocket endpoint key
flow_id 8-char hex timestamp Disk storage key; stable across server restarts

flow_id is generated once per run via generate_flow_id() and threaded through all sub-phases of an auto run so scan + pipeline + portfolio share the same folder. run_id is ephemeral — it exists only in the runs dict and is not persisted.

Startup Hydration

On server start, hydrate_runs_from_disk() scans reports/daily/*/ for run_meta.json files and rebuilds the runs dict with events: [] (lazy). Events are only loaded when actually needed (WebSocket connect or GET run detail).


2. Event Structure

Every event sent over WebSocket or persisted to run_events.jsonl follows this schema:

{
  // Core identity
  "type":           "thought" | "tool" | "tool_result" | "result" | "log" | "system",
  "node_id":        "Bull Researcher",          // LangGraph node name
  "parent_node_id": "Bull Researcher",          // parent node (for tool events)
  "identifier":     "RIG",                      // ticker, "MARKET", or portfolio_id
  "agent":          "BULL RESEARCHER",          // uppercase display name
  "timestamp":      "10:28:49",                 // HH:MM:SS

  // Content
  "message":        "Thinking... (gls...)",     // truncated display text
  "prompt":         "You are a bull researcher…", // full prompt (thought/result only)
  "response":       "Based on the analysis…",    // full response (result/tool_result)

  // Metrics (result events)
  "metrics": {
    "model":       "glm-4.7-flash:q4_K_M",
    "tokens_in":   1240,
    "tokens_out":  856,
    "latency_ms":  17843
  },

  // Tool events
  "status":  "running" | "success" | "error" | "graceful_skip",
  "service": "yfinance",                        // data vendor used

  // Re-run tracking
  "rerun_seq": 1                                // incremented on each phase re-run; 0 = original
}

Event Types

Type Emitted by Content
thought LLM streaming chunk message (truncated), prompt
result LLM final output message, prompt, response, metrics
tool Tool invocation start node_id, status: "running", service
tool_result Tool completion status, response (tool output), service
log RunLogger structured log line
system Engine human-readable status update; special messages "Run completed." and "Error: …" control frontend state machine

Graph Rendering Rules

The frontend renders graph nodes by grouping events on (node_id, identifier). For each unique pair, the node shows the latest event's metrics (last result event wins). Nodes within the same identifier are stacked vertically; each identifier becomes a column.


3. How Events Are Sent

Normal Run Flow

POST /api/run/{type}              → queues run, returns run_id + flow_id
                                    status: "queued"
WS  /ws/stream/{run_id}           → connects
  if status == "queued"           → WebSocket IS the executor
    engine.run_*()                → streams events live to socket
    run_info["events"].append()   → events cached in memory
    run_info["status"] = completed/failed
  if status in running/completed/failed
    → replay cached events, poll for new ones until terminal state

Background Task Flow (POST → BackgroundTask)

POST /api/run/{type}
  BackgroundTask(_run_and_store)  → drives engine generator
    events cached in runs[run_id]["events"]
    status updated to running → completed/failed
WS /ws/stream/{run_id}
  → enters "streaming from cache" loop
  → polls events[sent:] every 50ms until status is terminal

Lazy-Loading (Server Restart / Run History)

Server restart
  hydrate_runs_from_disk()        → runs[run_id] = {..., "events": []}

WS /ws/stream/{run_id}
  run_info.events == []
    → create_report_store(flow_id=flow_id)
    → store.load_run_events(date)
    → run_info["events"] = disk_events
    if status == "running" and disk_events:
      → status = "failed", error = "Run did not complete (server restarted)"
  → replay all events, send "Run completed." or "Error: …"

Key Invariants

  • Events are append-only during a live run. Never modified in place.
  • run_events.jsonl is written on run completion (not streamed to disk in real time). This is acceptable for V1; periodic flush is a future enhancement.
  • WebSocket polling interval is 50ms (_EVENT_POLL_INTERVAL_SECONDS = 0.05).
  • System messages "Run completed." and "Error: <msg>" are terminal — the frontend transitions to completed or error state on receiving them.

4. Checkpoint Structure

Checkpoints are intermediate snapshots that allow phase-level re-runs without re-executing earlier phases.

Analysts Checkpoint

Written by: run_pipeline() after the graph completes Condition: at least one of market_report, sentiment_report, news_report, fundamentals_report is populated (Social Analyst is optional) Path: {flow_id}/{TICKER}/report/{ts}_analysts_checkpoint.json

{
  "company_of_interest": "RIG",
  "trade_date": "2026-03-26",
  "market_report": "…",          // from Market Analyst
  "news_report": "…",            // from News Analyst
  "fundamentals_report": "…",   // from Fundamentals Analyst
  "sentiment_report": "",        // from Social Analyst (may be empty — that's OK)
  "macro_regime_report": "…",   // from Macro Synthesis scan
  "messages": [...]             // LangGraph message history (for debate context)
}

Used by: run_pipeline_from_phase() when phase == "debate_and_trader". Overlaid onto initial_state before running debate_graph.

Trader Checkpoint

Written by: run_pipeline() after the graph completes Condition: trader_investment_plan is populated Path: {flow_id}/{TICKER}/report/{ts}_trader_checkpoint.json

{
  "company_of_interest": "RIG",
  "trade_date": "2026-03-26",
  "market_report": "…",
  "news_report": "…",
  "fundamentals_report": "…",
  "sentiment_report": "",
  "macro_regime_report": "…",
  "investment_debate_state": {...},  // full bull/bear debate transcript
  "investment_plan": "…",            // Research Manager output
  "trader_investment_plan": "…",     // Trader output
  "messages": [...]
}

Used by: run_pipeline_from_phase() when phase == "risk".

Phase Re-run Routing

node_id              →  phase              →  checkpoint loaded
──────────────────────────────────────────────────────────────
Market Analyst       →  analysts           →  none (full re-run)
News Analyst         →  analysts           →  none
Fundamentals Analyst →  analysts           →  none
Social Analyst       →  analysts           →  none
Bull Researcher      →  debate_and_trader  →  analysts_checkpoint
Bear Researcher      →  debate_and_trader  →  analysts_checkpoint
Research Manager     →  debate_and_trader  →  analysts_checkpoint
Trader               →  debate_and_trader  →  analysts_checkpoint
Aggressive Analyst   →  risk               →  trader_checkpoint
Conservative Analyst →  risk               →  trader_checkpoint
Neutral Analyst      →  risk               →  trader_checkpoint
Portfolio Manager    →  risk               →  trader_checkpoint

After any phase re-run completes, the engine cascades to run_portfolio() so the PM decision incorporates the updated ticker analysis.

Checkpoint Lookup Rule

CRITICAL: The read store used to load checkpoints must use the same flow_id as the original run. Without the flow_id, _date_root() falls back to the legacy flat layout and will never find checkpoints stored under {flow_id}/{TICKER}/report/.

In trigger_rerun_node, the original flow_id is resolved as:

flow_id = run.get("flow_id") or run.get("short_rid") or run["params"].get("flow_id")

This is then passed through rerun_params["flow_id"] to run_pipeline_from_phase, which passes it to create_report_store(flow_id=flow_id).


5. Selective Event Filtering on Re-run

When a phase re-run is triggered, the run's event list is selectively filtered to remove stale events for the re-run scope while preserving events from:

  • Other tickers (TSDD events preserved when re-running RIG)
  • Earlier phases of the same ticker (analyst events preserved when re-running debate)
  • Scan/market events (always preserved)
# Nodes cleared per phase (plus all tool events with matching parent_node_id)
debate_and_trader  {Bull Researcher, Bear Researcher, Research Manager, Trader,
                     Aggressive Analyst, Conservative Analyst, Neutral Analyst,
                     Portfolio Manager}
risk               {Aggressive Analyst, Conservative Analyst, Neutral Analyst,
                     Portfolio Manager}
analysts           all nodes for the ticker

# Portfolio cascade nodes (always cleared — re-run always cascades to PM)
{review_holdings, make_pm_decision}

The WebSocket replays this filtered set first (rebuilding the full graph), then streams the new re-run events on top. The frontend's clearEvents() + WebSocket reconnect ensures a clean state before replay.


6. MongoDB vs Local Storage — Decision Guide

Use Local Storage (ReportStore) when:

  • Development or single-machine deployment — no infrastructure required
  • Offline / air-gapped environments — no network dependency
  • Report files are the primary output — reports as .json/.md files that can be read with any tool
  • Simplicity over scalability — one process, one machine

Use MongoDB (MongoReportStore) when:

  • Multi-process or multi-node deployment — local files are not shared
  • Run history across restarts — hydration from MongoDB is more reliable than scanning the filesystem
  • Reflexion memoryReflexionMemory works best with MongoDB for efficient per-ticker history queries
  • Future: TTL / retention — MongoDB TTL indexes make automatic cleanup easy
  • Production environments — MongoDB provides durability, replication, and backup

Configuration

# Enable MongoDB:
TRADINGAGENTS_MONGO_URI=mongodb://localhost:27017
TRADINGAGENTS_MONGO_DB=tradingagents   # optional, default: "tradingagents"

# Local storage (default when MONGO_URI is unset):
TRADINGAGENTS_REPORTS_DIR=/path/to/reports   # optional, default: ./reports

Factory Behaviour

# Always use the factory — never instantiate stores directly
from tradingagents.portfolio.store_factory import create_report_store

# Writing: always pass flow_id (scopes writes to the correct run folder)
writer = create_report_store(flow_id=flow_id)

# Reading: omit flow_id (resolves via latest.json or MongoDB latest query)
reader = create_report_store()

create_report_store() returns:

  1. DualReportStore(MongoReportStore, ReportStore) — when MONGO_URI is set and pymongo is installed (writes to both; reads from Mongo first, falls back to disk)
  2. ReportStore — when MongoDB is unavailable or not configured

MongoDB failures always fall back to filesystem with a warning log. The application must remain functional without MongoDB.

Known V1 Limitations (Future Work)

Issue Status
pymongo is synchronous — blocks asyncio event loop Deferred: migrate to motor before production
No TTL index — reports accumulate indefinitely Deferred: requires retention policy decision
MongoClient created per store instance Deferred: singleton via FastAPI app lifespan
run_events.jsonl written on completion, not streaming Deferred: periodic flush for long runs

Consequences & Constraints

MUST

  • Always use create_report_store(flow_id=…) for writes — never pass no args when writing, as the flat fallback path will overwrite across runs.
  • Always pass the original flow_id when loading checkpoints for re-run — checkpoint lookup will silently return None otherwise, causing full re-run fallback.
  • Save analysts checkpoint if any() analyst report is populated — Social Analyst is optional; all() silently blocks checkpoints when social is disabled.
  • Selective event filtering on re-run — never clear all events; always use _filter_rerun_events(events, ticker, phase) to preserve other tickers and earlier phases.

MUST NOT

  • Never hard-code ReportStore() in engine methods — always use the factory.
  • Never hold pymongo in the async hot path — wrap in asyncio.to_thread if blocking becomes measurable.

Source Files

tradingagents/portfolio/report_store.py       ← ReportStore (filesystem)
tradingagents/portfolio/mongo_report_store.py ← MongoReportStore
tradingagents/portfolio/dual_report_store.py  ← DualReportStore (both)
tradingagents/portfolio/store_factory.py      ← create_report_store()
tradingagents/report_paths.py                 ← flow_id/run_id helpers, ts_now()
agent_os/backend/main.py                      ← hydrate_runs_from_disk()
agent_os/backend/routes/runs.py               ← _run_and_store, _append_and_store,
                                                 _filter_rerun_events, trigger_rerun_node
agent_os/backend/routes/websocket.py          ← lazy-loading, orphaned run detection
agent_os/backend/services/langgraph_engine.py ← run_pipeline_from_phase, NODE_TO_PHASE,
                                                 checkpoint save/load logic
agent_os/frontend/src/hooks/useAgentStream.ts ← WebSocket client, event accumulation
agent_os/frontend/src/Dashboard.tsx           ← triggerNodeRerun, loadRun, clearEvents