16 KiB
ADR 018 — Storage Layout, Event Persistence, WebSocket Streaming & Phase Re-run
Status: accepted Date: 2026-03-26 Supersedes: ADR 015 (run_id namespacing — replaced by flow_id layout) Extends: ADR 013 (WebSocket streaming — adds lazy-loading and run history)
Context
This ADR finalises the storage architecture introduced across several PRs
(feat/fe-max-tickers-load-run, PR#106, PR#107, PR#108) and documents the
decisions made while fixing the Run History loading bug and phase-level re-run
capability.
Key problems solved:
- Run history lost on server restart — in-memory run store (
runsdict) is not durable. Users could not replay completed runs after a restart. - Checkpoint-less re-runs — re-running a node started from scratch (full analysts → debate → risk) instead of resuming from the correct phase.
- Re-run wiped full graph context — clearing all events on re-run removed scan nodes and other tickers from the graph, leaving only the re-run phase.
- Analysts checkpoint never saved — Social Analyst is optional; requiring all four analyst keys caused the checkpoint to be skipped silently.
1. Directory Structure (flow_id Layout)
Layout
reports/
└── daily/
└── {date}/ ← e.g. 2026-03-26/
├── latest.json ← pointer to most-recent flow_id (legacy compat)
├── daily_digest.md ← appended by every run on this date
├── {flow_id}/ ← 8-char hex, e.g. 021f29ef/
│ ├── run_meta.json ← run metadata (id, status, params, …)
│ ├── run_events.jsonl ← newline-delimited JSON events
│ ├── market/
│ │ └── report/
│ │ ├── {ts}_scan_report.json
│ │ └── {ts}_complete_report.json
│ ├── {TICKER}/ ← e.g. RIG/, TSDD/
│ │ └── report/
│ │ ├── {ts}_complete_report.json
│ │ ├── {ts}_analysts_checkpoint.json
│ │ ├── {ts}_trader_checkpoint.json
│ │ └── complete_report.md
│ └── portfolio/
│ └── report/
│ ├── {ts}_pm_decision.json
│ └── {ts}_execution_result.json
└── runs/ ← legacy run_id layout (backward compat only)
└── {run_id}/
flow_id vs run_id
| Concept | Type | Purpose |
|---|---|---|
run_id |
UUID | In-memory identity for a live run; used as WebSocket endpoint key |
flow_id |
8-char hex timestamp | Disk storage key; stable across server restarts |
flow_id is generated once per run via generate_flow_id() and threaded through
all sub-phases of an auto run so scan + pipeline + portfolio share the same folder.
run_id is ephemeral — it exists only in the runs dict and is not persisted.
Startup Hydration
On server start, hydrate_runs_from_disk() scans reports/daily/*/ for
run_meta.json files and rebuilds the runs dict with events: [] (lazy).
Events are only loaded when actually needed (WebSocket connect or GET run detail).
2. Event Structure
Every event sent over WebSocket or persisted to run_events.jsonl follows this
schema:
{
// Core identity
"type": "thought" | "tool" | "tool_result" | "result" | "log" | "system",
"node_id": "Bull Researcher", // LangGraph node name
"parent_node_id": "Bull Researcher", // parent node (for tool events)
"identifier": "RIG", // ticker, "MARKET", or portfolio_id
"agent": "BULL RESEARCHER", // uppercase display name
"timestamp": "10:28:49", // HH:MM:SS
// Content
"message": "Thinking... (gls...)", // truncated display text
"prompt": "You are a bull researcher…", // full prompt (thought/result only)
"response": "Based on the analysis…", // full response (result/tool_result)
// Metrics (result events)
"metrics": {
"model": "glm-4.7-flash:q4_K_M",
"tokens_in": 1240,
"tokens_out": 856,
"latency_ms": 17843
},
// Tool events
"status": "running" | "success" | "error" | "graceful_skip",
"service": "yfinance", // data vendor used
// Re-run tracking
"rerun_seq": 1 // incremented on each phase re-run; 0 = original
}
Event Types
| Type | Emitted by | Content |
|---|---|---|
thought |
LLM streaming chunk | message (truncated), prompt |
result |
LLM final output | message, prompt, response, metrics |
tool |
Tool invocation start | node_id, status: "running", service |
tool_result |
Tool completion | status, response (tool output), service |
log |
RunLogger |
structured log line |
system |
Engine | human-readable status update; special messages "Run completed." and "Error: …" control frontend state machine |
Graph Rendering Rules
The frontend renders graph nodes by grouping events on (node_id, identifier).
For each unique pair, the node shows the latest event's metrics (last result
event wins). Nodes within the same identifier are stacked vertically; each
identifier becomes a column.
3. How Events Are Sent
Normal Run Flow
POST /api/run/{type} → queues run, returns run_id + flow_id
status: "queued"
WS /ws/stream/{run_id} → connects
if status == "queued" → WebSocket IS the executor
engine.run_*() → streams events live to socket
run_info["events"].append() → events cached in memory
run_info["status"] = completed/failed
if status in running/completed/failed
→ replay cached events, poll for new ones until terminal state
Background Task Flow (POST → BackgroundTask)
POST /api/run/{type}
BackgroundTask(_run_and_store) → drives engine generator
events cached in runs[run_id]["events"]
status updated to running → completed/failed
WS /ws/stream/{run_id}
→ enters "streaming from cache" loop
→ polls events[sent:] every 50ms until status is terminal
Lazy-Loading (Server Restart / Run History)
Server restart
hydrate_runs_from_disk() → runs[run_id] = {..., "events": []}
WS /ws/stream/{run_id}
run_info.events == []
→ create_report_store(flow_id=flow_id)
→ store.load_run_events(date)
→ run_info["events"] = disk_events
if status == "running" and disk_events:
→ status = "failed", error = "Run did not complete (server restarted)"
→ replay all events, send "Run completed." or "Error: …"
Key Invariants
- Events are append-only during a live run. Never modified in place.
- run_events.jsonl is written on run completion (not streamed to disk in real time). This is acceptable for V1; periodic flush is a future enhancement.
- WebSocket polling interval is 50ms (
_EVENT_POLL_INTERVAL_SECONDS = 0.05). - System messages
"Run completed."and"Error: <msg>"are terminal — the frontend transitions tocompletedorerrorstate on receiving them.
4. Checkpoint Structure
Checkpoints are intermediate snapshots that allow phase-level re-runs without re-executing earlier phases.
Analysts Checkpoint
Written by: run_pipeline() after the graph completes
Condition: at least one of market_report, sentiment_report,
news_report, fundamentals_report is populated (Social Analyst is optional)
Path: {flow_id}/{TICKER}/report/{ts}_analysts_checkpoint.json
{
"company_of_interest": "RIG",
"trade_date": "2026-03-26",
"market_report": "…", // from Market Analyst
"news_report": "…", // from News Analyst
"fundamentals_report": "…", // from Fundamentals Analyst
"sentiment_report": "", // from Social Analyst (may be empty — that's OK)
"macro_regime_report": "…", // from Macro Synthesis scan
"messages": [...] // LangGraph message history (for debate context)
}
Used by: run_pipeline_from_phase() when phase == "debate_and_trader".
Overlaid onto initial_state before running debate_graph.
Trader Checkpoint
Written by: run_pipeline() after the graph completes
Condition: trader_investment_plan is populated
Path: {flow_id}/{TICKER}/report/{ts}_trader_checkpoint.json
{
"company_of_interest": "RIG",
"trade_date": "2026-03-26",
"market_report": "…",
"news_report": "…",
"fundamentals_report": "…",
"sentiment_report": "",
"macro_regime_report": "…",
"investment_debate_state": {...}, // full bull/bear debate transcript
"investment_plan": "…", // Research Manager output
"trader_investment_plan": "…", // Trader output
"messages": [...]
}
Used by: run_pipeline_from_phase() when phase == "risk".
Phase Re-run Routing
node_id → phase → checkpoint loaded
──────────────────────────────────────────────────────────────
Market Analyst → analysts → none (full re-run)
News Analyst → analysts → none
Fundamentals Analyst → analysts → none
Social Analyst → analysts → none
Bull Researcher → debate_and_trader → analysts_checkpoint
Bear Researcher → debate_and_trader → analysts_checkpoint
Research Manager → debate_and_trader → analysts_checkpoint
Trader → debate_and_trader → analysts_checkpoint
Aggressive Analyst → risk → trader_checkpoint
Conservative Analyst → risk → trader_checkpoint
Neutral Analyst → risk → trader_checkpoint
Portfolio Manager → risk → trader_checkpoint
After any phase re-run completes, the engine cascades to run_portfolio()
so the PM decision incorporates the updated ticker analysis.
Checkpoint Lookup Rule
CRITICAL: The read store used to load checkpoints must use the same
flow_id as the original run. Without the flow_id, _date_root() falls
back to the legacy flat layout and will never find checkpoints stored under
{flow_id}/{TICKER}/report/.
In trigger_rerun_node, the original flow_id is resolved as:
flow_id = run.get("flow_id") or run.get("short_rid") or run["params"].get("flow_id")
This is then passed through rerun_params["flow_id"] to run_pipeline_from_phase,
which passes it to create_report_store(flow_id=flow_id).
5. Selective Event Filtering on Re-run
When a phase re-run is triggered, the run's event list is selectively filtered to remove stale events for the re-run scope while preserving events from:
- Other tickers (TSDD events preserved when re-running RIG)
- Earlier phases of the same ticker (analyst events preserved when re-running debate)
- Scan/market events (always preserved)
# Nodes cleared per phase (plus all tool events with matching parent_node_id)
debate_and_trader → {Bull Researcher, Bear Researcher, Research Manager, Trader,
Aggressive Analyst, Conservative Analyst, Neutral Analyst,
Portfolio Manager}
risk → {Aggressive Analyst, Conservative Analyst, Neutral Analyst,
Portfolio Manager}
analysts → all nodes for the ticker
# Portfolio cascade nodes (always cleared — re-run always cascades to PM)
{review_holdings, make_pm_decision}
The WebSocket replays this filtered set first (rebuilding the full graph), then
streams the new re-run events on top. The frontend's clearEvents() + WebSocket
reconnect ensures a clean state before replay.
6. MongoDB vs Local Storage — Decision Guide
Use Local Storage (ReportStore) when:
- Development or single-machine deployment — no infrastructure required
- Offline / air-gapped environments — no network dependency
- Report files are the primary output — reports as .json/.md files that can be read with any tool
- Simplicity over scalability — one process, one machine
Use MongoDB (MongoReportStore) when:
- Multi-process or multi-node deployment — local files are not shared
- Run history across restarts — hydration from MongoDB is more reliable than scanning the filesystem
- Reflexion memory —
ReflexionMemoryworks best with MongoDB for efficient per-ticker history queries - Future: TTL / retention — MongoDB TTL indexes make automatic cleanup easy
- Production environments — MongoDB provides durability, replication, and backup
Configuration
# Enable MongoDB:
TRADINGAGENTS_MONGO_URI=mongodb://localhost:27017
TRADINGAGENTS_MONGO_DB=tradingagents # optional, default: "tradingagents"
# Local storage (default when MONGO_URI is unset):
TRADINGAGENTS_REPORTS_DIR=/path/to/reports # optional, default: ./reports
Factory Behaviour
# Always use the factory — never instantiate stores directly
from tradingagents.portfolio.store_factory import create_report_store
# Writing: always pass flow_id (scopes writes to the correct run folder)
writer = create_report_store(flow_id=flow_id)
# Reading: omit flow_id (resolves via latest.json or MongoDB latest query)
reader = create_report_store()
create_report_store() returns:
DualReportStore(MongoReportStore, ReportStore)— whenMONGO_URIis set and pymongo is installed (writes to both; reads from Mongo first, falls back to disk)ReportStore— when MongoDB is unavailable or not configured
MongoDB failures always fall back to filesystem with a warning log. The application must remain functional without MongoDB.
Known V1 Limitations (Future Work)
| Issue | Status |
|---|---|
pymongo is synchronous — blocks asyncio event loop |
Deferred: migrate to motor before production |
| No TTL index — reports accumulate indefinitely | Deferred: requires retention policy decision |
MongoClient created per store instance |
Deferred: singleton via FastAPI app lifespan |
run_events.jsonl written on completion, not streaming |
Deferred: periodic flush for long runs |
Consequences & Constraints
MUST
- Always use
create_report_store(flow_id=…)for writes — never pass no args when writing, as the flat fallback path will overwrite across runs. - Always pass the original
flow_idwhen loading checkpoints for re-run — checkpoint lookup will silently returnNoneotherwise, causing full re-run fallback. - Save analysts checkpoint if
any()analyst report is populated — Social Analyst is optional;all()silently blocks checkpoints when social is disabled. - Selective event filtering on re-run — never clear all events; always use
_filter_rerun_events(events, ticker, phase)to preserve other tickers and earlier phases.
MUST NOT
- Never hard-code
ReportStore()in engine methods — always use the factory. - Never hold pymongo in the async hot path — wrap in
asyncio.to_threadif blocking becomes measurable.
Source Files
tradingagents/portfolio/report_store.py ← ReportStore (filesystem)
tradingagents/portfolio/mongo_report_store.py ← MongoReportStore
tradingagents/portfolio/dual_report_store.py ← DualReportStore (both)
tradingagents/portfolio/store_factory.py ← create_report_store()
tradingagents/report_paths.py ← flow_id/run_id helpers, ts_now()
agent_os/backend/main.py ← hydrate_runs_from_disk()
agent_os/backend/routes/runs.py ← _run_and_store, _append_and_store,
_filter_rerun_events, trigger_rerun_node
agent_os/backend/routes/websocket.py ← lazy-loading, orphaned run detection
agent_os/backend/services/langgraph_engine.py ← run_pipeline_from_phase, NODE_TO_PHASE,
checkpoint save/load logic
agent_os/frontend/src/hooks/useAgentStream.ts ← WebSocket client, event accumulation
agent_os/frontend/src/Dashboard.tsx ← triggerNodeRerun, loadRun, clearEvents