* feat: introduce flow_id with timestamp-based report versioning
Replace run_id with flow_id as the primary grouping concept (one flow =
one user analysis intent spanning scan + pipeline + portfolio). Reports
are now written as {timestamp}_{name}.json so load methods always return
the latest version by lexicographic sort, eliminating the latest.json
pointer pattern for new flows.
Key changes:
- report_paths.py: add generate_flow_id(), ts_now() (ms precision),
flow_id kwarg on all path helpers; keep run_id / pointer helpers for
backward compatibility
- ReportStore: dual-mode save/load — flow_id uses timestamped layout,
run_id uses legacy runs/{id}/ layout with latest.json
- MongoReportStore: add flow_id field and index; run_id stays for compat
- DualReportStore: expose flow_id property
- store_factory: accept flow_id as primary param, run_id as alias
- runs.py / langgraph_engine.py: generate and thread flow_id through all
trigger endpoints and run methods
- Tests: add flow_id coverage for all layers; 905 tests pass
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: load flow_id in FE to resume runs and fix max_tickers cap on continuation
- Add flow_id to RunParams interface and initial state
- loadRun() now restores flow_id + max_auto_tickers from history so the next
run continues in the same flow directory (Phase 1 scan skipped, already-done
tickers skipped via skip-if-exists logic)
- startRun() spreads flow_id into the request body when set, letting the backend
reuse the existing flow directory instead of generating a fresh flow_id
- After each run, params.flow_id is updated from the response so subsequent
runs automatically continue from the same flow
- max_auto_tickers restored from run.params.max_tickers ensures the ticker cap
matches the original run; scan_tickers[:max_t] on the backend then limits
the Phase 2 queue to the user's setting even when the existing scan has more
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(mongo): fast-fail timeout + lazy ensure_indexes to avoid 30s block on fallback
MongoClient previously used pymongo's 30-second serverSelectionTimeoutMS default,
causing store_factory to hang for 30s before falling back to the filesystem when
Atlas is unreachable. Also, ensure_indexes() was called eagerly in __init__,
making every store construction attempt block on a live network call.
- Set serverSelectionTimeoutMS=5_000 so fallback is triggered in ≤5s
- Move ensure_indexes() call out of __init__ — indexes are now created lazily
on the first _save() call via a guarded self._indexes_ensured flag
- ensure_indexes() is still idempotent and safe to call explicitly in tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(store): wrap all DualReportStore mongo calls in _try_mongo() for graceful degradation
Any MongoDB exception (SSL error, ServerSelectionTimeout, auth failure) was
propagating uncaught through DualReportStore and crashing the run. Reads
would return an error instead of falling back to local, and writes would
abort mid-run without saving anything.
Introduce a single _try_mongo(fn, default) helper that:
- Executes the Mongo callable
- Catches *any* exception, logs it as WARNING with type + message
- Returns the default value so the caller continues with local-only data
Pattern per method:
writes → try mongo (fire-and-forget); always return local result
reads → try mongo first; fall back to local on None or exception
lists → try mongo; fall back to local on empty/None
Runs now complete successfully even when Atlas is unreachable or returns SSL
errors. MongoDB sync resumes automatically once connectivity is restored.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(observability): non-blocking MongoDB inserts + 5s timeout in RunLogger
Every LLM and tool callback called _append() which synchronously called
insert_one() against MongoDB. When Atlas was unreachable this blocked the
entire LangGraph run for pymongo's 30-second default timeout per event,
effectively serializing all agent work behind MongoDB retries.
Two fixes:
1. serverSelectionTimeoutMS=5_000 on the RunLogger's MongoClient — consistent
with the same fix applied to MongoReportStore.
2. MongoDB inserts are now fire-and-forget via daemon threads — _append() spawns
a Thread(target=_insert, daemon=True) and returns immediately. LLM callbacks
and tool events are never delayed by MongoDB connectivity issues.
Failures are still reported via WARNING log from the background thread.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* revert(observability): restore synchronous MongoDB inserts in RunLogger
Root cause was an IP whitelist issue on Atlas causing SSL failures, not
insert volume. The background-thread approach added unnecessary complexity.
The 5s serverSelectionTimeoutMS is retained as a defensive safeguard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>