7.9 KiB
TradingAgents backend migration and rollback notes draft
Status: draft Audience: backend/application maintainers Scope: migrate toward application-service boundary and result-contract-v1alpha1 with rollback safety
Current progress snapshot (2026-04)
Mainline has moved beyond pure planning, but it has not finished the full boundary migration:
Phase 0is effectively done: contract and architecture drafts exist.Phase 1-4are partially landed:- backend services now project
v1alpha1-style public payloads; - result contracts are persisted via
result_store.py; /ws/analysis/{task_id}and/ws/orchestratoralready wrap payloads withcontract_version;- recommendation and task-status reads already depend on application-layer shaping more than route-local reconstruction.
- backend services now project
Phase 5is partially landed via the task lifecycle boundary slice:status/list/cancelnow route through backend task services instead of route-local orchestration;web_dashboard/backend/main.pyis still too large outside that slice;- reports/export and other residual route-local orchestration are still pending;
- compatibility fields still coexist with the newer contract-first path.
Also note that research provenance / node guard / profiling work is now landed on the orchestrator side. That effort complements the backend migration but should not be confused with “application boundary fully complete.”
Recent improvements (2026-04-16):
- Orchestrator error classification now includes comprehensive provider × base_url matrix validation
- Timeout configuration validation warns when analyst/research timeouts may be insufficient for multi-analyst profiles
- All provider mismatches (anthropic, openai, google, xai, ollama, openrouter) are now detected before graph initialization
1. Migration objective
Move backend delivery code from route-local orchestration to an application-service layer without changing the quant+LLM merge kernel behavior.
Target outcomes:
- stable result contract (
v1alpha1) - thin FastAPI transport
- application-owned task lifecycle and mapping
- rollback-safe migration using dual-read/dual-write where useful
2. Current coupling hotspots
Primary hotspot: web_dashboard/backend/main.py
It currently combines:
- route handlers
- task persistence
- subprocess creation and monitoring
- progress/stage state mutation
- result projection into API fields
- report export concerns
This file is the first migration target.
3. Recommended migration sequence
Phase 0: contract freeze draft
Deliverables:
- agree on
docs/contracts/result-contract-v1alpha1.md - agree on application boundary in
docs/architecture/application-boundary.md
Rollback:
- none needed; documentation only
Phase 1: introduce application service behind existing routes
Actions:
- add backend application modules for analysis status, live signals, and report reads
- keep existing route URLs unchanged
- move mapping logic out of route functions into service/mappers
Compatibility tactic:
- routes still return current payload shape if frontend depends on it
- internal service also emits
v1alpha1DTOs for verification comparison
Rollback:
- route handlers can call old inline functions directly via feature flag or import switch
Current status:
- partially complete on mainline via
analysis_service.py,job_service.py, andresult_store.py - task lifecycle (
status/list/cancel) is now service-routed - not complete enough yet to claim
main.pyis only a thin adapter
Phase 2: dual-read for task status
Why:
Task status currently lives in memory plus data/task_status/*.json. During migration, new service storage and old persisted shape may diverge.
Recommended strategy:
- read preference: new application store first
- fallback read: legacy JSON task status
- compare key fields during shadow period:
status,progress,current_stage,decision,error
Rollback:
- switch read preference back to legacy JSON only
- leave new store populated for debugging, but non-authoritative
Phase 3: dual-write for task results
Why:
To avoid breaking status pages and historical tooling during rollout.
Recommended strategy:
- authoritative write: new application store
- compatibility write: legacy
app.state.task_results+data/task_status/*.json - emit diff logs when new-vs-legacy projections disagree
Guardrails:
- dual-write only for application-layer payloads
- do not dual-write alternate domain semantics into
orchestrator/
Rollback:
- disable new-store writes
- continue legacy writes only
Phase 4: websocket and live signal migration
Actions:
- make
/ws/analysis/{task_id}and/ws/orchestratorrender application contracts - keep websocket wrapper fields stable while migrating internal body shape
Suggested compatibility step:
- send legacy event envelope with embedded
contract_version - update frontend consumers before removing legacy-only fields
Rollback:
- restore websocket serializer to legacy shape
- keep application service intact behind adapter
Current status:
- partially complete on mainline
/ws/orchestratoralready emitscontract_version,data_quality,degradation, andresearch/ws/analysis/{task_id}already reads application-shaped task state
Phase 5: remove route-local orchestration
Actions:
- delete dead inline task mutation helpers from
main.py - keep routes as thin adapter layer
- preserve report retrieval behavior
Rollback:
- only safe after shadow metrics show parity
- otherwise revert to Phase 3 dual-write mode, not direct deletion
4. Suggested feature flags
Environment-variable style examples:
TA_APP_SERVICE_ENABLED=1TA_RESULT_CONTRACT_VERSION=v1alpha1TA_TASKSTORE_DUAL_READ=1TA_TASKSTORE_DUAL_WRITE=1TA_WS_V1ALPHA1_ENABLED=0
These names are placeholders; exact naming can be chosen during implementation.
5. Verification checkpoints per phase
For each migration phase, verify:
- same task ids are returned for the same route behavior
- stage transitions remain monotonic
- completed tasks persist
decision,confidence, and degraded-path outcomes - failure path still preserves actionable error text
- live websocket payloads preserve ticker/date ordering expectations
6. Rollback triggers
Rollback immediately if any of these happen:
- task status disappears after backend restart
- WebSocket clients stop receiving progress updates
- completed analysis loses
decisionor confidence fields - degraded single-lane signals are reclassified incorrectly
- report export or historical report retrieval cannot find prior artifacts
7. Explicit non-goals during migration
- do not rewrite
orchestrator/signals.pymerge math as part of boundary migration - do not rework provider/model selection semantics in the same change set
- do not force frontend redesign before contract shadowing proves parity
- do not implement a new strategy layer inside the application service
8. Minimal rollback playbook
If production or local verification fails after migration cutover:
- disable application-service read path
- disable dual-write to new store if it corrupts parity checks
- restore legacy route-local serializers
- keep generated comparison logs/artifacts for diff analysis
- re-run backend tests and one end-to-end manual analysis flow
9. Review checklist
A migration plan is acceptable only if it:
- preserves orchestrator ownership of quant+LLM merge semantics
- introduces feature-flagged cutover points
- supports dual-read/dual-write only at application/persistence boundary
- provides a one-step rollback path at each release phase
10. Maintainer note
When updating migration status, keep these three documents aligned:
docs/architecture/application-boundary.mddocs/contracts/result-contract-v1alpha1.mddocs/architecture/research-provenance.md
The first two describe backend/application convergence; the third describes orchestrator-side research degradation and profiling semantics that now feed those contracts.