234 lines
7.9 KiB
Markdown
234 lines
7.9 KiB
Markdown
# TradingAgents backend migration and rollback notes draft
|
||
|
||
Status: draft
|
||
Audience: backend/application maintainers
|
||
Scope: migrate toward application-service boundary and result-contract-v1alpha1 with rollback safety
|
||
|
||
## Current progress snapshot (2026-04)
|
||
|
||
Mainline has moved beyond pure planning, but it has not finished the full boundary migration:
|
||
|
||
- `Phase 0` is effectively done: contract and architecture drafts exist.
|
||
- `Phase 1-4` are **partially landed**:
|
||
- backend services now project `v1alpha1`-style public payloads;
|
||
- result contracts are persisted via `result_store.py`;
|
||
- `/ws/analysis/{task_id}` and `/ws/orchestrator` already wrap payloads with `contract_version`;
|
||
- recommendation and task-status reads already depend on application-layer shaping more than route-local reconstruction.
|
||
- `Phase 5` is **partially landed** via the task lifecycle boundary slice:
|
||
- `status/list/cancel` now route through backend task services instead of route-local orchestration;
|
||
- `web_dashboard/backend/main.py` is still too large outside that slice;
|
||
- reports/export and other residual route-local orchestration are still pending;
|
||
- compatibility fields still coexist with the newer contract-first path.
|
||
|
||
Also note that research provenance / node guard / profiling work is now landed on the orchestrator side. That effort complements the backend migration but should not be confused with “application boundary fully complete.”
|
||
|
||
**Recent improvements (2026-04-16)**:
|
||
- Orchestrator error classification now includes comprehensive provider × base_url matrix validation
|
||
- Timeout configuration validation warns when analyst/research timeouts may be insufficient for multi-analyst profiles
|
||
- All provider mismatches (anthropic, openai, google, xai, ollama, openrouter) are now detected before graph initialization
|
||
|
||
## 1. Migration objective
|
||
|
||
Move backend delivery code from route-local orchestration to an application-service layer without changing the quant+LLM merge kernel behavior.
|
||
|
||
Target outcomes:
|
||
|
||
- stable result contract (`v1alpha1`)
|
||
- thin FastAPI transport
|
||
- application-owned task lifecycle and mapping
|
||
- rollback-safe migration using dual-read/dual-write where useful
|
||
|
||
## 2. Current coupling hotspots
|
||
|
||
Primary hotspot: `web_dashboard/backend/main.py`
|
||
|
||
It currently combines:
|
||
|
||
- route handlers
|
||
- task persistence
|
||
- subprocess creation and monitoring
|
||
- progress/stage state mutation
|
||
- result projection into API fields
|
||
- report export concerns
|
||
|
||
This file is the first migration target.
|
||
|
||
## 3. Recommended migration sequence
|
||
|
||
## Phase 0: contract freeze draft
|
||
|
||
Deliverables:
|
||
|
||
- agree on `docs/contracts/result-contract-v1alpha1.md`
|
||
- agree on application boundary in `docs/architecture/application-boundary.md`
|
||
|
||
Rollback:
|
||
|
||
- none needed; documentation only
|
||
|
||
## Phase 1: introduce application service behind existing routes
|
||
|
||
Actions:
|
||
|
||
- add backend application modules for analysis status, live signals, and report reads
|
||
- keep existing route URLs unchanged
|
||
- move mapping logic out of route functions into service/mappers
|
||
|
||
Compatibility tactic:
|
||
|
||
- routes still return current payload shape if frontend depends on it
|
||
- internal service also emits `v1alpha1` DTOs for verification comparison
|
||
|
||
Rollback:
|
||
|
||
- route handlers can call old inline functions directly via feature flag or import switch
|
||
|
||
Current status:
|
||
|
||
- partially complete on mainline via `analysis_service.py`, `job_service.py`, and `result_store.py`
|
||
- task lifecycle (`status/list/cancel`) is now service-routed
|
||
- not complete enough yet to claim `main.py` is only a thin adapter
|
||
|
||
## Phase 2: dual-read for task status
|
||
|
||
Why:
|
||
|
||
Task status currently lives in memory plus `data/task_status/*.json`. During migration, new service storage and old persisted shape may diverge.
|
||
|
||
Recommended strategy:
|
||
|
||
- read preference: new application store first
|
||
- fallback read: legacy JSON task status
|
||
- compare key fields during shadow period: `status`, `progress`, `current_stage`, `decision`, `error`
|
||
|
||
Rollback:
|
||
|
||
- switch read preference back to legacy JSON only
|
||
- leave new store populated for debugging, but non-authoritative
|
||
|
||
## Phase 3: dual-write for task results
|
||
|
||
Why:
|
||
|
||
To avoid breaking status pages and historical tooling during rollout.
|
||
|
||
Recommended strategy:
|
||
|
||
- authoritative write: new application store
|
||
- compatibility write: legacy `app.state.task_results` + `data/task_status/*.json`
|
||
- emit diff logs when new-vs-legacy projections disagree
|
||
|
||
Guardrails:
|
||
|
||
- dual-write only for application-layer payloads
|
||
- do not dual-write alternate domain semantics into `orchestrator/`
|
||
|
||
Rollback:
|
||
|
||
- disable new-store writes
|
||
- continue legacy writes only
|
||
|
||
## Phase 4: websocket and live signal migration
|
||
|
||
Actions:
|
||
|
||
- make `/ws/analysis/{task_id}` and `/ws/orchestrator` render application contracts
|
||
- keep websocket wrapper fields stable while migrating internal body shape
|
||
|
||
Suggested compatibility step:
|
||
|
||
- send legacy event envelope with embedded `contract_version`
|
||
- update frontend consumers before removing legacy-only fields
|
||
|
||
Rollback:
|
||
|
||
- restore websocket serializer to legacy shape
|
||
- keep application service intact behind adapter
|
||
|
||
Current status:
|
||
|
||
- partially complete on mainline
|
||
- `/ws/orchestrator` already emits `contract_version`, `data_quality`, `degradation`, and `research`
|
||
- `/ws/analysis/{task_id}` already reads application-shaped task state
|
||
|
||
## Phase 5: remove route-local orchestration
|
||
|
||
Actions:
|
||
|
||
- delete dead inline task mutation helpers from `main.py`
|
||
- keep routes as thin adapter layer
|
||
- preserve report retrieval behavior
|
||
|
||
Rollback:
|
||
|
||
- only safe after shadow metrics show parity
|
||
- otherwise revert to Phase 3 dual-write mode, not direct deletion
|
||
|
||
## 4. Suggested feature flags
|
||
|
||
Environment-variable style examples:
|
||
|
||
- `TA_APP_SERVICE_ENABLED=1`
|
||
- `TA_RESULT_CONTRACT_VERSION=v1alpha1`
|
||
- `TA_TASKSTORE_DUAL_READ=1`
|
||
- `TA_TASKSTORE_DUAL_WRITE=1`
|
||
- `TA_WS_V1ALPHA1_ENABLED=0`
|
||
|
||
These names are placeholders; exact naming can be chosen during implementation.
|
||
|
||
## 5. Verification checkpoints per phase
|
||
|
||
For each migration phase, verify:
|
||
|
||
- same task ids are returned for the same route behavior
|
||
- stage transitions remain monotonic
|
||
- completed tasks persist `decision`, `confidence`, and degraded-path outcomes
|
||
- failure path still preserves actionable error text
|
||
- live websocket payloads preserve ticker/date ordering expectations
|
||
|
||
## 6. Rollback triggers
|
||
|
||
Rollback immediately if any of these happen:
|
||
|
||
- task status disappears after backend restart
|
||
- WebSocket clients stop receiving progress updates
|
||
- completed analysis loses `decision` or confidence fields
|
||
- degraded single-lane signals are reclassified incorrectly
|
||
- report export or historical report retrieval cannot find prior artifacts
|
||
|
||
## 7. Explicit non-goals during migration
|
||
|
||
- do not rewrite `orchestrator/signals.py` merge math as part of boundary migration
|
||
- do not rework provider/model selection semantics in the same change set
|
||
- do not force frontend redesign before contract shadowing proves parity
|
||
- do not implement a new strategy layer inside the application service
|
||
|
||
## 8. Minimal rollback playbook
|
||
|
||
If production or local verification fails after migration cutover:
|
||
|
||
1. disable application-service read path
|
||
2. disable dual-write to new store if it corrupts parity checks
|
||
3. restore legacy route-local serializers
|
||
4. keep generated comparison logs/artifacts for diff analysis
|
||
5. re-run backend tests and one end-to-end manual analysis flow
|
||
|
||
## 9. Review checklist
|
||
|
||
A migration plan is acceptable only if it:
|
||
|
||
- preserves orchestrator ownership of quant+LLM merge semantics
|
||
- introduces feature-flagged cutover points
|
||
- supports dual-read/dual-write only at application/persistence boundary
|
||
- provides a one-step rollback path at each release phase
|
||
|
||
## 10. Maintainer note
|
||
|
||
When updating migration status, keep these three documents aligned:
|
||
|
||
- `docs/architecture/application-boundary.md`
|
||
- `docs/contracts/result-contract-v1alpha1.md`
|
||
- `docs/architecture/research-provenance.md`
|
||
|
||
The first two describe backend/application convergence; the third describes orchestrator-side research degradation and profiling semantics that now feed those contracts.
|