TradingAgents/docs/migration/rollback-notes.md

234 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TradingAgents backend migration and rollback notes draft
Status: draft
Audience: backend/application maintainers
Scope: migrate toward application-service boundary and result-contract-v1alpha1 with rollback safety
## Current progress snapshot (2026-04)
Mainline has moved beyond pure planning, but it has not finished the full boundary migration:
- `Phase 0` is effectively done: contract and architecture drafts exist.
- `Phase 1-4` are **partially landed**:
- backend services now project `v1alpha1`-style public payloads;
- result contracts are persisted via `result_store.py`;
- `/ws/analysis/{task_id}` and `/ws/orchestrator` already wrap payloads with `contract_version`;
- recommendation and task-status reads already depend on application-layer shaping more than route-local reconstruction.
- `Phase 5` is **partially landed** via the task lifecycle boundary slice:
- `status/list/cancel` now route through backend task services instead of route-local orchestration;
- `web_dashboard/backend/main.py` is still too large outside that slice;
- reports/export and other residual route-local orchestration are still pending;
- compatibility fields still coexist with the newer contract-first path.
Also note that research provenance / node guard / profiling work is now landed on the orchestrator side. That effort complements the backend migration but should not be confused with “application boundary fully complete.”
**Recent improvements (2026-04-16)**:
- Orchestrator error classification now includes comprehensive provider × base_url matrix validation
- Timeout configuration validation warns when analyst/research timeouts may be insufficient for multi-analyst profiles
- All provider mismatches (anthropic, openai, google, xai, ollama, openrouter) are now detected before graph initialization
## 1. Migration objective
Move backend delivery code from route-local orchestration to an application-service layer without changing the quant+LLM merge kernel behavior.
Target outcomes:
- stable result contract (`v1alpha1`)
- thin FastAPI transport
- application-owned task lifecycle and mapping
- rollback-safe migration using dual-read/dual-write where useful
## 2. Current coupling hotspots
Primary hotspot: `web_dashboard/backend/main.py`
It currently combines:
- route handlers
- task persistence
- subprocess creation and monitoring
- progress/stage state mutation
- result projection into API fields
- report export concerns
This file is the first migration target.
## 3. Recommended migration sequence
## Phase 0: contract freeze draft
Deliverables:
- agree on `docs/contracts/result-contract-v1alpha1.md`
- agree on application boundary in `docs/architecture/application-boundary.md`
Rollback:
- none needed; documentation only
## Phase 1: introduce application service behind existing routes
Actions:
- add backend application modules for analysis status, live signals, and report reads
- keep existing route URLs unchanged
- move mapping logic out of route functions into service/mappers
Compatibility tactic:
- routes still return current payload shape if frontend depends on it
- internal service also emits `v1alpha1` DTOs for verification comparison
Rollback:
- route handlers can call old inline functions directly via feature flag or import switch
Current status:
- partially complete on mainline via `analysis_service.py`, `job_service.py`, and `result_store.py`
- task lifecycle (`status/list/cancel`) is now service-routed
- not complete enough yet to claim `main.py` is only a thin adapter
## Phase 2: dual-read for task status
Why:
Task status currently lives in memory plus `data/task_status/*.json`. During migration, new service storage and old persisted shape may diverge.
Recommended strategy:
- read preference: new application store first
- fallback read: legacy JSON task status
- compare key fields during shadow period: `status`, `progress`, `current_stage`, `decision`, `error`
Rollback:
- switch read preference back to legacy JSON only
- leave new store populated for debugging, but non-authoritative
## Phase 3: dual-write for task results
Why:
To avoid breaking status pages and historical tooling during rollout.
Recommended strategy:
- authoritative write: new application store
- compatibility write: legacy `app.state.task_results` + `data/task_status/*.json`
- emit diff logs when new-vs-legacy projections disagree
Guardrails:
- dual-write only for application-layer payloads
- do not dual-write alternate domain semantics into `orchestrator/`
Rollback:
- disable new-store writes
- continue legacy writes only
## Phase 4: websocket and live signal migration
Actions:
- make `/ws/analysis/{task_id}` and `/ws/orchestrator` render application contracts
- keep websocket wrapper fields stable while migrating internal body shape
Suggested compatibility step:
- send legacy event envelope with embedded `contract_version`
- update frontend consumers before removing legacy-only fields
Rollback:
- restore websocket serializer to legacy shape
- keep application service intact behind adapter
Current status:
- partially complete on mainline
- `/ws/orchestrator` already emits `contract_version`, `data_quality`, `degradation`, and `research`
- `/ws/analysis/{task_id}` already reads application-shaped task state
## Phase 5: remove route-local orchestration
Actions:
- delete dead inline task mutation helpers from `main.py`
- keep routes as thin adapter layer
- preserve report retrieval behavior
Rollback:
- only safe after shadow metrics show parity
- otherwise revert to Phase 3 dual-write mode, not direct deletion
## 4. Suggested feature flags
Environment-variable style examples:
- `TA_APP_SERVICE_ENABLED=1`
- `TA_RESULT_CONTRACT_VERSION=v1alpha1`
- `TA_TASKSTORE_DUAL_READ=1`
- `TA_TASKSTORE_DUAL_WRITE=1`
- `TA_WS_V1ALPHA1_ENABLED=0`
These names are placeholders; exact naming can be chosen during implementation.
## 5. Verification checkpoints per phase
For each migration phase, verify:
- same task ids are returned for the same route behavior
- stage transitions remain monotonic
- completed tasks persist `decision`, `confidence`, and degraded-path outcomes
- failure path still preserves actionable error text
- live websocket payloads preserve ticker/date ordering expectations
## 6. Rollback triggers
Rollback immediately if any of these happen:
- task status disappears after backend restart
- WebSocket clients stop receiving progress updates
- completed analysis loses `decision` or confidence fields
- degraded single-lane signals are reclassified incorrectly
- report export or historical report retrieval cannot find prior artifacts
## 7. Explicit non-goals during migration
- do not rewrite `orchestrator/signals.py` merge math as part of boundary migration
- do not rework provider/model selection semantics in the same change set
- do not force frontend redesign before contract shadowing proves parity
- do not implement a new strategy layer inside the application service
## 8. Minimal rollback playbook
If production or local verification fails after migration cutover:
1. disable application-service read path
2. disable dual-write to new store if it corrupts parity checks
3. restore legacy route-local serializers
4. keep generated comparison logs/artifacts for diff analysis
5. re-run backend tests and one end-to-end manual analysis flow
## 9. Review checklist
A migration plan is acceptable only if it:
- preserves orchestrator ownership of quant+LLM merge semantics
- introduces feature-flagged cutover points
- supports dual-read/dual-write only at application/persistence boundary
- provides a one-step rollback path at each release phase
## 10. Maintainer note
When updating migration status, keep these three documents aligned:
- `docs/architecture/application-boundary.md`
- `docs/contracts/result-contract-v1alpha1.md`
- `docs/architecture/research-provenance.md`
The first two describe backend/application convergence; the third describes orchestrator-side research degradation and profiling semantics that now feed those contracts.