TradingAgents/docs/specs/MarketData/spec.md

352 lines
13 KiB
Markdown

# MarketData Domain - PostgreSQL Migration Specification
## Feature Overview
**Feature**: MarketData Domain PostgreSQL Migration
**Status**: Migration project (85% complete → PostgreSQL integration)
**Priority**: High (foundational infrastructure for AI agents)
This specification defines the migration of the MarketData domain from CSV-based storage to PostgreSQL + TimescaleDB + pgvectorscale integration, while preserving 100% API compatibility and delivering 10x performance improvements for AI agent operations.
## User Stories
### Primary User Story
> As a Dagster pipeline and AI Agent, I want to collect daily OHLC data from yfinance, insider data from FinnHub, and fundamental data from FinnHub with PostgreSQL + TimescaleDB storage, so that agents have high-performance, RAG-enhanced market data access for comprehensive trading analysis.
### Supporting User Stories
**Agent Performance**
- As an AI Agent, I want market data queries to complete in under 100ms, so that real-time trading analysis is efficient
- As a Technical Analyst Agent, I want vector similarity search for historical patterns, so that pattern-based trading decisions are context-aware
**Data Pipeline Reliability**
- As a Dagster pipeline, I want atomic data ingestion with PostgreSQL ACID transactions, so that data integrity is guaranteed during bulk operations
- As a Risk Management Agent, I want comprehensive audit trails for all market data access, so that trading decisions are fully traceable
## Acceptance Criteria
### Migration Compatibility
- **AC1**: GIVEN the MarketData domain migration WHEN PostgreSQL + TimescaleDB integration is complete THEN all existing MarketDataService APIs remain 100% compatible with 10x performance improvement
### Data Collection Pipeline
- **AC2**: GIVEN daily market data collection WHEN Dagster pipelines execute THEN OHLC data from yfinance and insider/fundamental data from FinnHub are stored in TimescaleDB hypertables
### Performance Requirements
- **AC3**: GIVEN historical market data queries WHEN AI agents request technical analysis THEN responses are delivered within 100ms using TimescaleDB time-series optimization
- **AC4**: GIVEN technical analysis requests WHEN agents query indicators THEN all 20 existing TA-Lib indicators are preserved with PostgreSQL-backed data access
### RAG Integration
- **AC5**: GIVEN RAG-powered analysis WHEN agents search for historical patterns THEN vector similarity search using pgvectorscale returns relevant market conditions within 200ms
### Scalability
- **AC6**: GIVEN concurrent agent operations WHEN multiple agents access market data THEN PostgreSQL async operations support concurrent reads without file system limitations
### Data Quality
- **AC7**: GIVEN data quality requirements WHEN market data is collected THEN comprehensive validation, audit trails, and error handling maintain data integrity with PostgreSQL ACID transactions
## Business Rules
### API Preservation
- **BR1**: Preserve 100% API compatibility with existing MarketDataService for seamless migration
- **BR2**: Maintain all existing method signatures in FundamentalDataService and InsiderDataService
### Data Collection Standards
- **BR3**: Daily automated collection from yfinance (OHLC) and FinnHub (insider + fundamentals) via Dagster pipelines
- **BR4**: FinnHub API rate limiting compliance with proper backoff strategies
- **BR5**: Graceful degradation when external APIs are unavailable
### Database Architecture
- **BR6**: TimescaleDB hypertables for market_data, fundamental_data, and insider_data tables
- **BR7**: Vector embeddings generation for technical analysis patterns using pgvectorscale
### Performance Standards
- **BR8**: Sub-100ms query performance for common market data operations
- **BR9**: Data retention policy: 10 years for OHLC, 5 years for fundamentals, 3 years for insider data
### Audit and Compliance
- **BR10**: Comprehensive audit logging for all data collection and agent queries
## Technical Implementation Details
### Architecture Pattern
**Router → Service → Repository → Entity → Database**
The migration preserves the existing service interfaces while upgrading the underlying data persistence layer.
### Database Schema Design
#### TimescaleDB Hypertables
```sql
-- Market Data (OHLC)
CREATE TABLE market_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
date TIMESTAMPTZ NOT NULL,
open DECIMAL(12,4),
high DECIMAL(12,4),
low DECIMAL(12,4),
close DECIMAL(12,4),
adj_close DECIMAL(12,4),
volume BIGINT,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
SELECT create_hypertable('market_data', 'date', chunk_time_interval => INTERVAL '1 month');
-- Fundamental Data
CREATE TABLE fundamental_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
report_date TIMESTAMPTZ NOT NULL,
period_type VARCHAR(20), -- annual, quarterly
metric_name VARCHAR(100),
metric_value DECIMAL(20,4),
created_at TIMESTAMPTZ DEFAULT NOW()
);
SELECT create_hypertable('fundamental_data', 'report_date', chunk_time_interval => INTERVAL '3 months');
-- Insider Data
CREATE TABLE insider_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
transaction_date TIMESTAMPTZ NOT NULL,
person_name VARCHAR(200),
position VARCHAR(100),
transaction_type VARCHAR(50),
shares BIGINT,
price DECIMAL(12,4),
value DECIMAL(20,4),
created_at TIMESTAMPTZ DEFAULT NOW()
);
SELECT create_hypertable('insider_data', 'transaction_date', chunk_time_interval => INTERVAL '1 month');
```
#### Vector Storage for RAG
```sql
-- Technical Indicators with Vector Embeddings
CREATE TABLE technical_indicators (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
date TIMESTAMPTZ NOT NULL,
indicator_name VARCHAR(50),
indicator_value DECIMAL(12,6),
pattern_embedding vector(384), -- OpenRouter embeddings
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON technical_indicators USING hnsw (pattern_embedding vector_cosine_ops);
```
### SQLAlchemy Entity Models
```python
# MarketDataEntity
@dataclass
class MarketDataEntity:
symbol: str
date: datetime
open: Optional[Decimal] = None
high: Optional[Decimal] = None
low: Optional[Decimal] = None
close: Optional[Decimal] = None
adj_close: Optional[Decimal] = None
volume: Optional[int] = None
id: Optional[int] = None
created_at: Optional[datetime] = None
updated_at: Optional[datetime] = None
@classmethod
def from_yfinance_data(cls, symbol: str, row: pd.Series) -> "MarketDataEntity":
"""Convert yfinance data to entity"""
def to_database_record(self) -> dict:
"""Convert entity to database record"""
def validate(self) -> None:
"""Validate entity data integrity"""
```
### Repository Migration
```python
class MarketDataRepository:
"""PostgreSQL + TimescaleDB repository with async operations"""
def __init__(self, database_manager: DatabaseManager):
self.db = database_manager
async def get_ohlc_data(
self,
symbol: str,
start_date: datetime,
end_date: datetime
) -> List[MarketDataEntity]:
"""Retrieve OHLC data with TimescaleDB optimization"""
query = """
SELECT * FROM market_data
WHERE symbol = $1 AND date BETWEEN $2 AND $3
ORDER BY date DESC
"""
rows = await self.db.fetch(query, symbol, start_date, end_date)
return [MarketDataEntity.from_database_record(row) for row in rows]
async def bulk_upsert_market_data(
self,
entities: List[MarketDataEntity]
) -> int:
"""Atomic bulk upsert for Dagster pipelines"""
async def find_similar_patterns(
self,
pattern_embedding: List[float],
limit: int = 10
) -> List[Dict]:
"""RAG-powered pattern matching using pgvectorscale"""
query = """
SELECT symbol, date, indicator_name, indicator_value,
pattern_embedding <=> $1 as similarity
FROM technical_indicators
ORDER BY pattern_embedding <=> $1
LIMIT $2
"""
return await self.db.fetch(query, pattern_embedding, limit)
```
### Service Compatibility Layer
```python
class MarketDataService:
"""Preserved API with PostgreSQL backend"""
def __init__(self, repository: MarketDataRepository, yfinance_client: YFinanceClient):
self.repository = repository
self.yfinance_client = yfinance_client
async def get_stock_data(self, symbol: str, period: str = "1y") -> pd.DataFrame:
"""100% compatible with existing API signature"""
# Implementation using PostgreSQL repository
async def calculate_technical_indicators(
self,
symbol: str,
indicators: List[str]
) -> Dict[str, np.ndarray]:
"""Preserve all 20 TA-Lib indicators with PostgreSQL data"""
async def get_trading_style_preset(self, style: str) -> Dict:
"""Preserved trading style presets with enhanced performance"""
```
### Vector RAG Enhancement
```python
class MarketDataRAGService:
"""RAG-powered market analysis enhancement"""
async def find_historical_patterns(
self,
current_indicators: Dict[str, float],
lookback_days: int = 30
) -> List[Dict]:
"""Vector similarity search for historical patterns"""
async def generate_pattern_embedding(
self,
indicator_values: Dict[str, float]
) -> List[float]:
"""Generate embeddings using OpenRouter for pattern matching"""
```
## Migration Components
### Phase 1: Database Schema & Entities
1. **SQLAlchemy Entity Models**
- MarketDataEntity for OHLC data
- FundamentalDataEntity for financial statements
- InsiderDataEntity for SEC transactions
- TechnicalIndicatorEntity for calculated values
2. **TimescaleDB Setup**
- Hypertable creation for time-series optimization
- Proper indexing strategy
- Vector extension configuration
### Phase 2: Repository Migration
1. **Async PostgreSQL Operations**
- Follow news domain patterns for consistency
- Connection pooling and transaction management
- Error handling and retry logic
2. **Data Migration Scripts**
- CSV to PostgreSQL data transfer
- Data validation and integrity checks
- Performance optimization
### Phase 3: Service Preservation
1. **API Compatibility**
- Maintain all existing method signatures
- Preserve return types and data formats
- Performance optimization through PostgreSQL
2. **Vector RAG Integration**
- Pattern embedding generation
- Similarity search capabilities
- Historical context enhancement
### Phase 4: Testing & Integration
1. **Comprehensive Testing**
- Real PostgreSQL database for repository tests
- Preserved pytest-vcr for API clients
- Service compatibility validation
2. **Agent Integration**
- AgentToolkit RAG capabilities
- Performance benchmarking
- Concurrent access testing
## Dependencies
### Ready Dependencies
- **YFinanceClient and FinnhubClient**: Fully implemented and tested
- **PostgreSQL + TimescaleDB + pgvectorscale**: Database infrastructure established
- **News domain PostgreSQL patterns**: Migration templates available
- **DatabaseManager**: Async operations and connection management ready
- **OpenRouter configuration**: Vector embeddings generation available
### Planned Dependencies
- **Dagster orchestration**: Framework for daily data collection pipelines
## Success Criteria
### Performance Metrics
- **10x query performance improvement** over CSV-based storage
- **Sub-100ms market data operations** for common agent queries
- **Sub-200ms RAG queries** for vector similarity search
- **Support for 500+ tickers** with concurrent agent access
### Compatibility Standards
- **100% existing API preservation** without breaking changes
- **Seamless migration** without agent disruption
- **Efficient bulk data ingestion** for Dagster pipelines
### Quality Assurance
- **85%+ test coverage maintained** across all components
- **Comprehensive data validation** and audit trails
- **PostgreSQL ACID transactions** for data integrity
## Architecture Alignment
This migration aligns with the multi-agent trading framework vision by providing:
1. **High-performance market data foundation** for sophisticated agent analysis
2. **RAG-powered historical context** for pattern-based trading decisions
3. **Scalable concurrent access** supporting multiple agents simultaneously
4. **Comprehensive audit trails** for regulatory compliance and risk management
5. **Time-series optimization** for efficient technical analysis operations
The migration follows established news domain patterns to ensure architectural consistency across the entire TradingAgents framework.