352 lines
13 KiB
Markdown
352 lines
13 KiB
Markdown
# MarketData Domain - PostgreSQL Migration Specification
|
|
|
|
## Feature Overview
|
|
|
|
**Feature**: MarketData Domain PostgreSQL Migration
|
|
**Status**: Migration project (85% complete → PostgreSQL integration)
|
|
**Priority**: High (foundational infrastructure for AI agents)
|
|
|
|
This specification defines the migration of the MarketData domain from CSV-based storage to PostgreSQL + TimescaleDB + pgvectorscale integration, while preserving 100% API compatibility and delivering 10x performance improvements for AI agent operations.
|
|
|
|
## User Stories
|
|
|
|
### Primary User Story
|
|
> As a Dagster pipeline and AI Agent, I want to collect daily OHLC data from yfinance, insider data from FinnHub, and fundamental data from FinnHub with PostgreSQL + TimescaleDB storage, so that agents have high-performance, RAG-enhanced market data access for comprehensive trading analysis.
|
|
|
|
### Supporting User Stories
|
|
|
|
**Agent Performance**
|
|
- As an AI Agent, I want market data queries to complete in under 100ms, so that real-time trading analysis is efficient
|
|
- As a Technical Analyst Agent, I want vector similarity search for historical patterns, so that pattern-based trading decisions are context-aware
|
|
|
|
**Data Pipeline Reliability**
|
|
- As a Dagster pipeline, I want atomic data ingestion with PostgreSQL ACID transactions, so that data integrity is guaranteed during bulk operations
|
|
- As a Risk Management Agent, I want comprehensive audit trails for all market data access, so that trading decisions are fully traceable
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Migration Compatibility
|
|
- **AC1**: GIVEN the MarketData domain migration WHEN PostgreSQL + TimescaleDB integration is complete THEN all existing MarketDataService APIs remain 100% compatible with 10x performance improvement
|
|
|
|
### Data Collection Pipeline
|
|
- **AC2**: GIVEN daily market data collection WHEN Dagster pipelines execute THEN OHLC data from yfinance and insider/fundamental data from FinnHub are stored in TimescaleDB hypertables
|
|
|
|
### Performance Requirements
|
|
- **AC3**: GIVEN historical market data queries WHEN AI agents request technical analysis THEN responses are delivered within 100ms using TimescaleDB time-series optimization
|
|
- **AC4**: GIVEN technical analysis requests WHEN agents query indicators THEN all 20 existing TA-Lib indicators are preserved with PostgreSQL-backed data access
|
|
|
|
### RAG Integration
|
|
- **AC5**: GIVEN RAG-powered analysis WHEN agents search for historical patterns THEN vector similarity search using pgvectorscale returns relevant market conditions within 200ms
|
|
|
|
### Scalability
|
|
- **AC6**: GIVEN concurrent agent operations WHEN multiple agents access market data THEN PostgreSQL async operations support concurrent reads without file system limitations
|
|
|
|
### Data Quality
|
|
- **AC7**: GIVEN data quality requirements WHEN market data is collected THEN comprehensive validation, audit trails, and error handling maintain data integrity with PostgreSQL ACID transactions
|
|
|
|
## Business Rules
|
|
|
|
### API Preservation
|
|
- **BR1**: Preserve 100% API compatibility with existing MarketDataService for seamless migration
|
|
- **BR2**: Maintain all existing method signatures in FundamentalDataService and InsiderDataService
|
|
|
|
### Data Collection Standards
|
|
- **BR3**: Daily automated collection from yfinance (OHLC) and FinnHub (insider + fundamentals) via Dagster pipelines
|
|
- **BR4**: FinnHub API rate limiting compliance with proper backoff strategies
|
|
- **BR5**: Graceful degradation when external APIs are unavailable
|
|
|
|
### Database Architecture
|
|
- **BR6**: TimescaleDB hypertables for market_data, fundamental_data, and insider_data tables
|
|
- **BR7**: Vector embeddings generation for technical analysis patterns using pgvectorscale
|
|
|
|
### Performance Standards
|
|
- **BR8**: Sub-100ms query performance for common market data operations
|
|
- **BR9**: Data retention policy: 10 years for OHLC, 5 years for fundamentals, 3 years for insider data
|
|
|
|
### Audit and Compliance
|
|
- **BR10**: Comprehensive audit logging for all data collection and agent queries
|
|
|
|
## Technical Implementation Details
|
|
|
|
### Architecture Pattern
|
|
**Router → Service → Repository → Entity → Database**
|
|
|
|
The migration preserves the existing service interfaces while upgrading the underlying data persistence layer.
|
|
|
|
### Database Schema Design
|
|
|
|
#### TimescaleDB Hypertables
|
|
|
|
```sql
|
|
-- Market Data (OHLC)
|
|
CREATE TABLE market_data (
|
|
id SERIAL PRIMARY KEY,
|
|
symbol VARCHAR(10) NOT NULL,
|
|
date TIMESTAMPTZ NOT NULL,
|
|
open DECIMAL(12,4),
|
|
high DECIMAL(12,4),
|
|
low DECIMAL(12,4),
|
|
close DECIMAL(12,4),
|
|
adj_close DECIMAL(12,4),
|
|
volume BIGINT,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
SELECT create_hypertable('market_data', 'date', chunk_time_interval => INTERVAL '1 month');
|
|
|
|
-- Fundamental Data
|
|
CREATE TABLE fundamental_data (
|
|
id SERIAL PRIMARY KEY,
|
|
symbol VARCHAR(10) NOT NULL,
|
|
report_date TIMESTAMPTZ NOT NULL,
|
|
period_type VARCHAR(20), -- annual, quarterly
|
|
metric_name VARCHAR(100),
|
|
metric_value DECIMAL(20,4),
|
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
SELECT create_hypertable('fundamental_data', 'report_date', chunk_time_interval => INTERVAL '3 months');
|
|
|
|
-- Insider Data
|
|
CREATE TABLE insider_data (
|
|
id SERIAL PRIMARY KEY,
|
|
symbol VARCHAR(10) NOT NULL,
|
|
transaction_date TIMESTAMPTZ NOT NULL,
|
|
person_name VARCHAR(200),
|
|
position VARCHAR(100),
|
|
transaction_type VARCHAR(50),
|
|
shares BIGINT,
|
|
price DECIMAL(12,4),
|
|
value DECIMAL(20,4),
|
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
SELECT create_hypertable('insider_data', 'transaction_date', chunk_time_interval => INTERVAL '1 month');
|
|
```
|
|
|
|
#### Vector Storage for RAG
|
|
|
|
```sql
|
|
-- Technical Indicators with Vector Embeddings
|
|
CREATE TABLE technical_indicators (
|
|
id SERIAL PRIMARY KEY,
|
|
symbol VARCHAR(10) NOT NULL,
|
|
date TIMESTAMPTZ NOT NULL,
|
|
indicator_name VARCHAR(50),
|
|
indicator_value DECIMAL(12,6),
|
|
pattern_embedding vector(384), -- OpenRouter embeddings
|
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX ON technical_indicators USING hnsw (pattern_embedding vector_cosine_ops);
|
|
```
|
|
|
|
### SQLAlchemy Entity Models
|
|
|
|
```python
|
|
# MarketDataEntity
|
|
@dataclass
|
|
class MarketDataEntity:
|
|
symbol: str
|
|
date: datetime
|
|
open: Optional[Decimal] = None
|
|
high: Optional[Decimal] = None
|
|
low: Optional[Decimal] = None
|
|
close: Optional[Decimal] = None
|
|
adj_close: Optional[Decimal] = None
|
|
volume: Optional[int] = None
|
|
id: Optional[int] = None
|
|
created_at: Optional[datetime] = None
|
|
updated_at: Optional[datetime] = None
|
|
|
|
@classmethod
|
|
def from_yfinance_data(cls, symbol: str, row: pd.Series) -> "MarketDataEntity":
|
|
"""Convert yfinance data to entity"""
|
|
|
|
def to_database_record(self) -> dict:
|
|
"""Convert entity to database record"""
|
|
|
|
def validate(self) -> None:
|
|
"""Validate entity data integrity"""
|
|
```
|
|
|
|
### Repository Migration
|
|
|
|
```python
|
|
class MarketDataRepository:
|
|
"""PostgreSQL + TimescaleDB repository with async operations"""
|
|
|
|
def __init__(self, database_manager: DatabaseManager):
|
|
self.db = database_manager
|
|
|
|
async def get_ohlc_data(
|
|
self,
|
|
symbol: str,
|
|
start_date: datetime,
|
|
end_date: datetime
|
|
) -> List[MarketDataEntity]:
|
|
"""Retrieve OHLC data with TimescaleDB optimization"""
|
|
query = """
|
|
SELECT * FROM market_data
|
|
WHERE symbol = $1 AND date BETWEEN $2 AND $3
|
|
ORDER BY date DESC
|
|
"""
|
|
rows = await self.db.fetch(query, symbol, start_date, end_date)
|
|
return [MarketDataEntity.from_database_record(row) for row in rows]
|
|
|
|
async def bulk_upsert_market_data(
|
|
self,
|
|
entities: List[MarketDataEntity]
|
|
) -> int:
|
|
"""Atomic bulk upsert for Dagster pipelines"""
|
|
|
|
async def find_similar_patterns(
|
|
self,
|
|
pattern_embedding: List[float],
|
|
limit: int = 10
|
|
) -> List[Dict]:
|
|
"""RAG-powered pattern matching using pgvectorscale"""
|
|
query = """
|
|
SELECT symbol, date, indicator_name, indicator_value,
|
|
pattern_embedding <=> $1 as similarity
|
|
FROM technical_indicators
|
|
ORDER BY pattern_embedding <=> $1
|
|
LIMIT $2
|
|
"""
|
|
return await self.db.fetch(query, pattern_embedding, limit)
|
|
```
|
|
|
|
### Service Compatibility Layer
|
|
|
|
```python
|
|
class MarketDataService:
|
|
"""Preserved API with PostgreSQL backend"""
|
|
|
|
def __init__(self, repository: MarketDataRepository, yfinance_client: YFinanceClient):
|
|
self.repository = repository
|
|
self.yfinance_client = yfinance_client
|
|
|
|
async def get_stock_data(self, symbol: str, period: str = "1y") -> pd.DataFrame:
|
|
"""100% compatible with existing API signature"""
|
|
# Implementation using PostgreSQL repository
|
|
|
|
async def calculate_technical_indicators(
|
|
self,
|
|
symbol: str,
|
|
indicators: List[str]
|
|
) -> Dict[str, np.ndarray]:
|
|
"""Preserve all 20 TA-Lib indicators with PostgreSQL data"""
|
|
|
|
async def get_trading_style_preset(self, style: str) -> Dict:
|
|
"""Preserved trading style presets with enhanced performance"""
|
|
```
|
|
|
|
### Vector RAG Enhancement
|
|
|
|
```python
|
|
class MarketDataRAGService:
|
|
"""RAG-powered market analysis enhancement"""
|
|
|
|
async def find_historical_patterns(
|
|
self,
|
|
current_indicators: Dict[str, float],
|
|
lookback_days: int = 30
|
|
) -> List[Dict]:
|
|
"""Vector similarity search for historical patterns"""
|
|
|
|
async def generate_pattern_embedding(
|
|
self,
|
|
indicator_values: Dict[str, float]
|
|
) -> List[float]:
|
|
"""Generate embeddings using OpenRouter for pattern matching"""
|
|
```
|
|
|
|
## Migration Components
|
|
|
|
### Phase 1: Database Schema & Entities
|
|
1. **SQLAlchemy Entity Models**
|
|
- MarketDataEntity for OHLC data
|
|
- FundamentalDataEntity for financial statements
|
|
- InsiderDataEntity for SEC transactions
|
|
- TechnicalIndicatorEntity for calculated values
|
|
|
|
2. **TimescaleDB Setup**
|
|
- Hypertable creation for time-series optimization
|
|
- Proper indexing strategy
|
|
- Vector extension configuration
|
|
|
|
### Phase 2: Repository Migration
|
|
1. **Async PostgreSQL Operations**
|
|
- Follow news domain patterns for consistency
|
|
- Connection pooling and transaction management
|
|
- Error handling and retry logic
|
|
|
|
2. **Data Migration Scripts**
|
|
- CSV to PostgreSQL data transfer
|
|
- Data validation and integrity checks
|
|
- Performance optimization
|
|
|
|
### Phase 3: Service Preservation
|
|
1. **API Compatibility**
|
|
- Maintain all existing method signatures
|
|
- Preserve return types and data formats
|
|
- Performance optimization through PostgreSQL
|
|
|
|
2. **Vector RAG Integration**
|
|
- Pattern embedding generation
|
|
- Similarity search capabilities
|
|
- Historical context enhancement
|
|
|
|
### Phase 4: Testing & Integration
|
|
1. **Comprehensive Testing**
|
|
- Real PostgreSQL database for repository tests
|
|
- Preserved pytest-vcr for API clients
|
|
- Service compatibility validation
|
|
|
|
2. **Agent Integration**
|
|
- AgentToolkit RAG capabilities
|
|
- Performance benchmarking
|
|
- Concurrent access testing
|
|
|
|
## Dependencies
|
|
|
|
### Ready Dependencies
|
|
- **YFinanceClient and FinnhubClient**: Fully implemented and tested
|
|
- **PostgreSQL + TimescaleDB + pgvectorscale**: Database infrastructure established
|
|
- **News domain PostgreSQL patterns**: Migration templates available
|
|
- **DatabaseManager**: Async operations and connection management ready
|
|
- **OpenRouter configuration**: Vector embeddings generation available
|
|
|
|
### Planned Dependencies
|
|
- **Dagster orchestration**: Framework for daily data collection pipelines
|
|
|
|
## Success Criteria
|
|
|
|
### Performance Metrics
|
|
- **10x query performance improvement** over CSV-based storage
|
|
- **Sub-100ms market data operations** for common agent queries
|
|
- **Sub-200ms RAG queries** for vector similarity search
|
|
- **Support for 500+ tickers** with concurrent agent access
|
|
|
|
### Compatibility Standards
|
|
- **100% existing API preservation** without breaking changes
|
|
- **Seamless migration** without agent disruption
|
|
- **Efficient bulk data ingestion** for Dagster pipelines
|
|
|
|
### Quality Assurance
|
|
- **85%+ test coverage maintained** across all components
|
|
- **Comprehensive data validation** and audit trails
|
|
- **PostgreSQL ACID transactions** for data integrity
|
|
|
|
## Architecture Alignment
|
|
|
|
This migration aligns with the multi-agent trading framework vision by providing:
|
|
|
|
1. **High-performance market data foundation** for sophisticated agent analysis
|
|
2. **RAG-powered historical context** for pattern-based trading decisions
|
|
3. **Scalable concurrent access** supporting multiple agents simultaneously
|
|
4. **Comprehensive audit trails** for regulatory compliance and risk management
|
|
5. **Time-series optimization** for efficient technical analysis operations
|
|
|
|
The migration follows established news domain patterns to ensure architectural consistency across the entire TradingAgents framework. |