TradingAgents/docs/specs/MarketData/design.md

1363 lines
52 KiB
Markdown

# MarketData Domain: PostgreSQL Migration Technical Design
## Project Overview
**Project Type**: Migration Project (85% complete CSV → PostgreSQL + TimescaleDB + pgvectorscale)
**Business Impact**: 10x performance improvement with sub-100ms query times and RAG capabilities
**API Compatibility**: 100% preservation of existing MarketDataService, FundamentalDataService, InsiderDataService APIs
**Data Volume**: 10 years OHLC, 5 years fundamentals, 3 years insider data for 500+ tickers
## Architecture Overview
### Current State (85% Complete)
```
CSV File Storage (./data/market_data/)
├── OHLC data in CSV files
├── Fundamental data in CSV files
├── Insider transaction data in CSV files
└── Manual file-based operations
```
### Target Architecture
```
PostgreSQL + TimescaleDB + pgvectorscale
├── TimescaleDB hypertables for time-series optimization
├── pgvectorscale for RAG vector embeddings
├── Async PostgreSQL operations for concurrent agent access
├── Dagster automation for daily data collection
└── 100% API-compatible service layer
```
### Component Relationships
```
External APIs (YFinance + FinnHub) → Dagster Pipeline → PostgreSQL Storage → Repository Layer → Service Layer → Agents
pgvectorscale (RAG)
```
## Domain Model
### MarketDataEntity
**Purpose**: OHLC price data with TimescaleDB optimization and vector embeddings
```python
from sqlalchemy import Column, String, DateTime, Numeric, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy_utils import TSVectorType
from pgvectorscale import Vector
class MarketDataEntity(Base):
__tablename__ = 'market_data'
__table_args__ = {
'timescaledb': {
'time_column_name': 'timestamp',
'chunk_time_interval': '1 day'
}
}
id = Column(Integer, primary_key=True)
symbol = Column(String(10), nullable=False, index=True)
timestamp = Column(DateTime, nullable=False, index=True)
open_price = Column(Numeric(10, 2), nullable=False)
high_price = Column(Numeric(10, 2), nullable=False)
low_price = Column(Numeric(10, 2), nullable=False)
close_price = Column(Numeric(10, 2), nullable=False)
volume = Column(Integer, nullable=False)
adjusted_close = Column(Numeric(10, 2), nullable=False)
# Vector embeddings for RAG
technical_pattern_embedding = Column(Vector(384)) # Technical analysis patterns
price_movement_embedding = Column(Vector(384)) # Price movement patterns
# Business rules
@classmethod
def from_csv_record(cls, csv_data: dict) -> 'MarketDataEntity':
"""Transform CSV data to entity"""
return cls(
symbol=csv_data['symbol'],
timestamp=pd.to_datetime(csv_data['timestamp']),
open_price=csv_data['open'],
high_price=csv_data['high'],
low_price=csv_data['low'],
close_price=csv_data['close'],
volume=csv_data['volume'],
adjusted_close=csv_data['adj_close']
)
def to_service_response(self) -> dict:
"""Transform entity to service API format"""
return {
'symbol': self.symbol,
'timestamp': self.timestamp.isoformat(),
'open': float(self.open_price),
'high': float(self.high_price),
'low': float(self.low_price),
'close': float(self.close_price),
'volume': self.volume,
'adj_close': float(self.adjusted_close)
}
def validate(self) -> bool:
"""Validate business rules"""
return (
self.high_price >= self.low_price and
self.high_price >= self.open_price and
self.high_price >= self.close_price and
self.low_price <= self.open_price and
self.low_price <= self.close_price and
self.volume >= 0
)
```
### FundamentalDataEntity
**Purpose**: Financial statement data with PostgreSQL storage
```python
class FundamentalDataEntity(Base):
__tablename__ = 'fundamental_data'
id = Column(Integer, primary_key=True)
symbol = Column(String(10), nullable=False, index=True)
report_date = Column(DateTime, nullable=False, index=True)
period_type = Column(String(10), nullable=False) # Q, Y
# Balance Sheet
total_assets = Column(Numeric(15, 2))
total_liabilities = Column(Numeric(15, 2))
shareholders_equity = Column(Numeric(15, 2))
# Income Statement
total_revenue = Column(Numeric(15, 2))
net_income = Column(Numeric(15, 2))
earnings_per_share = Column(Numeric(8, 4))
# Cash Flow
operating_cash_flow = Column(Numeric(15, 2))
capital_expenditures = Column(Numeric(15, 2))
free_cash_flow = Column(Numeric(15, 2))
# Ratios (calculated)
pe_ratio = Column(Numeric(8, 2))
pb_ratio = Column(Numeric(8, 2))
roe = Column(Numeric(8, 4))
roa = Column(Numeric(8, 4))
debt_to_equity = Column(Numeric(8, 4))
# Vector embeddings for RAG
financial_health_embedding = Column(Vector(384))
@classmethod
def from_finnhub_response(cls, finnhub_data: dict) -> 'FundamentalDataEntity':
"""Transform FinnHub API response to entity"""
return cls(
symbol=finnhub_data['symbol'],
report_date=pd.to_datetime(finnhub_data['reportedDate']),
period_type=finnhub_data['period'],
total_assets=finnhub_data.get('totalAssets'),
total_revenue=finnhub_data.get('totalRevenue'),
# ... map all fields
)
def calculate_ratios(self, current_price: float):
"""Calculate financial ratios"""
if self.earnings_per_share and self.earnings_per_share > 0:
self.pe_ratio = current_price / self.earnings_per_share
if self.shareholders_equity and self.shareholders_equity > 0:
self.pb_ratio = current_price / (self.shareholders_equity / 1_000_000) # Book value per share
if self.shareholders_equity and self.net_income:
self.roe = self.net_income / self.shareholders_equity
if self.total_assets and self.net_income:
self.roa = self.net_income / self.total_assets
```
### InsiderDataEntity
**Purpose**: SEC insider transaction records with sentiment analysis
```python
class InsiderDataEntity(Base):
__tablename__ = 'insider_data'
id = Column(Integer, primary_key=True)
symbol = Column(String(10), nullable=False, index=True)
transaction_date = Column(DateTime, nullable=False, index=True)
# Insider information
insider_name = Column(String(200), nullable=False)
insider_position = Column(String(100))
# Transaction details
transaction_type = Column(String(20), nullable=False) # Buy, Sell
shares_traded = Column(Integer, nullable=False)
transaction_price = Column(Numeric(10, 2))
shares_owned_after = Column(Integer)
# Derived fields
transaction_value = Column(Numeric(15, 2)) # shares * price
sentiment_score = Column(Numeric(3, 2)) # -1 to 1
# Vector embeddings for RAG
transaction_pattern_embedding = Column(Vector(384))
@classmethod
def from_finnhub_response(cls, finnhub_data: dict) -> 'InsiderDataEntity':
"""Transform FinnHub insider data to entity"""
entity = cls(
symbol=finnhub_data['symbol'],
transaction_date=pd.to_datetime(finnhub_data['transactionDate']),
insider_name=finnhub_data['personName'],
insider_position=finnhub_data.get('position'),
transaction_type='Buy' if finnhub_data['change'] > 0 else 'Sell',
shares_traded=abs(finnhub_data['change']),
shares_owned_after=finnhub_data['currentShares']
)
entity.calculate_sentiment()
return entity
def calculate_sentiment(self):
"""Calculate sentiment score based on transaction type and insider position"""
base_score = 0.7 if self.transaction_type == 'Buy' else -0.7
# Adjust based on position
if self.insider_position and 'ceo' in self.insider_position.lower():
base_score *= 1.2
elif self.insider_position and 'cfo' in self.insider_position.lower():
base_score *= 1.1
self.sentiment_score = max(-1.0, min(1.0, base_score))
```
### TechnicalIndicatorEntity
**Purpose**: Calculated TA-Lib indicator values with vector embeddings
```python
class TechnicalIndicatorEntity(Base):
__tablename__ = 'technical_indicators'
id = Column(Integer, primary_key=True)
symbol = Column(String(10), nullable=False, index=True)
timestamp = Column(DateTime, nullable=False, index=True)
# Moving Averages
sma_20 = Column(Numeric(10, 2))
sma_50 = Column(Numeric(10, 2))
ema_12 = Column(Numeric(10, 2))
ema_26 = Column(Numeric(10, 2))
# Momentum Indicators
rsi_14 = Column(Numeric(5, 2))
macd = Column(Numeric(10, 4))
macd_signal = Column(Numeric(10, 4))
macd_histogram = Column(Numeric(10, 4))
# Volatility Indicators
bollinger_upper = Column(Numeric(10, 2))
bollinger_lower = Column(Numeric(10, 2))
atr_14 = Column(Numeric(10, 4))
# Volume Indicators
obv = Column(Numeric(15, 0))
volume_sma_20 = Column(Numeric(15, 0))
# Pattern Recognition (0-100 scores)
pattern_doji = Column(Integer)
pattern_hammer = Column(Integer)
pattern_engulfing = Column(Integer)
# Vector embeddings for RAG pattern matching
indicator_pattern_embedding = Column(Vector(384))
@classmethod
def calculate_from_ohlc(cls, symbol: str, ohlc_data: pd.DataFrame) -> List['TechnicalIndicatorEntity']:
"""Calculate all technical indicators from OHLC data"""
import talib
indicators = []
# Calculate all indicators
sma_20 = talib.SMA(ohlc_data['close'], timeperiod=20)
rsi_14 = talib.RSI(ohlc_data['close'], timeperiod=14)
macd, macd_signal, macd_hist = talib.MACD(ohlc_data['close'])
# ... calculate all indicators
for i, timestamp in enumerate(ohlc_data.index):
if pd.notna(sma_20.iloc[i]): # Only create records with valid data
indicators.append(cls(
symbol=symbol,
timestamp=timestamp,
sma_20=sma_20.iloc[i],
rsi_14=rsi_14.iloc[i],
macd=macd.iloc[i],
macd_signal=macd_signal.iloc[i],
macd_histogram=macd_hist.iloc[i]
# ... set all calculated values
))
return indicators
```
## Database Design
### PostgreSQL + TimescaleDB + pgvectorscale Schema
```sql
-- Enable extensions
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
-- Market Data (TimescaleDB hypertable)
CREATE TABLE market_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
open_price DECIMAL(10,2) NOT NULL,
high_price DECIMAL(10,2) NOT NULL,
low_price DECIMAL(10,2) NOT NULL,
close_price DECIMAL(10,2) NOT NULL,
volume BIGINT NOT NULL,
adjusted_close DECIMAL(10,2) NOT NULL,
technical_pattern_embedding vector(384),
price_movement_embedding vector(384),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Convert to TimescaleDB hypertable
SELECT create_hypertable('market_data', 'timestamp', chunk_time_interval => INTERVAL '1 day');
-- Indexes for performance
CREATE INDEX idx_market_data_symbol_time ON market_data (symbol, timestamp DESC);
CREATE INDEX idx_market_data_symbol ON market_data (symbol);
-- Vector indexes for RAG
CREATE INDEX idx_market_data_technical_embedding
ON market_data USING diskann (technical_pattern_embedding);
CREATE INDEX idx_market_data_price_embedding
ON market_data USING diskann (price_movement_embedding);
-- Fundamental Data
CREATE TABLE fundamental_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
report_date TIMESTAMPTZ NOT NULL,
period_type VARCHAR(10) NOT NULL,
-- Balance Sheet
total_assets DECIMAL(15,2),
total_liabilities DECIMAL(15,2),
shareholders_equity DECIMAL(15,2),
-- Income Statement
total_revenue DECIMAL(15,2),
net_income DECIMAL(15,2),
earnings_per_share DECIMAL(8,4),
-- Cash Flow
operating_cash_flow DECIMAL(15,2),
capital_expenditures DECIMAL(15,2),
free_cash_flow DECIMAL(15,2),
-- Ratios
pe_ratio DECIMAL(8,2),
pb_ratio DECIMAL(8,2),
roe DECIMAL(8,4),
roa DECIMAL(8,4),
debt_to_equity DECIMAL(8,4),
-- RAG embedding
financial_health_embedding vector(384),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(symbol, report_date, period_type)
);
CREATE INDEX idx_fundamental_symbol_date ON fundamental_data (symbol, report_date DESC);
CREATE INDEX idx_fundamental_embedding ON fundamental_data USING diskann (financial_health_embedding);
-- Insider Data
CREATE TABLE insider_data (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
transaction_date TIMESTAMPTZ NOT NULL,
insider_name VARCHAR(200) NOT NULL,
insider_position VARCHAR(100),
transaction_type VARCHAR(20) NOT NULL,
shares_traded INTEGER NOT NULL,
transaction_price DECIMAL(10,2),
shares_owned_after INTEGER,
transaction_value DECIMAL(15,2),
sentiment_score DECIMAL(3,2),
transaction_pattern_embedding vector(384),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_insider_symbol_date ON insider_data (symbol, transaction_date DESC);
CREATE INDEX idx_insider_embedding ON insider_data USING diskann (transaction_pattern_embedding);
-- Technical Indicators (TimescaleDB hypertable)
CREATE TABLE technical_indicators (
id SERIAL PRIMARY KEY,
symbol VARCHAR(10) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
-- Moving Averages
sma_20 DECIMAL(10,2),
sma_50 DECIMAL(10,2),
ema_12 DECIMAL(10,2),
ema_26 DECIMAL(10,2),
-- Momentum
rsi_14 DECIMAL(5,2),
macd DECIMAL(10,4),
macd_signal DECIMAL(10,4),
macd_histogram DECIMAL(10,4),
-- Volatility
bollinger_upper DECIMAL(10,2),
bollinger_lower DECIMAL(10,2),
atr_14 DECIMAL(10,4),
-- Volume
obv DECIMAL(15,0),
volume_sma_20 DECIMAL(15,0),
-- Patterns
pattern_doji INTEGER,
pattern_hammer INTEGER,
pattern_engulfing INTEGER,
-- RAG embedding
indicator_pattern_embedding vector(384),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
SELECT create_hypertable('technical_indicators', 'timestamp', chunk_time_interval => INTERVAL '1 day');
CREATE INDEX idx_technical_symbol_time ON technical_indicators (symbol, timestamp DESC);
CREATE INDEX idx_technical_embedding ON technical_indicators USING diskann (indicator_pattern_embedding);
```
### Migration Strategy Scripts
```python
# migrations/001_create_market_data_tables.py
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
def upgrade():
# Create market_data table
op.create_table('market_data',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('symbol', sa.String(10), nullable=False),
sa.Column('timestamp', sa.DateTime(timezone=True), nullable=False),
sa.Column('open_price', sa.Numeric(10, 2), nullable=False),
sa.Column('high_price', sa.Numeric(10, 2), nullable=False),
sa.Column('low_price', sa.Numeric(10, 2), nullable=False),
sa.Column('close_price', sa.Numeric(10, 2), nullable=False),
sa.Column('volume', sa.BigInteger(), nullable=False),
sa.Column('adjusted_close', sa.Numeric(10, 2), nullable=False),
sa.Column('technical_pattern_embedding', postgresql.ARRAY(sa.Float()), nullable=True),
sa.Column('price_movement_embedding', postgresql.ARRAY(sa.Float()), nullable=True),
sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=True),
sa.Column('updated_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=True),
sa.PrimaryKeyConstraint('id')
)
# Convert to hypertable
op.execute("SELECT create_hypertable('market_data', 'timestamp', chunk_time_interval => INTERVAL '1 day');")
# Create indexes
op.create_index('idx_market_data_symbol_time', 'market_data', ['symbol', 'timestamp'])
op.create_index('idx_market_data_symbol', 'market_data', ['symbol'])
def downgrade():
op.drop_table('market_data')
```
## API Preservation
### 100% Compatible Service Layer
**MarketDataService**: Preserve all existing methods with PostgreSQL backend
```python
from typing import List, Dict, Any, Optional
import pandas as pd
from datetime import datetime, timedelta
class MarketDataService:
"""API-compatible service with PostgreSQL backend"""
def __init__(self, repository: MarketDataRepository):
self.repository = repository
async def get_ohlc_data(self, symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
"""Get OHLC data - 100% API compatible"""
entities = await self.repository.get_ohlc_data(symbol, start_date, end_date)
# Transform to same DataFrame format as CSV version
return pd.DataFrame([
{
'timestamp': entity.timestamp,
'open': float(entity.open_price),
'high': float(entity.high_price),
'low': float(entity.low_price),
'close': float(entity.close_price),
'volume': entity.volume,
'adj_close': float(entity.adjusted_close)
}
for entity in entities
]).set_index('timestamp')
async def get_technical_indicators(self, symbol: str, start_date: str, end_date: str) -> Dict[str, List[float]]:
"""Get all technical indicators - 100% API compatible"""
indicators = await self.repository.get_technical_indicators(symbol, start_date, end_date)
return {
'sma_20': [float(ind.sma_20) if ind.sma_20 else None for ind in indicators],
'rsi_14': [float(ind.rsi_14) if ind.rsi_14 else None for ind in indicators],
'macd': [float(ind.macd) if ind.macd else None for ind in indicators],
'macd_signal': [float(ind.macd_signal) if ind.macd_signal else None for ind in indicators],
# ... all indicators
}
async def get_trading_style_preset(self, style: str, symbol: str, lookback_days: int = 30) -> Dict[str, Any]:
"""Get trading style analysis - 100% API compatible"""
end_date = datetime.now()
start_date = end_date - timedelta(days=lookback_days)
ohlc_data = await self.get_ohlc_data(symbol, start_date.isoformat(), end_date.isoformat())
indicators = await self.get_technical_indicators(symbol, start_date.isoformat(), end_date.isoformat())
if style == 'momentum':
return await self._analyze_momentum(ohlc_data, indicators)
elif style == 'mean_reversion':
return await self._analyze_mean_reversion(ohlc_data, indicators)
elif style == 'breakout':
return await self._analyze_breakout(ohlc_data, indicators)
# ... all trading styles
async def _analyze_momentum(self, ohlc_data: pd.DataFrame, indicators: Dict) -> Dict[str, Any]:
"""Momentum analysis with RAG enhancement"""
latest_rsi = indicators['rsi_14'][-1] if indicators['rsi_14'] else 50
latest_macd = indicators['macd'][-1] if indicators['macd'] else 0
# RAG: Find similar momentum patterns
similar_patterns = await self.repository.find_similar_momentum_patterns(
latest_rsi, latest_macd, limit=10
)
return {
'signal': 'BUY' if latest_rsi > 70 and latest_macd > 0 else 'HOLD',
'confidence': 0.85,
'indicators': {
'rsi': latest_rsi,
'macd': latest_macd
},
'similar_patterns': [p.to_dict() for p in similar_patterns],
'rag_enhanced': True
}
```
**FundamentalDataService**: Complete API preservation
```python
class FundamentalDataService:
"""API-compatible fundamental analysis with PostgreSQL backend"""
def __init__(self, repository: FundamentalDataRepository):
self.repository = repository
async def get_financial_ratios(self, symbol: str, period_type: str = 'Q') -> Dict[str, float]:
"""Get latest financial ratios - 100% API compatible"""
latest_data = await self.repository.get_latest_fundamental_data(symbol, period_type)
if not latest_data:
return {}
return {
'pe_ratio': float(latest_data.pe_ratio) if latest_data.pe_ratio else None,
'pb_ratio': float(latest_data.pb_ratio) if latest_data.pb_ratio else None,
'roe': float(latest_data.roe) if latest_data.roe else None,
'roa': float(latest_data.roa) if latest_data.roa else None,
'debt_to_equity': float(latest_data.debt_to_equity) if latest_data.debt_to_equity else None
}
async def analyze_financial_health(self, symbol: str) -> Dict[str, Any]:
"""Financial health analysis with RAG - 100% API compatible"""
latest_data = await self.repository.get_latest_fundamental_data(symbol)
historical_data = await self.repository.get_fundamental_history(symbol, quarters=8)
# RAG: Find companies with similar financial profiles
similar_companies = await self.repository.find_similar_financial_profiles(
latest_data.financial_health_embedding, limit=10
)
return {
'health_score': self._calculate_health_score(latest_data, historical_data),
'trend_analysis': self._analyze_trends(historical_data),
'peer_comparison': [comp.to_dict() for comp in similar_companies],
'rag_enhanced': True
}
```
**InsiderDataService**: Complete API preservation
```python
class InsiderDataService:
"""API-compatible insider analysis with PostgreSQL backend"""
def __init__(self, repository: InsiderDataRepository):
self.repository = repository
async def get_recent_insider_activity(self, symbol: str, days: int = 90) -> List[Dict[str, Any]]:
"""Get recent insider transactions - 100% API compatible"""
end_date = datetime.now()
start_date = end_date - timedelta(days=days)
transactions = await self.repository.get_insider_transactions(symbol, start_date, end_date)
return [
{
'insider_name': trans.insider_name,
'position': trans.insider_position,
'transaction_date': trans.transaction_date.isoformat(),
'transaction_type': trans.transaction_type,
'shares_traded': trans.shares_traded,
'transaction_price': float(trans.transaction_price) if trans.transaction_price else None,
'transaction_value': float(trans.transaction_value) if trans.transaction_value else None,
'sentiment_score': float(trans.sentiment_score) if trans.sentiment_score else None
}
for trans in transactions
]
async def analyze_insider_sentiment(self, symbol: str, days: int = 180) -> Dict[str, Any]:
"""Insider sentiment analysis with RAG - 100% API compatible"""
transactions = await self.get_recent_insider_activity(symbol, days)
# RAG: Find similar insider activity patterns
similar_patterns = await self.repository.find_similar_insider_patterns(
symbol, days, limit=10
)
buy_volume = sum(t['shares_traded'] for t in transactions if t['transaction_type'] == 'Buy')
sell_volume = sum(t['shares_traded'] for t in transactions if t['transaction_type'] == 'Sell')
net_sentiment = buy_volume - sell_volume
return {
'net_sentiment': net_sentiment,
'buy_transactions': len([t for t in transactions if t['transaction_type'] == 'Buy']),
'sell_transactions': len([t for t in transactions if t['transaction_type'] == 'Sell']),
'average_sentiment_score': sum(t['sentiment_score'] for t in transactions if t['sentiment_score']) / len(transactions) if transactions else 0,
'similar_patterns': [p.to_dict() for p in similar_patterns],
'rag_enhanced': True
}
```
## Component Architecture
### Repository Migration Pattern
**AsyncRepository with PostgreSQL Operations**
```python
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from sqlalchemy import select, and_, desc
from typing import List, Optional
from datetime import datetime
class MarketDataRepository:
"""Async PostgreSQL repository with RAG capabilities"""
def __init__(self, session_factory: async_sessionmaker[AsyncSession]):
self.session_factory = session_factory
async def get_ohlc_data(self, symbol: str, start_date: str, end_date: str) -> List[MarketDataEntity]:
"""Get OHLC data with sub-100ms performance"""
async with self.session_factory() as session:
stmt = select(MarketDataEntity).where(
and_(
MarketDataEntity.symbol == symbol,
MarketDataEntity.timestamp >= datetime.fromisoformat(start_date),
MarketDataEntity.timestamp <= datetime.fromisoformat(end_date)
)
).order_by(MarketDataEntity.timestamp)
result = await session.execute(stmt)
return result.scalars().all()
async def save_ohlc_batch(self, entities: List[MarketDataEntity]) -> None:
"""Batch insert with conflict resolution"""
async with self.session_factory() as session:
session.add_all(entities)
await session.commit()
async def find_similar_momentum_patterns(self, rsi: float, macd: float, limit: int = 10) -> List[TechnicalIndicatorEntity]:
"""RAG: Find similar technical patterns using vector similarity"""
target_embedding = self._encode_momentum_pattern(rsi, macd)
async with self.session_factory() as session:
# Using pgvectorscale cosine similarity
stmt = select(TechnicalIndicatorEntity).order_by(
TechnicalIndicatorEntity.indicator_pattern_embedding.cosine_distance(target_embedding)
).limit(limit)
result = await session.execute(stmt)
return result.scalars().all()
def _encode_momentum_pattern(self, rsi: float, macd: float) -> List[float]:
"""Encode momentum indicators to vector for similarity search"""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
pattern_text = f"RSI: {rsi:.2f}, MACD: {macd:.4f}, momentum pattern"
return model.encode(pattern_text).tolist()
```
### Migration Data Processing
**4-Phase Migration Strategy**
```python
class MarketDataMigrator:
"""Migrate from CSV files to PostgreSQL with data validation"""
def __init__(self, csv_data_path: str, repository: MarketDataRepository):
self.csv_data_path = csv_data_path
self.repository = repository
async def migrate_all_data(self) -> Dict[str, int]:
"""Execute 4-phase migration strategy"""
results = {}
# Phase 1: Market Data (OHLC)
results['market_data'] = await self._migrate_market_data()
# Phase 2: Fundamental Data
results['fundamental_data'] = await self._migrate_fundamental_data()
# Phase 3: Insider Data
results['insider_data'] = await self._migrate_insider_data()
# Phase 4: Calculate Technical Indicators
results['technical_indicators'] = await self._calculate_technical_indicators()
return results
async def _migrate_market_data(self) -> int:
"""Migrate OHLC data from CSV files"""
csv_files = glob.glob(f"{self.csv_data_path}/market_data/*.csv")
total_records = 0
for csv_file in csv_files:
symbol = self._extract_symbol_from_filename(csv_file)
df = pd.read_csv(csv_file)
# Transform to entities
entities = []
for _, row in df.iterrows():
entity = MarketDataEntity.from_csv_record({
'symbol': symbol,
'timestamp': row['Date'],
'open': row['Open'],
'high': row['High'],
'low': row['Low'],
'close': row['Close'],
'volume': row['Volume'],
'adj_close': row['Adj Close']
})
if entity.validate():
entities.append(entity)
# Batch insert
await self.repository.save_ohlc_batch(entities)
total_records += len(entities)
print(f"Migrated {len(entities)} records for {symbol}")
return total_records
async def _calculate_technical_indicators(self) -> int:
"""Calculate and store technical indicators for all symbols"""
symbols = await self.repository.get_all_symbols()
total_indicators = 0
for symbol in symbols:
# Get OHLC data
ohlc_data = await self.repository.get_ohlc_data(
symbol, "2020-01-01", datetime.now().isoformat()
)
# Convert to DataFrame
df = pd.DataFrame([{
'timestamp': entity.timestamp,
'close': float(entity.close_price),
'high': float(entity.high_price),
'low': float(entity.low_price),
'volume': entity.volume
} for entity in ohlc_data])
df.set_index('timestamp', inplace=True)
# Calculate indicators
indicators = TechnicalIndicatorEntity.calculate_from_ohlc(symbol, df)
# Generate embeddings
for indicator in indicators:
indicator.indicator_pattern_embedding = self._generate_indicator_embedding(indicator)
# Save indicators
await self.repository.save_technical_indicators(indicators)
total_indicators += len(indicators)
print(f"Calculated {len(indicators)} indicators for {symbol}")
return total_indicators
```
## RAG Integration
### Vector Embeddings for Historical Pattern Matching
**Embedding Generation Strategy**
```python
from sentence_transformers import SentenceTransformer
import numpy as np
class MarketDataEmbeddingService:
"""Generate vector embeddings for RAG pattern matching"""
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
def generate_technical_pattern_embedding(self, indicator_data: TechnicalIndicatorEntity) -> List[float]:
"""Generate embedding for technical analysis patterns"""
pattern_description = f"""
Technical Analysis Pattern:
RSI: {indicator_data.rsi_14:.2f}
MACD: {indicator_data.macd:.4f}, Signal: {indicator_data.macd_signal:.4f}
SMA 20: {indicator_data.sma_20:.2f}, SMA 50: {indicator_data.sma_50:.2f}
Bollinger Bands: Upper {indicator_data.bollinger_upper:.2f}, Lower {indicator_data.bollinger_lower:.2f}
Volume: Above average {indicator_data.volume_sma_20 > 0}
Pattern: {'Bullish' if indicator_data.rsi_14 > 50 and indicator_data.macd > indicator_data.macd_signal else 'Bearish'}
"""
return self.model.encode(pattern_description.strip()).tolist()
def generate_price_movement_embedding(self, market_data: MarketDataEntity) -> List[float]:
"""Generate embedding for price movement patterns"""
price_change = (market_data.close_price - market_data.open_price) / market_data.open_price * 100
volatility = (market_data.high_price - market_data.low_price) / market_data.open_price * 100
movement_description = f"""
Price Movement Pattern:
Price Change: {price_change:.2f}%
Intraday Volatility: {volatility:.2f}%
Volume Profile: {'High' if market_data.volume > 1000000 else 'Normal'}
Price Level: {'Above' if market_data.close_price > market_data.open_price else 'Below'} opening
Candle Type: {'Green' if market_data.close_price >= market_data.open_price else 'Red'}
"""
return self.model.encode(movement_description.strip()).tolist()
def generate_financial_health_embedding(self, fundamental_data: FundamentalDataEntity) -> List[float]:
"""Generate embedding for financial health patterns"""
health_description = f"""
Financial Health Profile:
PE Ratio: {fundamental_data.pe_ratio:.2f if fundamental_data.pe_ratio else 'N/A'}
ROE: {fundamental_data.roe * 100:.2f}% if fundamental_data.roe else 'N/A'}
Debt to Equity: {fundamental_data.debt_to_equity:.2f if fundamental_data.debt_to_equity else 'N/A'}
Revenue Growth: {'Positive' if fundamental_data.total_revenue and fundamental_data.total_revenue > 0 else 'Unknown'}
Profitability: {'Profitable' if fundamental_data.net_income and fundamental_data.net_income > 0 else 'Loss'}
Financial Strength: {'Strong' if fundamental_data.pe_ratio and fundamental_data.pe_ratio < 20 else 'Average'}
"""
return self.model.encode(health_description.strip()).tolist()
```
**RAG Query Service**
```python
class MarketDataRAGService:
"""RAG-powered market analysis with vector similarity search"""
def __init__(self, repository: MarketDataRepository):
self.repository = repository
self.embedding_service = MarketDataEmbeddingService()
async def find_similar_market_conditions(
self,
symbol: str,
current_date: datetime,
similarity_threshold: float = 0.8,
limit: int = 10
) -> Dict[str, Any]:
"""Find historically similar market conditions for decision support"""
# Get current market state
current_ohlc = await self.repository.get_latest_ohlc(symbol, current_date)
current_indicators = await self.repository.get_latest_indicators(symbol, current_date)
current_fundamentals = await self.repository.get_latest_fundamentals(symbol)
# Generate query embeddings
current_pattern_embedding = self.embedding_service.generate_technical_pattern_embedding(current_indicators)
# Find similar patterns
similar_patterns = await self.repository.find_similar_patterns(
current_pattern_embedding,
similarity_threshold,
limit
)
# Analyze outcomes of similar patterns
pattern_outcomes = []
for pattern in similar_patterns:
# Get price movement 5-10 days after similar pattern
future_price = await self.repository.get_price_after_date(
pattern.symbol,
pattern.timestamp,
days=7
)
if future_price:
price_change = (future_price.close_price - pattern.close_price) / pattern.close_price
pattern_outcomes.append({
'symbol': pattern.symbol,
'date': pattern.timestamp,
'similarity_score': pattern.similarity_score,
'future_return': float(price_change),
'pattern_type': self._classify_pattern(pattern)
})
return {
'current_symbol': symbol,
'analysis_date': current_date,
'similar_patterns_found': len(similar_patterns),
'historical_outcomes': pattern_outcomes,
'average_return': np.mean([p['future_return'] for p in pattern_outcomes]) if pattern_outcomes else 0,
'success_rate': len([p for p in pattern_outcomes if p['future_return'] > 0]) / len(pattern_outcomes) if pattern_outcomes else 0,
'confidence_score': self._calculate_confidence(pattern_outcomes)
}
async def get_peer_analysis(self, symbol: str) -> Dict[str, Any]:
"""Find peer companies with similar financial profiles"""
fundamentals = await self.repository.get_latest_fundamentals(symbol)
if not fundamentals:
return {'error': 'No fundamental data available'}
# Find companies with similar financial health embeddings
similar_companies = await self.repository.find_similar_financial_profiles(
fundamentals.financial_health_embedding,
exclude_symbol=symbol,
limit=10
)
# Get recent performance comparison
peer_performance = []
for peer in similar_companies:
recent_performance = await self.repository.get_recent_performance(peer.symbol, days=30)
peer_performance.append({
'symbol': peer.symbol,
'similarity_score': peer.similarity_score,
'recent_return': recent_performance.get('return', 0),
'volatility': recent_performance.get('volatility', 0),
'pe_ratio': float(peer.pe_ratio) if peer.pe_ratio else None
})
return {
'target_symbol': symbol,
'peer_companies': peer_performance,
'peer_average_return': np.mean([p['recent_return'] for p in peer_performance]),
'relative_performance': 'Above average' if recent_performance.get('return', 0) > np.mean([p['recent_return'] for p in peer_performance]) else 'Below average'
}
```
## Migration Strategy
### 4-Phase Migration Approach
**Phase 1: Database Infrastructure Setup**
```bash
# 1. Set up PostgreSQL with extensions
docker-compose up -d postgres
# 2. Run database migrations
alembic upgrade head
# 3. Verify TimescaleDB and pgvectorscale extensions
psql $DATABASE_URL -c "SELECT * FROM pg_extension WHERE extname IN ('timescaledb', 'vectorscale');"
```
**Phase 2: Data Migration with Validation**
```python
# Migration script: migrate_market_data.py
import asyncio
from tradingagents.domains.marketdata.migration.migrator import MarketDataMigrator
from tradingagents.domains.marketdata.repository import MarketDataRepository
async def main():
# Initialize components
repository = MarketDataRepository(session_factory)
migrator = MarketDataMigrator("./data/market_data", repository)
# Execute migration
print("Starting MarketData migration...")
results = await migrator.migrate_all_data()
print(f"Migration completed:")
print(f" Market Data: {results['market_data']} records")
print(f" Fundamental Data: {results['fundamental_data']} records")
print(f" Insider Data: {results['insider_data']} records")
print(f" Technical Indicators: {results['technical_indicators']} records")
# Validate migration
validation_results = await migrator.validate_migration()
print(f"Validation: {validation_results}")
if __name__ == "__main__":
asyncio.run(main())
```
**Phase 3: Service Layer Migration**
```python
# Switch services to PostgreSQL repositories
# Update dependency injection in service factory
class ServiceFactory:
def __init__(self, session_factory: async_sessionmaker[AsyncSession]):
self.session_factory = session_factory
def create_market_data_service(self) -> MarketDataService:
repository = MarketDataRepository(self.session_factory)
return MarketDataService(repository)
def create_fundamental_data_service(self) -> FundamentalDataService:
repository = FundamentalDataRepository(self.session_factory)
return FundamentalDataService(repository)
def create_insider_data_service(self) -> InsiderDataService:
repository = InsiderDataRepository(self.session_factory)
return InsiderDataService(repository)
```
**Phase 4: RAG Enhancement and Testing**
```python
# Generate embeddings for existing data
async def enhance_with_rag():
embedding_service = MarketDataEmbeddingService()
# Generate embeddings for all technical indicators
indicators = await repository.get_all_technical_indicators()
for indicator in indicators:
embedding = embedding_service.generate_technical_pattern_embedding(indicator)
await repository.update_indicator_embedding(indicator.id, embedding)
print(f"Generated embeddings for {len(indicators)} technical indicators")
# Test RAG functionality
rag_service = MarketDataRAGService(repository)
test_results = await rag_service.find_similar_market_conditions("AAPL", datetime.now())
print(f"RAG test completed: {test_results}")
```
## Testing Strategy
### Migration Testing
**Data Integrity Tests**
```python
class TestMarketDataMigration:
"""Validate PostgreSQL migration maintains data integrity"""
async def test_ohlc_data_accuracy(self):
"""Verify OHLC data matches CSV files exactly"""
# Load original CSV data
original_df = pd.read_csv("./data/market_data/AAPL.csv")
# Get PostgreSQL data
postgres_entities = await self.repository.get_ohlc_data("AAPL", "2020-01-01", "2024-01-01")
postgres_df = pd.DataFrame([entity.to_dict() for entity in postgres_entities])
# Compare datasets
assert len(original_df) == len(postgres_df)
assert np.allclose(original_df['Close'].values, postgres_df['close_price'].values)
assert np.allclose(original_df['Volume'].values, postgres_df['volume'].values)
async def test_performance_improvement(self):
"""Verify 10x performance improvement"""
import time
# Test PostgreSQL query time
start = time.time()
postgres_data = await self.repository.get_ohlc_data("AAPL", "2023-01-01", "2024-01-01")
postgres_time = time.time() - start
# Historical CSV time (from benchmarks)
csv_baseline_time = 0.500 # 500ms baseline
assert postgres_time < 0.100 # Sub-100ms requirement
assert postgres_time < csv_baseline_time / 10 # 10x improvement
async def test_rag_functionality(self):
"""Verify RAG vector similarity search works"""
rag_service = MarketDataRAGService(self.repository)
# Test pattern matching
results = await rag_service.find_similar_market_conditions("AAPL", datetime(2023, 6, 1))
assert len(results['historical_outcomes']) > 0
assert 'confidence_score' in results
assert results['average_return'] is not None
```
**API Compatibility Tests**
```python
class TestAPICompatibility:
"""Ensure 100% API compatibility after migration"""
async def test_market_data_service_api(self):
"""Verify MarketDataService API unchanged"""
service = MarketDataService(self.repository)
# Test all existing methods
ohlc_data = await service.get_ohlc_data("AAPL", "2023-01-01", "2023-12-31")
assert isinstance(ohlc_data, pd.DataFrame)
assert list(ohlc_data.columns) == ['open', 'high', 'low', 'close', 'volume', 'adj_close']
indicators = await service.get_technical_indicators("AAPL", "2023-01-01", "2023-12-31")
assert isinstance(indicators, dict)
assert 'sma_20' in indicators
assert 'rsi_14' in indicators
momentum_analysis = await service.get_trading_style_preset("momentum", "AAPL")
assert 'signal' in momentum_analysis
assert 'confidence' in momentum_analysis
async def test_fundamental_service_api(self):
"""Verify FundamentalDataService API unchanged"""
service = FundamentalDataService(self.repository)
ratios = await service.get_financial_ratios("AAPL")
assert isinstance(ratios, dict)
assert 'pe_ratio' in ratios
health = await service.analyze_financial_health("AAPL")
assert 'health_score' in health
assert 'trend_analysis' in health
```
**Performance Validation**
```python
class TestPerformanceRequirements:
"""Validate performance requirements are met"""
async def test_sub_100ms_queries(self):
"""Verify sub-100ms query performance"""
import time
queries = [
lambda: self.repository.get_ohlc_data("AAPL", "2023-12-01", "2023-12-31"),
lambda: self.repository.get_technical_indicators("AAPL", "2023-12-01", "2023-12-31"),
lambda: self.repository.get_latest_fundamentals("AAPL")
]
for query in queries:
start = time.time()
await query()
elapsed = time.time() - start
assert elapsed < 0.100, f"Query took {elapsed:.3f}s, exceeds 100ms requirement"
async def test_rag_query_performance(self):
"""Verify sub-200ms RAG query performance"""
rag_service = MarketDataRAGService(self.repository)
start = time.time()
results = await rag_service.find_similar_market_conditions("AAPL", datetime.now())
elapsed = time.time() - start
assert elapsed < 0.200, f"RAG query took {elapsed:.3f}s, exceeds 200ms requirement"
async def test_concurrent_access(self):
"""Verify concurrent agent access performance"""
import asyncio
async def concurrent_query(symbol: str):
return await self.repository.get_ohlc_data(symbol, "2023-12-01", "2023-12-31")
# Simulate 10 concurrent agents
tasks = [concurrent_query(f"SYMBOL_{i}") for i in range(10)]
start = time.time()
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
assert elapsed < 1.0, f"Concurrent queries took {elapsed:.3f}s, too slow for agent workload"
assert len(results) == 10
```
## Implementation Guidance
### Step-by-Step Implementation
**Week 1: Database Setup and Schema Migration**
1. Set up PostgreSQL with TimescaleDB and pgvectorscale extensions
2. Create database schemas with proper indexing
3. Run Alembic migrations to create all tables
4. Test hypertable creation and vector index performance
**Week 2: Entity Models and Repository Layer**
1. Implement all SQLAlchemy entity models with business logic
2. Create async repository classes with PostgreSQL operations
3. Implement batch operations for high-performance data loading
4. Add vector similarity search capabilities
**Week 3: Data Migration and Validation**
1. Build CSV-to-PostgreSQL migration scripts
2. Migrate all historical data with integrity validation
3. Generate vector embeddings for all existing data
4. Validate data accuracy and performance benchmarks
**Week 4: Service Layer and API Preservation**
1. Update service layer to use PostgreSQL repositories
2. Ensure 100% API compatibility with existing interfaces
3. Implement RAG-enhanced analysis features
4. Complete integration testing and performance validation
### Code Organization
```
tradingagents/domains/marketdata/
├── entities/
│ ├── __init__.py
│ ├── market_data_entity.py
│ ├── fundamental_data_entity.py
│ ├── insider_data_entity.py
│ └── technical_indicator_entity.py
├── repositories/
│ ├── __init__.py
│ ├── market_data_repository.py
│ ├── fundamental_data_repository.py
│ └── insider_data_repository.py
├── services/
│ ├── __init__.py
│ ├── market_data_service.py
│ ├── fundamental_data_service.py
│ ├── insider_data_service.py
│ └── embedding_service.py
├── migration/
│ ├── __init__.py
│ ├── migrator.py
│ └── validation.py
├── rag/
│ ├── __init__.py
│ ├── rag_service.py
│ └── pattern_matcher.py
└── clients/
├── __init__.py
├── yfinance_client.py # Already implemented
└── finnhub_client.py # Already implemented
```
### Configuration Updates
**Database Configuration**
```python
# config/database.py
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker
from sqlalchemy.pool import QueuePool
class DatabaseConfig:
def __init__(self, database_url: str):
self.engine = create_async_engine(
database_url,
poolclass=QueuePool,
pool_size=20,
max_overflow=30,
pool_pre_ping=True,
echo=False # Set True for SQL debugging
)
self.session_factory = async_sessionmaker(
bind=self.engine,
expire_on_commit=False,
autoflush=True,
autocommit=False
)
async def health_check(self) -> bool:
"""Verify database connectivity and extensions"""
async with self.session_factory() as session:
result = await session.execute(
"SELECT extname FROM pg_extension WHERE extname IN ('timescaledb', 'vectorscale')"
)
extensions = result.fetchall()
return len(extensions) == 2
```
### Monitoring and Observability
**Performance Monitoring**
```python
class MarketDataMetrics:
"""Monitor PostgreSQL migration performance"""
def __init__(self):
self.query_times = []
self.rag_query_times = []
async def track_query_performance(self, operation: str, execution_time: float):
"""Track query performance metrics"""
self.query_times.append({
'operation': operation,
'time': execution_time,
'timestamp': datetime.now()
})
# Alert if performance degrades
if execution_time > 0.100:
print(f"WARNING: {operation} took {execution_time:.3f}s, exceeds 100ms SLA")
def generate_performance_report(self) -> Dict[str, Any]:
"""Generate performance analytics report"""
recent_queries = [q for q in self.query_times if q['timestamp'] > datetime.now() - timedelta(hours=1)]
return {
'total_queries': len(recent_queries),
'average_query_time': np.mean([q['time'] for q in recent_queries]),
'p95_query_time': np.percentile([q['time'] for q in recent_queries], 95),
'sla_violations': len([q for q in recent_queries if q['time'] > 0.100]),
'performance_trend': 'stable' # Could implement trending analysis
}
```
This technical design provides a comprehensive migration strategy from the 85% complete CSV-based system to a high-performance PostgreSQL + TimescaleDB + pgvectorscale architecture while preserving 100% API compatibility and adding powerful RAG capabilities for historical pattern matching.
The design emphasizes:
- **Performance**: Sub-100ms queries with TimescaleDB optimization
- **Compatibility**: Zero API changes for existing services
- **Intelligence**: RAG-enhanced analysis with vector similarity search
- **Scalability**: Async PostgreSQL operations for concurrent agent access
- **Quality**: Comprehensive testing strategy for migration validation
Implementation should follow the 4-phase approach with weekly milestones to ensure smooth migration and immediate performance benefits.