TradingAgents/MarketDataService_PRD.md

503 lines
18 KiB
Markdown

# Product Requirements Document: MarketDataService Completion
## Overview
Complete the `MarketDataService` to provide strongly-typed market data and technical indicators to trading agents using a local-first data strategy with gap detection and intelligent caching.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Service uses `BaseClient` inheritance but `YFinanceClient` exists and needs refactoring to FinnhubClient standard
- **CRITICAL**: Service calls client methods with string dates instead of date objects
- **CRITICAL**: Need to integrate `stockstats` library for technical analysis calculations instead of legacy utils
- **CRITICAL**: `MarketDataRepository` exists but missing service interface methods
- Missing strongly-typed interface between YFinanceClient and service
- YFinanceClient uses BaseClient inheritance and string dates (needs refactoring)
- No concrete gap detection logic
- Missing technical indicator data sufficiency validation
### What Works
- ✅ Local-first data strategy implementation (`_get_price_data_local_first`)
- ✅ Force refresh logic (`_fetch_and_cache_fresh_data`)
-`MarketDataContext` Pydantic model for agent consumption
- ✅ Error handling and metadata creation patterns
-`YFinanceClient` exists with yfinance SDK integration and comprehensive methods
-`MarketDataRepository` exists with CSV storage and pandas DataFrame operations
- ✅ Service structure ready for `stockstats` integration for technical analysis
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# YFinanceClient methods (to be refactored)
def get_historical_data(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
def get_price_data(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
# Technical analysis handled in service layer using stockstats
# No get_technical_indicator method needed in client - calculated from OHLCV data
```
#### Service → Repository Interface
```python
# MarketDataRepository methods (to be implemented)
def has_data_for_period(symbol: str, start_date: str, end_date: str) -> bool
def get_data(symbol: str, start_date: str, end_date: str) -> dict[str, Any]
def store_data(symbol: str, cache_data: dict, overwrite: bool) -> bool
def clear_data(symbol: str, start_date: str, end_date: str) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(symbol: str, start_date: str, end_date: str, indicators: list[str], force_refresh: bool) -> MarketDataContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `MarketDataRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing price data periods using `detect_market_gaps()`
3. **Data Sufficiency Check**: Ensure enough historical data for requested indicators
4. **Selective Fetching**: Fetch only missing data from `YFinanceClient`
5. **Cache Updates**: Store new data via `repository.store_data()`
6. **Context Assembly**: Return validated `MarketDataContext`
#### Gap Detection Implementation
```python
def detect_market_gaps(self, cached_dates: list[str], requested_start: str, requested_end: str) -> list[tuple[str, str]]:
"""
Returns list of (start, end) tuples for missing periods.
Example: If requesting 2024-01-01 to 2024-01-31 and cache has:
- 2024-01-01 to 2024-01-10
- 2024-01-20 to 2024-01-25
Returns: [("2024-01-11", "2024-01-19"), ("2024-01-26", "2024-01-31")]
Accounts for:
- Weekends (Saturday/Sunday)
- Market holidays
- Continuous date ranges to minimize API calls
"""
# Implementation should use pandas business day logic
```
#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh
#### Cache Invalidation Strategy
- **Historical data is immutable**: Data older than yesterday never changes
- **Today's data needs updates**: During market hours, refresh every 15 minutes
- **After market close**: Today's data becomes immutable
```python
def is_data_stale(self, data_date: date, last_updated: datetime) -> bool:
today = date.today()
if data_date < today:
return False # Historical data never stale
# For today's data, check if market is open and last update > 15 min
if is_market_open() and (datetime.now() - last_updated).minutes > 15:
return True
return False
```
### 3. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, symbol: str, start_date: str, end_date: str, ...) -> MarketDataContext:
# Validate date strings
try:
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
except ValueError as e:
raise ValueError(f"Invalid date format: {e}")
# Check date order
if end_dt < start_dt:
raise ValueError(f"End date {end_date} is before start date {start_date}")
# Expand date range for technical indicators
expanded_start = self._calculate_lookback_start(start_dt, indicators)
# Use date objects when calling YFinanceClient
price_data = self.yfinance_client.get_historical_data(symbol, expanded_start, end_dt)
# Calculate technical indicators using stockstats library
technical_indicators = self._calculate_technical_indicators(price_data, indicators)
```
### 4. Technical Analysis with Stockstats
#### Data Sufficiency Validation
```python
# Minimum data points required for each indicator
INDICATOR_REQUIREMENTS = {
"sma_20": 20,
"sma_200": 200,
"ema_12": 24, # 2x for exponential smoothing
"ema_200": 400,
"rsi_14": 28, # 2x period for warm-up
"macd": 34, # 26 + 8 for signal line
"bb_upper": 20, # Based on 20-period SMA
"atr_14": 28, # 2x period for accuracy
"stochrsi_14": 42, # 3x period for double smoothing
}
def _calculate_lookback_start(self, start_date: date, indicators: list[str]) -> date:
"""Calculate how far back we need data to compute indicators accurately."""
max_lookback = 0
for indicator in indicators:
lookback = INDICATOR_REQUIREMENTS.get(indicator, 0)
max_lookback = max(max_lookback, lookback)
# Add buffer for weekends/holidays
business_days_back = max_lookback * 1.5
return start_date - timedelta(days=int(business_days_back))
def _validate_data_sufficiency(self, data_points: int, indicators: list[str]) -> dict[str, bool]:
"""Check if we have enough data for each indicator."""
return {
indicator: data_points >= INDICATOR_REQUIREMENTS.get(indicator, 0)
for indicator in indicators
}
```
#### Stockstats Integration
```python
def _calculate_technical_indicators(self, price_data: list[dict], indicators: list[str]) -> dict[str, list[dict]]:
"""
Calculate technical indicators using stockstats library.
Args:
price_data: OHLCV data from YFinanceClient
indicators: List of requested indicators (e.g., ['rsi_14', 'macd', 'bb_upper', 'sma_20'])
Returns:
Dict mapping indicator names to time series data
"""
import pandas as pd
from stockstats import StockDataFrame
# Convert price data to pandas DataFrame
df = pd.DataFrame(price_data)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check data sufficiency
sufficiency = self._validate_data_sufficiency(len(df), indicators)
# Create StockDataFrame for technical analysis
sdf = StockDataFrame.retype(df)
# Calculate requested indicators
indicator_data = {}
for indicator in indicators:
if not sufficiency[indicator]:
logger.warning(f"Insufficient data for {indicator}, need {INDICATOR_REQUIREMENTS[indicator]} points")
indicator_data[indicator] = []
continue
try:
if indicator in sdf.columns:
values = sdf[indicator].dropna()
indicator_data[indicator] = [
{"date": idx.strftime("%Y-%m-%d"), "value": float(val)}
for idx, val in values.items()
]
except Exception as e:
logger.warning(f"Failed to calculate {indicator}: {e}")
indicator_data[indicator] = []
return indicator_data
```
### 5. Error Recovery and Partial Data
```python
def handle_partial_price_data(
self,
requested_start: str,
requested_end: str,
available_data: list[dict]
) -> MarketDataContext:
"""
Handle cases where only partial date range is available.
- If no data available: Raise exception
- If partial data: Return what's available with metadata
- Mark gaps in metadata
"""
if not available_data:
raise ValueError(f"No market data available for {symbol}")
actual_start = min(d['date'] for d in available_data)
actual_end = max(d['date'] for d in available_data)
metadata = {
"requested_period": {"start": requested_start, "end": requested_end},
"actual_period": {"start": actual_start, "end": actual_end},
"partial_data": actual_start > requested_start or actual_end < requested_end,
"data_points": len(available_data)
}
# Return context with available data and metadata
```
### 6. Pydantic Validation
#### Context Structure
```python
@dataclass
class MarketDataContext(BaseModel):
symbol: str
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
price_data: list[dict[str, Any]] # OHLCV records
technical_indicators: dict[str, list[TechnicalIndicatorData]]
metadata: dict[str, Any]
@validator('price_data')
def validate_price_data(cls, v):
# Ensure OHLCV fields present and valid
required_fields = {'date', 'open', 'high', 'low', 'close', 'volume'}
for record in v:
if not all(field in record for field in required_fields):
raise ValueError(f"Missing required OHLCV fields")
return v
```
## Implementation Tasks
### Phase 1: Refactor YFinanceClient
1. **YFinanceClient Refactoring**
- **Refactor existing** `tradingagents/clients/yfinance_client.py`
- Remove BaseClient inheritance
- Update all method signatures to accept `date` objects instead of strings
- Keep all existing functionality intact
- Example changes:
```python
# Current (wrong)
def get_historical_data(self, symbol: str, start_date: str, end_date: str) -> dict[str, Any]:
# Updated (correct)
def get_historical_data(self, symbol: str, start_date: date, end_date: date) -> dict[str, Any]:
```
2. **Comprehensive Testing**
- Update `tradingagents/clients/test_yfinance_client.py`
- Test with date objects
- Use pytest-vcr for HTTP interaction recording
- Test error handling and edge cases
### Phase 2: Update MarketDataRepository
3. **Repository Interface Enhancement**
- Update existing `tradingagents/repositories/market_data_repository.py`
- Add missing service interface methods: `has_data_for_period()`, `get_data()`, `store_data()`, `clear_data()`
- Maintain existing CSV/pandas functionality while adding service compatibility
- Support gap detection and partial data scenarios
### Phase 3: Update MarketDataService
4. **Client Integration Fix**
- Replace `BaseClient` dependency with `YFinanceClient`
- File: `tradingagents/services/market_data_service.py:8, 26`
- Update constructor to accept `yfinance_client: YFinanceClient`
5. **Date Conversion and Validation**
- Add `date.fromisoformat()` conversion in service methods
- Add date validation (format, order)
- Update client calls to use date objects instead of strings
- File: `tradingagents/services/market_data_service.py:151, 227`
6. **Technical Indicator Integration with Stockstats**
- Implement `_calculate_technical_indicators()` method using `stockstats` library
- Add `_calculate_lookback_start()` for data sufficiency
- Add `_validate_data_sufficiency()` to check if enough data
- Replace legacy `StockstatsUtils` integration with direct stockstats usage
- File: `tradingagents/services/market_data_service.py:9, 43, 280-346`
### Phase 4: Type Safety & Validation
7. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all date object conversions
- Ensure MarketDataContext compliance
8. **Enhanced Testing**
- Update existing service tests for new YFinanceClient interface
- Add gap detection test scenarios
- Test technical indicator data sufficiency
- Test partial data handling
## Testing Scenarios
### Integration Tests
1. **Gap Detection**
- Test with empty cache (should fetch all)
- Test with partial cache (should fetch only missing periods)
- Test weekend/holiday handling
2. **Technical Indicator Sufficiency**
- Test SMA_200 with only 100 days of data (should skip indicator)
- Test RSI_14 with exactly 28 days (should calculate)
- Test mixed indicators with varying data requirements
3. **Partial Data Recovery**
- Test when API returns less data than requested
- Test when some dates are missing (holidays)
- Test metadata accuracy for partial data
4. **Date Handling**
- Test invalid date formats
- Test end_date < start_date
- Test future dates
- Test weekend date handling
5. **Cache Staleness**
- Test historical data (should never refresh)
- Test today's data during market hours (should refresh if > 15 min)
- Test today's data after market close (should not refresh)
## Success Criteria
### Functional Requirements
- ✅ Service successfully calls refactored `YFinanceClient` with `date` objects
- ✅ Gap detection correctly identifies missing trading days
- ✅ Technical indicators validate data sufficiency before calculation
- ✅ Partial data scenarios handled gracefully
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `MarketDataContext` to agents
- ✅ Technical indicators calculated from OHLCV data using stockstats library
- ✅ Force refresh bypasses cache and refreshes data
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass with updated architecture
- ✅ No runtime errors with date conversions
- ✅ Proper error messages for validation failures
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ Official yfinance SDK and stockstats library usage
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
## Data Architecture
### YFinanceClient Response Format
```python
{
"symbol": "AAPL",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"data": [
{
"date": "2024-01-02", # Note: Jan 1 was a holiday
"open": 150.0,
"high": 155.0,
"low": 149.0,
"close": 154.0,
"volume": 1000000,
"adj_close": 154.0
},
...
],
"metadata": {
"source": "yfinance",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"missing_dates": ["2024-01-01", "2024-01-15"] # Holidays
}
}
```
### Technical Indicator Data Format
```python
# MarketDataContext.technical_indicators structure
{
"rsi_14": [
{"date": "2024-01-29", "value": 65.5}, # First valid after 28 days
{"date": "2024-01-30", "value": 67.2},
...
],
"sma_200": [], # Empty if insufficient data
"macd": [
{"date": "2024-01-31", "value": {"macd": 2.1, "signal": 1.8, "histogram": 0.3}}
],
"_metadata": {
"indicators_calculated": ["rsi_14", "macd"],
"indicators_skipped": {
"sma_200": "Insufficient data: need 200 points, have 31"
}
}
}
```
## Dependencies
### Existing Components (Need Updates)
-`YFinanceClient` exists but needs refactoring (remove BaseClient, use date objects)
-`MarketDataRepository` exists with CSV storage but needs service interface methods
- ✅ Tests exist but need updates for new interfaces
### Required
- Official `yfinance` library for market data fetching
- `stockstats` library for technical analysis calculations
- `pandas` for date/time handling and business day calculations
- Working internet connection for live data fetching
- Writable data directory for repository storage
## Timeline
### Immediate (Phase 1)
- Refactor existing YFinanceClient to use date objects
- Remove BaseClient inheritance
- Update tests for new interface
### Phase 2-3
- Add service interface methods to MarketDataRepository
- Update MarketDataService to use refactored YFinanceClient
- Implement data sufficiency validation
- Integrate stockstats library for technical indicators
### Phase 4
- Comprehensive type checking and validation
- Integration testing with gap detection
- Performance optimization and caching efficiency
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Refactoring**: YFinanceClient uses date objects, no BaseClient
3. **Gap Detection**: Correctly identifies missing trading days
4. **Data Sufficiency**: Validates enough data for technical indicators
5. **Partial Data**: Service handles incomplete data gracefully
6. **Local-First**: Service checks repository before API calls
7. **Context Validation**: Returns valid `MarketDataContext` with Pydantic validation
8. **Technical Indicators**: Calculated using stockstats with proper validation
### Should Have
1. **Cache Efficiency**: Minimal redundant API calls to Yahoo Finance
2. **Force Refresh**: Complete cache bypass when requested
3. **Stale Data Handling**: Refresh today's data during market hours
4. **Clear Error Messages**: Informative errors for validation failures
### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Extended Indicators**: Support for 50+ technical indicators
3. **Real-time Data**: WebSocket integration for live prices
4. **Bulk Symbol Support**: Fetch multiple symbols efficiently
---
This PRD focuses on completing the `MarketDataService` as a strongly-typed, local-first data service that integrates OHLCV price data from a refactored `YFinanceClient` and calculates comprehensive technical indicators using the `stockstats` library, with robust gap detection and data sufficiency validation.