refactor: migrate to newspaper4k and improve news service repository integration

- Upgrade from newspaper3k to newspaper4k for better article scraping
- Add repository integration for cached news data retrieval
- Implement proper date handling and data conversion in news service
- Move PRD files to dedicated prd/ directory
- Add type stubs and improve type checking configuration
- Fix linting issues (unused variables and loop control variables)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Martin C. Richards 2025-08-10 13:00:40 +02:00
parent 07606f6bf4
commit d773ed4cfa
20 changed files with 2180 additions and 2047 deletions

23
.claude/settings.json Normal file
View File

@ -0,0 +1,23 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|MultiEdit|Write",
"hooks": [
{
"type": "command",
"command": "mise run format"
},
{
"type": "command",
"command": "mise run lint --fix"
},
{
"type": "command",
"command": "mise run typecheck"
}
]
}
]
}
}

View File

@ -1,289 +0,0 @@
# Product Requirements Document: FundamentalDataService Completion
## Overview
Complete the `FundamentalDataService` to provide strongly-typed fundamental financial data to trading agents using a local-first data strategy with gap detection and intelligent caching.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Service calls `FinnhubClient` methods with string dates but client expects `date` objects
- **CRITICAL**: References non-existent `self.simfin_client` instead of `self.finnhub_client`
- Missing strongly-typed interfaces between components
- Incomplete local-first strategy implementation
- No concrete gap detection logic
- Missing error recovery for partial data
### What Works
- ✅ `FinnhubClient` fully implemented with strict `date` object interface
- ✅ `FundamentalDataRepository` with dataclass-based storage
- ✅ `FundamentalContext` Pydantic model for agent consumption
- ✅ Basic service structure and error handling
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# FinnhubClient methods (already implemented)
def get_balance_sheet(symbol: str, frequency: str, report_date: date) -> dict[str, Any]
def get_income_statement(symbol: str, frequency: str, report_date: date) -> dict[str, Any]
def get_cash_flow(symbol: str, frequency: str, report_date: date) -> dict[str, Any]
```
#### Service → Repository Interface
```python
# Repository methods (already implemented)
def has_data_for_period(symbol: str, start_date: str, end_date: str, frequency: str) -> bool
def get_data(symbol: str, start_date: str, end_date: str, frequency: str) -> dict[str, Any]
def store_data(symbol: str, cache_data: dict, frequency: str, overwrite: bool) -> bool
def clear_data(symbol: str, start_date: str, end_date: str, frequency: str) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(symbol: str, start_date: str, end_date: str, frequency: str, force_refresh: bool) -> FundamentalContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `FundamentalDataRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing data periods using `detect_fundamental_gaps()`
3. **Selective Fetching**: Fetch only missing data from `FinnhubClient`
4. **Cache Updates**: Store new data via `repository.store_data()`
5. **Context Assembly**: Return validated `FundamentalContext`
#### Gap Detection Implementation
```python
def detect_fundamental_gaps(self, symbol: str, start_date: str, end_date: str, frequency: str) -> list[str]:
"""
Returns list of report dates that need fetching.
Example: If requesting quarterly from 2024-01-01 to 2024-12-31
and cache has Q1 and Q3, returns ["2024-06-30", "2024-09-30", "2024-12-31"]
For quarterly: Check for Q1 (Mar 31), Q2 (Jun 30), Q3 (Sep 30), Q4 (Dec 31)
For annual: Check for fiscal year ends
"""
# Implementation should:
# 1. Get existing report dates from repository
# 2. Calculate expected report dates in requested period
# 3. Return difference between expected and existing
```
#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh
#### Cache Invalidation Strategy
- **Fundamental data is immutable**: Once a report is filed, it doesn't change
- **No staleness checks needed**: Reports are valid indefinitely
- **Only fetch if missing**: Never re-fetch existing reports
### 3. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, symbol: str, start_date: str, end_date: str, ...) -> FundamentalContext:
# Validate date strings
try:
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
except ValueError as e:
raise ValueError(f"Invalid date format: {e}")
# Check date order
if end_dt < start_dt:
raise ValueError(f"End date {end_date} is before start date {start_date}")
# Use date objects when calling FinnhubClient
data = self.finnhub_client.get_balance_sheet(symbol, frequency, end_dt)
```
### 4. Error Recovery and Partial Data
```python
def handle_partial_statements(
self,
balance_sheet: dict | None,
income_statement: dict | None,
cash_flow: dict | None
) -> FundamentalContext:
"""
Create context even if some statements are missing.
- If all statements fail: Raise exception
- If some statements succeed: Return partial context
- Mark missing statements in metadata
"""
metadata = {
"has_balance_sheet": balance_sheet is not None,
"has_income_statement": income_statement is not None,
"has_cash_flow": cash_flow is not None,
"partial_data": any(s is None for s in [balance_sheet, income_statement, cash_flow])
}
# Convert available statements to FinancialStatement objects
# Return FundamentalContext with available data
```
### 5. Pydantic Validation
#### Context Structure
```python
@dataclass
class FundamentalContext(BaseModel):
symbol: str
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
balance_sheet: FinancialStatement | None
income_statement: FinancialStatement | None
cash_flow: FinancialStatement | None
key_ratios: dict[str, float]
metadata: dict[str, Any]
@validator('period')
def validate_period(cls, v):
# Ensure start and end dates are present and valid
return v
```
## Implementation Tasks
### Phase 1: Fix Critical Issues
1. **Date Conversion Fix**
- Add `date.fromisoformat()` conversion in service methods
- Add date validation (format, order)
- Update all `FinnhubClient` method calls to use `date` objects
- File: `tradingagents/services/fundamental_data_service.py:153, 164, 175`
2. **Client Reference Fix**
- Replace `self.simfin_client` with `self.finnhub_client`
- File: `tradingagents/services/fundamental_data_service.py:375`
### Phase 2: Enhanced Local-First Strategy
3. **Gap Detection Logic**
- Implement `detect_fundamental_gaps()` method
- Calculate expected report dates based on frequency
- Compare with cached data to find gaps
- Handle fiscal year variations
4. **Partial Data Handling**
- Implement `handle_partial_statements()` method
- Continue processing if some statements succeed
- Mark missing data in metadata
- Only fail if all statements fail
### Phase 3: Type Safety & Validation
5. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all `date` object conversions
- Ensure Pydantic model compliance
6. **Enhanced Testing**
- Update existing tests for new date handling
- Add gap detection test scenarios
- Test partial data scenarios
- Test force refresh behavior
- Test date validation edge cases
## Testing Scenarios
### Integration Tests
1. **Gap Detection**
- Test with empty cache (should fetch all)
- Test with partial cache (should fetch only missing)
- Test with complete cache (should fetch none)
2. **Partial Data Recovery**
- Test when balance sheet API fails but others succeed
- Test when only one statement type is available
- Test when all APIs fail (should raise exception)
3. **Date Handling**
- Test invalid date formats
- Test end_date < start_date
- Test boundary conditions (year start/end)
4. **Force Refresh**
- Test that force_refresh=True clears cache
- Test that new data is fetched and stored
## Success Criteria
### Functional Requirements
- ✅ Service successfully calls `FinnhubClient` with `date` objects
- ✅ Gap detection correctly identifies missing reports
- ✅ Partial data scenarios handled gracefully
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `FundamentalContext` to agents
- ✅ Force refresh bypasses cache and refreshes data
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass
- ✅ No runtime errors with date conversions
- ✅ Proper error messages for validation failures
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
## Dependencies
### Completed
- ✅ `FinnhubClient` with `date` object interface
- ✅ `FundamentalDataRepository` with dataclass storage
- ✅ `FundamentalContext` Pydantic model
### Required
- Working `FinnhubClient` instance with valid API key
- Writable data directory for repository storage
## Timeline
### Immediate (Today)
- Fix critical date conversion and reference issues
- Implement basic gap detection
- Add date validation
### Next Steps
- Implement partial data handling
- Comprehensive testing
- Integration with agent workflows
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Integration**: All `FinnhubClient` calls use `date` objects correctly
3. **Gap Detection**: Correctly identifies missing report periods
4. **Partial Data**: Service returns partial context when some statements fail
5. **Local-First**: Service checks repository before API calls
6. **Context Validation**: Returns valid `FundamentalContext` with Pydantic validation
7. **Error Handling**: Graceful handling of API failures and missing data
### Should Have
1. **Cache Efficiency**: Minimal redundant API calls
2. **Force Refresh**: Complete cache bypass when requested
3. **Data Quality**: Metadata indicating data completeness
4. **Clear Error Messages**: Informative errors for date validation failures
### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Fiscal Year Handling**: Support for non-calendar fiscal years
3. **Bulk Operations**: Fetch multiple symbols efficiently
---
This PRD focuses on completing the `FundamentalDataService` as a strongly-typed, local-first data service that seamlessly integrates with the existing `FinnhubClient` and `FundamentalDataRepository` components while providing robust gap detection and partial data handling.

View File

@ -1,502 +0,0 @@
# Product Requirements Document: MarketDataService Completion
## Overview
Complete the `MarketDataService` to provide strongly-typed market data and technical indicators to trading agents using a local-first data strategy with gap detection and intelligent caching.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Service uses `BaseClient` inheritance but `YFinanceClient` exists and needs refactoring to FinnhubClient standard
- **CRITICAL**: Service calls client methods with string dates instead of date objects
- **CRITICAL**: Need to integrate `stockstats` library for technical analysis calculations instead of legacy utils
- **CRITICAL**: `MarketDataRepository` exists but missing service interface methods
- Missing strongly-typed interface between YFinanceClient and service
- YFinanceClient uses BaseClient inheritance and string dates (needs refactoring)
- No concrete gap detection logic
- Missing technical indicator data sufficiency validation
### What Works
- ✅ Local-first data strategy implementation (`_get_price_data_local_first`)
- ✅ Force refresh logic (`_fetch_and_cache_fresh_data`)
- ✅ `MarketDataContext` Pydantic model for agent consumption
- ✅ Error handling and metadata creation patterns
- ✅ `YFinanceClient` exists with yfinance SDK integration and comprehensive methods
- ✅ `MarketDataRepository` exists with CSV storage and pandas DataFrame operations
- ✅ Service structure ready for `stockstats` integration for technical analysis
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# YFinanceClient methods (to be refactored)
def get_historical_data(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
def get_price_data(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
# Technical analysis handled in service layer using stockstats
# No get_technical_indicator method needed in client - calculated from OHLCV data
```
#### Service → Repository Interface
```python
# MarketDataRepository methods (to be implemented)
def has_data_for_period(symbol: str, start_date: str, end_date: str) -> bool
def get_data(symbol: str, start_date: str, end_date: str) -> dict[str, Any]
def store_data(symbol: str, cache_data: dict, overwrite: bool) -> bool
def clear_data(symbol: str, start_date: str, end_date: str) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(symbol: str, start_date: str, end_date: str, indicators: list[str], force_refresh: bool) -> MarketDataContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `MarketDataRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing price data periods using `detect_market_gaps()`
3. **Data Sufficiency Check**: Ensure enough historical data for requested indicators
4. **Selective Fetching**: Fetch only missing data from `YFinanceClient`
5. **Cache Updates**: Store new data via `repository.store_data()`
6. **Context Assembly**: Return validated `MarketDataContext`
#### Gap Detection Implementation
```python
def detect_market_gaps(self, cached_dates: list[str], requested_start: str, requested_end: str) -> list[tuple[str, str]]:
"""
Returns list of (start, end) tuples for missing periods.
Example: If requesting 2024-01-01 to 2024-01-31 and cache has:
- 2024-01-01 to 2024-01-10
- 2024-01-20 to 2024-01-25
Returns: [("2024-01-11", "2024-01-19"), ("2024-01-26", "2024-01-31")]
Accounts for:
- Weekends (Saturday/Sunday)
- Market holidays
- Continuous date ranges to minimize API calls
"""
# Implementation should use pandas business day logic
```
#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh
#### Cache Invalidation Strategy
- **Historical data is immutable**: Data older than yesterday never changes
- **Today's data needs updates**: During market hours, refresh every 15 minutes
- **After market close**: Today's data becomes immutable
```python
def is_data_stale(self, data_date: date, last_updated: datetime) -> bool:
today = date.today()
if data_date < today:
return False # Historical data never stale
# For today's data, check if market is open and last update > 15 min
if is_market_open() and (datetime.now() - last_updated).minutes > 15:
return True
return False
```
### 3. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, symbol: str, start_date: str, end_date: str, ...) -> MarketDataContext:
# Validate date strings
try:
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
except ValueError as e:
raise ValueError(f"Invalid date format: {e}")
# Check date order
if end_dt < start_dt:
raise ValueError(f"End date {end_date} is before start date {start_date}")
# Expand date range for technical indicators
expanded_start = self._calculate_lookback_start(start_dt, indicators)
# Use date objects when calling YFinanceClient
price_data = self.yfinance_client.get_historical_data(symbol, expanded_start, end_dt)
# Calculate technical indicators using stockstats library
technical_indicators = self._calculate_technical_indicators(price_data, indicators)
```
### 4. Technical Analysis with Stockstats
#### Data Sufficiency Validation
```python
# Minimum data points required for each indicator
INDICATOR_REQUIREMENTS = {
"sma_20": 20,
"sma_200": 200,
"ema_12": 24, # 2x for exponential smoothing
"ema_200": 400,
"rsi_14": 28, # 2x period for warm-up
"macd": 34, # 26 + 8 for signal line
"bb_upper": 20, # Based on 20-period SMA
"atr_14": 28, # 2x period for accuracy
"stochrsi_14": 42, # 3x period for double smoothing
}
def _calculate_lookback_start(self, start_date: date, indicators: list[str]) -> date:
"""Calculate how far back we need data to compute indicators accurately."""
max_lookback = 0
for indicator in indicators:
lookback = INDICATOR_REQUIREMENTS.get(indicator, 0)
max_lookback = max(max_lookback, lookback)
# Add buffer for weekends/holidays
business_days_back = max_lookback * 1.5
return start_date - timedelta(days=int(business_days_back))
def _validate_data_sufficiency(self, data_points: int, indicators: list[str]) -> dict[str, bool]:
"""Check if we have enough data for each indicator."""
return {
indicator: data_points >= INDICATOR_REQUIREMENTS.get(indicator, 0)
for indicator in indicators
}
```
#### Stockstats Integration
```python
def _calculate_technical_indicators(self, price_data: list[dict], indicators: list[str]) -> dict[str, list[dict]]:
"""
Calculate technical indicators using stockstats library.
Args:
price_data: OHLCV data from YFinanceClient
indicators: List of requested indicators (e.g., ['rsi_14', 'macd', 'bb_upper', 'sma_20'])
Returns:
Dict mapping indicator names to time series data
"""
import pandas as pd
from stockstats import StockDataFrame
# Convert price data to pandas DataFrame
df = pd.DataFrame(price_data)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Check data sufficiency
sufficiency = self._validate_data_sufficiency(len(df), indicators)
# Create StockDataFrame for technical analysis
sdf = StockDataFrame.retype(df)
# Calculate requested indicators
indicator_data = {}
for indicator in indicators:
if not sufficiency[indicator]:
logger.warning(f"Insufficient data for {indicator}, need {INDICATOR_REQUIREMENTS[indicator]} points")
indicator_data[indicator] = []
continue
try:
if indicator in sdf.columns:
values = sdf[indicator].dropna()
indicator_data[indicator] = [
{"date": idx.strftime("%Y-%m-%d"), "value": float(val)}
for idx, val in values.items()
]
except Exception as e:
logger.warning(f"Failed to calculate {indicator}: {e}")
indicator_data[indicator] = []
return indicator_data
```
### 5. Error Recovery and Partial Data
```python
def handle_partial_price_data(
self,
requested_start: str,
requested_end: str,
available_data: list[dict]
) -> MarketDataContext:
"""
Handle cases where only partial date range is available.
- If no data available: Raise exception
- If partial data: Return what's available with metadata
- Mark gaps in metadata
"""
if not available_data:
raise ValueError(f"No market data available for {symbol}")
actual_start = min(d['date'] for d in available_data)
actual_end = max(d['date'] for d in available_data)
metadata = {
"requested_period": {"start": requested_start, "end": requested_end},
"actual_period": {"start": actual_start, "end": actual_end},
"partial_data": actual_start > requested_start or actual_end < requested_end,
"data_points": len(available_data)
}
# Return context with available data and metadata
```
### 6. Pydantic Validation
#### Context Structure
```python
@dataclass
class MarketDataContext(BaseModel):
symbol: str
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
price_data: list[dict[str, Any]] # OHLCV records
technical_indicators: dict[str, list[TechnicalIndicatorData]]
metadata: dict[str, Any]
@validator('price_data')
def validate_price_data(cls, v):
# Ensure OHLCV fields present and valid
required_fields = {'date', 'open', 'high', 'low', 'close', 'volume'}
for record in v:
if not all(field in record for field in required_fields):
raise ValueError(f"Missing required OHLCV fields")
return v
```
## Implementation Tasks
### Phase 1: Refactor YFinanceClient
1. **YFinanceClient Refactoring**
- **Refactor existing** `tradingagents/clients/yfinance_client.py`
- Remove BaseClient inheritance
- Update all method signatures to accept `date` objects instead of strings
- Keep all existing functionality intact
- Example changes:
```python
# Current (wrong)
def get_historical_data(self, symbol: str, start_date: str, end_date: str) -> dict[str, Any]:
# Updated (correct)
def get_historical_data(self, symbol: str, start_date: date, end_date: date) -> dict[str, Any]:
```
2. **Comprehensive Testing**
- Update `tradingagents/clients/test_yfinance_client.py`
- Test with date objects
- Use pytest-vcr for HTTP interaction recording
- Test error handling and edge cases
### Phase 2: Update MarketDataRepository
3. **Repository Interface Enhancement**
- Update existing `tradingagents/repositories/market_data_repository.py`
- Add missing service interface methods: `has_data_for_period()`, `get_data()`, `store_data()`, `clear_data()`
- Maintain existing CSV/pandas functionality while adding service compatibility
- Support gap detection and partial data scenarios
### Phase 3: Update MarketDataService
4. **Client Integration Fix**
- Replace `BaseClient` dependency with `YFinanceClient`
- File: `tradingagents/services/market_data_service.py:8, 26`
- Update constructor to accept `yfinance_client: YFinanceClient`
5. **Date Conversion and Validation**
- Add `date.fromisoformat()` conversion in service methods
- Add date validation (format, order)
- Update client calls to use date objects instead of strings
- File: `tradingagents/services/market_data_service.py:151, 227`
6. **Technical Indicator Integration with Stockstats**
- Implement `_calculate_technical_indicators()` method using `stockstats` library
- Add `_calculate_lookback_start()` for data sufficiency
- Add `_validate_data_sufficiency()` to check if enough data
- Replace legacy `StockstatsUtils` integration with direct stockstats usage
- File: `tradingagents/services/market_data_service.py:9, 43, 280-346`
### Phase 4: Type Safety & Validation
7. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all date object conversions
- Ensure MarketDataContext compliance
8. **Enhanced Testing**
- Update existing service tests for new YFinanceClient interface
- Add gap detection test scenarios
- Test technical indicator data sufficiency
- Test partial data handling
## Testing Scenarios
### Integration Tests
1. **Gap Detection**
- Test with empty cache (should fetch all)
- Test with partial cache (should fetch only missing periods)
- Test weekend/holiday handling
2. **Technical Indicator Sufficiency**
- Test SMA_200 with only 100 days of data (should skip indicator)
- Test RSI_14 with exactly 28 days (should calculate)
- Test mixed indicators with varying data requirements
3. **Partial Data Recovery**
- Test when API returns less data than requested
- Test when some dates are missing (holidays)
- Test metadata accuracy for partial data
4. **Date Handling**
- Test invalid date formats
- Test end_date < start_date
- Test future dates
- Test weekend date handling
5. **Cache Staleness**
- Test historical data (should never refresh)
- Test today's data during market hours (should refresh if > 15 min)
- Test today's data after market close (should not refresh)
## Success Criteria
### Functional Requirements
- ✅ Service successfully calls refactored `YFinanceClient` with `date` objects
- ✅ Gap detection correctly identifies missing trading days
- ✅ Technical indicators validate data sufficiency before calculation
- ✅ Partial data scenarios handled gracefully
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `MarketDataContext` to agents
- ✅ Technical indicators calculated from OHLCV data using stockstats library
- ✅ Force refresh bypasses cache and refreshes data
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass with updated architecture
- ✅ No runtime errors with date conversions
- ✅ Proper error messages for validation failures
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ Official yfinance SDK and stockstats library usage
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
## Data Architecture
### YFinanceClient Response Format
```python
{
"symbol": "AAPL",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"data": [
{
"date": "2024-01-02", # Note: Jan 1 was a holiday
"open": 150.0,
"high": 155.0,
"low": 149.0,
"close": 154.0,
"volume": 1000000,
"adj_close": 154.0
},
...
],
"metadata": {
"source": "yfinance",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"missing_dates": ["2024-01-01", "2024-01-15"] # Holidays
}
}
```
### Technical Indicator Data Format
```python
# MarketDataContext.technical_indicators structure
{
"rsi_14": [
{"date": "2024-01-29", "value": 65.5}, # First valid after 28 days
{"date": "2024-01-30", "value": 67.2},
...
],
"sma_200": [], # Empty if insufficient data
"macd": [
{"date": "2024-01-31", "value": {"macd": 2.1, "signal": 1.8, "histogram": 0.3}}
],
"_metadata": {
"indicators_calculated": ["rsi_14", "macd"],
"indicators_skipped": {
"sma_200": "Insufficient data: need 200 points, have 31"
}
}
}
```
## Dependencies
### Existing Components (Need Updates)
- ✅ `YFinanceClient` exists but needs refactoring (remove BaseClient, use date objects)
- ✅ `MarketDataRepository` exists with CSV storage but needs service interface methods
- ✅ Tests exist but need updates for new interfaces
### Required
- Official `yfinance` library for market data fetching
- `stockstats` library for technical analysis calculations
- `pandas` for date/time handling and business day calculations
- Working internet connection for live data fetching
- Writable data directory for repository storage
## Timeline
### Immediate (Phase 1)
- Refactor existing YFinanceClient to use date objects
- Remove BaseClient inheritance
- Update tests for new interface
### Phase 2-3
- Add service interface methods to MarketDataRepository
- Update MarketDataService to use refactored YFinanceClient
- Implement data sufficiency validation
- Integrate stockstats library for technical indicators
### Phase 4
- Comprehensive type checking and validation
- Integration testing with gap detection
- Performance optimization and caching efficiency
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Refactoring**: YFinanceClient uses date objects, no BaseClient
3. **Gap Detection**: Correctly identifies missing trading days
4. **Data Sufficiency**: Validates enough data for technical indicators
5. **Partial Data**: Service handles incomplete data gracefully
6. **Local-First**: Service checks repository before API calls
7. **Context Validation**: Returns valid `MarketDataContext` with Pydantic validation
8. **Technical Indicators**: Calculated using stockstats with proper validation
### Should Have
1. **Cache Efficiency**: Minimal redundant API calls to Yahoo Finance
2. **Force Refresh**: Complete cache bypass when requested
3. **Stale Data Handling**: Refresh today's data during market hours
4. **Clear Error Messages**: Informative errors for validation failures
### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Extended Indicators**: Support for 50+ technical indicators
3. **Real-time Data**: WebSocket integration for live prices
4. **Bulk Symbol Support**: Fetch multiple symbols efficiently
---
This PRD focuses on completing the `MarketDataService` as a strongly-typed, local-first data service that integrates OHLCV price data from a refactored `YFinanceClient` and calculates comprehensive technical indicators using the `stockstats` library, with robust gap detection and data sufficiency validation.

View File

@ -1,779 +0,0 @@
# Product Requirements Document: NewsService Completion
## Overview
Complete the `NewsService` to provide strongly-typed news data and sentiment analysis to trading agents using a local-first data strategy with RSS feed integration, article content extraction, and LLM-powered sentiment analysis.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Service is currently empty placeholder with only method stubs
- **CRITICAL**: Need to implement GoogleNewsClient to read RSS feeds
- **CRITICAL**: Need RSS article fetching with fallback to Internet Archive
- **CRITICAL**: Need LLM-powered sentiment analysis integration
- **CRITICAL**: Service uses `BaseClient` inheritance instead of typed clients
- **CRITICAL**: `NewsRepository` has different interface than service expectations
- Missing strongly-typed interfaces between components
- No concrete approach for article content extraction
### What Works
- ✅ `NewsContext` and `ArticleData` Pydantic models for agent consumption
- ✅ `SentimentScore` model for structured sentiment data
- ✅ `FinnhubClient` with `get_company_news()` method using date objects
- ✅ `NewsRepository` with dataclass-based storage and deduplication
- ✅ Service structure placeholder ready for implementation
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# FinnhubClient methods (already implemented)
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
# GoogleNewsClient methods (to be implemented)
def fetch_rss_feed(query: str, start_date: date, end_date: date) -> dict[str, Any]
def fetch_article_content(url: str, use_archive_fallback: bool = True) -> dict[str, Any]
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
def get_global_news(start_date: date, end_date: date, categories: list[str]) -> dict[str, Any]
```
#### Service → Repository Interface
```python
# NewsRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, sources: list[str], force_refresh: bool) -> NewsContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `NewsRepository.has_data_for_period()`
2. **Freshness Check**: Determine if cache needs updating (news is append-only)
3. **RSS Feed Fetching**: Fetch RSS feeds from Google News
4. **Content Extraction**: Extract full article content with Internet Archive fallback
5. **LLM Analysis**: Perform sentiment analysis using LLM
6. **Cache Updates**: Store enriched articles via `repository.store_data()`
7. **Context Assembly**: Return validated `NewsContext`
#### News-Specific Gap Detection
```python
def should_fetch_new_articles(self, last_fetch_time: datetime, current_time: datetime) -> bool:
"""
News doesn't have "gaps" - it's append-only. Check if enough time passed for new articles.
Returns True if:
- Last fetch was more than 6 hours ago
- User requested force_refresh
- No data exists for the query/period
"""
if not last_fetch_time:
return True
hours_since_fetch = (current_time - last_fetch_time).total_seconds() / 3600
return hours_since_fetch >= 6 # Fetch new articles every 6 hours
```
#### Force Refresh Support
- `force_refresh=True` fetches all articles fresh from sources
- Does NOT clear existing cache (news is immutable)
- Deduplicates against existing articles before storing
#### Cache Invalidation Strategy
- **Articles are immutable**: Once published, articles don't change
- **Cache grows append-only**: New articles are added, old ones retained
- **Freshness check**: Re-fetch every 6 hours for new articles
- **No deletion**: Articles are never removed from cache
### 3. RSS Feed Processing & Article Fetching
#### GoogleNewsClient RSS Implementation
```python
import feedparser
from newspaper import Article
import requests
from datetime import date, datetime
from typing import Any, Optional
class GoogleNewsClient:
"""Google News RSS client following FinnhubClient standard."""
def __init__(self):
self.base_rss_url = "https://news.google.com/rss"
self.archive_base_url = "https://archive.org/wayback/available"
def fetch_rss_feed(self, query: str, start_date: date, end_date: date) -> dict[str, Any]:
"""
Fetch RSS feed data for news articles.
Args:
query: Search query or company symbol
start_date: Start date for filtering articles
end_date: End date for filtering articles
Returns:
Dict containing RSS feed articles with metadata
"""
# Construct RSS feed URL
rss_url = f"{self.base_rss_url}/search?q={query}&hl=en-US&gl=US&ceid=US:en"
# Parse RSS feed
feed = feedparser.parse(rss_url)
# Filter and structure articles
articles = []
for entry in feed.entries:
# Parse publication date
pub_date = datetime(*entry.published_parsed[:6]).date()
# Filter by date range
if start_date <= pub_date <= end_date:
articles.append({
"headline": entry.title,
"url": entry.link,
"source": entry.source.get('title', 'Google News'),
"date": pub_date.isoformat(),
"summary": entry.get('summary', ''),
})
return {
"query": query,
"period": {"start": start_date.isoformat(), "end": end_date.isoformat()},
"articles": articles,
"metadata": {
"source": "google_news_rss",
"rss_feed_url": rss_url,
"article_count": len(articles)
}
}
def fetch_article_content(self, url: str, use_archive_fallback: bool = True) -> dict[str, Any]:
"""
Fetch full article content from URL with Internet Archive fallback.
Args:
url: Article URL to fetch
use_archive_fallback: Whether to try Internet Archive if direct fetch fails
Returns:
Dict containing article content, title, publication date
"""
try:
# Try direct fetch
article = Article(url)
article.download()
article.parse()
return {
"content": article.text,
"title": article.title,
"authors": article.authors,
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
"extracted_via": "direct_fetch",
"extraction_success": True
}
except Exception as e:
if use_archive_fallback:
# Try Internet Archive
archive_url = self._get_archive_url(url)
if archive_url:
try:
article = Article(archive_url)
article.download()
article.parse()
return {
"content": article.text,
"title": article.title,
"authors": article.authors,
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
"extracted_via": "internet_archive",
"extraction_success": True
}
except Exception:
pass
# Return failure
return {
"content": "",
"title": "",
"extracted_via": "failed",
"extraction_success": False,
"error": str(e)
}
def _get_archive_url(self, url: str) -> Optional[str]:
"""Get Internet Archive URL for a given URL."""
try:
response = requests.get(f"{self.archive_base_url}?url={url}")
data = response.json()
if data.get("archived_snapshots", {}).get("closest", {}).get("available"):
return data["archived_snapshots"]["closest"]["url"]
except Exception:
pass
return None
```
### 4. LLM-Powered Sentiment Analysis
#### Sentiment Analysis Integration
```python
class LLMSentimentAnalyzer:
"""LLM-based sentiment analyzer for financial news."""
def __init__(self, llm_client):
self.llm_client = llm_client
self.sentiment_prompt = """
Analyze the sentiment of this financial news article for trading purposes.
Article:
Title: {headline}
Content: {content}
Provide your analysis in the following JSON format:
{{
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
"confidence": <float between 0.0 and 1.0>,
"label": <"positive", "negative", or "neutral">,
"reasoning": <brief explanation>,
"key_themes": <list of key financial themes>,
"financial_entities": <list of mentioned companies/tickers>
}}
Focus on the financial and market implications of the news.
"""
def analyze_sentiment(self, article: ArticleData) -> SentimentScore:
"""
Analyze article sentiment using LLM.
Args:
article: Article data with headline and content
Returns:
SentimentScore with score, confidence, and label
"""
# Prepare prompt
prompt = self.sentiment_prompt.format(
headline=article.headline,
content=article.content[:2000] # Limit content length
)
# Get LLM response
response = self.llm_client.complete(prompt)
# Parse response
try:
result = json.loads(response)
# Convert to SentimentScore
score = result.get("score", 0.0)
return SentimentScore(
positive=max(0, score),
negative=abs(min(0, score)),
neutral=1.0 - abs(score),
metadata={
"confidence": result.get("confidence", 0.5),
"label": result.get("label", "neutral"),
"reasoning": result.get("reasoning", ""),
"key_themes": result.get("key_themes", []),
"financial_entities": result.get("financial_entities", [])
}
)
except Exception as e:
# Return neutral sentiment on error
return SentimentScore(
positive=0.0,
negative=0.0,
neutral=1.0,
metadata={"error": str(e)}
)
def batch_analyze(self, articles: list[ArticleData], batch_size: int = 5) -> list[SentimentScore]:
"""
Batch process sentiment analysis for multiple articles.
Args:
articles: List of articles to analyze
batch_size: Number of articles to process in parallel
Returns:
List of sentiment scores corresponding to input articles
"""
results = []
for i in range(0, len(articles), batch_size):
batch = articles[i:i + batch_size]
# Process batch (could be parallelized)
for article in batch:
sentiment = self.analyze_sentiment(article)
results.append(sentiment)
# Add small delay to respect rate limits
time.sleep(0.1)
return results
```
### 5. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> NewsContext:
# Validate date strings
try:
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
except ValueError as e:
raise ValueError(f"Invalid date format: {e}")
# Check date order
if end_dt < start_dt:
raise ValueError(f"End date {end_date} is before start date {start_date}")
# Fetch from multiple sources
finnhub_data = self.finnhub_client.get_company_news(symbol, start_dt, end_dt) if symbol else None
google_rss = self.google_client.fetch_rss_feed(query, start_dt, end_dt)
# Fetch full article content for RSS articles
for article in google_rss.get('articles', []):
content_data = self.google_client.fetch_article_content(article['url'])
article.update(content_data)
# Combine all articles
all_articles = self._combine_and_deduplicate(finnhub_data, google_rss)
# Perform LLM sentiment analysis
enriched_articles = []
for article in all_articles:
article_data = ArticleData(**article)
article_data.sentiment = self.sentiment_analyzer.analyze_sentiment(article_data)
enriched_articles.append(article_data)
# Create and return context
return self._create_news_context(enriched_articles, start_date, end_date)
```
### 6. Error Recovery and Partial Data
```python
def handle_source_failure(
self,
finnhub_data: dict | None,
google_data: dict | None,
errors: dict[str, Exception]
) -> NewsContext:
"""
Handle cases where one or more news sources fail.
- If all sources fail: Raise exception
- If some sources succeed: Return partial data with metadata
- Track content extraction failures separately
"""
if not finnhub_data and not google_data:
raise ValueError("All news sources failed to return data")
# Track extraction statistics
extraction_stats = {
"total_articles": 0,
"successful_extractions": 0,
"archive_fallbacks": 0,
"failed_extractions": 0
}
# Process available articles
all_articles = []
successful_sources = []
if finnhub_data:
all_articles.extend(finnhub_data.get('articles', []))
successful_sources.append('finnhub')
if google_data:
articles = google_data.get('articles', [])
for article in articles:
extraction_stats["total_articles"] += 1
if article.get("extraction_success"):
extraction_stats["successful_extractions"] += 1
if article.get("extracted_via") == "internet_archive":
extraction_stats["archive_fallbacks"] += 1
else:
extraction_stats["failed_extractions"] += 1
all_articles.extend(articles)
successful_sources.append('google_news')
metadata = {
"sources_requested": ["finnhub", "google_news"],
"sources_successful": successful_sources,
"sources_failed": {source: str(error) for source, error in errors.items()},
"extraction_stats": extraction_stats,
"partial_data": len(successful_sources) < 2
}
# Deduplicate and return context
return self._create_context(all_articles, metadata)
```
### 7. Repository Method Bridging
```python
# Add these bridge methods to NewsRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
"""Bridge to existing get_news_data method."""
existing_data = self.get_news_data(
symbol=symbol or query,
start_date=start_date,
end_date=end_date
)
return len(existing_data.get('articles', [])) > 0
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]:
"""Bridge to existing get_news_data method."""
return self.get_news_data(
symbol=symbol or query,
start_date=start_date,
end_date=end_date
)
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool:
"""Bridge to existing store_news_articles method."""
articles = cache_data.get('articles', [])
if not articles:
return False
# Convert to expected format
news_articles = [
NewsArticle(
symbol=symbol or query,
headline=a['headline'],
summary=a.get('summary', ''),
content=a.get('content', ''),
url=a['url'],
source=a['source'],
date=a['date'],
entities=a.get('entities', []),
sentiment_score=a.get('sentiment', {}).get('score', 0.0),
sentiment_metadata=a.get('sentiment', {})
)
for a in articles
]
return self.store_news_articles(news_articles)
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
"""News is append-only, so this just marks data as stale for re-fetch."""
# Implementation depends on repository design
# Could update metadata to trigger re-fetch
return True
```
### 8. Pydantic Validation
#### Context Structure
```python
@dataclass
class NewsContext(BaseModel):
symbol: str | None
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
articles: list[ArticleData]
sentiment_summary: SentimentScore
article_count: int
sources: list[str]
metadata: dict[str, Any]
@validator('period')
def validate_period(cls, v):
# Ensure start and end dates are present and valid
if 'start' not in v or 'end' not in v:
raise ValueError("Period must have 'start' and 'end' dates")
return v
@validator('articles')
def validate_articles(cls, v):
# Ensure no duplicate URLs
urls = [a.url for a in v]
if len(urls) != len(set(urls)):
raise ValueError("Duplicate articles detected")
return v
```
## Implementation Tasks
### Phase 1: Create GoogleNewsClient
1. **GoogleNewsClient Implementation**
- Create `tradingagents/clients/google_news_client.py` following FinnhubClient standard
- Implement RSS feed parsing using `feedparser` library
- Add `fetch_rss_feed()` method with Google News RSS integration
- Add `fetch_article_content()` method with `newspaper3k` and Internet Archive fallback
- Use `date` objects for all date parameters
- No BaseClient inheritance
2. **Article Content Extraction**
- Implement robust article content extraction using `newspaper3k`
- Add fallback to Internet Archive Wayback Machine for failed fetches
- Handle paywall detection and alternative content sources
- Extract clean text, title, publication date, and metadata
3. **Comprehensive Testing**
- Create test suite for GoogleNewsClient
- Test RSS parsing with various queries
- Test content extraction with real and archived URLs
- Use pytest-vcr for HTTP interaction recording
### Phase 2: Bridge NewsRepository Interface
4. **Repository Interface Standardization**
- Add standard service interface methods to `NewsRepository`
- Bridge existing methods without changing underlying storage
- File: `tradingagents/repositories/news_repository.py`
- Maintain backward compatibility
### Phase 3: Implement NewsService
5. **Service Core Implementation**
- Replace method stubs with full implementation
- Implement `get_context()`, `get_company_news_context()`, `get_global_news_context()`
- Add local-first data strategy with freshness checking
- Replace `BaseClient` dependencies with typed clients
- File: `tradingagents/services/news_service.py`
6. **LLM Sentiment Analysis Integration**
- Implement `LLMSentimentAnalyzer` class
- Create financial news sentiment prompts
- Add batch processing for efficiency
- Handle LLM rate limiting and errors
7. **Date Conversion and Article Processing**
- Add date validation and conversion
- Implement RSS article fetching pipeline
- Add content extraction with fallback
- Combine articles from multiple sources
- Implement deduplication by URL
### Phase 4: Type Safety & Validation
8. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all date object conversions
- Ensure NewsContext compliance
9. **Enhanced Testing**
- Test RSS feed parsing edge cases
- Test content extraction failures and fallbacks
- Test LLM sentiment analysis with various article types
- Test multi-source aggregation and deduplication
## Testing Scenarios
### Integration Tests
1. **RSS Feed Processing**
- Test with various search queries
- Test date filtering in RSS results
- Test handling of malformed RSS feeds
2. **Content Extraction**
- Test direct fetch success
- Test Internet Archive fallback
- Test paywall detection
- Test extraction failure handling
3. **LLM Sentiment Analysis**
- Test positive news sentiment
- Test negative earnings reports
- Test neutral market updates
- Test batch processing
- Test LLM error handling
4. **Multi-Source Aggregation**
- Test both sources succeed
- Test Finnhub fails, Google succeeds
- Test Google fails, Finnhub succeeds
- Test both sources fail
5. **Date Handling**
- Test invalid date formats
- Test end_date < start_date
- Test date filtering in RSS feeds
## Success Criteria
### Functional Requirements
- ✅ Service successfully implements all placeholder methods
- ✅ GoogleNewsClient reads and parses RSS feeds correctly
- ✅ Article content extraction works with Internet Archive fallback
- ✅ LLM sentiment analysis provides structured financial sentiment
- ✅ Local-first strategy with proper freshness checking
- ✅ Multi-source aggregation with deduplication
- ✅ Returns properly validated `NewsContext` to agents
- ✅ Force refresh fetches fresh articles without clearing cache
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All tests pass with new implementation
- ✅ No runtime errors with date conversions
- ✅ Proper error messages for validation failures
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ RSS feed parsing with robust error handling
- ✅ Article content extraction with fallback strategy
- ✅ LLM integration with proper prompt engineering
- ✅ Efficient caching with minimal external calls
- ✅ Clear separation of concerns
## Data Architecture
### GoogleNewsClient RSS Response Format
```python
{
"query": "Apple stock",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"articles": [
{
"headline": "Apple Stock Soars on New Product Launch",
"summary": "Brief summary from RSS feed...",
"content": "Full article text extracted from source...",
"url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
"source": "CNBC",
"date": "2024-01-20",
"authors": ["Tech Reporter"],
"publish_date": "2024-01-20T14:30:00Z",
"extracted_via": "direct_fetch", # or "internet_archive"
"extraction_success": true
}
],
"metadata": {
"source": "google_news_rss",
"article_count": 25,
"rss_feed_url": "https://news.google.com/rss/search?q=Apple+stock",
"extraction_stats": {
"successful": 22,
"archive_fallback": 2,
"failed": 3
}
}
}
```
### LLM Sentiment Analysis Response Format
```python
{
"article_url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
"sentiment": {
"positive": 0.7,
"negative": 0.1,
"neutral": 0.2,
"metadata": {
"score": 0.7,
"confidence": 0.85,
"label": "positive",
"reasoning": "Article discusses positive earnings and growth outlook",
"key_themes": ["earnings_beat", "product_launch", "revenue_growth"],
"financial_entities": ["AAPL", "Apple Inc.", "iPhone 15"]
}
}
}
```
### Aggregate Sentiment Summary
```python
{
"sentiment_summary": {
"positive": 0.65, # Average across all articles
"negative": 0.20,
"neutral": 0.15,
"metadata": {
"dominant_sentiment": "positive",
"confidence": 0.82,
"article_count": 25,
"themes": {
"earnings": 8,
"product_launch": 5,
"market_analysis": 12
}
}
}
}
```
## Dependencies
### Components to Create
- ⏳ `GoogleNewsClient` - Full implementation with RSS and content extraction
- ⏳ `LLMSentimentAnalyzer` - LLM integration for sentiment analysis
- ⏳ `NewsService` - Replace stubs with full implementation
### Existing Components
- ✅ `FinnhubClient` with company news using date objects
- ✅ `NewsRepository` with dataclass storage
- ✅ `NewsContext` and related Pydantic models
### Required Libraries
- `feedparser` - RSS feed parsing
- `newspaper3k` - Article content extraction
- `requests` - HTTP requests and Internet Archive API
- `beautifulsoup4` - HTML parsing fallback
- LLM client library (OpenAI, Anthropic, etc.)
## Timeline
### Immediate (Phase 1)
- Create GoogleNewsClient with RSS and content extraction
- Implement feedparser integration
- Add Internet Archive fallback
- Create comprehensive test suite
### Phase 2-3
- Add repository bridge methods
- Implement full NewsService
- Integrate LLM sentiment analysis
- Handle multi-source aggregation
### Phase 4
- Type checking and validation
- Integration testing
- Performance optimization
- Documentation
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **RSS Integration**: Successfully parse Google News RSS feeds
3. **Content Extraction**: Extract full articles with fallback
4. **LLM Sentiment**: Financial sentiment analysis for all articles
5. **Service Implementation**: All stubs replaced with working code
6. **Local-First**: Check cache before fetching new data
7. **Multi-Source**: Aggregate Finnhub and Google News
### Should Have
1. **Extraction Stats**: Track success/failure rates
2. **Batch Processing**: Efficient LLM sentiment analysis
3. **Force Refresh**: Fetch new articles on demand
4. **Error Recovery**: Handle partial failures gracefully
### Nice to Have
1. **Additional Sources**: Support more news providers
2. **Real-time Monitoring**: WebSocket for breaking news
3. **Advanced Extraction**: Handle PDFs, videos
4. **Sentiment Trends**: Track sentiment over time
---
This PRD focuses on completing the currently empty `NewsService` with a full implementation including RSS feed integration, article content extraction with Internet Archive fallback, and LLM-powered sentiment analysis for financial news.

View File

@ -293,6 +293,33 @@ This project uses [mise](https://mise.jdx.dev/) for tool and task management. Al
- **Install tools**: `mise install` - Install Python, uv, ruff, pyright
- **Install dependencies**: `mise run install` - Install project dependencies with uv
### Testing Principles
**Pragmatic outside-in TDD** - Mock I/O boundaries, test real logic, fast feedback.
#### Test Structure (Mirror Source)
```
tests/
├── conftest.py # Shared fixtures
├── domains/
│ ├── __init__.py
│ └── news/
│ ├── __init__.py
│ ├── test_news_service.py # Mock repo + clients
│ ├── test_news_repository.py # Docker test DB
│ └── test_google_news_client.py # pytest-vcr
```
#### Mocking Strategy by Layer
- **Services**: Mock Repository + Clients, test real transformations
- **Repositories**: Real persistence (temp files/Docker), no mocks
- **Clients**: Real HTTP with pytest-vcr cassettes
#### Quality Standards
- **85% coverage** minimum
- **< 100ms** per unit test
- **Mock boundaries, test behavior**
### Configuration
The TradingAgents framework uses a centralized `TradingAgentsConfig` class for all configuration management.
@ -428,4 +455,5 @@ ALWAYS prefer editing an existing file to creating a new one.
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context unless it is highly relevant to your task.
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context unless it is highly relevant to your task.
- remember what we learnt about testing?

View File

@ -1,424 +0,0 @@
# Product Requirements Document: SocialMediaService Completion
## Overview
Complete the `SocialMediaService` to provide strongly-typed social media data and sentiment analysis to trading agents using a local-first data strategy with gap detection and intelligent caching.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Missing `RedditClient` implementation - service calls non-existent client methods
- **CRITICAL**: Service uses `BaseClient` inheritance but needs typed `RedditClient`
- **CRITICAL**: `SocialRepository` has different interface than standard service pattern
- **CRITICAL**: Repository uses `date` objects internally but service expects string date interface
- Missing strongly-typed interfaces between components
- Service calls `reddit_client.search_posts()`, `get_top_posts()`, `filter_posts_by_date()` methods that don't exist
### What Works
- ✅ Local-first data strategy implementation (`_get_social_data_local_first`)
- ✅ Force refresh logic (`_fetch_and_cache_fresh_social_data`)
- ✅ `SocialContext` Pydantic model for agent consumption
- ✅ Comprehensive sentiment analysis with keyword-based scoring
- ✅ Engagement metrics calculation and post ranking
- ✅ Error handling and metadata creation patterns
- ✅ `SocialRepository` with JSON storage and post deduplication
- ✅ `PostData` and `SentimentScore` models for structured data
- ✅ Real-time sentiment analysis with weighted scoring
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# RedditClient methods (to be implemented)
def search_posts(query: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_top_posts(subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_company_posts(symbol: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int) -> dict[str, Any]
```
#### Service → Repository Interface
```python
# SocialRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, subreddits: list[str], force_refresh: bool) -> SocialContext
def get_company_social_context(symbol: str, start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
def get_global_trends(start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `SocialRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing social media data periods
3. **Selective Fetching**: Fetch only missing data from `RedditClient`
4. **Cache Updates**: Store new data via `repository.store_data()`
5. **Context Assembly**: Return validated `SocialContext`
#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh
### 3. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> SocialContext:
# Convert to date objects for client calls
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
# Use date objects when calling RedditClient
posts_data = self.reddit_client.search_posts(query, subreddits, start_dt, end_dt, limit, time_filter)
# Repository bridge handles string to date conversion internally
cached_data = self.repository.get_data(query, start_date, end_date, symbol)
```
### 4. Reddit API Integration
#### RedditClient Implementation Strategy
```python
# RedditClient following FinnhubClient standard
class RedditClient:
"""Client for Reddit API access with PRAW library integration."""
def __init__(self, client_id: str, client_secret: str, user_agent: str):
"""Initialize Reddit client with PRAW."""
import praw
self.reddit = praw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)
def search_posts(self, query: str, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50,
time_filter: str = "week") -> dict[str, Any]:
"""Search for posts across subreddits within date range."""
def get_top_posts(self, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50,
time_filter: str = "week") -> dict[str, Any]:
"""Get top posts from subreddits within date range."""
def get_company_posts(self, symbol: str, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50) -> dict[str, Any]:
"""Get company-specific posts from subreddits."""
```
#### Reddit Response Format
```python
{
"query": "AAPL",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"posts": [
{
"title": "Apple earnings discussion",
"content": "What do you think about...",
"author": "redditor123",
"subreddit": "investing",
"created_utc": 1704067200,
"score": 125,
"num_comments": 45,
"upvote_ratio": 0.87,
"url": "https://reddit.com/r/investing/comments/abc123",
"id": "abc123"
}
],
"metadata": {
"source": "reddit",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"subreddits": ["investing", "stocks"],
"total_posts": 25
}
}
```
### 5. Sentiment Analysis Enhancement
#### Advanced Sentiment Features
- **Weighted Scoring**: High-engagement posts have more influence on overall sentiment
- **Keyword Analysis**: Comprehensive positive/negative keyword detection
- **Score Adjustment**: Reddit score (upvotes) influences sentiment confidence
- **Confidence Metrics**: Based on post count and engagement levels
- **Multi-level Analysis**: Individual post sentiment + overall summary sentiment
#### Sentiment Calculation Strategy
```python
def _calculate_advanced_sentiment(self, posts: list[PostData]) -> SentimentScore:
"""Enhanced sentiment analysis with multiple factors."""
# Weight by engagement score (upvotes + comments)
# Adjust for subreddit context (WSB vs investing)
# Consider temporal patterns (recent posts weighted higher)
# Apply confidence scoring based on data volume
```
### 6. Pydantic Validation
#### Context Structure
```python
@dataclass
class SocialContext(BaseModel):
symbol: str | None
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
posts: list[PostData]
engagement_metrics: dict[str, float]
sentiment_summary: SentimentScore
post_count: int
platforms: list[str] # ["reddit"]
metadata: dict[str, Any]
```
#### PostData Format
```python
@dataclass
class PostData(BaseModel):
title: str
content: str
author: str
source: str # subreddit name
date: str
url: str
score: int
comments: int
engagement_score: int
subreddit: str | None
sentiment: SentimentScore | None
metadata: dict[str, Any]
```
## Implementation Tasks
### Phase 1: Create RedditClient
1. **RedditClient Implementation**
- Create `tradingagents/clients/reddit_client.py`
- Follow FinnhubClient standard: no BaseClient inheritance, date objects, proper error handling
- Use PRAW (Python Reddit API Wrapper) library for Reddit API access
- Methods: `search_posts()`, `get_top_posts()`, `get_company_posts()`
- Implement date filtering for posts within specified ranges
- Handle Reddit API rate limits and authentication
2. **Comprehensive Testing**
- Create `tradingagents/clients/test_reddit_client.py`
- Use pytest-vcr for Reddit API interaction recording
- Test all client methods with multiple queries and subreddits
- Test error handling and API rate limit scenarios
- Mock Reddit API responses for consistent testing
### Phase 2: Bridge SocialRepository Interface
3. **Repository Interface Standardization**
- Add standard service interface methods to `SocialRepository`
- Bridge existing `get_social_data()` with `get_data()`
- Bridge existing `store_social_posts()` with `store_data()`
- Add missing `has_data_for_period()` and `clear_data()` methods
- File: `tradingagents/repositories/social_repository.py`
- Maintain existing dataclass functionality while adding service compatibility
4. **Repository Method Implementation**
```python
# Add these methods to SocialRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
```
### Phase 3: Update SocialMediaService
5. **Client Integration Fix**
- Replace `BaseClient` dependency with `RedditClient`
- File: `tradingagents/services/social_media_service.py:27`
- Update constructor: `reddit_client: RedditClient`
6. **Date Conversion Fix**
- Add `date.fromisoformat()` conversion in service methods
- Update all client calls to use date objects instead of strings
- File: `tradingagents/services/social_media_service.py:182-190, 418-429`
7. **Repository Interface Integration**
- Update repository method calls to use new standard interface
- Ensure proper error handling for repository operations
- File: `tradingagents/services/social_media_service.py:302-311, 325-337`
### Phase 4: Type Safety & Validation
8. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all date object conversions
- Ensure SocialContext compliance
9. **Enhanced Testing**
- Update existing service tests for new RedditClient interface
- Add gap detection test scenarios
- Test sentiment analysis accuracy with known datasets
- Test multi-subreddit aggregation and deduplication
## Success Criteria
### Functional Requirements
- ✅ Service successfully calls `RedditClient` with `date` objects
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `SocialContext` to agents
- ✅ Sentiment analysis provides accurate scores with confidence metrics
- ✅ Multi-subreddit support with post deduplication
- ✅ Force refresh bypasses cache and refreshes data
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass with updated architecture
- ✅ No runtime errors with date conversions
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ PRAW library integration for reliable Reddit API access
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
- ✅ Accurate sentiment analysis with engagement weighting
## Data Architecture
### RedditClient Response Format
```python
{
"query": "Tesla",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"posts": [
{
"title": "Tesla Q4 earnings beat expectations",
"content": "Tesla reported strong Q4 results...",
"author": "teslaInvestor",
"subreddit": "TeslaInvestors",
"created_utc": 1704067200,
"score": 245,
"num_comments": 67,
"upvote_ratio": 0.92,
"url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
"id": "xyz789"
}
],
"metadata": {
"source": "reddit",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"subreddits": ["TeslaInvestors", "stocks"],
"post_count": 25,
"api_calls": 3
}
}
```
### SocialRepository Data Bridge Format
```python
# Repository stores data in existing SocialPost format but provides service interface
{
"query": "Tesla",
"symbol": "TSLA",
"posts": [
{
"title": "Tesla Q4 earnings beat expectations",
"content": "Tesla reported strong Q4 results...",
"author": "teslaInvestor",
"source": "TeslaInvestors",
"date": "2024-01-15",
"url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
"score": 245,
"comments": 67,
"engagement_score": 312,
"subreddit": "TeslaInvestors",
"sentiment": {
"score": 0.7,
"confidence": 0.8,
"label": "positive"
},
"metadata": {
"platform_id": "xyz789",
"upvote_ratio": 0.92
}
}
],
"metadata": {
"cached_at": "2024-01-31T10:00:00Z",
"post_count": 25,
"sources": ["reddit"]
}
}
```
## Dependencies
### Missing Components (Need Creation)
- ⏳ `RedditClient` needs full implementation from scratch
- ⏳ Service interface bridge methods for `SocialRepository`
- ⏳ Comprehensive pytest-vcr test suites for Reddit API
### Existing Components (Ready)
- ✅ `SocialRepository` with JSON storage and deduplication
- ✅ `SocialContext` and `PostData` Pydantic models
- ✅ Sentiment analysis and engagement metrics logic
### Required
- PRAW (Python Reddit API Wrapper) library for Reddit integration
- Valid Reddit API credentials (client_id, client_secret, user_agent)
- Working internet connection for live data fetching
- Writable data directory for repository storage
## Timeline
### Immediate (Phase 1)
- Create RedditClient following FinnhubClient standard with PRAW integration
- Implement comprehensive testing with pytest-vcr for Reddit API
- Validate client functionality with multiple subreddits and queries
### Phase 2-3
- Add standard service interface methods to SocialRepository
- Update SocialMediaService to use RedditClient with date objects
- Bridge repository interfaces while maintaining existing functionality
### Phase 4
- Comprehensive type checking and validation
- Integration testing with sentiment analysis workflows
- Performance optimization and caching efficiency
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Integration**: All `RedditClient` calls use `date` objects correctly
3. **Local-First**: Service checks repository before Reddit API calls
4. **Context Validation**: Returns valid `SocialContext` with Pydantic validation
5. **Sentiment Analysis**: Provides accurate sentiment scores with confidence metrics
6. **Multi-Platform**: Seamlessly aggregates social data from Reddit with extensibility
### Should Have
1. **Gap Detection**: Intelligent identification of missing data periods
2. **Cache Efficiency**: Minimal redundant API calls to Reddit
3. **Force Refresh**: Complete cache bypass when requested
4. **Data Quality**: Metadata indicating data source and quality metrics
5. **Deduplication**: Automatic removal of duplicate posts by platform_id
### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Data Staleness**: Automatic refresh of old cached social data
3. **Enhanced Sentiment**: Integration with advanced NLP libraries (TextBlob, VADER)
4. **Real-time Social**: Support for live social media feeds and alerts
5. **Platform Expansion**: Easy addition of Twitter, Discord, other social platforms
---
This PRD focuses on completing the `SocialMediaService` as a strongly-typed, local-first data service that integrates Reddit social media data through a new `RedditClient` following the established FinnhubClient standard patterns, while providing comprehensive sentiment analysis and engagement metrics to trading agents.

1013
prd/news_service.md Normal file

File diff suppressed because it is too large Load Diff

View File

@ -33,7 +33,7 @@ dependencies = [
"typing-extensions>=4.14.0",
"yfinance>=0.2.63",
"TA-Lib>=0.4.28",
"newspaper3k>=0.2.8",
"newspaper4k>=0.9.3",
]
[project.optional-dependencies]

View File

@ -7,5 +7,6 @@
"reportMissingTypeStubs": false,
"useLibraryCodeForTypes": true,
"autoSearchPaths": true,
"extraPaths": []
"extraPaths": [],
"stubPath": "typings"
}

4
test_typecheck.sh Normal file
View File

@ -0,0 +1,4 @@
#!/bin/bash
echo "Running type check..."
cd /Users/martinrichards/code/TradingAgents
mise run typecheck

1
tests/__init__.py Normal file
View File

@ -0,0 +1 @@
"""Test package for TradingAgents following pragmatic outside-in TDD."""

127
tests/conftest.py Normal file
View File

@ -0,0 +1,127 @@
"""
Test configuration and shared fixtures following pragmatic TDD principles.
Provides shared fixtures for mocking I/O boundaries while using real objects
for business logic and data transformations.
"""
import shutil
import tempfile
from datetime import date, datetime
from unittest.mock import Mock
import pytest
from tradingagents.domains.news.article_scraper_client import (
ArticleScraperClient,
ScrapeResult,
)
from tradingagents.domains.news.google_news_client import (
GoogleNewsArticle,
GoogleNewsClient,
)
from tradingagents.domains.news.news_repository import (
NewsArticle,
NewsRepository,
)
@pytest.fixture
def mock_google_client():
"""Mock GoogleNewsClient for testing I/O boundary."""
return Mock(spec=GoogleNewsClient)
@pytest.fixture
def mock_article_scraper():
"""Mock ArticleScraperClient for testing I/O boundary."""
return Mock(spec=ArticleScraperClient)
@pytest.fixture
def mock_repository():
"""Mock NewsRepository for testing I/O boundary."""
return Mock(spec=NewsRepository)
@pytest.fixture
def temp_data_dir():
"""Temporary directory for testing real repository persistence."""
temp_dir = tempfile.mkdtemp()
yield temp_dir
shutil.rmtree(temp_dir)
@pytest.fixture
def real_repository(temp_data_dir):
"""Real NewsRepository instance for testing persistence logic."""
return NewsRepository(temp_data_dir)
@pytest.fixture
def sample_news_articles():
"""Sample NewsArticle objects for testing data transformations."""
return [
NewsArticle(
headline="Apple Stock Rises 5% on Strong Earnings",
url="https://example.com/apple-earnings",
source="CNBC",
published_date=date(2024, 1, 15),
summary="Apple reports strong quarterly earnings beating expectations",
sentiment_score=0.7,
author="John Reporter",
),
NewsArticle(
headline="Apple Faces Supply Chain Challenges",
url="https://example.com/apple-supply-chain",
source="Reuters",
published_date=date(2024, 1, 16),
summary="Apple struggles with component shortages affecting production",
sentiment_score=-0.3,
author="Jane Analyst",
),
]
@pytest.fixture
def sample_google_articles():
"""Sample GoogleNewsArticle objects for testing data transformations."""
return [
GoogleNewsArticle(
title="Apple Stock Soars on Positive Outlook",
link="https://example.com/apple-soars",
published=datetime(2024, 1, 15, 10, 30),
summary="Investors are optimistic about Apple's future",
source="MarketWatch",
guid="article1",
),
GoogleNewsArticle(
title="Apple Announces New Product Line",
link="https://example.com/apple-products",
published=datetime(2024, 1, 16, 14, 20),
summary="Apple unveils exciting new product lineup",
source="TechCrunch",
guid="article2",
),
]
@pytest.fixture
def sample_scrape_results():
"""Sample ScrapeResult objects for testing data transformations."""
return {
"https://example.com/apple-soars": ScrapeResult(
status="SUCCESS",
content="Full article content about Apple's stock performance...",
author="Market Reporter",
title="Apple Stock Soars on Positive Outlook",
publish_date="2024-01-15",
),
"https://example.com/apple-products": ScrapeResult(
status="SUCCESS",
content="Detailed content about Apple's new product announcements...",
author="Tech Writer",
title="Apple Announces New Product Line",
publish_date="2024-01-16",
),
}

View File

@ -0,0 +1 @@
"""Domain tests package."""

View File

@ -0,0 +1 @@
"""News domain tests package."""

View File

@ -0,0 +1,532 @@
"""
Test ArticleScraperClient with pytest-vcr for HTTP recording/replay.
Following pragmatic TDD principles:
- Mock HTTP boundaries with VCR cassettes
- Test real business logic and data transformations
- Fast, deterministic tests
"""
from pathlib import Path
from unittest.mock import Mock, patch
import pytest
from tradingagents.domains.news.article_scraper_client import (
ArticleScraperClient,
ScrapeResult,
)
@pytest.fixture
def cassette_dir():
"""Directory for VCR cassettes."""
return (
Path(__file__).parent.parent.parent
/ "fixtures"
/ "vcr_cassettes"
/ "article_scraper"
)
@pytest.fixture
def scraper():
"""ArticleScraperClient instance for testing."""
return ArticleScraperClient(
user_agent="Test-Agent/1.0",
delay=0.1, # Faster tests
)
@pytest.fixture
def valid_urls():
"""Valid test URLs."""
return [
"https://www.reuters.com/business/finance/",
"https://www.bloomberg.com/markets/stocks",
"https://techcrunch.com/2024/01/15/tech-news/",
]
@pytest.fixture
def invalid_urls():
"""Invalid test URLs."""
return [
"",
"not-a-url",
"http://",
"https://",
"ftp://example.com/file.txt",
"https://non-existent-domain-123456.com/article",
]
class TestArticleScraperClient:
"""Test ArticleScraperClient functionality."""
def test_initialization(self):
"""Test scraper initializes with correct configuration."""
# Test with custom user agent
scraper = ArticleScraperClient("Custom-Agent/1.0", delay=2.0)
assert scraper.user_agent == "Custom-Agent/1.0"
assert scraper.delay == 2.0
# Test with default user agent (None/empty)
scraper_default = ArticleScraperClient(None)
assert "Chrome" in scraper_default.user_agent
assert scraper_default.delay == 1.0
def test_is_valid_url(self, scraper):
"""Test URL validation logic."""
# Valid URLs
assert scraper._is_valid_url("https://example.com/article") is True
assert scraper._is_valid_url("http://example.com/article") is True
assert scraper._is_valid_url("https://sub.domain.com/path?query=value") is True
# Invalid URLs
assert scraper._is_valid_url("") is False
assert scraper._is_valid_url("not-a-url") is False
assert scraper._is_valid_url("ftp://example.com") is False
assert scraper._is_valid_url("http://") is False
assert scraper._is_valid_url("https://") is False
def test_scrape_article_invalid_url(self, scraper, invalid_urls):
"""Test scraping with invalid URLs returns NOT_FOUND."""
for url in invalid_urls:
result = scraper.scrape_article(url)
assert result.status == "NOT_FOUND"
assert result.content == ""
assert result.final_url == url
class TestArticleScrapingSuccess:
"""Test successful article scraping scenarios."""
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_success(self, mock_article_class, mock_sleep, scraper):
"""Test successful article scraping with mocked newspaper4k."""
# Setup mock article
mock_article = Mock()
mock_article.text = "This is a long article content that is definitely over 100 characters in length and should pass the validation check."
mock_article.title = "Test Article Title"
mock_article.authors = ["John Doe", "Jane Smith"]
mock_article.publish_date = "2024-01-15"
mock_article.download.return_value = None
mock_article.parse.return_value = None
mock_article_class.return_value = mock_article
# Test scraping
result = scraper.scrape_article("https://example.com/article")
# Verify results
assert result.status == "SUCCESS"
assert result.content == mock_article.text
assert result.title == "Test Article Title"
assert result.author == "John Doe, Jane Smith"
assert result.publish_date == "2024-01-15"
assert result.final_url == "https://example.com/article"
# Verify newspaper4k was configured correctly
mock_article_class.assert_called_once()
args, kwargs = mock_article_class.call_args
assert args[0] == "https://example.com/article"
config = (
kwargs["config"]
if "config" in kwargs
else args[1]
if len(args) > 1
else None
)
assert config is not None
assert config.browser_user_agent == "Test-Agent/1.0"
assert config.request_timeout == 10
# Verify delay was applied
mock_sleep.assert_called_once_with(0.1)
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_with_datetime_publish_date(
self, mock_article_class, mock_sleep, scraper
):
"""Test successful scraping with datetime publish_date."""
from datetime import datetime
mock_article = Mock()
mock_article.text = "Long article content over 100 characters for testing publish date handling in the newspaper4k client."
mock_article.title = "DateTime Test Article"
mock_article.authors = []
mock_article.publish_date = datetime(2024, 1, 15, 14, 30, 0)
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/datetime-article")
assert result.status == "SUCCESS"
assert result.publish_date == "2024-01-15"
assert result.author == "" # Empty authors list
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_short_content_fails(
self, mock_article_class, mock_sleep, scraper
):
"""Test that articles with content under 100 chars are rejected."""
mock_article = Mock()
mock_article.text = "Short content" # Under 100 characters
mock_article.title = "Short Article"
mock_article.authors = []
mock_article.publish_date = None
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/short-article")
assert result.status == "SCRAPE_FAILED"
assert result.content == ""
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_empty_content_fails(
self, mock_article_class, mock_sleep, scraper
):
"""Test that articles with empty content are rejected."""
mock_article = Mock()
mock_article.text = "" # Empty content
mock_article.title = ""
mock_article.authors = []
mock_article.publish_date = None
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/empty-article")
assert result.status == "SCRAPE_FAILED"
assert result.content == ""
class TestArticleScrapingFailure:
"""Test article scraping failure scenarios."""
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_download_exception(
self, mock_article_class, mock_sleep, scraper
):
"""Test scraping when newspaper4k download fails."""
mock_article = Mock()
mock_article.download.side_effect = Exception("Download failed")
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/failing-article")
assert result.status == "SCRAPE_FAILED"
assert result.content == ""
assert result.final_url == "https://example.com/failing-article"
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_parse_exception(
self, mock_article_class, mock_sleep, scraper
):
"""Test scraping when newspaper4k parse fails."""
mock_article = Mock()
mock_article.download.return_value = None
mock_article.parse.side_effect = Exception("Parse failed")
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/parse-fail-article")
assert result.status == "SCRAPE_FAILED"
assert result.content == ""
class TestWaybackMachineFallback:
"""Test Internet Archive Wayback Machine fallback functionality."""
@patch("tradingagents.domains.news.article_scraper_client.requests.get")
def test_scrape_from_wayback_no_requests(self, mock_get, scraper):
"""Test Wayback fallback when requests is not available."""
with patch(
"builtins.__import__", side_effect=ImportError("No module named 'requests'")
):
result = scraper._scrape_from_wayback("https://example.com/article")
assert result.status == "NOT_FOUND"
assert result.final_url == "https://example.com/article"
@patch("tradingagents.domains.news.article_scraper_client.requests.get")
def test_scrape_from_wayback_no_snapshots(self, mock_get, scraper):
"""Test Wayback fallback when no archived snapshots exist."""
# Mock CDX API response with only headers (no snapshots)
mock_response = Mock()
mock_response.json.return_value = [["timestamp", "original"]] # Only headers
mock_response.raise_for_status.return_value = None
mock_get.return_value = mock_response
result = scraper._scrape_from_wayback("https://example.com/no-archive")
assert result.status == "NOT_FOUND"
assert result.final_url == "https://example.com/no-archive"
@patch("tradingagents.domains.news.article_scraper_client.requests.get")
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_from_wayback_success(
self, mock_article_class, mock_sleep, mock_get, scraper
):
"""Test successful Wayback Machine scraping."""
# Mock CDX API response
mock_response = Mock()
mock_response.json.return_value = [
["timestamp", "original"], # Headers
["20240115120000", "https://example.com/article"], # Snapshot data
]
mock_response.raise_for_status.return_value = None
mock_get.return_value = mock_response
# Mock successful article scraping from archive
mock_article = Mock()
mock_article.text = "Archived article content that is long enough to pass validation checks and contains meaningful information."
mock_article.title = "Archived Article"
mock_article.authors = ["Archive Author"]
mock_article.publish_date = "2024-01-15"
mock_article_class.return_value = mock_article
result = scraper._scrape_from_wayback("https://example.com/article")
assert result.status == "ARCHIVE_SUCCESS"
assert result.content == mock_article.text
assert result.title == "Archived Article"
assert (
result.final_url
== "https://web.archive.org/web/20240115120000/https://example.com/article"
)
# Verify CDX API was called correctly
mock_get.assert_called_with(
"http://web.archive.org/cdx/search/cdx",
params={
"url": "https://example.com/article",
"output": "json",
"fl": "timestamp,original",
"filter": "statuscode:200",
"limit": "1",
},
timeout=10,
)
@patch("tradingagents.domains.news.article_scraper_client.requests.get")
def test_scrape_from_wayback_requests_exception(self, mock_get, scraper):
"""Test Wayback fallback when requests fails."""
mock_get.side_effect = Exception("Request timeout")
result = scraper._scrape_from_wayback("https://example.com/timeout")
assert result.status == "NOT_FOUND"
assert result.final_url == "https://example.com/timeout"
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_scrape_article_fallback_to_wayback(
self, mock_article_class, mock_sleep, scraper
):
"""Test full workflow: source fails, fallback to Wayback succeeds."""
# First call (original source) fails
# Second call (Wayback source) succeeds
mock_article_fail = Mock()
mock_article_fail.download.side_effect = Exception("Download failed")
mock_article_success = Mock()
mock_article_success.text = "Successfully scraped content from Wayback Machine with enough length to pass validation tests."
mock_article_success.title = "Wayback Success"
mock_article_success.authors = ["Wayback Author"]
mock_article_success.publish_date = "2024-01-15"
mock_article_success.download.return_value = None
mock_article_success.parse.return_value = None
mock_article_class.side_effect = [mock_article_fail, mock_article_success]
with patch(
"tradingagents.domains.news.article_scraper_client.requests.get"
) as mock_get:
# Mock successful CDX API response
mock_response = Mock()
mock_response.json.return_value = [
["timestamp", "original"],
["20240115120000", "https://example.com/article"],
]
mock_response.raise_for_status.return_value = None
mock_get.return_value = mock_response
result = scraper.scrape_article("https://example.com/article")
assert result.status == "ARCHIVE_SUCCESS"
assert (
result.content
== "Successfully scraped content from Wayback Machine with enough length to pass validation tests."
)
assert "web.archive.org" in result.final_url
class TestMultipleArticles:
"""Test scraping multiple articles functionality."""
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
def test_scrape_multiple_articles_empty_list(self, mock_sleep, scraper):
"""Test scraping empty list returns empty dict."""
results = scraper.scrape_multiple_articles([])
assert results == {}
mock_sleep.assert_not_called()
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
def test_scrape_multiple_articles_single_url(self, mock_sleep, scraper):
"""Test scraping single URL in list."""
urls = ["https://example.com/single"]
with patch.object(scraper, "scrape_article") as mock_scrape:
mock_scrape.return_value = ScrapeResult(
status="SUCCESS", content="Single article content"
)
results = scraper.scrape_multiple_articles(urls)
assert len(results) == 1
assert results["https://example.com/single"].status == "SUCCESS"
mock_scrape.assert_called_once_with("https://example.com/single")
# No delay needed for single article
mock_sleep.assert_not_called()
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
def test_scrape_multiple_articles_with_delays(self, mock_sleep, scraper):
"""Test scraping multiple URLs with delays between requests."""
urls = [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3",
]
with patch.object(scraper, "scrape_article") as mock_scrape:
mock_scrape.side_effect = [
ScrapeResult(status="SUCCESS", content="Article 1"),
ScrapeResult(status="SUCCESS", content="Article 2"),
ScrapeResult(status="SCRAPE_FAILED", content=""),
]
results = scraper.scrape_multiple_articles(urls)
assert len(results) == 3
assert results["https://example.com/article1"].status == "SUCCESS"
assert results["https://example.com/article2"].status == "SUCCESS"
assert results["https://example.com/article3"].status == "SCRAPE_FAILED"
# Verify delay called between requests (n-1 times)
assert mock_sleep.call_count == 2
mock_sleep.assert_called_with(0.1)
class TestDataTransformation:
"""Test data transformation and edge cases."""
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_publish_date_edge_cases(self, mock_article_class, mock_sleep, scraper):
"""Test various publish_date formats are handled correctly."""
from datetime import datetime
test_cases = [
(None, ""),
("", ""),
("2024-01-15", "2024-01-15"),
(datetime(2024, 1, 15), "2024-01-15"),
(12345, "12345"), # Numeric conversion
({"year": 2024}, "{'year': 2024}"), # Dict conversion
]
for pub_date, expected in test_cases:
mock_article = Mock()
mock_article.text = "Long enough content for validation testing with various publish date formats and edge cases."
mock_article.title = "Date Test"
mock_article.authors = []
mock_article.publish_date = pub_date
mock_article_class.return_value = mock_article
result = scraper.scrape_article("https://example.com/date-test")
assert result.status == "SUCCESS"
assert result.publish_date == expected
def test_scrape_result_dataclass_defaults(self):
"""Test ScrapeResult dataclass has correct defaults."""
result = ScrapeResult(status="TEST")
assert result.status == "TEST"
assert result.content == ""
assert result.author == ""
assert result.final_url == ""
assert result.title == ""
assert result.publish_date == ""
def test_scrape_result_all_fields(self):
"""Test ScrapeResult with all fields populated."""
result = ScrapeResult(
status="SUCCESS",
content="Full article content",
author="Test Author",
final_url="https://final.com/url",
title="Test Title",
publish_date="2024-01-15",
)
assert result.status == "SUCCESS"
assert result.content == "Full article content"
assert result.author == "Test Author"
assert result.final_url == "https://final.com/url"
assert result.title == "Test Title"
assert result.publish_date == "2024-01-15"
class TestErrorHandlingAndEdgeCases:
"""Test error handling and edge cases."""
def test_user_agent_fallback(self):
"""Test user agent fallback when None or empty is provided."""
scraper_none = ArticleScraperClient(None)
scraper_empty = ArticleScraperClient("")
# Both should use default Chrome user agent
default_ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
assert scraper_none.user_agent == default_ua
assert scraper_empty.user_agent == default_ua
@patch("tradingagents.domains.news.article_scraper_client.time.sleep")
@patch("tradingagents.domains.news.article_scraper_client.Article")
def test_config_applied_correctly(self, mock_article_class, mock_sleep):
"""Test that newspaper4k Config is applied with correct settings."""
scraper = ArticleScraperClient("Custom-Agent/2.0", delay=0.5)
mock_article = Mock()
mock_article.text = "Test content that meets minimum length requirements for successful article scraping validation."
mock_article_class.return_value = mock_article
scraper.scrape_article("https://example.com/config-test")
# Verify Article was created with correct config
mock_article_class.assert_called_once()
args, kwargs = mock_article_class.call_args
assert args[0] == "https://example.com/config-test"
config = kwargs.get("config") or (args[1] if len(args) > 1 else None)
assert config is not None
assert config.browser_user_agent == "Custom-Agent/2.0"
assert config.request_timeout == 10
assert config.keep_article_html is True
assert config.fetch_images is False

View File

@ -0,0 +1,336 @@
"""
Test suite for NewsService following pragmatic outside-in TDD methodology.
This test suite follows the CLAUDE.md testing principles:
- Mock I/O boundaries (Repository calls, HTTP clients, external systems)
- Real objects for logic (Data transformations, validation, business logic)
- Outside-in but practical - Start with service tests, work inward
"""
from datetime import date
from unittest.mock import Mock
import pytest
# Import mock ScrapeResult from conftest to avoid newspaper3k import issues
from conftest import ScrapeResult
from tradingagents.domains.news.news_repository import (
NewsData,
)
from tradingagents.domains.news.news_service import (
ArticleData,
NewsContext,
NewsService,
NewsUpdateResult,
SentimentScore,
)
class TestNewsServiceCollaboratorInteractions:
"""Test NewsService interactions with its collaborators (I/O boundaries)."""
def test_get_company_news_context_calls_repository_with_correct_params(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test that get_company_news_context calls repository with correct parameters."""
# Arrange - Mock the I/O boundary
mock_repository.get_news_data.return_value = {}
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act - Call the service method
result = service.get_company_news_context("AAPL", "2024-01-01", "2024-01-31")
# Assert - Repository should be called with converted date objects
mock_repository.get_news_data.assert_called_once_with(
query="AAPL",
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
sources=["finnhub", "google_news"],
)
# Assert - Result should have correct structure (real object logic)
assert isinstance(result, NewsContext)
assert result.query == "AAPL"
assert result.symbol == "AAPL"
assert result.period == {"start": "2024-01-01", "end": "2024-01-31"}
def test_get_global_news_context_calls_repository_for_each_category(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test that get_global_news_context calls repository for each category."""
# Arrange - Mock the I/O boundary
mock_repository.get_news_data.return_value = {}
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
categories = ["business", "politics", "technology"]
# Act
service.get_global_news_context(
"2024-01-01", "2024-01-31", categories=categories
)
# Assert - Repository should be called once for each category
assert mock_repository.get_news_data.call_count == 3
for call_args in mock_repository.get_news_data.call_args_list:
args, kwargs = call_args
assert args[0] in categories # query should be one of the categories
assert args[1] == date(2024, 1, 1) # start_date
assert args[2] == date(2024, 1, 31) # end_date
assert kwargs["sources"] == ["google_news"]
def test_update_company_news_calls_google_client(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test that update_company_news calls GoogleNewsClient correctly."""
# Arrange - Mock the I/O boundary
mock_google_client.get_company_news.return_value = []
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act
result = service.update_company_news("AAPL")
# Assert - Google client should be called
mock_google_client.get_company_news.assert_called_once_with("AAPL")
assert isinstance(result, NewsUpdateResult)
assert result.symbol == "AAPL"
assert result.articles_found == 0
def test_update_company_news_scrapes_each_article_url(
self,
mock_repository,
mock_google_client,
mock_article_scraper,
sample_google_articles,
):
"""Test that update_company_news calls scraper for each article URL."""
# Arrange - Mock I/O boundaries with real data objects
mock_google_client.get_company_news.return_value = sample_google_articles
mock_article_scraper.scrape_article.return_value = ScrapeResult(
status="SUCCESS",
content="Full article content",
author="Test Author",
title="Test Title",
publish_date="2024-01-15",
)
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act
result = service.update_company_news("AAPL")
# Assert - Scraper should be called for each article
assert mock_article_scraper.scrape_article.call_count == 2
mock_article_scraper.scrape_article.assert_any_call(
"https://example.com/apple-soars"
)
mock_article_scraper.scrape_article.assert_any_call(
"https://example.com/apple-products"
)
# Assert - Real object logic for result
assert result.articles_found == 2
assert result.articles_scraped == 2
assert result.articles_failed == 0
def test_repository_failure_returns_empty_context_with_error_metadata(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test that repository failure is handled gracefully."""
# Arrange - Mock repository failure (I/O boundary)
mock_repository.get_news_data.side_effect = Exception(
"Database connection failed"
)
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act
result = service.get_company_news_context("AAPL", "2024-01-01", "2024-01-31")
# Assert - Should return empty context with error metadata (real object logic)
assert isinstance(result, NewsContext)
assert result.articles == []
assert result.article_count == 0
assert "error" in result.metadata
assert "Database connection failed" in result.metadata["error"]
class TestNewsServiceDataTransformations:
"""Test data transformations using real objects (no mocking)."""
def test_converts_repository_articles_to_article_data(
self, mock_google_client, mock_article_scraper, sample_news_articles
):
"""Test conversion of NewsRepository.NewsArticle to ArticleData."""
# Arrange - Create real repository with sample data
mock_repo = Mock()
news_data = NewsData(
query="AAPL",
date=date(2024, 1, 15),
source="finnhub",
articles=sample_news_articles,
)
mock_repo.get_news_data.return_value = {date(2024, 1, 15): [news_data]}
service = NewsService(mock_google_client, mock_repo, mock_article_scraper)
# Act - Test real data transformation logic
result = service.get_company_news_context("AAPL", "2024-01-01", "2024-01-31")
# Assert - Real object data transformation
assert len(result.articles) == 2
assert result.articles[0].title == "Apple Stock Rises 5% on Strong Earnings"
assert (
result.articles[0].content
== "Apple reports strong quarterly earnings beating expectations"
)
assert result.articles[0].date == "2024-01-15"
assert result.articles[0].source == "CNBC"
assert result.articles[0].url == "https://example.com/apple-earnings"
def test_calculates_sentiment_summary_from_articles(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test sentiment summary calculation from article list."""
# Arrange - Create articles with sentiment-bearing content (real objects)
articles = [
ArticleData(
title="Great News for Apple",
content="Apple stock is performing excellent with strong growth and positive outlook",
author="Analyst",
source="CNBC",
date="2024-01-15",
url="https://example.com/positive",
),
ArticleData(
title="Apple Faces Challenges",
content="Apple stock is declining due to bad earnings and negative market sentiment",
author="Reporter",
source="Reuters",
date="2024-01-16",
url="https://example.com/negative",
),
]
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act - Test real sentiment calculation logic (private method)
sentiment = service._calculate_sentiment_summary(articles)
# Assert - Real sentiment calculation
assert isinstance(sentiment, SentimentScore)
assert -1.0 <= sentiment.score <= 1.0
assert 0.0 <= sentiment.confidence <= 1.0
assert sentiment.label in ["positive", "negative", "neutral"]
def test_extracts_trending_topics_from_articles(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test trending topic extraction."""
# Arrange - Create articles with repeated keywords (real objects)
articles = [
ArticleData(
title="Apple iPhone Sales Surge",
content="Content about iPhone",
author="Reporter",
source="TechNews",
date="2024-01-15",
url="https://example.com/iphone1",
),
ArticleData(
title="iPhone Market Share Growth",
content="More iPhone content",
author="Analyst",
source="MarketWatch",
date="2024-01-16",
url="https://example.com/iphone2",
),
ArticleData(
title="Apple Revenue from Services",
content="Services revenue content",
author="Finance Writer",
source="Bloomberg",
date="2024-01-17",
url="https://example.com/services",
),
]
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act - Test real trending topic extraction logic
topics = service._extract_trending_topics(articles)
# Assert - Should identify repeated keywords
assert isinstance(topics, list)
assert "iphone" in topics # Should appear twice
assert "apple" in topics # Should appear multiple times
class TestNewsServiceErrorScenarios:
"""Test various error scenarios and edge cases."""
def test_handles_google_client_failure(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test handling of GoogleNewsClient failure."""
# Arrange - Mock client failure (I/O boundary)
mock_google_client.get_company_news.side_effect = Exception(
"API rate limit exceeded"
)
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act & Assert - Should raise the exception
with pytest.raises(Exception, match="API rate limit exceeded"):
service.update_company_news("AAPL")
def test_handles_article_scraper_failure(
self,
mock_repository,
mock_google_client,
mock_article_scraper,
sample_google_articles,
):
"""Test handling of article scraper failure."""
# Arrange - Mock scraper returning failure status
mock_google_client.get_company_news.return_value = sample_google_articles
mock_article_scraper.scrape_article.return_value = ScrapeResult(
status="SCRAPE_FAILED", content="", author="", title="", publish_date=""
)
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act
result = service.update_company_news("AAPL")
# Assert - Should handle scraper failures gracefully
assert result.articles_found == 2
assert result.articles_scraped == 0
assert result.articles_failed == 2
def test_handles_invalid_date_formats(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test validation of date formats."""
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act & Assert - Should raise ValueError for invalid date format
with pytest.raises(ValueError):
service.get_company_news_context("AAPL", "invalid-date", "2024-01-31")
def test_handles_empty_articles_gracefully(
self, mock_repository, mock_google_client, mock_article_scraper
):
"""Test handling of empty article list."""
service = NewsService(mock_google_client, mock_repository, mock_article_scraper)
# Act - Test sentiment calculation with empty list
sentiment = service._calculate_sentiment_summary([])
# Assert - Should return neutral sentiment
assert sentiment.score == 0.0
assert sentiment.confidence == 0.0
assert sentiment.label == "neutral"

View File

@ -8,7 +8,7 @@ from dataclasses import dataclass
from datetime import datetime
from urllib.parse import urlparse
import newspaper
from newspaper import Article, Config
logger = logging.getLogger(__name__)
@ -28,12 +28,12 @@ class ScrapeResult:
class ArticleScraperClient:
"""Client for scraping article content with Internet Archive fallback."""
def __init__(self, user_agent: str, delay: float = 1.0):
def __init__(self, user_agent: str | None = None, delay: float = 1.0):
"""
Initialize article scraper.
Args:
user_agent: User agent string for requests
user_agent: User agent string for requests (None for default)
delay: Delay between requests in seconds
"""
self.user_agent = user_agent or (
@ -65,17 +65,18 @@ class ArticleScraperClient:
return self._scrape_from_wayback(url)
def _scrape_from_source(self, url: str) -> ScrapeResult:
"""Scrape article from original source using newspaper3k."""
"""Scrape article from original source using newspaper4k."""
try:
# Add delay to be respectful
time.sleep(self.delay)
# Configure newspaper article
article = newspaper.Article(url)
article.config.browser_user_agent = self.user_agent
article.config.request_timeout = 10
# Configure newspaper4k with optimizations
config = Config()
config.browser_user_agent = self.user_agent
config.request_timeout = 10
config.fetch_images = False
# Download and parse
article = Article(url, config=config)
article.download()
article.parse()

View File

@ -4,6 +4,7 @@ News service that provides structured news context.
import logging
from dataclasses import dataclass
from datetime import date
from enum import Enum
from typing import Any
@ -134,13 +135,39 @@ class NewsService:
try:
logger.info(f"Getting company news context for {symbol} from repository")
# Get articles from repository
# Get articles from repository (READ PATH - no API calls)
articles = []
if self.repository:
try:
# This would depend on the actual repository interface
# For now, return empty list - repository integration needs to be completed
articles = []
# Convert date strings to date objects
start_date_obj = date.fromisoformat(start_date)
end_date_obj = date.fromisoformat(end_date)
# Get cached news data from repository
news_data_by_date = self.repository.get_news_data(
query=symbol,
start_date=start_date_obj,
end_date=end_date_obj,
sources=["finnhub", "google_news"],
)
# Convert repository data to ArticleData objects
for _date_key, news_data_list in news_data_by_date.items():
for news_data in news_data_list:
for article in news_data.articles:
articles.append(
ArticleData(
title=article.headline,
content=article.summary
or "", # Use summary as fallback for content
author=article.author or "",
source=article.source,
date=article.published_date.isoformat(),
url=article.url,
sentiment=None, # Will be calculated later
)
)
logger.debug(
f"Retrieved {len(articles)} articles from repository for {symbol}"
)
@ -218,13 +245,39 @@ class NewsService:
f"Getting global news context from repository for categories: {categories}"
)
# Get articles from repository
# Get articles from repository (READ PATH - no API calls)
articles = []
if self.repository:
try:
# This would depend on the actual repository interface
# For now, return empty list - repository integration needs to be completed
articles = []
# Convert date strings to date objects
start_date_obj = date.fromisoformat(start_date)
end_date_obj = date.fromisoformat(end_date)
# Get cached news data from repository for each category
for category in categories:
news_data_by_date = self.repository.get_news_data(
query=category,
start_date=start_date_obj,
end_date=end_date_obj,
sources=["google_news"], # Global news mainly from Google
)
# Convert repository data to ArticleData objects
for _date_key, news_data_list in news_data_by_date.items():
for news_data in news_data_list:
for article in news_data.articles:
articles.append(
ArticleData(
title=article.headline,
content=article.summary or "",
author=article.author or "",
source=article.source,
date=article.published_date.isoformat(),
url=article.url,
sentiment=None,
)
)
logger.debug(
f"Retrieved {len(articles)} global articles from repository"
)

31
typings/newspaper.pyi Normal file
View File

@ -0,0 +1,31 @@
"""Type stubs for newspaper (newspaper4k package)."""
from datetime import datetime
class Config:
"""Configuration for newspaper Article."""
browser_user_agent: str
request_timeout: int
fetch_images: bool
def __init__(self) -> None: ...
class Article:
"""Article class for parsing web articles."""
text: str
title: str | None
authors: list[str]
publish_date: datetime | None
top_image: str | None
movies: list[str]
keywords: list[str]
summary: str
def __init__(self, url: str, config: Config | None = None) -> None: ...
def download(self) -> None: ...
def parse(self) -> None: ...
def nlp(self) -> None: ...
def article(url: str) -> Article: ...

41
uv.lock
View File

@ -633,17 +633,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/32/b6/7517af5234378518f27ad35a7b24af9591bc500b8c1780929c1295999eb6/fastapi-0.115.9-py3-none-any.whl", hash = "sha256:4a439d7923e4de796bcc88b64e9754340fcd1574673cbd865ba8a99fe0d28c56", size = 94919, upload-time = "2025-02-27T16:43:40.537Z" },
]
[[package]]
name = "feedfinder2"
version = "0.0.4"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "beautifulsoup4" },
{ name = "requests" },
{ name = "six" },
]
sdist = { url = "https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz", hash = "sha256:3701ee01a6c85f8b865a049c30ba0b4608858c803fe8e30d1d289fdbe89d0efe", size = 3297, upload-time = "2016-01-25T15:09:17.492Z" }
[[package]]
name = "feedparser"
version = "6.0.11"
@ -1049,12 +1038,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/2c/e1/e6716421ea10d38022b952c159d5161ca1193197fb744506875fbb87ea7b/iniconfig-2.1.0-py3-none-any.whl", hash = "sha256:9deba5723312380e77435581c6bf4935c94cbfab9b1ed33ef8d238ea168eb760", size = 6050, upload-time = "2025-03-19T20:10:01.071Z" },
]
[[package]]
name = "jieba3k"
version = "0.35.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip", hash = "sha256:980a4f2636b778d312518066be90c7697d410dd5a472385f5afced71a2db1c10", size = 7423646, upload-time = "2014-11-15T05:47:47.978Z" }
[[package]]
name = "jinja2"
version = "3.1.6"
@ -1700,27 +1683,25 @@ wheels = [
]
[[package]]
name = "newspaper3k"
version = "0.2.8"
name = "newspaper4k"
version = "0.9.3.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "beautifulsoup4" },
{ name = "cssselect" },
{ name = "feedfinder2" },
{ name = "feedparser" },
{ name = "jieba3k" },
{ name = "lxml" },
{ name = "nltk" },
{ name = "numpy" },
{ name = "pandas" },
{ name = "pillow" },
{ name = "python-dateutil" },
{ name = "pyyaml" },
{ name = "requests" },
{ name = "tinysegmenter" },
{ name = "tldextract" },
]
sdist = { url = "https://files.pythonhosted.org/packages/ce/fb/8f8525be0cafa48926e85b0c06a7cb3e2a892d340b8036f8c8b1b572df1c/newspaper3k-0.2.8.tar.gz", hash = "sha256:9f1bd3e1fb48f400c715abf875cc7b0a67b7ddcd87f50c9aeeb8fcbbbd9004fb", size = 205685, upload-time = "2018-09-28T04:58:23.53Z" }
sdist = { url = "https://files.pythonhosted.org/packages/af/a8/80a186f09ffa2a9366ed93391b03fdaf8057d75a67a21c2eafef36b654ba/newspaper4k-0.9.3.1.tar.gz", hash = "sha256:fc237ae6a7b65d5ac4df224f962b2d7368c991fdf63b5176e439a1b74a2992e0", size = 273009, upload-time = "2024-03-18T21:56:46.344Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl", hash = "sha256:44a864222633d3081113d1030615991c3dbba87239f6bbf59d91240f71a22e3e", size = 211132, upload-time = "2018-09-28T04:58:18.847Z" },
{ url = "https://files.pythonhosted.org/packages/ab/73/cc4e7a57373e6940fc081d4f36988e3faa54c59a51dea4e8f01d5c10ccb6/newspaper4k-0.9.3.1-py3-none-any.whl", hash = "sha256:42a03b7915d92941a9fe4cc8dab47240219560e0cb8ecb5a291dc5a913eb8aa4", size = 296617, upload-time = "2024-03-18T21:56:43.932Z" },
]
[[package]]
@ -3443,12 +3424,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/de/a8/8f499c179ec900783ffe133e9aab10044481679bb9aad78436d239eee716/tiktoken-0.9.0-cp313-cp313-win_amd64.whl", hash = "sha256:5ea0edb6f83dc56d794723286215918c1cde03712cbbafa0348b33448faf5b95", size = 894669, upload-time = "2025-02-14T06:02:47.341Z" },
]
[[package]]
name = "tinysegmenter"
version = "0.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz", hash = "sha256:ed1f6d2e806a4758a73be589754384cbadadc7e1a414c81a166fc9adf2d40c6d", size = 16893, upload-time = "2017-07-23T11:18:29.85Z" }
[[package]]
name = "tldextract"
version = "5.3.0"
@ -3591,7 +3566,7 @@ dependencies = [
{ name = "langchain-google-genai" },
{ name = "langchain-openai" },
{ name = "langgraph" },
{ name = "newspaper3k" },
{ name = "newspaper4k" },
{ name = "pandas" },
{ name = "parsel" },
{ name = "praw" },
@ -3642,7 +3617,7 @@ requires-dist = [
{ name = "langchain-google-genai", specifier = ">=2.1.5" },
{ name = "langchain-openai", specifier = ">=0.3.23" },
{ name = "langgraph", specifier = ">=0.4.8" },
{ name = "newspaper3k", specifier = ">=0.2.8" },
{ name = "newspaper4k", specifier = ">=0.9.3" },
{ name = "pandas", specifier = ">=2.3.0" },
{ name = "parsel", specifier = ">=1.10.0" },
{ name = "praw", specifier = ">=7.8.1" },