780 lines
27 KiB
Markdown
780 lines
27 KiB
Markdown
# Product Requirements Document: NewsService Completion
|
|
|
|
## Overview
|
|
|
|
Complete the `NewsService` to provide strongly-typed news data and sentiment analysis to trading agents using a local-first data strategy with RSS feed integration, article content extraction, and LLM-powered sentiment analysis.
|
|
|
|
## Current State Analysis
|
|
|
|
### Issues to Fix
|
|
- **CRITICAL**: Service is currently empty placeholder with only method stubs
|
|
- **CRITICAL**: Need to implement GoogleNewsClient to read RSS feeds
|
|
- **CRITICAL**: Need RSS article fetching with fallback to Internet Archive
|
|
- **CRITICAL**: Need LLM-powered sentiment analysis integration
|
|
- **CRITICAL**: Service uses `BaseClient` inheritance instead of typed clients
|
|
- **CRITICAL**: `NewsRepository` has different interface than service expectations
|
|
- Missing strongly-typed interfaces between components
|
|
- No concrete approach for article content extraction
|
|
|
|
### What Works
|
|
- ✅ `NewsContext` and `ArticleData` Pydantic models for agent consumption
|
|
- ✅ `SentimentScore` model for structured sentiment data
|
|
- ✅ `FinnhubClient` with `get_company_news()` method using date objects
|
|
- ✅ `NewsRepository` with dataclass-based storage and deduplication
|
|
- ✅ Service structure placeholder ready for implementation
|
|
|
|
## Technical Requirements
|
|
|
|
### 1. Strongly-Typed Interfaces
|
|
|
|
#### Client → Service Interface
|
|
```python
|
|
# FinnhubClient methods (already implemented)
|
|
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
|
|
|
|
# GoogleNewsClient methods (to be implemented)
|
|
def fetch_rss_feed(query: str, start_date: date, end_date: date) -> dict[str, Any]
|
|
def fetch_article_content(url: str, use_archive_fallback: bool = True) -> dict[str, Any]
|
|
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
|
|
def get_global_news(start_date: date, end_date: date, categories: list[str]) -> dict[str, Any]
|
|
```
|
|
|
|
#### Service → Repository Interface
|
|
```python
|
|
# NewsRepository methods (to be implemented/bridged)
|
|
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
|
|
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
|
|
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
|
|
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
|
|
```
|
|
|
|
#### Service → Agent Interface
|
|
```python
|
|
# Service output (already defined)
|
|
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, sources: list[str], force_refresh: bool) -> NewsContext
|
|
```
|
|
|
|
### 2. Local-First Data Strategy
|
|
|
|
#### Flow
|
|
1. **Repository Lookup**: Check `NewsRepository.has_data_for_period()`
|
|
2. **Freshness Check**: Determine if cache needs updating (news is append-only)
|
|
3. **RSS Feed Fetching**: Fetch RSS feeds from Google News
|
|
4. **Content Extraction**: Extract full article content with Internet Archive fallback
|
|
5. **LLM Analysis**: Perform sentiment analysis using LLM
|
|
6. **Cache Updates**: Store enriched articles via `repository.store_data()`
|
|
7. **Context Assembly**: Return validated `NewsContext`
|
|
|
|
#### News-Specific Gap Detection
|
|
```python
|
|
def should_fetch_new_articles(self, last_fetch_time: datetime, current_time: datetime) -> bool:
|
|
"""
|
|
News doesn't have "gaps" - it's append-only. Check if enough time passed for new articles.
|
|
|
|
Returns True if:
|
|
- Last fetch was more than 6 hours ago
|
|
- User requested force_refresh
|
|
- No data exists for the query/period
|
|
"""
|
|
if not last_fetch_time:
|
|
return True
|
|
|
|
hours_since_fetch = (current_time - last_fetch_time).total_seconds() / 3600
|
|
return hours_since_fetch >= 6 # Fetch new articles every 6 hours
|
|
```
|
|
|
|
#### Force Refresh Support
|
|
- `force_refresh=True` fetches all articles fresh from sources
|
|
- Does NOT clear existing cache (news is immutable)
|
|
- Deduplicates against existing articles before storing
|
|
|
|
#### Cache Invalidation Strategy
|
|
- **Articles are immutable**: Once published, articles don't change
|
|
- **Cache grows append-only**: New articles are added, old ones retained
|
|
- **Freshness check**: Re-fetch every 6 hours for new articles
|
|
- **No deletion**: Articles are never removed from cache
|
|
|
|
### 3. RSS Feed Processing & Article Fetching
|
|
|
|
#### GoogleNewsClient RSS Implementation
|
|
```python
|
|
import feedparser
|
|
from newspaper import Article
|
|
import requests
|
|
from datetime import date, datetime
|
|
from typing import Any, Optional
|
|
|
|
class GoogleNewsClient:
|
|
"""Google News RSS client following FinnhubClient standard."""
|
|
|
|
def __init__(self):
|
|
self.base_rss_url = "https://news.google.com/rss"
|
|
self.archive_base_url = "https://archive.org/wayback/available"
|
|
|
|
def fetch_rss_feed(self, query: str, start_date: date, end_date: date) -> dict[str, Any]:
|
|
"""
|
|
Fetch RSS feed data for news articles.
|
|
|
|
Args:
|
|
query: Search query or company symbol
|
|
start_date: Start date for filtering articles
|
|
end_date: End date for filtering articles
|
|
|
|
Returns:
|
|
Dict containing RSS feed articles with metadata
|
|
"""
|
|
# Construct RSS feed URL
|
|
rss_url = f"{self.base_rss_url}/search?q={query}&hl=en-US&gl=US&ceid=US:en"
|
|
|
|
# Parse RSS feed
|
|
feed = feedparser.parse(rss_url)
|
|
|
|
# Filter and structure articles
|
|
articles = []
|
|
for entry in feed.entries:
|
|
# Parse publication date
|
|
pub_date = datetime(*entry.published_parsed[:6]).date()
|
|
|
|
# Filter by date range
|
|
if start_date <= pub_date <= end_date:
|
|
articles.append({
|
|
"headline": entry.title,
|
|
"url": entry.link,
|
|
"source": entry.source.get('title', 'Google News'),
|
|
"date": pub_date.isoformat(),
|
|
"summary": entry.get('summary', ''),
|
|
})
|
|
|
|
return {
|
|
"query": query,
|
|
"period": {"start": start_date.isoformat(), "end": end_date.isoformat()},
|
|
"articles": articles,
|
|
"metadata": {
|
|
"source": "google_news_rss",
|
|
"rss_feed_url": rss_url,
|
|
"article_count": len(articles)
|
|
}
|
|
}
|
|
|
|
def fetch_article_content(self, url: str, use_archive_fallback: bool = True) -> dict[str, Any]:
|
|
"""
|
|
Fetch full article content from URL with Internet Archive fallback.
|
|
|
|
Args:
|
|
url: Article URL to fetch
|
|
use_archive_fallback: Whether to try Internet Archive if direct fetch fails
|
|
|
|
Returns:
|
|
Dict containing article content, title, publication date
|
|
"""
|
|
try:
|
|
# Try direct fetch
|
|
article = Article(url)
|
|
article.download()
|
|
article.parse()
|
|
|
|
return {
|
|
"content": article.text,
|
|
"title": article.title,
|
|
"authors": article.authors,
|
|
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
|
|
"extracted_via": "direct_fetch",
|
|
"extraction_success": True
|
|
}
|
|
|
|
except Exception as e:
|
|
if use_archive_fallback:
|
|
# Try Internet Archive
|
|
archive_url = self._get_archive_url(url)
|
|
if archive_url:
|
|
try:
|
|
article = Article(archive_url)
|
|
article.download()
|
|
article.parse()
|
|
|
|
return {
|
|
"content": article.text,
|
|
"title": article.title,
|
|
"authors": article.authors,
|
|
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
|
|
"extracted_via": "internet_archive",
|
|
"extraction_success": True
|
|
}
|
|
except Exception:
|
|
pass
|
|
|
|
# Return failure
|
|
return {
|
|
"content": "",
|
|
"title": "",
|
|
"extracted_via": "failed",
|
|
"extraction_success": False,
|
|
"error": str(e)
|
|
}
|
|
|
|
def _get_archive_url(self, url: str) -> Optional[str]:
|
|
"""Get Internet Archive URL for a given URL."""
|
|
try:
|
|
response = requests.get(f"{self.archive_base_url}?url={url}")
|
|
data = response.json()
|
|
if data.get("archived_snapshots", {}).get("closest", {}).get("available"):
|
|
return data["archived_snapshots"]["closest"]["url"]
|
|
except Exception:
|
|
pass
|
|
return None
|
|
```
|
|
|
|
### 4. LLM-Powered Sentiment Analysis
|
|
|
|
#### Sentiment Analysis Integration
|
|
```python
|
|
class LLMSentimentAnalyzer:
|
|
"""LLM-based sentiment analyzer for financial news."""
|
|
|
|
def __init__(self, llm_client):
|
|
self.llm_client = llm_client
|
|
self.sentiment_prompt = """
|
|
Analyze the sentiment of this financial news article for trading purposes.
|
|
|
|
Article:
|
|
Title: {headline}
|
|
Content: {content}
|
|
|
|
Provide your analysis in the following JSON format:
|
|
{{
|
|
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
|
|
"confidence": <float between 0.0 and 1.0>,
|
|
"label": <"positive", "negative", or "neutral">,
|
|
"reasoning": <brief explanation>,
|
|
"key_themes": <list of key financial themes>,
|
|
"financial_entities": <list of mentioned companies/tickers>
|
|
}}
|
|
|
|
Focus on the financial and market implications of the news.
|
|
"""
|
|
|
|
def analyze_sentiment(self, article: ArticleData) -> SentimentScore:
|
|
"""
|
|
Analyze article sentiment using LLM.
|
|
|
|
Args:
|
|
article: Article data with headline and content
|
|
|
|
Returns:
|
|
SentimentScore with score, confidence, and label
|
|
"""
|
|
# Prepare prompt
|
|
prompt = self.sentiment_prompt.format(
|
|
headline=article.headline,
|
|
content=article.content[:2000] # Limit content length
|
|
)
|
|
|
|
# Get LLM response
|
|
response = self.llm_client.complete(prompt)
|
|
|
|
# Parse response
|
|
try:
|
|
result = json.loads(response)
|
|
|
|
# Convert to SentimentScore
|
|
score = result.get("score", 0.0)
|
|
return SentimentScore(
|
|
positive=max(0, score),
|
|
negative=abs(min(0, score)),
|
|
neutral=1.0 - abs(score),
|
|
metadata={
|
|
"confidence": result.get("confidence", 0.5),
|
|
"label": result.get("label", "neutral"),
|
|
"reasoning": result.get("reasoning", ""),
|
|
"key_themes": result.get("key_themes", []),
|
|
"financial_entities": result.get("financial_entities", [])
|
|
}
|
|
)
|
|
except Exception as e:
|
|
# Return neutral sentiment on error
|
|
return SentimentScore(
|
|
positive=0.0,
|
|
negative=0.0,
|
|
neutral=1.0,
|
|
metadata={"error": str(e)}
|
|
)
|
|
|
|
def batch_analyze(self, articles: list[ArticleData], batch_size: int = 5) -> list[SentimentScore]:
|
|
"""
|
|
Batch process sentiment analysis for multiple articles.
|
|
|
|
Args:
|
|
articles: List of articles to analyze
|
|
batch_size: Number of articles to process in parallel
|
|
|
|
Returns:
|
|
List of sentiment scores corresponding to input articles
|
|
"""
|
|
results = []
|
|
|
|
for i in range(0, len(articles), batch_size):
|
|
batch = articles[i:i + batch_size]
|
|
|
|
# Process batch (could be parallelized)
|
|
for article in batch:
|
|
sentiment = self.analyze_sentiment(article)
|
|
results.append(sentiment)
|
|
|
|
# Add small delay to respect rate limits
|
|
time.sleep(0.1)
|
|
|
|
return results
|
|
```
|
|
|
|
### 5. Date Object Conversion
|
|
|
|
#### Service Boundary Conversion
|
|
```python
|
|
# Service receives string dates from agents
|
|
def get_context(self, query: str, start_date: str, end_date: str, ...) -> NewsContext:
|
|
# Validate date strings
|
|
try:
|
|
start_dt = date.fromisoformat(start_date)
|
|
end_dt = date.fromisoformat(end_date)
|
|
except ValueError as e:
|
|
raise ValueError(f"Invalid date format: {e}")
|
|
|
|
# Check date order
|
|
if end_dt < start_dt:
|
|
raise ValueError(f"End date {end_date} is before start date {start_date}")
|
|
|
|
# Fetch from multiple sources
|
|
finnhub_data = self.finnhub_client.get_company_news(symbol, start_dt, end_dt) if symbol else None
|
|
google_rss = self.google_client.fetch_rss_feed(query, start_dt, end_dt)
|
|
|
|
# Fetch full article content for RSS articles
|
|
for article in google_rss.get('articles', []):
|
|
content_data = self.google_client.fetch_article_content(article['url'])
|
|
article.update(content_data)
|
|
|
|
# Combine all articles
|
|
all_articles = self._combine_and_deduplicate(finnhub_data, google_rss)
|
|
|
|
# Perform LLM sentiment analysis
|
|
enriched_articles = []
|
|
for article in all_articles:
|
|
article_data = ArticleData(**article)
|
|
article_data.sentiment = self.sentiment_analyzer.analyze_sentiment(article_data)
|
|
enriched_articles.append(article_data)
|
|
|
|
# Create and return context
|
|
return self._create_news_context(enriched_articles, start_date, end_date)
|
|
```
|
|
|
|
### 6. Error Recovery and Partial Data
|
|
|
|
```python
|
|
def handle_source_failure(
|
|
self,
|
|
finnhub_data: dict | None,
|
|
google_data: dict | None,
|
|
errors: dict[str, Exception]
|
|
) -> NewsContext:
|
|
"""
|
|
Handle cases where one or more news sources fail.
|
|
|
|
- If all sources fail: Raise exception
|
|
- If some sources succeed: Return partial data with metadata
|
|
- Track content extraction failures separately
|
|
"""
|
|
if not finnhub_data and not google_data:
|
|
raise ValueError("All news sources failed to return data")
|
|
|
|
# Track extraction statistics
|
|
extraction_stats = {
|
|
"total_articles": 0,
|
|
"successful_extractions": 0,
|
|
"archive_fallbacks": 0,
|
|
"failed_extractions": 0
|
|
}
|
|
|
|
# Process available articles
|
|
all_articles = []
|
|
successful_sources = []
|
|
|
|
if finnhub_data:
|
|
all_articles.extend(finnhub_data.get('articles', []))
|
|
successful_sources.append('finnhub')
|
|
|
|
if google_data:
|
|
articles = google_data.get('articles', [])
|
|
for article in articles:
|
|
extraction_stats["total_articles"] += 1
|
|
if article.get("extraction_success"):
|
|
extraction_stats["successful_extractions"] += 1
|
|
if article.get("extracted_via") == "internet_archive":
|
|
extraction_stats["archive_fallbacks"] += 1
|
|
else:
|
|
extraction_stats["failed_extractions"] += 1
|
|
|
|
all_articles.extend(articles)
|
|
successful_sources.append('google_news')
|
|
|
|
metadata = {
|
|
"sources_requested": ["finnhub", "google_news"],
|
|
"sources_successful": successful_sources,
|
|
"sources_failed": {source: str(error) for source, error in errors.items()},
|
|
"extraction_stats": extraction_stats,
|
|
"partial_data": len(successful_sources) < 2
|
|
}
|
|
|
|
# Deduplicate and return context
|
|
return self._create_context(all_articles, metadata)
|
|
```
|
|
|
|
### 7. Repository Method Bridging
|
|
|
|
```python
|
|
# Add these bridge methods to NewsRepository
|
|
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
|
|
"""Bridge to existing get_news_data method."""
|
|
existing_data = self.get_news_data(
|
|
symbol=symbol or query,
|
|
start_date=start_date,
|
|
end_date=end_date
|
|
)
|
|
return len(existing_data.get('articles', [])) > 0
|
|
|
|
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]:
|
|
"""Bridge to existing get_news_data method."""
|
|
return self.get_news_data(
|
|
symbol=symbol or query,
|
|
start_date=start_date,
|
|
end_date=end_date
|
|
)
|
|
|
|
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool:
|
|
"""Bridge to existing store_news_articles method."""
|
|
articles = cache_data.get('articles', [])
|
|
if not articles:
|
|
return False
|
|
|
|
# Convert to expected format
|
|
news_articles = [
|
|
NewsArticle(
|
|
symbol=symbol or query,
|
|
headline=a['headline'],
|
|
summary=a.get('summary', ''),
|
|
content=a.get('content', ''),
|
|
url=a['url'],
|
|
source=a['source'],
|
|
date=a['date'],
|
|
entities=a.get('entities', []),
|
|
sentiment_score=a.get('sentiment', {}).get('score', 0.0),
|
|
sentiment_metadata=a.get('sentiment', {})
|
|
)
|
|
for a in articles
|
|
]
|
|
|
|
return self.store_news_articles(news_articles)
|
|
|
|
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
|
|
"""News is append-only, so this just marks data as stale for re-fetch."""
|
|
# Implementation depends on repository design
|
|
# Could update metadata to trigger re-fetch
|
|
return True
|
|
```
|
|
|
|
### 8. Pydantic Validation
|
|
|
|
#### Context Structure
|
|
```python
|
|
@dataclass
|
|
class NewsContext(BaseModel):
|
|
symbol: str | None
|
|
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
|
|
articles: list[ArticleData]
|
|
sentiment_summary: SentimentScore
|
|
article_count: int
|
|
sources: list[str]
|
|
metadata: dict[str, Any]
|
|
|
|
@validator('period')
|
|
def validate_period(cls, v):
|
|
# Ensure start and end dates are present and valid
|
|
if 'start' not in v or 'end' not in v:
|
|
raise ValueError("Period must have 'start' and 'end' dates")
|
|
return v
|
|
|
|
@validator('articles')
|
|
def validate_articles(cls, v):
|
|
# Ensure no duplicate URLs
|
|
urls = [a.url for a in v]
|
|
if len(urls) != len(set(urls)):
|
|
raise ValueError("Duplicate articles detected")
|
|
return v
|
|
```
|
|
|
|
## Implementation Tasks
|
|
|
|
### Phase 1: Create GoogleNewsClient
|
|
|
|
1. **GoogleNewsClient Implementation**
|
|
- Create `tradingagents/clients/google_news_client.py` following FinnhubClient standard
|
|
- Implement RSS feed parsing using `feedparser` library
|
|
- Add `fetch_rss_feed()` method with Google News RSS integration
|
|
- Add `fetch_article_content()` method with `newspaper3k` and Internet Archive fallback
|
|
- Use `date` objects for all date parameters
|
|
- No BaseClient inheritance
|
|
|
|
2. **Article Content Extraction**
|
|
- Implement robust article content extraction using `newspaper3k`
|
|
- Add fallback to Internet Archive Wayback Machine for failed fetches
|
|
- Handle paywall detection and alternative content sources
|
|
- Extract clean text, title, publication date, and metadata
|
|
|
|
3. **Comprehensive Testing**
|
|
- Create test suite for GoogleNewsClient
|
|
- Test RSS parsing with various queries
|
|
- Test content extraction with real and archived URLs
|
|
- Use pytest-vcr for HTTP interaction recording
|
|
|
|
### Phase 2: Bridge NewsRepository Interface
|
|
|
|
4. **Repository Interface Standardization**
|
|
- Add standard service interface methods to `NewsRepository`
|
|
- Bridge existing methods without changing underlying storage
|
|
- File: `tradingagents/repositories/news_repository.py`
|
|
- Maintain backward compatibility
|
|
|
|
### Phase 3: Implement NewsService
|
|
|
|
5. **Service Core Implementation**
|
|
- Replace method stubs with full implementation
|
|
- Implement `get_context()`, `get_company_news_context()`, `get_global_news_context()`
|
|
- Add local-first data strategy with freshness checking
|
|
- Replace `BaseClient` dependencies with typed clients
|
|
- File: `tradingagents/services/news_service.py`
|
|
|
|
6. **LLM Sentiment Analysis Integration**
|
|
- Implement `LLMSentimentAnalyzer` class
|
|
- Create financial news sentiment prompts
|
|
- Add batch processing for efficiency
|
|
- Handle LLM rate limiting and errors
|
|
|
|
7. **Date Conversion and Article Processing**
|
|
- Add date validation and conversion
|
|
- Implement RSS article fetching pipeline
|
|
- Add content extraction with fallback
|
|
- Combine articles from multiple sources
|
|
- Implement deduplication by URL
|
|
|
|
### Phase 4: Type Safety & Validation
|
|
|
|
8. **Comprehensive Type Checking**
|
|
- Run `mise run typecheck` - must pass with 0 errors
|
|
- Validate all date object conversions
|
|
- Ensure NewsContext compliance
|
|
|
|
9. **Enhanced Testing**
|
|
- Test RSS feed parsing edge cases
|
|
- Test content extraction failures and fallbacks
|
|
- Test LLM sentiment analysis with various article types
|
|
- Test multi-source aggregation and deduplication
|
|
|
|
## Testing Scenarios
|
|
|
|
### Integration Tests
|
|
|
|
1. **RSS Feed Processing**
|
|
- Test with various search queries
|
|
- Test date filtering in RSS results
|
|
- Test handling of malformed RSS feeds
|
|
|
|
2. **Content Extraction**
|
|
- Test direct fetch success
|
|
- Test Internet Archive fallback
|
|
- Test paywall detection
|
|
- Test extraction failure handling
|
|
|
|
3. **LLM Sentiment Analysis**
|
|
- Test positive news sentiment
|
|
- Test negative earnings reports
|
|
- Test neutral market updates
|
|
- Test batch processing
|
|
- Test LLM error handling
|
|
|
|
4. **Multi-Source Aggregation**
|
|
- Test both sources succeed
|
|
- Test Finnhub fails, Google succeeds
|
|
- Test Google fails, Finnhub succeeds
|
|
- Test both sources fail
|
|
|
|
5. **Date Handling**
|
|
- Test invalid date formats
|
|
- Test end_date < start_date
|
|
- Test date filtering in RSS feeds
|
|
|
|
## Success Criteria
|
|
|
|
### Functional Requirements
|
|
- ✅ Service successfully implements all placeholder methods
|
|
- ✅ GoogleNewsClient reads and parses RSS feeds correctly
|
|
- ✅ Article content extraction works with Internet Archive fallback
|
|
- ✅ LLM sentiment analysis provides structured financial sentiment
|
|
- ✅ Local-first strategy with proper freshness checking
|
|
- ✅ Multi-source aggregation with deduplication
|
|
- ✅ Returns properly validated `NewsContext` to agents
|
|
- ✅ Force refresh fetches fresh articles without clearing cache
|
|
|
|
### Technical Requirements
|
|
- ✅ Zero type checking errors: `mise run typecheck`
|
|
- ✅ Zero linting errors: `mise run lint`
|
|
- ✅ All tests pass with new implementation
|
|
- ✅ No runtime errors with date conversions
|
|
- ✅ Proper error messages for validation failures
|
|
|
|
### Quality Requirements
|
|
- ✅ Strongly-typed interfaces between all components
|
|
- ✅ RSS feed parsing with robust error handling
|
|
- ✅ Article content extraction with fallback strategy
|
|
- ✅ LLM integration with proper prompt engineering
|
|
- ✅ Efficient caching with minimal external calls
|
|
- ✅ Clear separation of concerns
|
|
|
|
## Data Architecture
|
|
|
|
### GoogleNewsClient RSS Response Format
|
|
```python
|
|
{
|
|
"query": "Apple stock",
|
|
"period": {"start": "2024-01-01", "end": "2024-01-31"},
|
|
"articles": [
|
|
{
|
|
"headline": "Apple Stock Soars on New Product Launch",
|
|
"summary": "Brief summary from RSS feed...",
|
|
"content": "Full article text extracted from source...",
|
|
"url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
|
|
"source": "CNBC",
|
|
"date": "2024-01-20",
|
|
"authors": ["Tech Reporter"],
|
|
"publish_date": "2024-01-20T14:30:00Z",
|
|
"extracted_via": "direct_fetch", # or "internet_archive"
|
|
"extraction_success": true
|
|
}
|
|
],
|
|
"metadata": {
|
|
"source": "google_news_rss",
|
|
"article_count": 25,
|
|
"rss_feed_url": "https://news.google.com/rss/search?q=Apple+stock",
|
|
"extraction_stats": {
|
|
"successful": 22,
|
|
"archive_fallback": 2,
|
|
"failed": 3
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### LLM Sentiment Analysis Response Format
|
|
```python
|
|
{
|
|
"article_url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
|
|
"sentiment": {
|
|
"positive": 0.7,
|
|
"negative": 0.1,
|
|
"neutral": 0.2,
|
|
"metadata": {
|
|
"score": 0.7,
|
|
"confidence": 0.85,
|
|
"label": "positive",
|
|
"reasoning": "Article discusses positive earnings and growth outlook",
|
|
"key_themes": ["earnings_beat", "product_launch", "revenue_growth"],
|
|
"financial_entities": ["AAPL", "Apple Inc.", "iPhone 15"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Aggregate Sentiment Summary
|
|
```python
|
|
{
|
|
"sentiment_summary": {
|
|
"positive": 0.65, # Average across all articles
|
|
"negative": 0.20,
|
|
"neutral": 0.15,
|
|
"metadata": {
|
|
"dominant_sentiment": "positive",
|
|
"confidence": 0.82,
|
|
"article_count": 25,
|
|
"themes": {
|
|
"earnings": 8,
|
|
"product_launch": 5,
|
|
"market_analysis": 12
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
### Components to Create
|
|
- ⏳ `GoogleNewsClient` - Full implementation with RSS and content extraction
|
|
- ⏳ `LLMSentimentAnalyzer` - LLM integration for sentiment analysis
|
|
- ⏳ `NewsService` - Replace stubs with full implementation
|
|
|
|
### Existing Components
|
|
- ✅ `FinnhubClient` with company news using date objects
|
|
- ✅ `NewsRepository` with dataclass storage
|
|
- ✅ `NewsContext` and related Pydantic models
|
|
|
|
### Required Libraries
|
|
- `feedparser` - RSS feed parsing
|
|
- `newspaper3k` - Article content extraction
|
|
- `requests` - HTTP requests and Internet Archive API
|
|
- `beautifulsoup4` - HTML parsing fallback
|
|
- LLM client library (OpenAI, Anthropic, etc.)
|
|
|
|
## Timeline
|
|
|
|
### Immediate (Phase 1)
|
|
- Create GoogleNewsClient with RSS and content extraction
|
|
- Implement feedparser integration
|
|
- Add Internet Archive fallback
|
|
- Create comprehensive test suite
|
|
|
|
### Phase 2-3
|
|
- Add repository bridge methods
|
|
- Implement full NewsService
|
|
- Integrate LLM sentiment analysis
|
|
- Handle multi-source aggregation
|
|
|
|
### Phase 4
|
|
- Type checking and validation
|
|
- Integration testing
|
|
- Performance optimization
|
|
- Documentation
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Must Have
|
|
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
|
|
2. **RSS Integration**: Successfully parse Google News RSS feeds
|
|
3. **Content Extraction**: Extract full articles with fallback
|
|
4. **LLM Sentiment**: Financial sentiment analysis for all articles
|
|
5. **Service Implementation**: All stubs replaced with working code
|
|
6. **Local-First**: Check cache before fetching new data
|
|
7. **Multi-Source**: Aggregate Finnhub and Google News
|
|
|
|
### Should Have
|
|
1. **Extraction Stats**: Track success/failure rates
|
|
2. **Batch Processing**: Efficient LLM sentiment analysis
|
|
3. **Force Refresh**: Fetch new articles on demand
|
|
4. **Error Recovery**: Handle partial failures gracefully
|
|
|
|
### Nice to Have
|
|
1. **Additional Sources**: Support more news providers
|
|
2. **Real-time Monitoring**: WebSocket for breaking news
|
|
3. **Advanced Extraction**: Handle PDFs, videos
|
|
4. **Sentiment Trends**: Track sentiment over time
|
|
|
|
---
|
|
|
|
This PRD focuses on completing the currently empty `NewsService` with a full implementation including RSS feed integration, article content extraction with Internet Archive fallback, and LLM-powered sentiment analysis for financial news.
|