27 KiB
Product Requirements Document: NewsService Completion
Overview
Complete the NewsService to provide strongly-typed news data and sentiment analysis to trading agents using a local-first data strategy with RSS feed integration, article content extraction, and LLM-powered sentiment analysis.
Current State Analysis
Issues to Fix
- CRITICAL: Service is currently empty placeholder with only method stubs
- CRITICAL: Need to implement GoogleNewsClient to read RSS feeds
- CRITICAL: Need RSS article fetching with fallback to Internet Archive
- CRITICAL: Need LLM-powered sentiment analysis integration
- CRITICAL: Service uses
BaseClientinheritance instead of typed clients - CRITICAL:
NewsRepositoryhas different interface than service expectations - Missing strongly-typed interfaces between components
- No concrete approach for article content extraction
What Works
- ✅
NewsContextandArticleDataPydantic models for agent consumption - ✅
SentimentScoremodel for structured sentiment data - ✅
FinnhubClientwithget_company_news()method using date objects - ✅
NewsRepositorywith dataclass-based storage and deduplication - ✅ Service structure placeholder ready for implementation
Technical Requirements
1. Strongly-Typed Interfaces
Client → Service Interface
# FinnhubClient methods (already implemented)
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
# GoogleNewsClient methods (to be implemented)
def fetch_rss_feed(query: str, start_date: date, end_date: date) -> dict[str, Any]
def fetch_article_content(url: str, use_archive_fallback: bool = True) -> dict[str, Any]
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
def get_global_news(start_date: date, end_date: date, categories: list[str]) -> dict[str, Any]
Service → Repository Interface
# NewsRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
Service → Agent Interface
# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, sources: list[str], force_refresh: bool) -> NewsContext
2. Local-First Data Strategy
Flow
- Repository Lookup: Check
NewsRepository.has_data_for_period() - Freshness Check: Determine if cache needs updating (news is append-only)
- RSS Feed Fetching: Fetch RSS feeds from Google News
- Content Extraction: Extract full article content with Internet Archive fallback
- LLM Analysis: Perform sentiment analysis using LLM
- Cache Updates: Store enriched articles via
repository.store_data() - Context Assembly: Return validated
NewsContext
News-Specific Gap Detection
def should_fetch_new_articles(self, last_fetch_time: datetime, current_time: datetime) -> bool:
"""
News doesn't have "gaps" - it's append-only. Check if enough time passed for new articles.
Returns True if:
- Last fetch was more than 6 hours ago
- User requested force_refresh
- No data exists for the query/period
"""
if not last_fetch_time:
return True
hours_since_fetch = (current_time - last_fetch_time).total_seconds() / 3600
return hours_since_fetch >= 6 # Fetch new articles every 6 hours
Force Refresh Support
force_refresh=Truefetches all articles fresh from sources- Does NOT clear existing cache (news is immutable)
- Deduplicates against existing articles before storing
Cache Invalidation Strategy
- Articles are immutable: Once published, articles don't change
- Cache grows append-only: New articles are added, old ones retained
- Freshness check: Re-fetch every 6 hours for new articles
- No deletion: Articles are never removed from cache
3. RSS Feed Processing & Article Fetching
GoogleNewsClient RSS Implementation
import feedparser
from newspaper import Article
import requests
from datetime import date, datetime
from typing import Any, Optional
class GoogleNewsClient:
"""Google News RSS client following FinnhubClient standard."""
def __init__(self):
self.base_rss_url = "https://news.google.com/rss"
self.archive_base_url = "https://archive.org/wayback/available"
def fetch_rss_feed(self, query: str, start_date: date, end_date: date) -> dict[str, Any]:
"""
Fetch RSS feed data for news articles.
Args:
query: Search query or company symbol
start_date: Start date for filtering articles
end_date: End date for filtering articles
Returns:
Dict containing RSS feed articles with metadata
"""
# Construct RSS feed URL
rss_url = f"{self.base_rss_url}/search?q={query}&hl=en-US&gl=US&ceid=US:en"
# Parse RSS feed
feed = feedparser.parse(rss_url)
# Filter and structure articles
articles = []
for entry in feed.entries:
# Parse publication date
pub_date = datetime(*entry.published_parsed[:6]).date()
# Filter by date range
if start_date <= pub_date <= end_date:
articles.append({
"headline": entry.title,
"url": entry.link,
"source": entry.source.get('title', 'Google News'),
"date": pub_date.isoformat(),
"summary": entry.get('summary', ''),
})
return {
"query": query,
"period": {"start": start_date.isoformat(), "end": end_date.isoformat()},
"articles": articles,
"metadata": {
"source": "google_news_rss",
"rss_feed_url": rss_url,
"article_count": len(articles)
}
}
def fetch_article_content(self, url: str, use_archive_fallback: bool = True) -> dict[str, Any]:
"""
Fetch full article content from URL with Internet Archive fallback.
Args:
url: Article URL to fetch
use_archive_fallback: Whether to try Internet Archive if direct fetch fails
Returns:
Dict containing article content, title, publication date
"""
try:
# Try direct fetch
article = Article(url)
article.download()
article.parse()
return {
"content": article.text,
"title": article.title,
"authors": article.authors,
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
"extracted_via": "direct_fetch",
"extraction_success": True
}
except Exception as e:
if use_archive_fallback:
# Try Internet Archive
archive_url = self._get_archive_url(url)
if archive_url:
try:
article = Article(archive_url)
article.download()
article.parse()
return {
"content": article.text,
"title": article.title,
"authors": article.authors,
"publish_date": article.publish_date.isoformat() if article.publish_date else None,
"extracted_via": "internet_archive",
"extraction_success": True
}
except Exception:
pass
# Return failure
return {
"content": "",
"title": "",
"extracted_via": "failed",
"extraction_success": False,
"error": str(e)
}
def _get_archive_url(self, url: str) -> Optional[str]:
"""Get Internet Archive URL for a given URL."""
try:
response = requests.get(f"{self.archive_base_url}?url={url}")
data = response.json()
if data.get("archived_snapshots", {}).get("closest", {}).get("available"):
return data["archived_snapshots"]["closest"]["url"]
except Exception:
pass
return None
4. LLM-Powered Sentiment Analysis
Sentiment Analysis Integration
class LLMSentimentAnalyzer:
"""LLM-based sentiment analyzer for financial news."""
def __init__(self, llm_client):
self.llm_client = llm_client
self.sentiment_prompt = """
Analyze the sentiment of this financial news article for trading purposes.
Article:
Title: {headline}
Content: {content}
Provide your analysis in the following JSON format:
{{
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
"confidence": <float between 0.0 and 1.0>,
"label": <"positive", "negative", or "neutral">,
"reasoning": <brief explanation>,
"key_themes": <list of key financial themes>,
"financial_entities": <list of mentioned companies/tickers>
}}
Focus on the financial and market implications of the news.
"""
def analyze_sentiment(self, article: ArticleData) -> SentimentScore:
"""
Analyze article sentiment using LLM.
Args:
article: Article data with headline and content
Returns:
SentimentScore with score, confidence, and label
"""
# Prepare prompt
prompt = self.sentiment_prompt.format(
headline=article.headline,
content=article.content[:2000] # Limit content length
)
# Get LLM response
response = self.llm_client.complete(prompt)
# Parse response
try:
result = json.loads(response)
# Convert to SentimentScore
score = result.get("score", 0.0)
return SentimentScore(
positive=max(0, score),
negative=abs(min(0, score)),
neutral=1.0 - abs(score),
metadata={
"confidence": result.get("confidence", 0.5),
"label": result.get("label", "neutral"),
"reasoning": result.get("reasoning", ""),
"key_themes": result.get("key_themes", []),
"financial_entities": result.get("financial_entities", [])
}
)
except Exception as e:
# Return neutral sentiment on error
return SentimentScore(
positive=0.0,
negative=0.0,
neutral=1.0,
metadata={"error": str(e)}
)
def batch_analyze(self, articles: list[ArticleData], batch_size: int = 5) -> list[SentimentScore]:
"""
Batch process sentiment analysis for multiple articles.
Args:
articles: List of articles to analyze
batch_size: Number of articles to process in parallel
Returns:
List of sentiment scores corresponding to input articles
"""
results = []
for i in range(0, len(articles), batch_size):
batch = articles[i:i + batch_size]
# Process batch (could be parallelized)
for article in batch:
sentiment = self.analyze_sentiment(article)
results.append(sentiment)
# Add small delay to respect rate limits
time.sleep(0.1)
return results
5. Date Object Conversion
Service Boundary Conversion
# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> NewsContext:
# Validate date strings
try:
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
except ValueError as e:
raise ValueError(f"Invalid date format: {e}")
# Check date order
if end_dt < start_dt:
raise ValueError(f"End date {end_date} is before start date {start_date}")
# Fetch from multiple sources
finnhub_data = self.finnhub_client.get_company_news(symbol, start_dt, end_dt) if symbol else None
google_rss = self.google_client.fetch_rss_feed(query, start_dt, end_dt)
# Fetch full article content for RSS articles
for article in google_rss.get('articles', []):
content_data = self.google_client.fetch_article_content(article['url'])
article.update(content_data)
# Combine all articles
all_articles = self._combine_and_deduplicate(finnhub_data, google_rss)
# Perform LLM sentiment analysis
enriched_articles = []
for article in all_articles:
article_data = ArticleData(**article)
article_data.sentiment = self.sentiment_analyzer.analyze_sentiment(article_data)
enriched_articles.append(article_data)
# Create and return context
return self._create_news_context(enriched_articles, start_date, end_date)
6. Error Recovery and Partial Data
def handle_source_failure(
self,
finnhub_data: dict | None,
google_data: dict | None,
errors: dict[str, Exception]
) -> NewsContext:
"""
Handle cases where one or more news sources fail.
- If all sources fail: Raise exception
- If some sources succeed: Return partial data with metadata
- Track content extraction failures separately
"""
if not finnhub_data and not google_data:
raise ValueError("All news sources failed to return data")
# Track extraction statistics
extraction_stats = {
"total_articles": 0,
"successful_extractions": 0,
"archive_fallbacks": 0,
"failed_extractions": 0
}
# Process available articles
all_articles = []
successful_sources = []
if finnhub_data:
all_articles.extend(finnhub_data.get('articles', []))
successful_sources.append('finnhub')
if google_data:
articles = google_data.get('articles', [])
for article in articles:
extraction_stats["total_articles"] += 1
if article.get("extraction_success"):
extraction_stats["successful_extractions"] += 1
if article.get("extracted_via") == "internet_archive":
extraction_stats["archive_fallbacks"] += 1
else:
extraction_stats["failed_extractions"] += 1
all_articles.extend(articles)
successful_sources.append('google_news')
metadata = {
"sources_requested": ["finnhub", "google_news"],
"sources_successful": successful_sources,
"sources_failed": {source: str(error) for source, error in errors.items()},
"extraction_stats": extraction_stats,
"partial_data": len(successful_sources) < 2
}
# Deduplicate and return context
return self._create_context(all_articles, metadata)
7. Repository Method Bridging
# Add these bridge methods to NewsRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
"""Bridge to existing get_news_data method."""
existing_data = self.get_news_data(
symbol=symbol or query,
start_date=start_date,
end_date=end_date
)
return len(existing_data.get('articles', [])) > 0
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]:
"""Bridge to existing get_news_data method."""
return self.get_news_data(
symbol=symbol or query,
start_date=start_date,
end_date=end_date
)
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool:
"""Bridge to existing store_news_articles method."""
articles = cache_data.get('articles', [])
if not articles:
return False
# Convert to expected format
news_articles = [
NewsArticle(
symbol=symbol or query,
headline=a['headline'],
summary=a.get('summary', ''),
content=a.get('content', ''),
url=a['url'],
source=a['source'],
date=a['date'],
entities=a.get('entities', []),
sentiment_score=a.get('sentiment', {}).get('score', 0.0),
sentiment_metadata=a.get('sentiment', {})
)
for a in articles
]
return self.store_news_articles(news_articles)
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
"""News is append-only, so this just marks data as stale for re-fetch."""
# Implementation depends on repository design
# Could update metadata to trigger re-fetch
return True
8. Pydantic Validation
Context Structure
@dataclass
class NewsContext(BaseModel):
symbol: str | None
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
articles: list[ArticleData]
sentiment_summary: SentimentScore
article_count: int
sources: list[str]
metadata: dict[str, Any]
@validator('period')
def validate_period(cls, v):
# Ensure start and end dates are present and valid
if 'start' not in v or 'end' not in v:
raise ValueError("Period must have 'start' and 'end' dates")
return v
@validator('articles')
def validate_articles(cls, v):
# Ensure no duplicate URLs
urls = [a.url for a in v]
if len(urls) != len(set(urls)):
raise ValueError("Duplicate articles detected")
return v
Implementation Tasks
Phase 1: Create GoogleNewsClient
-
GoogleNewsClient Implementation
- Create
tradingagents/clients/google_news_client.pyfollowing FinnhubClient standard - Implement RSS feed parsing using
feedparserlibrary - Add
fetch_rss_feed()method with Google News RSS integration - Add
fetch_article_content()method withnewspaper3kand Internet Archive fallback - Use
dateobjects for all date parameters - No BaseClient inheritance
- Create
-
Article Content Extraction
- Implement robust article content extraction using
newspaper3k - Add fallback to Internet Archive Wayback Machine for failed fetches
- Handle paywall detection and alternative content sources
- Extract clean text, title, publication date, and metadata
- Implement robust article content extraction using
-
Comprehensive Testing
- Create test suite for GoogleNewsClient
- Test RSS parsing with various queries
- Test content extraction with real and archived URLs
- Use pytest-vcr for HTTP interaction recording
Phase 2: Bridge NewsRepository Interface
- Repository Interface Standardization
- Add standard service interface methods to
NewsRepository - Bridge existing methods without changing underlying storage
- File:
tradingagents/repositories/news_repository.py - Maintain backward compatibility
- Add standard service interface methods to
Phase 3: Implement NewsService
-
Service Core Implementation
- Replace method stubs with full implementation
- Implement
get_context(),get_company_news_context(),get_global_news_context() - Add local-first data strategy with freshness checking
- Replace
BaseClientdependencies with typed clients - File:
tradingagents/services/news_service.py
-
LLM Sentiment Analysis Integration
- Implement
LLMSentimentAnalyzerclass - Create financial news sentiment prompts
- Add batch processing for efficiency
- Handle LLM rate limiting and errors
- Implement
-
Date Conversion and Article Processing
- Add date validation and conversion
- Implement RSS article fetching pipeline
- Add content extraction with fallback
- Combine articles from multiple sources
- Implement deduplication by URL
Phase 4: Type Safety & Validation
-
Comprehensive Type Checking
- Run
mise run typecheck- must pass with 0 errors - Validate all date object conversions
- Ensure NewsContext compliance
- Run
-
Enhanced Testing
- Test RSS feed parsing edge cases
- Test content extraction failures and fallbacks
- Test LLM sentiment analysis with various article types
- Test multi-source aggregation and deduplication
Testing Scenarios
Integration Tests
-
RSS Feed Processing
- Test with various search queries
- Test date filtering in RSS results
- Test handling of malformed RSS feeds
-
Content Extraction
- Test direct fetch success
- Test Internet Archive fallback
- Test paywall detection
- Test extraction failure handling
-
LLM Sentiment Analysis
- Test positive news sentiment
- Test negative earnings reports
- Test neutral market updates
- Test batch processing
- Test LLM error handling
-
Multi-Source Aggregation
- Test both sources succeed
- Test Finnhub fails, Google succeeds
- Test Google fails, Finnhub succeeds
- Test both sources fail
-
Date Handling
- Test invalid date formats
- Test end_date < start_date
- Test date filtering in RSS feeds
Success Criteria
Functional Requirements
- ✅ Service successfully implements all placeholder methods
- ✅ GoogleNewsClient reads and parses RSS feeds correctly
- ✅ Article content extraction works with Internet Archive fallback
- ✅ LLM sentiment analysis provides structured financial sentiment
- ✅ Local-first strategy with proper freshness checking
- ✅ Multi-source aggregation with deduplication
- ✅ Returns properly validated
NewsContextto agents - ✅ Force refresh fetches fresh articles without clearing cache
Technical Requirements
- ✅ Zero type checking errors:
mise run typecheck - ✅ Zero linting errors:
mise run lint - ✅ All tests pass with new implementation
- ✅ No runtime errors with date conversions
- ✅ Proper error messages for validation failures
Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ RSS feed parsing with robust error handling
- ✅ Article content extraction with fallback strategy
- ✅ LLM integration with proper prompt engineering
- ✅ Efficient caching with minimal external calls
- ✅ Clear separation of concerns
Data Architecture
GoogleNewsClient RSS Response Format
{
"query": "Apple stock",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"articles": [
{
"headline": "Apple Stock Soars on New Product Launch",
"summary": "Brief summary from RSS feed...",
"content": "Full article text extracted from source...",
"url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
"source": "CNBC",
"date": "2024-01-20",
"authors": ["Tech Reporter"],
"publish_date": "2024-01-20T14:30:00Z",
"extracted_via": "direct_fetch", # or "internet_archive"
"extraction_success": true
}
],
"metadata": {
"source": "google_news_rss",
"article_count": 25,
"rss_feed_url": "https://news.google.com/rss/search?q=Apple+stock",
"extraction_stats": {
"successful": 22,
"archive_fallback": 2,
"failed": 3
}
}
}
LLM Sentiment Analysis Response Format
{
"article_url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
"sentiment": {
"positive": 0.7,
"negative": 0.1,
"neutral": 0.2,
"metadata": {
"score": 0.7,
"confidence": 0.85,
"label": "positive",
"reasoning": "Article discusses positive earnings and growth outlook",
"key_themes": ["earnings_beat", "product_launch", "revenue_growth"],
"financial_entities": ["AAPL", "Apple Inc.", "iPhone 15"]
}
}
}
Aggregate Sentiment Summary
{
"sentiment_summary": {
"positive": 0.65, # Average across all articles
"negative": 0.20,
"neutral": 0.15,
"metadata": {
"dominant_sentiment": "positive",
"confidence": 0.82,
"article_count": 25,
"themes": {
"earnings": 8,
"product_launch": 5,
"market_analysis": 12
}
}
}
}
Dependencies
Components to Create
- ⏳
GoogleNewsClient- Full implementation with RSS and content extraction - ⏳
LLMSentimentAnalyzer- LLM integration for sentiment analysis - ⏳
NewsService- Replace stubs with full implementation
Existing Components
- ✅
FinnhubClientwith company news using date objects - ✅
NewsRepositorywith dataclass storage - ✅
NewsContextand related Pydantic models
Required Libraries
feedparser- RSS feed parsingnewspaper3k- Article content extractionrequests- HTTP requests and Internet Archive APIbeautifulsoup4- HTML parsing fallback- LLM client library (OpenAI, Anthropic, etc.)
Timeline
Immediate (Phase 1)
- Create GoogleNewsClient with RSS and content extraction
- Implement feedparser integration
- Add Internet Archive fallback
- Create comprehensive test suite
Phase 2-3
- Add repository bridge methods
- Implement full NewsService
- Integrate LLM sentiment analysis
- Handle multi-source aggregation
Phase 4
- Type checking and validation
- Integration testing
- Performance optimization
- Documentation
Acceptance Criteria
Must Have
- Type Safety: Service passes
mise run typecheckwith zero errors - RSS Integration: Successfully parse Google News RSS feeds
- Content Extraction: Extract full articles with fallback
- LLM Sentiment: Financial sentiment analysis for all articles
- Service Implementation: All stubs replaced with working code
- Local-First: Check cache before fetching new data
- Multi-Source: Aggregate Finnhub and Google News
Should Have
- Extraction Stats: Track success/failure rates
- Batch Processing: Efficient LLM sentiment analysis
- Force Refresh: Fetch new articles on demand
- Error Recovery: Handle partial failures gracefully
Nice to Have
- Additional Sources: Support more news providers
- Real-time Monitoring: WebSocket for breaking news
- Advanced Extraction: Handle PDFs, videos
- Sentiment Trends: Track sentiment over time
This PRD focuses on completing the currently empty NewsService with a full implementation including RSS feed integration, article content extraction with Internet Archive fallback, and LLM-powered sentiment analysis for financial news.