27 KiB

Raw Blame History

Product Requirements Document: NewsService Completion

Overview

Complete the NewsService to provide strongly-typed news data and sentiment analysis to trading agents using a local-first data strategy with RSS feed integration, article content extraction, and LLM-powered sentiment analysis.

Current State Analysis

Issues to Fix

CRITICAL: Service is currently empty placeholder with only method stubs
CRITICAL: Need to implement GoogleNewsClient to read RSS feeds
CRITICAL: Need RSS article fetching with fallback to Internet Archive
CRITICAL: Need LLM-powered sentiment analysis integration
CRITICAL: Service uses BaseClient inheritance instead of typed clients
CRITICAL: NewsRepository has different interface than service expectations
Missing strongly-typed interfaces between components
No concrete approach for article content extraction

What Works

✅ NewsContext and ArticleData Pydantic models for agent consumption
✅ SentimentScore model for structured sentiment data
✅ FinnhubClient with get_company_news() method using date objects
✅ NewsRepository with dataclass-based storage and deduplication
✅ Service structure placeholder ready for implementation

Technical Requirements

1. Strongly-Typed Interfaces

Client → Service Interface

# FinnhubClient methods (already implemented)
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]

# GoogleNewsClient methods (to be implemented)
def fetch_rss_feed(query: str, start_date: date, end_date: date) -> dict[str, Any]
def fetch_article_content(url: str, use_archive_fallback: bool = True) -> dict[str, Any]
def get_company_news(symbol: str, start_date: date, end_date: date) -> dict[str, Any]
def get_global_news(start_date: date, end_date: date, categories: list[str]) -> dict[str, Any]

Service → Repository Interface

# NewsRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool

Service → Agent Interface

# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, sources: list[str], force_refresh: bool) -> NewsContext

2. Local-First Data Strategy

Flow

Repository Lookup: Check NewsRepository.has_data_for_period()
Freshness Check: Determine if cache needs updating (news is append-only)
RSS Feed Fetching: Fetch RSS feeds from Google News
Content Extraction: Extract full article content with Internet Archive fallback
LLM Analysis: Perform sentiment analysis using LLM
Cache Updates: Store enriched articles via repository.store_data()
Context Assembly: Return validated NewsContext

News-Specific Gap Detection

def should_fetch_new_articles(self, last_fetch_time: datetime, current_time: datetime) -> bool:
    """
    News doesn't have "gaps" - it's append-only. Check if enough time passed for new articles.

    Returns True if:
    - Last fetch was more than 6 hours ago
    - User requested force_refresh
    - No data exists for the query/period
    """
    if not last_fetch_time:
        return True

    hours_since_fetch = (current_time - last_fetch_time).total_seconds() / 3600
    return hours_since_fetch >= 6  # Fetch new articles every 6 hours

Force Refresh Support

force_refresh=True fetches all articles fresh from sources
Does NOT clear existing cache (news is immutable)
Deduplicates against existing articles before storing

Cache Invalidation Strategy

Articles are immutable: Once published, articles don't change
Cache grows append-only: New articles are added, old ones retained
Freshness check: Re-fetch every 6 hours for new articles
No deletion: Articles are never removed from cache

3. RSS Feed Processing & Article Fetching

GoogleNewsClient RSS Implementation

import feedparser
from newspaper import Article
import requests
from datetime import date, datetime
from typing import Any, Optional

class GoogleNewsClient:
    """Google News RSS client following FinnhubClient standard."""

    def __init__(self):
        self.base_rss_url = "https://news.google.com/rss"
        self.archive_base_url = "https://archive.org/wayback/available"

    def fetch_rss_feed(self, query: str, start_date: date, end_date: date) -> dict[str, Any]:
        """
        Fetch RSS feed data for news articles.

        Args:
            query: Search query or company symbol
            start_date: Start date for filtering articles
            end_date: End date for filtering articles

        Returns:
            Dict containing RSS feed articles with metadata
        """
        # Construct RSS feed URL
        rss_url = f"{self.base_rss_url}/search?q={query}&hl=en-US&gl=US&ceid=US:en"

        # Parse RSS feed
        feed = feedparser.parse(rss_url)

        # Filter and structure articles
        articles = []
        for entry in feed.entries:
            # Parse publication date
            pub_date = datetime(*entry.published_parsed[:6]).date()

            # Filter by date range
            if start_date <= pub_date <= end_date:
                articles.append({
                    "headline": entry.title,
                    "url": entry.link,
                    "source": entry.source.get('title', 'Google News'),
                    "date": pub_date.isoformat(),
                    "summary": entry.get('summary', ''),
                })

        return {
            "query": query,
            "period": {"start": start_date.isoformat(), "end": end_date.isoformat()},
            "articles": articles,
            "metadata": {
                "source": "google_news_rss",
                "rss_feed_url": rss_url,
                "article_count": len(articles)
            }
        }

    def fetch_article_content(self, url: str, use_archive_fallback: bool = True) -> dict[str, Any]:
        """
        Fetch full article content from URL with Internet Archive fallback.

        Args:
            url: Article URL to fetch
            use_archive_fallback: Whether to try Internet Archive if direct fetch fails

        Returns:
            Dict containing article content, title, publication date
        """
        try:
            # Try direct fetch
            article = Article(url)
            article.download()
            article.parse()

            return {
                "content": article.text,
                "title": article.title,
                "authors": article.authors,
                "publish_date": article.publish_date.isoformat() if article.publish_date else None,
                "extracted_via": "direct_fetch",
                "extraction_success": True
            }

        except Exception as e:
            if use_archive_fallback:
                # Try Internet Archive
                archive_url = self._get_archive_url(url)
                if archive_url:
                    try:
                        article = Article(archive_url)
                        article.download()
                        article.parse()

                        return {
                            "content": article.text,
                            "title": article.title,
                            "authors": article.authors,
                            "publish_date": article.publish_date.isoformat() if article.publish_date else None,
                            "extracted_via": "internet_archive",
                            "extraction_success": True
                        }
                    except Exception:
                        pass

            # Return failure
            return {
                "content": "",
                "title": "",
                "extracted_via": "failed",
                "extraction_success": False,
                "error": str(e)
            }

    def _get_archive_url(self, url: str) -> Optional[str]:
        """Get Internet Archive URL for a given URL."""
        try:
            response = requests.get(f"{self.archive_base_url}?url={url}")
            data = response.json()
            if data.get("archived_snapshots", {}).get("closest", {}).get("available"):
                return data["archived_snapshots"]["closest"]["url"]
        except Exception:
            pass
        return None

4. LLM-Powered Sentiment Analysis

Sentiment Analysis Integration

class LLMSentimentAnalyzer:
    """LLM-based sentiment analyzer for financial news."""

    def __init__(self, llm_client):
        self.llm_client = llm_client
        self.sentiment_prompt = """
        Analyze the sentiment of this financial news article for trading purposes.

        Article:
        Title: {headline}
        Content: {content}

        Provide your analysis in the following JSON format:
        {{
            "score": <float between -1.0 (very negative) and 1.0 (very positive)>,
            "confidence": <float between 0.0 and 1.0>,
            "label": <"positive", "negative", or "neutral">,
            "reasoning": <brief explanation>,
            "key_themes": <list of key financial themes>,
            "financial_entities": <list of mentioned companies/tickers>
        }}

        Focus on the financial and market implications of the news.
        """

    def analyze_sentiment(self, article: ArticleData) -> SentimentScore:
        """
        Analyze article sentiment using LLM.

        Args:
            article: Article data with headline and content

        Returns:
            SentimentScore with score, confidence, and label
        """
        # Prepare prompt
        prompt = self.sentiment_prompt.format(
            headline=article.headline,
            content=article.content[:2000]  # Limit content length
        )

        # Get LLM response
        response = self.llm_client.complete(prompt)

        # Parse response
        try:
            result = json.loads(response)

            # Convert to SentimentScore
            score = result.get("score", 0.0)
            return SentimentScore(
                positive=max(0, score),
                negative=abs(min(0, score)),
                neutral=1.0 - abs(score),
                metadata={
                    "confidence": result.get("confidence", 0.5),
                    "label": result.get("label", "neutral"),
                    "reasoning": result.get("reasoning", ""),
                    "key_themes": result.get("key_themes", []),
                    "financial_entities": result.get("financial_entities", [])
                }
            )
        except Exception as e:
            # Return neutral sentiment on error
            return SentimentScore(
                positive=0.0,
                negative=0.0,
                neutral=1.0,
                metadata={"error": str(e)}
            )

    def batch_analyze(self, articles: list[ArticleData], batch_size: int = 5) -> list[SentimentScore]:
        """
        Batch process sentiment analysis for multiple articles.

        Args:
            articles: List of articles to analyze
            batch_size: Number of articles to process in parallel

        Returns:
            List of sentiment scores corresponding to input articles
        """
        results = []

        for i in range(0, len(articles), batch_size):
            batch = articles[i:i + batch_size]

            # Process batch (could be parallelized)
            for article in batch:
                sentiment = self.analyze_sentiment(article)
                results.append(sentiment)

                # Add small delay to respect rate limits
                time.sleep(0.1)

        return results

5. Date Object Conversion

Service Boundary Conversion

# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> NewsContext:
    # Validate date strings
    try:
        start_dt = date.fromisoformat(start_date)
        end_dt = date.fromisoformat(end_date)
    except ValueError as e:
        raise ValueError(f"Invalid date format: {e}")

    # Check date order
    if end_dt < start_dt:
        raise ValueError(f"End date {end_date} is before start date {start_date}")

    # Fetch from multiple sources
    finnhub_data = self.finnhub_client.get_company_news(symbol, start_dt, end_dt) if symbol else None
    google_rss = self.google_client.fetch_rss_feed(query, start_dt, end_dt)

    # Fetch full article content for RSS articles
    for article in google_rss.get('articles', []):
        content_data = self.google_client.fetch_article_content(article['url'])
        article.update(content_data)

    # Combine all articles
    all_articles = self._combine_and_deduplicate(finnhub_data, google_rss)

    # Perform LLM sentiment analysis
    enriched_articles = []
    for article in all_articles:
        article_data = ArticleData(**article)
        article_data.sentiment = self.sentiment_analyzer.analyze_sentiment(article_data)
        enriched_articles.append(article_data)

    # Create and return context
    return self._create_news_context(enriched_articles, start_date, end_date)

6. Error Recovery and Partial Data

def handle_source_failure(
    self,
    finnhub_data: dict | None,
    google_data: dict | None,
    errors: dict[str, Exception]
) -> NewsContext:
    """
    Handle cases where one or more news sources fail.

    - If all sources fail: Raise exception
    - If some sources succeed: Return partial data with metadata
    - Track content extraction failures separately
    """
    if not finnhub_data and not google_data:
        raise ValueError("All news sources failed to return data")

    # Track extraction statistics
    extraction_stats = {
        "total_articles": 0,
        "successful_extractions": 0,
        "archive_fallbacks": 0,
        "failed_extractions": 0
    }

    # Process available articles
    all_articles = []
    successful_sources = []

    if finnhub_data:
        all_articles.extend(finnhub_data.get('articles', []))
        successful_sources.append('finnhub')

    if google_data:
        articles = google_data.get('articles', [])
        for article in articles:
            extraction_stats["total_articles"] += 1
            if article.get("extraction_success"):
                extraction_stats["successful_extractions"] += 1
                if article.get("extracted_via") == "internet_archive":
                    extraction_stats["archive_fallbacks"] += 1
            else:
                extraction_stats["failed_extractions"] += 1

        all_articles.extend(articles)
        successful_sources.append('google_news')

    metadata = {
        "sources_requested": ["finnhub", "google_news"],
        "sources_successful": successful_sources,
        "sources_failed": {source: str(error) for source, error in errors.items()},
        "extraction_stats": extraction_stats,
        "partial_data": len(successful_sources) < 2
    }

    # Deduplicate and return context
    return self._create_context(all_articles, metadata)

7. Repository Method Bridging

# Add these bridge methods to NewsRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
    """Bridge to existing get_news_data method."""
    existing_data = self.get_news_data(
        symbol=symbol or query,
        start_date=start_date,
        end_date=end_date
    )
    return len(existing_data.get('articles', [])) > 0

def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]:
    """Bridge to existing get_news_data method."""
    return self.get_news_data(
        symbol=symbol or query,
        start_date=start_date,
        end_date=end_date
    )

def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool:
    """Bridge to existing store_news_articles method."""
    articles = cache_data.get('articles', [])
    if not articles:
        return False

    # Convert to expected format
    news_articles = [
        NewsArticle(
            symbol=symbol or query,
            headline=a['headline'],
            summary=a.get('summary', ''),
            content=a.get('content', ''),
            url=a['url'],
            source=a['source'],
            date=a['date'],
            entities=a.get('entities', []),
            sentiment_score=a.get('sentiment', {}).get('score', 0.0),
            sentiment_metadata=a.get('sentiment', {})
        )
        for a in articles
    ]

    return self.store_news_articles(news_articles)

def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool:
    """News is append-only, so this just marks data as stale for re-fetch."""
    # Implementation depends on repository design
    # Could update metadata to trigger re-fetch
    return True

8. Pydantic Validation

Context Structure

@dataclass
class NewsContext(BaseModel):
    symbol: str | None
    period: dict[str, str]  # {"start": "2024-01-01", "end": "2024-01-31"}
    articles: list[ArticleData]
    sentiment_summary: SentimentScore
    article_count: int
    sources: list[str]
    metadata: dict[str, Any]

    @validator('period')
    def validate_period(cls, v):
        # Ensure start and end dates are present and valid
        if 'start' not in v or 'end' not in v:
            raise ValueError("Period must have 'start' and 'end' dates")
        return v

    @validator('articles')
    def validate_articles(cls, v):
        # Ensure no duplicate URLs
        urls = [a.url for a in v]
        if len(urls) != len(set(urls)):
            raise ValueError("Duplicate articles detected")
        return v

Implementation Tasks

Phase 1: Create GoogleNewsClient

GoogleNewsClient Implementation
- Create tradingagents/clients/google_news_client.py following FinnhubClient standard
- Implement RSS feed parsing using feedparser library
- Add fetch_rss_feed() method with Google News RSS integration
- Add fetch_article_content() method with newspaper3k and Internet Archive fallback
- Use date objects for all date parameters
- No BaseClient inheritance
Article Content Extraction
- Implement robust article content extraction using newspaper3k
- Add fallback to Internet Archive Wayback Machine for failed fetches
- Handle paywall detection and alternative content sources
- Extract clean text, title, publication date, and metadata
Comprehensive Testing
- Create test suite for GoogleNewsClient
- Test RSS parsing with various queries
- Test content extraction with real and archived URLs
- Use pytest-vcr for HTTP interaction recording

Phase 2: Bridge NewsRepository Interface

Repository Interface Standardization
- Add standard service interface methods to NewsRepository
- Bridge existing methods without changing underlying storage
- File: tradingagents/repositories/news_repository.py
- Maintain backward compatibility

Phase 3: Implement NewsService

Service Core Implementation
- Replace method stubs with full implementation
- Implement get_context(), get_company_news_context(), get_global_news_context()
- Add local-first data strategy with freshness checking
- Replace BaseClient dependencies with typed clients
- File: tradingagents/services/news_service.py
LLM Sentiment Analysis Integration
- Implement LLMSentimentAnalyzer class
- Create financial news sentiment prompts
- Add batch processing for efficiency
- Handle LLM rate limiting and errors
Date Conversion and Article Processing
- Add date validation and conversion
- Implement RSS article fetching pipeline
- Add content extraction with fallback
- Combine articles from multiple sources
- Implement deduplication by URL

Phase 4: Type Safety & Validation

Comprehensive Type Checking
- Run mise run typecheck - must pass with 0 errors
- Validate all date object conversions
- Ensure NewsContext compliance
Enhanced Testing
- Test RSS feed parsing edge cases
- Test content extraction failures and fallbacks
- Test LLM sentiment analysis with various article types
- Test multi-source aggregation and deduplication

Testing Scenarios

Integration Tests

RSS Feed Processing
- Test with various search queries
- Test date filtering in RSS results
- Test handling of malformed RSS feeds
Content Extraction
- Test direct fetch success
- Test Internet Archive fallback
- Test paywall detection
- Test extraction failure handling
LLM Sentiment Analysis
- Test positive news sentiment
- Test negative earnings reports
- Test neutral market updates
- Test batch processing
- Test LLM error handling
Multi-Source Aggregation
- Test both sources succeed
- Test Finnhub fails, Google succeeds
- Test Google fails, Finnhub succeeds
- Test both sources fail
Date Handling
- Test invalid date formats
- Test end_date < start_date
- Test date filtering in RSS feeds

Success Criteria

Functional Requirements

✅ Service successfully implements all placeholder methods
✅ GoogleNewsClient reads and parses RSS feeds correctly
✅ Article content extraction works with Internet Archive fallback
✅ LLM sentiment analysis provides structured financial sentiment
✅ Local-first strategy with proper freshness checking
✅ Multi-source aggregation with deduplication
✅ Returns properly validated NewsContext to agents
✅ Force refresh fetches fresh articles without clearing cache

Technical Requirements

✅ Zero type checking errors: mise run typecheck
✅ Zero linting errors: mise run lint
✅ All tests pass with new implementation
✅ No runtime errors with date conversions
✅ Proper error messages for validation failures

Quality Requirements

✅ Strongly-typed interfaces between all components
✅ RSS feed parsing with robust error handling
✅ Article content extraction with fallback strategy
✅ LLM integration with proper prompt engineering
✅ Efficient caching with minimal external calls
✅ Clear separation of concerns

Data Architecture

GoogleNewsClient RSS Response Format

{
    "query": "Apple stock",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "articles": [
        {
            "headline": "Apple Stock Soars on New Product Launch",
            "summary": "Brief summary from RSS feed...",
            "content": "Full article text extracted from source...",
            "url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
            "source": "CNBC",
            "date": "2024-01-20",
            "authors": ["Tech Reporter"],
            "publish_date": "2024-01-20T14:30:00Z",
            "extracted_via": "direct_fetch",  # or "internet_archive"
            "extraction_success": true
        }
    ],
    "metadata": {
        "source": "google_news_rss",
        "article_count": 25,
        "rss_feed_url": "https://news.google.com/rss/search?q=Apple+stock",
        "extraction_stats": {
            "successful": 22,
            "archive_fallback": 2,
            "failed": 3
        }
    }
}

LLM Sentiment Analysis Response Format

{
    "article_url": "https://www.cnbc.com/2024/01/20/apple-stock.html",
    "sentiment": {
        "positive": 0.7,
        "negative": 0.1,
        "neutral": 0.2,
        "metadata": {
            "score": 0.7,
            "confidence": 0.85,
            "label": "positive",
            "reasoning": "Article discusses positive earnings and growth outlook",
            "key_themes": ["earnings_beat", "product_launch", "revenue_growth"],
            "financial_entities": ["AAPL", "Apple Inc.", "iPhone 15"]
        }
    }
}

Aggregate Sentiment Summary

{
    "sentiment_summary": {
        "positive": 0.65,  # Average across all articles
        "negative": 0.20,
        "neutral": 0.15,
        "metadata": {
            "dominant_sentiment": "positive",
            "confidence": 0.82,
            "article_count": 25,
            "themes": {
                "earnings": 8,
                "product_launch": 5,
                "market_analysis": 12
            }
        }
    }
}

Dependencies

Components to Create

⏳ GoogleNewsClient - Full implementation with RSS and content extraction
⏳ LLMSentimentAnalyzer - LLM integration for sentiment analysis
⏳ NewsService - Replace stubs with full implementation

Existing Components

✅ FinnhubClient with company news using date objects
✅ NewsRepository with dataclass storage
✅ NewsContext and related Pydantic models

Required Libraries

feedparser - RSS feed parsing
newspaper3k - Article content extraction
requests - HTTP requests and Internet Archive API
beautifulsoup4 - HTML parsing fallback
LLM client library (OpenAI, Anthropic, etc.)

Timeline

Immediate (Phase 1)

Create GoogleNewsClient with RSS and content extraction
Implement feedparser integration
Add Internet Archive fallback
Create comprehensive test suite

Phase 2-3

Add repository bridge methods
Implement full NewsService
Integrate LLM sentiment analysis
Handle multi-source aggregation

Phase 4

Type checking and validation
Integration testing
Performance optimization
Documentation

Acceptance Criteria

Must Have

Type Safety: Service passes mise run typecheck with zero errors
RSS Integration: Successfully parse Google News RSS feeds
Content Extraction: Extract full articles with fallback
LLM Sentiment: Financial sentiment analysis for all articles
Service Implementation: All stubs replaced with working code
Local-First: Check cache before fetching new data
Multi-Source: Aggregate Finnhub and Google News

Should Have

Extraction Stats: Track success/failure rates
Batch Processing: Efficient LLM sentiment analysis
Force Refresh: Fetch new articles on demand
Error Recovery: Handle partial failures gracefully

Nice to Have

Additional Sources: Support more news providers
Real-time Monitoring: WebSocket for breaking news
Advanced Extraction: Handle PDFs, videos
Sentiment Trends: Track sentiment over time

This PRD focuses on completing the currently empty NewsService with a full implementation including RSS feed integration, article content extraction with Internet Archive fallback, and LLM-powered sentiment analysis for financial news.

27 KiB Raw Blame History

Product Requirements Document: NewsService Completion

Overview

Current State Analysis

Issues to Fix

What Works

Technical Requirements

1. Strongly-Typed Interfaces

Client → Service Interface

Service → Repository Interface

Service → Agent Interface

2. Local-First Data Strategy

Flow

News-Specific Gap Detection

Force Refresh Support

Cache Invalidation Strategy

3. RSS Feed Processing & Article Fetching

GoogleNewsClient RSS Implementation

4. LLM-Powered Sentiment Analysis

Sentiment Analysis Integration

5. Date Object Conversion

Service Boundary Conversion

6. Error Recovery and Partial Data

7. Repository Method Bridging

8. Pydantic Validation

Context Structure

Implementation Tasks

Phase 1: Create GoogleNewsClient

Phase 2: Bridge NewsRepository Interface

Phase 3: Implement NewsService

Phase 4: Type Safety & Validation

Testing Scenarios

Integration Tests

Success Criteria

Functional Requirements

Technical Requirements

Quality Requirements

Data Architecture

GoogleNewsClient RSS Response Format

LLM Sentiment Analysis Response Format

Aggregate Sentiment Summary

Dependencies

Components to Create

Existing Components

Required Libraries

Timeline

Immediate (Phase 1)

Phase 2-3

Phase 4

Acceptance Criteria

Must Have

Should Have

Nice to Have

27 KiB

Raw Blame History