43 KiB

Raw Blame History

News Service PRD

Executive Summary

The News Service feature will provide up-to-date news sentiment analysis for stock market tickers to the TradingAgents framework. This service will enable agents to make more informed trading decisions based on current market news and sentiment.

Requirements

Target Users

Trading Agents (News Analyst, Researchers, Trader Agent, Risk Management team)
Cron Job system for daily updates

Problem Statement

Agents need up-to-date news sentiment when analyzing the stock market to make better trading decisions. Currently, they may be missing important news events or experiencing delays in sentiment analysis that could impact trading performance.

Success Metrics

Impact on trading decision quality

User Stories

As Cron Job I want to be able to update and store the news with sentiment analysis for a ticker each day
As a Trading Agent I want to be able to retrieve the news with sentiment analysis for a ticker and a day from a database

Out of Scope (v1)

Real-time news streaming (vs daily updates)
Multi-language news support
Historical news sentiment analysis beyond a certain date range
News source ranking or weighting
Advanced filtering options

Timeline

MVP in 1 week

Status

✅ Requirements Complete | ✅ Technical Design Complete | 🔄 Implementation In Progress

Technical Design

Architecture

The NewsService will be the central component, orchestrating the fetching, scraping, analysis, and storage of news articles.
It will utilize the existing GoogleNewsClient to fetch RSS feeds from Google News.
The ArticleScraperClient will be enhanced to scrape full article content with robust fallback strategies:
- Direct Fetch: Primary method using newspaper3k library for content extraction
- Archive Fallback: Internet Archive Wayback Machine fallback for failed fetches
- Content Extraction: Clean text, title, publication date, and metadata extraction
- Paywall Detection: Handle paywall-protected content gracefully
A new SentimentAnalysisService will be created to handle the interaction with the configured LLM for structured sentiment analysis.
The NewsRepository will store the news articles along with their sentiment scores in the existing file-based database.

Implementation Components

Backend:
- tradingagents/domains/news/news_service.py:
  - A new private method _get_sentiment_for_article will be added to call the SentimentAnalysisService.
  - The update_company_news method will be modified to call this new method for each scraped article.
  - The _calculate_sentiment_summary will be updated to aggregate the new structured sentiment scores.
  - Update to work with SQLAlchemy-based NewsRepository instead of file-based storage.
- tradingagents/domains/news/repository.py (Enhanced with Compatibility Layer):
  - Replace file-based storage with SQLAlchemy ORM operations
  - Backward Compatibility: Maintain existing interface with adapter pattern
  - Implement new methods: save_articles(), get_articles_by_symbol(), get_articles_by_date_range()
  - Add transaction management and connection pooling
  - Include duplicate detection using URL uniqueness constraints
  - Add batch operations for efficient bulk inserts

Data Model Compatibility Strategy:

# Enhanced ArticleData to bridge existing and new models
@dataclass
class ArticleData:
    # Existing fields (maintain compatibility)
    title: str
    content: str
    author: str
    source: str  # Keep as string for existing code
    date: str  # YYYY-MM-DD format
    url: str
    sentiment: SentimentScore | None = None
    
    # New fields for enhanced functionality
    source_id: int | None = None  # Foreign key when available
    category_id: int | None = None  # Foreign key when available  
    
    # Vector fields (optional for backward compatibility)
    title_embedding: List[float] | None = None
    content_embedding: List[float] | None = None
    sentiment_embedding: List[float] | None = None
    
    @classmethod
    def from_db_model(cls, article: NewsArticle) -> 'ArticleData':
        """Convert database model to existing ArticleData format."""
        return cls(
            title=article.title,
            content=article.content or "",
            author=article.author or "",
            source=article.source.name if article.source else "Unknown",  # Flatten relationship
            date=article.published_date.isoformat(),
            url=article.url,
            sentiment=SentimentScore(
                score=float(article.sentiment_score) if article.sentiment_score else 0.0,
                confidence=float(article.sentiment_confidence) if article.sentiment_confidence else 0.0,
                label=article.sentiment_label or "neutral"
            ) if article.sentiment_score is not None else None,
            source_id=article.source_id,
            category_id=article.category_id,
            title_embedding=article.title_embedding,
            content_embedding=article.content_embedding,  
            sentiment_embedding=article.sentiment_embedding
        )
    
    def to_db_model(self, session: Session) -> NewsArticle:
        """Convert to database model, handling source lookup."""
        # Get or create source
        source = session.query(NewsSource).filter_by(name=self.source).first()
        if not source:
            source = NewsSource(name=self.source)
            session.add(source)
            session.flush()  # Get ID
        
        return NewsArticle(
            title=self.title,
            content=self.content,
            author=self.author,
            source_id=source.id,
            url=self.url,
            published_date=date.fromisoformat(self.date),
            sentiment_score=Decimal(str(self.sentiment.score)) if self.sentiment else None,
            sentiment_confidence=Decimal(str(self.sentiment.confidence)) if self.sentiment else None,
            sentiment_label=self.sentiment.label if self.sentiment else None,
            title_embedding=self.title_embedding,
            content_embedding=self.content_embedding,
            sentiment_embedding=self.sentiment_embedding
        )

- `tradingagents/domains/news/sentiment_service.py` (New File):
    - This new service will encapsulate the logic for calling the LLM and generating embeddings.
    - Primary method: `get_sentiment_with_embeddings(article_content: str) -> SentimentScoreWithEmbeddings`.
    - It will use the `quick_think_llm` from the `TradingAgentsConfig` for performance.
    - It will use a structured prompt to ask the LLM to return a JSON object with `score`, `confidence`, and `label`.
    - **Embedding Generation**: Generate multiple embeddings using OpenAI's embedding API:
        - `title_embedding`: Vector representation of article title (1536 dims)
        - `content_embedding`: Vector representation of full article content (1536 dims) 
        - `sentiment_embedding`: Smaller specialized sentiment vector using sentence-transformers (384 dims)
    - **Vector Similarity**: Enable semantic search for similar articles and sentiment clustering

Database:
- PostgreSQL + SQLAlchemy + pgvector Integration:
  - Replace file-based storage with PostgreSQL database using SQLAlchemy ORM
  - Create new SQLAlchemy models for news articles with proper relationships
  - Implement database migrations using Alembic
  - Add connection pooling and transaction management
  - Integrate pgvector extension for high-dimensional sentiment embeddings storage
  - Enable semantic similarity search and vector-based sentiment clustering
- Database Schema Design:
  - news_articles table with columns for article data, sentiment scores, embeddings, and metadata
  - news_sources table for source information and credibility tracking
  - news_categories table for article categorization
  - sentiment_embeddings table for high-dimensional vector storage using pgvector
  - Proper indexing for symbol, date, source queries, and vector similarity searches
  - Foreign key relationships between articles, sources, categories, and embeddings

API Specification

No external API changes. All modifications will be internal to the NewsService and the cron job that calls it.

Security & Performance

Security: LLM API keys will continue to be managed through the TradingAgentsConfig and environment variables. No new security risks are introduced.
Performance: The scraping and sentiment analysis process is I/O and network-bound. This will run as part of the daily cron job, so it will not impact the performance of the trading agents' decision-making process, which will read from the cached data.

Database Schema Design

Core Tables

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- News sources for credibility tracking
CREATE TABLE news_sources (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL UNIQUE,
    domain VARCHAR(255),
    credibility_score DECIMAL(3,2) DEFAULT 0.5,  -- 0.0 to 1.0
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- News categories for article classification  
CREATE TABLE news_categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL UNIQUE,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Main articles table
CREATE TABLE news_articles (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    author VARCHAR(255),
    symbol VARCHAR(10),  -- Stock ticker, nullable for global news
    source_id INTEGER REFERENCES news_sources(id),
    category_id INTEGER REFERENCES news_categories(id),
    url TEXT UNIQUE NOT NULL,
    published_date DATE NOT NULL,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Sentiment analysis
    sentiment_score DECIMAL(3,2),  -- -1.0 to 1.0
    sentiment_confidence DECIMAL(3,2),  -- 0.0 to 1.0  
    sentiment_label VARCHAR(20),  -- positive/negative/neutral
    sentiment_analyzed_at TIMESTAMP,
    
    -- Vector embeddings for semantic analysis
    title_embedding vector(1536),  -- OpenAI ada-002 embedding dimension
    content_embedding vector(1536), -- Full article content embedding
    sentiment_embedding vector(384), -- Sentence-transformer for sentiment
    embedding_model VARCHAR(50) DEFAULT 'text-embedding-ada-002',
    embedded_at TIMESTAMP,
    
    -- Metadata
    content_length INTEGER,
    scrape_status VARCHAR(20) DEFAULT 'SUCCESS',  -- SUCCESS, FAILED, ARCHIVE_SUCCESS
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Remove redundant sentiment_embeddings table
-- All embeddings stored directly in news_articles table for simplicity and performance

-- Performance indexes
CREATE INDEX idx_news_articles_symbol_date ON news_articles(symbol, published_date);
CREATE INDEX idx_news_articles_published_date ON news_articles(published_date);
CREATE INDEX idx_news_articles_source ON news_articles(source_id);
CREATE INDEX idx_news_articles_sentiment ON news_articles(sentiment_score, sentiment_confidence);
CREATE INDEX idx_news_articles_url_hash ON news_articles USING HASH(url);

-- Vector similarity indexes using HNSW (Hierarchical Navigable Small World)
-- Note: HNSW indexes consume significant memory (2-4x vector storage)
CREATE INDEX idx_articles_title_embedding ON news_articles USING hnsw (title_embedding vector_cosine_ops) 
    WITH (m = 16, ef_construction = 64);  -- Tuned for performance vs memory
CREATE INDEX idx_articles_content_embedding ON news_articles USING hnsw (content_embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_articles_sentiment_embedding ON news_articles USING hnsw (sentiment_embedding vector_cosine_ops)
    WITH (m = 8, ef_construction = 32);  -- Smaller index for sentiment vectors

SQLAlchemy Models

# tradingagents/domains/news/models.py
from datetime import datetime, date
from decimal import Decimal
from typing import List, Optional
from sqlalchemy import Column, Integer, String, Text, Date, DateTime, Decimal as SQLDecimal, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from pgvector.sqlalchemy import Vector

Base = declarative_base()

class NewsSource(Base):
    __tablename__ = 'news_sources'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(255), nullable=False, unique=True)
    domain = Column(String(255))
    credibility_score = Column(SQLDecimal(3,2), default=0.5)
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # Relationships
    articles = relationship("NewsArticle", back_populates="source")

class NewsCategory(Base):
    __tablename__ = 'news_categories'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False, unique=True)
    description = Column(Text)
    created_at = Column(DateTime, default=datetime.utcnow)
    
    # Relationships
    articles = relationship("NewsArticle", back_populates="category")

class NewsArticle(Base):
    __tablename__ = 'news_articles'
    
    id = Column(Integer, primary_key=True)
    title = Column(Text, nullable=False)
    content = Column(Text)
    author = Column(String(255))
    symbol = Column(String(10))  # Nullable for global news
    source_id = Column(Integer, ForeignKey('news_sources.id'))
    category_id = Column(Integer, ForeignKey('news_categories.id'))
    url = Column(Text, unique=True, nullable=False)
    published_date = Column(Date, nullable=False)
    scraped_at = Column(DateTime, default=datetime.utcnow)
    
    # Sentiment fields
    sentiment_score = Column(SQLDecimal(3,2))  # -1.0 to 1.0
    sentiment_confidence = Column(SQLDecimal(3,2))  # 0.0 to 1.0
    sentiment_label = Column(String(20))  # positive/negative/neutral
    sentiment_analyzed_at = Column(DateTime)
    
    # Vector embeddings using pgvector
    title_embedding = Column(Vector(1536))  # OpenAI ada-002 dimensions
    content_embedding = Column(Vector(1536))  # Full content embedding
    sentiment_embedding = Column(Vector(384))  # Sentence transformer for sentiment
    embedding_model = Column(String(50), default='text-embedding-ada-002')
    embedded_at = Column(DateTime)
    
    # Metadata
    content_length = Column(Integer)
    scrape_status = Column(String(20), default='SUCCESS')
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # Relationships
    source = relationship("NewsSource", back_populates="articles")
    category = relationship("NewsCategory", back_populates="articles")

# Removed redundant SentimentEmbedding table for simplified architecture

Database Migration Strategy

Alembic Configuration:

# alembic/env.py
from tradingagents.domains.news.models import Base
from tradingagents.config import TradingAgentsConfig

config = TradingAgentsConfig.from_env()
target_metadata = Base.metadata

# Database URL from config
config.set_main_option("sqlalchemy.url", config.database_url)

Initial Migration:

# Initialize Alembic in the project
alembic init alembic

# Generate initial migration
alembic revision --autogenerate -m "Create news tables"

# Apply migration
alembic upgrade head

Migration Files:

001_enable_pgvector.py - Enable pgvector extension
002_create_news_tables.py - Initial schema creation with vector fields
003_add_vector_indexes.py - HNSW indexes for vector similarity
004_seed_categories_sources.py - Seed default categories and trusted sources

TradingAgentsConfig Extension:

@dataclass
class TradingAgentsConfig:
    # ... existing fields ...
    
    # Database configuration
    database_url: str = field(default_factory=lambda: os.getenv("DATABASE_URL", ""))
    database_pool_size: int = field(default_factory=lambda: int(os.getenv("DATABASE_POOL_SIZE", "10")))
    database_max_overflow: int = field(default_factory=lambda: int(os.getenv("DATABASE_MAX_OVERFLOW", "20")))
    database_echo: bool = field(default_factory=lambda: os.getenv("DATABASE_ECHO", "false").lower() == "true")
    
    # Vector configuration
    enable_vector_search: bool = field(default_factory=lambda: os.getenv("ENABLE_VECTOR_SEARCH", "true").lower() == "true")
    embedding_model: str = field(default_factory=lambda: os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002"))
    embedding_batch_size: int = field(default_factory=lambda: int(os.getenv("EMBEDDING_BATCH_SIZE", "100")))
    enable_sentence_transformers: bool = field(default_factory=lambda: os.getenv("ENABLE_SENTENCE_TRANSFORMERS", "true").lower() == "true")
    
    @property
    def has_database_config(self) -> bool:
        """Check if database is properly configured."""
        return bool(self.database_url and self.database_url.startswith("postgresql://"))
    
    @property 
    def embedding_provider(self) -> str:
        """Get embedding provider from LLM provider setting."""
        # Map LLM providers to their embedding providers
        llm_provider = getattr(self, 'llm_provider', 'openai')
        embedding_map = {
            'openai': 'openai',
            'google': 'google',  # Use Gemini for embeddings when Google is selected
            'anthropic': 'openai',  # Anthropic doesn't have embeddings, use OpenAI
            'ollama': 'openai'  # Local models, use OpenAI for embeddings
        }
        return embedding_map.get(llm_provider, 'openai')

def validate_database_config(config: TradingAgentsConfig) -> None:
    """Validate database configuration before startup."""
    if not config.has_database_config:
        raise ValueError("DATABASE_URL must be set for PostgreSQL integration")
    
    if config.enable_vector_search and not config.has_database_config:
        raise ValueError("Vector search requires PostgreSQL database configuration")

Environment Variables:

# Database configuration (required)
DATABASE_URL=postgresql://username:password@localhost:5432/tradingagents
DATABASE_POOL_SIZE=10  # optional, defaults to 10
DATABASE_MAX_OVERFLOW=20  # optional, defaults to 20  
DATABASE_ECHO=false  # optional, set to true for SQL debugging

# Vector configuration (optional)
ENABLE_VECTOR_SEARCH=true  # optional, defaults to true
EMBEDDING_MODEL=google/gemini-2.5-flash  # Use Gemini via OpenRouter for embeddings
EMBEDDING_BATCH_SIZE=100  # optional
ENABLE_SENTENCE_TRANSFORMERS=true  # optional

# Example configurations by provider:
# For OpenAI: EMBEDDING_MODEL=text-embedding-ada-002
# For Gemini: EMBEDDING_MODEL=google/gemini-2.5-flash (via OpenRouter)

Embedding Generation Service Design

SentimentScore Enhancement:

@dataclass
class SentimentScoreWithEmbeddings:
    """Enhanced sentiment analysis with vector embeddings."""
    
    score: float  # -1.0 to 1.0
    confidence: float  # 0.0 to 1.0
    label: str  # positive/negative/neutral
    
    # Vector embeddings
    title_embedding: List[float]  # 1536 dimensions
    content_embedding: List[float]  # 1536 dimensions  
    sentiment_embedding: List[float]  # 384 dimensions
    embedding_model: str = "text-embedding-ada-002"

Service Implementation:

class EmbeddingProvider:
    """Abstract base for embedding providers."""
    async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
        raise NotImplementedError

class OpenAIEmbeddingProvider(EmbeddingProvider):
    def __init__(self, api_key: str, model: str = "text-embedding-ada-002"):
        self.client = AsyncOpenAI(api_key=api_key)
        self.model = model
    
    async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
        response = await self.client.embeddings.create(
            input=texts,
            model=self.model
        )
        return [item.embedding for item in response.data]

class GeminiEmbeddingProvider(EmbeddingProvider):
    def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
        self.client = AsyncOpenAI(api_key=api_key, base_url=base_url)
        self.model = "google/gemini-2.5-flash"
    
    async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
        # Gemini via OpenRouter - batch embeddings
        response = await self.client.embeddings.create(
            input=texts,
            model=self.model
        )
        return [item.embedding for item in response.data]

class SentimentAnalysisService:
    def __init__(self, config: TradingAgentsConfig):
        self.llm_client = self._get_llm_client(config)
        self.embedding_provider = self._get_embedding_provider(config)
        self.sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2') if config.enable_sentence_transformers else None
    
    def _get_embedding_provider(self, config: TradingAgentsConfig) -> EmbeddingProvider:
        """Get appropriate embedding provider based on configuration."""
        provider = config.embedding_provider
        
        if provider == 'openai':
            return OpenAIEmbeddingProvider(
                api_key=os.getenv('OPENAI_API_KEY'),
                model=config.embedding_model
            )
        elif provider == 'google':
            return GeminiEmbeddingProvider(
                api_key=os.getenv('OPENAI_API_KEY'),  # OpenRouter key
                base_url="https://openrouter.ai/api/v1"
            )
        else:
            # Default to OpenAI
            return OpenAIEmbeddingProvider(
                api_key=os.getenv('OPENAI_API_KEY'),
                model=config.embedding_model
            )
    
    async def get_sentiment_with_embeddings(
        self, 
        title: str, 
        content: str
    ) -> SentimentScoreWithEmbeddings:
        """Generate sentiment analysis with vector embeddings - optimized for performance."""
        
        # 1. Parallel processing: sentiment score + embeddings
        tasks = [
            self._get_sentiment_score(content),  # LLM sentiment analysis
            self.embedding_provider.get_embeddings([title, content])  # Batch embedding API call
        ]
        
        sentiment, embeddings = await asyncio.gather(*tasks)
        title_embedding, content_embedding = embeddings
        
        # 2. Generate local sentiment embedding if enabled  
        sentiment_embedding = None
        if self.sentence_transformer:
            sentiment_embedding = self.sentence_transformer.encode(content).tolist()
        
        return SentimentScoreWithEmbeddings(
            score=sentiment.score,
            confidence=sentiment.confidence, 
            label=sentiment.label,
            title_embedding=title_embedding,
            content_embedding=content_embedding,
            sentiment_embedding=sentiment_embedding,
            embedding_model=self.embedding_provider.model
        )

    async def _get_sentiment_score(self, content: str) -> SentimentScore:
        """Generate sentiment score using LLM with financial news prompt."""
        
        prompt = """
        Analyze the sentiment of this financial news article for trading purposes.

        Article Content: {content}

        Provide your analysis in the following JSON format:
        {{
            "score": <float between -1.0 (very negative) and 1.0 (very positive)>,
            "confidence": <float between 0.0 and 1.0>,
            "label": <"positive", "negative", or "neutral">,
            "reasoning": <brief explanation>,
            "key_themes": <list of key financial themes>,
            "financial_entities": <list of mentioned companies/tickers>
        }}

        Focus on the financial and market implications of the news.
        Consider impact on stock prices, market sentiment, and trading decisions.
        """.format(content=content[:2000])  # Limit content length
        
        response = await self.llm_client.complete(prompt)
        
        try:
            result = json.loads(response)
            return SentimentScore(
                score=result.get("score", 0.0),
                confidence=result.get("confidence", 0.5), 
                label=result.get("label", "neutral"),
                metadata={
                    "reasoning": result.get("reasoning", ""),
                    "key_themes": result.get("key_themes", []),
                    "financial_entities": result.get("financial_entities", [])
                }
            )
        except Exception as e:
            # Return neutral sentiment on error
            return SentimentScore(
                score=0.0,
                confidence=0.0,
                label="neutral",
                metadata={"error": str(e)}
            )
    
    def find_similar_articles(
        self, 
        embedding: List[float], 
        limit: int = 10,
        similarity_threshold: float = 0.8
    ) -> List[NewsArticle]:
        """Find semantically similar articles using vector similarity."""
        # Use pgvector cosine similarity search
        pass
        
    async def batch_analyze_sentiment(
        self, 
        articles: List[ArticleData], 
        batch_size: int = 5
    ) -> List[SentimentScoreWithEmbeddings]:
        """
        Batch process sentiment analysis and embedding generation.
        
        Args:
            articles: List of articles to analyze
            batch_size: Number of articles to process concurrently
            
        Returns:
            List of sentiment scores with embeddings
        """
        results = []
        
        for i in range(0, len(articles), batch_size):
            batch = articles[i:i + batch_size]
            
            # Process batch concurrently
            batch_tasks = [
                self.get_sentiment_with_embeddings(article.title, article.content)
                for article in batch
            ]
            
            batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
            
            for result in batch_results:
                if isinstance(result, Exception):
                    # Handle individual failures gracefully
                    logger.error(f"Sentiment analysis failed: {result}")
                    results.append(self._get_neutral_sentiment_with_embeddings())
                else:
                    results.append(result)
            
            # Rate limiting: Add delay between batches
            if i + batch_size < len(articles):
                await asyncio.sleep(1.0)  # 1 second delay between batches
                
        return results

Optimized Vector Similarity Queries:

-- Find articles similar to a given title embedding (HNSW optimized)
-- Note: Don't use WHERE clause on similarity - it defeats HNSW indexing
SELECT id, title, symbol, 
       (title_embedding <=> %s) as distance,
       (1 - (title_embedding <=> %s)) as similarity
FROM news_articles 
WHERE title_embedding IS NOT NULL  -- Only filter on non-null vectors
ORDER BY title_embedding <=> %s
LIMIT 20  -- Get more candidates, filter in application if needed
HAVING distance < 0.2;  -- Filter after ordering for best performance

-- Find articles with similar sentiment patterns (pre-filter by label for efficiency)
SELECT id, title, sentiment_label, 
       (sentiment_embedding <=> %s) as distance
FROM news_articles
WHERE sentiment_label = %s  -- Filter first by indexed column
  AND sentiment_embedding IS NOT NULL
ORDER BY sentiment_embedding <=> %s
LIMIT 15;

-- Cluster articles by content similarity for a ticker (optimized approach)
WITH similar_articles AS (
    SELECT id, symbol, sentiment_score,
           (content_embedding <=> %s) as distance
    FROM news_articles
    WHERE symbol = %s  -- Use indexed column first
      AND content_embedding IS NOT NULL
    ORDER BY content_embedding <=> %s
    LIMIT 50  -- Limit search space
)
SELECT symbol, 
       AVG(sentiment_score) as avg_sentiment,
       COUNT(*) as article_count,
       AVG(distance) as avg_content_distance
FROM similar_articles
WHERE distance < 0.3  -- Apply similarity threshold after vector search
GROUP BY symbol;

-- Performance monitoring query
SELECT 
    schemaname,
    tablename,
    attname as column_name,
    n_distinct,
    correlation
FROM pg_stats 
WHERE tablename = 'news_articles' 
  AND attname LIKE '%embedding%';

Memory Usage Estimation:

-- Estimate memory requirements for HNSW indexes
SELECT 
    pg_size_pretty(pg_total_relation_size('idx_articles_title_embedding')) as title_index_size,
    pg_size_pretty(pg_total_relation_size('idx_articles_content_embedding')) as content_index_size,
    pg_size_pretty(pg_total_relation_size('idx_articles_sentiment_embedding')) as sentiment_index_size,
    pg_size_pretty(pg_total_relation_size('news_articles')) as table_size;
    
-- Expected memory usage: 500MB-1GB for 10K articles with 3 embedding types

Current Implementation Status

✅ COMPLETED COMPONENTS:

NewsService Core Structure (90% Complete)
- ✅ Core service class with dependency injection
- ✅ Read path implemented: get_company_news_context(), get_global_news_context()
- ✅ Write path implemented: update_company_news(), update_global_news()
- ✅ Repository integration with file-based storage
- ✅ ArticleData model conversion from repository NewsArticle
- ✅ Simple keyword-based sentiment analysis as fallback
- ✅ Error handling and empty context returns
- ✅ Trending topics extraction
- ✅ Date validation and ISO format handling
NewsRepository (100% Complete)
- ✅ File-based storage with JSON serialization
- ✅ Source separation (finnhub, google_news)
- ✅ Date-based file organization (YYYY-MM-DD.json)
- ✅ Article deduplication by URL
- ✅ Batch storage operations
- ✅ Complete CRUD operations
- ✅ Proper error handling and logging
Data Models (100% Complete)
- ✅ ArticleData dataclass with sentiment field
- ✅ NewsContext and GlobalNewsContext for agent consumption
- ✅ SentimentScore model
- ✅ NewsUpdateResult for operation tracking
- ✅ DataQuality enum for metadata

✅ COMPLETED COMPONENTS (UPDATED):

GoogleNewsClient (100% Complete)
- ✅ RSS feed parsing with feedparser
- ✅ Company news method implemented (get_company_news())
- ✅ Global news method implemented (get_global_news())
- ✅ Proper error handling and logging
- ✅ Google News RSS URL construction
- ✅ Article parsing with source extraction
- ✅ Date parsing with fallback handling
ArticleScraperClient (100% Complete)
- ✅ Full newspaper3k content extraction
- ✅ Internet Archive Wayback Machine fallback
- ✅ Robust error handling for failed scrapes
- ✅ Content validation (minimum length checks)
- ✅ Multiple article batch processing
- ✅ Rate limiting with configurable delays
- ✅ Proper URL validation

❌ MISSING COMPONENTS:

LLM Sentiment Analysis Service (0% Complete)
- ❌ SentimentAnalysisService class not created
- ❌ LLM integration not implemented
- ❌ Financial news prompts not defined
- ❌ Batch processing not implemented
- Current: Using simple keyword-based fallback
- Next: Create dedicated sentiment service
Database Migration (0% Complete)
- ❌ SQLAlchemy models not created
- ❌ PostgreSQL integration not started
- ❌ pgvector extension not configured
- ❌ Alembic migrations not set up
- Current: Using file-based storage
- Status: Planned for future iteration
Vector Embeddings (0% Complete)
- ❌ Embedding providers not implemented
- ❌ Vector similarity not available
- ❌ Semantic search not implemented
- Status: Advanced feature for future enhancement

Revised Implementation Phases

PHASE 1: Complete Core Functionality (Current Priority)

GoogleNewsClient RSS Implementation (2-3 days)
- Implement feedparser RSS parsing
- Add company news and global news methods
- Handle RSS feed errors and edge cases
- Create comprehensive tests with VCR cassettes
ArticleScraperClient Implementation (2-3 days)
- Implement newspaper3k content extraction
- Add Internet Archive fallback mechanism
- Handle paywalls and extraction failures
- Create scraping tests with mock responses
LLM Sentiment Analysis Service (3-4 days)
- Create SentimentAnalysisService class
- Implement LLM client integration using TradingAgentsConfig
- Design financial news sentiment prompts
- Add batch processing with rate limiting
- Replace keyword-based sentiment in NewsService

PHASE 2: Testing and Refinement (Current Phase)

Integration Testing (1-2 days)
- End-to-end testing with real RSS feeds
- Test article scraping and sentiment analysis pipeline
- Verify error handling and partial failures
- Performance testing with multiple tickers
Type Safety and Quality (1 day)
- Ensure mise run typecheck passes with 0 errors
- Fix any remaining linting issues
- Add missing docstrings and type hints

PHASE 3: Future Enhancements (Deferred)

Database Migration: SQLAlchemy + PostgreSQL + pgvector
Vector Embeddings: Semantic similarity and clustering
Performance Optimization: Caching improvements and batch processing

Total Timeline: 1-2 weeks for core completion

Week 1: Complete GoogleNewsClient, ArticleScraperClient, LLM Sentiment Service
Week 2: Integration testing, refinement, and quality assurance
Future: Database migration and vector enhancements as separate project

Testing Plan

Test Strategy

Unit Testing: Test individual components in isolation with mocked dependencies
Integration Testing: Test component interactions and data flow
End-to-End Testing: Test complete workflows from news fetching to storage

Unit Tests

GoogleNewsClient Tests

Location: tests/domains/news/test_google_news_client.py
Framework: pytest with pytest-vcr for HTTP recording/replay
VCR Cassettes: tests/fixtures/vcr_cassettes/google_news/
Test Cases:
- @pytest.mark.vcr test_get_news_by_symbol_success() - Valid symbol returns articles
- @pytest.mark.vcr test_get_news_by_symbol_invalid_symbol() - Invalid symbol handling
- @pytest.mark.vcr test_get_global_news_success() - Global news retrieval
- @pytest.mark.vcr test_get_global_news_empty_response() - Empty RSS feed handling
- test_rss_feed_parsing_error() - Malformed RSS handling (mocked)
- test_network_timeout() - Network timeout scenarios (mocked)
- test_rate_limiting() - Rate limit compliance (mocked)

ArticleScraperClient Tests

Location: tests/domains/news/test_article_scraper_client.py
Framework: pytest with pytest-vcr for HTTP recording/replay
VCR Cassettes: tests/fixtures/vcr_cassettes/article_scraper/
Test Cases:
- @pytest.mark.vcr test_scrape_article_success() - Successful article scraping
- @pytest.mark.vcr test_scrape_article_archive_fallback() - Archive.is fallback
- test_scrape_article_both_fail() - Both methods fail gracefully (mocked)
- test_invalid_url() - Invalid URL handling (mocked)
- @pytest.mark.vcr test_content_extraction() - Content parsing accuracy

SentimentAnalysisService Tests

Location: tests/domains/news/test_sentiment_service.py
Test Cases:
- test_get_sentiment_positive() - Positive sentiment detection
- test_get_sentiment_negative() - Negative sentiment detection
- test_get_sentiment_neutral() - Neutral sentiment detection
- test_get_sentiment_llm_error() - LLM API error handling
- test_get_sentiment_invalid_response() - Invalid JSON response handling
- test_get_sentiment_empty_content() - Empty content handling

NewsService Tests

Location: tests/domains/news/test_news_service.py
Test Cases:
- test_update_company_news_success() - Complete news update workflow
- test_update_company_news_no_articles() - No articles found scenario
- test_update_company_news_scraping_failure() - Partial scraping failures
- test_sentiment_analysis_integration() - Sentiment analysis integration
- test_calculate_sentiment_summary() - Sentiment aggregation logic
- test_get_company_news_by_date() - News retrieval by date

NewsRepository Tests

Location: tests/domains/news/test_news_repository.py
Test Cases:
- test_store_news_articles() - Article storage
- test_get_news_by_symbol_and_date() - News retrieval
- test_duplicate_article_handling() - Duplicate prevention
- test_data_persistence() - File system persistence
- test_invalid_data_handling() - Invalid data rejection

Integration Tests

News Workflow Integration

Location: tests/integration/test_news_workflow.py
Test Cases:
- test_full_news_update_workflow() - Complete end-to-end workflow
- test_news_service_with_real_clients() - Real client integration
- test_sentiment_service_integration() - LLM integration testing
- test_repository_integration() - Data persistence integration

End-to-End Tests

Complete System Tests

Location: tests/e2e/test_news_system.py
Test Cases:
- test_daily_news_update_simulation() - Simulate daily cron job
- test_trading_agent_news_consumption() - Agent news retrieval
- test_system_performance_with_multiple_tickers() - Performance testing
- test_error_recovery_scenarios() - System resilience testing

Test Data Management

Mock Data Strategy

RSS Feed Samples: Saved sample RSS responses for consistent testing
Article Content: Pre-scraped article content for sentiment testing
LLM Responses: Mock sentiment analysis responses for unit tests

Test Configuration

Environment Variables: Separate test configuration
Database Isolation: Temporary test databases
VCR Configuration: Record/replay HTTP interactions for deterministic tests
Pytest Configuration: pytest.ini with VCR settings and test markers

Performance Testing

Load Testing

Concurrent News Updates: Test multiple ticker updates simultaneously
Memory Usage: Monitor memory consumption during batch processing
API Rate Limiting: Verify rate limit compliance under load

Benchmarking

Scraping Speed: Measure article scraping performance
Sentiment Analysis: Measure LLM response times
Storage Performance: Database write/read performance

Test Automation

CI/CD Integration

Pre-commit Hooks: Run fast unit tests before commits
Pull Request Checks: Full test suite on PR creation
Nightly Tests: End-to-end tests with real data

Test Coverage Requirements

Minimum Coverage: 80% line coverage for all components
Critical Path Coverage: 100% coverage for core business logic
Error Handling Coverage: All exception paths tested

Manual Testing Scenarios

Smoke Tests

Daily Operations: Manual verification of daily news updates
Data Quality: Spot-check sentiment analysis accuracy
System Health: Monitor error rates and performance metrics

Acceptance Testing

Trading Agent Integration: Verify agents can consume news data effectively
Data Accuracy: Validate news relevance and sentiment accuracy
Performance Benchmarks: Confirm system meets performance requirements

Current Implementation Status Summary

Overall Progress: 90% Complete 🎉

✅ COMPLETED (100%)

Requirements analysis and technical design
NewsService core structure with read/write paths
NewsRepository with file-based storage and deduplication
Data models (ArticleData, NewsContext, SentimentScore)
GoogleNewsClient with full RSS feed parsing
ArticleScraperClient with newspaper3k + Internet Archive fallback
Basic sentiment analysis (keyword-based fallback)
Error handling and validation
Service integration and dependency injection

❌ MISSING (10%)

LLM sentiment analysis service (only remaining core component)

⏸️ DEFERRED (Future Iterations)

Database migration to PostgreSQL + SQLAlchemy
Vector embeddings and semantic search
Real-time news streaming capabilities

What's Working Now

The current NewsService implementation provides:

Read Path: Agents can successfully call get_company_news_context() and get_global_news_context()
Repository Integration: Service reads cached news data from file-based NewsRepository
Data Transformation: Converts NewsRepository.NewsArticle → ArticleData for agents
Basic Sentiment: Simple keyword-based sentiment analysis as fallback
Error Handling: Graceful error handling with empty contexts and metadata
Type Safety: Proper type hints and dataclass definitions

What's Missing

The service currently cannot:

LLM Sentiment Analysis: No LLM integration for financial news sentiment (using keyword fallback)
Structured Storage: Still using file-based storage instead of planned PostgreSQL + SQLAlchemy
Vector Embeddings: No semantic similarity or vector-based features

Critical Gap (Only 1 Remaining!)

LLM Sentiment Service - No structured sentiment analysis with LLM prompts
- Current: Simple keyword-based sentiment scoring
- Needed: LLM integration using TradingAgentsConfig
- Impact: Agents get basic sentiment but not sophisticated financial analysis

Recently Discovered: Implementation is 90% Complete!

Upon detailed code review, the implementation is much further along than initially documented:

✅ GoogleNewsClient - Fully implemented with RSS parsing
✅ ArticleScraperClient - Complete with newspaper3k + Internet Archive fallback
✅ NewsService - Full read/write paths with proper error handling
✅ NewsRepository - Production-ready file-based storage

Next Immediate Steps (Revised)

✅ COMPLETE: GoogleNewsClient RSS parsing - Already implemented with feedparser
✅ COMPLETE: ArticleScraperClient - Already implemented with newspaper3k + Internet Archive
⏳ PRIORITY: Create LLM Sentiment Service - Replace keyword-based analysis (2-3 days)
⏳ PRIORITY: Integration testing - End-to-end workflow validation (1-2 days)

Timeline to MVP (Updated)

3-5 days for LLM sentiment service + testing
Current system is production-ready with basic sentiment analysis
Database migration deferred to future iteration
Vector features planned as advanced enhancement

Implementation Priority

HIGH PRIORITY (Required for sophisticated sentiment):

LLM Sentiment Analysis Service with financial news prompts

MEDIUM PRIORITY (System improvements):

Better error handling and retry logic
Performance optimization for batch processing
Comprehensive integration test suite

LOW PRIORITY (Future enhancements):

PostgreSQL + SQLAlchemy migration
Vector embeddings and semantic search
Real-time news streaming

43 KiB Raw Blame History