TradingAgents/prd/news_service.md

1014 lines
43 KiB
Markdown

# News Service PRD
## Executive Summary
The News Service feature will provide up-to-date news sentiment analysis for stock market tickers to the TradingAgents framework. This service will enable agents to make more informed trading decisions based on current market news and sentiment.
## Requirements
### Target Users
- Trading Agents (News Analyst, Researchers, Trader Agent, Risk Management team)
- Cron Job system for daily updates
### Problem Statement
Agents need up-to-date news sentiment when analyzing the stock market to make better trading decisions. Currently, they may be missing important news events or experiencing delays in sentiment analysis that could impact trading performance.
### Success Metrics
- Impact on trading decision quality
### User Stories
1. As Cron Job I want to be able to update and store the news with sentiment analysis for a ticker each day
2. As a Trading Agent I want to be able to retrieve the news with sentiment analysis for a ticker and a day from a database
### Out of Scope (v1)
- Real-time news streaming (vs daily updates)
- Multi-language news support
- Historical news sentiment analysis beyond a certain date range
- News source ranking or weighting
- Advanced filtering options
### Timeline
MVP in 1 week
## Status
✅ Requirements Complete | ✅ Technical Design Complete | 🔄 Implementation In Progress
## Technical Design
### Architecture
- The `NewsService` will be the central component, orchestrating the fetching, scraping, analysis, and storage of news articles.
- It will utilize the existing `GoogleNewsClient` to fetch RSS feeds from Google News.
- The `ArticleScraperClient` will be enhanced to scrape full article content with robust fallback strategies:
- **Direct Fetch**: Primary method using `newspaper3k` library for content extraction
- **Archive Fallback**: Internet Archive Wayback Machine fallback for failed fetches
- **Content Extraction**: Clean text, title, publication date, and metadata extraction
- **Paywall Detection**: Handle paywall-protected content gracefully
- A new `SentimentAnalysisService` will be created to handle the interaction with the configured LLM for structured sentiment analysis.
- The `NewsRepository` will store the news articles along with their sentiment scores in the existing file-based database.
### Implementation Components
- **Backend:**
- `tradingagents/domains/news/news_service.py`:
- A new private method `_get_sentiment_for_article` will be added to call the `SentimentAnalysisService`.
- The `update_company_news` method will be modified to call this new method for each scraped article.
- The `_calculate_sentiment_summary` will be updated to aggregate the new structured sentiment scores.
- Update to work with SQLAlchemy-based NewsRepository instead of file-based storage.
- `tradingagents/domains/news/repository.py` (Enhanced with Compatibility Layer):
- Replace file-based storage with SQLAlchemy ORM operations
- **Backward Compatibility**: Maintain existing interface with adapter pattern
- Implement new methods: `save_articles()`, `get_articles_by_symbol()`, `get_articles_by_date_range()`
- Add transaction management and connection pooling
- Include duplicate detection using URL uniqueness constraints
- Add batch operations for efficient bulk inserts
**Data Model Compatibility Strategy:**
```python
# Enhanced ArticleData to bridge existing and new models
@dataclass
class ArticleData:
# Existing fields (maintain compatibility)
title: str
content: str
author: str
source: str # Keep as string for existing code
date: str # YYYY-MM-DD format
url: str
sentiment: SentimentScore | None = None
# New fields for enhanced functionality
source_id: int | None = None # Foreign key when available
category_id: int | None = None # Foreign key when available
# Vector fields (optional for backward compatibility)
title_embedding: List[float] | None = None
content_embedding: List[float] | None = None
sentiment_embedding: List[float] | None = None
@classmethod
def from_db_model(cls, article: NewsArticle) -> 'ArticleData':
"""Convert database model to existing ArticleData format."""
return cls(
title=article.title,
content=article.content or "",
author=article.author or "",
source=article.source.name if article.source else "Unknown", # Flatten relationship
date=article.published_date.isoformat(),
url=article.url,
sentiment=SentimentScore(
score=float(article.sentiment_score) if article.sentiment_score else 0.0,
confidence=float(article.sentiment_confidence) if article.sentiment_confidence else 0.0,
label=article.sentiment_label or "neutral"
) if article.sentiment_score is not None else None,
source_id=article.source_id,
category_id=article.category_id,
title_embedding=article.title_embedding,
content_embedding=article.content_embedding,
sentiment_embedding=article.sentiment_embedding
)
def to_db_model(self, session: Session) -> NewsArticle:
"""Convert to database model, handling source lookup."""
# Get or create source
source = session.query(NewsSource).filter_by(name=self.source).first()
if not source:
source = NewsSource(name=self.source)
session.add(source)
session.flush() # Get ID
return NewsArticle(
title=self.title,
content=self.content,
author=self.author,
source_id=source.id,
url=self.url,
published_date=date.fromisoformat(self.date),
sentiment_score=Decimal(str(self.sentiment.score)) if self.sentiment else None,
sentiment_confidence=Decimal(str(self.sentiment.confidence)) if self.sentiment else None,
sentiment_label=self.sentiment.label if self.sentiment else None,
title_embedding=self.title_embedding,
content_embedding=self.content_embedding,
sentiment_embedding=self.sentiment_embedding
)
```
- `tradingagents/domains/news/sentiment_service.py` (New File):
- This new service will encapsulate the logic for calling the LLM and generating embeddings.
- Primary method: `get_sentiment_with_embeddings(article_content: str) -> SentimentScoreWithEmbeddings`.
- It will use the `quick_think_llm` from the `TradingAgentsConfig` for performance.
- It will use a structured prompt to ask the LLM to return a JSON object with `score`, `confidence`, and `label`.
- **Embedding Generation**: Generate multiple embeddings using OpenAI's embedding API:
- `title_embedding`: Vector representation of article title (1536 dims)
- `content_embedding`: Vector representation of full article content (1536 dims)
- `sentiment_embedding`: Smaller specialized sentiment vector using sentence-transformers (384 dims)
- **Vector Similarity**: Enable semantic search for similar articles and sentiment clustering
- **Database:**
- **PostgreSQL + SQLAlchemy + pgvector Integration:**
- Replace file-based storage with PostgreSQL database using SQLAlchemy ORM
- Create new SQLAlchemy models for news articles with proper relationships
- Implement database migrations using Alembic
- Add connection pooling and transaction management
- Integrate pgvector extension for high-dimensional sentiment embeddings storage
- Enable semantic similarity search and vector-based sentiment clustering
- **Database Schema Design:**
- `news_articles` table with columns for article data, sentiment scores, embeddings, and metadata
- `news_sources` table for source information and credibility tracking
- `news_categories` table for article categorization
- `sentiment_embeddings` table for high-dimensional vector storage using pgvector
- Proper indexing for symbol, date, source queries, and vector similarity searches
- Foreign key relationships between articles, sources, categories, and embeddings
### API Specification
- No external API changes. All modifications will be internal to the `NewsService` and the cron job that calls it.
### Security & Performance
- **Security:** LLM API keys will continue to be managed through the `TradingAgentsConfig` and environment variables. No new security risks are introduced.
- **Performance:** The scraping and sentiment analysis process is I/O and network-bound. This will run as part of the daily cron job, so it will not impact the performance of the trading agents' decision-making process, which will read from the cached data.
### Database Schema Design
#### Core Tables
```sql
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- News sources for credibility tracking
CREATE TABLE news_sources (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
domain VARCHAR(255),
credibility_score DECIMAL(3,2) DEFAULT 0.5, -- 0.0 to 1.0
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- News categories for article classification
CREATE TABLE news_categories (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Main articles table
CREATE TABLE news_articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT,
author VARCHAR(255),
symbol VARCHAR(10), -- Stock ticker, nullable for global news
source_id INTEGER REFERENCES news_sources(id),
category_id INTEGER REFERENCES news_categories(id),
url TEXT UNIQUE NOT NULL,
published_date DATE NOT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Sentiment analysis
sentiment_score DECIMAL(3,2), -- -1.0 to 1.0
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
sentiment_label VARCHAR(20), -- positive/negative/neutral
sentiment_analyzed_at TIMESTAMP,
-- Vector embeddings for semantic analysis
title_embedding vector(1536), -- OpenAI ada-002 embedding dimension
content_embedding vector(1536), -- Full article content embedding
sentiment_embedding vector(384), -- Sentence-transformer for sentiment
embedding_model VARCHAR(50) DEFAULT 'text-embedding-ada-002',
embedded_at TIMESTAMP,
-- Metadata
content_length INTEGER,
scrape_status VARCHAR(20) DEFAULT 'SUCCESS', -- SUCCESS, FAILED, ARCHIVE_SUCCESS
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Remove redundant sentiment_embeddings table
-- All embeddings stored directly in news_articles table for simplicity and performance
-- Performance indexes
CREATE INDEX idx_news_articles_symbol_date ON news_articles(symbol, published_date);
CREATE INDEX idx_news_articles_published_date ON news_articles(published_date);
CREATE INDEX idx_news_articles_source ON news_articles(source_id);
CREATE INDEX idx_news_articles_sentiment ON news_articles(sentiment_score, sentiment_confidence);
CREATE INDEX idx_news_articles_url_hash ON news_articles USING HASH(url);
-- Vector similarity indexes using HNSW (Hierarchical Navigable Small World)
-- Note: HNSW indexes consume significant memory (2-4x vector storage)
CREATE INDEX idx_articles_title_embedding ON news_articles USING hnsw (title_embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64); -- Tuned for performance vs memory
CREATE INDEX idx_articles_content_embedding ON news_articles USING hnsw (content_embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_articles_sentiment_embedding ON news_articles USING hnsw (sentiment_embedding vector_cosine_ops)
WITH (m = 8, ef_construction = 32); -- Smaller index for sentiment vectors
```
#### SQLAlchemy Models
```python
# tradingagents/domains/news/models.py
from datetime import datetime, date
from decimal import Decimal
from typing import List, Optional
from sqlalchemy import Column, Integer, String, Text, Date, DateTime, Decimal as SQLDecimal, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from pgvector.sqlalchemy import Vector
Base = declarative_base()
class NewsSource(Base):
__tablename__ = 'news_sources'
id = Column(Integer, primary_key=True)
name = Column(String(255), nullable=False, unique=True)
domain = Column(String(255))
credibility_score = Column(SQLDecimal(3,2), default=0.5)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# Relationships
articles = relationship("NewsArticle", back_populates="source")
class NewsCategory(Base):
__tablename__ = 'news_categories'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False, unique=True)
description = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
# Relationships
articles = relationship("NewsArticle", back_populates="category")
class NewsArticle(Base):
__tablename__ = 'news_articles'
id = Column(Integer, primary_key=True)
title = Column(Text, nullable=False)
content = Column(Text)
author = Column(String(255))
symbol = Column(String(10)) # Nullable for global news
source_id = Column(Integer, ForeignKey('news_sources.id'))
category_id = Column(Integer, ForeignKey('news_categories.id'))
url = Column(Text, unique=True, nullable=False)
published_date = Column(Date, nullable=False)
scraped_at = Column(DateTime, default=datetime.utcnow)
# Sentiment fields
sentiment_score = Column(SQLDecimal(3,2)) # -1.0 to 1.0
sentiment_confidence = Column(SQLDecimal(3,2)) # 0.0 to 1.0
sentiment_label = Column(String(20)) # positive/negative/neutral
sentiment_analyzed_at = Column(DateTime)
# Vector embeddings using pgvector
title_embedding = Column(Vector(1536)) # OpenAI ada-002 dimensions
content_embedding = Column(Vector(1536)) # Full content embedding
sentiment_embedding = Column(Vector(384)) # Sentence transformer for sentiment
embedding_model = Column(String(50), default='text-embedding-ada-002')
embedded_at = Column(DateTime)
# Metadata
content_length = Column(Integer)
scrape_status = Column(String(20), default='SUCCESS')
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# Relationships
source = relationship("NewsSource", back_populates="articles")
category = relationship("NewsCategory", back_populates="articles")
# Removed redundant SentimentEmbedding table for simplified architecture
```
#### Database Migration Strategy
**Alembic Configuration:**
```python
# alembic/env.py
from tradingagents.domains.news.models import Base
from tradingagents.config import TradingAgentsConfig
config = TradingAgentsConfig.from_env()
target_metadata = Base.metadata
# Database URL from config
config.set_main_option("sqlalchemy.url", config.database_url)
```
**Initial Migration:**
```bash
# Initialize Alembic in the project
alembic init alembic
# Generate initial migration
alembic revision --autogenerate -m "Create news tables"
# Apply migration
alembic upgrade head
```
**Migration Files:**
- `001_enable_pgvector.py` - Enable pgvector extension
- `002_create_news_tables.py` - Initial schema creation with vector fields
- `003_add_vector_indexes.py` - HNSW indexes for vector similarity
- `004_seed_categories_sources.py` - Seed default categories and trusted sources
**TradingAgentsConfig Extension:**
```python
@dataclass
class TradingAgentsConfig:
# ... existing fields ...
# Database configuration
database_url: str = field(default_factory=lambda: os.getenv("DATABASE_URL", ""))
database_pool_size: int = field(default_factory=lambda: int(os.getenv("DATABASE_POOL_SIZE", "10")))
database_max_overflow: int = field(default_factory=lambda: int(os.getenv("DATABASE_MAX_OVERFLOW", "20")))
database_echo: bool = field(default_factory=lambda: os.getenv("DATABASE_ECHO", "false").lower() == "true")
# Vector configuration
enable_vector_search: bool = field(default_factory=lambda: os.getenv("ENABLE_VECTOR_SEARCH", "true").lower() == "true")
embedding_model: str = field(default_factory=lambda: os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002"))
embedding_batch_size: int = field(default_factory=lambda: int(os.getenv("EMBEDDING_BATCH_SIZE", "100")))
enable_sentence_transformers: bool = field(default_factory=lambda: os.getenv("ENABLE_SENTENCE_TRANSFORMERS", "true").lower() == "true")
@property
def has_database_config(self) -> bool:
"""Check if database is properly configured."""
return bool(self.database_url and self.database_url.startswith("postgresql://"))
@property
def embedding_provider(self) -> str:
"""Get embedding provider from LLM provider setting."""
# Map LLM providers to their embedding providers
llm_provider = getattr(self, 'llm_provider', 'openai')
embedding_map = {
'openai': 'openai',
'google': 'google', # Use Gemini for embeddings when Google is selected
'anthropic': 'openai', # Anthropic doesn't have embeddings, use OpenAI
'ollama': 'openai' # Local models, use OpenAI for embeddings
}
return embedding_map.get(llm_provider, 'openai')
def validate_database_config(config: TradingAgentsConfig) -> None:
"""Validate database configuration before startup."""
if not config.has_database_config:
raise ValueError("DATABASE_URL must be set for PostgreSQL integration")
if config.enable_vector_search and not config.has_database_config:
raise ValueError("Vector search requires PostgreSQL database configuration")
```
**Environment Variables:**
```bash
# Database configuration (required)
DATABASE_URL=postgresql://username:password@localhost:5432/tradingagents
DATABASE_POOL_SIZE=10 # optional, defaults to 10
DATABASE_MAX_OVERFLOW=20 # optional, defaults to 20
DATABASE_ECHO=false # optional, set to true for SQL debugging
# Vector configuration (optional)
ENABLE_VECTOR_SEARCH=true # optional, defaults to true
EMBEDDING_MODEL=google/gemini-2.5-flash # Use Gemini via OpenRouter for embeddings
EMBEDDING_BATCH_SIZE=100 # optional
ENABLE_SENTENCE_TRANSFORMERS=true # optional
# Example configurations by provider:
# For OpenAI: EMBEDDING_MODEL=text-embedding-ada-002
# For Gemini: EMBEDDING_MODEL=google/gemini-2.5-flash (via OpenRouter)
```
#### Embedding Generation Service Design
**SentimentScore Enhancement:**
```python
@dataclass
class SentimentScoreWithEmbeddings:
"""Enhanced sentiment analysis with vector embeddings."""
score: float # -1.0 to 1.0
confidence: float # 0.0 to 1.0
label: str # positive/negative/neutral
# Vector embeddings
title_embedding: List[float] # 1536 dimensions
content_embedding: List[float] # 1536 dimensions
sentiment_embedding: List[float] # 384 dimensions
embedding_model: str = "text-embedding-ada-002"
```
**Service Implementation:**
```python
class EmbeddingProvider:
"""Abstract base for embedding providers."""
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
raise NotImplementedError
class OpenAIEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str, model: str = "text-embedding-ada-002"):
self.client = AsyncOpenAI(api_key=api_key)
self.model = model
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
response = await self.client.embeddings.create(
input=texts,
model=self.model
)
return [item.embedding for item in response.data]
class GeminiEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
self.client = AsyncOpenAI(api_key=api_key, base_url=base_url)
self.model = "google/gemini-2.5-flash"
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
# Gemini via OpenRouter - batch embeddings
response = await self.client.embeddings.create(
input=texts,
model=self.model
)
return [item.embedding for item in response.data]
class SentimentAnalysisService:
def __init__(self, config: TradingAgentsConfig):
self.llm_client = self._get_llm_client(config)
self.embedding_provider = self._get_embedding_provider(config)
self.sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2') if config.enable_sentence_transformers else None
def _get_embedding_provider(self, config: TradingAgentsConfig) -> EmbeddingProvider:
"""Get appropriate embedding provider based on configuration."""
provider = config.embedding_provider
if provider == 'openai':
return OpenAIEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'),
model=config.embedding_model
)
elif provider == 'google':
return GeminiEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'), # OpenRouter key
base_url="https://openrouter.ai/api/v1"
)
else:
# Default to OpenAI
return OpenAIEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'),
model=config.embedding_model
)
async def get_sentiment_with_embeddings(
self,
title: str,
content: str
) -> SentimentScoreWithEmbeddings:
"""Generate sentiment analysis with vector embeddings - optimized for performance."""
# 1. Parallel processing: sentiment score + embeddings
tasks = [
self._get_sentiment_score(content), # LLM sentiment analysis
self.embedding_provider.get_embeddings([title, content]) # Batch embedding API call
]
sentiment, embeddings = await asyncio.gather(*tasks)
title_embedding, content_embedding = embeddings
# 2. Generate local sentiment embedding if enabled
sentiment_embedding = None
if self.sentence_transformer:
sentiment_embedding = self.sentence_transformer.encode(content).tolist()
return SentimentScoreWithEmbeddings(
score=sentiment.score,
confidence=sentiment.confidence,
label=sentiment.label,
title_embedding=title_embedding,
content_embedding=content_embedding,
sentiment_embedding=sentiment_embedding,
embedding_model=self.embedding_provider.model
)
async def _get_sentiment_score(self, content: str) -> SentimentScore:
"""Generate sentiment score using LLM with financial news prompt."""
prompt = """
Analyze the sentiment of this financial news article for trading purposes.
Article Content: {content}
Provide your analysis in the following JSON format:
{{
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
"confidence": <float between 0.0 and 1.0>,
"label": <"positive", "negative", or "neutral">,
"reasoning": <brief explanation>,
"key_themes": <list of key financial themes>,
"financial_entities": <list of mentioned companies/tickers>
}}
Focus on the financial and market implications of the news.
Consider impact on stock prices, market sentiment, and trading decisions.
""".format(content=content[:2000]) # Limit content length
response = await self.llm_client.complete(prompt)
try:
result = json.loads(response)
return SentimentScore(
score=result.get("score", 0.0),
confidence=result.get("confidence", 0.5),
label=result.get("label", "neutral"),
metadata={
"reasoning": result.get("reasoning", ""),
"key_themes": result.get("key_themes", []),
"financial_entities": result.get("financial_entities", [])
}
)
except Exception as e:
# Return neutral sentiment on error
return SentimentScore(
score=0.0,
confidence=0.0,
label="neutral",
metadata={"error": str(e)}
)
def find_similar_articles(
self,
embedding: List[float],
limit: int = 10,
similarity_threshold: float = 0.8
) -> List[NewsArticle]:
"""Find semantically similar articles using vector similarity."""
# Use pgvector cosine similarity search
pass
async def batch_analyze_sentiment(
self,
articles: List[ArticleData],
batch_size: int = 5
) -> List[SentimentScoreWithEmbeddings]:
"""
Batch process sentiment analysis and embedding generation.
Args:
articles: List of articles to analyze
batch_size: Number of articles to process concurrently
Returns:
List of sentiment scores with embeddings
"""
results = []
for i in range(0, len(articles), batch_size):
batch = articles[i:i + batch_size]
# Process batch concurrently
batch_tasks = [
self.get_sentiment_with_embeddings(article.title, article.content)
for article in batch
]
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
for result in batch_results:
if isinstance(result, Exception):
# Handle individual failures gracefully
logger.error(f"Sentiment analysis failed: {result}")
results.append(self._get_neutral_sentiment_with_embeddings())
else:
results.append(result)
# Rate limiting: Add delay between batches
if i + batch_size < len(articles):
await asyncio.sleep(1.0) # 1 second delay between batches
return results
```
**Optimized Vector Similarity Queries:**
```sql
-- Find articles similar to a given title embedding (HNSW optimized)
-- Note: Don't use WHERE clause on similarity - it defeats HNSW indexing
SELECT id, title, symbol,
(title_embedding <=> %s) as distance,
(1 - (title_embedding <=> %s)) as similarity
FROM news_articles
WHERE title_embedding IS NOT NULL -- Only filter on non-null vectors
ORDER BY title_embedding <=> %s
LIMIT 20 -- Get more candidates, filter in application if needed
HAVING distance < 0.2; -- Filter after ordering for best performance
-- Find articles with similar sentiment patterns (pre-filter by label for efficiency)
SELECT id, title, sentiment_label,
(sentiment_embedding <=> %s) as distance
FROM news_articles
WHERE sentiment_label = %s -- Filter first by indexed column
AND sentiment_embedding IS NOT NULL
ORDER BY sentiment_embedding <=> %s
LIMIT 15;
-- Cluster articles by content similarity for a ticker (optimized approach)
WITH similar_articles AS (
SELECT id, symbol, sentiment_score,
(content_embedding <=> %s) as distance
FROM news_articles
WHERE symbol = %s -- Use indexed column first
AND content_embedding IS NOT NULL
ORDER BY content_embedding <=> %s
LIMIT 50 -- Limit search space
)
SELECT symbol,
AVG(sentiment_score) as avg_sentiment,
COUNT(*) as article_count,
AVG(distance) as avg_content_distance
FROM similar_articles
WHERE distance < 0.3 -- Apply similarity threshold after vector search
GROUP BY symbol;
-- Performance monitoring query
SELECT
schemaname,
tablename,
attname as column_name,
n_distinct,
correlation
FROM pg_stats
WHERE tablename = 'news_articles'
AND attname LIKE '%embedding%';
```
**Memory Usage Estimation:**
```sql
-- Estimate memory requirements for HNSW indexes
SELECT
pg_size_pretty(pg_total_relation_size('idx_articles_title_embedding')) as title_index_size,
pg_size_pretty(pg_total_relation_size('idx_articles_content_embedding')) as content_index_size,
pg_size_pretty(pg_total_relation_size('idx_articles_sentiment_embedding')) as sentiment_index_size,
pg_size_pretty(pg_total_relation_size('news_articles')) as table_size;
-- Expected memory usage: 500MB-1GB for 10K articles with 3 embedding types
```
### Current Implementation Status
**✅ COMPLETED COMPONENTS:**
1. **NewsService Core Structure (90% Complete)**
- ✅ Core service class with dependency injection
- ✅ Read path implemented: `get_company_news_context()`, `get_global_news_context()`
- ✅ Write path implemented: `update_company_news()`, `update_global_news()`
- ✅ Repository integration with file-based storage
- ✅ ArticleData model conversion from repository NewsArticle
- ✅ Simple keyword-based sentiment analysis as fallback
- ✅ Error handling and empty context returns
- ✅ Trending topics extraction
- ✅ Date validation and ISO format handling
2. **NewsRepository (100% Complete)**
- ✅ File-based storage with JSON serialization
- ✅ Source separation (finnhub, google_news)
- ✅ Date-based file organization (YYYY-MM-DD.json)
- ✅ Article deduplication by URL
- ✅ Batch storage operations
- ✅ Complete CRUD operations
- ✅ Proper error handling and logging
3. **Data Models (100% Complete)**
- ✅ ArticleData dataclass with sentiment field
- ✅ NewsContext and GlobalNewsContext for agent consumption
- ✅ SentimentScore model
- ✅ NewsUpdateResult for operation tracking
- ✅ DataQuality enum for metadata
**✅ COMPLETED COMPONENTS (UPDATED):**
4. **GoogleNewsClient (100% Complete)**
- ✅ RSS feed parsing with feedparser
- ✅ Company news method implemented (`get_company_news()`)
- ✅ Global news method implemented (`get_global_news()`)
- ✅ Proper error handling and logging
- ✅ Google News RSS URL construction
- ✅ Article parsing with source extraction
- ✅ Date parsing with fallback handling
5. **ArticleScraperClient (100% Complete)**
- ✅ Full newspaper3k content extraction
- ✅ Internet Archive Wayback Machine fallback
- ✅ Robust error handling for failed scrapes
- ✅ Content validation (minimum length checks)
- ✅ Multiple article batch processing
- ✅ Rate limiting with configurable delays
- ✅ Proper URL validation
**❌ MISSING COMPONENTS:**
6. **LLM Sentiment Analysis Service (0% Complete)**
- ❌ SentimentAnalysisService class not created
- ❌ LLM integration not implemented
- ❌ Financial news prompts not defined
- ❌ Batch processing not implemented
- **Current**: Using simple keyword-based fallback
- **Next**: Create dedicated sentiment service
7. **Database Migration (0% Complete)**
- ❌ SQLAlchemy models not created
- ❌ PostgreSQL integration not started
- ❌ pgvector extension not configured
- ❌ Alembic migrations not set up
- **Current**: Using file-based storage
- **Status**: Planned for future iteration
8. **Vector Embeddings (0% Complete)**
- ❌ Embedding providers not implemented
- ❌ Vector similarity not available
- ❌ Semantic search not implemented
- **Status**: Advanced feature for future enhancement
### Revised Implementation Phases
**PHASE 1: Complete Core Functionality (Current Priority)**
- **GoogleNewsClient RSS Implementation (2-3 days)**
- Implement feedparser RSS parsing
- Add company news and global news methods
- Handle RSS feed errors and edge cases
- Create comprehensive tests with VCR cassettes
- **ArticleScraperClient Implementation (2-3 days)**
- Implement newspaper3k content extraction
- Add Internet Archive fallback mechanism
- Handle paywalls and extraction failures
- Create scraping tests with mock responses
- **LLM Sentiment Analysis Service (3-4 days)**
- Create SentimentAnalysisService class
- Implement LLM client integration using TradingAgentsConfig
- Design financial news sentiment prompts
- Add batch processing with rate limiting
- Replace keyword-based sentiment in NewsService
**PHASE 2: Testing and Refinement (Current Phase)**
- **Integration Testing (1-2 days)**
- End-to-end testing with real RSS feeds
- Test article scraping and sentiment analysis pipeline
- Verify error handling and partial failures
- Performance testing with multiple tickers
- **Type Safety and Quality (1 day)**
- Ensure `mise run typecheck` passes with 0 errors
- Fix any remaining linting issues
- Add missing docstrings and type hints
**PHASE 3: Future Enhancements (Deferred)**
- **Database Migration**: SQLAlchemy + PostgreSQL + pgvector
- **Vector Embeddings**: Semantic similarity and clustering
- **Performance Optimization**: Caching improvements and batch processing
### Total Timeline: 1-2 weeks for core completion
- **Week 1**: Complete GoogleNewsClient, ArticleScraperClient, LLM Sentiment Service
- **Week 2**: Integration testing, refinement, and quality assurance
- **Future**: Database migration and vector enhancements as separate project
## Testing Plan
### Test Strategy
- **Unit Testing:** Test individual components in isolation with mocked dependencies
- **Integration Testing:** Test component interactions and data flow
- **End-to-End Testing:** Test complete workflows from news fetching to storage
### Unit Tests
#### GoogleNewsClient Tests
- **Location:** `tests/domains/news/test_google_news_client.py`
- **Framework:** `pytest` with `pytest-vcr` for HTTP recording/replay
- **VCR Cassettes:** `tests/fixtures/vcr_cassettes/google_news/`
- **Test Cases:**
- `@pytest.mark.vcr` `test_get_news_by_symbol_success()` - Valid symbol returns articles
- `@pytest.mark.vcr` `test_get_news_by_symbol_invalid_symbol()` - Invalid symbol handling
- `@pytest.mark.vcr` `test_get_global_news_success()` - Global news retrieval
- `@pytest.mark.vcr` `test_get_global_news_empty_response()` - Empty RSS feed handling
- `test_rss_feed_parsing_error()` - Malformed RSS handling (mocked)
- `test_network_timeout()` - Network timeout scenarios (mocked)
- `test_rate_limiting()` - Rate limit compliance (mocked)
#### ArticleScraperClient Tests
- **Location:** `tests/domains/news/test_article_scraper_client.py`
- **Framework:** `pytest` with `pytest-vcr` for HTTP recording/replay
- **VCR Cassettes:** `tests/fixtures/vcr_cassettes/article_scraper/`
- **Test Cases:**
- `@pytest.mark.vcr` `test_scrape_article_success()` - Successful article scraping
- `@pytest.mark.vcr` `test_scrape_article_archive_fallback()` - Archive.is fallback
- `test_scrape_article_both_fail()` - Both methods fail gracefully (mocked)
- `test_invalid_url()` - Invalid URL handling (mocked)
- `@pytest.mark.vcr` `test_content_extraction()` - Content parsing accuracy
#### SentimentAnalysisService Tests
- **Location:** `tests/domains/news/test_sentiment_service.py`
- **Test Cases:**
- `test_get_sentiment_positive()` - Positive sentiment detection
- `test_get_sentiment_negative()` - Negative sentiment detection
- `test_get_sentiment_neutral()` - Neutral sentiment detection
- `test_get_sentiment_llm_error()` - LLM API error handling
- `test_get_sentiment_invalid_response()` - Invalid JSON response handling
- `test_get_sentiment_empty_content()` - Empty content handling
#### NewsService Tests
- **Location:** `tests/domains/news/test_news_service.py`
- **Test Cases:**
- `test_update_company_news_success()` - Complete news update workflow
- `test_update_company_news_no_articles()` - No articles found scenario
- `test_update_company_news_scraping_failure()` - Partial scraping failures
- `test_sentiment_analysis_integration()` - Sentiment analysis integration
- `test_calculate_sentiment_summary()` - Sentiment aggregation logic
- `test_get_company_news_by_date()` - News retrieval by date
#### NewsRepository Tests
- **Location:** `tests/domains/news/test_news_repository.py`
- **Test Cases:**
- `test_store_news_articles()` - Article storage
- `test_get_news_by_symbol_and_date()` - News retrieval
- `test_duplicate_article_handling()` - Duplicate prevention
- `test_data_persistence()` - File system persistence
- `test_invalid_data_handling()` - Invalid data rejection
### Integration Tests
#### News Workflow Integration
- **Location:** `tests/integration/test_news_workflow.py`
- **Test Cases:**
- `test_full_news_update_workflow()` - Complete end-to-end workflow
- `test_news_service_with_real_clients()` - Real client integration
- `test_sentiment_service_integration()` - LLM integration testing
- `test_repository_integration()` - Data persistence integration
### End-to-End Tests
#### Complete System Tests
- **Location:** `tests/e2e/test_news_system.py`
- **Test Cases:**
- `test_daily_news_update_simulation()` - Simulate daily cron job
- `test_trading_agent_news_consumption()` - Agent news retrieval
- `test_system_performance_with_multiple_tickers()` - Performance testing
- `test_error_recovery_scenarios()` - System resilience testing
### Test Data Management
#### Mock Data Strategy
- **RSS Feed Samples:** Saved sample RSS responses for consistent testing
- **Article Content:** Pre-scraped article content for sentiment testing
- **LLM Responses:** Mock sentiment analysis responses for unit tests
#### Test Configuration
- **Environment Variables:** Separate test configuration
- **Database Isolation:** Temporary test databases
- **VCR Configuration:** Record/replay HTTP interactions for deterministic tests
- **Pytest Configuration:** `pytest.ini` with VCR settings and test markers
### Performance Testing
#### Load Testing
- **Concurrent News Updates:** Test multiple ticker updates simultaneously
- **Memory Usage:** Monitor memory consumption during batch processing
- **API Rate Limiting:** Verify rate limit compliance under load
#### Benchmarking
- **Scraping Speed:** Measure article scraping performance
- **Sentiment Analysis:** Measure LLM response times
- **Storage Performance:** Database write/read performance
### Test Automation
#### CI/CD Integration
- **Pre-commit Hooks:** Run fast unit tests before commits
- **Pull Request Checks:** Full test suite on PR creation
- **Nightly Tests:** End-to-end tests with real data
#### Test Coverage Requirements
- **Minimum Coverage:** 80% line coverage for all components
- **Critical Path Coverage:** 100% coverage for core business logic
- **Error Handling Coverage:** All exception paths tested
### Manual Testing Scenarios
#### Smoke Tests
- **Daily Operations:** Manual verification of daily news updates
- **Data Quality:** Spot-check sentiment analysis accuracy
- **System Health:** Monitor error rates and performance metrics
#### Acceptance Testing
- **Trading Agent Integration:** Verify agents can consume news data effectively
- **Data Accuracy:** Validate news relevance and sentiment accuracy
- **Performance Benchmarks:** Confirm system meets performance requirements
## Current Implementation Status Summary
### Overall Progress: 90% Complete 🎉
**✅ COMPLETED (100%)**
- Requirements analysis and technical design
- NewsService core structure with read/write paths
- NewsRepository with file-based storage and deduplication
- Data models (ArticleData, NewsContext, SentimentScore)
- GoogleNewsClient with full RSS feed parsing
- ArticleScraperClient with newspaper3k + Internet Archive fallback
- Basic sentiment analysis (keyword-based fallback)
- Error handling and validation
- Service integration and dependency injection
**❌ MISSING (10%)**
- LLM sentiment analysis service (only remaining core component)
**⏸️ DEFERRED (Future Iterations)**
- Database migration to PostgreSQL + SQLAlchemy
- Vector embeddings and semantic search
- Real-time news streaming capabilities
### What's Working Now
The current NewsService implementation provides:
- **Read Path**: Agents can successfully call `get_company_news_context()` and `get_global_news_context()`
- **Repository Integration**: Service reads cached news data from file-based NewsRepository
- **Data Transformation**: Converts NewsRepository.NewsArticle → ArticleData for agents
- **Basic Sentiment**: Simple keyword-based sentiment analysis as fallback
- **Error Handling**: Graceful error handling with empty contexts and metadata
- **Type Safety**: Proper type hints and dataclass definitions
### What's Missing
The service currently cannot:
- **LLM Sentiment Analysis**: No LLM integration for financial news sentiment (using keyword fallback)
- **Structured Storage**: Still using file-based storage instead of planned PostgreSQL + SQLAlchemy
- **Vector Embeddings**: No semantic similarity or vector-based features
### Critical Gap (Only 1 Remaining!)
1. **LLM Sentiment Service** - No structured sentiment analysis with LLM prompts
- Current: Simple keyword-based sentiment scoring
- Needed: LLM integration using TradingAgentsConfig
- Impact: Agents get basic sentiment but not sophisticated financial analysis
### Recently Discovered: Implementation is 90% Complete!
Upon detailed code review, the implementation is much further along than initially documented:
-**GoogleNewsClient** - Fully implemented with RSS parsing
-**ArticleScraperClient** - Complete with newspaper3k + Internet Archive fallback
-**NewsService** - Full read/write paths with proper error handling
-**NewsRepository** - Production-ready file-based storage
### Next Immediate Steps (Revised)
1. **✅ COMPLETE: GoogleNewsClient RSS parsing** - Already implemented with feedparser
2. **✅ COMPLETE: ArticleScraperClient** - Already implemented with newspaper3k + Internet Archive
3. **⏳ PRIORITY: Create LLM Sentiment Service** - Replace keyword-based analysis (2-3 days)
4. **⏳ PRIORITY: Integration testing** - End-to-end workflow validation (1-2 days)
### Timeline to MVP (Updated)
- **3-5 days** for LLM sentiment service + testing
- **Current system is production-ready** with basic sentiment analysis
- **Database migration** deferred to future iteration
- **Vector features** planned as advanced enhancement
### Implementation Priority
**HIGH PRIORITY (Required for sophisticated sentiment)**:
- LLM Sentiment Analysis Service with financial news prompts
**MEDIUM PRIORITY (System improvements)**:
- Better error handling and retry logic
- Performance optimization for batch processing
- Comprehensive integration test suite
**LOW PRIORITY (Future enhancements)**:
- PostgreSQL + SQLAlchemy migration
- Vector embeddings and semantic search
- Real-time news streaming