43 KiB
News Service PRD
Executive Summary
The News Service feature will provide up-to-date news sentiment analysis for stock market tickers to the TradingAgents framework. This service will enable agents to make more informed trading decisions based on current market news and sentiment.
Requirements
Target Users
- Trading Agents (News Analyst, Researchers, Trader Agent, Risk Management team)
- Cron Job system for daily updates
Problem Statement
Agents need up-to-date news sentiment when analyzing the stock market to make better trading decisions. Currently, they may be missing important news events or experiencing delays in sentiment analysis that could impact trading performance.
Success Metrics
- Impact on trading decision quality
User Stories
- As Cron Job I want to be able to update and store the news with sentiment analysis for a ticker each day
- As a Trading Agent I want to be able to retrieve the news with sentiment analysis for a ticker and a day from a database
Out of Scope (v1)
- Real-time news streaming (vs daily updates)
- Multi-language news support
- Historical news sentiment analysis beyond a certain date range
- News source ranking or weighting
- Advanced filtering options
Timeline
MVP in 1 week
Status
✅ Requirements Complete | ✅ Technical Design Complete | 🔄 Implementation In Progress
Technical Design
Architecture
- The
NewsServicewill be the central component, orchestrating the fetching, scraping, analysis, and storage of news articles. - It will utilize the existing
GoogleNewsClientto fetch RSS feeds from Google News. - The
ArticleScraperClientwill be enhanced to scrape full article content with robust fallback strategies:- Direct Fetch: Primary method using
newspaper3klibrary for content extraction - Archive Fallback: Internet Archive Wayback Machine fallback for failed fetches
- Content Extraction: Clean text, title, publication date, and metadata extraction
- Paywall Detection: Handle paywall-protected content gracefully
- Direct Fetch: Primary method using
- A new
SentimentAnalysisServicewill be created to handle the interaction with the configured LLM for structured sentiment analysis. - The
NewsRepositorywill store the news articles along with their sentiment scores in the existing file-based database.
Implementation Components
- Backend:
tradingagents/domains/news/news_service.py:- A new private method
_get_sentiment_for_articlewill be added to call theSentimentAnalysisService. - The
update_company_newsmethod will be modified to call this new method for each scraped article. - The
_calculate_sentiment_summarywill be updated to aggregate the new structured sentiment scores. - Update to work with SQLAlchemy-based NewsRepository instead of file-based storage.
- A new private method
tradingagents/domains/news/repository.py(Enhanced with Compatibility Layer):- Replace file-based storage with SQLAlchemy ORM operations
- Backward Compatibility: Maintain existing interface with adapter pattern
- Implement new methods:
save_articles(),get_articles_by_symbol(),get_articles_by_date_range() - Add transaction management and connection pooling
- Include duplicate detection using URL uniqueness constraints
- Add batch operations for efficient bulk inserts
Data Model Compatibility Strategy:
# Enhanced ArticleData to bridge existing and new models
@dataclass
class ArticleData:
# Existing fields (maintain compatibility)
title: str
content: str
author: str
source: str # Keep as string for existing code
date: str # YYYY-MM-DD format
url: str
sentiment: SentimentScore | None = None
# New fields for enhanced functionality
source_id: int | None = None # Foreign key when available
category_id: int | None = None # Foreign key when available
# Vector fields (optional for backward compatibility)
title_embedding: List[float] | None = None
content_embedding: List[float] | None = None
sentiment_embedding: List[float] | None = None
@classmethod
def from_db_model(cls, article: NewsArticle) -> 'ArticleData':
"""Convert database model to existing ArticleData format."""
return cls(
title=article.title,
content=article.content or "",
author=article.author or "",
source=article.source.name if article.source else "Unknown", # Flatten relationship
date=article.published_date.isoformat(),
url=article.url,
sentiment=SentimentScore(
score=float(article.sentiment_score) if article.sentiment_score else 0.0,
confidence=float(article.sentiment_confidence) if article.sentiment_confidence else 0.0,
label=article.sentiment_label or "neutral"
) if article.sentiment_score is not None else None,
source_id=article.source_id,
category_id=article.category_id,
title_embedding=article.title_embedding,
content_embedding=article.content_embedding,
sentiment_embedding=article.sentiment_embedding
)
def to_db_model(self, session: Session) -> NewsArticle:
"""Convert to database model, handling source lookup."""
# Get or create source
source = session.query(NewsSource).filter_by(name=self.source).first()
if not source:
source = NewsSource(name=self.source)
session.add(source)
session.flush() # Get ID
return NewsArticle(
title=self.title,
content=self.content,
author=self.author,
source_id=source.id,
url=self.url,
published_date=date.fromisoformat(self.date),
sentiment_score=Decimal(str(self.sentiment.score)) if self.sentiment else None,
sentiment_confidence=Decimal(str(self.sentiment.confidence)) if self.sentiment else None,
sentiment_label=self.sentiment.label if self.sentiment else None,
title_embedding=self.title_embedding,
content_embedding=self.content_embedding,
sentiment_embedding=self.sentiment_embedding
)
- `tradingagents/domains/news/sentiment_service.py` (New File):
- This new service will encapsulate the logic for calling the LLM and generating embeddings.
- Primary method: `get_sentiment_with_embeddings(article_content: str) -> SentimentScoreWithEmbeddings`.
- It will use the `quick_think_llm` from the `TradingAgentsConfig` for performance.
- It will use a structured prompt to ask the LLM to return a JSON object with `score`, `confidence`, and `label`.
- **Embedding Generation**: Generate multiple embeddings using OpenAI's embedding API:
- `title_embedding`: Vector representation of article title (1536 dims)
- `content_embedding`: Vector representation of full article content (1536 dims)
- `sentiment_embedding`: Smaller specialized sentiment vector using sentence-transformers (384 dims)
- **Vector Similarity**: Enable semantic search for similar articles and sentiment clustering
- Database:
- PostgreSQL + SQLAlchemy + pgvector Integration:
- Replace file-based storage with PostgreSQL database using SQLAlchemy ORM
- Create new SQLAlchemy models for news articles with proper relationships
- Implement database migrations using Alembic
- Add connection pooling and transaction management
- Integrate pgvector extension for high-dimensional sentiment embeddings storage
- Enable semantic similarity search and vector-based sentiment clustering
- Database Schema Design:
news_articlestable with columns for article data, sentiment scores, embeddings, and metadatanews_sourcestable for source information and credibility trackingnews_categoriestable for article categorizationsentiment_embeddingstable for high-dimensional vector storage using pgvector- Proper indexing for symbol, date, source queries, and vector similarity searches
- Foreign key relationships between articles, sources, categories, and embeddings
- PostgreSQL + SQLAlchemy + pgvector Integration:
API Specification
- No external API changes. All modifications will be internal to the
NewsServiceand the cron job that calls it.
Security & Performance
- Security: LLM API keys will continue to be managed through the
TradingAgentsConfigand environment variables. No new security risks are introduced. - Performance: The scraping and sentiment analysis process is I/O and network-bound. This will run as part of the daily cron job, so it will not impact the performance of the trading agents' decision-making process, which will read from the cached data.
Database Schema Design
Core Tables
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- News sources for credibility tracking
CREATE TABLE news_sources (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
domain VARCHAR(255),
credibility_score DECIMAL(3,2) DEFAULT 0.5, -- 0.0 to 1.0
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- News categories for article classification
CREATE TABLE news_categories (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Main articles table
CREATE TABLE news_articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT,
author VARCHAR(255),
symbol VARCHAR(10), -- Stock ticker, nullable for global news
source_id INTEGER REFERENCES news_sources(id),
category_id INTEGER REFERENCES news_categories(id),
url TEXT UNIQUE NOT NULL,
published_date DATE NOT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Sentiment analysis
sentiment_score DECIMAL(3,2), -- -1.0 to 1.0
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
sentiment_label VARCHAR(20), -- positive/negative/neutral
sentiment_analyzed_at TIMESTAMP,
-- Vector embeddings for semantic analysis
title_embedding vector(1536), -- OpenAI ada-002 embedding dimension
content_embedding vector(1536), -- Full article content embedding
sentiment_embedding vector(384), -- Sentence-transformer for sentiment
embedding_model VARCHAR(50) DEFAULT 'text-embedding-ada-002',
embedded_at TIMESTAMP,
-- Metadata
content_length INTEGER,
scrape_status VARCHAR(20) DEFAULT 'SUCCESS', -- SUCCESS, FAILED, ARCHIVE_SUCCESS
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Remove redundant sentiment_embeddings table
-- All embeddings stored directly in news_articles table for simplicity and performance
-- Performance indexes
CREATE INDEX idx_news_articles_symbol_date ON news_articles(symbol, published_date);
CREATE INDEX idx_news_articles_published_date ON news_articles(published_date);
CREATE INDEX idx_news_articles_source ON news_articles(source_id);
CREATE INDEX idx_news_articles_sentiment ON news_articles(sentiment_score, sentiment_confidence);
CREATE INDEX idx_news_articles_url_hash ON news_articles USING HASH(url);
-- Vector similarity indexes using HNSW (Hierarchical Navigable Small World)
-- Note: HNSW indexes consume significant memory (2-4x vector storage)
CREATE INDEX idx_articles_title_embedding ON news_articles USING hnsw (title_embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64); -- Tuned for performance vs memory
CREATE INDEX idx_articles_content_embedding ON news_articles USING hnsw (content_embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_articles_sentiment_embedding ON news_articles USING hnsw (sentiment_embedding vector_cosine_ops)
WITH (m = 8, ef_construction = 32); -- Smaller index for sentiment vectors
SQLAlchemy Models
# tradingagents/domains/news/models.py
from datetime import datetime, date
from decimal import Decimal
from typing import List, Optional
from sqlalchemy import Column, Integer, String, Text, Date, DateTime, Decimal as SQLDecimal, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from pgvector.sqlalchemy import Vector
Base = declarative_base()
class NewsSource(Base):
__tablename__ = 'news_sources'
id = Column(Integer, primary_key=True)
name = Column(String(255), nullable=False, unique=True)
domain = Column(String(255))
credibility_score = Column(SQLDecimal(3,2), default=0.5)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# Relationships
articles = relationship("NewsArticle", back_populates="source")
class NewsCategory(Base):
__tablename__ = 'news_categories'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False, unique=True)
description = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
# Relationships
articles = relationship("NewsArticle", back_populates="category")
class NewsArticle(Base):
__tablename__ = 'news_articles'
id = Column(Integer, primary_key=True)
title = Column(Text, nullable=False)
content = Column(Text)
author = Column(String(255))
symbol = Column(String(10)) # Nullable for global news
source_id = Column(Integer, ForeignKey('news_sources.id'))
category_id = Column(Integer, ForeignKey('news_categories.id'))
url = Column(Text, unique=True, nullable=False)
published_date = Column(Date, nullable=False)
scraped_at = Column(DateTime, default=datetime.utcnow)
# Sentiment fields
sentiment_score = Column(SQLDecimal(3,2)) # -1.0 to 1.0
sentiment_confidence = Column(SQLDecimal(3,2)) # 0.0 to 1.0
sentiment_label = Column(String(20)) # positive/negative/neutral
sentiment_analyzed_at = Column(DateTime)
# Vector embeddings using pgvector
title_embedding = Column(Vector(1536)) # OpenAI ada-002 dimensions
content_embedding = Column(Vector(1536)) # Full content embedding
sentiment_embedding = Column(Vector(384)) # Sentence transformer for sentiment
embedding_model = Column(String(50), default='text-embedding-ada-002')
embedded_at = Column(DateTime)
# Metadata
content_length = Column(Integer)
scrape_status = Column(String(20), default='SUCCESS')
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# Relationships
source = relationship("NewsSource", back_populates="articles")
category = relationship("NewsCategory", back_populates="articles")
# Removed redundant SentimentEmbedding table for simplified architecture
Database Migration Strategy
Alembic Configuration:
# alembic/env.py
from tradingagents.domains.news.models import Base
from tradingagents.config import TradingAgentsConfig
config = TradingAgentsConfig.from_env()
target_metadata = Base.metadata
# Database URL from config
config.set_main_option("sqlalchemy.url", config.database_url)
Initial Migration:
# Initialize Alembic in the project
alembic init alembic
# Generate initial migration
alembic revision --autogenerate -m "Create news tables"
# Apply migration
alembic upgrade head
Migration Files:
001_enable_pgvector.py- Enable pgvector extension002_create_news_tables.py- Initial schema creation with vector fields003_add_vector_indexes.py- HNSW indexes for vector similarity004_seed_categories_sources.py- Seed default categories and trusted sources
TradingAgentsConfig Extension:
@dataclass
class TradingAgentsConfig:
# ... existing fields ...
# Database configuration
database_url: str = field(default_factory=lambda: os.getenv("DATABASE_URL", ""))
database_pool_size: int = field(default_factory=lambda: int(os.getenv("DATABASE_POOL_SIZE", "10")))
database_max_overflow: int = field(default_factory=lambda: int(os.getenv("DATABASE_MAX_OVERFLOW", "20")))
database_echo: bool = field(default_factory=lambda: os.getenv("DATABASE_ECHO", "false").lower() == "true")
# Vector configuration
enable_vector_search: bool = field(default_factory=lambda: os.getenv("ENABLE_VECTOR_SEARCH", "true").lower() == "true")
embedding_model: str = field(default_factory=lambda: os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002"))
embedding_batch_size: int = field(default_factory=lambda: int(os.getenv("EMBEDDING_BATCH_SIZE", "100")))
enable_sentence_transformers: bool = field(default_factory=lambda: os.getenv("ENABLE_SENTENCE_TRANSFORMERS", "true").lower() == "true")
@property
def has_database_config(self) -> bool:
"""Check if database is properly configured."""
return bool(self.database_url and self.database_url.startswith("postgresql://"))
@property
def embedding_provider(self) -> str:
"""Get embedding provider from LLM provider setting."""
# Map LLM providers to their embedding providers
llm_provider = getattr(self, 'llm_provider', 'openai')
embedding_map = {
'openai': 'openai',
'google': 'google', # Use Gemini for embeddings when Google is selected
'anthropic': 'openai', # Anthropic doesn't have embeddings, use OpenAI
'ollama': 'openai' # Local models, use OpenAI for embeddings
}
return embedding_map.get(llm_provider, 'openai')
def validate_database_config(config: TradingAgentsConfig) -> None:
"""Validate database configuration before startup."""
if not config.has_database_config:
raise ValueError("DATABASE_URL must be set for PostgreSQL integration")
if config.enable_vector_search and not config.has_database_config:
raise ValueError("Vector search requires PostgreSQL database configuration")
Environment Variables:
# Database configuration (required)
DATABASE_URL=postgresql://username:password@localhost:5432/tradingagents
DATABASE_POOL_SIZE=10 # optional, defaults to 10
DATABASE_MAX_OVERFLOW=20 # optional, defaults to 20
DATABASE_ECHO=false # optional, set to true for SQL debugging
# Vector configuration (optional)
ENABLE_VECTOR_SEARCH=true # optional, defaults to true
EMBEDDING_MODEL=google/gemini-2.5-flash # Use Gemini via OpenRouter for embeddings
EMBEDDING_BATCH_SIZE=100 # optional
ENABLE_SENTENCE_TRANSFORMERS=true # optional
# Example configurations by provider:
# For OpenAI: EMBEDDING_MODEL=text-embedding-ada-002
# For Gemini: EMBEDDING_MODEL=google/gemini-2.5-flash (via OpenRouter)
Embedding Generation Service Design
SentimentScore Enhancement:
@dataclass
class SentimentScoreWithEmbeddings:
"""Enhanced sentiment analysis with vector embeddings."""
score: float # -1.0 to 1.0
confidence: float # 0.0 to 1.0
label: str # positive/negative/neutral
# Vector embeddings
title_embedding: List[float] # 1536 dimensions
content_embedding: List[float] # 1536 dimensions
sentiment_embedding: List[float] # 384 dimensions
embedding_model: str = "text-embedding-ada-002"
Service Implementation:
class EmbeddingProvider:
"""Abstract base for embedding providers."""
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
raise NotImplementedError
class OpenAIEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str, model: str = "text-embedding-ada-002"):
self.client = AsyncOpenAI(api_key=api_key)
self.model = model
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
response = await self.client.embeddings.create(
input=texts,
model=self.model
)
return [item.embedding for item in response.data]
class GeminiEmbeddingProvider(EmbeddingProvider):
def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
self.client = AsyncOpenAI(api_key=api_key, base_url=base_url)
self.model = "google/gemini-2.5-flash"
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
# Gemini via OpenRouter - batch embeddings
response = await self.client.embeddings.create(
input=texts,
model=self.model
)
return [item.embedding for item in response.data]
class SentimentAnalysisService:
def __init__(self, config: TradingAgentsConfig):
self.llm_client = self._get_llm_client(config)
self.embedding_provider = self._get_embedding_provider(config)
self.sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2') if config.enable_sentence_transformers else None
def _get_embedding_provider(self, config: TradingAgentsConfig) -> EmbeddingProvider:
"""Get appropriate embedding provider based on configuration."""
provider = config.embedding_provider
if provider == 'openai':
return OpenAIEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'),
model=config.embedding_model
)
elif provider == 'google':
return GeminiEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'), # OpenRouter key
base_url="https://openrouter.ai/api/v1"
)
else:
# Default to OpenAI
return OpenAIEmbeddingProvider(
api_key=os.getenv('OPENAI_API_KEY'),
model=config.embedding_model
)
async def get_sentiment_with_embeddings(
self,
title: str,
content: str
) -> SentimentScoreWithEmbeddings:
"""Generate sentiment analysis with vector embeddings - optimized for performance."""
# 1. Parallel processing: sentiment score + embeddings
tasks = [
self._get_sentiment_score(content), # LLM sentiment analysis
self.embedding_provider.get_embeddings([title, content]) # Batch embedding API call
]
sentiment, embeddings = await asyncio.gather(*tasks)
title_embedding, content_embedding = embeddings
# 2. Generate local sentiment embedding if enabled
sentiment_embedding = None
if self.sentence_transformer:
sentiment_embedding = self.sentence_transformer.encode(content).tolist()
return SentimentScoreWithEmbeddings(
score=sentiment.score,
confidence=sentiment.confidence,
label=sentiment.label,
title_embedding=title_embedding,
content_embedding=content_embedding,
sentiment_embedding=sentiment_embedding,
embedding_model=self.embedding_provider.model
)
async def _get_sentiment_score(self, content: str) -> SentimentScore:
"""Generate sentiment score using LLM with financial news prompt."""
prompt = """
Analyze the sentiment of this financial news article for trading purposes.
Article Content: {content}
Provide your analysis in the following JSON format:
{{
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
"confidence": <float between 0.0 and 1.0>,
"label": <"positive", "negative", or "neutral">,
"reasoning": <brief explanation>,
"key_themes": <list of key financial themes>,
"financial_entities": <list of mentioned companies/tickers>
}}
Focus on the financial and market implications of the news.
Consider impact on stock prices, market sentiment, and trading decisions.
""".format(content=content[:2000]) # Limit content length
response = await self.llm_client.complete(prompt)
try:
result = json.loads(response)
return SentimentScore(
score=result.get("score", 0.0),
confidence=result.get("confidence", 0.5),
label=result.get("label", "neutral"),
metadata={
"reasoning": result.get("reasoning", ""),
"key_themes": result.get("key_themes", []),
"financial_entities": result.get("financial_entities", [])
}
)
except Exception as e:
# Return neutral sentiment on error
return SentimentScore(
score=0.0,
confidence=0.0,
label="neutral",
metadata={"error": str(e)}
)
def find_similar_articles(
self,
embedding: List[float],
limit: int = 10,
similarity_threshold: float = 0.8
) -> List[NewsArticle]:
"""Find semantically similar articles using vector similarity."""
# Use pgvector cosine similarity search
pass
async def batch_analyze_sentiment(
self,
articles: List[ArticleData],
batch_size: int = 5
) -> List[SentimentScoreWithEmbeddings]:
"""
Batch process sentiment analysis and embedding generation.
Args:
articles: List of articles to analyze
batch_size: Number of articles to process concurrently
Returns:
List of sentiment scores with embeddings
"""
results = []
for i in range(0, len(articles), batch_size):
batch = articles[i:i + batch_size]
# Process batch concurrently
batch_tasks = [
self.get_sentiment_with_embeddings(article.title, article.content)
for article in batch
]
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
for result in batch_results:
if isinstance(result, Exception):
# Handle individual failures gracefully
logger.error(f"Sentiment analysis failed: {result}")
results.append(self._get_neutral_sentiment_with_embeddings())
else:
results.append(result)
# Rate limiting: Add delay between batches
if i + batch_size < len(articles):
await asyncio.sleep(1.0) # 1 second delay between batches
return results
Optimized Vector Similarity Queries:
-- Find articles similar to a given title embedding (HNSW optimized)
-- Note: Don't use WHERE clause on similarity - it defeats HNSW indexing
SELECT id, title, symbol,
(title_embedding <=> %s) as distance,
(1 - (title_embedding <=> %s)) as similarity
FROM news_articles
WHERE title_embedding IS NOT NULL -- Only filter on non-null vectors
ORDER BY title_embedding <=> %s
LIMIT 20 -- Get more candidates, filter in application if needed
HAVING distance < 0.2; -- Filter after ordering for best performance
-- Find articles with similar sentiment patterns (pre-filter by label for efficiency)
SELECT id, title, sentiment_label,
(sentiment_embedding <=> %s) as distance
FROM news_articles
WHERE sentiment_label = %s -- Filter first by indexed column
AND sentiment_embedding IS NOT NULL
ORDER BY sentiment_embedding <=> %s
LIMIT 15;
-- Cluster articles by content similarity for a ticker (optimized approach)
WITH similar_articles AS (
SELECT id, symbol, sentiment_score,
(content_embedding <=> %s) as distance
FROM news_articles
WHERE symbol = %s -- Use indexed column first
AND content_embedding IS NOT NULL
ORDER BY content_embedding <=> %s
LIMIT 50 -- Limit search space
)
SELECT symbol,
AVG(sentiment_score) as avg_sentiment,
COUNT(*) as article_count,
AVG(distance) as avg_content_distance
FROM similar_articles
WHERE distance < 0.3 -- Apply similarity threshold after vector search
GROUP BY symbol;
-- Performance monitoring query
SELECT
schemaname,
tablename,
attname as column_name,
n_distinct,
correlation
FROM pg_stats
WHERE tablename = 'news_articles'
AND attname LIKE '%embedding%';
Memory Usage Estimation:
-- Estimate memory requirements for HNSW indexes
SELECT
pg_size_pretty(pg_total_relation_size('idx_articles_title_embedding')) as title_index_size,
pg_size_pretty(pg_total_relation_size('idx_articles_content_embedding')) as content_index_size,
pg_size_pretty(pg_total_relation_size('idx_articles_sentiment_embedding')) as sentiment_index_size,
pg_size_pretty(pg_total_relation_size('news_articles')) as table_size;
-- Expected memory usage: 500MB-1GB for 10K articles with 3 embedding types
Current Implementation Status
✅ COMPLETED COMPONENTS:
-
NewsService Core Structure (90% Complete)
- ✅ Core service class with dependency injection
- ✅ Read path implemented:
get_company_news_context(),get_global_news_context() - ✅ Write path implemented:
update_company_news(),update_global_news() - ✅ Repository integration with file-based storage
- ✅ ArticleData model conversion from repository NewsArticle
- ✅ Simple keyword-based sentiment analysis as fallback
- ✅ Error handling and empty context returns
- ✅ Trending topics extraction
- ✅ Date validation and ISO format handling
-
NewsRepository (100% Complete)
- ✅ File-based storage with JSON serialization
- ✅ Source separation (finnhub, google_news)
- ✅ Date-based file organization (YYYY-MM-DD.json)
- ✅ Article deduplication by URL
- ✅ Batch storage operations
- ✅ Complete CRUD operations
- ✅ Proper error handling and logging
-
Data Models (100% Complete)
- ✅ ArticleData dataclass with sentiment field
- ✅ NewsContext and GlobalNewsContext for agent consumption
- ✅ SentimentScore model
- ✅ NewsUpdateResult for operation tracking
- ✅ DataQuality enum for metadata
✅ COMPLETED COMPONENTS (UPDATED):
-
GoogleNewsClient (100% Complete)
- ✅ RSS feed parsing with feedparser
- ✅ Company news method implemented (
get_company_news()) - ✅ Global news method implemented (
get_global_news()) - ✅ Proper error handling and logging
- ✅ Google News RSS URL construction
- ✅ Article parsing with source extraction
- ✅ Date parsing with fallback handling
-
ArticleScraperClient (100% Complete)
- ✅ Full newspaper3k content extraction
- ✅ Internet Archive Wayback Machine fallback
- ✅ Robust error handling for failed scrapes
- ✅ Content validation (minimum length checks)
- ✅ Multiple article batch processing
- ✅ Rate limiting with configurable delays
- ✅ Proper URL validation
❌ MISSING COMPONENTS:
-
LLM Sentiment Analysis Service (0% Complete)
- ❌ SentimentAnalysisService class not created
- ❌ LLM integration not implemented
- ❌ Financial news prompts not defined
- ❌ Batch processing not implemented
- Current: Using simple keyword-based fallback
- Next: Create dedicated sentiment service
-
Database Migration (0% Complete)
- ❌ SQLAlchemy models not created
- ❌ PostgreSQL integration not started
- ❌ pgvector extension not configured
- ❌ Alembic migrations not set up
- Current: Using file-based storage
- Status: Planned for future iteration
-
Vector Embeddings (0% Complete)
- ❌ Embedding providers not implemented
- ❌ Vector similarity not available
- ❌ Semantic search not implemented
- Status: Advanced feature for future enhancement
Revised Implementation Phases
PHASE 1: Complete Core Functionality (Current Priority)
-
GoogleNewsClient RSS Implementation (2-3 days)
- Implement feedparser RSS parsing
- Add company news and global news methods
- Handle RSS feed errors and edge cases
- Create comprehensive tests with VCR cassettes
-
ArticleScraperClient Implementation (2-3 days)
- Implement newspaper3k content extraction
- Add Internet Archive fallback mechanism
- Handle paywalls and extraction failures
- Create scraping tests with mock responses
-
LLM Sentiment Analysis Service (3-4 days)
- Create SentimentAnalysisService class
- Implement LLM client integration using TradingAgentsConfig
- Design financial news sentiment prompts
- Add batch processing with rate limiting
- Replace keyword-based sentiment in NewsService
PHASE 2: Testing and Refinement (Current Phase)
-
Integration Testing (1-2 days)
- End-to-end testing with real RSS feeds
- Test article scraping and sentiment analysis pipeline
- Verify error handling and partial failures
- Performance testing with multiple tickers
-
Type Safety and Quality (1 day)
- Ensure
mise run typecheckpasses with 0 errors - Fix any remaining linting issues
- Add missing docstrings and type hints
- Ensure
PHASE 3: Future Enhancements (Deferred)
- Database Migration: SQLAlchemy + PostgreSQL + pgvector
- Vector Embeddings: Semantic similarity and clustering
- Performance Optimization: Caching improvements and batch processing
Total Timeline: 1-2 weeks for core completion
- Week 1: Complete GoogleNewsClient, ArticleScraperClient, LLM Sentiment Service
- Week 2: Integration testing, refinement, and quality assurance
- Future: Database migration and vector enhancements as separate project
Testing Plan
Test Strategy
- Unit Testing: Test individual components in isolation with mocked dependencies
- Integration Testing: Test component interactions and data flow
- End-to-End Testing: Test complete workflows from news fetching to storage
Unit Tests
GoogleNewsClient Tests
- Location:
tests/domains/news/test_google_news_client.py - Framework:
pytestwithpytest-vcrfor HTTP recording/replay - VCR Cassettes:
tests/fixtures/vcr_cassettes/google_news/ - Test Cases:
@pytest.mark.vcrtest_get_news_by_symbol_success()- Valid symbol returns articles@pytest.mark.vcrtest_get_news_by_symbol_invalid_symbol()- Invalid symbol handling@pytest.mark.vcrtest_get_global_news_success()- Global news retrieval@pytest.mark.vcrtest_get_global_news_empty_response()- Empty RSS feed handlingtest_rss_feed_parsing_error()- Malformed RSS handling (mocked)test_network_timeout()- Network timeout scenarios (mocked)test_rate_limiting()- Rate limit compliance (mocked)
ArticleScraperClient Tests
- Location:
tests/domains/news/test_article_scraper_client.py - Framework:
pytestwithpytest-vcrfor HTTP recording/replay - VCR Cassettes:
tests/fixtures/vcr_cassettes/article_scraper/ - Test Cases:
@pytest.mark.vcrtest_scrape_article_success()- Successful article scraping@pytest.mark.vcrtest_scrape_article_archive_fallback()- Archive.is fallbacktest_scrape_article_both_fail()- Both methods fail gracefully (mocked)test_invalid_url()- Invalid URL handling (mocked)@pytest.mark.vcrtest_content_extraction()- Content parsing accuracy
SentimentAnalysisService Tests
- Location:
tests/domains/news/test_sentiment_service.py - Test Cases:
test_get_sentiment_positive()- Positive sentiment detectiontest_get_sentiment_negative()- Negative sentiment detectiontest_get_sentiment_neutral()- Neutral sentiment detectiontest_get_sentiment_llm_error()- LLM API error handlingtest_get_sentiment_invalid_response()- Invalid JSON response handlingtest_get_sentiment_empty_content()- Empty content handling
NewsService Tests
- Location:
tests/domains/news/test_news_service.py - Test Cases:
test_update_company_news_success()- Complete news update workflowtest_update_company_news_no_articles()- No articles found scenariotest_update_company_news_scraping_failure()- Partial scraping failurestest_sentiment_analysis_integration()- Sentiment analysis integrationtest_calculate_sentiment_summary()- Sentiment aggregation logictest_get_company_news_by_date()- News retrieval by date
NewsRepository Tests
- Location:
tests/domains/news/test_news_repository.py - Test Cases:
test_store_news_articles()- Article storagetest_get_news_by_symbol_and_date()- News retrievaltest_duplicate_article_handling()- Duplicate preventiontest_data_persistence()- File system persistencetest_invalid_data_handling()- Invalid data rejection
Integration Tests
News Workflow Integration
- Location:
tests/integration/test_news_workflow.py - Test Cases:
test_full_news_update_workflow()- Complete end-to-end workflowtest_news_service_with_real_clients()- Real client integrationtest_sentiment_service_integration()- LLM integration testingtest_repository_integration()- Data persistence integration
End-to-End Tests
Complete System Tests
- Location:
tests/e2e/test_news_system.py - Test Cases:
test_daily_news_update_simulation()- Simulate daily cron jobtest_trading_agent_news_consumption()- Agent news retrievaltest_system_performance_with_multiple_tickers()- Performance testingtest_error_recovery_scenarios()- System resilience testing
Test Data Management
Mock Data Strategy
- RSS Feed Samples: Saved sample RSS responses for consistent testing
- Article Content: Pre-scraped article content for sentiment testing
- LLM Responses: Mock sentiment analysis responses for unit tests
Test Configuration
- Environment Variables: Separate test configuration
- Database Isolation: Temporary test databases
- VCR Configuration: Record/replay HTTP interactions for deterministic tests
- Pytest Configuration:
pytest.iniwith VCR settings and test markers
Performance Testing
Load Testing
- Concurrent News Updates: Test multiple ticker updates simultaneously
- Memory Usage: Monitor memory consumption during batch processing
- API Rate Limiting: Verify rate limit compliance under load
Benchmarking
- Scraping Speed: Measure article scraping performance
- Sentiment Analysis: Measure LLM response times
- Storage Performance: Database write/read performance
Test Automation
CI/CD Integration
- Pre-commit Hooks: Run fast unit tests before commits
- Pull Request Checks: Full test suite on PR creation
- Nightly Tests: End-to-end tests with real data
Test Coverage Requirements
- Minimum Coverage: 80% line coverage for all components
- Critical Path Coverage: 100% coverage for core business logic
- Error Handling Coverage: All exception paths tested
Manual Testing Scenarios
Smoke Tests
- Daily Operations: Manual verification of daily news updates
- Data Quality: Spot-check sentiment analysis accuracy
- System Health: Monitor error rates and performance metrics
Acceptance Testing
- Trading Agent Integration: Verify agents can consume news data effectively
- Data Accuracy: Validate news relevance and sentiment accuracy
- Performance Benchmarks: Confirm system meets performance requirements
Current Implementation Status Summary
Overall Progress: 90% Complete 🎉
✅ COMPLETED (100%)
- Requirements analysis and technical design
- NewsService core structure with read/write paths
- NewsRepository with file-based storage and deduplication
- Data models (ArticleData, NewsContext, SentimentScore)
- GoogleNewsClient with full RSS feed parsing
- ArticleScraperClient with newspaper3k + Internet Archive fallback
- Basic sentiment analysis (keyword-based fallback)
- Error handling and validation
- Service integration and dependency injection
❌ MISSING (10%)
- LLM sentiment analysis service (only remaining core component)
⏸️ DEFERRED (Future Iterations)
- Database migration to PostgreSQL + SQLAlchemy
- Vector embeddings and semantic search
- Real-time news streaming capabilities
What's Working Now
The current NewsService implementation provides:
- Read Path: Agents can successfully call
get_company_news_context()andget_global_news_context() - Repository Integration: Service reads cached news data from file-based NewsRepository
- Data Transformation: Converts NewsRepository.NewsArticle → ArticleData for agents
- Basic Sentiment: Simple keyword-based sentiment analysis as fallback
- Error Handling: Graceful error handling with empty contexts and metadata
- Type Safety: Proper type hints and dataclass definitions
What's Missing
The service currently cannot:
- LLM Sentiment Analysis: No LLM integration for financial news sentiment (using keyword fallback)
- Structured Storage: Still using file-based storage instead of planned PostgreSQL + SQLAlchemy
- Vector Embeddings: No semantic similarity or vector-based features
Critical Gap (Only 1 Remaining!)
- LLM Sentiment Service - No structured sentiment analysis with LLM prompts
- Current: Simple keyword-based sentiment scoring
- Needed: LLM integration using TradingAgentsConfig
- Impact: Agents get basic sentiment but not sophisticated financial analysis
Recently Discovered: Implementation is 90% Complete!
Upon detailed code review, the implementation is much further along than initially documented:
- ✅ GoogleNewsClient - Fully implemented with RSS parsing
- ✅ ArticleScraperClient - Complete with newspaper3k + Internet Archive fallback
- ✅ NewsService - Full read/write paths with proper error handling
- ✅ NewsRepository - Production-ready file-based storage
Next Immediate Steps (Revised)
- ✅ COMPLETE: GoogleNewsClient RSS parsing - Already implemented with feedparser
- ✅ COMPLETE: ArticleScraperClient - Already implemented with newspaper3k + Internet Archive
- ⏳ PRIORITY: Create LLM Sentiment Service - Replace keyword-based analysis (2-3 days)
- ⏳ PRIORITY: Integration testing - End-to-end workflow validation (1-2 days)
Timeline to MVP (Updated)
- 3-5 days for LLM sentiment service + testing
- Current system is production-ready with basic sentiment analysis
- Database migration deferred to future iteration
- Vector features planned as advanced enhancement
Implementation Priority
HIGH PRIORITY (Required for sophisticated sentiment):
- LLM Sentiment Analysis Service with financial news prompts
MEDIUM PRIORITY (System improvements):
- Better error handling and retry logic
- Performance optimization for batch processing
- Comprehensive integration test suite
LOW PRIORITY (Future enhancements):
- PostgreSQL + SQLAlchemy migration
- Vector embeddings and semantic search
- Real-time news streaming