1014 lines
43 KiB
Markdown
1014 lines
43 KiB
Markdown
# News Service PRD
|
|
|
|
## Executive Summary
|
|
The News Service feature will provide up-to-date news sentiment analysis for stock market tickers to the TradingAgents framework. This service will enable agents to make more informed trading decisions based on current market news and sentiment.
|
|
|
|
## Requirements
|
|
|
|
### Target Users
|
|
- Trading Agents (News Analyst, Researchers, Trader Agent, Risk Management team)
|
|
- Cron Job system for daily updates
|
|
|
|
### Problem Statement
|
|
Agents need up-to-date news sentiment when analyzing the stock market to make better trading decisions. Currently, they may be missing important news events or experiencing delays in sentiment analysis that could impact trading performance.
|
|
|
|
### Success Metrics
|
|
- Impact on trading decision quality
|
|
|
|
### User Stories
|
|
1. As Cron Job I want to be able to update and store the news with sentiment analysis for a ticker each day
|
|
2. As a Trading Agent I want to be able to retrieve the news with sentiment analysis for a ticker and a day from a database
|
|
|
|
### Out of Scope (v1)
|
|
- Real-time news streaming (vs daily updates)
|
|
- Multi-language news support
|
|
- Historical news sentiment analysis beyond a certain date range
|
|
- News source ranking or weighting
|
|
- Advanced filtering options
|
|
|
|
### Timeline
|
|
MVP in 1 week
|
|
|
|
## Status
|
|
✅ Requirements Complete | ✅ Technical Design Complete | 🔄 Implementation In Progress
|
|
|
|
## Technical Design
|
|
|
|
### Architecture
|
|
- The `NewsService` will be the central component, orchestrating the fetching, scraping, analysis, and storage of news articles.
|
|
- It will utilize the existing `GoogleNewsClient` to fetch RSS feeds from Google News.
|
|
- The `ArticleScraperClient` will be enhanced to scrape full article content with robust fallback strategies:
|
|
- **Direct Fetch**: Primary method using `newspaper3k` library for content extraction
|
|
- **Archive Fallback**: Internet Archive Wayback Machine fallback for failed fetches
|
|
- **Content Extraction**: Clean text, title, publication date, and metadata extraction
|
|
- **Paywall Detection**: Handle paywall-protected content gracefully
|
|
- A new `SentimentAnalysisService` will be created to handle the interaction with the configured LLM for structured sentiment analysis.
|
|
- The `NewsRepository` will store the news articles along with their sentiment scores in the existing file-based database.
|
|
|
|
### Implementation Components
|
|
- **Backend:**
|
|
- `tradingagents/domains/news/news_service.py`:
|
|
- A new private method `_get_sentiment_for_article` will be added to call the `SentimentAnalysisService`.
|
|
- The `update_company_news` method will be modified to call this new method for each scraped article.
|
|
- The `_calculate_sentiment_summary` will be updated to aggregate the new structured sentiment scores.
|
|
- Update to work with SQLAlchemy-based NewsRepository instead of file-based storage.
|
|
- `tradingagents/domains/news/repository.py` (Enhanced with Compatibility Layer):
|
|
- Replace file-based storage with SQLAlchemy ORM operations
|
|
- **Backward Compatibility**: Maintain existing interface with adapter pattern
|
|
- Implement new methods: `save_articles()`, `get_articles_by_symbol()`, `get_articles_by_date_range()`
|
|
- Add transaction management and connection pooling
|
|
- Include duplicate detection using URL uniqueness constraints
|
|
- Add batch operations for efficient bulk inserts
|
|
|
|
**Data Model Compatibility Strategy:**
|
|
```python
|
|
# Enhanced ArticleData to bridge existing and new models
|
|
@dataclass
|
|
class ArticleData:
|
|
# Existing fields (maintain compatibility)
|
|
title: str
|
|
content: str
|
|
author: str
|
|
source: str # Keep as string for existing code
|
|
date: str # YYYY-MM-DD format
|
|
url: str
|
|
sentiment: SentimentScore | None = None
|
|
|
|
# New fields for enhanced functionality
|
|
source_id: int | None = None # Foreign key when available
|
|
category_id: int | None = None # Foreign key when available
|
|
|
|
# Vector fields (optional for backward compatibility)
|
|
title_embedding: List[float] | None = None
|
|
content_embedding: List[float] | None = None
|
|
sentiment_embedding: List[float] | None = None
|
|
|
|
@classmethod
|
|
def from_db_model(cls, article: NewsArticle) -> 'ArticleData':
|
|
"""Convert database model to existing ArticleData format."""
|
|
return cls(
|
|
title=article.title,
|
|
content=article.content or "",
|
|
author=article.author or "",
|
|
source=article.source.name if article.source else "Unknown", # Flatten relationship
|
|
date=article.published_date.isoformat(),
|
|
url=article.url,
|
|
sentiment=SentimentScore(
|
|
score=float(article.sentiment_score) if article.sentiment_score else 0.0,
|
|
confidence=float(article.sentiment_confidence) if article.sentiment_confidence else 0.0,
|
|
label=article.sentiment_label or "neutral"
|
|
) if article.sentiment_score is not None else None,
|
|
source_id=article.source_id,
|
|
category_id=article.category_id,
|
|
title_embedding=article.title_embedding,
|
|
content_embedding=article.content_embedding,
|
|
sentiment_embedding=article.sentiment_embedding
|
|
)
|
|
|
|
def to_db_model(self, session: Session) -> NewsArticle:
|
|
"""Convert to database model, handling source lookup."""
|
|
# Get or create source
|
|
source = session.query(NewsSource).filter_by(name=self.source).first()
|
|
if not source:
|
|
source = NewsSource(name=self.source)
|
|
session.add(source)
|
|
session.flush() # Get ID
|
|
|
|
return NewsArticle(
|
|
title=self.title,
|
|
content=self.content,
|
|
author=self.author,
|
|
source_id=source.id,
|
|
url=self.url,
|
|
published_date=date.fromisoformat(self.date),
|
|
sentiment_score=Decimal(str(self.sentiment.score)) if self.sentiment else None,
|
|
sentiment_confidence=Decimal(str(self.sentiment.confidence)) if self.sentiment else None,
|
|
sentiment_label=self.sentiment.label if self.sentiment else None,
|
|
title_embedding=self.title_embedding,
|
|
content_embedding=self.content_embedding,
|
|
sentiment_embedding=self.sentiment_embedding
|
|
)
|
|
```
|
|
- `tradingagents/domains/news/sentiment_service.py` (New File):
|
|
- This new service will encapsulate the logic for calling the LLM and generating embeddings.
|
|
- Primary method: `get_sentiment_with_embeddings(article_content: str) -> SentimentScoreWithEmbeddings`.
|
|
- It will use the `quick_think_llm` from the `TradingAgentsConfig` for performance.
|
|
- It will use a structured prompt to ask the LLM to return a JSON object with `score`, `confidence`, and `label`.
|
|
- **Embedding Generation**: Generate multiple embeddings using OpenAI's embedding API:
|
|
- `title_embedding`: Vector representation of article title (1536 dims)
|
|
- `content_embedding`: Vector representation of full article content (1536 dims)
|
|
- `sentiment_embedding`: Smaller specialized sentiment vector using sentence-transformers (384 dims)
|
|
- **Vector Similarity**: Enable semantic search for similar articles and sentiment clustering
|
|
- **Database:**
|
|
- **PostgreSQL + SQLAlchemy + pgvector Integration:**
|
|
- Replace file-based storage with PostgreSQL database using SQLAlchemy ORM
|
|
- Create new SQLAlchemy models for news articles with proper relationships
|
|
- Implement database migrations using Alembic
|
|
- Add connection pooling and transaction management
|
|
- Integrate pgvector extension for high-dimensional sentiment embeddings storage
|
|
- Enable semantic similarity search and vector-based sentiment clustering
|
|
- **Database Schema Design:**
|
|
- `news_articles` table with columns for article data, sentiment scores, embeddings, and metadata
|
|
- `news_sources` table for source information and credibility tracking
|
|
- `news_categories` table for article categorization
|
|
- `sentiment_embeddings` table for high-dimensional vector storage using pgvector
|
|
- Proper indexing for symbol, date, source queries, and vector similarity searches
|
|
- Foreign key relationships between articles, sources, categories, and embeddings
|
|
|
|
### API Specification
|
|
- No external API changes. All modifications will be internal to the `NewsService` and the cron job that calls it.
|
|
|
|
### Security & Performance
|
|
- **Security:** LLM API keys will continue to be managed through the `TradingAgentsConfig` and environment variables. No new security risks are introduced.
|
|
- **Performance:** The scraping and sentiment analysis process is I/O and network-bound. This will run as part of the daily cron job, so it will not impact the performance of the trading agents' decision-making process, which will read from the cached data.
|
|
|
|
### Database Schema Design
|
|
|
|
#### Core Tables
|
|
```sql
|
|
-- Enable pgvector extension
|
|
CREATE EXTENSION IF NOT EXISTS vector;
|
|
|
|
-- News sources for credibility tracking
|
|
CREATE TABLE news_sources (
|
|
id SERIAL PRIMARY KEY,
|
|
name VARCHAR(255) NOT NULL UNIQUE,
|
|
domain VARCHAR(255),
|
|
credibility_score DECIMAL(3,2) DEFAULT 0.5, -- 0.0 to 1.0
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
-- News categories for article classification
|
|
CREATE TABLE news_categories (
|
|
id SERIAL PRIMARY KEY,
|
|
name VARCHAR(100) NOT NULL UNIQUE,
|
|
description TEXT,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
-- Main articles table
|
|
CREATE TABLE news_articles (
|
|
id SERIAL PRIMARY KEY,
|
|
title TEXT NOT NULL,
|
|
content TEXT,
|
|
author VARCHAR(255),
|
|
symbol VARCHAR(10), -- Stock ticker, nullable for global news
|
|
source_id INTEGER REFERENCES news_sources(id),
|
|
category_id INTEGER REFERENCES news_categories(id),
|
|
url TEXT UNIQUE NOT NULL,
|
|
published_date DATE NOT NULL,
|
|
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
-- Sentiment analysis
|
|
sentiment_score DECIMAL(3,2), -- -1.0 to 1.0
|
|
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
|
|
sentiment_label VARCHAR(20), -- positive/negative/neutral
|
|
sentiment_analyzed_at TIMESTAMP,
|
|
|
|
-- Vector embeddings for semantic analysis
|
|
title_embedding vector(1536), -- OpenAI ada-002 embedding dimension
|
|
content_embedding vector(1536), -- Full article content embedding
|
|
sentiment_embedding vector(384), -- Sentence-transformer for sentiment
|
|
embedding_model VARCHAR(50) DEFAULT 'text-embedding-ada-002',
|
|
embedded_at TIMESTAMP,
|
|
|
|
-- Metadata
|
|
content_length INTEGER,
|
|
scrape_status VARCHAR(20) DEFAULT 'SUCCESS', -- SUCCESS, FAILED, ARCHIVE_SUCCESS
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
-- Remove redundant sentiment_embeddings table
|
|
-- All embeddings stored directly in news_articles table for simplicity and performance
|
|
|
|
-- Performance indexes
|
|
CREATE INDEX idx_news_articles_symbol_date ON news_articles(symbol, published_date);
|
|
CREATE INDEX idx_news_articles_published_date ON news_articles(published_date);
|
|
CREATE INDEX idx_news_articles_source ON news_articles(source_id);
|
|
CREATE INDEX idx_news_articles_sentiment ON news_articles(sentiment_score, sentiment_confidence);
|
|
CREATE INDEX idx_news_articles_url_hash ON news_articles USING HASH(url);
|
|
|
|
-- Vector similarity indexes using HNSW (Hierarchical Navigable Small World)
|
|
-- Note: HNSW indexes consume significant memory (2-4x vector storage)
|
|
CREATE INDEX idx_articles_title_embedding ON news_articles USING hnsw (title_embedding vector_cosine_ops)
|
|
WITH (m = 16, ef_construction = 64); -- Tuned for performance vs memory
|
|
CREATE INDEX idx_articles_content_embedding ON news_articles USING hnsw (content_embedding vector_cosine_ops)
|
|
WITH (m = 16, ef_construction = 64);
|
|
CREATE INDEX idx_articles_sentiment_embedding ON news_articles USING hnsw (sentiment_embedding vector_cosine_ops)
|
|
WITH (m = 8, ef_construction = 32); -- Smaller index for sentiment vectors
|
|
```
|
|
|
|
#### SQLAlchemy Models
|
|
```python
|
|
# tradingagents/domains/news/models.py
|
|
from datetime import datetime, date
|
|
from decimal import Decimal
|
|
from typing import List, Optional
|
|
from sqlalchemy import Column, Integer, String, Text, Date, DateTime, Decimal as SQLDecimal, ForeignKey
|
|
from sqlalchemy.ext.declarative import declarative_base
|
|
from sqlalchemy.orm import relationship
|
|
from pgvector.sqlalchemy import Vector
|
|
|
|
Base = declarative_base()
|
|
|
|
class NewsSource(Base):
|
|
__tablename__ = 'news_sources'
|
|
|
|
id = Column(Integer, primary_key=True)
|
|
name = Column(String(255), nullable=False, unique=True)
|
|
domain = Column(String(255))
|
|
credibility_score = Column(SQLDecimal(3,2), default=0.5)
|
|
created_at = Column(DateTime, default=datetime.utcnow)
|
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
|
|
|
# Relationships
|
|
articles = relationship("NewsArticle", back_populates="source")
|
|
|
|
class NewsCategory(Base):
|
|
__tablename__ = 'news_categories'
|
|
|
|
id = Column(Integer, primary_key=True)
|
|
name = Column(String(100), nullable=False, unique=True)
|
|
description = Column(Text)
|
|
created_at = Column(DateTime, default=datetime.utcnow)
|
|
|
|
# Relationships
|
|
articles = relationship("NewsArticle", back_populates="category")
|
|
|
|
class NewsArticle(Base):
|
|
__tablename__ = 'news_articles'
|
|
|
|
id = Column(Integer, primary_key=True)
|
|
title = Column(Text, nullable=False)
|
|
content = Column(Text)
|
|
author = Column(String(255))
|
|
symbol = Column(String(10)) # Nullable for global news
|
|
source_id = Column(Integer, ForeignKey('news_sources.id'))
|
|
category_id = Column(Integer, ForeignKey('news_categories.id'))
|
|
url = Column(Text, unique=True, nullable=False)
|
|
published_date = Column(Date, nullable=False)
|
|
scraped_at = Column(DateTime, default=datetime.utcnow)
|
|
|
|
# Sentiment fields
|
|
sentiment_score = Column(SQLDecimal(3,2)) # -1.0 to 1.0
|
|
sentiment_confidence = Column(SQLDecimal(3,2)) # 0.0 to 1.0
|
|
sentiment_label = Column(String(20)) # positive/negative/neutral
|
|
sentiment_analyzed_at = Column(DateTime)
|
|
|
|
# Vector embeddings using pgvector
|
|
title_embedding = Column(Vector(1536)) # OpenAI ada-002 dimensions
|
|
content_embedding = Column(Vector(1536)) # Full content embedding
|
|
sentiment_embedding = Column(Vector(384)) # Sentence transformer for sentiment
|
|
embedding_model = Column(String(50), default='text-embedding-ada-002')
|
|
embedded_at = Column(DateTime)
|
|
|
|
# Metadata
|
|
content_length = Column(Integer)
|
|
scrape_status = Column(String(20), default='SUCCESS')
|
|
created_at = Column(DateTime, default=datetime.utcnow)
|
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
|
|
|
# Relationships
|
|
source = relationship("NewsSource", back_populates="articles")
|
|
category = relationship("NewsCategory", back_populates="articles")
|
|
|
|
# Removed redundant SentimentEmbedding table for simplified architecture
|
|
```
|
|
|
|
#### Database Migration Strategy
|
|
|
|
**Alembic Configuration:**
|
|
```python
|
|
# alembic/env.py
|
|
from tradingagents.domains.news.models import Base
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
config = TradingAgentsConfig.from_env()
|
|
target_metadata = Base.metadata
|
|
|
|
# Database URL from config
|
|
config.set_main_option("sqlalchemy.url", config.database_url)
|
|
```
|
|
|
|
**Initial Migration:**
|
|
```bash
|
|
# Initialize Alembic in the project
|
|
alembic init alembic
|
|
|
|
# Generate initial migration
|
|
alembic revision --autogenerate -m "Create news tables"
|
|
|
|
# Apply migration
|
|
alembic upgrade head
|
|
```
|
|
|
|
**Migration Files:**
|
|
- `001_enable_pgvector.py` - Enable pgvector extension
|
|
- `002_create_news_tables.py` - Initial schema creation with vector fields
|
|
- `003_add_vector_indexes.py` - HNSW indexes for vector similarity
|
|
- `004_seed_categories_sources.py` - Seed default categories and trusted sources
|
|
|
|
**TradingAgentsConfig Extension:**
|
|
```python
|
|
@dataclass
|
|
class TradingAgentsConfig:
|
|
# ... existing fields ...
|
|
|
|
# Database configuration
|
|
database_url: str = field(default_factory=lambda: os.getenv("DATABASE_URL", ""))
|
|
database_pool_size: int = field(default_factory=lambda: int(os.getenv("DATABASE_POOL_SIZE", "10")))
|
|
database_max_overflow: int = field(default_factory=lambda: int(os.getenv("DATABASE_MAX_OVERFLOW", "20")))
|
|
database_echo: bool = field(default_factory=lambda: os.getenv("DATABASE_ECHO", "false").lower() == "true")
|
|
|
|
# Vector configuration
|
|
enable_vector_search: bool = field(default_factory=lambda: os.getenv("ENABLE_VECTOR_SEARCH", "true").lower() == "true")
|
|
embedding_model: str = field(default_factory=lambda: os.getenv("EMBEDDING_MODEL", "text-embedding-ada-002"))
|
|
embedding_batch_size: int = field(default_factory=lambda: int(os.getenv("EMBEDDING_BATCH_SIZE", "100")))
|
|
enable_sentence_transformers: bool = field(default_factory=lambda: os.getenv("ENABLE_SENTENCE_TRANSFORMERS", "true").lower() == "true")
|
|
|
|
@property
|
|
def has_database_config(self) -> bool:
|
|
"""Check if database is properly configured."""
|
|
return bool(self.database_url and self.database_url.startswith("postgresql://"))
|
|
|
|
@property
|
|
def embedding_provider(self) -> str:
|
|
"""Get embedding provider from LLM provider setting."""
|
|
# Map LLM providers to their embedding providers
|
|
llm_provider = getattr(self, 'llm_provider', 'openai')
|
|
embedding_map = {
|
|
'openai': 'openai',
|
|
'google': 'google', # Use Gemini for embeddings when Google is selected
|
|
'anthropic': 'openai', # Anthropic doesn't have embeddings, use OpenAI
|
|
'ollama': 'openai' # Local models, use OpenAI for embeddings
|
|
}
|
|
return embedding_map.get(llm_provider, 'openai')
|
|
|
|
def validate_database_config(config: TradingAgentsConfig) -> None:
|
|
"""Validate database configuration before startup."""
|
|
if not config.has_database_config:
|
|
raise ValueError("DATABASE_URL must be set for PostgreSQL integration")
|
|
|
|
if config.enable_vector_search and not config.has_database_config:
|
|
raise ValueError("Vector search requires PostgreSQL database configuration")
|
|
```
|
|
|
|
**Environment Variables:**
|
|
```bash
|
|
# Database configuration (required)
|
|
DATABASE_URL=postgresql://username:password@localhost:5432/tradingagents
|
|
DATABASE_POOL_SIZE=10 # optional, defaults to 10
|
|
DATABASE_MAX_OVERFLOW=20 # optional, defaults to 20
|
|
DATABASE_ECHO=false # optional, set to true for SQL debugging
|
|
|
|
# Vector configuration (optional)
|
|
ENABLE_VECTOR_SEARCH=true # optional, defaults to true
|
|
EMBEDDING_MODEL=google/gemini-2.5-flash # Use Gemini via OpenRouter for embeddings
|
|
EMBEDDING_BATCH_SIZE=100 # optional
|
|
ENABLE_SENTENCE_TRANSFORMERS=true # optional
|
|
|
|
# Example configurations by provider:
|
|
# For OpenAI: EMBEDDING_MODEL=text-embedding-ada-002
|
|
# For Gemini: EMBEDDING_MODEL=google/gemini-2.5-flash (via OpenRouter)
|
|
```
|
|
|
|
#### Embedding Generation Service Design
|
|
|
|
**SentimentScore Enhancement:**
|
|
```python
|
|
@dataclass
|
|
class SentimentScoreWithEmbeddings:
|
|
"""Enhanced sentiment analysis with vector embeddings."""
|
|
|
|
score: float # -1.0 to 1.0
|
|
confidence: float # 0.0 to 1.0
|
|
label: str # positive/negative/neutral
|
|
|
|
# Vector embeddings
|
|
title_embedding: List[float] # 1536 dimensions
|
|
content_embedding: List[float] # 1536 dimensions
|
|
sentiment_embedding: List[float] # 384 dimensions
|
|
embedding_model: str = "text-embedding-ada-002"
|
|
```
|
|
|
|
**Service Implementation:**
|
|
```python
|
|
class EmbeddingProvider:
|
|
"""Abstract base for embedding providers."""
|
|
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
|
|
raise NotImplementedError
|
|
|
|
class OpenAIEmbeddingProvider(EmbeddingProvider):
|
|
def __init__(self, api_key: str, model: str = "text-embedding-ada-002"):
|
|
self.client = AsyncOpenAI(api_key=api_key)
|
|
self.model = model
|
|
|
|
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
|
|
response = await self.client.embeddings.create(
|
|
input=texts,
|
|
model=self.model
|
|
)
|
|
return [item.embedding for item in response.data]
|
|
|
|
class GeminiEmbeddingProvider(EmbeddingProvider):
|
|
def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
|
|
self.client = AsyncOpenAI(api_key=api_key, base_url=base_url)
|
|
self.model = "google/gemini-2.5-flash"
|
|
|
|
async def get_embeddings(self, texts: List[str]) -> List[List[float]]:
|
|
# Gemini via OpenRouter - batch embeddings
|
|
response = await self.client.embeddings.create(
|
|
input=texts,
|
|
model=self.model
|
|
)
|
|
return [item.embedding for item in response.data]
|
|
|
|
class SentimentAnalysisService:
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.llm_client = self._get_llm_client(config)
|
|
self.embedding_provider = self._get_embedding_provider(config)
|
|
self.sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2') if config.enable_sentence_transformers else None
|
|
|
|
def _get_embedding_provider(self, config: TradingAgentsConfig) -> EmbeddingProvider:
|
|
"""Get appropriate embedding provider based on configuration."""
|
|
provider = config.embedding_provider
|
|
|
|
if provider == 'openai':
|
|
return OpenAIEmbeddingProvider(
|
|
api_key=os.getenv('OPENAI_API_KEY'),
|
|
model=config.embedding_model
|
|
)
|
|
elif provider == 'google':
|
|
return GeminiEmbeddingProvider(
|
|
api_key=os.getenv('OPENAI_API_KEY'), # OpenRouter key
|
|
base_url="https://openrouter.ai/api/v1"
|
|
)
|
|
else:
|
|
# Default to OpenAI
|
|
return OpenAIEmbeddingProvider(
|
|
api_key=os.getenv('OPENAI_API_KEY'),
|
|
model=config.embedding_model
|
|
)
|
|
|
|
async def get_sentiment_with_embeddings(
|
|
self,
|
|
title: str,
|
|
content: str
|
|
) -> SentimentScoreWithEmbeddings:
|
|
"""Generate sentiment analysis with vector embeddings - optimized for performance."""
|
|
|
|
# 1. Parallel processing: sentiment score + embeddings
|
|
tasks = [
|
|
self._get_sentiment_score(content), # LLM sentiment analysis
|
|
self.embedding_provider.get_embeddings([title, content]) # Batch embedding API call
|
|
]
|
|
|
|
sentiment, embeddings = await asyncio.gather(*tasks)
|
|
title_embedding, content_embedding = embeddings
|
|
|
|
# 2. Generate local sentiment embedding if enabled
|
|
sentiment_embedding = None
|
|
if self.sentence_transformer:
|
|
sentiment_embedding = self.sentence_transformer.encode(content).tolist()
|
|
|
|
return SentimentScoreWithEmbeddings(
|
|
score=sentiment.score,
|
|
confidence=sentiment.confidence,
|
|
label=sentiment.label,
|
|
title_embedding=title_embedding,
|
|
content_embedding=content_embedding,
|
|
sentiment_embedding=sentiment_embedding,
|
|
embedding_model=self.embedding_provider.model
|
|
)
|
|
|
|
async def _get_sentiment_score(self, content: str) -> SentimentScore:
|
|
"""Generate sentiment score using LLM with financial news prompt."""
|
|
|
|
prompt = """
|
|
Analyze the sentiment of this financial news article for trading purposes.
|
|
|
|
Article Content: {content}
|
|
|
|
Provide your analysis in the following JSON format:
|
|
{{
|
|
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
|
|
"confidence": <float between 0.0 and 1.0>,
|
|
"label": <"positive", "negative", or "neutral">,
|
|
"reasoning": <brief explanation>,
|
|
"key_themes": <list of key financial themes>,
|
|
"financial_entities": <list of mentioned companies/tickers>
|
|
}}
|
|
|
|
Focus on the financial and market implications of the news.
|
|
Consider impact on stock prices, market sentiment, and trading decisions.
|
|
""".format(content=content[:2000]) # Limit content length
|
|
|
|
response = await self.llm_client.complete(prompt)
|
|
|
|
try:
|
|
result = json.loads(response)
|
|
return SentimentScore(
|
|
score=result.get("score", 0.0),
|
|
confidence=result.get("confidence", 0.5),
|
|
label=result.get("label", "neutral"),
|
|
metadata={
|
|
"reasoning": result.get("reasoning", ""),
|
|
"key_themes": result.get("key_themes", []),
|
|
"financial_entities": result.get("financial_entities", [])
|
|
}
|
|
)
|
|
except Exception as e:
|
|
# Return neutral sentiment on error
|
|
return SentimentScore(
|
|
score=0.0,
|
|
confidence=0.0,
|
|
label="neutral",
|
|
metadata={"error": str(e)}
|
|
)
|
|
|
|
def find_similar_articles(
|
|
self,
|
|
embedding: List[float],
|
|
limit: int = 10,
|
|
similarity_threshold: float = 0.8
|
|
) -> List[NewsArticle]:
|
|
"""Find semantically similar articles using vector similarity."""
|
|
# Use pgvector cosine similarity search
|
|
pass
|
|
|
|
async def batch_analyze_sentiment(
|
|
self,
|
|
articles: List[ArticleData],
|
|
batch_size: int = 5
|
|
) -> List[SentimentScoreWithEmbeddings]:
|
|
"""
|
|
Batch process sentiment analysis and embedding generation.
|
|
|
|
Args:
|
|
articles: List of articles to analyze
|
|
batch_size: Number of articles to process concurrently
|
|
|
|
Returns:
|
|
List of sentiment scores with embeddings
|
|
"""
|
|
results = []
|
|
|
|
for i in range(0, len(articles), batch_size):
|
|
batch = articles[i:i + batch_size]
|
|
|
|
# Process batch concurrently
|
|
batch_tasks = [
|
|
self.get_sentiment_with_embeddings(article.title, article.content)
|
|
for article in batch
|
|
]
|
|
|
|
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
|
|
|
|
for result in batch_results:
|
|
if isinstance(result, Exception):
|
|
# Handle individual failures gracefully
|
|
logger.error(f"Sentiment analysis failed: {result}")
|
|
results.append(self._get_neutral_sentiment_with_embeddings())
|
|
else:
|
|
results.append(result)
|
|
|
|
# Rate limiting: Add delay between batches
|
|
if i + batch_size < len(articles):
|
|
await asyncio.sleep(1.0) # 1 second delay between batches
|
|
|
|
return results
|
|
```
|
|
|
|
**Optimized Vector Similarity Queries:**
|
|
```sql
|
|
-- Find articles similar to a given title embedding (HNSW optimized)
|
|
-- Note: Don't use WHERE clause on similarity - it defeats HNSW indexing
|
|
SELECT id, title, symbol,
|
|
(title_embedding <=> %s) as distance,
|
|
(1 - (title_embedding <=> %s)) as similarity
|
|
FROM news_articles
|
|
WHERE title_embedding IS NOT NULL -- Only filter on non-null vectors
|
|
ORDER BY title_embedding <=> %s
|
|
LIMIT 20 -- Get more candidates, filter in application if needed
|
|
HAVING distance < 0.2; -- Filter after ordering for best performance
|
|
|
|
-- Find articles with similar sentiment patterns (pre-filter by label for efficiency)
|
|
SELECT id, title, sentiment_label,
|
|
(sentiment_embedding <=> %s) as distance
|
|
FROM news_articles
|
|
WHERE sentiment_label = %s -- Filter first by indexed column
|
|
AND sentiment_embedding IS NOT NULL
|
|
ORDER BY sentiment_embedding <=> %s
|
|
LIMIT 15;
|
|
|
|
-- Cluster articles by content similarity for a ticker (optimized approach)
|
|
WITH similar_articles AS (
|
|
SELECT id, symbol, sentiment_score,
|
|
(content_embedding <=> %s) as distance
|
|
FROM news_articles
|
|
WHERE symbol = %s -- Use indexed column first
|
|
AND content_embedding IS NOT NULL
|
|
ORDER BY content_embedding <=> %s
|
|
LIMIT 50 -- Limit search space
|
|
)
|
|
SELECT symbol,
|
|
AVG(sentiment_score) as avg_sentiment,
|
|
COUNT(*) as article_count,
|
|
AVG(distance) as avg_content_distance
|
|
FROM similar_articles
|
|
WHERE distance < 0.3 -- Apply similarity threshold after vector search
|
|
GROUP BY symbol;
|
|
|
|
-- Performance monitoring query
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
attname as column_name,
|
|
n_distinct,
|
|
correlation
|
|
FROM pg_stats
|
|
WHERE tablename = 'news_articles'
|
|
AND attname LIKE '%embedding%';
|
|
```
|
|
|
|
**Memory Usage Estimation:**
|
|
```sql
|
|
-- Estimate memory requirements for HNSW indexes
|
|
SELECT
|
|
pg_size_pretty(pg_total_relation_size('idx_articles_title_embedding')) as title_index_size,
|
|
pg_size_pretty(pg_total_relation_size('idx_articles_content_embedding')) as content_index_size,
|
|
pg_size_pretty(pg_total_relation_size('idx_articles_sentiment_embedding')) as sentiment_index_size,
|
|
pg_size_pretty(pg_total_relation_size('news_articles')) as table_size;
|
|
|
|
-- Expected memory usage: 500MB-1GB for 10K articles with 3 embedding types
|
|
```
|
|
|
|
### Current Implementation Status
|
|
|
|
**✅ COMPLETED COMPONENTS:**
|
|
|
|
1. **NewsService Core Structure (90% Complete)**
|
|
- ✅ Core service class with dependency injection
|
|
- ✅ Read path implemented: `get_company_news_context()`, `get_global_news_context()`
|
|
- ✅ Write path implemented: `update_company_news()`, `update_global_news()`
|
|
- ✅ Repository integration with file-based storage
|
|
- ✅ ArticleData model conversion from repository NewsArticle
|
|
- ✅ Simple keyword-based sentiment analysis as fallback
|
|
- ✅ Error handling and empty context returns
|
|
- ✅ Trending topics extraction
|
|
- ✅ Date validation and ISO format handling
|
|
|
|
2. **NewsRepository (100% Complete)**
|
|
- ✅ File-based storage with JSON serialization
|
|
- ✅ Source separation (finnhub, google_news)
|
|
- ✅ Date-based file organization (YYYY-MM-DD.json)
|
|
- ✅ Article deduplication by URL
|
|
- ✅ Batch storage operations
|
|
- ✅ Complete CRUD operations
|
|
- ✅ Proper error handling and logging
|
|
|
|
3. **Data Models (100% Complete)**
|
|
- ✅ ArticleData dataclass with sentiment field
|
|
- ✅ NewsContext and GlobalNewsContext for agent consumption
|
|
- ✅ SentimentScore model
|
|
- ✅ NewsUpdateResult for operation tracking
|
|
- ✅ DataQuality enum for metadata
|
|
|
|
**✅ COMPLETED COMPONENTS (UPDATED):**
|
|
|
|
4. **GoogleNewsClient (100% Complete)**
|
|
- ✅ RSS feed parsing with feedparser
|
|
- ✅ Company news method implemented (`get_company_news()`)
|
|
- ✅ Global news method implemented (`get_global_news()`)
|
|
- ✅ Proper error handling and logging
|
|
- ✅ Google News RSS URL construction
|
|
- ✅ Article parsing with source extraction
|
|
- ✅ Date parsing with fallback handling
|
|
|
|
5. **ArticleScraperClient (100% Complete)**
|
|
- ✅ Full newspaper3k content extraction
|
|
- ✅ Internet Archive Wayback Machine fallback
|
|
- ✅ Robust error handling for failed scrapes
|
|
- ✅ Content validation (minimum length checks)
|
|
- ✅ Multiple article batch processing
|
|
- ✅ Rate limiting with configurable delays
|
|
- ✅ Proper URL validation
|
|
|
|
**❌ MISSING COMPONENTS:**
|
|
|
|
6. **LLM Sentiment Analysis Service (0% Complete)**
|
|
- ❌ SentimentAnalysisService class not created
|
|
- ❌ LLM integration not implemented
|
|
- ❌ Financial news prompts not defined
|
|
- ❌ Batch processing not implemented
|
|
- **Current**: Using simple keyword-based fallback
|
|
- **Next**: Create dedicated sentiment service
|
|
|
|
7. **Database Migration (0% Complete)**
|
|
- ❌ SQLAlchemy models not created
|
|
- ❌ PostgreSQL integration not started
|
|
- ❌ pgvector extension not configured
|
|
- ❌ Alembic migrations not set up
|
|
- **Current**: Using file-based storage
|
|
- **Status**: Planned for future iteration
|
|
|
|
8. **Vector Embeddings (0% Complete)**
|
|
- ❌ Embedding providers not implemented
|
|
- ❌ Vector similarity not available
|
|
- ❌ Semantic search not implemented
|
|
- **Status**: Advanced feature for future enhancement
|
|
|
|
### Revised Implementation Phases
|
|
|
|
**PHASE 1: Complete Core Functionality (Current Priority)**
|
|
- **GoogleNewsClient RSS Implementation (2-3 days)**
|
|
- Implement feedparser RSS parsing
|
|
- Add company news and global news methods
|
|
- Handle RSS feed errors and edge cases
|
|
- Create comprehensive tests with VCR cassettes
|
|
|
|
- **ArticleScraperClient Implementation (2-3 days)**
|
|
- Implement newspaper3k content extraction
|
|
- Add Internet Archive fallback mechanism
|
|
- Handle paywalls and extraction failures
|
|
- Create scraping tests with mock responses
|
|
|
|
- **LLM Sentiment Analysis Service (3-4 days)**
|
|
- Create SentimentAnalysisService class
|
|
- Implement LLM client integration using TradingAgentsConfig
|
|
- Design financial news sentiment prompts
|
|
- Add batch processing with rate limiting
|
|
- Replace keyword-based sentiment in NewsService
|
|
|
|
**PHASE 2: Testing and Refinement (Current Phase)**
|
|
- **Integration Testing (1-2 days)**
|
|
- End-to-end testing with real RSS feeds
|
|
- Test article scraping and sentiment analysis pipeline
|
|
- Verify error handling and partial failures
|
|
- Performance testing with multiple tickers
|
|
|
|
- **Type Safety and Quality (1 day)**
|
|
- Ensure `mise run typecheck` passes with 0 errors
|
|
- Fix any remaining linting issues
|
|
- Add missing docstrings and type hints
|
|
|
|
**PHASE 3: Future Enhancements (Deferred)**
|
|
- **Database Migration**: SQLAlchemy + PostgreSQL + pgvector
|
|
- **Vector Embeddings**: Semantic similarity and clustering
|
|
- **Performance Optimization**: Caching improvements and batch processing
|
|
|
|
### Total Timeline: 1-2 weeks for core completion
|
|
- **Week 1**: Complete GoogleNewsClient, ArticleScraperClient, LLM Sentiment Service
|
|
- **Week 2**: Integration testing, refinement, and quality assurance
|
|
- **Future**: Database migration and vector enhancements as separate project
|
|
|
|
## Testing Plan
|
|
|
|
### Test Strategy
|
|
- **Unit Testing:** Test individual components in isolation with mocked dependencies
|
|
- **Integration Testing:** Test component interactions and data flow
|
|
- **End-to-End Testing:** Test complete workflows from news fetching to storage
|
|
|
|
### Unit Tests
|
|
|
|
#### GoogleNewsClient Tests
|
|
- **Location:** `tests/domains/news/test_google_news_client.py`
|
|
- **Framework:** `pytest` with `pytest-vcr` for HTTP recording/replay
|
|
- **VCR Cassettes:** `tests/fixtures/vcr_cassettes/google_news/`
|
|
- **Test Cases:**
|
|
- `@pytest.mark.vcr` `test_get_news_by_symbol_success()` - Valid symbol returns articles
|
|
- `@pytest.mark.vcr` `test_get_news_by_symbol_invalid_symbol()` - Invalid symbol handling
|
|
- `@pytest.mark.vcr` `test_get_global_news_success()` - Global news retrieval
|
|
- `@pytest.mark.vcr` `test_get_global_news_empty_response()` - Empty RSS feed handling
|
|
- `test_rss_feed_parsing_error()` - Malformed RSS handling (mocked)
|
|
- `test_network_timeout()` - Network timeout scenarios (mocked)
|
|
- `test_rate_limiting()` - Rate limit compliance (mocked)
|
|
|
|
#### ArticleScraperClient Tests
|
|
- **Location:** `tests/domains/news/test_article_scraper_client.py`
|
|
- **Framework:** `pytest` with `pytest-vcr` for HTTP recording/replay
|
|
- **VCR Cassettes:** `tests/fixtures/vcr_cassettes/article_scraper/`
|
|
- **Test Cases:**
|
|
- `@pytest.mark.vcr` `test_scrape_article_success()` - Successful article scraping
|
|
- `@pytest.mark.vcr` `test_scrape_article_archive_fallback()` - Archive.is fallback
|
|
- `test_scrape_article_both_fail()` - Both methods fail gracefully (mocked)
|
|
- `test_invalid_url()` - Invalid URL handling (mocked)
|
|
- `@pytest.mark.vcr` `test_content_extraction()` - Content parsing accuracy
|
|
|
|
#### SentimentAnalysisService Tests
|
|
- **Location:** `tests/domains/news/test_sentiment_service.py`
|
|
- **Test Cases:**
|
|
- `test_get_sentiment_positive()` - Positive sentiment detection
|
|
- `test_get_sentiment_negative()` - Negative sentiment detection
|
|
- `test_get_sentiment_neutral()` - Neutral sentiment detection
|
|
- `test_get_sentiment_llm_error()` - LLM API error handling
|
|
- `test_get_sentiment_invalid_response()` - Invalid JSON response handling
|
|
- `test_get_sentiment_empty_content()` - Empty content handling
|
|
|
|
#### NewsService Tests
|
|
- **Location:** `tests/domains/news/test_news_service.py`
|
|
- **Test Cases:**
|
|
- `test_update_company_news_success()` - Complete news update workflow
|
|
- `test_update_company_news_no_articles()` - No articles found scenario
|
|
- `test_update_company_news_scraping_failure()` - Partial scraping failures
|
|
- `test_sentiment_analysis_integration()` - Sentiment analysis integration
|
|
- `test_calculate_sentiment_summary()` - Sentiment aggregation logic
|
|
- `test_get_company_news_by_date()` - News retrieval by date
|
|
|
|
#### NewsRepository Tests
|
|
- **Location:** `tests/domains/news/test_news_repository.py`
|
|
- **Test Cases:**
|
|
- `test_store_news_articles()` - Article storage
|
|
- `test_get_news_by_symbol_and_date()` - News retrieval
|
|
- `test_duplicate_article_handling()` - Duplicate prevention
|
|
- `test_data_persistence()` - File system persistence
|
|
- `test_invalid_data_handling()` - Invalid data rejection
|
|
|
|
### Integration Tests
|
|
|
|
#### News Workflow Integration
|
|
- **Location:** `tests/integration/test_news_workflow.py`
|
|
- **Test Cases:**
|
|
- `test_full_news_update_workflow()` - Complete end-to-end workflow
|
|
- `test_news_service_with_real_clients()` - Real client integration
|
|
- `test_sentiment_service_integration()` - LLM integration testing
|
|
- `test_repository_integration()` - Data persistence integration
|
|
|
|
### End-to-End Tests
|
|
|
|
#### Complete System Tests
|
|
- **Location:** `tests/e2e/test_news_system.py`
|
|
- **Test Cases:**
|
|
- `test_daily_news_update_simulation()` - Simulate daily cron job
|
|
- `test_trading_agent_news_consumption()` - Agent news retrieval
|
|
- `test_system_performance_with_multiple_tickers()` - Performance testing
|
|
- `test_error_recovery_scenarios()` - System resilience testing
|
|
|
|
### Test Data Management
|
|
|
|
#### Mock Data Strategy
|
|
- **RSS Feed Samples:** Saved sample RSS responses for consistent testing
|
|
- **Article Content:** Pre-scraped article content for sentiment testing
|
|
- **LLM Responses:** Mock sentiment analysis responses for unit tests
|
|
|
|
#### Test Configuration
|
|
- **Environment Variables:** Separate test configuration
|
|
- **Database Isolation:** Temporary test databases
|
|
- **VCR Configuration:** Record/replay HTTP interactions for deterministic tests
|
|
- **Pytest Configuration:** `pytest.ini` with VCR settings and test markers
|
|
|
|
### Performance Testing
|
|
|
|
#### Load Testing
|
|
- **Concurrent News Updates:** Test multiple ticker updates simultaneously
|
|
- **Memory Usage:** Monitor memory consumption during batch processing
|
|
- **API Rate Limiting:** Verify rate limit compliance under load
|
|
|
|
#### Benchmarking
|
|
- **Scraping Speed:** Measure article scraping performance
|
|
- **Sentiment Analysis:** Measure LLM response times
|
|
- **Storage Performance:** Database write/read performance
|
|
|
|
### Test Automation
|
|
|
|
#### CI/CD Integration
|
|
- **Pre-commit Hooks:** Run fast unit tests before commits
|
|
- **Pull Request Checks:** Full test suite on PR creation
|
|
- **Nightly Tests:** End-to-end tests with real data
|
|
|
|
#### Test Coverage Requirements
|
|
- **Minimum Coverage:** 80% line coverage for all components
|
|
- **Critical Path Coverage:** 100% coverage for core business logic
|
|
- **Error Handling Coverage:** All exception paths tested
|
|
|
|
### Manual Testing Scenarios
|
|
|
|
#### Smoke Tests
|
|
- **Daily Operations:** Manual verification of daily news updates
|
|
- **Data Quality:** Spot-check sentiment analysis accuracy
|
|
- **System Health:** Monitor error rates and performance metrics
|
|
|
|
#### Acceptance Testing
|
|
- **Trading Agent Integration:** Verify agents can consume news data effectively
|
|
- **Data Accuracy:** Validate news relevance and sentiment accuracy
|
|
- **Performance Benchmarks:** Confirm system meets performance requirements
|
|
|
|
## Current Implementation Status Summary
|
|
|
|
### Overall Progress: 90% Complete 🎉
|
|
|
|
**✅ COMPLETED (100%)**
|
|
- Requirements analysis and technical design
|
|
- NewsService core structure with read/write paths
|
|
- NewsRepository with file-based storage and deduplication
|
|
- Data models (ArticleData, NewsContext, SentimentScore)
|
|
- GoogleNewsClient with full RSS feed parsing
|
|
- ArticleScraperClient with newspaper3k + Internet Archive fallback
|
|
- Basic sentiment analysis (keyword-based fallback)
|
|
- Error handling and validation
|
|
- Service integration and dependency injection
|
|
|
|
**❌ MISSING (10%)**
|
|
- LLM sentiment analysis service (only remaining core component)
|
|
|
|
**⏸️ DEFERRED (Future Iterations)**
|
|
- Database migration to PostgreSQL + SQLAlchemy
|
|
- Vector embeddings and semantic search
|
|
- Real-time news streaming capabilities
|
|
|
|
### What's Working Now
|
|
The current NewsService implementation provides:
|
|
- **Read Path**: Agents can successfully call `get_company_news_context()` and `get_global_news_context()`
|
|
- **Repository Integration**: Service reads cached news data from file-based NewsRepository
|
|
- **Data Transformation**: Converts NewsRepository.NewsArticle → ArticleData for agents
|
|
- **Basic Sentiment**: Simple keyword-based sentiment analysis as fallback
|
|
- **Error Handling**: Graceful error handling with empty contexts and metadata
|
|
- **Type Safety**: Proper type hints and dataclass definitions
|
|
|
|
### What's Missing
|
|
The service currently cannot:
|
|
- **LLM Sentiment Analysis**: No LLM integration for financial news sentiment (using keyword fallback)
|
|
- **Structured Storage**: Still using file-based storage instead of planned PostgreSQL + SQLAlchemy
|
|
- **Vector Embeddings**: No semantic similarity or vector-based features
|
|
|
|
### Critical Gap (Only 1 Remaining!)
|
|
1. **LLM Sentiment Service** - No structured sentiment analysis with LLM prompts
|
|
- Current: Simple keyword-based sentiment scoring
|
|
- Needed: LLM integration using TradingAgentsConfig
|
|
- Impact: Agents get basic sentiment but not sophisticated financial analysis
|
|
|
|
### Recently Discovered: Implementation is 90% Complete!
|
|
Upon detailed code review, the implementation is much further along than initially documented:
|
|
- ✅ **GoogleNewsClient** - Fully implemented with RSS parsing
|
|
- ✅ **ArticleScraperClient** - Complete with newspaper3k + Internet Archive fallback
|
|
- ✅ **NewsService** - Full read/write paths with proper error handling
|
|
- ✅ **NewsRepository** - Production-ready file-based storage
|
|
|
|
### Next Immediate Steps (Revised)
|
|
1. **✅ COMPLETE: GoogleNewsClient RSS parsing** - Already implemented with feedparser
|
|
2. **✅ COMPLETE: ArticleScraperClient** - Already implemented with newspaper3k + Internet Archive
|
|
3. **⏳ PRIORITY: Create LLM Sentiment Service** - Replace keyword-based analysis (2-3 days)
|
|
4. **⏳ PRIORITY: Integration testing** - End-to-end workflow validation (1-2 days)
|
|
|
|
### Timeline to MVP (Updated)
|
|
- **3-5 days** for LLM sentiment service + testing
|
|
- **Current system is production-ready** with basic sentiment analysis
|
|
- **Database migration** deferred to future iteration
|
|
- **Vector features** planned as advanced enhancement
|
|
|
|
### Implementation Priority
|
|
**HIGH PRIORITY (Required for sophisticated sentiment)**:
|
|
- LLM Sentiment Analysis Service with financial news prompts
|
|
|
|
**MEDIUM PRIORITY (System improvements)**:
|
|
- Better error handling and retry logic
|
|
- Performance optimization for batch processing
|
|
- Comprehensive integration test suite
|
|
|
|
**LOW PRIORITY (Future enhancements)**:
|
|
- PostgreSQL + SQLAlchemy migration
|
|
- Vector embeddings and semantic search
|
|
- Real-time news streaming
|