946 lines
33 KiB
Markdown
946 lines
33 KiB
Markdown
# News Domain Technical Design
|
|
|
|
## Overview
|
|
|
|
This document details the technical design for completing the final 5% of the News domain implementation. The existing infrastructure is 95% complete with Google News collection, article scraping, and basic storage implemented. The remaining work focuses on **scheduled execution**, **LLM-powered sentiment analysis**, and **vector embeddings** using OpenRouter as the unified LLM provider.
|
|
|
|
## Architecture Overview
|
|
|
|
### Component Relationships
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[APScheduler] --> B[ScheduledNewsCollector]
|
|
B --> C[NewsService]
|
|
C --> D[GoogleNewsClient]
|
|
C --> E[ArticleScraperClient]
|
|
C --> F[OpenRouter LLM Client]
|
|
C --> G[OpenRouter Embeddings Client]
|
|
C --> H[NewsRepository]
|
|
H --> I[PostgreSQL + TimescaleDB + pgvectorscale]
|
|
|
|
J[News Analysts] --> K[AgentToolkit]
|
|
K --> C
|
|
K --> H
|
|
```
|
|
|
|
### Data Flow Architecture
|
|
|
|
1. **Scheduled Collection Flow**
|
|
```
|
|
APScheduler → ScheduledNewsCollector → NewsService.update_company_news()
|
|
→ GoogleNewsClient → ArticleScraperClient → OpenRouter (sentiment + embeddings)
|
|
→ NewsRepository.upsert_batch() → PostgreSQL
|
|
```
|
|
|
|
2. **Agent Query Flow**
|
|
```
|
|
News Analyst → AgentToolkit → NewsService.find_relevant_articles()
|
|
→ NewsRepository (semantic search) → pgvectorscale vector similarity
|
|
```
|
|
|
|
### Key Design Principles
|
|
|
|
- **Leverage Existing 95%**: Build on proven GoogleNewsClient and ArticleScraperClient infrastructure
|
|
- **OpenRouter Unified**: Single API for both sentiment analysis and embeddings
|
|
- **Best-Effort Processing**: LLM failures don't block article storage
|
|
- **Vector-Enhanced Search**: Semantic similarity for News Analysts
|
|
- **Fault-Tolerant Scheduling**: Robust error handling and monitoring
|
|
|
|
## Domain Model
|
|
|
|
### Enhanced NewsArticle Entity
|
|
|
|
The existing `NewsArticle` entity requires enhancements for structured sentiment and vector support:
|
|
|
|
```python
|
|
from typing import Optional, Dict, Any, List
|
|
from pydantic import BaseModel, Field, validator
|
|
import datetime
|
|
|
|
class SentimentScore(BaseModel):
|
|
"""Structured sentiment analysis result"""
|
|
sentiment: Literal["positive", "negative", "neutral"]
|
|
confidence: float = Field(ge=0.0, le=1.0)
|
|
reasoning: str
|
|
|
|
@validator('confidence')
|
|
def validate_confidence(cls, v):
|
|
if v < 0.5:
|
|
raise ValueError("Confidence must be >= 0.5 for reliable sentiment")
|
|
return v
|
|
|
|
class NewsArticle(BaseModel):
|
|
"""Enhanced NewsArticle entity with sentiment and vector support"""
|
|
# Existing fields (95% complete)
|
|
headline: str
|
|
url: str = Field(..., regex=r'^https?://')
|
|
source: str
|
|
published_date: datetime.datetime
|
|
summary: Optional[str] = None
|
|
entities: List[str] = Field(default_factory=list)
|
|
author: Optional[str] = None
|
|
category: Optional[str] = None
|
|
|
|
# Enhanced fields (final 5%)
|
|
sentiment_score: Optional[SentimentScore] = None
|
|
title_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
|
|
content_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
|
|
|
|
# Metadata
|
|
created_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
|
|
updated_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
|
|
|
|
@validator('content_embedding', 'title_embedding')
|
|
def validate_embeddings(cls, v):
|
|
if v and len(v) != 1536:
|
|
raise ValueError("Embeddings must be 1536 dimensions for OpenRouter compatibility")
|
|
return v
|
|
|
|
def has_reliable_sentiment(self) -> bool:
|
|
"""Check if sentiment analysis is reliable (confidence >= 0.5)"""
|
|
return bool(self.sentiment_score and self.sentiment_score.confidence >= 0.5)
|
|
|
|
def to_record(self) -> Dict[str, Any]:
|
|
"""Convert to database record format"""
|
|
record = self.dict()
|
|
# Convert sentiment to JSONB format
|
|
if self.sentiment_score:
|
|
record['sentiment_score'] = self.sentiment_score.dict()
|
|
return record
|
|
|
|
@classmethod
|
|
def from_record(cls, record: Dict[str, Any]) -> 'NewsArticle':
|
|
"""Create entity from database record"""
|
|
if record.get('sentiment_score'):
|
|
record['sentiment_score'] = SentimentScore(**record['sentiment_score'])
|
|
return cls(**record)
|
|
```
|
|
|
|
### New NewsJobConfig Entity
|
|
|
|
Configuration entity for scheduled news collection:
|
|
|
|
```python
|
|
from pydantic import BaseModel, Field, validator
|
|
from typing import List
|
|
|
|
class NewsJobConfig(BaseModel):
|
|
"""Configuration for scheduled news collection jobs"""
|
|
tickers: List[str] = Field(..., min_items=1, max_items=50)
|
|
schedule_hour: int = Field(..., ge=0, le=23)
|
|
sentiment_model: str = Field(default="anthropic/claude-3.5-haiku")
|
|
embedding_model: str = Field(default="text-embedding-3-large")
|
|
max_articles_per_ticker: int = Field(default=20, ge=5, le=100)
|
|
lookback_days: int = Field(default=7, ge=1, le=30)
|
|
|
|
@validator('tickers')
|
|
def validate_tickers(cls, v):
|
|
# Ensure uppercase stock symbols
|
|
return [ticker.upper().strip() for ticker in v]
|
|
|
|
@validator('sentiment_model')
|
|
def validate_sentiment_model(cls, v):
|
|
# Ensure OpenRouter model format
|
|
if '/' not in v:
|
|
raise ValueError("Model must be in OpenRouter format (provider/model)")
|
|
return v
|
|
|
|
def to_cron_expression(self) -> str:
|
|
"""Convert to cron expression for APScheduler"""
|
|
return f"0 {self.schedule_hour} * * *" # Daily at specified hour
|
|
```
|
|
|
|
## Database Design
|
|
|
|
### Schema Enhancements
|
|
|
|
The existing `news_articles` table requires minimal modifications to support the final 5%:
|
|
|
|
```sql
|
|
-- Existing table structure (95% complete)
|
|
CREATE TABLE IF NOT EXISTS news_articles (
|
|
id SERIAL PRIMARY KEY,
|
|
headline TEXT NOT NULL,
|
|
url TEXT UNIQUE NOT NULL,
|
|
source TEXT NOT NULL,
|
|
published_date TIMESTAMPTZ NOT NULL,
|
|
summary TEXT,
|
|
entities TEXT[] DEFAULT '{}',
|
|
sentiment_score JSONB, -- Enhanced for structured format
|
|
author TEXT,
|
|
category TEXT,
|
|
title_embedding vector(1536), -- New: pgvectorscale vector type
|
|
content_embedding vector(1536), -- New: pgvectorscale vector type
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
-- New indexes for final 5% performance
|
|
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_symbol_date
|
|
ON news_articles (((entities)), published_date DESC);
|
|
|
|
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_title_embedding
|
|
ON news_articles USING vectors (title_embedding vector_cosine_ops);
|
|
|
|
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_content_embedding
|
|
ON news_articles USING vectors (content_embedding vector_cosine_ops);
|
|
|
|
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_sentiment
|
|
ON news_articles (((sentiment_score->>'sentiment')))
|
|
WHERE sentiment_score IS NOT NULL;
|
|
```
|
|
|
|
### Query Patterns
|
|
|
|
**Time-based News Queries (News Analysts)**
|
|
```sql
|
|
-- Optimized for Agent queries: recent news for specific ticker
|
|
SELECT headline, summary, sentiment_score, published_date
|
|
FROM news_articles
|
|
WHERE entities @> ARRAY[$1::text]
|
|
AND published_date >= NOW() - INTERVAL '30 days'
|
|
ORDER BY published_date DESC
|
|
LIMIT 20;
|
|
```
|
|
|
|
**Semantic Similarity Queries (Vector Search)**
|
|
```sql
|
|
-- Find similar articles using pgvectorscale
|
|
SELECT headline, url, summary,
|
|
1 - (title_embedding <=> $1::vector) AS similarity_score
|
|
FROM news_articles
|
|
WHERE entities @> ARRAY[$2::text]
|
|
AND title_embedding IS NOT NULL
|
|
ORDER BY title_embedding <=> $1::vector
|
|
LIMIT 10;
|
|
```
|
|
|
|
**Batch Upsert Operations (Daily Collection)**
|
|
```sql
|
|
-- Efficient upsert for daily news collection
|
|
INSERT INTO news_articles (headline, url, source, published_date, summary, entities, sentiment_score, title_embedding, content_embedding)
|
|
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
|
ON CONFLICT (url) DO UPDATE SET
|
|
headline = EXCLUDED.headline,
|
|
summary = EXCLUDED.summary,
|
|
entities = EXCLUDED.entities,
|
|
sentiment_score = EXCLUDED.sentiment_score,
|
|
title_embedding = EXCLUDED.title_embedding,
|
|
content_embedding = EXCLUDED.content_embedding,
|
|
updated_at = NOW();
|
|
```
|
|
|
|
## API Integration
|
|
|
|
### OpenRouter Unified Client
|
|
|
|
Single OpenRouter integration for both sentiment analysis and embeddings:
|
|
|
|
```python
|
|
from typing import List, Optional, Dict, Any
|
|
import httpx
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
class OpenRouterClient:
|
|
"""Unified OpenRouter client for sentiment analysis and embeddings"""
|
|
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.config = config
|
|
self.base_url = "https://openrouter.ai/api/v1"
|
|
self.headers = {
|
|
"Authorization": f"Bearer {config.openrouter_api_key}",
|
|
"Content-Type": "application/json"
|
|
}
|
|
|
|
async def analyze_sentiment(self, text: str, model: Optional[str] = None) -> SentimentScore:
|
|
"""Generate structured sentiment analysis using LLM"""
|
|
model = model or self.config.quick_think_llm
|
|
|
|
prompt = f"""Analyze the sentiment of this news article text and respond with ONLY a JSON object:
|
|
|
|
Article: {text[:2000]} # Truncate for token limits
|
|
|
|
Required JSON format:
|
|
{{
|
|
"sentiment": "positive|negative|neutral",
|
|
"confidence": 0.0-1.0,
|
|
"reasoning": "brief explanation"
|
|
}}"""
|
|
|
|
payload = {
|
|
"model": model,
|
|
"messages": [{"role": "user", "content": prompt}],
|
|
"temperature": 0.1, # Low temperature for consistent structured output
|
|
"max_tokens": 200
|
|
}
|
|
|
|
async with httpx.AsyncClient() as client:
|
|
try:
|
|
response = await client.post(
|
|
f"{self.base_url}/chat/completions",
|
|
headers=self.headers,
|
|
json=payload,
|
|
timeout=30.0
|
|
)
|
|
response.raise_for_status()
|
|
|
|
result = response.json()
|
|
content = result["choices"][0]["message"]["content"].strip()
|
|
|
|
# Parse JSON response
|
|
import json
|
|
sentiment_data = json.loads(content)
|
|
return SentimentScore(**sentiment_data)
|
|
|
|
except Exception as e:
|
|
# Best-effort: return neutral sentiment on failure
|
|
return SentimentScore(
|
|
sentiment="neutral",
|
|
confidence=0.3, # Below reliability threshold
|
|
reasoning=f"Analysis failed: {str(e)[:100]}"
|
|
)
|
|
|
|
async def generate_embeddings(self, texts: List[str], model: Optional[str] = None) -> List[List[float]]:
|
|
"""Generate embeddings for multiple texts"""
|
|
model = model or "text-embedding-3-large"
|
|
|
|
# Truncate texts to avoid token limits
|
|
truncated_texts = [text[:8000] for text in texts]
|
|
|
|
payload = {
|
|
"model": model,
|
|
"input": truncated_texts
|
|
}
|
|
|
|
async with httpx.AsyncClient() as client:
|
|
try:
|
|
response = await client.post(
|
|
f"{self.base_url}/embeddings",
|
|
headers=self.headers,
|
|
json=payload,
|
|
timeout=60.0
|
|
)
|
|
response.raise_for_status()
|
|
|
|
result = response.json()
|
|
return [item["embedding"] for item in result["data"]]
|
|
|
|
except Exception as e:
|
|
# Return None embeddings on failure (stored as NULL in DB)
|
|
return [None] * len(texts)
|
|
```
|
|
|
|
### Enhanced NewsService Integration
|
|
|
|
Update existing NewsService to integrate LLM capabilities:
|
|
|
|
```python
|
|
class NewsService:
|
|
"""Enhanced NewsService with LLM sentiment and embeddings (final 5%)"""
|
|
|
|
def __init__(self,
|
|
repository: NewsRepository,
|
|
google_client: GoogleNewsClient,
|
|
scraper_client: ArticleScraperClient,
|
|
openrouter_client: OpenRouterClient):
|
|
self.repository = repository
|
|
self.google_client = google_client
|
|
self.scraper_client = scraper_client
|
|
self.openrouter_client = openrouter_client
|
|
|
|
async def update_company_news(self,
|
|
symbol: str,
|
|
lookback_days: int = 7,
|
|
max_articles: int = 20,
|
|
include_sentiment: bool = True,
|
|
include_embeddings: bool = True) -> List[NewsArticle]:
|
|
"""Enhanced method with LLM sentiment analysis and embeddings"""
|
|
|
|
# Step 1: Use existing 95% infrastructure for collection
|
|
cutoff_date = datetime.datetime.now() - datetime.timedelta(days=lookback_days)
|
|
|
|
# Fetch from Google News (existing)
|
|
google_results = await self.google_client.fetch_company_news(symbol, max_articles)
|
|
|
|
articles = []
|
|
for result in google_results:
|
|
if result.published_date < cutoff_date:
|
|
continue
|
|
|
|
# Scrape full content (existing)
|
|
scraped_content = await self.scraper_client.scrape_article(result.url)
|
|
|
|
# Create base article (existing pattern)
|
|
article = NewsArticle(
|
|
headline=result.title,
|
|
url=result.url,
|
|
source=result.source,
|
|
published_date=result.published_date,
|
|
summary=scraped_content.summary if scraped_content else result.description,
|
|
entities=[symbol],
|
|
author=scraped_content.author if scraped_content else None
|
|
)
|
|
|
|
# Step 2: NEW - Add LLM sentiment analysis
|
|
if include_sentiment and scraped_content and scraped_content.content:
|
|
article.sentiment_score = await self.openrouter_client.analyze_sentiment(
|
|
scraped_content.content
|
|
)
|
|
|
|
articles.append(article)
|
|
|
|
# Step 3: NEW - Batch generate embeddings
|
|
if include_embeddings and articles:
|
|
titles = [a.headline for a in articles]
|
|
contents = [a.summary or a.headline for a in articles]
|
|
|
|
title_embeddings = await self.openrouter_client.generate_embeddings(titles)
|
|
content_embeddings = await self.openrouter_client.generate_embeddings(contents)
|
|
|
|
for i, article in enumerate(articles):
|
|
if i < len(title_embeddings) and title_embeddings[i]:
|
|
article.title_embedding = title_embeddings[i]
|
|
if i < len(content_embeddings) and content_embeddings[i]:
|
|
article.content_embedding = content_embeddings[i]
|
|
|
|
# Step 4: Batch persist (existing pattern)
|
|
await self.repository.upsert_batch(articles)
|
|
return articles
|
|
|
|
async def find_similar_articles(self,
|
|
query_text: str,
|
|
symbol: Optional[str] = None,
|
|
limit: int = 10) -> List[NewsArticle]:
|
|
"""NEW: Semantic similarity search for News Analysts"""
|
|
|
|
# Generate query embedding
|
|
query_embeddings = await self.openrouter_client.generate_embeddings([query_text])
|
|
if not query_embeddings[0]:
|
|
# Fallback to text search
|
|
return await self.repository.find_by_text_search(query_text, symbol, limit)
|
|
|
|
return await self.repository.find_similar_articles(
|
|
query_embeddings[0], symbol, limit
|
|
)
|
|
```
|
|
|
|
## Job Scheduling Architecture
|
|
|
|
### APScheduler Integration
|
|
|
|
Robust scheduled execution using APScheduler:
|
|
|
|
```python
|
|
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
|
from apscheduler.jobstores.redis import RedisJobStore # Optional: persistent job store
|
|
from apscheduler.executors.asyncio import AsyncIOExecutor
|
|
import logging
|
|
|
|
class ScheduledNewsCollector:
|
|
"""Orchestrates scheduled news collection jobs"""
|
|
|
|
def __init__(self,
|
|
news_service: NewsService,
|
|
config: TradingAgentsConfig,
|
|
job_config: NewsJobConfig):
|
|
self.news_service = news_service
|
|
self.config = config
|
|
self.job_config = job_config
|
|
|
|
# Configure APScheduler
|
|
jobstores = {
|
|
'default': {'type': 'memory'} # Use Redis for production
|
|
}
|
|
executors = {
|
|
'default': AsyncIOExecutor(),
|
|
}
|
|
job_defaults = {
|
|
'coalesce': False, # Don't combine missed jobs
|
|
'max_instances': 1, # One job per ticker at a time
|
|
'misfire_grace_time': 300 # 5 minute grace period
|
|
}
|
|
|
|
self.scheduler = AsyncIOScheduler(
|
|
jobstores=jobstores,
|
|
executors=executors,
|
|
job_defaults=job_defaults,
|
|
timezone='UTC'
|
|
)
|
|
|
|
async def start(self):
|
|
"""Start the scheduler and register jobs"""
|
|
|
|
for ticker in self.job_config.tickers:
|
|
# Schedule daily collection for each ticker
|
|
self.scheduler.add_job(
|
|
func=self._collect_ticker_news,
|
|
trigger='cron',
|
|
hour=self.job_config.schedule_hour,
|
|
minute=0,
|
|
args=[ticker],
|
|
id=f"news_collection_{ticker}",
|
|
replace_existing=True,
|
|
max_instances=1
|
|
)
|
|
|
|
self.scheduler.start()
|
|
logging.info(f"Started news collection scheduler for {len(self.job_config.tickers)} tickers")
|
|
|
|
async def stop(self):
|
|
"""Gracefully stop the scheduler"""
|
|
if self.scheduler.running:
|
|
self.scheduler.shutdown(wait=True)
|
|
|
|
async def _collect_ticker_news(self, ticker: str):
|
|
"""Execute news collection for a single ticker"""
|
|
|
|
start_time = datetime.datetime.now()
|
|
|
|
try:
|
|
logging.info(f"Starting news collection for {ticker}")
|
|
|
|
articles = await self.news_service.update_company_news(
|
|
symbol=ticker,
|
|
lookback_days=self.job_config.lookback_days,
|
|
max_articles=self.job_config.max_articles_per_ticker,
|
|
include_sentiment=True,
|
|
include_embeddings=True
|
|
)
|
|
|
|
# Log metrics
|
|
sentiment_count = sum(1 for a in articles if a.has_reliable_sentiment())
|
|
embedding_count = sum(1 for a in articles if a.title_embedding)
|
|
|
|
duration = (datetime.datetime.now() - start_time).total_seconds()
|
|
|
|
logging.info(
|
|
f"Completed news collection for {ticker}: "
|
|
f"{len(articles)} articles, {sentiment_count} with sentiment, "
|
|
f"{embedding_count} with embeddings in {duration:.1f}s"
|
|
)
|
|
|
|
except Exception as e:
|
|
logging.error(f"News collection failed for {ticker}: {str(e)}")
|
|
# Don't raise - let scheduler continue with other tickers
|
|
|
|
def get_job_status(self) -> Dict[str, Any]:
|
|
"""Get status of all scheduled jobs"""
|
|
jobs = self.scheduler.get_jobs()
|
|
return {
|
|
"scheduler_running": self.scheduler.running,
|
|
"job_count": len(jobs),
|
|
"jobs": [
|
|
{
|
|
"id": job.id,
|
|
"next_run": job.next_run_time.isoformat() if job.next_run_time else None,
|
|
"trigger": str(job.trigger)
|
|
}
|
|
for job in jobs
|
|
]
|
|
}
|
|
```
|
|
|
|
### Error Handling and Monitoring
|
|
|
|
Comprehensive error handling for production reliability:
|
|
|
|
```python
|
|
class NewsCollectionMonitor:
|
|
"""Monitor and handle news collection job failures"""
|
|
|
|
def __init__(self, collector: ScheduledNewsCollector):
|
|
self.collector = collector
|
|
self.failure_counts = defaultdict(int)
|
|
self.max_failures = 3
|
|
|
|
async def handle_job_failure(self, ticker: str, error: Exception):
|
|
"""Handle job failure with exponential backoff"""
|
|
|
|
self.failure_counts[ticker] += 1
|
|
|
|
if self.failure_counts[ticker] >= self.max_failures:
|
|
logging.error(f"Max failures reached for {ticker}, disabling job")
|
|
self.collector.scheduler.remove_job(f"news_collection_{ticker}")
|
|
# Could send alert here
|
|
else:
|
|
# Schedule retry with exponential backoff
|
|
delay_minutes = 2 ** self.failure_counts[ticker]
|
|
retry_time = datetime.datetime.now() + datetime.timedelta(minutes=delay_minutes)
|
|
|
|
self.collector.scheduler.add_job(
|
|
func=self.collector._collect_ticker_news,
|
|
trigger='date',
|
|
run_date=retry_time,
|
|
args=[ticker],
|
|
id=f"news_retry_{ticker}_{int(retry_time.timestamp())}",
|
|
max_instances=1
|
|
)
|
|
|
|
def reset_failure_count(self, ticker: str):
|
|
"""Reset failure count on successful job"""
|
|
if ticker in self.failure_counts:
|
|
del self.failure_counts[ticker]
|
|
```
|
|
|
|
## Implementation Strategy
|
|
|
|
### Phase 1: Entity and Database Enhancements (Week 1)
|
|
|
|
**Deliverables:**
|
|
- [ ] Enhanced `NewsArticle` entity with `SentimentScore` and vector support
|
|
- [ ] New `NewsJobConfig` entity with validation
|
|
- [ ] Database migration for vector indexes and sentiment_score JSONB enhancement
|
|
- [ ] Repository method `find_similar_articles()` with pgvectorscale integration
|
|
|
|
**Testing Focus:**
|
|
- Unit tests for entity validation and serialization
|
|
- Repository integration tests with vector similarity queries
|
|
- Database migration verification
|
|
|
|
### Phase 2: OpenRouter Integration (Week 2)
|
|
|
|
**Deliverables:**
|
|
- [ ] `OpenRouterClient` with sentiment analysis and embeddings
|
|
- [ ] Enhanced `NewsService.update_company_news()` with LLM integration
|
|
- [ ] Error handling for LLM failures (best-effort approach)
|
|
- [ ] Integration tests with OpenRouter API (using pytest-vcr)
|
|
|
|
**Testing Focus:**
|
|
- Mock OpenRouter responses for consistent testing
|
|
- Error handling scenarios (API failures, malformed responses)
|
|
- Embedding dimension validation
|
|
|
|
### Phase 3: Job Scheduling System (Week 3)
|
|
|
|
**Deliverables:**
|
|
- [ ] `ScheduledNewsCollector` with APScheduler integration
|
|
- [ ] `NewsCollectionMonitor` for error handling and retries
|
|
- [ ] Configuration management for job scheduling
|
|
- [ ] Graceful startup and shutdown procedures
|
|
|
|
**Testing Focus:**
|
|
- Scheduler lifecycle testing
|
|
- Job execution and failure handling
|
|
- Configuration validation
|
|
|
|
### Phase 4: Testing and Performance Optimization (Week 4)
|
|
|
|
**Deliverables:**
|
|
- [ ] Complete test coverage maintaining >85% threshold
|
|
- [ ] Performance optimization for vector queries
|
|
- [ ] Documentation and deployment guides
|
|
- [ ] Integration with existing News Analyst AgentToolkit
|
|
|
|
**Testing Focus:**
|
|
- End-to-end integration tests
|
|
- Performance benchmarks for vector similarity queries
|
|
- Load testing for scheduled job execution
|
|
|
|
## Testing Strategy
|
|
|
|
### Test Architecture
|
|
|
|
Following the existing pragmatic TDD approach with mock boundaries:
|
|
|
|
```
|
|
tests/domains/news/
|
|
├── __init__.py
|
|
├── test_news_entities.py # Entity validation and serialization
|
|
├── test_news_service.py # Mock repository and OpenRouter client
|
|
├── test_news_repository.py # PostgreSQL test database
|
|
├── test_openrouter_client.py # pytest-vcr for API responses
|
|
├── test_scheduled_collector.py # Mock APScheduler and services
|
|
└── integration/
|
|
├── test_sentiment_pipeline.py # End-to-end sentiment analysis
|
|
├── test_embedding_pipeline.py # End-to-end embedding generation
|
|
└── test_scheduled_execution.py # Full job execution cycle
|
|
```
|
|
|
|
### Key Test Categories
|
|
|
|
**Entity Tests (Fast Unit Tests)**
|
|
```python
|
|
def test_news_article_sentiment_validation():
|
|
"""Test sentiment score validation and reliability checks"""
|
|
|
|
# Valid sentiment
|
|
sentiment = SentimentScore(
|
|
sentiment="positive",
|
|
confidence=0.8,
|
|
reasoning="Strong positive language"
|
|
)
|
|
|
|
article = NewsArticle(
|
|
headline="Test headline",
|
|
url="https://example.com",
|
|
source="Test Source",
|
|
published_date=datetime.datetime.now(),
|
|
sentiment_score=sentiment
|
|
)
|
|
|
|
assert article.has_reliable_sentiment() == True
|
|
|
|
# Low confidence sentiment
|
|
low_confidence = SentimentScore(
|
|
sentiment="neutral",
|
|
confidence=0.3,
|
|
reasoning="Ambiguous language"
|
|
)
|
|
|
|
article.sentiment_score = low_confidence
|
|
assert article.has_reliable_sentiment() == False
|
|
|
|
def test_news_article_vector_validation():
|
|
"""Test vector embedding validation"""
|
|
|
|
# Valid 1536-dimension embedding
|
|
valid_embedding = [0.1] * 1536
|
|
article = NewsArticle(
|
|
headline="Test",
|
|
url="https://example.com",
|
|
source="Test",
|
|
published_date=datetime.datetime.now(),
|
|
title_embedding=valid_embedding
|
|
)
|
|
|
|
assert len(article.title_embedding) == 1536
|
|
|
|
# Invalid dimension should raise ValidationError
|
|
with pytest.raises(ValidationError):
|
|
NewsArticle(
|
|
headline="Test",
|
|
url="https://example.com",
|
|
source="Test",
|
|
published_date=datetime.datetime.now(),
|
|
title_embedding=[0.1] * 512 # Wrong dimension
|
|
)
|
|
```
|
|
|
|
**Service Integration Tests (Mock Boundaries)**
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_news_service_with_sentiment_analysis(mock_openrouter_client, mock_repository):
|
|
"""Test NewsService integration with mocked LLM client"""
|
|
|
|
# Mock successful sentiment analysis
|
|
mock_sentiment = SentimentScore(
|
|
sentiment="positive",
|
|
confidence=0.9,
|
|
reasoning="Optimistic financial outlook"
|
|
)
|
|
mock_openrouter_client.analyze_sentiment.return_value = mock_sentiment
|
|
|
|
# Mock embeddings
|
|
mock_openrouter_client.generate_embeddings.return_value = [
|
|
[0.1] * 1536, # title embedding
|
|
[0.2] * 1536 # content embedding
|
|
]
|
|
|
|
service = NewsService(
|
|
repository=mock_repository,
|
|
google_client=mock_google_client,
|
|
scraper_client=mock_scraper_client,
|
|
openrouter_client=mock_openrouter_client
|
|
)
|
|
|
|
articles = await service.update_company_news("AAPL", include_sentiment=True)
|
|
|
|
# Verify LLM integration
|
|
assert len(articles) > 0
|
|
assert articles[0].sentiment_score == mock_sentiment
|
|
assert articles[0].title_embedding == [0.1] * 1536
|
|
assert mock_openrouter_client.analyze_sentiment.called
|
|
assert mock_openrouter_client.generate_embeddings.called
|
|
```
|
|
|
|
**Repository Integration Tests (Real Database)**
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_repository_vector_similarity_search(test_db):
|
|
"""Test vector similarity search with real pgvectorscale"""
|
|
|
|
repository = NewsRepository(test_db)
|
|
|
|
# Insert articles with embeddings
|
|
article1 = NewsArticle(
|
|
headline="Apple reports strong iPhone sales",
|
|
url="https://example.com/1",
|
|
source="TechNews",
|
|
published_date=datetime.datetime.now(),
|
|
entities=["AAPL"],
|
|
title_embedding=[0.1, 0.2] + [0.0] * 1534 # Similar to query
|
|
)
|
|
|
|
article2 = NewsArticle(
|
|
headline="Microsoft launches new Azure features",
|
|
url="https://example.com/2",
|
|
source="CloudNews",
|
|
published_date=datetime.datetime.now(),
|
|
entities=["MSFT"],
|
|
title_embedding=[0.9, 0.8] + [0.0] * 1534 # Different from query
|
|
)
|
|
|
|
await repository.upsert_batch([article1, article2])
|
|
|
|
# Query with similar embedding
|
|
query_embedding = [0.15, 0.25] + [0.0] * 1534
|
|
similar_articles = await repository.find_similar_articles(
|
|
query_embedding, symbol="AAPL", limit=1
|
|
)
|
|
|
|
assert len(similar_articles) == 1
|
|
assert similar_articles[0].headline == "Apple reports strong iPhone sales"
|
|
```
|
|
|
|
**API Integration Tests (pytest-vcr)**
|
|
```python
|
|
@pytest.mark.vcr
|
|
@pytest.mark.asyncio
|
|
async def test_openrouter_sentiment_analysis():
|
|
"""Test real OpenRouter API calls with VCR cassettes"""
|
|
|
|
config = TradingAgentsConfig.from_env()
|
|
client = OpenRouterClient(config)
|
|
|
|
test_text = "Apple's quarterly earnings exceeded expectations with strong iPhone sales."
|
|
|
|
sentiment = await client.analyze_sentiment(test_text)
|
|
|
|
assert isinstance(sentiment, SentimentScore)
|
|
assert sentiment.sentiment in ["positive", "negative", "neutral"]
|
|
assert 0.0 <= sentiment.confidence <= 1.0
|
|
assert len(sentiment.reasoning) > 0
|
|
|
|
@pytest.mark.vcr
|
|
@pytest.mark.asyncio
|
|
async def test_openrouter_embeddings_generation():
|
|
"""Test real OpenRouter embeddings API with VCR"""
|
|
|
|
config = TradingAgentsConfig.from_env()
|
|
client = OpenRouterClient(config)
|
|
|
|
texts = ["Apple stock rises", "Market volatility increases"]
|
|
|
|
embeddings = await client.generate_embeddings(texts)
|
|
|
|
assert len(embeddings) == 2
|
|
assert all(len(emb) == 1536 for emb in embeddings)
|
|
assert all(isinstance(val, float) for emb in embeddings for val in emb)
|
|
```
|
|
|
|
### Coverage Requirements
|
|
|
|
Maintain existing >85% coverage with new components:
|
|
|
|
- **Entity Layer**: 95% coverage (comprehensive validation testing)
|
|
- **Service Layer**: 90% coverage (mock external dependencies)
|
|
- **Repository Layer**: 85% coverage (real database integration tests)
|
|
- **Client Layer**: 80% coverage (pytest-vcr for API calls)
|
|
- **Integration Tests**: End-to-end scenarios covering complete workflows
|
|
|
|
### Performance Testing
|
|
|
|
```python
|
|
@pytest.mark.performance
|
|
@pytest.mark.asyncio
|
|
async def test_vector_similarity_performance():
|
|
"""Ensure vector similarity queries perform under 100ms"""
|
|
|
|
repository = NewsRepository(test_db)
|
|
|
|
# Insert 1000 articles with embeddings
|
|
articles = [create_test_article_with_embedding() for _ in range(1000)]
|
|
await repository.upsert_batch(articles)
|
|
|
|
query_embedding = [random.random() for _ in range(1536)]
|
|
|
|
start_time = time.time()
|
|
results = await repository.find_similar_articles(query_embedding, limit=10)
|
|
duration = time.time() - start_time
|
|
|
|
assert duration < 0.1 # Under 100ms
|
|
assert len(results) == 10
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### News Analyst AgentToolkit Integration
|
|
|
|
The completed News domain integrates seamlessly with existing News Analyst agents:
|
|
|
|
```python
|
|
class NewsAnalystToolkit:
|
|
"""Enhanced toolkit with semantic search capabilities"""
|
|
|
|
def __init__(self, news_service: NewsService):
|
|
self.news_service = news_service
|
|
|
|
async def get_relevant_news(self,
|
|
ticker: str,
|
|
query: Optional[str] = None,
|
|
days_back: int = 30) -> List[Dict[str, Any]]:
|
|
"""Get news with optional semantic search"""
|
|
|
|
if query:
|
|
# Use semantic similarity search
|
|
articles = await self.news_service.find_similar_articles(
|
|
query_text=query,
|
|
symbol=ticker,
|
|
limit=20
|
|
)
|
|
else:
|
|
# Use time-based search (existing)
|
|
articles = await self.news_service.find_recent_news(
|
|
symbol=ticker,
|
|
days_back=days_back
|
|
)
|
|
|
|
return [
|
|
{
|
|
"headline": article.headline,
|
|
"summary": article.summary,
|
|
"published_date": article.published_date.isoformat(),
|
|
"sentiment": article.sentiment_score.sentiment if article.sentiment_score else "unknown",
|
|
"confidence": article.sentiment_score.confidence if article.sentiment_score else 0.0,
|
|
"source": article.source,
|
|
"url": article.url
|
|
}
|
|
for article in articles
|
|
]
|
|
```
|
|
|
|
### Configuration Integration
|
|
|
|
Seamless integration with existing `TradingAgentsConfig`:
|
|
|
|
```python
|
|
# Enhanced configuration for news domain completion
|
|
config = TradingAgentsConfig(
|
|
# Existing LLM configuration
|
|
llm_provider="openrouter",
|
|
openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
|
|
quick_think_llm="anthropic/claude-3.5-haiku", # For sentiment analysis
|
|
|
|
# New news-specific settings
|
|
news_collection_enabled=True,
|
|
news_schedule_hour=6, # UTC
|
|
news_sentiment_enabled=True,
|
|
news_embeddings_enabled=True,
|
|
news_max_articles_per_ticker=20,
|
|
|
|
# Database (existing)
|
|
database_url=os.getenv("DATABASE_URL"),
|
|
)
|
|
|
|
# Job configuration
|
|
news_job_config = NewsJobConfig(
|
|
tickers=["AAPL", "GOOGL", "MSFT", "TSLA", "NVDA"],
|
|
schedule_hour=6, # 6 AM UTC daily collection
|
|
sentiment_model=config.quick_think_llm,
|
|
embedding_model="text-embedding-3-large",
|
|
max_articles_per_ticker=20
|
|
)
|
|
```
|
|
|
|
This design completes the final 5% of the News domain while leveraging the existing 95% infrastructure, maintaining architectural consistency, and providing the robust scheduled execution, LLM-powered sentiment analysis, and vector embeddings needed for advanced News Analyst capabilities. |