TradingAgents/docs/specs/news/design.md

946 lines
33 KiB
Markdown

# News Domain Technical Design
## Overview
This document details the technical design for completing the final 5% of the News domain implementation. The existing infrastructure is 95% complete with Google News collection, article scraping, and basic storage implemented. The remaining work focuses on **scheduled execution**, **LLM-powered sentiment analysis**, and **vector embeddings** using OpenRouter as the unified LLM provider.
## Architecture Overview
### Component Relationships
```mermaid
graph TD
A[APScheduler] --> B[ScheduledNewsCollector]
B --> C[NewsService]
C --> D[GoogleNewsClient]
C --> E[ArticleScraperClient]
C --> F[OpenRouter LLM Client]
C --> G[OpenRouter Embeddings Client]
C --> H[NewsRepository]
H --> I[PostgreSQL + TimescaleDB + pgvectorscale]
J[News Analysts] --> K[AgentToolkit]
K --> C
K --> H
```
### Data Flow Architecture
1. **Scheduled Collection Flow**
```
APScheduler → ScheduledNewsCollector → NewsService.update_company_news()
→ GoogleNewsClient → ArticleScraperClient → OpenRouter (sentiment + embeddings)
→ NewsRepository.upsert_batch() → PostgreSQL
```
2. **Agent Query Flow**
```
News Analyst → AgentToolkit → NewsService.find_relevant_articles()
→ NewsRepository (semantic search) → pgvectorscale vector similarity
```
### Key Design Principles
- **Leverage Existing 95%**: Build on proven GoogleNewsClient and ArticleScraperClient infrastructure
- **OpenRouter Unified**: Single API for both sentiment analysis and embeddings
- **Best-Effort Processing**: LLM failures don't block article storage
- **Vector-Enhanced Search**: Semantic similarity for News Analysts
- **Fault-Tolerant Scheduling**: Robust error handling and monitoring
## Domain Model
### Enhanced NewsArticle Entity
The existing `NewsArticle` entity requires enhancements for structured sentiment and vector support:
```python
from typing import Optional, Dict, Any, List
from pydantic import BaseModel, Field, validator
import datetime
class SentimentScore(BaseModel):
"""Structured sentiment analysis result"""
sentiment: Literal["positive", "negative", "neutral"]
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str
@validator('confidence')
def validate_confidence(cls, v):
if v < 0.5:
raise ValueError("Confidence must be >= 0.5 for reliable sentiment")
return v
class NewsArticle(BaseModel):
"""Enhanced NewsArticle entity with sentiment and vector support"""
# Existing fields (95% complete)
headline: str
url: str = Field(..., regex=r'^https?://')
source: str
published_date: datetime.datetime
summary: Optional[str] = None
entities: List[str] = Field(default_factory=list)
author: Optional[str] = None
category: Optional[str] = None
# Enhanced fields (final 5%)
sentiment_score: Optional[SentimentScore] = None
title_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
content_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
# Metadata
created_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
updated_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
@validator('content_embedding', 'title_embedding')
def validate_embeddings(cls, v):
if v and len(v) != 1536:
raise ValueError("Embeddings must be 1536 dimensions for OpenRouter compatibility")
return v
def has_reliable_sentiment(self) -> bool:
"""Check if sentiment analysis is reliable (confidence >= 0.5)"""
return bool(self.sentiment_score and self.sentiment_score.confidence >= 0.5)
def to_record(self) -> Dict[str, Any]:
"""Convert to database record format"""
record = self.dict()
# Convert sentiment to JSONB format
if self.sentiment_score:
record['sentiment_score'] = self.sentiment_score.dict()
return record
@classmethod
def from_record(cls, record: Dict[str, Any]) -> 'NewsArticle':
"""Create entity from database record"""
if record.get('sentiment_score'):
record['sentiment_score'] = SentimentScore(**record['sentiment_score'])
return cls(**record)
```
### New NewsJobConfig Entity
Configuration entity for scheduled news collection:
```python
from pydantic import BaseModel, Field, validator
from typing import List
class NewsJobConfig(BaseModel):
"""Configuration for scheduled news collection jobs"""
tickers: List[str] = Field(..., min_items=1, max_items=50)
schedule_hour: int = Field(..., ge=0, le=23)
sentiment_model: str = Field(default="anthropic/claude-3.5-haiku")
embedding_model: str = Field(default="text-embedding-3-large")
max_articles_per_ticker: int = Field(default=20, ge=5, le=100)
lookback_days: int = Field(default=7, ge=1, le=30)
@validator('tickers')
def validate_tickers(cls, v):
# Ensure uppercase stock symbols
return [ticker.upper().strip() for ticker in v]
@validator('sentiment_model')
def validate_sentiment_model(cls, v):
# Ensure OpenRouter model format
if '/' not in v:
raise ValueError("Model must be in OpenRouter format (provider/model)")
return v
def to_cron_expression(self) -> str:
"""Convert to cron expression for APScheduler"""
return f"0 {self.schedule_hour} * * *" # Daily at specified hour
```
## Database Design
### Schema Enhancements
The existing `news_articles` table requires minimal modifications to support the final 5%:
```sql
-- Existing table structure (95% complete)
CREATE TABLE IF NOT EXISTS news_articles (
id SERIAL PRIMARY KEY,
headline TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
source TEXT NOT NULL,
published_date TIMESTAMPTZ NOT NULL,
summary TEXT,
entities TEXT[] DEFAULT '{}',
sentiment_score JSONB, -- Enhanced for structured format
author TEXT,
category TEXT,
title_embedding vector(1536), -- New: pgvectorscale vector type
content_embedding vector(1536), -- New: pgvectorscale vector type
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- New indexes for final 5% performance
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_symbol_date
ON news_articles (((entities)), published_date DESC);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_title_embedding
ON news_articles USING vectors (title_embedding vector_cosine_ops);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_content_embedding
ON news_articles USING vectors (content_embedding vector_cosine_ops);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_sentiment
ON news_articles (((sentiment_score->>'sentiment')))
WHERE sentiment_score IS NOT NULL;
```
### Query Patterns
**Time-based News Queries (News Analysts)**
```sql
-- Optimized for Agent queries: recent news for specific ticker
SELECT headline, summary, sentiment_score, published_date
FROM news_articles
WHERE entities @> ARRAY[$1::text]
AND published_date >= NOW() - INTERVAL '30 days'
ORDER BY published_date DESC
LIMIT 20;
```
**Semantic Similarity Queries (Vector Search)**
```sql
-- Find similar articles using pgvectorscale
SELECT headline, url, summary,
1 - (title_embedding <=> $1::vector) AS similarity_score
FROM news_articles
WHERE entities @> ARRAY[$2::text]
AND title_embedding IS NOT NULL
ORDER BY title_embedding <=> $1::vector
LIMIT 10;
```
**Batch Upsert Operations (Daily Collection)**
```sql
-- Efficient upsert for daily news collection
INSERT INTO news_articles (headline, url, source, published_date, summary, entities, sentiment_score, title_embedding, content_embedding)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
ON CONFLICT (url) DO UPDATE SET
headline = EXCLUDED.headline,
summary = EXCLUDED.summary,
entities = EXCLUDED.entities,
sentiment_score = EXCLUDED.sentiment_score,
title_embedding = EXCLUDED.title_embedding,
content_embedding = EXCLUDED.content_embedding,
updated_at = NOW();
```
## API Integration
### OpenRouter Unified Client
Single OpenRouter integration for both sentiment analysis and embeddings:
```python
from typing import List, Optional, Dict, Any
import httpx
from tradingagents.config import TradingAgentsConfig
class OpenRouterClient:
"""Unified OpenRouter client for sentiment analysis and embeddings"""
def __init__(self, config: TradingAgentsConfig):
self.config = config
self.base_url = "https://openrouter.ai/api/v1"
self.headers = {
"Authorization": f"Bearer {config.openrouter_api_key}",
"Content-Type": "application/json"
}
async def analyze_sentiment(self, text: str, model: Optional[str] = None) -> SentimentScore:
"""Generate structured sentiment analysis using LLM"""
model = model or self.config.quick_think_llm
prompt = f"""Analyze the sentiment of this news article text and respond with ONLY a JSON object:
Article: {text[:2000]} # Truncate for token limits
Required JSON format:
{{
"sentiment": "positive|negative|neutral",
"confidence": 0.0-1.0,
"reasoning": "brief explanation"
}}"""
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.1, # Low temperature for consistent structured output
"max_tokens": 200
}
async with httpx.AsyncClient() as client:
try:
response = await client.post(
f"{self.base_url}/chat/completions",
headers=self.headers,
json=payload,
timeout=30.0
)
response.raise_for_status()
result = response.json()
content = result["choices"][0]["message"]["content"].strip()
# Parse JSON response
import json
sentiment_data = json.loads(content)
return SentimentScore(**sentiment_data)
except Exception as e:
# Best-effort: return neutral sentiment on failure
return SentimentScore(
sentiment="neutral",
confidence=0.3, # Below reliability threshold
reasoning=f"Analysis failed: {str(e)[:100]}"
)
async def generate_embeddings(self, texts: List[str], model: Optional[str] = None) -> List[List[float]]:
"""Generate embeddings for multiple texts"""
model = model or "text-embedding-3-large"
# Truncate texts to avoid token limits
truncated_texts = [text[:8000] for text in texts]
payload = {
"model": model,
"input": truncated_texts
}
async with httpx.AsyncClient() as client:
try:
response = await client.post(
f"{self.base_url}/embeddings",
headers=self.headers,
json=payload,
timeout=60.0
)
response.raise_for_status()
result = response.json()
return [item["embedding"] for item in result["data"]]
except Exception as e:
# Return None embeddings on failure (stored as NULL in DB)
return [None] * len(texts)
```
### Enhanced NewsService Integration
Update existing NewsService to integrate LLM capabilities:
```python
class NewsService:
"""Enhanced NewsService with LLM sentiment and embeddings (final 5%)"""
def __init__(self,
repository: NewsRepository,
google_client: GoogleNewsClient,
scraper_client: ArticleScraperClient,
openrouter_client: OpenRouterClient):
self.repository = repository
self.google_client = google_client
self.scraper_client = scraper_client
self.openrouter_client = openrouter_client
async def update_company_news(self,
symbol: str,
lookback_days: int = 7,
max_articles: int = 20,
include_sentiment: bool = True,
include_embeddings: bool = True) -> List[NewsArticle]:
"""Enhanced method with LLM sentiment analysis and embeddings"""
# Step 1: Use existing 95% infrastructure for collection
cutoff_date = datetime.datetime.now() - datetime.timedelta(days=lookback_days)
# Fetch from Google News (existing)
google_results = await self.google_client.fetch_company_news(symbol, max_articles)
articles = []
for result in google_results:
if result.published_date < cutoff_date:
continue
# Scrape full content (existing)
scraped_content = await self.scraper_client.scrape_article(result.url)
# Create base article (existing pattern)
article = NewsArticle(
headline=result.title,
url=result.url,
source=result.source,
published_date=result.published_date,
summary=scraped_content.summary if scraped_content else result.description,
entities=[symbol],
author=scraped_content.author if scraped_content else None
)
# Step 2: NEW - Add LLM sentiment analysis
if include_sentiment and scraped_content and scraped_content.content:
article.sentiment_score = await self.openrouter_client.analyze_sentiment(
scraped_content.content
)
articles.append(article)
# Step 3: NEW - Batch generate embeddings
if include_embeddings and articles:
titles = [a.headline for a in articles]
contents = [a.summary or a.headline for a in articles]
title_embeddings = await self.openrouter_client.generate_embeddings(titles)
content_embeddings = await self.openrouter_client.generate_embeddings(contents)
for i, article in enumerate(articles):
if i < len(title_embeddings) and title_embeddings[i]:
article.title_embedding = title_embeddings[i]
if i < len(content_embeddings) and content_embeddings[i]:
article.content_embedding = content_embeddings[i]
# Step 4: Batch persist (existing pattern)
await self.repository.upsert_batch(articles)
return articles
async def find_similar_articles(self,
query_text: str,
symbol: Optional[str] = None,
limit: int = 10) -> List[NewsArticle]:
"""NEW: Semantic similarity search for News Analysts"""
# Generate query embedding
query_embeddings = await self.openrouter_client.generate_embeddings([query_text])
if not query_embeddings[0]:
# Fallback to text search
return await self.repository.find_by_text_search(query_text, symbol, limit)
return await self.repository.find_similar_articles(
query_embeddings[0], symbol, limit
)
```
## Job Scheduling Architecture
### APScheduler Integration
Robust scheduled execution using APScheduler:
```python
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.jobstores.redis import RedisJobStore # Optional: persistent job store
from apscheduler.executors.asyncio import AsyncIOExecutor
import logging
class ScheduledNewsCollector:
"""Orchestrates scheduled news collection jobs"""
def __init__(self,
news_service: NewsService,
config: TradingAgentsConfig,
job_config: NewsJobConfig):
self.news_service = news_service
self.config = config
self.job_config = job_config
# Configure APScheduler
jobstores = {
'default': {'type': 'memory'} # Use Redis for production
}
executors = {
'default': AsyncIOExecutor(),
}
job_defaults = {
'coalesce': False, # Don't combine missed jobs
'max_instances': 1, # One job per ticker at a time
'misfire_grace_time': 300 # 5 minute grace period
}
self.scheduler = AsyncIOScheduler(
jobstores=jobstores,
executors=executors,
job_defaults=job_defaults,
timezone='UTC'
)
async def start(self):
"""Start the scheduler and register jobs"""
for ticker in self.job_config.tickers:
# Schedule daily collection for each ticker
self.scheduler.add_job(
func=self._collect_ticker_news,
trigger='cron',
hour=self.job_config.schedule_hour,
minute=0,
args=[ticker],
id=f"news_collection_{ticker}",
replace_existing=True,
max_instances=1
)
self.scheduler.start()
logging.info(f"Started news collection scheduler for {len(self.job_config.tickers)} tickers")
async def stop(self):
"""Gracefully stop the scheduler"""
if self.scheduler.running:
self.scheduler.shutdown(wait=True)
async def _collect_ticker_news(self, ticker: str):
"""Execute news collection for a single ticker"""
start_time = datetime.datetime.now()
try:
logging.info(f"Starting news collection for {ticker}")
articles = await self.news_service.update_company_news(
symbol=ticker,
lookback_days=self.job_config.lookback_days,
max_articles=self.job_config.max_articles_per_ticker,
include_sentiment=True,
include_embeddings=True
)
# Log metrics
sentiment_count = sum(1 for a in articles if a.has_reliable_sentiment())
embedding_count = sum(1 for a in articles if a.title_embedding)
duration = (datetime.datetime.now() - start_time).total_seconds()
logging.info(
f"Completed news collection for {ticker}: "
f"{len(articles)} articles, {sentiment_count} with sentiment, "
f"{embedding_count} with embeddings in {duration:.1f}s"
)
except Exception as e:
logging.error(f"News collection failed for {ticker}: {str(e)}")
# Don't raise - let scheduler continue with other tickers
def get_job_status(self) -> Dict[str, Any]:
"""Get status of all scheduled jobs"""
jobs = self.scheduler.get_jobs()
return {
"scheduler_running": self.scheduler.running,
"job_count": len(jobs),
"jobs": [
{
"id": job.id,
"next_run": job.next_run_time.isoformat() if job.next_run_time else None,
"trigger": str(job.trigger)
}
for job in jobs
]
}
```
### Error Handling and Monitoring
Comprehensive error handling for production reliability:
```python
class NewsCollectionMonitor:
"""Monitor and handle news collection job failures"""
def __init__(self, collector: ScheduledNewsCollector):
self.collector = collector
self.failure_counts = defaultdict(int)
self.max_failures = 3
async def handle_job_failure(self, ticker: str, error: Exception):
"""Handle job failure with exponential backoff"""
self.failure_counts[ticker] += 1
if self.failure_counts[ticker] >= self.max_failures:
logging.error(f"Max failures reached for {ticker}, disabling job")
self.collector.scheduler.remove_job(f"news_collection_{ticker}")
# Could send alert here
else:
# Schedule retry with exponential backoff
delay_minutes = 2 ** self.failure_counts[ticker]
retry_time = datetime.datetime.now() + datetime.timedelta(minutes=delay_minutes)
self.collector.scheduler.add_job(
func=self.collector._collect_ticker_news,
trigger='date',
run_date=retry_time,
args=[ticker],
id=f"news_retry_{ticker}_{int(retry_time.timestamp())}",
max_instances=1
)
def reset_failure_count(self, ticker: str):
"""Reset failure count on successful job"""
if ticker in self.failure_counts:
del self.failure_counts[ticker]
```
## Implementation Strategy
### Phase 1: Entity and Database Enhancements (Week 1)
**Deliverables:**
- [ ] Enhanced `NewsArticle` entity with `SentimentScore` and vector support
- [ ] New `NewsJobConfig` entity with validation
- [ ] Database migration for vector indexes and sentiment_score JSONB enhancement
- [ ] Repository method `find_similar_articles()` with pgvectorscale integration
**Testing Focus:**
- Unit tests for entity validation and serialization
- Repository integration tests with vector similarity queries
- Database migration verification
### Phase 2: OpenRouter Integration (Week 2)
**Deliverables:**
- [ ] `OpenRouterClient` with sentiment analysis and embeddings
- [ ] Enhanced `NewsService.update_company_news()` with LLM integration
- [ ] Error handling for LLM failures (best-effort approach)
- [ ] Integration tests with OpenRouter API (using pytest-vcr)
**Testing Focus:**
- Mock OpenRouter responses for consistent testing
- Error handling scenarios (API failures, malformed responses)
- Embedding dimension validation
### Phase 3: Job Scheduling System (Week 3)
**Deliverables:**
- [ ] `ScheduledNewsCollector` with APScheduler integration
- [ ] `NewsCollectionMonitor` for error handling and retries
- [ ] Configuration management for job scheduling
- [ ] Graceful startup and shutdown procedures
**Testing Focus:**
- Scheduler lifecycle testing
- Job execution and failure handling
- Configuration validation
### Phase 4: Testing and Performance Optimization (Week 4)
**Deliverables:**
- [ ] Complete test coverage maintaining >85% threshold
- [ ] Performance optimization for vector queries
- [ ] Documentation and deployment guides
- [ ] Integration with existing News Analyst AgentToolkit
**Testing Focus:**
- End-to-end integration tests
- Performance benchmarks for vector similarity queries
- Load testing for scheduled job execution
## Testing Strategy
### Test Architecture
Following the existing pragmatic TDD approach with mock boundaries:
```
tests/domains/news/
├── __init__.py
├── test_news_entities.py # Entity validation and serialization
├── test_news_service.py # Mock repository and OpenRouter client
├── test_news_repository.py # PostgreSQL test database
├── test_openrouter_client.py # pytest-vcr for API responses
├── test_scheduled_collector.py # Mock APScheduler and services
└── integration/
├── test_sentiment_pipeline.py # End-to-end sentiment analysis
├── test_embedding_pipeline.py # End-to-end embedding generation
└── test_scheduled_execution.py # Full job execution cycle
```
### Key Test Categories
**Entity Tests (Fast Unit Tests)**
```python
def test_news_article_sentiment_validation():
"""Test sentiment score validation and reliability checks"""
# Valid sentiment
sentiment = SentimentScore(
sentiment="positive",
confidence=0.8,
reasoning="Strong positive language"
)
article = NewsArticle(
headline="Test headline",
url="https://example.com",
source="Test Source",
published_date=datetime.datetime.now(),
sentiment_score=sentiment
)
assert article.has_reliable_sentiment() == True
# Low confidence sentiment
low_confidence = SentimentScore(
sentiment="neutral",
confidence=0.3,
reasoning="Ambiguous language"
)
article.sentiment_score = low_confidence
assert article.has_reliable_sentiment() == False
def test_news_article_vector_validation():
"""Test vector embedding validation"""
# Valid 1536-dimension embedding
valid_embedding = [0.1] * 1536
article = NewsArticle(
headline="Test",
url="https://example.com",
source="Test",
published_date=datetime.datetime.now(),
title_embedding=valid_embedding
)
assert len(article.title_embedding) == 1536
# Invalid dimension should raise ValidationError
with pytest.raises(ValidationError):
NewsArticle(
headline="Test",
url="https://example.com",
source="Test",
published_date=datetime.datetime.now(),
title_embedding=[0.1] * 512 # Wrong dimension
)
```
**Service Integration Tests (Mock Boundaries)**
```python
@pytest.mark.asyncio
async def test_news_service_with_sentiment_analysis(mock_openrouter_client, mock_repository):
"""Test NewsService integration with mocked LLM client"""
# Mock successful sentiment analysis
mock_sentiment = SentimentScore(
sentiment="positive",
confidence=0.9,
reasoning="Optimistic financial outlook"
)
mock_openrouter_client.analyze_sentiment.return_value = mock_sentiment
# Mock embeddings
mock_openrouter_client.generate_embeddings.return_value = [
[0.1] * 1536, # title embedding
[0.2] * 1536 # content embedding
]
service = NewsService(
repository=mock_repository,
google_client=mock_google_client,
scraper_client=mock_scraper_client,
openrouter_client=mock_openrouter_client
)
articles = await service.update_company_news("AAPL", include_sentiment=True)
# Verify LLM integration
assert len(articles) > 0
assert articles[0].sentiment_score == mock_sentiment
assert articles[0].title_embedding == [0.1] * 1536
assert mock_openrouter_client.analyze_sentiment.called
assert mock_openrouter_client.generate_embeddings.called
```
**Repository Integration Tests (Real Database)**
```python
@pytest.mark.asyncio
async def test_repository_vector_similarity_search(test_db):
"""Test vector similarity search with real pgvectorscale"""
repository = NewsRepository(test_db)
# Insert articles with embeddings
article1 = NewsArticle(
headline="Apple reports strong iPhone sales",
url="https://example.com/1",
source="TechNews",
published_date=datetime.datetime.now(),
entities=["AAPL"],
title_embedding=[0.1, 0.2] + [0.0] * 1534 # Similar to query
)
article2 = NewsArticle(
headline="Microsoft launches new Azure features",
url="https://example.com/2",
source="CloudNews",
published_date=datetime.datetime.now(),
entities=["MSFT"],
title_embedding=[0.9, 0.8] + [0.0] * 1534 # Different from query
)
await repository.upsert_batch([article1, article2])
# Query with similar embedding
query_embedding = [0.15, 0.25] + [0.0] * 1534
similar_articles = await repository.find_similar_articles(
query_embedding, symbol="AAPL", limit=1
)
assert len(similar_articles) == 1
assert similar_articles[0].headline == "Apple reports strong iPhone sales"
```
**API Integration Tests (pytest-vcr)**
```python
@pytest.mark.vcr
@pytest.mark.asyncio
async def test_openrouter_sentiment_analysis():
"""Test real OpenRouter API calls with VCR cassettes"""
config = TradingAgentsConfig.from_env()
client = OpenRouterClient(config)
test_text = "Apple's quarterly earnings exceeded expectations with strong iPhone sales."
sentiment = await client.analyze_sentiment(test_text)
assert isinstance(sentiment, SentimentScore)
assert sentiment.sentiment in ["positive", "negative", "neutral"]
assert 0.0 <= sentiment.confidence <= 1.0
assert len(sentiment.reasoning) > 0
@pytest.mark.vcr
@pytest.mark.asyncio
async def test_openrouter_embeddings_generation():
"""Test real OpenRouter embeddings API with VCR"""
config = TradingAgentsConfig.from_env()
client = OpenRouterClient(config)
texts = ["Apple stock rises", "Market volatility increases"]
embeddings = await client.generate_embeddings(texts)
assert len(embeddings) == 2
assert all(len(emb) == 1536 for emb in embeddings)
assert all(isinstance(val, float) for emb in embeddings for val in emb)
```
### Coverage Requirements
Maintain existing >85% coverage with new components:
- **Entity Layer**: 95% coverage (comprehensive validation testing)
- **Service Layer**: 90% coverage (mock external dependencies)
- **Repository Layer**: 85% coverage (real database integration tests)
- **Client Layer**: 80% coverage (pytest-vcr for API calls)
- **Integration Tests**: End-to-end scenarios covering complete workflows
### Performance Testing
```python
@pytest.mark.performance
@pytest.mark.asyncio
async def test_vector_similarity_performance():
"""Ensure vector similarity queries perform under 100ms"""
repository = NewsRepository(test_db)
# Insert 1000 articles with embeddings
articles = [create_test_article_with_embedding() for _ in range(1000)]
await repository.upsert_batch(articles)
query_embedding = [random.random() for _ in range(1536)]
start_time = time.time()
results = await repository.find_similar_articles(query_embedding, limit=10)
duration = time.time() - start_time
assert duration < 0.1 # Under 100ms
assert len(results) == 10
```
## Integration Points
### News Analyst AgentToolkit Integration
The completed News domain integrates seamlessly with existing News Analyst agents:
```python
class NewsAnalystToolkit:
"""Enhanced toolkit with semantic search capabilities"""
def __init__(self, news_service: NewsService):
self.news_service = news_service
async def get_relevant_news(self,
ticker: str,
query: Optional[str] = None,
days_back: int = 30) -> List[Dict[str, Any]]:
"""Get news with optional semantic search"""
if query:
# Use semantic similarity search
articles = await self.news_service.find_similar_articles(
query_text=query,
symbol=ticker,
limit=20
)
else:
# Use time-based search (existing)
articles = await self.news_service.find_recent_news(
symbol=ticker,
days_back=days_back
)
return [
{
"headline": article.headline,
"summary": article.summary,
"published_date": article.published_date.isoformat(),
"sentiment": article.sentiment_score.sentiment if article.sentiment_score else "unknown",
"confidence": article.sentiment_score.confidence if article.sentiment_score else 0.0,
"source": article.source,
"url": article.url
}
for article in articles
]
```
### Configuration Integration
Seamless integration with existing `TradingAgentsConfig`:
```python
# Enhanced configuration for news domain completion
config = TradingAgentsConfig(
# Existing LLM configuration
llm_provider="openrouter",
openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
quick_think_llm="anthropic/claude-3.5-haiku", # For sentiment analysis
# New news-specific settings
news_collection_enabled=True,
news_schedule_hour=6, # UTC
news_sentiment_enabled=True,
news_embeddings_enabled=True,
news_max_articles_per_ticker=20,
# Database (existing)
database_url=os.getenv("DATABASE_URL"),
)
# Job configuration
news_job_config = NewsJobConfig(
tickers=["AAPL", "GOOGL", "MSFT", "TSLA", "NVDA"],
schedule_hour=6, # 6 AM UTC daily collection
sentiment_model=config.quick_think_llm,
embedding_model="text-embedding-3-large",
max_articles_per_ticker=20
)
```
This design completes the final 5% of the News domain while leveraging the existing 95% infrastructure, maintaining architectural consistency, and providing the robust scheduled execution, LLM-powered sentiment analysis, and vector embeddings needed for advanced News Analyst capabilities.