33 KiB

Raw Blame History

News Domain Technical Design

Overview

This document details the technical design for completing the final 5% of the News domain implementation. The existing infrastructure is 95% complete with Google News collection, article scraping, and basic storage implemented. The remaining work focuses on scheduled execution, LLM-powered sentiment analysis, and vector embeddings using OpenRouter as the unified LLM provider.

Architecture Overview

Component Relationships

graph TD
    A[APScheduler] --> B[ScheduledNewsCollector]
    B --> C[NewsService]
    C --> D[GoogleNewsClient]
    C --> E[ArticleScraperClient]
    C --> F[OpenRouter LLM Client]
    C --> G[OpenRouter Embeddings Client]
    C --> H[NewsRepository]
    H --> I[PostgreSQL + TimescaleDB + pgvectorscale]
    
    J[News Analysts] --> K[AgentToolkit]
    K --> C
    K --> H

Data Flow Architecture

Scheduled Collection Flow

APScheduler → ScheduledNewsCollector → NewsService.update_company_news()
→ GoogleNewsClient → ArticleScraperClient → OpenRouter (sentiment + embeddings)
→ NewsRepository.upsert_batch() → PostgreSQL

Agent Query Flow

News Analyst → AgentToolkit → NewsService.find_relevant_articles()
→ NewsRepository (semantic search) → pgvectorscale vector similarity

Key Design Principles

Leverage Existing 95%: Build on proven GoogleNewsClient and ArticleScraperClient infrastructure
OpenRouter Unified: Single API for both sentiment analysis and embeddings
Best-Effort Processing: LLM failures don't block article storage
Vector-Enhanced Search: Semantic similarity for News Analysts
Fault-Tolerant Scheduling: Robust error handling and monitoring

Domain Model

Enhanced NewsArticle Entity

The existing NewsArticle entity requires enhancements for structured sentiment and vector support:

from typing import Optional, Dict, Any, List
from pydantic import BaseModel, Field, validator
import datetime

class SentimentScore(BaseModel):
    """Structured sentiment analysis result"""
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str
    
    @validator('confidence')
    def validate_confidence(cls, v):
        if v < 0.5:
            raise ValueError("Confidence must be >= 0.5 for reliable sentiment")
        return v

class NewsArticle(BaseModel):
    """Enhanced NewsArticle entity with sentiment and vector support"""
    # Existing fields (95% complete)
    headline: str
    url: str = Field(..., regex=r'^https?://')
    source: str
    published_date: datetime.datetime
    summary: Optional[str] = None
    entities: List[str] = Field(default_factory=list)
    author: Optional[str] = None
    category: Optional[str] = None
    
    # Enhanced fields (final 5%)
    sentiment_score: Optional[SentimentScore] = None
    title_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
    content_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
    
    # Metadata
    created_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
    updated_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
    
    @validator('content_embedding', 'title_embedding')
    def validate_embeddings(cls, v):
        if v and len(v) != 1536:
            raise ValueError("Embeddings must be 1536 dimensions for OpenRouter compatibility")
        return v
        
    def has_reliable_sentiment(self) -> bool:
        """Check if sentiment analysis is reliable (confidence >= 0.5)"""
        return bool(self.sentiment_score and self.sentiment_score.confidence >= 0.5)
        
    def to_record(self) -> Dict[str, Any]:
        """Convert to database record format"""
        record = self.dict()
        # Convert sentiment to JSONB format
        if self.sentiment_score:
            record['sentiment_score'] = self.sentiment_score.dict()
        return record
    
    @classmethod
    def from_record(cls, record: Dict[str, Any]) -> 'NewsArticle':
        """Create entity from database record"""
        if record.get('sentiment_score'):
            record['sentiment_score'] = SentimentScore(**record['sentiment_score'])
        return cls(**record)

New NewsJobConfig Entity

Configuration entity for scheduled news collection:

from pydantic import BaseModel, Field, validator
from typing import List

class NewsJobConfig(BaseModel):
    """Configuration for scheduled news collection jobs"""
    tickers: List[str] = Field(..., min_items=1, max_items=50)
    schedule_hour: int = Field(..., ge=0, le=23)
    sentiment_model: str = Field(default="anthropic/claude-3.5-haiku")
    embedding_model: str = Field(default="text-embedding-3-large") 
    max_articles_per_ticker: int = Field(default=20, ge=5, le=100)
    lookback_days: int = Field(default=7, ge=1, le=30)
    
    @validator('tickers')
    def validate_tickers(cls, v):
        # Ensure uppercase stock symbols
        return [ticker.upper().strip() for ticker in v]
    
    @validator('sentiment_model')
    def validate_sentiment_model(cls, v):
        # Ensure OpenRouter model format
        if '/' not in v:
            raise ValueError("Model must be in OpenRouter format (provider/model)")
        return v
    
    def to_cron_expression(self) -> str:
        """Convert to cron expression for APScheduler"""
        return f"0 {self.schedule_hour} * * *"  # Daily at specified hour

Database Design

Schema Enhancements

The existing news_articles table requires minimal modifications to support the final 5%:

-- Existing table structure (95% complete)
CREATE TABLE IF NOT EXISTS news_articles (
    id SERIAL PRIMARY KEY,
    headline TEXT NOT NULL,
    url TEXT UNIQUE NOT NULL,
    source TEXT NOT NULL,
    published_date TIMESTAMPTZ NOT NULL,
    summary TEXT,
    entities TEXT[] DEFAULT '{}',
    sentiment_score JSONB,  -- Enhanced for structured format
    author TEXT,
    category TEXT,
    title_embedding vector(1536),     -- New: pgvectorscale vector type
    content_embedding vector(1536),   -- New: pgvectorscale vector type
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- New indexes for final 5% performance
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_symbol_date 
    ON news_articles (((entities)), published_date DESC);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_title_embedding 
    ON news_articles USING vectors (title_embedding vector_cosine_ops);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_content_embedding 
    ON news_articles USING vectors (content_embedding vector_cosine_ops);

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_sentiment 
    ON news_articles (((sentiment_score->>'sentiment'))) 
    WHERE sentiment_score IS NOT NULL;

Query Patterns

Time-based News Queries (News Analysts)

-- Optimized for Agent queries: recent news for specific ticker
SELECT headline, summary, sentiment_score, published_date
FROM news_articles 
WHERE entities @> ARRAY[$1::text] 
  AND published_date >= NOW() - INTERVAL '30 days'
ORDER BY published_date DESC 
LIMIT 20;

Semantic Similarity Queries (Vector Search)

-- Find similar articles using pgvectorscale
SELECT headline, url, summary, 
       1 - (title_embedding <=> $1::vector) AS similarity_score
FROM news_articles 
WHERE entities @> ARRAY[$2::text]
  AND title_embedding IS NOT NULL
ORDER BY title_embedding <=> $1::vector 
LIMIT 10;

Batch Upsert Operations (Daily Collection)

-- Efficient upsert for daily news collection
INSERT INTO news_articles (headline, url, source, published_date, summary, entities, sentiment_score, title_embedding, content_embedding)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
ON CONFLICT (url) DO UPDATE SET
    headline = EXCLUDED.headline,
    summary = EXCLUDED.summary,
    entities = EXCLUDED.entities,
    sentiment_score = EXCLUDED.sentiment_score,
    title_embedding = EXCLUDED.title_embedding,
    content_embedding = EXCLUDED.content_embedding,
    updated_at = NOW();

API Integration

OpenRouter Unified Client

Single OpenRouter integration for both sentiment analysis and embeddings:

from typing import List, Optional, Dict, Any
import httpx
from tradingagents.config import TradingAgentsConfig

class OpenRouterClient:
    """Unified OpenRouter client for sentiment analysis and embeddings"""
    
    def __init__(self, config: TradingAgentsConfig):
        self.config = config
        self.base_url = "https://openrouter.ai/api/v1"
        self.headers = {
            "Authorization": f"Bearer {config.openrouter_api_key}",
            "Content-Type": "application/json"
        }
    
    async def analyze_sentiment(self, text: str, model: Optional[str] = None) -> SentimentScore:
        """Generate structured sentiment analysis using LLM"""
        model = model or self.config.quick_think_llm
        
        prompt = f"""Analyze the sentiment of this news article text and respond with ONLY a JSON object:

Article: {text[:2000]}  # Truncate for token limits

Required JSON format:
{{
    "sentiment": "positive|negative|neutral",
    "confidence": 0.0-1.0,
    "reasoning": "brief explanation"
}}"""
        
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.1,  # Low temperature for consistent structured output
            "max_tokens": 200
        }
        
        async with httpx.AsyncClient() as client:
            try:
                response = await client.post(
                    f"{self.base_url}/chat/completions",
                    headers=self.headers,
                    json=payload,
                    timeout=30.0
                )
                response.raise_for_status()
                
                result = response.json()
                content = result["choices"][0]["message"]["content"].strip()
                
                # Parse JSON response
                import json
                sentiment_data = json.loads(content)
                return SentimentScore(**sentiment_data)
                
            except Exception as e:
                # Best-effort: return neutral sentiment on failure
                return SentimentScore(
                    sentiment="neutral",
                    confidence=0.3,  # Below reliability threshold
                    reasoning=f"Analysis failed: {str(e)[:100]}"
                )
    
    async def generate_embeddings(self, texts: List[str], model: Optional[str] = None) -> List[List[float]]:
        """Generate embeddings for multiple texts"""
        model = model or "text-embedding-3-large"
        
        # Truncate texts to avoid token limits
        truncated_texts = [text[:8000] for text in texts]
        
        payload = {
            "model": model,
            "input": truncated_texts
        }
        
        async with httpx.AsyncClient() as client:
            try:
                response = await client.post(
                    f"{self.base_url}/embeddings",
                    headers=self.headers,
                    json=payload,
                    timeout=60.0
                )
                response.raise_for_status()
                
                result = response.json()
                return [item["embedding"] for item in result["data"]]
                
            except Exception as e:
                # Return None embeddings on failure (stored as NULL in DB)
                return [None] * len(texts)

Enhanced NewsService Integration

Update existing NewsService to integrate LLM capabilities:

class NewsService:
    """Enhanced NewsService with LLM sentiment and embeddings (final 5%)"""
    
    def __init__(self, 
                 repository: NewsRepository, 
                 google_client: GoogleNewsClient,
                 scraper_client: ArticleScraperClient,
                 openrouter_client: OpenRouterClient):
        self.repository = repository
        self.google_client = google_client
        self.scraper_client = scraper_client
        self.openrouter_client = openrouter_client
    
    async def update_company_news(self, 
                                symbol: str, 
                                lookback_days: int = 7,
                                max_articles: int = 20,
                                include_sentiment: bool = True,
                                include_embeddings: bool = True) -> List[NewsArticle]:
        """Enhanced method with LLM sentiment analysis and embeddings"""
        
        # Step 1: Use existing 95% infrastructure for collection
        cutoff_date = datetime.datetime.now() - datetime.timedelta(days=lookback_days)
        
        # Fetch from Google News (existing)
        google_results = await self.google_client.fetch_company_news(symbol, max_articles)
        
        articles = []
        for result in google_results:
            if result.published_date < cutoff_date:
                continue
                
            # Scrape full content (existing)
            scraped_content = await self.scraper_client.scrape_article(result.url)
            
            # Create base article (existing pattern)
            article = NewsArticle(
                headline=result.title,
                url=result.url,
                source=result.source,
                published_date=result.published_date,
                summary=scraped_content.summary if scraped_content else result.description,
                entities=[symbol],
                author=scraped_content.author if scraped_content else None
            )
            
            # Step 2: NEW - Add LLM sentiment analysis
            if include_sentiment and scraped_content and scraped_content.content:
                article.sentiment_score = await self.openrouter_client.analyze_sentiment(
                    scraped_content.content
                )
            
            articles.append(article)
        
        # Step 3: NEW - Batch generate embeddings
        if include_embeddings and articles:
            titles = [a.headline for a in articles]
            contents = [a.summary or a.headline for a in articles]
            
            title_embeddings = await self.openrouter_client.generate_embeddings(titles)
            content_embeddings = await self.openrouter_client.generate_embeddings(contents)
            
            for i, article in enumerate(articles):
                if i < len(title_embeddings) and title_embeddings[i]:
                    article.title_embedding = title_embeddings[i]
                if i < len(content_embeddings) and content_embeddings[i]:
                    article.content_embedding = content_embeddings[i]
        
        # Step 4: Batch persist (existing pattern)
        await self.repository.upsert_batch(articles)
        return articles
    
    async def find_similar_articles(self, 
                                  query_text: str, 
                                  symbol: Optional[str] = None,
                                  limit: int = 10) -> List[NewsArticle]:
        """NEW: Semantic similarity search for News Analysts"""
        
        # Generate query embedding
        query_embeddings = await self.openrouter_client.generate_embeddings([query_text])
        if not query_embeddings[0]:
            # Fallback to text search
            return await self.repository.find_by_text_search(query_text, symbol, limit)
            
        return await self.repository.find_similar_articles(
            query_embeddings[0], symbol, limit
        )

Job Scheduling Architecture

APScheduler Integration

Robust scheduled execution using APScheduler:

from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.jobstores.redis import RedisJobStore  # Optional: persistent job store
from apscheduler.executors.asyncio import AsyncIOExecutor
import logging

class ScheduledNewsCollector:
    """Orchestrates scheduled news collection jobs"""
    
    def __init__(self, 
                 news_service: NewsService,
                 config: TradingAgentsConfig,
                 job_config: NewsJobConfig):
        self.news_service = news_service
        self.config = config
        self.job_config = job_config
        
        # Configure APScheduler
        jobstores = {
            'default': {'type': 'memory'}  # Use Redis for production
        }
        executors = {
            'default': AsyncIOExecutor(),
        }
        job_defaults = {
            'coalesce': False,  # Don't combine missed jobs
            'max_instances': 1,  # One job per ticker at a time
            'misfire_grace_time': 300  # 5 minute grace period
        }
        
        self.scheduler = AsyncIOScheduler(
            jobstores=jobstores,
            executors=executors,
            job_defaults=job_defaults,
            timezone='UTC'
        )
    
    async def start(self):
        """Start the scheduler and register jobs"""
        
        for ticker in self.job_config.tickers:
            # Schedule daily collection for each ticker
            self.scheduler.add_job(
                func=self._collect_ticker_news,
                trigger='cron',
                hour=self.job_config.schedule_hour,
                minute=0,
                args=[ticker],
                id=f"news_collection_{ticker}",
                replace_existing=True,
                max_instances=1
            )
            
        self.scheduler.start()
        logging.info(f"Started news collection scheduler for {len(self.job_config.tickers)} tickers")
    
    async def stop(self):
        """Gracefully stop the scheduler"""
        if self.scheduler.running:
            self.scheduler.shutdown(wait=True)
    
    async def _collect_ticker_news(self, ticker: str):
        """Execute news collection for a single ticker"""
        
        start_time = datetime.datetime.now()
        
        try:
            logging.info(f"Starting news collection for {ticker}")
            
            articles = await self.news_service.update_company_news(
                symbol=ticker,
                lookback_days=self.job_config.lookback_days,
                max_articles=self.job_config.max_articles_per_ticker,
                include_sentiment=True,
                include_embeddings=True
            )
            
            # Log metrics
            sentiment_count = sum(1 for a in articles if a.has_reliable_sentiment())
            embedding_count = sum(1 for a in articles if a.title_embedding)
            
            duration = (datetime.datetime.now() - start_time).total_seconds()
            
            logging.info(
                f"Completed news collection for {ticker}: "
                f"{len(articles)} articles, {sentiment_count} with sentiment, "
                f"{embedding_count} with embeddings in {duration:.1f}s"
            )
            
        except Exception as e:
            logging.error(f"News collection failed for {ticker}: {str(e)}")
            # Don't raise - let scheduler continue with other tickers
    
    def get_job_status(self) -> Dict[str, Any]:
        """Get status of all scheduled jobs"""
        jobs = self.scheduler.get_jobs()
        return {
            "scheduler_running": self.scheduler.running,
            "job_count": len(jobs),
            "jobs": [
                {
                    "id": job.id,
                    "next_run": job.next_run_time.isoformat() if job.next_run_time else None,
                    "trigger": str(job.trigger)
                }
                for job in jobs
            ]
        }

Error Handling and Monitoring

Comprehensive error handling for production reliability:

class NewsCollectionMonitor:
    """Monitor and handle news collection job failures"""
    
    def __init__(self, collector: ScheduledNewsCollector):
        self.collector = collector
        self.failure_counts = defaultdict(int)
        self.max_failures = 3
    
    async def handle_job_failure(self, ticker: str, error: Exception):
        """Handle job failure with exponential backoff"""
        
        self.failure_counts[ticker] += 1
        
        if self.failure_counts[ticker] >= self.max_failures:
            logging.error(f"Max failures reached for {ticker}, disabling job")
            self.collector.scheduler.remove_job(f"news_collection_{ticker}")
            # Could send alert here
        else:
            # Schedule retry with exponential backoff
            delay_minutes = 2 ** self.failure_counts[ticker]
            retry_time = datetime.datetime.now() + datetime.timedelta(minutes=delay_minutes)
            
            self.collector.scheduler.add_job(
                func=self.collector._collect_ticker_news,
                trigger='date',
                run_date=retry_time,
                args=[ticker],
                id=f"news_retry_{ticker}_{int(retry_time.timestamp())}",
                max_instances=1
            )
    
    def reset_failure_count(self, ticker: str):
        """Reset failure count on successful job"""
        if ticker in self.failure_counts:
            del self.failure_counts[ticker]

Implementation Strategy

Phase 1: Entity and Database Enhancements (Week 1)

Deliverables:

Enhanced NewsArticle entity with SentimentScore and vector support
New NewsJobConfig entity with validation
Database migration for vector indexes and sentiment_score JSONB enhancement
Repository method find_similar_articles() with pgvectorscale integration

Testing Focus:

Unit tests for entity validation and serialization
Repository integration tests with vector similarity queries
Database migration verification

Phase 2: OpenRouter Integration (Week 2)

Deliverables:

OpenRouterClient with sentiment analysis and embeddings
Enhanced NewsService.update_company_news() with LLM integration
Error handling for LLM failures (best-effort approach)
Integration tests with OpenRouter API (using pytest-vcr)

Testing Focus:

Mock OpenRouter responses for consistent testing
Error handling scenarios (API failures, malformed responses)
Embedding dimension validation

Phase 3: Job Scheduling System (Week 3)

Deliverables:

ScheduledNewsCollector with APScheduler integration
NewsCollectionMonitor for error handling and retries
Configuration management for job scheduling
Graceful startup and shutdown procedures

Testing Focus:

Scheduler lifecycle testing
Job execution and failure handling
Configuration validation

Phase 4: Testing and Performance Optimization (Week 4)

Deliverables:

Complete test coverage maintaining >85% threshold
Performance optimization for vector queries
Documentation and deployment guides
Integration with existing News Analyst AgentToolkit

Testing Focus:

End-to-end integration tests
Performance benchmarks for vector similarity queries
Load testing for scheduled job execution

Testing Strategy

Test Architecture

Following the existing pragmatic TDD approach with mock boundaries:

tests/domains/news/
├── __init__.py
├── test_news_entities.py          # Entity validation and serialization
├── test_news_service.py           # Mock repository and OpenRouter client  
├── test_news_repository.py        # PostgreSQL test database
├── test_openrouter_client.py      # pytest-vcr for API responses
├── test_scheduled_collector.py    # Mock APScheduler and services
└── integration/
    ├── test_sentiment_pipeline.py    # End-to-end sentiment analysis
    ├── test_embedding_pipeline.py    # End-to-end embedding generation
    └── test_scheduled_execution.py   # Full job execution cycle

Key Test Categories

Entity Tests (Fast Unit Tests)

def test_news_article_sentiment_validation():
    """Test sentiment score validation and reliability checks"""
    
    # Valid sentiment
    sentiment = SentimentScore(
        sentiment="positive",
        confidence=0.8,
        reasoning="Strong positive language"
    )
    
    article = NewsArticle(
        headline="Test headline",
        url="https://example.com",
        source="Test Source",
        published_date=datetime.datetime.now(),
        sentiment_score=sentiment
    )
    
    assert article.has_reliable_sentiment() == True
    
    # Low confidence sentiment
    low_confidence = SentimentScore(
        sentiment="neutral",
        confidence=0.3,
        reasoning="Ambiguous language"
    )
    
    article.sentiment_score = low_confidence
    assert article.has_reliable_sentiment() == False

def test_news_article_vector_validation():
    """Test vector embedding validation"""
    
    # Valid 1536-dimension embedding
    valid_embedding = [0.1] * 1536
    article = NewsArticle(
        headline="Test",
        url="https://example.com", 
        source="Test",
        published_date=datetime.datetime.now(),
        title_embedding=valid_embedding
    )
    
    assert len(article.title_embedding) == 1536
    
    # Invalid dimension should raise ValidationError
    with pytest.raises(ValidationError):
        NewsArticle(
            headline="Test",
            url="https://example.com",
            source="Test", 
            published_date=datetime.datetime.now(),
            title_embedding=[0.1] * 512  # Wrong dimension
        )

Service Integration Tests (Mock Boundaries)

@pytest.mark.asyncio
async def test_news_service_with_sentiment_analysis(mock_openrouter_client, mock_repository):
    """Test NewsService integration with mocked LLM client"""
    
    # Mock successful sentiment analysis
    mock_sentiment = SentimentScore(
        sentiment="positive",
        confidence=0.9,
        reasoning="Optimistic financial outlook"
    )
    mock_openrouter_client.analyze_sentiment.return_value = mock_sentiment
    
    # Mock embeddings
    mock_openrouter_client.generate_embeddings.return_value = [
        [0.1] * 1536,  # title embedding
        [0.2] * 1536   # content embedding
    ]
    
    service = NewsService(
        repository=mock_repository,
        google_client=mock_google_client,
        scraper_client=mock_scraper_client,
        openrouter_client=mock_openrouter_client
    )
    
    articles = await service.update_company_news("AAPL", include_sentiment=True)
    
    # Verify LLM integration
    assert len(articles) > 0
    assert articles[0].sentiment_score == mock_sentiment
    assert articles[0].title_embedding == [0.1] * 1536
    assert mock_openrouter_client.analyze_sentiment.called
    assert mock_openrouter_client.generate_embeddings.called

Repository Integration Tests (Real Database)

@pytest.mark.asyncio 
async def test_repository_vector_similarity_search(test_db):
    """Test vector similarity search with real pgvectorscale"""
    
    repository = NewsRepository(test_db)
    
    # Insert articles with embeddings
    article1 = NewsArticle(
        headline="Apple reports strong iPhone sales",
        url="https://example.com/1",
        source="TechNews",
        published_date=datetime.datetime.now(),
        entities=["AAPL"],
        title_embedding=[0.1, 0.2] + [0.0] * 1534  # Similar to query
    )
    
    article2 = NewsArticle(
        headline="Microsoft launches new Azure features", 
        url="https://example.com/2",
        source="CloudNews",
        published_date=datetime.datetime.now(),
        entities=["MSFT"],
        title_embedding=[0.9, 0.8] + [0.0] * 1534  # Different from query
    )
    
    await repository.upsert_batch([article1, article2])
    
    # Query with similar embedding
    query_embedding = [0.15, 0.25] + [0.0] * 1534
    similar_articles = await repository.find_similar_articles(
        query_embedding, symbol="AAPL", limit=1
    )
    
    assert len(similar_articles) == 1
    assert similar_articles[0].headline == "Apple reports strong iPhone sales"

API Integration Tests (pytest-vcr)

@pytest.mark.vcr
@pytest.mark.asyncio
async def test_openrouter_sentiment_analysis():
    """Test real OpenRouter API calls with VCR cassettes"""
    
    config = TradingAgentsConfig.from_env()
    client = OpenRouterClient(config)
    
    test_text = "Apple's quarterly earnings exceeded expectations with strong iPhone sales."
    
    sentiment = await client.analyze_sentiment(test_text)
    
    assert isinstance(sentiment, SentimentScore)
    assert sentiment.sentiment in ["positive", "negative", "neutral"]
    assert 0.0 <= sentiment.confidence <= 1.0
    assert len(sentiment.reasoning) > 0

@pytest.mark.vcr
@pytest.mark.asyncio
async def test_openrouter_embeddings_generation():
    """Test real OpenRouter embeddings API with VCR"""
    
    config = TradingAgentsConfig.from_env()
    client = OpenRouterClient(config)
    
    texts = ["Apple stock rises", "Market volatility increases"]
    
    embeddings = await client.generate_embeddings(texts)
    
    assert len(embeddings) == 2
    assert all(len(emb) == 1536 for emb in embeddings)
    assert all(isinstance(val, float) for emb in embeddings for val in emb)

Coverage Requirements

Maintain existing >85% coverage with new components:

Entity Layer: 95% coverage (comprehensive validation testing)
Service Layer: 90% coverage (mock external dependencies)
Repository Layer: 85% coverage (real database integration tests)
Client Layer: 80% coverage (pytest-vcr for API calls)
Integration Tests: End-to-end scenarios covering complete workflows

Performance Testing

@pytest.mark.performance
@pytest.mark.asyncio
async def test_vector_similarity_performance():
    """Ensure vector similarity queries perform under 100ms"""
    
    repository = NewsRepository(test_db)
    
    # Insert 1000 articles with embeddings
    articles = [create_test_article_with_embedding() for _ in range(1000)]
    await repository.upsert_batch(articles)
    
    query_embedding = [random.random() for _ in range(1536)]
    
    start_time = time.time()
    results = await repository.find_similar_articles(query_embedding, limit=10)
    duration = time.time() - start_time
    
    assert duration < 0.1  # Under 100ms
    assert len(results) == 10

Integration Points

News Analyst AgentToolkit Integration

The completed News domain integrates seamlessly with existing News Analyst agents:

class NewsAnalystToolkit:
    """Enhanced toolkit with semantic search capabilities"""
    
    def __init__(self, news_service: NewsService):
        self.news_service = news_service
    
    async def get_relevant_news(self, 
                              ticker: str, 
                              query: Optional[str] = None,
                              days_back: int = 30) -> List[Dict[str, Any]]:
        """Get news with optional semantic search"""
        
        if query:
            # Use semantic similarity search
            articles = await self.news_service.find_similar_articles(
                query_text=query,
                symbol=ticker,
                limit=20
            )
        else:
            # Use time-based search (existing)
            articles = await self.news_service.find_recent_news(
                symbol=ticker,
                days_back=days_back
            )
        
        return [
            {
                "headline": article.headline,
                "summary": article.summary,
                "published_date": article.published_date.isoformat(),
                "sentiment": article.sentiment_score.sentiment if article.sentiment_score else "unknown",
                "confidence": article.sentiment_score.confidence if article.sentiment_score else 0.0,
                "source": article.source,
                "url": article.url
            }
            for article in articles
        ]

Configuration Integration

Seamless integration with existing TradingAgentsConfig:

# Enhanced configuration for news domain completion
config = TradingAgentsConfig(
    # Existing LLM configuration
    llm_provider="openrouter",
    openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
    quick_think_llm="anthropic/claude-3.5-haiku",  # For sentiment analysis
    
    # New news-specific settings
    news_collection_enabled=True,
    news_schedule_hour=6,  # UTC
    news_sentiment_enabled=True,
    news_embeddings_enabled=True,
    news_max_articles_per_ticker=20,
    
    # Database (existing)
    database_url=os.getenv("DATABASE_URL"),
)

# Job configuration
news_job_config = NewsJobConfig(
    tickers=["AAPL", "GOOGL", "MSFT", "TSLA", "NVDA"],
    schedule_hour=6,  # 6 AM UTC daily collection
    sentiment_model=config.quick_think_llm,
    embedding_model="text-embedding-3-large",
    max_articles_per_ticker=20
)

This design completes the final 5% of the News domain while leveraging the existing 95% infrastructure, maintaining architectural consistency, and providing the robust scheduled execution, LLM-powered sentiment analysis, and vector embeddings needed for advanced News Analyst capabilities.

33 KiB Raw Blame History

News Domain Technical Design

Overview

Architecture Overview

Component Relationships

Data Flow Architecture

Key Design Principles

Domain Model

Enhanced NewsArticle Entity

New NewsJobConfig Entity

Database Design

Schema Enhancements

Query Patterns

API Integration

OpenRouter Unified Client

Enhanced NewsService Integration

Job Scheduling Architecture

APScheduler Integration

Error Handling and Monitoring

Implementation Strategy

Phase 1: Entity and Database Enhancements (Week 1)

Phase 2: OpenRouter Integration (Week 2)

Phase 3: Job Scheduling System (Week 3)

Phase 4: Testing and Performance Optimization (Week 4)

Testing Strategy

Test Architecture

Key Test Categories

Coverage Requirements

Performance Testing

Integration Points

News Analyst AgentToolkit Integration

Configuration Integration

33 KiB

Raw Blame History