37 KiB
News Domain Technical Design
Overview
This document details the technical design for completing the final 5% of the News domain implementation. The existing infrastructure is 95% complete with Google News collection, article scraping, and basic storage implemented. The remaining work focuses on Dagster-orchestrated scheduled execution, LLM-powered sentiment analysis, and vector embeddings using OpenRouter as the unified LLM provider.
Architecture Overview
Component Relationships
graph TD
A[Dagster Scheduler] --> B[Dagster Job: news_collection_daily]
B --> C[Dagster Op: collect_news_for_symbol]
C --> D[NewsService]
D --> E[GoogleNewsClient]
D --> F[ArticleScraperClient]
D --> G[OpenRouter Sentiment Client]
D --> H[OpenRouter Embeddings Client]
D --> I[NewsRepository]
I --> J[PostgreSQL + TimescaleDB + pgvectorscale]
K[News Analysts] --> L[AgentToolkit]
L --> D
L --> I
Data Flow Architecture
-
Scheduled Collection Flow (Dagster)
Dagster Schedule → Dagster Job → Dagster Op (per symbol) → NewsService.update_company_news() → GoogleNewsClient (RSS) → ArticleScraperClient (content) → OpenRouter (sentiment + embeddings) → NewsRepository.upsert_batch() → PostgreSQL -
Agent Query Flow (RAG)
News Analyst → AgentToolkit → NewsService.find_similar_news() → NewsRepository.find_similar_articles() → pgvectorscale vector similarity (cosine distance) → Return ranked results with sentiment
Key Design Principles
- Leverage Existing 95%: Build on proven GoogleNewsClient and ArticleScraperClient infrastructure
- OpenRouter Unified: Single API for both sentiment analysis and embeddings
- Best-Effort Processing: LLM failures don't block article storage
- Vector-Enhanced Search: Semantic similarity for News Analysts via RAG
- Dagster Orchestration: Fault-tolerant scheduling with built-in monitoring and alerting
- Layered Architecture: Entity → Repository → Service → Dagster Op → Dagster Job
Domain Model
Enhanced NewsArticle Dataclass
The existing NewsArticle dataclass requires enhancements for LLM sentiment and vector support:
from dataclasses import dataclass, field
from datetime import date
from typing import Optional, List
@dataclass
class NewsArticle:
"""Represents a news article with sentiment and embeddings."""
# Existing fields (95% complete)
headline: str
url: str # Unique identifier for deduplication
source: str # "Google News", "Finnhub", etc.
published_date: date
# Optional existing fields
summary: Optional[str] = None
entities: List[str] = field(default_factory=list)
author: Optional[str] = None
category: Optional[str] = None
# Enhanced fields (final 5% - LLM sentiment)
sentiment_score: Optional[float] = None # -1.0 to 1.0
sentiment_confidence: Optional[float] = None # 0.0 to 1.0
sentiment_label: Optional[str] = None # "positive", "negative", "neutral"
# Enhanced fields (final 5% - vector embeddings)
title_embedding: Optional[List[float]] = None # 1536 dimensions
content_embedding: Optional[List[float]] = None # 1536 dimensions
def to_entity(self, symbol: Optional[str] = None) -> NewsArticleEntity:
"""Convert NewsArticle dataclass to NewsArticleEntity SQLAlchemy model."""
return NewsArticleEntity(
headline=self.headline,
url=self.url,
source=self.source,
published_date=self.published_date,
summary=self.summary,
entities=self.entities if self.entities else None,
sentiment_score=self.sentiment_score,
sentiment_confidence=self.sentiment_confidence,
sentiment_label=self.sentiment_label,
author=self.author,
category=self.category,
symbol=symbol,
title_embedding=self.title_embedding,
content_embedding=self.content_embedding,
)
@staticmethod
def from_entity(entity: NewsArticleEntity) -> 'NewsArticle':
"""Convert NewsArticleEntity SQLAlchemy model to NewsArticle dataclass."""
return NewsArticle(
headline=entity.headline,
url=entity.url,
source=entity.source,
published_date=entity.published_date,
summary=entity.summary,
entities=entity.entities or [],
sentiment_score=entity.sentiment_score,
sentiment_confidence=entity.sentiment_confidence,
sentiment_label=entity.sentiment_label,
author=entity.author,
category=entity.category,
title_embedding=entity.title_embedding,
content_embedding=entity.content_embedding,
)
def has_reliable_sentiment(self) -> bool:
"""Check if sentiment analysis is reliable (confidence >= 0.6)."""
return bool(
self.sentiment_score is not None
and self.sentiment_confidence is not None
and self.sentiment_confidence >= 0.6
)
NewsArticleEntity SQLAlchemy Model
The existing SQLAlchemy model already has vector embedding columns. We need to add sentiment fields:
from sqlalchemy import Float, String, Text, DateTime, Date, JSON, Index, func
from sqlalchemy.dialects.postgresql import UUID as PG_UUID
from sqlalchemy.orm import Mapped, mapped_column
from pgvector.sqlalchemy import Vector
import uuid
from datetime import datetime, date
class NewsArticleEntity(Base):
"""SQLAlchemy model for news articles with vector embedding support."""
__tablename__ = "news_articles"
__table_args__ = (
Index("idx_symbol_date", "symbol", "published_date"),
Index("idx_published_date", "published_date"),
Index("idx_url_unique", "url", unique=True),
# Vector index for pgvectorscale similarity search
Index("idx_title_embedding_vector", "title_embedding", postgresql_using="ivfflat"),
)
# Primary key
id: Mapped[uuid.UUID] = mapped_column(PG_UUID(as_uuid=True), primary_key=True, default=uuid7)
# Core article fields
headline: Mapped[str] = mapped_column(Text, nullable=False)
url: Mapped[str] = mapped_column(Text, nullable=False, unique=True)
source: Mapped[str] = mapped_column(String(100), nullable=False)
published_date: Mapped[date] = mapped_column(Date, nullable=False, index=True)
# Optional fields
summary: Mapped[Optional[str]] = mapped_column(Text, nullable=True)
entities: Mapped[Optional[List[str]]] = mapped_column(JSON, nullable=True)
author: Mapped[Optional[str]] = mapped_column(String(255), nullable=True)
category: Mapped[Optional[str]] = mapped_column(String(100), nullable=True)
symbol: Mapped[Optional[str]] = mapped_column(String(20), index=True, nullable=True)
# LLM sentiment fields (NEW)
sentiment_score: Mapped[Optional[float]] = mapped_column(Float, nullable=True)
sentiment_confidence: Mapped[Optional[float]] = mapped_column(Float, nullable=True)
sentiment_label: Mapped[Optional[str]] = mapped_column(String(20), nullable=True)
# Vector embeddings (EXISTING - already in 95% complete infrastructure)
title_embedding: Mapped[Optional[List[float]]] = mapped_column(Vector(1536), nullable=True)
content_embedding: Mapped[Optional[List[float]]] = mapped_column(Vector(1536), nullable=True)
# Audit timestamps
created_at: Mapped[datetime] = mapped_column(DateTime, server_default=func.now())
updated_at: Mapped[datetime] = mapped_column(DateTime, server_default=func.now(), onupdate=func.now())
Data Access Layer
NewsRepository Enhancements
Add RAG-powered vector similarity search methods to the existing repository:
class NewsRepository:
"""Repository for news articles with vector similarity search."""
def __init__(self, database_manager: DatabaseManager):
self.db_manager = database_manager
# ... existing methods (list, get, upsert, delete, list_by_date_range, upsert_batch) ...
async def find_similar_articles(
self,
embedding: List[float],
limit: int = 10,
threshold: float = 0.7,
symbol: Optional[str] = None
) -> List[NewsArticle]:
"""
Find articles similar to given embedding using pgvectorscale cosine distance.
Args:
embedding: Query embedding vector (1536 dimensions)
limit: Maximum number of results to return
threshold: Minimum similarity score (0.0-1.0)
symbol: Optional symbol filter
Returns:
List of NewsArticle objects ranked by similarity
"""
async with self.db_manager.get_session() as session:
# Cosine similarity: 1 - cosine_distance
# pgvectorscale operator: <=> for cosine distance
query = select(
NewsArticleEntity,
(1 - NewsArticleEntity.title_embedding.cosine_distance(embedding)).label('similarity')
).filter(
NewsArticleEntity.title_embedding.is_not(None)
)
# Optional symbol filter
if symbol:
query = query.filter(NewsArticleEntity.symbol == symbol)
# Filter by similarity threshold and order by similarity desc
query = query.filter(
(1 - NewsArticleEntity.title_embedding.cosine_distance(embedding)) >= threshold
).order_by(
NewsArticleEntity.title_embedding.cosine_distance(embedding)
).limit(limit)
result = await session.execute(query)
rows = result.all()
# Convert to NewsArticle dataclass
articles = [NewsArticle.from_entity(row[0]) for row in rows]
logger.info(f"Found {len(articles)} similar articles (threshold={threshold})")
return articles
async def batch_update_embeddings(
self,
article_embeddings: List[Tuple[uuid.UUID, List[float], List[float]]]
) -> int:
"""
Efficiently batch update embeddings for multiple articles.
Args:
article_embeddings: List of (article_id, title_embedding, content_embedding) tuples
Returns:
Number of articles updated
"""
if not article_embeddings:
return 0
async with self.db_manager.get_session() as session:
# Use bulk update with PostgreSQL
stmt = update(NewsArticleEntity).where(
NewsArticleEntity.id == bindparam('article_id')
).values(
title_embedding=bindparam('title_emb'),
content_embedding=bindparam('content_emb'),
updated_at=func.now()
)
# Prepare batch data
batch_data = [
{
'article_id': article_id,
'title_emb': title_emb,
'content_emb': content_emb
}
for article_id, title_emb, content_emb in article_embeddings
]
await session.execute(stmt, batch_data)
logger.info(f"Batch updated embeddings for {len(article_embeddings)} articles")
return len(article_embeddings)
Service Layer
OpenRouter LLM Clients
Sentiment Analysis Client
from typing import Optional, Dict, Any
import aiohttp
import asyncio
from tradingagents.config import TradingAgentsConfig
@dataclass
class SentimentResult:
"""Result from sentiment analysis."""
score: float # -1.0 to 1.0
confidence: float # 0.0 to 1.0
label: str # "positive", "negative", "neutral"
reasoning: str
class OpenRouterSentimentClient:
"""Client for sentiment analysis via OpenRouter."""
def __init__(self, config: TradingAgentsConfig):
self.api_key = config.openrouter_api_key
self.model = config.quick_think_llm # claude-3.5-haiku
self.base_url = "https://openrouter.ai/api/v1/chat/completions"
async def analyze_sentiment(
self,
title: str,
content: str
) -> SentimentResult:
"""
Analyze sentiment of news article using OpenRouter LLM.
Args:
title: Article headline
content: Article content/summary
Returns:
SentimentResult with score, confidence, label, and reasoning
"""
try:
prompt = self._build_sentiment_prompt(title, content)
response = await self._call_openrouter(prompt)
return self._parse_sentiment_response(response)
except Exception as e:
logger.warning(f"OpenRouter sentiment analysis failed: {e}, using keyword fallback")
return self._fallback_sentiment(title, content)
def _build_sentiment_prompt(self, title: str, content: str) -> str:
"""Build structured prompt for sentiment analysis."""
return f"""Analyze the financial sentiment of this news article.
Title: {title}
Content: {content[:1000]}...
Provide sentiment analysis as JSON:
{{
"score": <float between -1.0 (very negative) and 1.0 (very positive)>,
"confidence": <float between 0.0 and 1.0>,
"label": "<positive|negative|neutral>",
"reasoning": "<brief 1-2 sentence explanation>"
}}
Focus on financial market implications."""
async def _call_openrouter(self, prompt: str) -> Dict[str, Any]:
"""Call OpenRouter API with retry logic."""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
payload = {
"model": self.model,
"messages": [{"role": "user", "content": prompt}],
"response_format": {"type": "json_object"}
}
async with aiohttp.ClientSession() as session:
for attempt in range(3): # Retry up to 3 times
try:
async with session.post(
self.base_url,
headers=headers,
json=payload,
timeout=aiohttp.ClientTimeout(total=30)
) as response:
response.raise_for_status()
data = await response.json()
return json.loads(data['choices'][0]['message']['content'])
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
if attempt == 2: # Last attempt
raise
await asyncio.sleep(2 ** attempt) # Exponential backoff
def _parse_sentiment_response(self, response: Dict[str, Any]) -> SentimentResult:
"""Parse OpenRouter JSON response into SentimentResult."""
return SentimentResult(
score=float(response['score']),
confidence=float(response['confidence']),
label=response['label'],
reasoning=response.get('reasoning', '')
)
def _fallback_sentiment(self, title: str, content: str) -> SentimentResult:
"""Keyword-based fallback sentiment analysis."""
text = f"{title} {content}".lower()
positive_keywords = ['gain', 'up', 'rise', 'growth', 'profit', 'beat', 'success']
negative_keywords = ['loss', 'down', 'fall', 'decline', 'miss', 'failure', 'concern']
pos_count = sum(1 for keyword in positive_keywords if keyword in text)
neg_count = sum(1 for keyword in negative_keywords if keyword in text)
if pos_count > neg_count:
return SentimentResult(score=0.3, confidence=0.5, label="positive", reasoning="Keyword-based fallback")
elif neg_count > pos_count:
return SentimentResult(score=-0.3, confidence=0.5, label="negative", reasoning="Keyword-based fallback")
else:
return SentimentResult(score=0.0, confidence=0.5, label="neutral", reasoning="Keyword-based fallback")
Embeddings Client
class OpenRouterEmbeddingsClient:
"""Client for generating embeddings via OpenRouter."""
def __init__(self, config: TradingAgentsConfig):
self.api_key = config.openrouter_api_key
self.model = "openai/text-embedding-ada-002" # Via OpenRouter
self.base_url = "https://openrouter.ai/api/v1/embeddings"
async def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
"""
Generate embeddings for multiple texts.
Args:
texts: List of text strings to embed
Returns:
List of 1536-dimensional embedding vectors
"""
if not texts:
return []
try:
# Preprocess texts
processed_texts = [self._preprocess_text(text) for text in texts]
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
payload = {
"model": self.model,
"input": processed_texts
}
async with aiohttp.ClientSession() as session:
async with session.post(
self.base_url,
headers=headers,
json=payload,
timeout=aiohttp.ClientTimeout(total=60)
) as response:
response.raise_for_status()
data = await response.json()
# Extract embeddings
embeddings = [item['embedding'] for item in data['data']]
# Validate dimensions
for i, emb in enumerate(embeddings):
if len(emb) != 1536:
raise ValueError(f"Invalid embedding dimension at index {i}: {len(emb)}")
return embeddings
except Exception as e:
logger.error(f"Embeddings generation failed: {e}, using zero vectors")
# Return zero vectors as fallback
return [[0.0] * 1536 for _ in texts]
async def generate_article_embeddings(
self,
article: NewsArticle
) -> Tuple[List[float], List[float]]:
"""
Generate embeddings for article title and content.
Args:
article: NewsArticle to generate embeddings for
Returns:
Tuple of (title_embedding, content_embedding)
"""
texts = []
if article.headline:
texts.append(article.headline)
if article.summary:
# Combine title and summary for comprehensive content embedding
combined = f"{article.headline} {article.summary}"
texts.append(combined)
if not texts:
return [0.0] * 1536, [0.0] * 1536
embeddings = await self.generate_embeddings(texts)
title_embedding = embeddings[0] if len(embeddings) > 0 else [0.0] * 1536
content_embedding = embeddings[1] if len(embeddings) > 1 else [0.0] * 1536
return title_embedding, content_embedding
def _preprocess_text(self, text: str) -> str:
"""Preprocess text for optimal embedding generation."""
# Remove extra whitespace
cleaned = " ".join(text.split())
# Limit to 8000 characters (OpenAI embedding limit)
return cleaned[:8000]
Enhanced NewsService
Integrate LLM clients into the existing NewsService:
class NewsService:
"""Service for news data, sentiment analysis, and vector embeddings."""
def __init__(
self,
google_client: GoogleNewsClient,
repository: NewsRepository,
article_scraper: ArticleScraperClient,
sentiment_client: OpenRouterSentimentClient,
embeddings_client: OpenRouterEmbeddingsClient,
):
self.google_client = google_client
self.repository = repository
self.article_scraper = article_scraper
self.sentiment_client = sentiment_client
self.embeddings_client = embeddings_client
async def update_company_news(self, symbol: str) -> NewsUpdateResult:
"""
Update company news with full LLM enrichment pipeline.
Flow:
1. Fetch RSS feed from Google News
2. Scrape article content
3. Generate LLM sentiment analysis
4. Generate vector embeddings
5. Store in PostgreSQL with embeddings
Args:
symbol: Stock ticker symbol
Returns:
NewsUpdateResult with statistics
"""
try:
logger.info(f"Updating company news for {symbol}")
# 1. Get RSS feed data
google_articles = self.google_client.get_company_news(symbol)
if not google_articles:
logger.warning(f"No articles found for {symbol}")
return NewsUpdateResult(
status="completed",
articles_found=0,
articles_scraped=0,
articles_failed=0,
symbol=symbol,
)
# 2. Scrape article content
scraped_articles = await self._scrape_articles(google_articles)
# 3. Enrich with LLM sentiment and embeddings
enriched_articles = await self._enrich_articles(scraped_articles)
# 4. Store in repository
stored_articles = await self.repository.upsert_batch(enriched_articles, symbol)
logger.info(f"Completed news update for {symbol}: {len(stored_articles)} articles stored")
return NewsUpdateResult(
status="completed",
articles_found=len(google_articles),
articles_scraped=len(scraped_articles),
articles_failed=len(google_articles) - len(scraped_articles),
symbol=symbol,
)
except Exception as e:
logger.error(f"Error updating company news for {symbol}: {e}")
raise
async def _scrape_articles(
self,
google_articles: List[GoogleNewsArticle]
) -> List[NewsArticle]:
"""Scrape content for Google News RSS articles."""
scraped = []
for article in google_articles:
if not article.link:
continue
scrape_result = self.article_scraper.scrape_article(article.link)
if scrape_result.status in ["SUCCESS", "ARCHIVE_SUCCESS"]:
news_article = NewsArticle(
headline=scrape_result.title or article.title,
url=article.link,
source=article.source,
published_date=date.fromisoformat(
scrape_result.publish_date or article.published.strftime("%Y-%m-%d")
),
summary=scrape_result.content,
author=scrape_result.author,
)
scraped.append(news_article)
return scraped
async def _enrich_articles(
self,
articles: List[NewsArticle]
) -> List[NewsArticle]:
"""Enrich articles with LLM sentiment and vector embeddings."""
enriched = []
for article in articles:
try:
# Generate sentiment
sentiment_result = await self.sentiment_client.analyze_sentiment(
article.headline,
article.summary or ""
)
article.sentiment_score = sentiment_result.score
article.sentiment_confidence = sentiment_result.confidence
article.sentiment_label = sentiment_result.label
# Generate embeddings
title_emb, content_emb = await self.embeddings_client.generate_article_embeddings(article)
article.title_embedding = title_emb
article.content_embedding = content_emb
enriched.append(article)
except Exception as e:
logger.warning(f"Failed to enrich article {article.url}: {e}, storing without enrichment")
enriched.append(article)
return enriched
async def find_similar_news(
self,
query_text: str,
symbol: Optional[str] = None,
limit: int = 5
) -> List[NewsArticle]:
"""
Find news articles similar to query text using RAG vector search.
Args:
query_text: Text to search for similar articles
symbol: Optional symbol filter
limit: Maximum number of results
Returns:
List of similar NewsArticle objects
"""
# Generate embedding for query text
query_embeddings = await self.embeddings_client.generate_embeddings([query_text])
query_embedding = query_embeddings[0]
# Search for similar articles
similar_articles = await self.repository.find_similar_articles(
embedding=query_embedding,
limit=limit,
threshold=0.7,
symbol=symbol
)
return similar_articles
Dagster Orchestration Layer
Directory Structure
tradingagents/data/
├── __init__.py
├── jobs/
│ ├── __init__.py
│ └── news_collection.py
├── ops/
│ ├── __init__.py
│ └── news_ops.py
├── schedules/
│ ├── __init__.py
│ └── news_schedules.py
└── sensors/
├── __init__.py
└── news_sensors.py
Dagster Ops (Operations)
# tradingagents/data/ops/news_ops.py
from dagster import op, OpExecutionContext, Out, Output, DagsterEventType
from tradingagents.domains.news.news_service import NewsService
from tradingagents.config import TradingAgentsConfig
from tradingagents.lib.database import DatabaseManager
@op(
required_resource_keys={"database_manager"},
out=Out(dict),
tags={"kind": "news", "domain": "news"},
)
def collect_news_for_symbol(context: OpExecutionContext, symbol: str) -> dict:
"""
Collect and process news for a single stock symbol.
Args:
symbol: Stock ticker symbol
Returns:
Dictionary with collection statistics
"""
context.log.info(f"Starting news collection for {symbol}")
try:
# Build NewsService with dependencies
config = TradingAgentsConfig.from_env()
db_manager = context.resources.database_manager
news_service = NewsService.build(db_manager, config)
# Execute news update
result = await news_service.update_company_news(symbol)
context.log.info(
f"Completed news collection for {symbol}: "
f"{result.articles_found} found, {result.articles_scraped} scraped"
)
return {
"symbol": symbol,
"articles_found": result.articles_found,
"articles_scraped": result.articles_scraped,
"articles_failed": result.articles_failed,
"status": result.status,
}
except Exception as e:
context.log.error(f"News collection failed for {symbol}: {e}")
raise
Dagster Jobs
# tradingagents/data/jobs/news_collection.py
from dagster import job, DynamicOut, DynamicOutput, OpExecutionContext, op
from tradingagents.data.ops.news_ops import collect_news_for_symbol
@op(out=DynamicOut())
def get_symbols_to_collect(context: OpExecutionContext) -> Generator[DynamicOutput, None, None]:
"""
Get list of symbols to collect news for.
Yields:
DynamicOutput for each symbol
"""
# This could be loaded from Dagster config, database, or external source
symbols = context.op_config.get("symbols", ["AAPL", "GOOGL", "MSFT", "TSLA"])
context.log.info(f"Collecting news for {len(symbols)} symbols: {symbols}")
for symbol in symbols:
yield DynamicOutput(symbol, mapping_key=symbol)
@job(
tags={"dagster/priority": "high", "domain": "news"},
)
def news_collection_daily():
"""
Daily news collection job for all configured symbols.
Workflow:
1. Get symbols to collect
2. Fan out: collect news for each symbol in parallel
3. Aggregate results
"""
get_symbols_to_collect().map(collect_news_for_symbol)
Dagster Schedules
# tradingagents/data/schedules/news_schedules.py
from dagster import schedule, ScheduleEvaluationContext, RunRequest
from tradingagents.data.jobs.news_collection import news_collection_daily
@schedule(
job=news_collection_daily,
cron_schedule="0 6 * * *", # Daily at 6 AM UTC
execution_timezone="UTC",
)
def news_collection_daily_schedule(context: ScheduleEvaluationContext):
"""
Schedule for daily news collection at 6 AM UTC.
Returns:
RunRequest with job configuration
"""
return RunRequest(
run_key=f"news_collection_{context.scheduled_execution_time.isoformat()}",
run_config={
"ops": {
"get_symbols_to_collect": {
"config": {
"symbols": ["AAPL", "GOOGL", "MSFT", "TSLA", "AMZN", "META", "NVDA"]
}
}
}
},
tags={
"scheduled_time": context.scheduled_execution_time.isoformat(),
"job_type": "news_collection",
},
)
Dagster Sensors (Failure Alerting)
# tradingagents/data/sensors/news_sensors.py
from dagster import sensor, SensorEvaluationContext, DagsterEventType, RunFailureSensorContext
from dagster import run_failure_sensor
@run_failure_sensor(
name="news_collection_failure_sensor",
monitored_jobs=[news_collection_daily],
)
def news_collection_failure_alert(context: RunFailureSensorContext):
"""
Alert when news collection job fails.
This could send notifications via Slack, PagerDuty, email, etc.
"""
context.log.error(
f"News collection job failed!\n"
f"Run ID: {context.dagster_run.run_id}\n"
f"Failure info: {context.failure_event.event_specific_data}"
)
# TODO: Implement alerting (Slack, PagerDuty, email)
# send_slack_alert(...)
Database Schema Changes
Migration Script (Alembic)
# alembic/versions/20250111_add_sentiment_fields.py
"""Add sentiment fields to news_articles
Revision ID: add_sentiment_fields
Revises: previous_revision
Create Date: 2025-01-11
"""
from alembic import op
import sqlalchemy as sa
# revision identifiers
revision = 'add_sentiment_fields'
down_revision = 'previous_revision'
branch_labels = None
depends_on = None
def upgrade():
# Add sentiment analysis fields
op.add_column('news_articles', sa.Column('sentiment_confidence', sa.Float(), nullable=True))
op.add_column('news_articles', sa.Column('sentiment_label', sa.String(20), nullable=True))
# Vector columns already exist from 95% complete infrastructure:
# - title_embedding vector(1536)
# - content_embedding vector(1536)
# - sentiment_score float
# Add index on sentiment_label for filtering
op.create_index('idx_news_sentiment_label', 'news_articles', ['sentiment_label'])
def downgrade():
op.drop_index('idx_news_sentiment_label', table_name='news_articles')
op.drop_column('news_articles', 'sentiment_label')
op.drop_column('news_articles', 'sentiment_confidence')
Testing Strategy
Unit Tests (Mock Boundaries)
# tests/domains/news/test_news_service_llm.py
import pytest
from unittest.mock import AsyncMock
from tradingagents.domains.news.news_service import NewsService
from tradingagents.domains.news.openrouter_sentiment_client import SentimentResult
@pytest.fixture
def mock_sentiment_client():
return AsyncMock()
@pytest.fixture
def mock_embeddings_client():
return AsyncMock()
async def test_enrich_articles_handles_llm_failures_gracefully(
mock_sentiment_client,
mock_embeddings_client
):
"""Test that LLM failures don't block article storage."""
# Mock sentiment failure
mock_sentiment_client.analyze_sentiment.side_effect = Exception("API Error")
# Mock embeddings success
mock_embeddings_client.generate_article_embeddings.return_value = (
[0.1] * 1536, [0.2] * 1536
)
service = NewsService(
google_client=AsyncMock(),
repository=AsyncMock(),
article_scraper=AsyncMock(),
sentiment_client=mock_sentiment_client,
embeddings_client=mock_embeddings_client,
)
articles = [create_test_article()]
enriched = await service._enrich_articles(articles)
# Article should still be returned even though sentiment failed
assert len(enriched) == 1
assert enriched[0].url == articles[0].url
Integration Tests (Real Database)
# tests/domains/news/integration/test_news_workflow.py
import pytest
from tradingagents.lib.database import create_test_database_manager
from tradingagents.domains.news.news_service import NewsService
@pytest.mark.asyncio
async def test_complete_news_pipeline_end_to_end(test_db_manager):
"""Test complete pipeline: RSS → Scrape → LLM → Vector → Store."""
config = TradingAgentsConfig.from_test_env()
service = NewsService.build(test_db_manager, config)
# Execute full pipeline
result = await service.update_company_news("AAPL")
# Verify results
assert result.status == "completed"
assert result.articles_scraped > 0
# Verify database storage
articles = await service.repository.list_by_date_range(
symbol="AAPL",
start_date=date.today(),
end_date=date.today()
)
assert len(articles) > 0
# Verify LLM enrichment
for article in articles:
assert article.sentiment_score is not None
assert article.title_embedding is not None
assert len(article.title_embedding) == 1536
Dagster Tests
# tests/data/jobs/test_news_collection.py
from dagster import build_op_context
from tradingagents.data.ops.news_ops import collect_news_for_symbol
def test_collect_news_for_symbol_op():
"""Test Dagster op for news collection."""
context = build_op_context(
resources={"database_manager": mock_database_manager}
)
result = collect_news_for_symbol(context, "AAPL")
assert result["symbol"] == "AAPL"
assert result["status"] == "completed"
assert result["articles_found"] >= 0
Performance Optimization
Query Performance Targets
- News retrieval: < 2 seconds for 30-day lookback
- Vector similarity search: < 1 second for top-10 results
- Batch insertion: < 5 seconds for 50 articles
Optimization Strategies
- Vector Indexes: Use pgvectorscale IVFFlat indexes for similarity search
- Batch Operations: Use
executemany()for bulk inserts and updates - Connection Pooling: Configure asyncpg connection pool (min=5, max=20)
- Async Operations: All I/O operations are async (HTTP, database)
- Caching: Dagster asset materialization for computed aggregates
Monitoring and Observability
Dagster UI Monitoring
- Job runs: View execution history and status
- Asset lineage: Track data dependencies
- Performance metrics: Execution time, success rate
- Logs: Structured logging with context
Custom Metrics
from dagster import Output, MetadataValue
def collect_news_for_symbol(context, symbol):
# ... collection logic ...
yield Output(
result,
metadata={
"articles_found": MetadataValue.int(result["articles_found"]),
"articles_scraped": MetadataValue.int(result["articles_scraped"]),
"success_rate": MetadataValue.float(
result["articles_scraped"] / result["articles_found"]
),
"execution_time": MetadataValue.float(execution_time_seconds),
}
)
Error Handling and Resilience
LLM Failure Strategies
- Sentiment Analysis Failures: Fall back to keyword-based sentiment
- Embedding Failures: Use zero vectors, log for manual review
- API Rate Limits: Exponential backoff with jitter
- Timeout Handling: 30s timeout for sentiment, 60s for embeddings
Dagster Retry Policies
from dagster import RetryPolicy
@op(
retry_policy=RetryPolicy(
max_retries=3,
delay=10, # seconds
backoff=BackoffPolicy.EXPONENTIAL,
)
)
def collect_news_for_symbol(context, symbol):
# ... implementation ...
Success Criteria
✅ Layered Architecture: Entity → Repository → Service → Dagster Op → Dagster Job ✅ LLM Sentiment: OpenRouter structured sentiment with confidence and fallback ✅ Vector RAG: pgvectorscale semantic search operational with <1s query time ✅ Dagster Orchestration: Daily automated collection via Dagster schedules ✅ Test Coverage: >85% maintained with pytest-vcr for HTTP mocking ✅ Performance: Query < 2s, vector search < 1s, batch insert < 5s ✅ Error Resilience: Graceful fallbacks for all LLM and API failures ✅ Monitoring: Dagster UI provides complete observability and alerting
Timeline
Phase 1: Entity + Migration (2-3h) Phase 2: Repository RAG methods (2-3h) Phase 3: LLM Clients (4-5h) Phase 4: Service Enhancement (2-3h) Phase 5: Dagster Orchestration (3-4h) Phase 6: Testing & Documentation (2-3h)
Total: 15-20 hours with AI assistance