12 KiB

Raw Blame History

News Domain Completion Specification

Feature Overview

Complete the final 5% of the news domain by adding scheduled execution, LLM sentiment analysis, and vector embeddings to the existing 95% complete infrastructure. This enables automated daily news collection with advanced sentiment analysis and semantic search capabilities for News Analysts in the multi-agent trading framework.

User Story

Primary User: Dagster Job (automated system)
Secondary Users: News Analysts (LLM agents)

As a Dagster Job, I want to automatically fetch Google News articles for tracked tickers, extract content, perform LLM sentiment analysis, and store with embeddings in the database, so that News Analysts can access comprehensive, up-to-date news data for trading decisions.

Acceptance Criteria

AC1: Scheduled Execution

GIVEN a scheduled job runs daily
WHEN it executes
THEN it fetches news for all configured tickers without manual intervention

Validation:

Job executes at configured time (default: daily at 6 AM UTC)
All tickers in configuration are processed
Job completion status is logged with metrics

AC2: Content Extraction Resilience

GIVEN a news article is found
WHEN content extraction fails due to paywall
THEN a warning is logged and processing continues with available metadata

Validation:

Paywall detection doesn't halt processing
Warning messages include article URL and error reason
Metadata (title, source, publish_date) is still stored

AC3: Fast News Retrieval

GIVEN a ticker symbol
WHEN a News Analyst requests news data
THEN they receive articles with sentiment scores and embeddings within 2 seconds

Validation:

Database queries return results in < 2 seconds
Results include sentiment scores and vector embeddings
Pagination supports large result sets

AC4: LLM Sentiment Analysis

GIVEN news articles are processed
WHEN LLM sentiment analysis runs
THEN each article gets a structured sentiment score (positive/negative/neutral with confidence)

Validation:

Sentiment scores use structured format: {"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}
LLM integration uses OpenRouter unified provider
Failed sentiment analysis doesn't prevent article storage

AC5: Vector Embeddings Storage

GIVEN news articles are stored
WHEN saved to database
THEN they include vector embeddings for both title and content for semantic search

Validation:

1536-dimension embeddings generated for title and content
Embeddings stored in pgvectorscale-optimized columns
Semantic similarity search returns relevant results

Business Rules

BR1: Best Effort Processing

Log warnings for paywalled/blocked content but continue processing
Network failures don't halt entire job execution
API rate limits are respected with exponential backoff

BR2: Daily Schedule Execution

Configurable ticker list supports adding/removing symbols
Job execution time is configurable (default: daily at 6 AM UTC)
Manual job execution available for testing and backfill

BR3: Data Quality Standards

URL-based deduplication prevents duplicate articles
Article publish dates must be within last 30 days
Source URLs must be valid and accessible

BR4: LLM Integration Standards

Use OpenRouter unified provider for sentiment analysis
Quick-think LLM for sentiment processing (cost optimization)
Structured prompts ensure consistent sentiment format

BR5: Vector Search Optimization

Embeddings enable semantic similarity search for agents
Vector indexes optimize query performance
Embedding generation uses consistent model for coherence

BR6: Graceful Error Handling

Individual article failures don't stop batch processing
Comprehensive logging for monitoring and debugging
Database transactions ensure data consistency

Technical Implementation

Architecture Alignment

Follows established Router → Service → Repository → Entity → Database pattern with Dagster orchestration:

Dagster Schedule → Dagster Job → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale

Database Schema Integration

Leverages existing NewsRepository with vector extensions:

-- Existing news_articles table enhanced with:
ALTER TABLE news_articles 
ADD COLUMN IF NOT EXISTS sentiment_score JSONB,
ADD COLUMN IF NOT EXISTS title_embedding vector(1536),
ADD COLUMN IF NOT EXISTS content_embedding vector(1536);

-- Vector similarity indexes
CREATE INDEX IF NOT EXISTS idx_title_embedding 
ON news_articles USING ivfflat (title_embedding vector_cosine_ops);

LLM Integration Pattern

# OpenRouter sentiment analysis
sentiment_result = await llm_client.analyze_sentiment(
    text=article.content,
    model="anthropic/claude-3.5-haiku",  # quick_think_llm
    structured_output=True
)

# Expected response format
{
    "sentiment": "positive|negative|neutral",
    "confidence": 0.85,
    "reasoning": "Brief explanation"
}

Vector Embedding Strategy

# Generate embeddings for semantic search
title_embedding = await embedding_client.create_embedding(
    text=article.title,
    model="text-embedding-3-small"  # 1536 dimensions
)

content_embedding = await embedding_client.create_embedding(
    text=article.content[:8000],  # Truncate for token limits
    model="text-embedding-3-small"
)

Scheduled Execution Framework

Use Dagster for job orchestration (existing dependency in project):

from dagster import (
    job, 
    schedule, 
    ScheduleDefinition,
    op,
    In,
    Out,
    AssetMaterialization
)
from dagster._core.scheduler import ScheduleExecutionContext

@op
def fetch_news_for_tickers(context, tickers: list[str]) -> list[dict]:
    """Fetch news articles for configured tickers"""
    pass

@op  
def process_articles_with_sentiment(context, articles: list[dict]) -> list[dict]:
    """Process articles with LLM sentiment analysis and embeddings"""
    pass

@op
def store_articles(context, processed_articles: list[dict]) -> None:
    """Store articles with sentiment and embeddings in database"""
    pass

@job
def daily_news_collection_job():
    """Daily news collection pipeline"""
    tickers = ["AAPL", "GOOGL", "MSFT", "TSLA"]  # From config
    articles = fetch_news_for_tickers(tickers)
    processed = process_articles_with_sentiment(articles)
    store_articles(processed)

@schedule(
    cron_schedule="0 6 * * *",  # Daily at 6 AM UTC
    job=daily_news_collection_job,
    execution_timezone="UTC"
)
def daily_news_collection_schedule(context: ScheduleExecutionContext):
    """Schedule for daily news collection"""
    run_config = {
        "ops": {
            "fetch_news_for_tickers": {
                "inputs": {
                    "tickers": ["AAPL", "GOOGL", "MSFT", "TSLA"]
                }
            }
        }
    }
    return run_config

Implementation Approach

Phase 1: Dagster Scheduling Integration (2-3 hours)

Create Dagster ops for news collection pipeline
Configure daily schedule with cron expression
Set up job configuration management for ticker lists
Add manual job execution capability for testing
Implement job monitoring and asset tracking

Phase 2: LLM Sentiment Integration (3-4 hours)

Integrate OpenRouter LLM for sentiment analysis
Create structured sentiment analysis prompts
Update NewsService to include sentiment processing
Add sentiment data to NewsArticle domain model

Phase 3: Vector Embeddings (2-3 hours)

Add embedding generation to article processing
Update database schema for vector storage
Implement semantic search capabilities in NewsRepository
Create vector similarity query methods

Phase 4: Testing & Monitoring (2 hours)

Add comprehensive test coverage for new components
Implement job monitoring and alerting
Create configuration validation
Performance testing for 2-second query requirement

Total Estimated Effort: 9-12 hours

Dependencies

Required APIs

OpenRouter API: LLM sentiment analysis (OPENROUTER_API_KEY)
OpenAI API: Vector embeddings (OPENAI_API_KEY for embeddings)

Database Requirements

PostgreSQL: Base storage with async support
TimescaleDB: Time-series optimization for news data
pgvectorscale: Vector storage and similarity search

Existing Infrastructure (95% Complete)

NewsService with update_news_for_symbol method
GoogleNewsClient for RSS feed parsing
ArticleScraperClient with newspaper4k integration
NewsRepository with async PostgreSQL operations
NewsArticle domain model with validation
Comprehensive test coverage with pytest-vcr
Dagster framework for data orchestration (existing dependency)

New Dependencies

Enhanced vector embedding capabilities
LLM client integration for sentiment analysis
Dagster scheduling integration (existing dependency)

Configuration Management

Environment Variables

# Existing
OPENROUTER_API_KEY="sk-or-..."
DATABASE_URL="postgresql://..."

# New requirements
OPENAI_API_KEY="sk-..."  # For embeddings
NEWS_SCHEDULE_HOUR=6     # UTC hour for daily execution
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA"  # Comma-separated ticker list

Configuration File Support

# config/news_collection.yaml
schedule:
  hour: 6
  minute: 0
  timezone: "UTC"

tickers:
  - "AAPL"
  - "GOOGL" 
  - "MSFT"
  - "TSLA"

sentiment:
  llm_model: "anthropic/claude-3.5-haiku"
  confidence_threshold: 0.5

embeddings:
  model: "text-embedding-3-small"
  dimensions: 1536
  content_max_length: 8000

Success Metrics

Performance Targets

Query Response Time: < 2 seconds for news retrieval with sentiment
Job Execution Time: < 30 minutes for daily collection (4 tickers)
Success Rate: > 95% article processing success rate
Test Coverage: Maintain > 85% coverage including new components

Operational Metrics

Daily job completion status and execution time
Article processing success/failure rates per ticker
LLM sentiment analysis success rates
Vector embedding generation performance
Database query performance monitoring

Risk Mitigation

Technical Risks

LLM API Rate Limits: Implement exponential backoff and batch processing
Vector Storage Performance: Monitor query times and optimize indexes
Paywall Content Blocking: Graceful degradation with metadata-only storage
Database Migration Complexity: Test schema changes thoroughly

Operational Risks

Scheduled Job Failures: Implement monitoring and alerting
API Key Management: Secure configuration management
Data Quality Issues: Validation at multiple pipeline stages
Performance Degradation: Regular performance monitoring and optimization

Testing Strategy

Unit Testing (pytest with pytest-vcr)

Scheduled job execution logic
LLM sentiment analysis integration
Vector embedding generation
Configuration management

Integration Testing

End-to-end news collection pipeline
Database vector operations
LLM API integration
Job scheduling functionality

Performance Testing

Query response time validation (< 2 seconds)
Batch processing performance
Vector similarity search optimization
Concurrent job execution handling

Monitoring and Observability

Logging Strategy

Job execution start/completion with metrics
Individual article processing success/failure
LLM API call status and timing
Database operation performance

Health Checks

Daily job completion status
Database connectivity and performance
LLM API availability and response times
Vector search functionality

Alerting Triggers

Failed daily news collection jobs
API rate limit violations
Database query performance degradation
Sentiment analysis failure rates > 10%

This specification completes the news domain infrastructure to support advanced news analysis for the multi-agent trading framework, providing News Analysts with comprehensive, sentiment-analyzed, and semantically searchable news data for informed trading decisions.

12 KiB Raw Blame History

News Domain Completion Specification

Feature Overview

User Story

Acceptance Criteria

AC1: Scheduled Execution

AC2: Content Extraction Resilience

AC3: Fast News Retrieval

AC4: LLM Sentiment Analysis

AC5: Vector Embeddings Storage

Business Rules

BR1: Best Effort Processing

BR2: Daily Schedule Execution

BR3: Data Quality Standards

BR4: LLM Integration Standards

BR5: Vector Search Optimization

BR6: Graceful Error Handling

Technical Implementation

Architecture Alignment

Database Schema Integration

LLM Integration Pattern

Vector Embedding Strategy

Scheduled Execution Framework

Implementation Approach

Phase 1: Dagster Scheduling Integration (2-3 hours)

Phase 2: LLM Sentiment Integration (3-4 hours)

Phase 3: Vector Embeddings (2-3 hours)

Phase 4: Testing & Monitoring (2 hours)

Total Estimated Effort: 9-12 hours

Dependencies

Required APIs

Database Requirements

Existing Infrastructure (95% Complete)

New Dependencies

Configuration Management

Environment Variables

Configuration File Support

Success Metrics

Performance Targets

Operational Metrics

Risk Mitigation

Technical Risks

Operational Risks

Testing Strategy

Unit Testing (pytest with pytest-vcr)

Integration Testing

Performance Testing

Monitoring and Observability

Logging Strategy

Health Checks

Alerting Triggers

12 KiB

Raw Blame History