TradingAgents/docs/specs/news/spec-lite.md

3.8 KiB

News Domain Completion - Implementation Summary

Core Requirement

Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.

User Story

Dagster Job automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → News Analysts get comprehensive, up-to-date news data for trading decisions.

Essential Requirements

1. Scheduled Execution

  • Daily job at 6 AM UTC for all configured tickers
  • Dagster orchestration with partitioned schedules
  • Graceful error handling with Dagster sensors and alerting

2. LLM Sentiment Analysis

  • OpenRouter integration using quick_think_llm (claude-3.5-haiku)
  • Structured output: {"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "label": "positive|negative|neutral"}
  • Best-effort processing - failures don't stop pipeline

3. Vector Embeddings

  • 1536-dimension embeddings for title and content
  • pgvectorscale storage with similarity indexes
  • Semantic search capability for News Analysts

Technical Implementation

Architecture Pattern

Dagster Job → Dagster Op → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale

Database Changes

ALTER TABLE news_articles
ADD COLUMN sentiment_confidence FLOAT,
ADD COLUMN sentiment_label VARCHAR(20);

-- Vector columns already exist from 95% complete infrastructure
-- title_embedding vector(1536)
-- content_embedding vector(1536)

Key Integration Points

  • Existing NewsService: Enhance update_company_news method
  • LLM Integration: OpenRouter unified provider for sentiment and embeddings
  • Vector Generation: OpenAI text-embedding-ada-002 via OpenRouter (1536 dims)
  • Job Scheduling: Dagster jobs with daily partitioned schedules

Implementation Phases

  1. Entity Layer (2-3h): Enhance NewsArticle dataclass + migration
  2. Repository Layer (2-3h): RAG vector similarity search methods
  3. LLM Integration (4-5h): OpenRouter sentiment + embeddings clients
  4. Service Enhancement (2-3h): Integrate LLM clients into NewsService
  5. Dagster Orchestration (3-4h): Jobs, ops, and schedules
  6. Testing & Monitoring (2-3h): Coverage + performance validation

Total: 15-20 hours

Success Criteria

  • Daily automated news collection via Dagster without manual intervention
  • News retrieval with sentiment scores < 2 seconds response time
  • Vector embeddings enable semantic search for News Analysts
  • >95% article processing success rate despite paywall/blocking
  • Maintain >85% test coverage including new components
  • Dagster UI provides monitoring and alerting for job failures

Dependencies

  • APIs: OpenRouter (sentiment + embeddings via unified provider)
  • Infrastructure: PostgreSQL + TimescaleDB + pgvectorscale
  • Orchestration: Dagster for job scheduling and monitoring
  • Existing: 95% complete news domain components (clients, repository, service)

Configuration

# Dagster workspace.yaml
schedules:
  news_collection_daily:
    cron_schedule: "0 6 * * *"  # Daily at 6 AM UTC
    execution_timezone: "UTC"

# Dagster run config
ops:
  collect_news:
    config:
      symbols: ["AAPL", "GOOGL", "MSFT", "TSLA"]
      lookback_days: 1
# Environment variables
OPENROUTER_API_KEY="sk-or-..."  # Unified LLM provider
DATABASE_URL="postgresql+asyncpg://..."

Risk Mitigation

  • API Rate Limits: Exponential backoff + batch processing
  • Paywall Blocking: Metadata-only storage with warnings
  • Job Failures: Dagster sensors + alerting for operational visibility
  • Performance: Vector indexes + query optimization for <2s target
  • LLM Failures: Keyword-based fallback for sentiment, zero-vector fallback for embeddings