3.8 KiB
3.8 KiB
News Domain Completion - Implementation Summary
Core Requirement
Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.
User Story
Dagster Job automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → News Analysts get comprehensive, up-to-date news data for trading decisions.
Essential Requirements
1. Scheduled Execution
- Daily job at 6 AM UTC for all configured tickers
- Dagster orchestration with partitioned schedules
- Graceful error handling with Dagster sensors and alerting
2. LLM Sentiment Analysis
- OpenRouter integration using
quick_think_llm(claude-3.5-haiku) - Structured output:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "label": "positive|negative|neutral"} - Best-effort processing - failures don't stop pipeline
3. Vector Embeddings
- 1536-dimension embeddings for title and content
- pgvectorscale storage with similarity indexes
- Semantic search capability for News Analysts
Technical Implementation
Architecture Pattern
Dagster Job → Dagster Op → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
Database Changes
ALTER TABLE news_articles
ADD COLUMN sentiment_confidence FLOAT,
ADD COLUMN sentiment_label VARCHAR(20);
-- Vector columns already exist from 95% complete infrastructure
-- title_embedding vector(1536)
-- content_embedding vector(1536)
Key Integration Points
- Existing NewsService: Enhance
update_company_newsmethod - LLM Integration: OpenRouter unified provider for sentiment and embeddings
- Vector Generation: OpenAI text-embedding-ada-002 via OpenRouter (1536 dims)
- Job Scheduling: Dagster jobs with daily partitioned schedules
Implementation Phases
- Entity Layer (2-3h): Enhance NewsArticle dataclass + migration
- Repository Layer (2-3h): RAG vector similarity search methods
- LLM Integration (4-5h): OpenRouter sentiment + embeddings clients
- Service Enhancement (2-3h): Integrate LLM clients into NewsService
- Dagster Orchestration (3-4h): Jobs, ops, and schedules
- Testing & Monitoring (2-3h): Coverage + performance validation
Total: 15-20 hours
Success Criteria
- ✅ Daily automated news collection via Dagster without manual intervention
- ✅ News retrieval with sentiment scores < 2 seconds response time
- ✅ Vector embeddings enable semantic search for News Analysts
- ✅ >95% article processing success rate despite paywall/blocking
- ✅ Maintain >85% test coverage including new components
- ✅ Dagster UI provides monitoring and alerting for job failures
Dependencies
- APIs: OpenRouter (sentiment + embeddings via unified provider)
- Infrastructure: PostgreSQL + TimescaleDB + pgvectorscale
- Orchestration: Dagster for job scheduling and monitoring
- Existing: 95% complete news domain components (clients, repository, service)
Configuration
# Dagster workspace.yaml
schedules:
news_collection_daily:
cron_schedule: "0 6 * * *" # Daily at 6 AM UTC
execution_timezone: "UTC"
# Dagster run config
ops:
collect_news:
config:
symbols: ["AAPL", "GOOGL", "MSFT", "TSLA"]
lookback_days: 1
# Environment variables
OPENROUTER_API_KEY="sk-or-..." # Unified LLM provider
DATABASE_URL="postgresql+asyncpg://..."
Risk Mitigation
- API Rate Limits: Exponential backoff + batch processing
- Paywall Blocking: Metadata-only storage with warnings
- Job Failures: Dagster sensors + alerting for operational visibility
- Performance: Vector indexes + query optimization for <2s target
- LLM Failures: Keyword-based fallback for sentiment, zero-vector fallback for embeddings