3.0 KiB
3.0 KiB
News Domain Completion - Implementation Summary
Core Requirement
Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.
User Story
Dagster Job automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → News Analysts get comprehensive, up-to-date news data for trading decisions.
Essential Requirements
1. Scheduled Execution
- Daily job at 6 AM UTC for all configured tickers
- APScheduler integration (no Dagster dependency)
- Graceful error handling with comprehensive logging
2. LLM Sentiment Analysis
- OpenRouter integration using
quick_think_llm(claude-3.5-haiku) - Structured output:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0} - Best-effort processing - failures don't stop pipeline
3. Vector Embeddings
- 1536-dimension embeddings for title and content
- pgvectorscale storage with similarity indexes
- Semantic search capability for News Analysts
Technical Implementation
Architecture Pattern
ScheduledNewsJob → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
Database Changes
ALTER TABLE news_articles
ADD COLUMN sentiment_score JSONB,
ADD COLUMN title_embedding vector(1536),
ADD COLUMN content_embedding vector(1536);
Key Integration Points
- Existing NewsService: Enhance
update_news_for_symbolmethod - LLM Integration: OpenRouter unified provider for sentiment
- Vector Generation: text-embedding-3-small model (1536 dims)
- Job Scheduling: APScheduler with cron trigger
Implementation Phases
- Scheduled Execution (2-3h): APScheduler + config management
- LLM Sentiment (3-4h): OpenRouter integration + structured prompts
- Vector Embeddings (2-3h): Embedding generation + database schema
- Testing & Monitoring (2h): Coverage + performance validation
Total: 9-12 hours
Success Criteria
- ✅ Daily automated news collection without manual intervention
- ✅ News retrieval with sentiment scores < 2 seconds response time
- ✅ Vector embeddings enable semantic search for News Analysts
- ✅ >95% article processing success rate despite paywall/blocking
- ✅ Maintain >85% test coverage including new components
Dependencies
- APIs: OpenRouter (sentiment), OpenAI (embeddings)
- Infrastructure: PostgreSQL + TimescaleDB + pgvectorscale
- New Package:
apschedulerfor job scheduling - Existing: 95% complete news domain components
Configuration
OPENROUTER_API_KEY="sk-or-..."
OPENAI_API_KEY="sk-..."
NEWS_SCHEDULE_HOUR=6
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA"
Risk Mitigation
- API Rate Limits: Exponential backoff + batch processing
- Paywall Blocking: Metadata-only storage with warnings
- Job Failures: Monitoring + alerting for operational visibility
- Performance: Vector indexes + query optimization for <2s target