80 lines
3.0 KiB
Markdown
80 lines
3.0 KiB
Markdown
# News Domain Completion - Implementation Summary
|
|
|
|
## Core Requirement
|
|
Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.
|
|
|
|
## User Story
|
|
**Dagster Job** automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → **News Analysts** get comprehensive, up-to-date news data for trading decisions.
|
|
|
|
## Essential Requirements
|
|
|
|
### 1. Scheduled Execution
|
|
- Daily job at 6 AM UTC for all configured tickers
|
|
- APScheduler integration (no Dagster dependency)
|
|
- Graceful error handling with comprehensive logging
|
|
|
|
### 2. LLM Sentiment Analysis
|
|
- OpenRouter integration using `quick_think_llm` (claude-3.5-haiku)
|
|
- Structured output: `{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}`
|
|
- Best-effort processing - failures don't stop pipeline
|
|
|
|
### 3. Vector Embeddings
|
|
- 1536-dimension embeddings for title and content
|
|
- pgvectorscale storage with similarity indexes
|
|
- Semantic search capability for News Analysts
|
|
|
|
## Technical Implementation
|
|
|
|
### Architecture Pattern
|
|
```
|
|
ScheduledNewsJob → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
|
|
```
|
|
|
|
### Database Changes
|
|
```sql
|
|
ALTER TABLE news_articles
|
|
ADD COLUMN sentiment_score JSONB,
|
|
ADD COLUMN title_embedding vector(1536),
|
|
ADD COLUMN content_embedding vector(1536);
|
|
```
|
|
|
|
### Key Integration Points
|
|
- **Existing NewsService**: Enhance `update_news_for_symbol` method
|
|
- **LLM Integration**: OpenRouter unified provider for sentiment
|
|
- **Vector Generation**: text-embedding-3-small model (1536 dims)
|
|
- **Job Scheduling**: APScheduler with cron trigger
|
|
|
|
## Implementation Phases
|
|
1. **Scheduled Execution** (2-3h): APScheduler + config management
|
|
2. **LLM Sentiment** (3-4h): OpenRouter integration + structured prompts
|
|
3. **Vector Embeddings** (2-3h): Embedding generation + database schema
|
|
4. **Testing & Monitoring** (2h): Coverage + performance validation
|
|
|
|
**Total: 9-12 hours**
|
|
|
|
## Success Criteria
|
|
- ✅ Daily automated news collection without manual intervention
|
|
- ✅ News retrieval with sentiment scores < 2 seconds response time
|
|
- ✅ Vector embeddings enable semantic search for News Analysts
|
|
- ✅ >95% article processing success rate despite paywall/blocking
|
|
- ✅ Maintain >85% test coverage including new components
|
|
|
|
## Dependencies
|
|
- **APIs**: OpenRouter (sentiment), OpenAI (embeddings)
|
|
- **Infrastructure**: PostgreSQL + TimescaleDB + pgvectorscale
|
|
- **New Package**: `apscheduler` for job scheduling
|
|
- **Existing**: 95% complete news domain components
|
|
|
|
## Configuration
|
|
```bash
|
|
OPENROUTER_API_KEY="sk-or-..."
|
|
OPENAI_API_KEY="sk-..."
|
|
NEWS_SCHEDULE_HOUR=6
|
|
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA"
|
|
```
|
|
|
|
## Risk Mitigation
|
|
- **API Rate Limits**: Exponential backoff + batch processing
|
|
- **Paywall Blocking**: Metadata-only storage with warnings
|
|
- **Job Failures**: Monitoring + alerting for operational visibility
|
|
- **Performance**: Vector indexes + query optimization for <2s target |