TradingAgents/docs/specs/news/spec-lite.md

80 lines
3.0 KiB
Markdown

# News Domain Completion - Implementation Summary
## Core Requirement
Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.
## User Story
**Dagster Job** automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → **News Analysts** get comprehensive, up-to-date news data for trading decisions.
## Essential Requirements
### 1. Scheduled Execution
- Daily job at 6 AM UTC for all configured tickers
- APScheduler integration (no Dagster dependency)
- Graceful error handling with comprehensive logging
### 2. LLM Sentiment Analysis
- OpenRouter integration using `quick_think_llm` (claude-3.5-haiku)
- Structured output: `{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}`
- Best-effort processing - failures don't stop pipeline
### 3. Vector Embeddings
- 1536-dimension embeddings for title and content
- pgvectorscale storage with similarity indexes
- Semantic search capability for News Analysts
## Technical Implementation
### Architecture Pattern
```
ScheduledNewsJob → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
```
### Database Changes
```sql
ALTER TABLE news_articles
ADD COLUMN sentiment_score JSONB,
ADD COLUMN title_embedding vector(1536),
ADD COLUMN content_embedding vector(1536);
```
### Key Integration Points
- **Existing NewsService**: Enhance `update_news_for_symbol` method
- **LLM Integration**: OpenRouter unified provider for sentiment
- **Vector Generation**: text-embedding-3-small model (1536 dims)
- **Job Scheduling**: APScheduler with cron trigger
## Implementation Phases
1. **Scheduled Execution** (2-3h): APScheduler + config management
2. **LLM Sentiment** (3-4h): OpenRouter integration + structured prompts
3. **Vector Embeddings** (2-3h): Embedding generation + database schema
4. **Testing & Monitoring** (2h): Coverage + performance validation
**Total: 9-12 hours**
## Success Criteria
- ✅ Daily automated news collection without manual intervention
- ✅ News retrieval with sentiment scores < 2 seconds response time
- Vector embeddings enable semantic search for News Analysts
- >95% article processing success rate despite paywall/blocking
- ✅ Maintain >85% test coverage including new components
## Dependencies
- **APIs**: OpenRouter (sentiment), OpenAI (embeddings)
- **Infrastructure**: PostgreSQL + TimescaleDB + pgvectorscale
- **New Package**: `apscheduler` for job scheduling
- **Existing**: 95% complete news domain components
## Configuration
```bash
OPENROUTER_API_KEY="sk-or-..."
OPENAI_API_KEY="sk-..."
NEWS_SCHEDULE_HOUR=6
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA"
```
## Risk Mitigation
- **API Rate Limits**: Exponential backoff + batch processing
- **Paywall Blocking**: Metadata-only storage with warnings
- **Job Failures**: Monitoring + alerting for operational visibility
- **Performance**: Vector indexes + query optimization for <2s target