TradingAgents/docs/specs/socialmedia/spec-lite.md

105 lines
3.6 KiB
Markdown

# Social Media Domain - Specification Lite
## Summary
Complete implementation of social media data collection from Reddit with LLM sentiment analysis and vector embeddings for AI agent RAG integration.
## Core Requirements
### Data Collection
- **Daily Reddit collection** from financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
- **OpenRouter LLM sentiment analysis** with confidence scoring
- **Vector embeddings** for semantic similarity search
- **PostgreSQL storage** with TimescaleDB + pgvectorscale optimization
### Agent Integration
- **AgentToolkit methods**: `get_reddit_news()` and `get_reddit_stock_info()`
- **RAG-enhanced queries** with < 2 second response time
- **Vector similarity search** for contextual social media insights
## Technical Implementation
### Architecture Pattern
**Router Service Repository Entity Database** (matching news domain)
### Database Schema
```sql
social_media_posts (
post_id, ticker, subreddit, title, content, author,
created_at, upvotes, comment_count,
sentiment_score, sentiment_label, sentiment_confidence,
embedding vector(1536), -- pgvectorscale
data_quality_score, processing_status
)
```
### Key Components
#### 1. RedditClient
- PRAW integration with rate limiting
- Financial subreddit targeting
- Ticker-specific post filtering
#### 2. SentimentAnalyzer
- OpenRouter LLM integration
- Structured sentiment scoring (-1.0 to +1.0)
- Financial context awareness
#### 3. SocialRepository
- PostgreSQL with deduplication by post_id
- Vector similarity search using pgvectorscale
- TimescaleDB time-series optimization
#### 4. SocialMediaService
- Orchestrates collection pipeline: Reddit Sentiment Embeddings Storage
- Provides ticker-specific social context
- Calculates aggregate sentiment metrics
#### 5. AgentToolkit Integration
```python
async def get_reddit_news(ticker: str, days: int = 7) -> str:
# Returns formatted social media context with sentiment analysis
async def get_reddit_stock_info(ticker: str, query: Optional[str] = None) -> str:
# Returns semantic search results with sentiment aggregation
```
## Implementation Scope
### Complete Implementation ✅
- PostgreSQL migration from file storage
- Reddit API client (currently empty stub)
- SQLAlchemy entities with vector fields
- LLM sentiment analysis pipeline
- Vector embedding generation and search
- Dagster pipeline for scheduled collection
- Comprehensive test coverage (pytest-vcr for APIs)
### Current Status
**Basic stub implementation** - requires complete rebuild of all components
### Dependencies
- Reddit API credentials (PRAW)
- OpenRouter API access
- PostgreSQL with TimescaleDB + pgvectorscale
- Existing TradingAgentsConfig
- News domain patterns for consistency
## Data Flow
1. **Dagster pipeline** triggers daily collection
2. **RedditClient** fetches posts from financial subreddits
3. **SentimentAnalyzer** processes posts via OpenRouter LLM
4. **EmbeddingGenerator** creates vector embeddings
5. **SocialRepository** stores in PostgreSQL with deduplication
6. **AI Agents** query via AgentToolkit with RAG-enhanced context
## Testing Strategy
- **pytest-vcr** for Reddit API mocking
- **Real PostgreSQL** for repository integration tests
- **Service mocks** for business logic testing
- **85%+ coverage** matching project standards
## Success Criteria
- Daily automated Reddit collection with sentiment analysis
- Sub-2-second agent queries with vector search
- Seamless RAG integration matching news domain patterns
- Production-ready reliability with comprehensive error handling