3.6 KiB
3.6 KiB
Social Media Domain - Specification Lite
Summary
Complete implementation of social media data collection from Reddit with LLM sentiment analysis and vector embeddings for AI agent RAG integration.
Core Requirements
Data Collection
- Daily Reddit collection from financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
- OpenRouter LLM sentiment analysis with confidence scoring
- Vector embeddings for semantic similarity search
- PostgreSQL storage with TimescaleDB + pgvectorscale optimization
Agent Integration
- AgentToolkit methods:
get_reddit_news()andget_reddit_stock_info() - RAG-enhanced queries with < 2 second response time
- Vector similarity search for contextual social media insights
Technical Implementation
Architecture Pattern
Router → Service → Repository → Entity → Database (matching news domain)
Database Schema
social_media_posts (
post_id, ticker, subreddit, title, content, author,
created_at, upvotes, comment_count,
sentiment_score, sentiment_label, sentiment_confidence,
embedding vector(1536), -- pgvectorscale
data_quality_score, processing_status
)
Key Components
1. RedditClient
- PRAW integration with rate limiting
- Financial subreddit targeting
- Ticker-specific post filtering
2. SentimentAnalyzer
- OpenRouter LLM integration
- Structured sentiment scoring (-1.0 to +1.0)
- Financial context awareness
3. SocialRepository
- PostgreSQL with deduplication by post_id
- Vector similarity search using pgvectorscale
- TimescaleDB time-series optimization
4. SocialMediaService
- Orchestrates collection pipeline: Reddit → Sentiment → Embeddings → Storage
- Provides ticker-specific social context
- Calculates aggregate sentiment metrics
5. AgentToolkit Integration
async def get_reddit_news(ticker: str, days: int = 7) -> str:
# Returns formatted social media context with sentiment analysis
async def get_reddit_stock_info(ticker: str, query: Optional[str] = None) -> str:
# Returns semantic search results with sentiment aggregation
Implementation Scope
Complete Implementation ✅
- PostgreSQL migration from file storage
- Reddit API client (currently empty stub)
- SQLAlchemy entities with vector fields
- LLM sentiment analysis pipeline
- Vector embedding generation and search
- Dagster pipeline for scheduled collection
- Comprehensive test coverage (pytest-vcr for APIs)
Current Status
Basic stub implementation - requires complete rebuild of all components
Dependencies
- Reddit API credentials (PRAW)
- OpenRouter API access
- PostgreSQL with TimescaleDB + pgvectorscale
- Existing TradingAgentsConfig
- News domain patterns for consistency
Data Flow
- Dagster pipeline triggers daily collection
- RedditClient fetches posts from financial subreddits
- SentimentAnalyzer processes posts via OpenRouter LLM
- EmbeddingGenerator creates vector embeddings
- SocialRepository stores in PostgreSQL with deduplication
- AI Agents query via AgentToolkit with RAG-enhanced context
Testing Strategy
- pytest-vcr for Reddit API mocking
- Real PostgreSQL for repository integration tests
- Service mocks for business logic testing
- 85%+ coverage matching project standards
Success Criteria
- Daily automated Reddit collection with sentiment analysis
- Sub-2-second agent queries with vector search
- Seamless RAG integration matching news domain patterns
- Production-ready reliability with comprehensive error handling