# Social Media Domain - Technical Design Document ## Executive Summary This document specifies the complete greenfield implementation of the Social Media domain within TradingAgents, transitioning from empty stubs to a production-ready system for collecting and analyzing social media sentiment from financial subreddits. This domain will provide AI agents with social sentiment context for trading decisions through a PostgreSQL + TimescaleDB + pgvectorscale architecture with RAG-powered capabilities. **Implementation Scope**: Complete domain implementation (0% → 100% completion) **Architecture**: PostgreSQL + TimescaleDB + pgvectorscale with PRAW Reddit integration and OpenRouter LLM processing **Target**: 400+ posts daily across 4 financial subreddits with 85%+ test coverage --- ## 1. Architecture Overview ### 1.1 System Architecture The Social Media domain follows the established layered architecture pattern while introducing new capabilities for social media data collection and semantic search: ``` ┌─────────────────────────────────────────────────────────────┐ │ Dagster Pipeline │ │ (Scheduled Collection) │ └─────────────────────┬───────────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────────┐ │ RedditClient │ │ (PRAW + Rate Limiting) │ └─────────────────────┬───────────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────────┐ │ SocialMediaService │ │ (Business Logic + LLM Integration) │ └─────────────────────┬───────────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────────┐ │ SocialRepository │ │ (PostgreSQL + TimescaleDB + pgvectorscale) │ └─────────────────────┬───────────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────────┐ │ PostgreSQL + TimescaleDB + pgvectorscale │ │ (Time-series + Vector Storage) │ └─────────────────────────────────────────────────────────────┘ ``` ### 1.2 Data Flow Architecture **Collection Flow:** ``` Reddit API → RedditClient → SocialMediaService → OpenRouter LLM → SocialRepository → PostgreSQL + Vector Storage ``` **Agent Query Flow:** ``` AgentToolkit → SocialMediaService → SocialRepository → Vector Similarity Search + Sentiment Aggregation → Structured Response ``` ### 1.3 Key Architectural Principles - **Consistent Patterns**: Follow news domain architecture for maintainability - **Vector-Enhanced Search**: Semantic similarity using pgvectorscale for contextual social media analysis - **Best-Effort Processing**: Continue operation even when LLM services are unavailable - **Rate Limiting Compliance**: Respect Reddit API limits with exponential backoff - **Event-Driven Design**: Publish domain events for system integration --- ## 2. Domain Model ### 2.1 Core Entities #### SocialPost (Domain Entity) The primary domain entity managing business rules and data transformations: ```python @dataclass class SocialPost: """Core domain entity for Reddit posts with sentiment and engagement data.""" # Core Reddit Data post_id: str # Reddit unique ID (e.g., 't3_abc123') title: str # Post title content: Optional[str] # Post content (selftext for text posts) author: str # Reddit username subreddit: str # Subreddit name created_utc: datetime # Post creation time url: str # Reddit permalink or external URL # Engagement Metrics upvotes: int # Post score downvotes: int # Calculated from score + upvote_ratio comments_count: int # Number of comments # Enhanced Data sentiment_score: Optional[SentimentScore] = None tickers: List[str] = field(default_factory=list) title_embedding: Optional[List[float]] = None content_embedding: Optional[List[float]] = None def from_praw_submission(cls, submission: praw.Submission) -> 'SocialPost': """Create SocialPost from PRAW Submission object.""" def to_entity(self) -> SocialMediaPostEntity: """Transform to database entity for storage.""" def validate(self) -> List[str]: """Validate business rules and return errors.""" def extract_tickers(self) -> List[str]: """Extract stock ticker symbols from title and content.""" def has_reliable_sentiment(self) -> bool: """Check if sentiment confidence >= 0.5.""" def to_response(self) -> Dict[str, Any]: """Format for agent consumption.""" ``` **Validation Rules:** - `post_id` must match Reddit format (starts with 't3_') - `title` cannot be empty - `created_utc` cannot be in the future - `sentiment_score.confidence` must be 0.0-1.0 - `embeddings` must be 1536 dimensions if present - `subreddit` must be in allowed financial subreddits list #### SentimentScore (Value Object) Structured sentiment analysis result from OpenRouter LLM: ```python @dataclass class SentimentScore: """Structured sentiment analysis result with confidence and reasoning.""" sentiment: Literal['positive', 'negative', 'neutral'] confidence: float # 0.0-1.0 reasoning: str # Brief explanation def is_reliable(self) -> bool: """Check if confidence >= 0.5 for reliable sentiment.""" return self.confidence >= 0.5 def to_dict(self) -> Dict[str, Any]: """Convert to dictionary for JSON storage.""" ``` #### SocialJobConfig (Configuration) Configuration for scheduled Reddit collection: ```python @dataclass class SocialJobConfig: """Configuration for scheduled Reddit data collection.""" # Collection Settings subreddits: List[str] = field(default_factory=lambda: [ 'wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis' ]) max_posts_per_subreddit: int = 50 lookback_hours: int = 12 min_score: int = 10 # Processing Settings sentiment_model: str = "anthropic/claude-3.5-haiku" embedding_model: str = "text-embedding-3-large" # Rate Limiting rate_limit_delay: float = 1.0 # seconds between API calls # Scheduling schedule_times: List[str] = field(default_factory=lambda: [ '0 6 * * *', # 6 AM UTC '0 18 * * *' # 6 PM UTC ]) ``` --- ## 3. Database Design ### 3.1 Schema Definition The `social_media_posts` table leverages PostgreSQL with TimescaleDB for time-series optimization and pgvectorscale for vector similarity search: ```sql -- Core table definition CREATE TABLE social_media_posts ( id UUID PRIMARY KEY DEFAULT uuid7(), post_id VARCHAR(50) UNIQUE NOT NULL, title TEXT NOT NULL, content TEXT, author VARCHAR(100) NOT NULL, subreddit VARCHAR(50) NOT NULL, created_utc TIMESTAMPTZ NOT NULL, upvotes INTEGER NOT NULL DEFAULT 0, downvotes INTEGER NOT NULL DEFAULT 0, comments_count INTEGER NOT NULL DEFAULT 0, url TEXT NOT NULL, sentiment_score JSONB, sentiment_label VARCHAR(20), tickers TEXT[] DEFAULT '{}', title_embedding VECTOR(1536), content_embedding VECTOR(1536), inserted_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); -- TimescaleDB hypertable for time-series optimization SELECT create_hypertable('social_media_posts', 'created_utc', chunk_time_interval => INTERVAL '1 day'); -- Performance indexes CREATE UNIQUE INDEX idx_social_posts_post_id ON social_media_posts (post_id); CREATE INDEX idx_social_posts_subreddit_time ON social_media_posts (subreddit, created_utc DESC); CREATE INDEX idx_social_posts_tickers_gin ON social_media_posts USING GIN (tickers); CREATE INDEX idx_social_posts_title_embedding ON social_media_posts USING vectors (title_embedding vector_cosine_ops); CREATE INDEX idx_social_posts_content_embedding ON social_media_posts USING vectors (content_embedding vector_cosine_ops); CREATE INDEX idx_social_posts_sentiment ON social_media_posts (((sentiment_score->>'sentiment'))) WHERE sentiment_score IS NOT NULL; -- Data validation constraints ALTER TABLE social_media_posts ADD CONSTRAINT chk_sentiment_score CHECK (sentiment_score IS NULL OR ((sentiment_score->>'confidence')::float BETWEEN 0 AND 1)); ALTER TABLE social_media_posts ADD CONSTRAINT chk_created_utc CHECK (created_utc <= NOW()); ``` ### 3.2 SQLAlchemy Entity ```python class SocialMediaPostEntity(Base): """SQLAlchemy entity for PostgreSQL persistence with vector support.""" __tablename__ = "social_media_posts" id = Column(UUID(as_uuid=True), primary_key=True, default=uuid7) post_id = Column(String(50), unique=True, nullable=False, index=True) title = Column(Text, nullable=False) content = Column(Text) author = Column(String(100), nullable=False) subreddit = Column(String(50), nullable=False) created_utc = Column(DateTime(timezone=True), nullable=False) upvotes = Column(Integer, nullable=False, default=0) downvotes = Column(Integer, nullable=False, default=0) comments_count = Column(Integer, nullable=False, default=0) url = Column(Text, nullable=False) sentiment_score = Column(JSONB) sentiment_label = Column(String(20)) tickers = Column(ARRAY(String), default=[]) title_embedding = Column(Vector(1536)) content_embedding = Column(Vector(1536)) inserted_at = Column(DateTime(timezone=True), default=func.now()) updated_at = Column(DateTime(timezone=True), default=func.now(), onupdate=func.now()) def to_domain(self) -> SocialPost: """Convert to domain entity.""" @classmethod def from_domain(cls, post: SocialPost) -> 'SocialMediaPostEntity': """Create from domain entity.""" ``` ### 3.3 Access Patterns and Query Optimization **Common Access Patterns:** - Ticker-based queries: `SELECT * WHERE 'AAPL' = ANY(tickers)` - Time-range filtering: `SELECT * WHERE created_utc BETWEEN ? AND ?` - Vector similarity: `SELECT * ORDER BY embedding <=> ? LIMIT 10` - Sentiment aggregations: `SELECT AVG(sentiment_score) GROUP BY subreddit` **Performance Targets:** - Vector similarity queries: < 1s for top 10 results - Batch upserts: < 5s for 1000 posts - Ticker-based queries: < 100ms for 30-day ranges --- ## 4. API Integration ### 4.1 Reddit Client (PRAW Integration) Complete implementation of Reddit data collection using PRAW (Python Reddit API Wrapper): ```python class RedditClient: """PRAW wrapper with rate limiting and error handling.""" def __init__(self, config: RedditClientConfig): """Initialize Reddit client with OAuth2 credentials.""" self.reddit = praw.Reddit( client_id=config.client_id, client_secret=config.client_secret, user_agent=config.user_agent ) self.rate_limiter = AsyncLimiter(1, 1) # 1 request per second async def fetch_subreddit_posts( self, subreddit: str, limit: int = 50, time_filter: str = 'day' ) -> List[Dict[str, Any]]: """Fetch hot posts from subreddit with rate limiting.""" async def search_posts( self, query: str, subreddit: Optional[str] = None, limit: int = 25 ) -> List[Dict[str, Any]]: """Search posts with ticker symbols or keywords.""" async def get_post_details(self, post_id: str) -> Optional[Dict[str, Any]]: """Get detailed information for a specific post.""" ``` **Configuration Requirements:** - Reddit App Credentials: `client_id`, `client_secret`, `user_agent` - Rate Limiting: 1 request per second (60 requests/minute limit) - Error Handling: Exponential backoff for rate limits, graceful degradation for authentication errors ### 4.2 OpenRouter LLM Integration Leverage existing OpenRouter infrastructure with social media-specific enhancements: **Sentiment Analysis Prompt:** ``` Analyze this Reddit post about stocks/finance. Consider the informal language, memes, and community context typical of financial subreddits. Post: {title} - {content} Respond with valid JSON: { "sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "reasoning": "brief explanation considering context" } ``` **Embedding Configuration:** - Model: `text-embedding-3-large` (1536 dimensions) - Batch processing for efficiency - Generate embeddings for both title and content when available - Store NULL for failed embedding generation (best-effort processing) --- ## 5. Component Architecture ### 5.1 Repository Layer (Data Access) ```python class SocialRepository: """Data access layer for social media posts with vector capabilities.""" def __init__(self, session: AsyncSession): self.session = session async def find_by_ticker( self, ticker: str, days: int = 30, limit: int = 50 ) -> List[SocialPost]: """Find posts mentioning specific ticker within time range.""" async def find_similar_posts( self, query_embedding: List[float], ticker: Optional[str] = None, limit: int = 10 ) -> List[SocialPost]: """Find semantically similar posts using vector similarity.""" async def get_sentiment_summary( self, ticker: str, subreddit: Optional[str] = None, hours: int = 24 ) -> Dict[str, Any]: """Generate sentiment aggregation for ticker.""" async def upsert_batch(self, posts: List[SocialPost]) -> List[SocialPost]: """Batch upsert posts with conflict resolution.""" async def cleanup_old_posts(self, days: int = 90) -> int: """Remove posts older than retention period.""" ``` ### 5.2 Service Layer (Business Logic) ```python class SocialMediaService: """Business logic orchestration with LLM integration.""" def __init__( self, repository: SocialRepository, reddit_client: RedditClient, openrouter_client: OpenRouterClient ): self.repository = repository self.reddit_client = reddit_client self.openrouter_client = openrouter_client async def collect_subreddit_posts(self, config: SocialJobConfig) -> int: """Orchestrate complete collection process for configured subreddits.""" async def update_post_sentiment( self, posts: List[SocialPost] ) -> List[SocialPost]: """Add sentiment analysis to posts using OpenRouter LLM.""" async def generate_embeddings( self, posts: List[SocialPost] ) -> List[SocialPost]: """Generate vector embeddings for semantic search.""" async def find_trending_tickers( self, hours: int = 24 ) -> List[Dict[str, Any]]: """Identify trending ticker mentions across subreddits.""" ``` ### 5.3 Agent Integration Layer ```python class SocialMediaAgentToolkit: """RAG methods for AI agent integration.""" def __init__(self, service: SocialMediaService): self.service = service async def get_reddit_sentiment( self, ticker: str, days: int = 7 ) -> Dict[str, Any]: """Get sentiment summary for ticker from Reddit discussions.""" async def search_social_posts( self, query: str, ticker: Optional[str] = None ) -> List[Dict[str, Any]]: """Semantic search for relevant social media posts.""" async def get_trending_discussions( self, ticker: str ) -> List[Dict[str, Any]]: """Get trending discussions and sentiment for specific ticker.""" async def get_subreddit_analysis( self, subreddit: str, ticker: str ) -> Dict[str, Any]: """Analyze sentiment and engagement for ticker in specific subreddit.""" ``` **Agent Response Format:** ```json { "posts": [ { "post_id": "t3_abc123", "title": "AAPL earnings beat expectations", "subreddit": "stocks", "created_utc": "2024-01-15T14:30:00Z", "sentiment": { "sentiment": "positive", "confidence": 0.85, "reasoning": "Strong positive language about earnings" }, "engagement": { "upvotes": 245, "comments_count": 67 }, "tickers": ["AAPL"], "url": "https://reddit.com/r/stocks/comments/abc123" } ], "summary": { "total_posts": 15, "sentiment_breakdown": { "positive": 0.6, "negative": 0.2, "neutral": 0.2 }, "avg_confidence": 0.78, "data_quality": "high" } } ``` --- ## 6. Dagster Pipeline Architecture ### 6.1 Scheduled Collection Pipeline ```python @asset( partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"), config_schema=SocialJobConfig.schema() ) def reddit_posts_collection(context: AssetExecutionContext) -> MaterializeResult: """Collect Reddit posts from financial subreddits.""" @asset(deps=[reddit_posts_collection]) def reddit_sentiment_analysis(context: AssetExecutionContext) -> MaterializeResult: """Add sentiment analysis to collected posts.""" @asset(deps=[reddit_sentiment_analysis]) def reddit_embeddings_generation(context: AssetExecutionContext) -> MaterializeResult: """Generate vector embeddings for semantic search.""" # Schedule: Twice daily collection reddit_collection_schedule = ScheduleDefinition( name="reddit_collection_schedule", job=define_asset_job("reddit_collection", selection=[ reddit_posts_collection, reddit_sentiment_analysis, reddit_embeddings_generation ]), cron_schedule="0 6,18 * * *" # 6 AM and 6 PM UTC ) ``` ### 6.2 Data Quality and Monitoring **Collection Metrics:** - Posts collected per subreddit per run - Sentiment analysis success rate - Embedding generation success rate - API error rates and retry attempts **Data Quality Checks:** - Post deduplication verification - Sentiment confidence distribution - Embedding vector validation - Reddit API rate limit utilization **Failure Handling:** - Best-effort processing: Continue with remaining subreddits if one fails - Exponential backoff for Reddit API rate limits - Graceful degradation: Store posts without sentiment/embeddings if LLM fails - Dead letter queue for failed posts with retry mechanism --- ## 7. Testing Strategy ### 7.1 Test Structure Following the project's pragmatic outside-in TDD approach: ``` tests/domains/socialmedia/ ├── __init__.py ├── test_social_post.py # Domain entity validation ├── test_social_repository.py # PostgreSQL + vector operations ├── test_reddit_client.py # PRAW integration with VCR ├── test_social_media_service.py # Business logic with mocked deps ├── test_social_agent_toolkit.py # Agent integration methods └── fixtures/ ├── reddit_responses.json # Sample PRAW responses └── vcr_cassettes/ # HTTP cassettes for external APIs ``` ### 7.2 Testing Approach **Unit Tests (Mock I/O boundaries):** - `SocialPost` entity validation and transformations - `SocialRepository` with test PostgreSQL database - `RedditClient` with mocked PRAW responses - `SocialMediaService` with mocked dependencies **Integration Tests (Real components):** - End-to-end collection pipeline with test Reddit data - Vector similarity search with actual pgvectorscale - LLM integration with pytest-vcr cassettes - Dagster pipeline execution **Performance Tests:** - Vector similarity query performance (< 1s target) - Batch upsert performance (< 5s for 1000 posts) - Memory usage during large collection runs ### 7.3 Test Fixtures and Mocking **Reddit API Mocking:** ```python @pytest.fixture def mock_reddit_response(): """Sample Reddit API response for testing.""" return { "id": "abc123", "title": "AAPL earnings discussion", "selftext": "Strong quarter, bullish outlook", "author": "test_user", "subreddit_display_name": "stocks", "created_utc": 1705315200, "score": 150, "upvote_ratio": 0.85, "num_comments": 45, "permalink": "/r/stocks/comments/abc123/aapl_earnings/" } ``` **Vector Similarity Testing:** ```python @pytest.mark.asyncio async def test_vector_similarity_search(social_repository, sample_posts): """Test semantic similarity search using pgvectorscale.""" # Insert test posts with embeddings await social_repository.upsert_batch(sample_posts) # Test similarity search query_embedding = [0.1] * 1536 # Sample embedding similar_posts = await social_repository.find_similar_posts( query_embedding, limit=5 ) assert len(similar_posts) <= 5 assert all(post.title_embedding for post in similar_posts) ``` --- ## 8. Implementation Roadmap ### 8.1 Phase 1: Database Foundation (Week 1) **Priority 1: Database Schema** 1. Create PostgreSQL migration for `social_media_posts` table 2. Add TimescaleDB hypertable configuration 3. Set up pgvectorscale indexes for vector similarity 4. Implement data validation constraints **Priority 2: Core Entities** 1. `SocialMediaPostEntity` (SQLAlchemy entity) 2. `SocialPost` (domain entity with validation) 3. `SentimentScore` (value object) 4. Entity transformation methods (`to_domain`, `from_domain`) ### 8.2 Phase 2: Data Collection (Week 2) **Priority 1: Reddit Integration** 1. `RedditClient` with PRAW implementation 2. Rate limiting and error handling 3. Subreddit post collection methods 4. Reddit API authentication setup **Priority 2: Repository Layer** 1. `SocialRepository` with PostgreSQL operations 2. Vector similarity search methods 3. Batch upsert operations 4. Sentiment aggregation queries ### 8.3 Phase 3: Processing & Intelligence (Week 3) **Priority 1: Service Layer** 1. `SocialMediaService` business logic 2. OpenRouter LLM integration for sentiment 3. Vector embedding generation 4. Batch processing workflows **Priority 2: Agent Integration** 1. `SocialMediaAgentToolkit` RAG methods 2. Structured response formatting 3. Context-aware social media analysis 4. Integration with existing agent workflows ### 8.4 Phase 4: Automation & Monitoring (Week 4) **Priority 1: Dagster Pipeline** 1. Scheduled Reddit collection assets 2. Processing pipeline orchestration 3. Data quality monitoring 4. Error handling and retry logic **Priority 2: Testing & Documentation** 1. Comprehensive test suite (>85% coverage) 2. Performance testing and optimization 3. API documentation updates 4. Integration with existing test infrastructure --- ## 9. Monitoring and Observability ### 9.1 Key Metrics **Collection Metrics:** - Posts collected per subreddit per day - Collection job success/failure rates - Reddit API rate limit utilization - Data deduplication effectiveness **Processing Metrics:** - Sentiment analysis success rate and latency - Embedding generation success rate and latency - LLM token usage and costs - Vector similarity query performance **Business Metrics:** - Active tickers with social sentiment data - Sentiment distribution across subreddits - Trending ticker detection accuracy - Agent query response times ### 9.2 Alerting Strategy **Critical Alerts:** - Collection job failures (> 2 consecutive failures) - Reddit API authentication errors - Database connection failures - High LLM processing error rates (> 20%) **Warning Alerts:** - Low collection volumes (< 50% of expected) - High sentiment analysis latency (> 30s per batch) - Vector similarity performance degradation - Approaching Reddit API rate limits ### 9.3 Logging and Debugging **Structured Logging Format:** ```json { "timestamp": "2024-01-15T14:30:00Z", "level": "INFO", "component": "SocialMediaService", "operation": "collect_subreddit_posts", "subreddit": "stocks", "posts_collected": 45, "sentiment_analyzed": 43, "embeddings_generated": 41, "duration_ms": 12500, "metadata": { "reddit_api_calls": 3, "llm_tokens_used": 15420 } } ``` --- ## 10. Security and Compliance ### 10.1 Data Privacy **Reddit Data Handling:** - Store only publicly available Reddit posts - Respect user privacy: hash usernames for analytics - Implement data retention policies (90-day maximum) - No collection of private or deleted content **API Key Management:** - Environment variable storage for Reddit credentials - OpenRouter API key rotation support - No credential logging or persistence in plain text ### 10.2 Rate Limiting Compliance **Reddit API Compliance:** - Respect 60 requests per minute OAuth limit - Implement exponential backoff for rate limit violations - User-Agent string identification as required - Monitor and log API usage statistics **OpenRouter Usage:** - Monitor token usage and costs - Implement request batching for efficiency - Handle API rate limits gracefully - Cost optimization through model selection --- ## 11. Future Enhancements ### 11.1 Extended Social Media Sources **Twitter/X Integration:** - Similar architecture pattern for Twitter API v2 - Real-time streaming for high-frequency updates - Hashtag and mention tracking **News Comment Sections:** - Integration with financial news comment sections - Cross-platform sentiment correlation - Enhanced context for news articles ### 11.2 Advanced Analytics **Sentiment Trend Analysis:** - Time-series sentiment tracking - Volatility correlation with social sentiment - Predictive sentiment modeling **Influence Network Analysis:** - User influence scoring based on engagement - Community detection within financial subreddits - Viral content identification and tracking ### 11.3 Real-time Processing **Streaming Architecture:** - Real-time Reddit post collection - Event-driven sentiment processing - Live sentiment dashboards for agents **Market Hours Integration:** - Increased collection frequency during market hours - After-hours sentiment tracking - Weekend vs. weekday sentiment patterns --- This technical design provides a comprehensive blueprint for implementing the complete Social Media domain from empty stubs to a production-ready system. The architecture leverages proven patterns from the news domain while introducing specialized capabilities for social media data collection, semantic search, and AI agent integration.