TradingAgents/docs/specs/socialmedia/design.md

834 lines
28 KiB
Markdown

# Social Media Domain - Technical Design Document
## Executive Summary
This document specifies the complete greenfield implementation of the Social Media domain within TradingAgents, transitioning from empty stubs to a production-ready system for collecting and analyzing social media sentiment from financial subreddits. This domain will provide AI agents with social sentiment context for trading decisions through a PostgreSQL + TimescaleDB + pgvectorscale architecture with RAG-powered capabilities.
**Implementation Scope**: Complete domain implementation (0% → 100% completion)
**Architecture**: PostgreSQL + TimescaleDB + pgvectorscale with PRAW Reddit integration and OpenRouter LLM processing
**Target**: 400+ posts daily across 4 financial subreddits with 85%+ test coverage
---
## 1. Architecture Overview
### 1.1 System Architecture
The Social Media domain follows the established layered architecture pattern while introducing new capabilities for social media data collection and semantic search:
```
┌─────────────────────────────────────────────────────────────┐
│ Dagster Pipeline │
│ (Scheduled Collection) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────────┐
│ RedditClient │
│ (PRAW + Rate Limiting) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────────┐
│ SocialMediaService │
│ (Business Logic + LLM Integration) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────────┐
│ SocialRepository │
│ (PostgreSQL + TimescaleDB + pgvectorscale) │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────────┐
│ PostgreSQL + TimescaleDB + pgvectorscale │
│ (Time-series + Vector Storage) │
└─────────────────────────────────────────────────────────────┘
```
### 1.2 Data Flow Architecture
**Collection Flow:**
```
Reddit API → RedditClient → SocialMediaService → OpenRouter LLM →
SocialRepository → PostgreSQL + Vector Storage
```
**Agent Query Flow:**
```
AgentToolkit → SocialMediaService → SocialRepository →
Vector Similarity Search + Sentiment Aggregation → Structured Response
```
### 1.3 Key Architectural Principles
- **Consistent Patterns**: Follow news domain architecture for maintainability
- **Vector-Enhanced Search**: Semantic similarity using pgvectorscale for contextual social media analysis
- **Best-Effort Processing**: Continue operation even when LLM services are unavailable
- **Rate Limiting Compliance**: Respect Reddit API limits with exponential backoff
- **Event-Driven Design**: Publish domain events for system integration
---
## 2. Domain Model
### 2.1 Core Entities
#### SocialPost (Domain Entity)
The primary domain entity managing business rules and data transformations:
```python
@dataclass
class SocialPost:
"""Core domain entity for Reddit posts with sentiment and engagement data."""
# Core Reddit Data
post_id: str # Reddit unique ID (e.g., 't3_abc123')
title: str # Post title
content: Optional[str] # Post content (selftext for text posts)
author: str # Reddit username
subreddit: str # Subreddit name
created_utc: datetime # Post creation time
url: str # Reddit permalink or external URL
# Engagement Metrics
upvotes: int # Post score
downvotes: int # Calculated from score + upvote_ratio
comments_count: int # Number of comments
# Enhanced Data
sentiment_score: Optional[SentimentScore] = None
tickers: List[str] = field(default_factory=list)
title_embedding: Optional[List[float]] = None
content_embedding: Optional[List[float]] = None
def from_praw_submission(cls, submission: praw.Submission) -> 'SocialPost':
"""Create SocialPost from PRAW Submission object."""
def to_entity(self) -> SocialMediaPostEntity:
"""Transform to database entity for storage."""
def validate(self) -> List[str]:
"""Validate business rules and return errors."""
def extract_tickers(self) -> List[str]:
"""Extract stock ticker symbols from title and content."""
def has_reliable_sentiment(self) -> bool:
"""Check if sentiment confidence >= 0.5."""
def to_response(self) -> Dict[str, Any]:
"""Format for agent consumption."""
```
**Validation Rules:**
- `post_id` must match Reddit format (starts with 't3_')
- `title` cannot be empty
- `created_utc` cannot be in the future
- `sentiment_score.confidence` must be 0.0-1.0
- `embeddings` must be 1536 dimensions if present
- `subreddit` must be in allowed financial subreddits list
#### SentimentScore (Value Object)
Structured sentiment analysis result from OpenRouter LLM:
```python
@dataclass
class SentimentScore:
"""Structured sentiment analysis result with confidence and reasoning."""
sentiment: Literal['positive', 'negative', 'neutral']
confidence: float # 0.0-1.0
reasoning: str # Brief explanation
def is_reliable(self) -> bool:
"""Check if confidence >= 0.5 for reliable sentiment."""
return self.confidence >= 0.5
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for JSON storage."""
```
#### SocialJobConfig (Configuration)
Configuration for scheduled Reddit collection:
```python
@dataclass
class SocialJobConfig:
"""Configuration for scheduled Reddit data collection."""
# Collection Settings
subreddits: List[str] = field(default_factory=lambda: [
'wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis'
])
max_posts_per_subreddit: int = 50
lookback_hours: int = 12
min_score: int = 10
# Processing Settings
sentiment_model: str = "anthropic/claude-3.5-haiku"
embedding_model: str = "text-embedding-3-large"
# Rate Limiting
rate_limit_delay: float = 1.0 # seconds between API calls
# Scheduling
schedule_times: List[str] = field(default_factory=lambda: [
'0 6 * * *', # 6 AM UTC
'0 18 * * *' # 6 PM UTC
])
```
---
## 3. Database Design
### 3.1 Schema Definition
The `social_media_posts` table leverages PostgreSQL with TimescaleDB for time-series optimization and pgvectorscale for vector similarity search:
```sql
-- Core table definition
CREATE TABLE social_media_posts (
id UUID PRIMARY KEY DEFAULT uuid7(),
post_id VARCHAR(50) UNIQUE NOT NULL,
title TEXT NOT NULL,
content TEXT,
author VARCHAR(100) NOT NULL,
subreddit VARCHAR(50) NOT NULL,
created_utc TIMESTAMPTZ NOT NULL,
upvotes INTEGER NOT NULL DEFAULT 0,
downvotes INTEGER NOT NULL DEFAULT 0,
comments_count INTEGER NOT NULL DEFAULT 0,
url TEXT NOT NULL,
sentiment_score JSONB,
sentiment_label VARCHAR(20),
tickers TEXT[] DEFAULT '{}',
title_embedding VECTOR(1536),
content_embedding VECTOR(1536),
inserted_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- TimescaleDB hypertable for time-series optimization
SELECT create_hypertable('social_media_posts', 'created_utc',
chunk_time_interval => INTERVAL '1 day');
-- Performance indexes
CREATE UNIQUE INDEX idx_social_posts_post_id ON social_media_posts (post_id);
CREATE INDEX idx_social_posts_subreddit_time ON social_media_posts (subreddit, created_utc DESC);
CREATE INDEX idx_social_posts_tickers_gin ON social_media_posts USING GIN (tickers);
CREATE INDEX idx_social_posts_title_embedding ON social_media_posts
USING vectors (title_embedding vector_cosine_ops);
CREATE INDEX idx_social_posts_content_embedding ON social_media_posts
USING vectors (content_embedding vector_cosine_ops);
CREATE INDEX idx_social_posts_sentiment ON social_media_posts
(((sentiment_score->>'sentiment'))) WHERE sentiment_score IS NOT NULL;
-- Data validation constraints
ALTER TABLE social_media_posts ADD CONSTRAINT chk_sentiment_score
CHECK (sentiment_score IS NULL OR
((sentiment_score->>'confidence')::float BETWEEN 0 AND 1));
ALTER TABLE social_media_posts ADD CONSTRAINT chk_created_utc
CHECK (created_utc <= NOW());
```
### 3.2 SQLAlchemy Entity
```python
class SocialMediaPostEntity(Base):
"""SQLAlchemy entity for PostgreSQL persistence with vector support."""
__tablename__ = "social_media_posts"
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid7)
post_id = Column(String(50), unique=True, nullable=False, index=True)
title = Column(Text, nullable=False)
content = Column(Text)
author = Column(String(100), nullable=False)
subreddit = Column(String(50), nullable=False)
created_utc = Column(DateTime(timezone=True), nullable=False)
upvotes = Column(Integer, nullable=False, default=0)
downvotes = Column(Integer, nullable=False, default=0)
comments_count = Column(Integer, nullable=False, default=0)
url = Column(Text, nullable=False)
sentiment_score = Column(JSONB)
sentiment_label = Column(String(20))
tickers = Column(ARRAY(String), default=[])
title_embedding = Column(Vector(1536))
content_embedding = Column(Vector(1536))
inserted_at = Column(DateTime(timezone=True), default=func.now())
updated_at = Column(DateTime(timezone=True), default=func.now(), onupdate=func.now())
def to_domain(self) -> SocialPost:
"""Convert to domain entity."""
@classmethod
def from_domain(cls, post: SocialPost) -> 'SocialMediaPostEntity':
"""Create from domain entity."""
```
### 3.3 Access Patterns and Query Optimization
**Common Access Patterns:**
- Ticker-based queries: `SELECT * WHERE 'AAPL' = ANY(tickers)`
- Time-range filtering: `SELECT * WHERE created_utc BETWEEN ? AND ?`
- Vector similarity: `SELECT * ORDER BY embedding <=> ? LIMIT 10`
- Sentiment aggregations: `SELECT AVG(sentiment_score) GROUP BY subreddit`
**Performance Targets:**
- Vector similarity queries: < 1s for top 10 results
- Batch upserts: < 5s for 1000 posts
- Ticker-based queries: < 100ms for 30-day ranges
---
## 4. API Integration
### 4.1 Reddit Client (PRAW Integration)
Complete implementation of Reddit data collection using PRAW (Python Reddit API Wrapper):
```python
class RedditClient:
"""PRAW wrapper with rate limiting and error handling."""
def __init__(self, config: RedditClientConfig):
"""Initialize Reddit client with OAuth2 credentials."""
self.reddit = praw.Reddit(
client_id=config.client_id,
client_secret=config.client_secret,
user_agent=config.user_agent
)
self.rate_limiter = AsyncLimiter(1, 1) # 1 request per second
async def fetch_subreddit_posts(
self,
subreddit: str,
limit: int = 50,
time_filter: str = 'day'
) -> List[Dict[str, Any]]:
"""Fetch hot posts from subreddit with rate limiting."""
async def search_posts(
self,
query: str,
subreddit: Optional[str] = None,
limit: int = 25
) -> List[Dict[str, Any]]:
"""Search posts with ticker symbols or keywords."""
async def get_post_details(self, post_id: str) -> Optional[Dict[str, Any]]:
"""Get detailed information for a specific post."""
```
**Configuration Requirements:**
- Reddit App Credentials: `client_id`, `client_secret`, `user_agent`
- Rate Limiting: 1 request per second (60 requests/minute limit)
- Error Handling: Exponential backoff for rate limits, graceful degradation for authentication errors
### 4.2 OpenRouter LLM Integration
Leverage existing OpenRouter infrastructure with social media-specific enhancements:
**Sentiment Analysis Prompt:**
```
Analyze this Reddit post about stocks/finance. Consider the informal language,
memes, and community context typical of financial subreddits.
Post: {title} - {content}
Respond with valid JSON:
{
"sentiment": "positive|negative|neutral",
"confidence": 0.0-1.0,
"reasoning": "brief explanation considering context"
}
```
**Embedding Configuration:**
- Model: `text-embedding-3-large` (1536 dimensions)
- Batch processing for efficiency
- Generate embeddings for both title and content when available
- Store NULL for failed embedding generation (best-effort processing)
---
## 5. Component Architecture
### 5.1 Repository Layer (Data Access)
```python
class SocialRepository:
"""Data access layer for social media posts with vector capabilities."""
def __init__(self, session: AsyncSession):
self.session = session
async def find_by_ticker(
self,
ticker: str,
days: int = 30,
limit: int = 50
) -> List[SocialPost]:
"""Find posts mentioning specific ticker within time range."""
async def find_similar_posts(
self,
query_embedding: List[float],
ticker: Optional[str] = None,
limit: int = 10
) -> List[SocialPost]:
"""Find semantically similar posts using vector similarity."""
async def get_sentiment_summary(
self,
ticker: str,
subreddit: Optional[str] = None,
hours: int = 24
) -> Dict[str, Any]:
"""Generate sentiment aggregation for ticker."""
async def upsert_batch(self, posts: List[SocialPost]) -> List[SocialPost]:
"""Batch upsert posts with conflict resolution."""
async def cleanup_old_posts(self, days: int = 90) -> int:
"""Remove posts older than retention period."""
```
### 5.2 Service Layer (Business Logic)
```python
class SocialMediaService:
"""Business logic orchestration with LLM integration."""
def __init__(
self,
repository: SocialRepository,
reddit_client: RedditClient,
openrouter_client: OpenRouterClient
):
self.repository = repository
self.reddit_client = reddit_client
self.openrouter_client = openrouter_client
async def collect_subreddit_posts(self, config: SocialJobConfig) -> int:
"""Orchestrate complete collection process for configured subreddits."""
async def update_post_sentiment(
self,
posts: List[SocialPost]
) -> List[SocialPost]:
"""Add sentiment analysis to posts using OpenRouter LLM."""
async def generate_embeddings(
self,
posts: List[SocialPost]
) -> List[SocialPost]:
"""Generate vector embeddings for semantic search."""
async def find_trending_tickers(
self,
hours: int = 24
) -> List[Dict[str, Any]]:
"""Identify trending ticker mentions across subreddits."""
```
### 5.3 Agent Integration Layer
```python
class SocialMediaAgentToolkit:
"""RAG methods for AI agent integration."""
def __init__(self, service: SocialMediaService):
self.service = service
async def get_reddit_sentiment(
self,
ticker: str,
days: int = 7
) -> Dict[str, Any]:
"""Get sentiment summary for ticker from Reddit discussions."""
async def search_social_posts(
self,
query: str,
ticker: Optional[str] = None
) -> List[Dict[str, Any]]:
"""Semantic search for relevant social media posts."""
async def get_trending_discussions(
self,
ticker: str
) -> List[Dict[str, Any]]:
"""Get trending discussions and sentiment for specific ticker."""
async def get_subreddit_analysis(
self,
subreddit: str,
ticker: str
) -> Dict[str, Any]:
"""Analyze sentiment and engagement for ticker in specific subreddit."""
```
**Agent Response Format:**
```json
{
"posts": [
{
"post_id": "t3_abc123",
"title": "AAPL earnings beat expectations",
"subreddit": "stocks",
"created_utc": "2024-01-15T14:30:00Z",
"sentiment": {
"sentiment": "positive",
"confidence": 0.85,
"reasoning": "Strong positive language about earnings"
},
"engagement": {
"upvotes": 245,
"comments_count": 67
},
"tickers": ["AAPL"],
"url": "https://reddit.com/r/stocks/comments/abc123"
}
],
"summary": {
"total_posts": 15,
"sentiment_breakdown": {
"positive": 0.6,
"negative": 0.2,
"neutral": 0.2
},
"avg_confidence": 0.78,
"data_quality": "high"
}
}
```
---
## 6. Dagster Pipeline Architecture
### 6.1 Scheduled Collection Pipeline
```python
@asset(
partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
config_schema=SocialJobConfig.schema()
)
def reddit_posts_collection(context: AssetExecutionContext) -> MaterializeResult:
"""Collect Reddit posts from financial subreddits."""
@asset(deps=[reddit_posts_collection])
def reddit_sentiment_analysis(context: AssetExecutionContext) -> MaterializeResult:
"""Add sentiment analysis to collected posts."""
@asset(deps=[reddit_sentiment_analysis])
def reddit_embeddings_generation(context: AssetExecutionContext) -> MaterializeResult:
"""Generate vector embeddings for semantic search."""
# Schedule: Twice daily collection
reddit_collection_schedule = ScheduleDefinition(
name="reddit_collection_schedule",
job=define_asset_job("reddit_collection", selection=[
reddit_posts_collection,
reddit_sentiment_analysis,
reddit_embeddings_generation
]),
cron_schedule="0 6,18 * * *" # 6 AM and 6 PM UTC
)
```
### 6.2 Data Quality and Monitoring
**Collection Metrics:**
- Posts collected per subreddit per run
- Sentiment analysis success rate
- Embedding generation success rate
- API error rates and retry attempts
**Data Quality Checks:**
- Post deduplication verification
- Sentiment confidence distribution
- Embedding vector validation
- Reddit API rate limit utilization
**Failure Handling:**
- Best-effort processing: Continue with remaining subreddits if one fails
- Exponential backoff for Reddit API rate limits
- Graceful degradation: Store posts without sentiment/embeddings if LLM fails
- Dead letter queue for failed posts with retry mechanism
---
## 7. Testing Strategy
### 7.1 Test Structure
Following the project's pragmatic outside-in TDD approach:
```
tests/domains/socialmedia/
├── __init__.py
├── test_social_post.py # Domain entity validation
├── test_social_repository.py # PostgreSQL + vector operations
├── test_reddit_client.py # PRAW integration with VCR
├── test_social_media_service.py # Business logic with mocked deps
├── test_social_agent_toolkit.py # Agent integration methods
└── fixtures/
├── reddit_responses.json # Sample PRAW responses
└── vcr_cassettes/ # HTTP cassettes for external APIs
```
### 7.2 Testing Approach
**Unit Tests (Mock I/O boundaries):**
- `SocialPost` entity validation and transformations
- `SocialRepository` with test PostgreSQL database
- `RedditClient` with mocked PRAW responses
- `SocialMediaService` with mocked dependencies
**Integration Tests (Real components):**
- End-to-end collection pipeline with test Reddit data
- Vector similarity search with actual pgvectorscale
- LLM integration with pytest-vcr cassettes
- Dagster pipeline execution
**Performance Tests:**
- Vector similarity query performance (< 1s target)
- Batch upsert performance (< 5s for 1000 posts)
- Memory usage during large collection runs
### 7.3 Test Fixtures and Mocking
**Reddit API Mocking:**
```python
@pytest.fixture
def mock_reddit_response():
"""Sample Reddit API response for testing."""
return {
"id": "abc123",
"title": "AAPL earnings discussion",
"selftext": "Strong quarter, bullish outlook",
"author": "test_user",
"subreddit_display_name": "stocks",
"created_utc": 1705315200,
"score": 150,
"upvote_ratio": 0.85,
"num_comments": 45,
"permalink": "/r/stocks/comments/abc123/aapl_earnings/"
}
```
**Vector Similarity Testing:**
```python
@pytest.mark.asyncio
async def test_vector_similarity_search(social_repository, sample_posts):
"""Test semantic similarity search using pgvectorscale."""
# Insert test posts with embeddings
await social_repository.upsert_batch(sample_posts)
# Test similarity search
query_embedding = [0.1] * 1536 # Sample embedding
similar_posts = await social_repository.find_similar_posts(
query_embedding, limit=5
)
assert len(similar_posts) <= 5
assert all(post.title_embedding for post in similar_posts)
```
---
## 8. Implementation Roadmap
### 8.1 Phase 1: Database Foundation (Week 1)
**Priority 1: Database Schema**
1. Create PostgreSQL migration for `social_media_posts` table
2. Add TimescaleDB hypertable configuration
3. Set up pgvectorscale indexes for vector similarity
4. Implement data validation constraints
**Priority 2: Core Entities**
1. `SocialMediaPostEntity` (SQLAlchemy entity)
2. `SocialPost` (domain entity with validation)
3. `SentimentScore` (value object)
4. Entity transformation methods (`to_domain`, `from_domain`)
### 8.2 Phase 2: Data Collection (Week 2)
**Priority 1: Reddit Integration**
1. `RedditClient` with PRAW implementation
2. Rate limiting and error handling
3. Subreddit post collection methods
4. Reddit API authentication setup
**Priority 2: Repository Layer**
1. `SocialRepository` with PostgreSQL operations
2. Vector similarity search methods
3. Batch upsert operations
4. Sentiment aggregation queries
### 8.3 Phase 3: Processing & Intelligence (Week 3)
**Priority 1: Service Layer**
1. `SocialMediaService` business logic
2. OpenRouter LLM integration for sentiment
3. Vector embedding generation
4. Batch processing workflows
**Priority 2: Agent Integration**
1. `SocialMediaAgentToolkit` RAG methods
2. Structured response formatting
3. Context-aware social media analysis
4. Integration with existing agent workflows
### 8.4 Phase 4: Automation & Monitoring (Week 4)
**Priority 1: Dagster Pipeline**
1. Scheduled Reddit collection assets
2. Processing pipeline orchestration
3. Data quality monitoring
4. Error handling and retry logic
**Priority 2: Testing & Documentation**
1. Comprehensive test suite (>85% coverage)
2. Performance testing and optimization
3. API documentation updates
4. Integration with existing test infrastructure
---
## 9. Monitoring and Observability
### 9.1 Key Metrics
**Collection Metrics:**
- Posts collected per subreddit per day
- Collection job success/failure rates
- Reddit API rate limit utilization
- Data deduplication effectiveness
**Processing Metrics:**
- Sentiment analysis success rate and latency
- Embedding generation success rate and latency
- LLM token usage and costs
- Vector similarity query performance
**Business Metrics:**
- Active tickers with social sentiment data
- Sentiment distribution across subreddits
- Trending ticker detection accuracy
- Agent query response times
### 9.2 Alerting Strategy
**Critical Alerts:**
- Collection job failures (> 2 consecutive failures)
- Reddit API authentication errors
- Database connection failures
- High LLM processing error rates (> 20%)
**Warning Alerts:**
- Low collection volumes (< 50% of expected)
- High sentiment analysis latency (> 30s per batch)
- Vector similarity performance degradation
- Approaching Reddit API rate limits
### 9.3 Logging and Debugging
**Structured Logging Format:**
```json
{
"timestamp": "2024-01-15T14:30:00Z",
"level": "INFO",
"component": "SocialMediaService",
"operation": "collect_subreddit_posts",
"subreddit": "stocks",
"posts_collected": 45,
"sentiment_analyzed": 43,
"embeddings_generated": 41,
"duration_ms": 12500,
"metadata": {
"reddit_api_calls": 3,
"llm_tokens_used": 15420
}
}
```
---
## 10. Security and Compliance
### 10.1 Data Privacy
**Reddit Data Handling:**
- Store only publicly available Reddit posts
- Respect user privacy: hash usernames for analytics
- Implement data retention policies (90-day maximum)
- No collection of private or deleted content
**API Key Management:**
- Environment variable storage for Reddit credentials
- OpenRouter API key rotation support
- No credential logging or persistence in plain text
### 10.2 Rate Limiting Compliance
**Reddit API Compliance:**
- Respect 60 requests per minute OAuth limit
- Implement exponential backoff for rate limit violations
- User-Agent string identification as required
- Monitor and log API usage statistics
**OpenRouter Usage:**
- Monitor token usage and costs
- Implement request batching for efficiency
- Handle API rate limits gracefully
- Cost optimization through model selection
---
## 11. Future Enhancements
### 11.1 Extended Social Media Sources
**Twitter/X Integration:**
- Similar architecture pattern for Twitter API v2
- Real-time streaming for high-frequency updates
- Hashtag and mention tracking
**News Comment Sections:**
- Integration with financial news comment sections
- Cross-platform sentiment correlation
- Enhanced context for news articles
### 11.2 Advanced Analytics
**Sentiment Trend Analysis:**
- Time-series sentiment tracking
- Volatility correlation with social sentiment
- Predictive sentiment modeling
**Influence Network Analysis:**
- User influence scoring based on engagement
- Community detection within financial subreddits
- Viral content identification and tracking
### 11.3 Real-time Processing
**Streaming Architecture:**
- Real-time Reddit post collection
- Event-driven sentiment processing
- Live sentiment dashboards for agents
**Market Hours Integration:**
- Increased collection frequency during market hours
- After-hours sentiment tracking
- Weekend vs. weekday sentiment patterns
---
This technical design provides a comprehensive blueprint for implementing the complete Social Media domain from empty stubs to a production-ready system. The architecture leverages proven patterns from the news domain while introducing specialized capabilities for social media data collection, semantic search, and AI agent integration.