TradingAgents/design.md at f9fdb5a26da2a6fe0d9355127399c60af2dbd60f

28 KiB

Raw Blame History

Executive Summary

This document specifies the complete greenfield implementation of the Social Media domain within TradingAgents, transitioning from empty stubs to a production-ready system for collecting and analyzing social media sentiment from financial subreddits. This domain will provide AI agents with social sentiment context for trading decisions through a PostgreSQL + TimescaleDB + pgvectorscale architecture with RAG-powered capabilities.

Implementation Scope: Complete domain implementation (0% → 100% completion) Architecture: PostgreSQL + TimescaleDB + pgvectorscale with PRAW Reddit integration and OpenRouter LLM processing Target: 400+ posts daily across 4 financial subreddits with 85%+ test coverage

1. Architecture Overview

1.1 System Architecture

The Social Media domain follows the established layered architecture pattern while introducing new capabilities for social media data collection and semantic search:

┌─────────────────────────────────────────────────────────────┐
│                    Dagster Pipeline                         │
│                 (Scheduled Collection)                      │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                 RedditClient                                │
│           (PRAW + Rate Limiting)                           │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│              SocialMediaService                             │
│        (Business Logic + LLM Integration)                  │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│              SocialRepository                               │
│    (PostgreSQL + TimescaleDB + pgvectorscale)             │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│         PostgreSQL + TimescaleDB + pgvectorscale           │
│          (Time-series + Vector Storage)                    │
└─────────────────────────────────────────────────────────────┘

1.2 Data Flow Architecture

Collection Flow:

Reddit API → RedditClient → SocialMediaService → OpenRouter LLM → 
SocialRepository → PostgreSQL + Vector Storage

Agent Query Flow:

AgentToolkit → SocialMediaService → SocialRepository → 
Vector Similarity Search + Sentiment Aggregation → Structured Response

1.3 Key Architectural Principles

Consistent Patterns: Follow news domain architecture for maintainability
Vector-Enhanced Search: Semantic similarity using pgvectorscale for contextual social media analysis
Best-Effort Processing: Continue operation even when LLM services are unavailable
Rate Limiting Compliance: Respect Reddit API limits with exponential backoff
Event-Driven Design: Publish domain events for system integration

2. Domain Model

2.1 Core Entities

SocialPost (Domain Entity)

The primary domain entity managing business rules and data transformations:

@dataclass
class SocialPost:
    """Core domain entity for Reddit posts with sentiment and engagement data."""
    
    # Core Reddit Data
    post_id: str                    # Reddit unique ID (e.g., 't3_abc123')
    title: str                      # Post title
    content: Optional[str]          # Post content (selftext for text posts)
    author: str                     # Reddit username
    subreddit: str                  # Subreddit name
    created_utc: datetime           # Post creation time
    url: str                        # Reddit permalink or external URL
    
    # Engagement Metrics
    upvotes: int                    # Post score
    downvotes: int                  # Calculated from score + upvote_ratio
    comments_count: int             # Number of comments
    
    # Enhanced Data
    sentiment_score: Optional[SentimentScore] = None
    tickers: List[str] = field(default_factory=list)
    title_embedding: Optional[List[float]] = None
    content_embedding: Optional[List[float]] = None
    
    def from_praw_submission(cls, submission: praw.Submission) -> 'SocialPost':
        """Create SocialPost from PRAW Submission object."""
        
    def to_entity(self) -> SocialMediaPostEntity:
        """Transform to database entity for storage."""
        
    def validate(self) -> List[str]:
        """Validate business rules and return errors."""
        
    def extract_tickers(self) -> List[str]:
        """Extract stock ticker symbols from title and content."""
        
    def has_reliable_sentiment(self) -> bool:
        """Check if sentiment confidence >= 0.5."""
        
    def to_response(self) -> Dict[str, Any]:
        """Format for agent consumption."""

Validation Rules:

post_id must match Reddit format (starts with 't3_')
title cannot be empty
created_utc cannot be in the future
sentiment_score.confidence must be 0.0-1.0
embeddings must be 1536 dimensions if present
subreddit must be in allowed financial subreddits list

SentimentScore (Value Object)

Structured sentiment analysis result from OpenRouter LLM:

@dataclass
class SentimentScore:
    """Structured sentiment analysis result with confidence and reasoning."""
    
    sentiment: Literal['positive', 'negative', 'neutral']
    confidence: float  # 0.0-1.0
    reasoning: str     # Brief explanation
    
    def is_reliable(self) -> bool:
        """Check if confidence >= 0.5 for reliable sentiment."""
        return self.confidence >= 0.5
        
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for JSON storage."""

SocialJobConfig (Configuration)

Configuration for scheduled Reddit collection:

@dataclass
class SocialJobConfig:
    """Configuration for scheduled Reddit data collection."""
    
    # Collection Settings
    subreddits: List[str] = field(default_factory=lambda: [
        'wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis'
    ])
    max_posts_per_subreddit: int = 50
    lookback_hours: int = 12
    min_score: int = 10
    
    # Processing Settings
    sentiment_model: str = "anthropic/claude-3.5-haiku"
    embedding_model: str = "text-embedding-3-large"
    
    # Rate Limiting
    rate_limit_delay: float = 1.0  # seconds between API calls
    
    # Scheduling
    schedule_times: List[str] = field(default_factory=lambda: [
        '0 6 * * *',   # 6 AM UTC
        '0 18 * * *'   # 6 PM UTC
    ])

3. Database Design

3.1 Schema Definition

The social_media_posts table leverages PostgreSQL with TimescaleDB for time-series optimization and pgvectorscale for vector similarity search:

-- Core table definition
CREATE TABLE social_media_posts (
    id UUID PRIMARY KEY DEFAULT uuid7(),
    post_id VARCHAR(50) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    content TEXT,
    author VARCHAR(100) NOT NULL,
    subreddit VARCHAR(50) NOT NULL,
    created_utc TIMESTAMPTZ NOT NULL,
    upvotes INTEGER NOT NULL DEFAULT 0,
    downvotes INTEGER NOT NULL DEFAULT 0,
    comments_count INTEGER NOT NULL DEFAULT 0,
    url TEXT NOT NULL,
    sentiment_score JSONB,
    sentiment_label VARCHAR(20),
    tickers TEXT[] DEFAULT '{}',
    title_embedding VECTOR(1536),
    content_embedding VECTOR(1536),
    inserted_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- TimescaleDB hypertable for time-series optimization
SELECT create_hypertable('social_media_posts', 'created_utc', 
                         chunk_time_interval => INTERVAL '1 day');

-- Performance indexes
CREATE UNIQUE INDEX idx_social_posts_post_id ON social_media_posts (post_id);
CREATE INDEX idx_social_posts_subreddit_time ON social_media_posts (subreddit, created_utc DESC);
CREATE INDEX idx_social_posts_tickers_gin ON social_media_posts USING GIN (tickers);
CREATE INDEX idx_social_posts_title_embedding ON social_media_posts 
    USING vectors (title_embedding vector_cosine_ops);
CREATE INDEX idx_social_posts_content_embedding ON social_media_posts 
    USING vectors (content_embedding vector_cosine_ops);
CREATE INDEX idx_social_posts_sentiment ON social_media_posts 
    (((sentiment_score->>'sentiment'))) WHERE sentiment_score IS NOT NULL;

-- Data validation constraints
ALTER TABLE social_media_posts ADD CONSTRAINT chk_sentiment_score 
    CHECK (sentiment_score IS NULL OR 
           ((sentiment_score->>'confidence')::float BETWEEN 0 AND 1));
ALTER TABLE social_media_posts ADD CONSTRAINT chk_created_utc 
    CHECK (created_utc <= NOW());

3.2 SQLAlchemy Entity

class SocialMediaPostEntity(Base):
    """SQLAlchemy entity for PostgreSQL persistence with vector support."""
    
    __tablename__ = "social_media_posts"
    
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid7)
    post_id = Column(String(50), unique=True, nullable=False, index=True)
    title = Column(Text, nullable=False)
    content = Column(Text)
    author = Column(String(100), nullable=False)
    subreddit = Column(String(50), nullable=False)
    created_utc = Column(DateTime(timezone=True), nullable=False)
    upvotes = Column(Integer, nullable=False, default=0)
    downvotes = Column(Integer, nullable=False, default=0)
    comments_count = Column(Integer, nullable=False, default=0)
    url = Column(Text, nullable=False)
    sentiment_score = Column(JSONB)
    sentiment_label = Column(String(20))
    tickers = Column(ARRAY(String), default=[])
    title_embedding = Column(Vector(1536))
    content_embedding = Column(Vector(1536))
    inserted_at = Column(DateTime(timezone=True), default=func.now())
    updated_at = Column(DateTime(timezone=True), default=func.now(), onupdate=func.now())
    
    def to_domain(self) -> SocialPost:
        """Convert to domain entity."""
        
    @classmethod
    def from_domain(cls, post: SocialPost) -> 'SocialMediaPostEntity':
        """Create from domain entity."""

3.3 Access Patterns and Query Optimization

Common Access Patterns:

Ticker-based queries: SELECT * WHERE 'AAPL' = ANY(tickers)
Time-range filtering: SELECT * WHERE created_utc BETWEEN ? AND ?
Vector similarity: SELECT * ORDER BY embedding <=> ? LIMIT 10
Sentiment aggregations: SELECT AVG(sentiment_score) GROUP BY subreddit

Performance Targets:

Vector similarity queries: < 1s for top 10 results
Batch upserts: < 5s for 1000 posts
Ticker-based queries: < 100ms for 30-day ranges

4. API Integration

4.1 Reddit Client (PRAW Integration)

Complete implementation of Reddit data collection using PRAW (Python Reddit API Wrapper):

class RedditClient:
    """PRAW wrapper with rate limiting and error handling."""
    
    def __init__(self, config: RedditClientConfig):
        """Initialize Reddit client with OAuth2 credentials."""
        self.reddit = praw.Reddit(
            client_id=config.client_id,
            client_secret=config.client_secret,
            user_agent=config.user_agent
        )
        self.rate_limiter = AsyncLimiter(1, 1)  # 1 request per second
        
    async def fetch_subreddit_posts(
        self, 
        subreddit: str, 
        limit: int = 50, 
        time_filter: str = 'day'
    ) -> List[Dict[str, Any]]:
        """Fetch hot posts from subreddit with rate limiting."""
        
    async def search_posts(
        self, 
        query: str, 
        subreddit: Optional[str] = None, 
        limit: int = 25
    ) -> List[Dict[str, Any]]:
        """Search posts with ticker symbols or keywords."""
        
    async def get_post_details(self, post_id: str) -> Optional[Dict[str, Any]]:
        """Get detailed information for a specific post."""

Configuration Requirements:

Reddit App Credentials: client_id, client_secret, user_agent
Rate Limiting: 1 request per second (60 requests/minute limit)
Error Handling: Exponential backoff for rate limits, graceful degradation for authentication errors

4.2 OpenRouter LLM Integration

Leverage existing OpenRouter infrastructure with social media-specific enhancements:

Sentiment Analysis Prompt:

Analyze this Reddit post about stocks/finance. Consider the informal language, 
memes, and community context typical of financial subreddits.

Post: {title} - {content}

Respond with valid JSON:
{
  "sentiment": "positive|negative|neutral",
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation considering context"
}

Embedding Configuration:

Model: text-embedding-3-large (1536 dimensions)
Batch processing for efficiency
Generate embeddings for both title and content when available
Store NULL for failed embedding generation (best-effort processing)

5. Component Architecture

5.1 Repository Layer (Data Access)

class SocialRepository:
    """Data access layer for social media posts with vector capabilities."""
    
    def __init__(self, session: AsyncSession):
        self.session = session
        
    async def find_by_ticker(
        self, 
        ticker: str, 
        days: int = 30, 
        limit: int = 50
    ) -> List[SocialPost]:
        """Find posts mentioning specific ticker within time range."""
        
    async def find_similar_posts(
        self, 
        query_embedding: List[float], 
        ticker: Optional[str] = None, 
        limit: int = 10
    ) -> List[SocialPost]:
        """Find semantically similar posts using vector similarity."""
        
    async def get_sentiment_summary(
        self, 
        ticker: str, 
        subreddit: Optional[str] = None, 
        hours: int = 24
    ) -> Dict[str, Any]:
        """Generate sentiment aggregation for ticker."""
        
    async def upsert_batch(self, posts: List[SocialPost]) -> List[SocialPost]:
        """Batch upsert posts with conflict resolution."""
        
    async def cleanup_old_posts(self, days: int = 90) -> int:
        """Remove posts older than retention period."""

5.2 Service Layer (Business Logic)

class SocialMediaService:
    """Business logic orchestration with LLM integration."""
    
    def __init__(
        self, 
        repository: SocialRepository,
        reddit_client: RedditClient,
        openrouter_client: OpenRouterClient
    ):
        self.repository = repository
        self.reddit_client = reddit_client
        self.openrouter_client = openrouter_client
        
    async def collect_subreddit_posts(self, config: SocialJobConfig) -> int:
        """Orchestrate complete collection process for configured subreddits."""
        
    async def update_post_sentiment(
        self, 
        posts: List[SocialPost]
    ) -> List[SocialPost]:
        """Add sentiment analysis to posts using OpenRouter LLM."""
        
    async def generate_embeddings(
        self, 
        posts: List[SocialPost]
    ) -> List[SocialPost]:
        """Generate vector embeddings for semantic search."""
        
    async def find_trending_tickers(
        self, 
        hours: int = 24
    ) -> List[Dict[str, Any]]:
        """Identify trending ticker mentions across subreddits."""

5.3 Agent Integration Layer

class SocialMediaAgentToolkit:
    """RAG methods for AI agent integration."""
    
    def __init__(self, service: SocialMediaService):
        self.service = service
        
    async def get_reddit_sentiment(
        self, 
        ticker: str, 
        days: int = 7
    ) -> Dict[str, Any]:
        """Get sentiment summary for ticker from Reddit discussions."""
        
    async def search_social_posts(
        self, 
        query: str, 
        ticker: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """Semantic search for relevant social media posts."""
        
    async def get_trending_discussions(
        self, 
        ticker: str
    ) -> List[Dict[str, Any]]:
        """Get trending discussions and sentiment for specific ticker."""
        
    async def get_subreddit_analysis(
        self, 
        subreddit: str, 
        ticker: str
    ) -> Dict[str, Any]:
        """Analyze sentiment and engagement for ticker in specific subreddit."""

Agent Response Format:

{
  "posts": [
    {
      "post_id": "t3_abc123",
      "title": "AAPL earnings beat expectations",
      "subreddit": "stocks",
      "created_utc": "2024-01-15T14:30:00Z",
      "sentiment": {
        "sentiment": "positive",
        "confidence": 0.85,
        "reasoning": "Strong positive language about earnings"
      },
      "engagement": {
        "upvotes": 245,
        "comments_count": 67
      },
      "tickers": ["AAPL"],
      "url": "https://reddit.com/r/stocks/comments/abc123"
    }
  ],
  "summary": {
    "total_posts": 15,
    "sentiment_breakdown": {
      "positive": 0.6,
      "negative": 0.2,
      "neutral": 0.2
    },
    "avg_confidence": 0.78,
    "data_quality": "high"
  }
}

6. Dagster Pipeline Architecture

6.1 Scheduled Collection Pipeline

@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
    config_schema=SocialJobConfig.schema()
)
def reddit_posts_collection(context: AssetExecutionContext) -> MaterializeResult:
    """Collect Reddit posts from financial subreddits."""
    
@asset(deps=[reddit_posts_collection])
def reddit_sentiment_analysis(context: AssetExecutionContext) -> MaterializeResult:
    """Add sentiment analysis to collected posts."""
    
@asset(deps=[reddit_sentiment_analysis])
def reddit_embeddings_generation(context: AssetExecutionContext) -> MaterializeResult:
    """Generate vector embeddings for semantic search."""

# Schedule: Twice daily collection
reddit_collection_schedule = ScheduleDefinition(
    name="reddit_collection_schedule",
    job=define_asset_job("reddit_collection", selection=[
        reddit_posts_collection,
        reddit_sentiment_analysis,
        reddit_embeddings_generation
    ]),
    cron_schedule="0 6,18 * * *"  # 6 AM and 6 PM UTC
)

6.2 Data Quality and Monitoring

Collection Metrics:

Posts collected per subreddit per run
Sentiment analysis success rate
Embedding generation success rate
API error rates and retry attempts

Data Quality Checks:

Post deduplication verification
Sentiment confidence distribution
Embedding vector validation
Reddit API rate limit utilization

Failure Handling:

Best-effort processing: Continue with remaining subreddits if one fails
Exponential backoff for Reddit API rate limits
Graceful degradation: Store posts without sentiment/embeddings if LLM fails
Dead letter queue for failed posts with retry mechanism

7. Testing Strategy

7.1 Test Structure

Following the project's pragmatic outside-in TDD approach:

tests/domains/socialmedia/
├── __init__.py
├── test_social_post.py                 # Domain entity validation
├── test_social_repository.py           # PostgreSQL + vector operations
├── test_reddit_client.py               # PRAW integration with VCR
├── test_social_media_service.py        # Business logic with mocked deps
├── test_social_agent_toolkit.py        # Agent integration methods
└── fixtures/
    ├── reddit_responses.json           # Sample PRAW responses
    └── vcr_cassettes/                   # HTTP cassettes for external APIs

7.2 Testing Approach

Unit Tests (Mock I/O boundaries):

SocialPost entity validation and transformations
SocialRepository with test PostgreSQL database
RedditClient with mocked PRAW responses
SocialMediaService with mocked dependencies

Integration Tests (Real components):

End-to-end collection pipeline with test Reddit data
Vector similarity search with actual pgvectorscale
LLM integration with pytest-vcr cassettes
Dagster pipeline execution

Performance Tests:

Vector similarity query performance (< 1s target)
Batch upsert performance (< 5s for 1000 posts)
Memory usage during large collection runs

7.3 Test Fixtures and Mocking

Reddit API Mocking:

@pytest.fixture
def mock_reddit_response():
    """Sample Reddit API response for testing."""
    return {
        "id": "abc123",
        "title": "AAPL earnings discussion",
        "selftext": "Strong quarter, bullish outlook",
        "author": "test_user",
        "subreddit_display_name": "stocks",
        "created_utc": 1705315200,
        "score": 150,
        "upvote_ratio": 0.85,
        "num_comments": 45,
        "permalink": "/r/stocks/comments/abc123/aapl_earnings/"
    }

Vector Similarity Testing:

@pytest.mark.asyncio
async def test_vector_similarity_search(social_repository, sample_posts):
    """Test semantic similarity search using pgvectorscale."""
    # Insert test posts with embeddings
    await social_repository.upsert_batch(sample_posts)
    
    # Test similarity search
    query_embedding = [0.1] * 1536  # Sample embedding
    similar_posts = await social_repository.find_similar_posts(
        query_embedding, limit=5
    )
    
    assert len(similar_posts) <= 5
    assert all(post.title_embedding for post in similar_posts)

8. Implementation Roadmap

8.1 Phase 1: Database Foundation (Week 1)

Priority 1: Database Schema

Create PostgreSQL migration for social_media_posts table
Add TimescaleDB hypertable configuration
Set up pgvectorscale indexes for vector similarity
Implement data validation constraints

Priority 2: Core Entities

SocialMediaPostEntity (SQLAlchemy entity)
SocialPost (domain entity with validation)
SentimentScore (value object)
Entity transformation methods (to_domain, from_domain)

8.2 Phase 2: Data Collection (Week 2)

Priority 1: Reddit Integration

RedditClient with PRAW implementation
Rate limiting and error handling
Subreddit post collection methods
Reddit API authentication setup

Priority 2: Repository Layer

SocialRepository with PostgreSQL operations
Vector similarity search methods
Batch upsert operations
Sentiment aggregation queries

8.3 Phase 3: Processing & Intelligence (Week 3)

Priority 1: Service Layer

SocialMediaService business logic
OpenRouter LLM integration for sentiment
Vector embedding generation
Batch processing workflows

Priority 2: Agent Integration

SocialMediaAgentToolkit RAG methods
Structured response formatting
Context-aware social media analysis
Integration with existing agent workflows

8.4 Phase 4: Automation & Monitoring (Week 4)

Priority 1: Dagster Pipeline

Scheduled Reddit collection assets
Processing pipeline orchestration
Data quality monitoring
Error handling and retry logic

Priority 2: Testing & Documentation

Comprehensive test suite (>85% coverage)
Performance testing and optimization
API documentation updates
Integration with existing test infrastructure

9. Monitoring and Observability

9.1 Key Metrics

Collection Metrics:

Posts collected per subreddit per day
Collection job success/failure rates
Reddit API rate limit utilization
Data deduplication effectiveness

Processing Metrics:

Sentiment analysis success rate and latency
Embedding generation success rate and latency
LLM token usage and costs
Vector similarity query performance

Business Metrics:

Active tickers with social sentiment data
Sentiment distribution across subreddits
Trending ticker detection accuracy
Agent query response times

9.2 Alerting Strategy

Critical Alerts:

Collection job failures (> 2 consecutive failures)
Reddit API authentication errors
Database connection failures
High LLM processing error rates (> 20%)

Warning Alerts:

Low collection volumes (< 50% of expected)
High sentiment analysis latency (> 30s per batch)
Vector similarity performance degradation
Approaching Reddit API rate limits

9.3 Logging and Debugging

Structured Logging Format:

{
  "timestamp": "2024-01-15T14:30:00Z",
  "level": "INFO",
  "component": "SocialMediaService",
  "operation": "collect_subreddit_posts",
  "subreddit": "stocks",
  "posts_collected": 45,
  "sentiment_analyzed": 43,
  "embeddings_generated": 41,
  "duration_ms": 12500,
  "metadata": {
    "reddit_api_calls": 3,
    "llm_tokens_used": 15420
  }
}

10. Security and Compliance

10.1 Data Privacy

Reddit Data Handling:

Store only publicly available Reddit posts
Respect user privacy: hash usernames for analytics
Implement data retention policies (90-day maximum)
No collection of private or deleted content

API Key Management:

Environment variable storage for Reddit credentials
OpenRouter API key rotation support
No credential logging or persistence in plain text

10.2 Rate Limiting Compliance

Reddit API Compliance:

Respect 60 requests per minute OAuth limit
Implement exponential backoff for rate limit violations
User-Agent string identification as required
Monitor and log API usage statistics

OpenRouter Usage:

Monitor token usage and costs
Implement request batching for efficiency
Handle API rate limits gracefully
Cost optimization through model selection

11. Future Enhancements

Twitter/X Integration:

Similar architecture pattern for Twitter API v2
Real-time streaming for high-frequency updates
Hashtag and mention tracking

News Comment Sections:

Integration with financial news comment sections
Cross-platform sentiment correlation
Enhanced context for news articles

11.2 Advanced Analytics

Sentiment Trend Analysis:

Time-series sentiment tracking
Volatility correlation with social sentiment
Predictive sentiment modeling

Influence Network Analysis:

User influence scoring based on engagement
Community detection within financial subreddits
Viral content identification and tracking

11.3 Real-time Processing

Streaming Architecture:

Real-time Reddit post collection
Event-driven sentiment processing
Live sentiment dashboards for agents

Market Hours Integration:

Increased collection frequency during market hours
After-hours sentiment tracking
Weekend vs. weekday sentiment patterns

This technical design provides a comprehensive blueprint for implementing the complete Social Media domain from empty stubs to a production-ready system. The architecture leverages proven patterns from the news domain while introducing specialized capabilities for social media data collection, semantic search, and AI agent integration.

28 KiB Raw Blame History

Social Media Domain - Technical Design Document