TradingAgents/SocialMediaService_PRD.md

16 KiB

Product Requirements Document: SocialMediaService Completion

Overview

Complete the SocialMediaService to provide strongly-typed social media data and sentiment analysis to trading agents using a local-first data strategy with gap detection and intelligent caching.

Current State Analysis

Issues to Fix

  • CRITICAL: Missing RedditClient implementation - service calls non-existent client methods
  • CRITICAL: Service uses BaseClient inheritance but needs typed RedditClient
  • CRITICAL: SocialRepository has different interface than standard service pattern
  • CRITICAL: Repository uses date objects internally but service expects string date interface
  • Missing strongly-typed interfaces between components
  • Service calls reddit_client.search_posts(), get_top_posts(), filter_posts_by_date() methods that don't exist

What Works

  • Local-first data strategy implementation (_get_social_data_local_first)
  • Force refresh logic (_fetch_and_cache_fresh_social_data)
  • SocialContext Pydantic model for agent consumption
  • Comprehensive sentiment analysis with keyword-based scoring
  • Engagement metrics calculation and post ranking
  • Error handling and metadata creation patterns
  • SocialRepository with JSON storage and post deduplication
  • PostData and SentimentScore models for structured data
  • Real-time sentiment analysis with weighted scoring

Technical Requirements

1. Strongly-Typed Interfaces

Client → Service Interface

# RedditClient methods (to be implemented)
def search_posts(query: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_top_posts(subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_company_posts(symbol: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int) -> dict[str, Any]

Service → Repository Interface

# SocialRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool

Service → Agent Interface

# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, subreddits: list[str], force_refresh: bool) -> SocialContext
def get_company_social_context(symbol: str, start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
def get_global_trends(start_date: str, end_date: str, subreddits: list[str]) -> SocialContext

2. Local-First Data Strategy

Flow

  1. Repository Lookup: Check SocialRepository.has_data_for_period()
  2. Gap Detection: Identify missing social media data periods
  3. Selective Fetching: Fetch only missing data from RedditClient
  4. Cache Updates: Store new data via repository.store_data()
  5. Context Assembly: Return validated SocialContext

Force Refresh Support

  • force_refresh=True bypasses local data completely
  • Clears existing cache before fetching fresh data
  • Stores refreshed data with metadata indicating refresh

3. Date Object Conversion

Service Boundary Conversion

# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> SocialContext:
    # Convert to date objects for client calls
    start_dt = date.fromisoformat(start_date)
    end_dt = date.fromisoformat(end_date)
    
    # Use date objects when calling RedditClient
    posts_data = self.reddit_client.search_posts(query, subreddits, start_dt, end_dt, limit, time_filter)
    
    # Repository bridge handles string to date conversion internally
    cached_data = self.repository.get_data(query, start_date, end_date, symbol)

4. Reddit API Integration

RedditClient Implementation Strategy

# RedditClient following FinnhubClient standard
class RedditClient:
    """Client for Reddit API access with PRAW library integration."""
    
    def __init__(self, client_id: str, client_secret: str, user_agent: str):
        """Initialize Reddit client with PRAW."""
        import praw
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent
        )
    
    def search_posts(self, query: str, subreddit_names: list[str], 
                    start_date: date, end_date: date, limit: int = 50, 
                    time_filter: str = "week") -> dict[str, Any]:
        """Search for posts across subreddits within date range."""
        
    def get_top_posts(self, subreddit_names: list[str], 
                     start_date: date, end_date: date, limit: int = 50, 
                     time_filter: str = "week") -> dict[str, Any]:
        """Get top posts from subreddits within date range."""
        
    def get_company_posts(self, symbol: str, subreddit_names: list[str],
                         start_date: date, end_date: date, limit: int = 50) -> dict[str, Any]:
        """Get company-specific posts from subreddits."""

Reddit Response Format

{
    "query": "AAPL",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Apple earnings discussion",
            "content": "What do you think about...",
            "author": "redditor123",
            "subreddit": "investing",
            "created_utc": 1704067200,
            "score": 125,
            "num_comments": 45,
            "upvote_ratio": 0.87,
            "url": "https://reddit.com/r/investing/comments/abc123",
            "id": "abc123"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["investing", "stocks"],
        "total_posts": 25
    }
}

5. Sentiment Analysis Enhancement

Advanced Sentiment Features

  • Weighted Scoring: High-engagement posts have more influence on overall sentiment
  • Keyword Analysis: Comprehensive positive/negative keyword detection
  • Score Adjustment: Reddit score (upvotes) influences sentiment confidence
  • Confidence Metrics: Based on post count and engagement levels
  • Multi-level Analysis: Individual post sentiment + overall summary sentiment

Sentiment Calculation Strategy

def _calculate_advanced_sentiment(self, posts: list[PostData]) -> SentimentScore:
    """Enhanced sentiment analysis with multiple factors."""
    # Weight by engagement score (upvotes + comments)
    # Adjust for subreddit context (WSB vs investing)
    # Consider temporal patterns (recent posts weighted higher)
    # Apply confidence scoring based on data volume

6. Pydantic Validation

Context Structure

@dataclass 
class SocialContext(BaseModel):
    symbol: str | None
    period: dict[str, str]  # {"start": "2024-01-01", "end": "2024-01-31"}
    posts: list[PostData]
    engagement_metrics: dict[str, float]
    sentiment_summary: SentimentScore
    post_count: int
    platforms: list[str]  # ["reddit"]
    metadata: dict[str, Any]

PostData Format

@dataclass
class PostData(BaseModel):
    title: str
    content: str
    author: str
    source: str  # subreddit name
    date: str
    url: str
    score: int
    comments: int
    engagement_score: int
    subreddit: str | None
    sentiment: SentimentScore | None
    metadata: dict[str, Any]

Implementation Tasks

Phase 1: Create RedditClient

  1. RedditClient Implementation

    • Create tradingagents/clients/reddit_client.py
    • Follow FinnhubClient standard: no BaseClient inheritance, date objects, proper error handling
    • Use PRAW (Python Reddit API Wrapper) library for Reddit API access
    • Methods: search_posts(), get_top_posts(), get_company_posts()
    • Implement date filtering for posts within specified ranges
    • Handle Reddit API rate limits and authentication
  2. Comprehensive Testing

    • Create tradingagents/clients/test_reddit_client.py
    • Use pytest-vcr for Reddit API interaction recording
    • Test all client methods with multiple queries and subreddits
    • Test error handling and API rate limit scenarios
    • Mock Reddit API responses for consistent testing

Phase 2: Bridge SocialRepository Interface

  1. Repository Interface Standardization

    • Add standard service interface methods to SocialRepository
    • Bridge existing get_social_data() with get_data()
    • Bridge existing store_social_posts() with store_data()
    • Add missing has_data_for_period() and clear_data() methods
    • File: tradingagents/repositories/social_repository.py
    • Maintain existing dataclass functionality while adding service compatibility
  2. Repository Method Implementation

    # Add these methods to SocialRepository
    def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
    def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]
    def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool
    def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
    

Phase 3: Update SocialMediaService

  1. Client Integration Fix

    • Replace BaseClient dependency with RedditClient
    • File: tradingagents/services/social_media_service.py:27
    • Update constructor: reddit_client: RedditClient
  2. Date Conversion Fix

    • Add date.fromisoformat() conversion in service methods
    • Update all client calls to use date objects instead of strings
    • File: tradingagents/services/social_media_service.py:182-190, 418-429
  3. Repository Interface Integration

    • Update repository method calls to use new standard interface
    • Ensure proper error handling for repository operations
    • File: tradingagents/services/social_media_service.py:302-311, 325-337

Phase 4: Type Safety & Validation

  1. Comprehensive Type Checking

    • Run mise run typecheck - must pass with 0 errors
    • Validate all date object conversions
    • Ensure SocialContext compliance
  2. Enhanced Testing

    • Update existing service tests for new RedditClient interface
    • Add gap detection test scenarios
    • Test sentiment analysis accuracy with known datasets
    • Test multi-subreddit aggregation and deduplication

Success Criteria

Functional Requirements

  • Service successfully calls RedditClient with date objects
  • Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
  • Returns properly validated SocialContext to agents
  • Sentiment analysis provides accurate scores with confidence metrics
  • Multi-subreddit support with post deduplication
  • Force refresh bypasses cache and refreshes data

Technical Requirements

  • Zero type checking errors: mise run typecheck
  • Zero linting errors: mise run lint
  • All existing tests pass with updated architecture
  • No runtime errors with date conversions

Quality Requirements

  • Strongly-typed interfaces between all components
  • PRAW library integration for reliable Reddit API access
  • Comprehensive error handling and logging
  • Efficient caching with minimal API calls
  • Clear separation of concerns between service, client, and repository
  • Accurate sentiment analysis with engagement weighting

Data Architecture

RedditClient Response Format

{
    "query": "Tesla",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "subreddit": "TeslaInvestors",
            "created_utc": 1704067200,
            "score": 245,
            "num_comments": 67,
            "upvote_ratio": 0.92,
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "id": "xyz789"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["TeslaInvestors", "stocks"],
        "post_count": 25,
        "api_calls": 3
    }
}

SocialRepository Data Bridge Format

# Repository stores data in existing SocialPost format but provides service interface
{
    "query": "Tesla",
    "symbol": "TSLA",
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "source": "TeslaInvestors",
            "date": "2024-01-15",
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "score": 245,
            "comments": 67,
            "engagement_score": 312,
            "subreddit": "TeslaInvestors",
            "sentiment": {
                "score": 0.7,
                "confidence": 0.8,
                "label": "positive"
            },
            "metadata": {
                "platform_id": "xyz789",
                "upvote_ratio": 0.92
            }
        }
    ],
    "metadata": {
        "cached_at": "2024-01-31T10:00:00Z",
        "post_count": 25,
        "sources": ["reddit"]
    }
}

Dependencies

Missing Components (Need Creation)

  • RedditClient needs full implementation from scratch
  • Service interface bridge methods for SocialRepository
  • Comprehensive pytest-vcr test suites for Reddit API

Existing Components (Ready)

  • SocialRepository with JSON storage and deduplication
  • SocialContext and PostData Pydantic models
  • Sentiment analysis and engagement metrics logic

Required

  • PRAW (Python Reddit API Wrapper) library for Reddit integration
  • Valid Reddit API credentials (client_id, client_secret, user_agent)
  • Working internet connection for live data fetching
  • Writable data directory for repository storage

Timeline

Immediate (Phase 1)

  • Create RedditClient following FinnhubClient standard with PRAW integration
  • Implement comprehensive testing with pytest-vcr for Reddit API
  • Validate client functionality with multiple subreddits and queries

Phase 2-3

  • Add standard service interface methods to SocialRepository
  • Update SocialMediaService to use RedditClient with date objects
  • Bridge repository interfaces while maintaining existing functionality

Phase 4

  • Comprehensive type checking and validation
  • Integration testing with sentiment analysis workflows
  • Performance optimization and caching efficiency

Acceptance Criteria

Must Have

  1. Type Safety: Service passes mise run typecheck with zero errors
  2. Client Integration: All RedditClient calls use date objects correctly
  3. Local-First: Service checks repository before Reddit API calls
  4. Context Validation: Returns valid SocialContext with Pydantic validation
  5. Sentiment Analysis: Provides accurate sentiment scores with confidence metrics
  6. Multi-Platform: Seamlessly aggregates social data from Reddit with extensibility

Should Have

  1. Gap Detection: Intelligent identification of missing data periods
  2. Cache Efficiency: Minimal redundant API calls to Reddit
  3. Force Refresh: Complete cache bypass when requested
  4. Data Quality: Metadata indicating data source and quality metrics
  5. Deduplication: Automatic removal of duplicate posts by platform_id

Nice to Have

  1. Performance Metrics: Timing and cache hit rate logging
  2. Data Staleness: Automatic refresh of old cached social data
  3. Enhanced Sentiment: Integration with advanced NLP libraries (TextBlob, VADER)
  4. Real-time Social: Support for live social media feeds and alerts
  5. Platform Expansion: Easy addition of Twitter, Discord, other social platforms

This PRD focuses on completing the SocialMediaService as a strongly-typed, local-first data service that integrates Reddit social media data through a new RedditClient following the established FinnhubClient standard patterns, while providing comprehensive sentiment analysis and engagement metrics to trading agents.