16 KiB

Raw Blame History

Product Requirements Document: SocialMediaService Completion

Overview

Complete the SocialMediaService to provide strongly-typed social media data and sentiment analysis to trading agents using a local-first data strategy with gap detection and intelligent caching.

Current State Analysis

Issues to Fix

CRITICAL: Missing RedditClient implementation - service calls non-existent client methods
CRITICAL: Service uses BaseClient inheritance but needs typed RedditClient
CRITICAL: SocialRepository has different interface than standard service pattern
CRITICAL: Repository uses date objects internally but service expects string date interface
Missing strongly-typed interfaces between components
Service calls reddit_client.search_posts(), get_top_posts(), filter_posts_by_date() methods that don't exist

What Works

✅ Local-first data strategy implementation (_get_social_data_local_first)
✅ Force refresh logic (_fetch_and_cache_fresh_social_data)
✅ SocialContext Pydantic model for agent consumption
✅ Comprehensive sentiment analysis with keyword-based scoring
✅ Engagement metrics calculation and post ranking
✅ Error handling and metadata creation patterns
✅ SocialRepository with JSON storage and post deduplication
✅ PostData and SentimentScore models for structured data
✅ Real-time sentiment analysis with weighted scoring

Technical Requirements

1. Strongly-Typed Interfaces

Client → Service Interface

# RedditClient methods (to be implemented)
def search_posts(query: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_top_posts(subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_company_posts(symbol: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int) -> dict[str, Any]

Service → Repository Interface

# SocialRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool

Service → Agent Interface

# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, subreddits: list[str], force_refresh: bool) -> SocialContext
def get_company_social_context(symbol: str, start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
def get_global_trends(start_date: str, end_date: str, subreddits: list[str]) -> SocialContext

2. Local-First Data Strategy

Flow

Repository Lookup: Check SocialRepository.has_data_for_period()
Gap Detection: Identify missing social media data periods
Selective Fetching: Fetch only missing data from RedditClient
Cache Updates: Store new data via repository.store_data()
Context Assembly: Return validated SocialContext

Force Refresh Support

force_refresh=True bypasses local data completely
Clears existing cache before fetching fresh data
Stores refreshed data with metadata indicating refresh

3. Date Object Conversion

Service Boundary Conversion

# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> SocialContext:
    # Convert to date objects for client calls
    start_dt = date.fromisoformat(start_date)
    end_dt = date.fromisoformat(end_date)
    
    # Use date objects when calling RedditClient
    posts_data = self.reddit_client.search_posts(query, subreddits, start_dt, end_dt, limit, time_filter)
    
    # Repository bridge handles string to date conversion internally
    cached_data = self.repository.get_data(query, start_date, end_date, symbol)

4. Reddit API Integration

RedditClient Implementation Strategy

# RedditClient following FinnhubClient standard
class RedditClient:
    """Client for Reddit API access with PRAW library integration."""
    
    def __init__(self, client_id: str, client_secret: str, user_agent: str):
        """Initialize Reddit client with PRAW."""
        import praw
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent
        )
    
    def search_posts(self, query: str, subreddit_names: list[str], 
                    start_date: date, end_date: date, limit: int = 50, 
                    time_filter: str = "week") -> dict[str, Any]:
        """Search for posts across subreddits within date range."""
        
    def get_top_posts(self, subreddit_names: list[str], 
                     start_date: date, end_date: date, limit: int = 50, 
                     time_filter: str = "week") -> dict[str, Any]:
        """Get top posts from subreddits within date range."""
        
    def get_company_posts(self, symbol: str, subreddit_names: list[str],
                         start_date: date, end_date: date, limit: int = 50) -> dict[str, Any]:
        """Get company-specific posts from subreddits."""

Reddit Response Format

{
    "query": "AAPL",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Apple earnings discussion",
            "content": "What do you think about...",
            "author": "redditor123",
            "subreddit": "investing",
            "created_utc": 1704067200,
            "score": 125,
            "num_comments": 45,
            "upvote_ratio": 0.87,
            "url": "https://reddit.com/r/investing/comments/abc123",
            "id": "abc123"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["investing", "stocks"],
        "total_posts": 25
    }
}

5. Sentiment Analysis Enhancement

Advanced Sentiment Features

Weighted Scoring: High-engagement posts have more influence on overall sentiment
Keyword Analysis: Comprehensive positive/negative keyword detection
Score Adjustment: Reddit score (upvotes) influences sentiment confidence
Confidence Metrics: Based on post count and engagement levels
Multi-level Analysis: Individual post sentiment + overall summary sentiment

Sentiment Calculation Strategy

def _calculate_advanced_sentiment(self, posts: list[PostData]) -> SentimentScore:
    """Enhanced sentiment analysis with multiple factors."""
    # Weight by engagement score (upvotes + comments)
    # Adjust for subreddit context (WSB vs investing)
    # Consider temporal patterns (recent posts weighted higher)
    # Apply confidence scoring based on data volume

6. Pydantic Validation

Context Structure

@dataclass 
class SocialContext(BaseModel):
    symbol: str | None
    period: dict[str, str]  # {"start": "2024-01-01", "end": "2024-01-31"}
    posts: list[PostData]
    engagement_metrics: dict[str, float]
    sentiment_summary: SentimentScore
    post_count: int
    platforms: list[str]  # ["reddit"]
    metadata: dict[str, Any]

PostData Format

@dataclass
class PostData(BaseModel):
    title: str
    content: str
    author: str
    source: str  # subreddit name
    date: str
    url: str
    score: int
    comments: int
    engagement_score: int
    subreddit: str | None
    sentiment: SentimentScore | None
    metadata: dict[str, Any]

Implementation Tasks

Phase 1: Create RedditClient

RedditClient Implementation
- Create tradingagents/clients/reddit_client.py
- Follow FinnhubClient standard: no BaseClient inheritance, date objects, proper error handling
- Use PRAW (Python Reddit API Wrapper) library for Reddit API access
- Methods: search_posts(), get_top_posts(), get_company_posts()
- Implement date filtering for posts within specified ranges
- Handle Reddit API rate limits and authentication
Comprehensive Testing
- Create tradingagents/clients/test_reddit_client.py
- Use pytest-vcr for Reddit API interaction recording
- Test all client methods with multiple queries and subreddits
- Test error handling and API rate limit scenarios
- Mock Reddit API responses for consistent testing

Phase 2: Bridge SocialRepository Interface

Repository Interface Standardization
- Add standard service interface methods to SocialRepository
- Bridge existing get_social_data() with get_data()
- Bridge existing store_social_posts() with store_data()
- Add missing has_data_for_period() and clear_data() methods
- File: tradingagents/repositories/social_repository.py
- Maintain existing dataclass functionality while adding service compatibility

Repository Method Implementation

# Add these methods to SocialRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool

Phase 3: Update SocialMediaService

Client Integration Fix
- Replace BaseClient dependency with RedditClient
- File: tradingagents/services/social_media_service.py:27
- Update constructor: reddit_client: RedditClient
Date Conversion Fix
- Add date.fromisoformat() conversion in service methods
- Update all client calls to use date objects instead of strings
- File: tradingagents/services/social_media_service.py:182-190, 418-429
Repository Interface Integration
- Update repository method calls to use new standard interface
- Ensure proper error handling for repository operations
- File: tradingagents/services/social_media_service.py:302-311, 325-337

Phase 4: Type Safety & Validation

Comprehensive Type Checking
- Run mise run typecheck - must pass with 0 errors
- Validate all date object conversions
- Ensure SocialContext compliance
Enhanced Testing
- Update existing service tests for new RedditClient interface
- Add gap detection test scenarios
- Test sentiment analysis accuracy with known datasets
- Test multi-subreddit aggregation and deduplication

Success Criteria

Functional Requirements

✅ Service successfully calls RedditClient with date objects
✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
✅ Returns properly validated SocialContext to agents
✅ Sentiment analysis provides accurate scores with confidence metrics
✅ Multi-subreddit support with post deduplication
✅ Force refresh bypasses cache and refreshes data

Technical Requirements

✅ Zero type checking errors: mise run typecheck
✅ Zero linting errors: mise run lint
✅ All existing tests pass with updated architecture
✅ No runtime errors with date conversions

Quality Requirements

✅ Strongly-typed interfaces between all components
✅ PRAW library integration for reliable Reddit API access
✅ Comprehensive error handling and logging
✅ Efficient caching with minimal API calls
✅ Clear separation of concerns between service, client, and repository
✅ Accurate sentiment analysis with engagement weighting

Data Architecture

RedditClient Response Format

{
    "query": "Tesla",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "subreddit": "TeslaInvestors",
            "created_utc": 1704067200,
            "score": 245,
            "num_comments": 67,
            "upvote_ratio": 0.92,
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "id": "xyz789"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["TeslaInvestors", "stocks"],
        "post_count": 25,
        "api_calls": 3
    }
}

SocialRepository Data Bridge Format

# Repository stores data in existing SocialPost format but provides service interface
{
    "query": "Tesla",
    "symbol": "TSLA",
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "source": "TeslaInvestors",
            "date": "2024-01-15",
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "score": 245,
            "comments": 67,
            "engagement_score": 312,
            "subreddit": "TeslaInvestors",
            "sentiment": {
                "score": 0.7,
                "confidence": 0.8,
                "label": "positive"
            },
            "metadata": {
                "platform_id": "xyz789",
                "upvote_ratio": 0.92
            }
        }
    ],
    "metadata": {
        "cached_at": "2024-01-31T10:00:00Z",
        "post_count": 25,
        "sources": ["reddit"]
    }
}

Dependencies

Missing Components (Need Creation)

⏳ RedditClient needs full implementation from scratch
⏳ Service interface bridge methods for SocialRepository
⏳ Comprehensive pytest-vcr test suites for Reddit API

Existing Components (Ready)

✅ SocialRepository with JSON storage and deduplication
✅ SocialContext and PostData Pydantic models
✅ Sentiment analysis and engagement metrics logic

Required

PRAW (Python Reddit API Wrapper) library for Reddit integration
Valid Reddit API credentials (client_id, client_secret, user_agent)
Working internet connection for live data fetching
Writable data directory for repository storage

Timeline

Immediate (Phase 1)

Create RedditClient following FinnhubClient standard with PRAW integration
Implement comprehensive testing with pytest-vcr for Reddit API
Validate client functionality with multiple subreddits and queries

Phase 2-3

Add standard service interface methods to SocialRepository
Update SocialMediaService to use RedditClient with date objects
Bridge repository interfaces while maintaining existing functionality

Phase 4

Comprehensive type checking and validation
Integration testing with sentiment analysis workflows
Performance optimization and caching efficiency

Acceptance Criteria

Must Have

Type Safety: Service passes mise run typecheck with zero errors
Client Integration: All RedditClient calls use date objects correctly
Local-First: Service checks repository before Reddit API calls
Context Validation: Returns valid SocialContext with Pydantic validation
Sentiment Analysis: Provides accurate sentiment scores with confidence metrics
Multi-Platform: Seamlessly aggregates social data from Reddit with extensibility

Should Have

Gap Detection: Intelligent identification of missing data periods
Cache Efficiency: Minimal redundant API calls to Reddit
Force Refresh: Complete cache bypass when requested
Data Quality: Metadata indicating data source and quality metrics
Deduplication: Automatic removal of duplicate posts by platform_id

Nice to Have

Performance Metrics: Timing and cache hit rate logging
Data Staleness: Automatic refresh of old cached social data
Enhanced Sentiment: Integration with advanced NLP libraries (TextBlob, VADER)
Real-time Social: Support for live social media feeds and alerts
Platform Expansion: Easy addition of Twitter, Discord, other social platforms

This PRD focuses on completing the SocialMediaService as a strongly-typed, local-first data service that integrates Reddit social media data through a new RedditClient following the established FinnhubClient standard patterns, while providing comprehensive sentiment analysis and engagement metrics to trading agents.

16 KiB Raw Blame History

Product Requirements Document: SocialMediaService Completion

Overview

Current State Analysis

Issues to Fix

What Works

Technical Requirements

1. Strongly-Typed Interfaces

Client → Service Interface

Service → Repository Interface

Service → Agent Interface

2. Local-First Data Strategy

Flow

Force Refresh Support

3. Date Object Conversion

Service Boundary Conversion

4. Reddit API Integration

RedditClient Implementation Strategy

Reddit Response Format

5. Sentiment Analysis Enhancement

Advanced Sentiment Features

Sentiment Calculation Strategy

6. Pydantic Validation

Context Structure

PostData Format

Implementation Tasks

Phase 1: Create RedditClient

Phase 2: Bridge SocialRepository Interface

Phase 3: Update SocialMediaService

Phase 4: Type Safety & Validation

Success Criteria

Functional Requirements

Technical Requirements

Quality Requirements

Data Architecture

RedditClient Response Format

SocialRepository Data Bridge Format

Dependencies

Missing Components (Need Creation)

Existing Components (Ready)

Required

Timeline

Immediate (Phase 1)

Phase 2-3

Phase 4

Acceptance Criteria

Must Have

Should Have

Nice to Have

16 KiB

Raw Blame History