TradingAgents/SocialMediaService_PRD.md

# Product Requirements Document: SocialMediaService Completion

## Overview

Complete the `SocialMediaService` to provide strongly-typed social media data and sentiment analysis to trading agents using a local-first data strategy with gap detection and intelligent caching.

## Current State Analysis

### Issues to Fix
- **CRITICAL**: Missing `RedditClient` implementation - service calls non-existent client methods
- **CRITICAL**: Service uses `BaseClient` inheritance but needs typed `RedditClient`
- **CRITICAL**: `SocialRepository` has different interface than standard service pattern
- **CRITICAL**: Repository uses `date` objects internally but service expects string date interface
- Missing strongly-typed interfaces between components
- Service calls `reddit_client.search_posts()`, `get_top_posts()`, `filter_posts_by_date()` methods that don't exist

### What Works
- ✅ Local-first data strategy implementation (`_get_social_data_local_first`)
- ✅ Force refresh logic (`_fetch_and_cache_fresh_social_data`)
- ✅ `SocialContext` Pydantic model for agent consumption
- ✅ Comprehensive sentiment analysis with keyword-based scoring
- ✅ Engagement metrics calculation and post ranking
- ✅ Error handling and metadata creation patterns
- ✅ `SocialRepository` with JSON storage and post deduplication
- ✅ `PostData` and `SentimentScore` models for structured data
- ✅ Real-time sentiment analysis with weighted scoring

## Technical Requirements

### 1. Strongly-Typed Interfaces

#### Client → Service Interface
```python
# RedditClient methods (to be implemented)
def search_posts(query: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_top_posts(subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_company_posts(symbol: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int) -> dict[str, Any]
```

#### Service → Repository Interface
```python
# SocialRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
```

#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, subreddits: list[str], force_refresh: bool) -> SocialContext
def get_company_social_context(symbol: str, start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
def get_global_trends(start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
```

### 2. Local-First Data Strategy

#### Flow
1. **Repository Lookup**: Check `SocialRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing social media data periods
3. **Selective Fetching**: Fetch only missing data from `RedditClient`
4. **Cache Updates**: Store new data via `repository.store_data()`
5. **Context Assembly**: Return validated `SocialContext`

#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh

### 3. Date Object Conversion

#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> SocialContext:
    # Convert to date objects for client calls
    start_dt = date.fromisoformat(start_date)
    end_dt = date.fromisoformat(end_date)

    # Use date objects when calling RedditClient
    posts_data = self.reddit_client.search_posts(query, subreddits, start_dt, end_dt, limit, time_filter)

    # Repository bridge handles string to date conversion internally
    cached_data = self.repository.get_data(query, start_date, end_date, symbol)
```

### 4. Reddit API Integration

#### RedditClient Implementation Strategy
```python
# RedditClient following FinnhubClient standard
class RedditClient:
    """Client for Reddit API access with PRAW library integration."""

    def __init__(self, client_id: str, client_secret: str, user_agent: str):
        """Initialize Reddit client with PRAW."""
        import praw
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent
        )

    def search_posts(self, query: str, subreddit_names: list[str],
                    start_date: date, end_date: date, limit: int = 50,
                    time_filter: str = "week") -> dict[str, Any]:
        """Search for posts across subreddits within date range."""

    def get_top_posts(self, subreddit_names: list[str],
                     start_date: date, end_date: date, limit: int = 50,
                     time_filter: str = "week") -> dict[str, Any]:
        """Get top posts from subreddits within date range."""

    def get_company_posts(self, symbol: str, subreddit_names: list[str],
                         start_date: date, end_date: date, limit: int = 50) -> dict[str, Any]:
        """Get company-specific posts from subreddits."""
```

#### Reddit Response Format
```python
{
    "query": "AAPL",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Apple earnings discussion",
            "content": "What do you think about...",
            "author": "redditor123",
            "subreddit": "investing",
            "created_utc": 1704067200,
            "score": 125,
            "num_comments": 45,
            "upvote_ratio": 0.87,
            "url": "https://reddit.com/r/investing/comments/abc123",
            "id": "abc123"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["investing", "stocks"],
        "total_posts": 25
    }
}
```

### 5. Sentiment Analysis Enhancement

#### Advanced Sentiment Features
- **Weighted Scoring**: High-engagement posts have more influence on overall sentiment
- **Keyword Analysis**: Comprehensive positive/negative keyword detection
- **Score Adjustment**: Reddit score (upvotes) influences sentiment confidence
- **Confidence Metrics**: Based on post count and engagement levels
- **Multi-level Analysis**: Individual post sentiment + overall summary sentiment

#### Sentiment Calculation Strategy
```python
def _calculate_advanced_sentiment(self, posts: list[PostData]) -> SentimentScore:
    """Enhanced sentiment analysis with multiple factors."""
    # Weight by engagement score (upvotes + comments)
    # Adjust for subreddit context (WSB vs investing)
    # Consider temporal patterns (recent posts weighted higher)
    # Apply confidence scoring based on data volume
```

### 6. Pydantic Validation

#### Context Structure
```python
@dataclass
class SocialContext(BaseModel):
    symbol: str | None
    period: dict[str, str]  # {"start": "2024-01-01", "end": "2024-01-31"}
    posts: list[PostData]
    engagement_metrics: dict[str, float]
    sentiment_summary: SentimentScore
    post_count: int
    platforms: list[str]  # ["reddit"]
    metadata: dict[str, Any]
```

#### PostData Format
```python
@dataclass
class PostData(BaseModel):
    title: str
    content: str
    author: str
    source: str  # subreddit name
    date: str
    url: str
    score: int
    comments: int
    engagement_score: int
    subreddit: str | None
    sentiment: SentimentScore | None
    metadata: dict[str, Any]
```

## Implementation Tasks

### Phase 1: Create RedditClient

1. **RedditClient Implementation**
   - Create `tradingagents/clients/reddit_client.py`
   - Follow FinnhubClient standard: no BaseClient inheritance, date objects, proper error handling
   - Use PRAW (Python Reddit API Wrapper) library for Reddit API access
   - Methods: `search_posts()`, `get_top_posts()`, `get_company_posts()`
   - Implement date filtering for posts within specified ranges
   - Handle Reddit API rate limits and authentication

2. **Comprehensive Testing**
   - Create `tradingagents/clients/test_reddit_client.py`
   - Use pytest-vcr for Reddit API interaction recording
   - Test all client methods with multiple queries and subreddits
   - Test error handling and API rate limit scenarios
   - Mock Reddit API responses for consistent testing

### Phase 2: Bridge SocialRepository Interface

3. **Repository Interface Standardization**
   - Add standard service interface methods to `SocialRepository`
   - Bridge existing `get_social_data()` with `get_data()`
   - Bridge existing `store_social_posts()` with `store_data()`
   - Add missing `has_data_for_period()` and `clear_data()` methods
   - File: `tradingagents/repositories/social_repository.py`
   - Maintain existing dataclass functionality while adding service compatibility

4. **Repository Method Implementation**
   ```python
   # Add these methods to SocialRepository
   def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
   def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]
   def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool
   def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
   ```

### Phase 3: Update SocialMediaService

5. **Client Integration Fix**
   - Replace `BaseClient` dependency with `RedditClient`
   - File: `tradingagents/services/social_media_service.py:27`
   - Update constructor: `reddit_client: RedditClient`

6. **Date Conversion Fix**
   - Add `date.fromisoformat()` conversion in service methods
   - Update all client calls to use date objects instead of strings
   - File: `tradingagents/services/social_media_service.py:182-190, 418-429`

7. **Repository Interface Integration**
   - Update repository method calls to use new standard interface
   - Ensure proper error handling for repository operations
   - File: `tradingagents/services/social_media_service.py:302-311, 325-337`

### Phase 4: Type Safety & Validation

8. **Comprehensive Type Checking**
   - Run `mise run typecheck` - must pass with 0 errors
   - Validate all date object conversions
   - Ensure SocialContext compliance

9. **Enhanced Testing**
   - Update existing service tests for new RedditClient interface
   - Add gap detection test scenarios
   - Test sentiment analysis accuracy with known datasets
   - Test multi-subreddit aggregation and deduplication

## Success Criteria

### Functional Requirements
- ✅ Service successfully calls `RedditClient` with `date` objects
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `SocialContext` to agents
- ✅ Sentiment analysis provides accurate scores with confidence metrics
- ✅ Multi-subreddit support with post deduplication
- ✅ Force refresh bypasses cache and refreshes data

### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass with updated architecture
- ✅ No runtime errors with date conversions

### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ PRAW library integration for reliable Reddit API access
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
- ✅ Accurate sentiment analysis with engagement weighting

## Data Architecture

### RedditClient Response Format
```python
{
    "query": "Tesla",
    "period": {"start": "2024-01-01", "end": "2024-01-31"},
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "subreddit": "TeslaInvestors",
            "created_utc": 1704067200,
            "score": 245,
            "num_comments": 67,
            "upvote_ratio": 0.92,
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "id": "xyz789"
        }
    ],
    "metadata": {
        "source": "reddit",
        "retrieved_at": "2024-01-31T10:00:00Z",
        "data_quality": "HIGH",
        "subreddits": ["TeslaInvestors", "stocks"],
        "post_count": 25,
        "api_calls": 3
    }
}
```

### SocialRepository Data Bridge Format
```python
# Repository stores data in existing SocialPost format but provides service interface
{
    "query": "Tesla",
    "symbol": "TSLA",
    "posts": [
        {
            "title": "Tesla Q4 earnings beat expectations",
            "content": "Tesla reported strong Q4 results...",
            "author": "teslaInvestor",
            "source": "TeslaInvestors",
            "date": "2024-01-15",
            "url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
            "score": 245,
            "comments": 67,
            "engagement_score": 312,
            "subreddit": "TeslaInvestors",
            "sentiment": {
                "score": 0.7,
                "confidence": 0.8,
                "label": "positive"
            },
            "metadata": {
                "platform_id": "xyz789",
                "upvote_ratio": 0.92
            }
        }
    ],
    "metadata": {
        "cached_at": "2024-01-31T10:00:00Z",
        "post_count": 25,
        "sources": ["reddit"]
    }
}
```

## Dependencies

### Missing Components (Need Creation)
- ⏳ `RedditClient` needs full implementation from scratch
- ⏳ Service interface bridge methods for `SocialRepository`
- ⏳ Comprehensive pytest-vcr test suites for Reddit API

### Existing Components (Ready)
- ✅ `SocialRepository` with JSON storage and deduplication
- ✅ `SocialContext` and `PostData` Pydantic models
- ✅ Sentiment analysis and engagement metrics logic

### Required
- PRAW (Python Reddit API Wrapper) library for Reddit integration
- Valid Reddit API credentials (client_id, client_secret, user_agent)
- Working internet connection for live data fetching
- Writable data directory for repository storage

## Timeline

### Immediate (Phase 1)
- Create RedditClient following FinnhubClient standard with PRAW integration
- Implement comprehensive testing with pytest-vcr for Reddit API
- Validate client functionality with multiple subreddits and queries

### Phase 2-3
- Add standard service interface methods to SocialRepository
- Update SocialMediaService to use RedditClient with date objects
- Bridge repository interfaces while maintaining existing functionality

### Phase 4
- Comprehensive type checking and validation
- Integration testing with sentiment analysis workflows
- Performance optimization and caching efficiency

## Acceptance Criteria

### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Integration**: All `RedditClient` calls use `date` objects correctly
3. **Local-First**: Service checks repository before Reddit API calls
4. **Context Validation**: Returns valid `SocialContext` with Pydantic validation
5. **Sentiment Analysis**: Provides accurate sentiment scores with confidence metrics
6. **Multi-Platform**: Seamlessly aggregates social data from Reddit with extensibility

### Should Have
1. **Gap Detection**: Intelligent identification of missing data periods
2. **Cache Efficiency**: Minimal redundant API calls to Reddit
3. **Force Refresh**: Complete cache bypass when requested
4. **Data Quality**: Metadata indicating data source and quality metrics
5. **Deduplication**: Automatic removal of duplicate posts by platform_id

### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Data Staleness**: Automatic refresh of old cached social data
3. **Enhanced Sentiment**: Integration with advanced NLP libraries (TextBlob, VADER)
4. **Real-time Social**: Support for live social media feeds and alerts
5. **Platform Expansion**: Easy addition of Twitter, Discord, other social platforms

---

This PRD focuses on completing the `SocialMediaService` as a strongly-typed, local-first data service that integrates Reddit social media data through a new `RedditClient` following the established FinnhubClient standard patterns, while providing comprehensive sentiment analysis and engagement metrics to trading agents.