TradingAgents/SocialMediaService_PRD.md

424 lines
16 KiB
Markdown

# Product Requirements Document: SocialMediaService Completion
## Overview
Complete the `SocialMediaService` to provide strongly-typed social media data and sentiment analysis to trading agents using a local-first data strategy with gap detection and intelligent caching.
## Current State Analysis
### Issues to Fix
- **CRITICAL**: Missing `RedditClient` implementation - service calls non-existent client methods
- **CRITICAL**: Service uses `BaseClient` inheritance but needs typed `RedditClient`
- **CRITICAL**: `SocialRepository` has different interface than standard service pattern
- **CRITICAL**: Repository uses `date` objects internally but service expects string date interface
- Missing strongly-typed interfaces between components
- Service calls `reddit_client.search_posts()`, `get_top_posts()`, `filter_posts_by_date()` methods that don't exist
### What Works
- ✅ Local-first data strategy implementation (`_get_social_data_local_first`)
- ✅ Force refresh logic (`_fetch_and_cache_fresh_social_data`)
-`SocialContext` Pydantic model for agent consumption
- ✅ Comprehensive sentiment analysis with keyword-based scoring
- ✅ Engagement metrics calculation and post ranking
- ✅ Error handling and metadata creation patterns
-`SocialRepository` with JSON storage and post deduplication
-`PostData` and `SentimentScore` models for structured data
- ✅ Real-time sentiment analysis with weighted scoring
## Technical Requirements
### 1. Strongly-Typed Interfaces
#### Client → Service Interface
```python
# RedditClient methods (to be implemented)
def search_posts(query: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_top_posts(subreddit_names: list[str], start_date: date, end_date: date, limit: int, time_filter: str) -> dict[str, Any]
def get_company_posts(symbol: str, subreddit_names: list[str], start_date: date, end_date: date, limit: int) -> dict[str, Any]
```
#### Service → Repository Interface
```python
# SocialRepository methods (to be implemented/bridged)
def has_data_for_period(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
def get_data(query: str, start_date: str, end_date: str, symbol: str | None) -> dict[str, Any]
def store_data(query: str, cache_data: dict, symbol: str | None, overwrite: bool) -> bool
def clear_data(query: str, start_date: str, end_date: str, symbol: str | None) -> bool
```
#### Service → Agent Interface
```python
# Service output (already defined)
def get_context(query: str, start_date: str, end_date: str, symbol: str | None, subreddits: list[str], force_refresh: bool) -> SocialContext
def get_company_social_context(symbol: str, start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
def get_global_trends(start_date: str, end_date: str, subreddits: list[str]) -> SocialContext
```
### 2. Local-First Data Strategy
#### Flow
1. **Repository Lookup**: Check `SocialRepository.has_data_for_period()`
2. **Gap Detection**: Identify missing social media data periods
3. **Selective Fetching**: Fetch only missing data from `RedditClient`
4. **Cache Updates**: Store new data via `repository.store_data()`
5. **Context Assembly**: Return validated `SocialContext`
#### Force Refresh Support
- `force_refresh=True` bypasses local data completely
- Clears existing cache before fetching fresh data
- Stores refreshed data with metadata indicating refresh
### 3. Date Object Conversion
#### Service Boundary Conversion
```python
# Service receives string dates from agents
def get_context(self, query: str, start_date: str, end_date: str, ...) -> SocialContext:
# Convert to date objects for client calls
start_dt = date.fromisoformat(start_date)
end_dt = date.fromisoformat(end_date)
# Use date objects when calling RedditClient
posts_data = self.reddit_client.search_posts(query, subreddits, start_dt, end_dt, limit, time_filter)
# Repository bridge handles string to date conversion internally
cached_data = self.repository.get_data(query, start_date, end_date, symbol)
```
### 4. Reddit API Integration
#### RedditClient Implementation Strategy
```python
# RedditClient following FinnhubClient standard
class RedditClient:
"""Client for Reddit API access with PRAW library integration."""
def __init__(self, client_id: str, client_secret: str, user_agent: str):
"""Initialize Reddit client with PRAW."""
import praw
self.reddit = praw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)
def search_posts(self, query: str, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50,
time_filter: str = "week") -> dict[str, Any]:
"""Search for posts across subreddits within date range."""
def get_top_posts(self, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50,
time_filter: str = "week") -> dict[str, Any]:
"""Get top posts from subreddits within date range."""
def get_company_posts(self, symbol: str, subreddit_names: list[str],
start_date: date, end_date: date, limit: int = 50) -> dict[str, Any]:
"""Get company-specific posts from subreddits."""
```
#### Reddit Response Format
```python
{
"query": "AAPL",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"posts": [
{
"title": "Apple earnings discussion",
"content": "What do you think about...",
"author": "redditor123",
"subreddit": "investing",
"created_utc": 1704067200,
"score": 125,
"num_comments": 45,
"upvote_ratio": 0.87,
"url": "https://reddit.com/r/investing/comments/abc123",
"id": "abc123"
}
],
"metadata": {
"source": "reddit",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"subreddits": ["investing", "stocks"],
"total_posts": 25
}
}
```
### 5. Sentiment Analysis Enhancement
#### Advanced Sentiment Features
- **Weighted Scoring**: High-engagement posts have more influence on overall sentiment
- **Keyword Analysis**: Comprehensive positive/negative keyword detection
- **Score Adjustment**: Reddit score (upvotes) influences sentiment confidence
- **Confidence Metrics**: Based on post count and engagement levels
- **Multi-level Analysis**: Individual post sentiment + overall summary sentiment
#### Sentiment Calculation Strategy
```python
def _calculate_advanced_sentiment(self, posts: list[PostData]) -> SentimentScore:
"""Enhanced sentiment analysis with multiple factors."""
# Weight by engagement score (upvotes + comments)
# Adjust for subreddit context (WSB vs investing)
# Consider temporal patterns (recent posts weighted higher)
# Apply confidence scoring based on data volume
```
### 6. Pydantic Validation
#### Context Structure
```python
@dataclass
class SocialContext(BaseModel):
symbol: str | None
period: dict[str, str] # {"start": "2024-01-01", "end": "2024-01-31"}
posts: list[PostData]
engagement_metrics: dict[str, float]
sentiment_summary: SentimentScore
post_count: int
platforms: list[str] # ["reddit"]
metadata: dict[str, Any]
```
#### PostData Format
```python
@dataclass
class PostData(BaseModel):
title: str
content: str
author: str
source: str # subreddit name
date: str
url: str
score: int
comments: int
engagement_score: int
subreddit: str | None
sentiment: SentimentScore | None
metadata: dict[str, Any]
```
## Implementation Tasks
### Phase 1: Create RedditClient
1. **RedditClient Implementation**
- Create `tradingagents/clients/reddit_client.py`
- Follow FinnhubClient standard: no BaseClient inheritance, date objects, proper error handling
- Use PRAW (Python Reddit API Wrapper) library for Reddit API access
- Methods: `search_posts()`, `get_top_posts()`, `get_company_posts()`
- Implement date filtering for posts within specified ranges
- Handle Reddit API rate limits and authentication
2. **Comprehensive Testing**
- Create `tradingagents/clients/test_reddit_client.py`
- Use pytest-vcr for Reddit API interaction recording
- Test all client methods with multiple queries and subreddits
- Test error handling and API rate limit scenarios
- Mock Reddit API responses for consistent testing
### Phase 2: Bridge SocialRepository Interface
3. **Repository Interface Standardization**
- Add standard service interface methods to `SocialRepository`
- Bridge existing `get_social_data()` with `get_data()`
- Bridge existing `store_social_posts()` with `store_data()`
- Add missing `has_data_for_period()` and `clear_data()` methods
- File: `tradingagents/repositories/social_repository.py`
- Maintain existing dataclass functionality while adding service compatibility
4. **Repository Method Implementation**
```python
# Add these methods to SocialRepository
def has_data_for_period(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
def get_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> dict[str, Any]
def store_data(self, query: str, cache_data: dict, symbol: str | None = None, overwrite: bool = False) -> bool
def clear_data(self, query: str, start_date: str, end_date: str, symbol: str | None = None) -> bool
```
### Phase 3: Update SocialMediaService
5. **Client Integration Fix**
- Replace `BaseClient` dependency with `RedditClient`
- File: `tradingagents/services/social_media_service.py:27`
- Update constructor: `reddit_client: RedditClient`
6. **Date Conversion Fix**
- Add `date.fromisoformat()` conversion in service methods
- Update all client calls to use date objects instead of strings
- File: `tradingagents/services/social_media_service.py:182-190, 418-429`
7. **Repository Interface Integration**
- Update repository method calls to use new standard interface
- Ensure proper error handling for repository operations
- File: `tradingagents/services/social_media_service.py:302-311, 325-337`
### Phase 4: Type Safety & Validation
8. **Comprehensive Type Checking**
- Run `mise run typecheck` - must pass with 0 errors
- Validate all date object conversions
- Ensure SocialContext compliance
9. **Enhanced Testing**
- Update existing service tests for new RedditClient interface
- Add gap detection test scenarios
- Test sentiment analysis accuracy with known datasets
- Test multi-subreddit aggregation and deduplication
## Success Criteria
### Functional Requirements
- ✅ Service successfully calls `RedditClient` with `date` objects
- ✅ Local-first strategy works: checks cache → identifies gaps → fetches missing → stores updates
- ✅ Returns properly validated `SocialContext` to agents
- ✅ Sentiment analysis provides accurate scores with confidence metrics
- ✅ Multi-subreddit support with post deduplication
- ✅ Force refresh bypasses cache and refreshes data
### Technical Requirements
- ✅ Zero type checking errors: `mise run typecheck`
- ✅ Zero linting errors: `mise run lint`
- ✅ All existing tests pass with updated architecture
- ✅ No runtime errors with date conversions
### Quality Requirements
- ✅ Strongly-typed interfaces between all components
- ✅ PRAW library integration for reliable Reddit API access
- ✅ Comprehensive error handling and logging
- ✅ Efficient caching with minimal API calls
- ✅ Clear separation of concerns between service, client, and repository
- ✅ Accurate sentiment analysis with engagement weighting
## Data Architecture
### RedditClient Response Format
```python
{
"query": "Tesla",
"period": {"start": "2024-01-01", "end": "2024-01-31"},
"posts": [
{
"title": "Tesla Q4 earnings beat expectations",
"content": "Tesla reported strong Q4 results...",
"author": "teslaInvestor",
"subreddit": "TeslaInvestors",
"created_utc": 1704067200,
"score": 245,
"num_comments": 67,
"upvote_ratio": 0.92,
"url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
"id": "xyz789"
}
],
"metadata": {
"source": "reddit",
"retrieved_at": "2024-01-31T10:00:00Z",
"data_quality": "HIGH",
"subreddits": ["TeslaInvestors", "stocks"],
"post_count": 25,
"api_calls": 3
}
}
```
### SocialRepository Data Bridge Format
```python
# Repository stores data in existing SocialPost format but provides service interface
{
"query": "Tesla",
"symbol": "TSLA",
"posts": [
{
"title": "Tesla Q4 earnings beat expectations",
"content": "Tesla reported strong Q4 results...",
"author": "teslaInvestor",
"source": "TeslaInvestors",
"date": "2024-01-15",
"url": "https://reddit.com/r/TeslaInvestors/comments/xyz789",
"score": 245,
"comments": 67,
"engagement_score": 312,
"subreddit": "TeslaInvestors",
"sentiment": {
"score": 0.7,
"confidence": 0.8,
"label": "positive"
},
"metadata": {
"platform_id": "xyz789",
"upvote_ratio": 0.92
}
}
],
"metadata": {
"cached_at": "2024-01-31T10:00:00Z",
"post_count": 25,
"sources": ["reddit"]
}
}
```
## Dependencies
### Missing Components (Need Creation)
-`RedditClient` needs full implementation from scratch
- ⏳ Service interface bridge methods for `SocialRepository`
- ⏳ Comprehensive pytest-vcr test suites for Reddit API
### Existing Components (Ready)
-`SocialRepository` with JSON storage and deduplication
-`SocialContext` and `PostData` Pydantic models
- ✅ Sentiment analysis and engagement metrics logic
### Required
- PRAW (Python Reddit API Wrapper) library for Reddit integration
- Valid Reddit API credentials (client_id, client_secret, user_agent)
- Working internet connection for live data fetching
- Writable data directory for repository storage
## Timeline
### Immediate (Phase 1)
- Create RedditClient following FinnhubClient standard with PRAW integration
- Implement comprehensive testing with pytest-vcr for Reddit API
- Validate client functionality with multiple subreddits and queries
### Phase 2-3
- Add standard service interface methods to SocialRepository
- Update SocialMediaService to use RedditClient with date objects
- Bridge repository interfaces while maintaining existing functionality
### Phase 4
- Comprehensive type checking and validation
- Integration testing with sentiment analysis workflows
- Performance optimization and caching efficiency
## Acceptance Criteria
### Must Have
1. **Type Safety**: Service passes `mise run typecheck` with zero errors
2. **Client Integration**: All `RedditClient` calls use `date` objects correctly
3. **Local-First**: Service checks repository before Reddit API calls
4. **Context Validation**: Returns valid `SocialContext` with Pydantic validation
5. **Sentiment Analysis**: Provides accurate sentiment scores with confidence metrics
6. **Multi-Platform**: Seamlessly aggregates social data from Reddit with extensibility
### Should Have
1. **Gap Detection**: Intelligent identification of missing data periods
2. **Cache Efficiency**: Minimal redundant API calls to Reddit
3. **Force Refresh**: Complete cache bypass when requested
4. **Data Quality**: Metadata indicating data source and quality metrics
5. **Deduplication**: Automatic removal of duplicate posts by platform_id
### Nice to Have
1. **Performance Metrics**: Timing and cache hit rate logging
2. **Data Staleness**: Automatic refresh of old cached social data
3. **Enhanced Sentiment**: Integration with advanced NLP libraries (TextBlob, VADER)
4. **Real-time Social**: Support for live social media feeds and alerts
5. **Platform Expansion**: Easy addition of Twitter, Discord, other social platforms
---
This PRD focuses on completing the `SocialMediaService` as a strongly-typed, local-first data service that integrates Reddit social media data through a new `RedditClient` following the established FinnhubClient standard patterns, while providing comprehensive sentiment analysis and engagement metrics to trading agents.