28 KiB
Social Media Domain Specification
Feature Overview
Complete implementation of social media data collection and analysis - Transform the current stub implementation into a production-ready social media domain that provides comprehensive Reddit sentiment analysis for trading agents.
User Story
As a Dagster pipeline, I want to collect Reddit posts from financial subreddits with LLM sentiment analysis and vector embeddings, so that AI Agents can access comprehensive social media context for ticker-specific trading decisions through RAG-powered queries.
Acceptance Criteria
Daily Data Collection
- GIVEN a scheduled Dagster pipeline WHEN it executes daily THEN it collects Reddit posts from configured financial subreddits without manual intervention
- GIVEN Reddit posts are collected WHEN processed THEN they are stored in PostgreSQL with TimescaleDB optimization and vector embeddings for semantic search
LLM Sentiment Analysis
- GIVEN social media posts WHEN processed THEN each post receives OpenRouter LLM sentiment analysis with structured scores (positive/negative/neutral with confidence)
Agent Integration
- GIVEN a ticker symbol WHEN AI agents request social context THEN they receive relevant Reddit posts with sentiment scores and vector similarity ranking within 2 seconds
- GIVEN social media data WHEN agents query THEN AgentToolkit provides RAG-enhanced context including post content, sentiment trends, and engagement metrics
Business Rules and Constraints
Data Collection Rules
- Daily automated collection from configured financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
- OpenRouter LLM sentiment analysis for all posts with confidence scoring
- Vector embeddings generation for semantic similarity search
- Post deduplication by Reddit post ID to prevent duplicates
- Rate limiting compliance with Reddit API terms of service
Data Management
- Data retention policy: 90 days for social media posts
- Best effort processing: API failures or rate limits don't block other posts
Scope Definition
Included Features ✅
- Complete socialmedia domain implementation from stub to production
- PostgreSQL migration from current file-based storage
- Reddit API integration using PRAW or Reddit API client
- OpenRouter LLM sentiment analysis integration
- Vector embeddings generation and similarity search
- AgentToolkit integration with
get_reddit_newsandget_reddit_stock_infomethods - Dagster pipeline for scheduled daily collection
- SQLAlchemy entities with TimescaleDB and pgvectorscale support
- Comprehensive test coverage with pytest-vcr for API mocking
Excluded Features ❌
- Other social media platforms beyond Reddit (Twitter, LinkedIn, etc.)
- Real-time social media streaming (batch processing only)
- Custom sentiment models (use OpenRouter LLMs only)
- Social media influence scoring or user reputation tracking
- Multi-language post support (English only)
- Historical Reddit data backfilling beyond 30 days
Technical Implementation Details
Architecture Pattern
Router → Service → Repository → Entity → Database (matching news domain)
Current Implementation Status
Basic stub implementation - requires complete rebuild
Missing Components
- PostgreSQL database migration from file storage
- Reddit API client implementation (RedditClient is empty stub)
- SQLAlchemy entity models for social posts with vector fields
- LLM sentiment analysis integration via OpenRouter
- Vector embedding generation and similarity search
- AgentToolkit RAG methods (
get_reddit_news,get_reddit_stock_info) - Dagster pipeline for scheduled data collection
- Comprehensive test suite with domain-specific patterns
Existing Stub Components
- SocialMediaService with empty method stubs
- SocialRepository with file-based JSON storage
- Basic data models: SocialPost, PostData, SocialContext
- Empty RedditClient class requiring full implementation
- Agent references to social methods (not yet implemented)
Database Integration
PostgreSQL Schema Design
-- Social media posts table with TimescaleDB optimization
CREATE TABLE social_media_posts (
id SERIAL PRIMARY KEY,
post_id VARCHAR(50) UNIQUE NOT NULL, -- Reddit post ID
ticker VARCHAR(10), -- Associated ticker
subreddit VARCHAR(50) NOT NULL, -- Source subreddit
title TEXT NOT NULL, -- Post title
content TEXT, -- Post content
author VARCHAR(50), -- Reddit username
created_at TIMESTAMPTZ NOT NULL, -- Post creation time
collected_at TIMESTAMPTZ DEFAULT NOW(), -- Data collection time
upvotes INTEGER DEFAULT 0, -- Reddit upvotes
downvotes INTEGER DEFAULT 0, -- Reddit downvotes
comment_count INTEGER DEFAULT 0, -- Number of comments
url TEXT, -- Reddit URL
permalink TEXT, -- Reddit permalink
-- Sentiment analysis fields
sentiment_score DECIMAL(3,2), -- -1.0 to +1.0
sentiment_label VARCHAR(20), -- positive/negative/neutral
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
-- Vector embeddings
embedding vector(1536), -- pgvectorscale embedding
-- Metadata
data_quality_score DECIMAL(3,2) DEFAULT 1.0,
processing_status VARCHAR(20) DEFAULT 'pending',
error_message TEXT
);
-- TimescaleDB hypertable for time-series optimization
SELECT create_hypertable('social_media_posts', 'created_at');
-- Vector similarity index
CREATE INDEX idx_social_posts_embedding ON social_media_posts USING vectors (embedding vector_cosine_ops);
-- Performance indexes
CREATE INDEX idx_social_posts_ticker ON social_media_posts (ticker, created_at DESC);
CREATE INDEX idx_social_posts_subreddit ON social_media_posts (subreddit, created_at DESC);
CREATE INDEX idx_social_posts_sentiment ON social_media_posts (sentiment_label, sentiment_score);
Entity Model
# tradingagents/domains/socialmedia/entities.py
from sqlalchemy import Column, Integer, String, Text, DECIMAL, TIMESTAMP, Index
from sqlalchemy.dialects.postgresql import VECTOR
from tradingagents.database import Base
from typing import Optional, Dict, Any
import json
class SocialMediaPostEntity(Base):
__tablename__ = 'social_media_posts'
id = Column(Integer, primary_key=True)
post_id = Column(String(50), unique=True, nullable=False)
ticker = Column(String(10), index=True)
subreddit = Column(String(50), nullable=False, index=True)
title = Column(Text, nullable=False)
content = Column(Text)
author = Column(String(50))
created_at = Column(TIMESTAMP(timezone=True), nullable=False, index=True)
collected_at = Column(TIMESTAMP(timezone=True), server_default='NOW()')
upvotes = Column(Integer, default=0)
downvotes = Column(Integer, default=0)
comment_count = Column(Integer, default=0)
url = Column(Text)
permalink = Column(Text)
# Sentiment analysis
sentiment_score = Column(DECIMAL(3,2))
sentiment_label = Column(String(20))
sentiment_confidence = Column(DECIMAL(3,2))
# Vector embeddings
embedding = Column(VECTOR(1536))
# Metadata
data_quality_score = Column(DECIMAL(3,2), default=1.0)
processing_status = Column(String(20), default='pending')
error_message = Column(Text)
def to_domain(self) -> 'SocialPost':
"""Convert entity to domain model"""
return SocialPost(
post_id=self.post_id,
ticker=self.ticker,
subreddit=self.subreddit,
title=self.title,
content=self.content,
author=self.author,
created_at=self.created_at,
upvotes=self.upvotes,
downvotes=self.downvotes,
comment_count=self.comment_count,
url=self.url,
sentiment_score=float(self.sentiment_score) if self.sentiment_score else None,
sentiment_label=self.sentiment_label,
sentiment_confidence=float(self.sentiment_confidence) if self.sentiment_confidence else None
)
@classmethod
def from_domain(cls, post: 'SocialPost', embedding: Optional[list] = None) -> 'SocialMediaPostEntity':
"""Create entity from domain model"""
return cls(
post_id=post.post_id,
ticker=post.ticker,
subreddit=post.subreddit,
title=post.title,
content=post.content,
author=post.author,
created_at=post.created_at,
upvotes=post.upvotes,
downvotes=post.downvotes,
comment_count=post.comment_count,
url=post.url,
sentiment_score=post.sentiment_score,
sentiment_label=post.sentiment_label,
sentiment_confidence=post.sentiment_confidence,
embedding=embedding
)
Reddit API Integration
RedditClient Implementation
# tradingagents/domains/socialmedia/clients.py
import praw
from typing import List, Optional, Dict, Any
from datetime import datetime, timedelta
import asyncio
import aiohttp
from tradingagents.config import TradingAgentsConfig
class RedditClient:
def __init__(self, config: TradingAgentsConfig):
self.config = config
self.reddit = praw.Reddit(
client_id=config.reddit_client_id,
client_secret=config.reddit_client_secret,
user_agent=config.reddit_user_agent
)
async def fetch_financial_posts(
self,
subreddits: List[str],
ticker: Optional[str] = None,
limit: int = 100,
time_filter: str = "day"
) -> List[Dict[str, Any]]:
"""Fetch financial posts from specified subreddits"""
posts = []
for subreddit_name in subreddits:
try:
subreddit = self.reddit.subreddit(subreddit_name)
submissions = subreddit.hot(limit=limit)
for submission in submissions:
# Filter by ticker if specified
if ticker and ticker.upper() not in submission.title.upper():
continue
post_data = {
'post_id': submission.id,
'subreddit': subreddit_name,
'title': submission.title,
'content': submission.selftext,
'author': str(submission.author),
'created_at': datetime.fromtimestamp(submission.created_utc),
'upvotes': submission.ups,
'downvotes': submission.downs,
'comment_count': submission.num_comments,
'url': submission.url,
'permalink': submission.permalink
}
posts.append(post_data)
except Exception as e:
# Log error but continue processing other subreddits
print(f"Error fetching from {subreddit_name}: {e}")
continue
return posts
LLM Sentiment Analysis
OpenRouter Integration
# tradingagents/domains/socialmedia/services.py
from typing import Dict, Any, Optional, Tuple
import openai
from tradingagents.config import TradingAgentsConfig
class SentimentAnalyzer:
def __init__(self, config: TradingAgentsConfig):
self.config = config
self.client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=config.openrouter_api_key
)
async def analyze_sentiment(self, text: str) -> Tuple[float, str, float]:
"""
Analyze sentiment of social media post
Returns: (score, label, confidence)
"""
prompt = f"""
Analyze the financial sentiment of this social media post.
Post: "{text}"
Return sentiment as JSON with:
- score: float from -1.0 (very negative) to +1.0 (very positive)
- label: "positive", "negative", or "neutral"
- confidence: float from 0.0 to 1.0 indicating confidence
Focus on financial and trading sentiment, not general sentiment.
"""
try:
response = await self.client.chat.completions.create(
model=self.config.quick_think_llm,
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0.1
)
result = json.loads(response.choices[0].message.content)
return result['score'], result['label'], result['confidence']
except Exception as e:
# Return neutral sentiment on error
return 0.0, "neutral", 0.0
Vector Embeddings and Search
Embedding Generation
# tradingagents/domains/socialmedia/embeddings.py
import openai
from typing import List, Optional
from tradingagents.config import TradingAgentsConfig
class EmbeddingGenerator:
def __init__(self, config: TradingAgentsConfig):
self.config = config
self.client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=config.openrouter_api_key
)
async def generate_embedding(self, text: str) -> Optional[List[float]]:
"""Generate vector embedding for text"""
try:
response = await self.client.embeddings.create(
model="text-embedding-3-small",
input=text,
encoding_format="float"
)
return response.data[0].embedding
except Exception as e:
print(f"Embedding generation failed: {e}")
return None
def prepare_text_for_embedding(self, post: Dict[str, Any]) -> str:
"""Combine title and content for embedding"""
title = post.get('title', '')
content = post.get('content', '')
return f"{title} {content}".strip()
Repository Implementation
SocialRepository with PostgreSQL
# tradingagents/domains/socialmedia/repositories.py
from typing import List, Optional, Dict, Any
from sqlalchemy.orm import Session
from sqlalchemy import desc, and_, text
from tradingagents.domains.socialmedia.entities import SocialMediaPostEntity
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
from tradingagents.database import get_db_session
from datetime import datetime, timedelta
class SocialRepository:
def __init__(self):
self.session = get_db_session()
async def save_posts(self, posts: List[SocialPost]) -> List[str]:
"""Save social media posts with deduplication"""
saved_ids = []
for post in posts:
# Check for existing post
existing = self.session.query(SocialMediaPostEntity).filter(
SocialMediaPostEntity.post_id == post.post_id
).first()
if existing:
continue # Skip duplicates
entity = SocialMediaPostEntity.from_domain(post)
self.session.add(entity)
saved_ids.append(post.post_id)
self.session.commit()
return saved_ids
async def get_posts_for_ticker(
self,
ticker: str,
days: int = 7,
limit: int = 50
) -> List[SocialPost]:
"""Get social media posts for specific ticker"""
cutoff_date = datetime.now() - timedelta(days=days)
results = self.session.query(SocialMediaPostEntity).filter(
and_(
SocialMediaPostEntity.ticker == ticker,
SocialMediaPostEntity.created_at >= cutoff_date
)
).order_by(desc(SocialMediaPostEntity.created_at)).limit(limit).all()
return [entity.to_domain() for entity in results]
async def vector_similarity_search(
self,
query_embedding: List[float],
ticker: Optional[str] = None,
limit: int = 10
) -> List[SocialPost]:
"""Find similar posts using vector search"""
query = self.session.query(SocialMediaPostEntity)
if ticker:
query = query.filter(SocialMediaPostEntity.ticker == ticker)
# Vector similarity search using pgvectorscale
query = query.order_by(
text(f"embedding <-> '{query_embedding}'")
).limit(limit)
results = query.all()
return [entity.to_domain() for entity in results]
Service Layer
SocialMediaService
# tradingagents/domains/socialmedia/services.py
from typing import List, Optional, Dict, Any
from tradingagents.domains.socialmedia.repositories import SocialRepository
from tradingagents.domains.socialmedia.clients import RedditClient
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
from tradingagents.config import TradingAgentsConfig
class SocialMediaService:
def __init__(self, config: TradingAgentsConfig):
self.config = config
self.repository = SocialRepository()
self.reddit_client = RedditClient(config)
self.sentiment_analyzer = SentimentAnalyzer(config)
self.embedding_generator = EmbeddingGenerator(config)
async def collect_social_data(
self,
ticker: Optional[str] = None,
subreddits: Optional[List[str]] = None
) -> SocialContext:
"""Main entry point for social media data collection"""
if not subreddits:
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
# Fetch posts from Reddit
raw_posts = await self.reddit_client.fetch_financial_posts(
subreddits=subreddits,
ticker=ticker,
limit=100
)
# Process posts: sentiment analysis + embeddings
processed_posts = []
for raw_post in raw_posts:
# Generate sentiment
text = f"{raw_post['title']} {raw_post['content']}"
score, label, confidence = await self.sentiment_analyzer.analyze_sentiment(text)
# Generate embedding
embedding = await self.embedding_generator.generate_embedding(text)
post = SocialPost(
**raw_post,
sentiment_score=score,
sentiment_label=label,
sentiment_confidence=confidence
)
processed_posts.append(post)
# Save to database
await self.repository.save_posts(processed_posts)
# Return context
return SocialContext(
posts=processed_posts,
ticker=ticker,
total_posts=len(processed_posts),
sentiment_summary=self._calculate_sentiment_summary(processed_posts)
)
def _calculate_sentiment_summary(self, posts: List[SocialPost]) -> Dict[str, Any]:
"""Calculate aggregate sentiment metrics"""
if not posts:
return {}
scores = [p.sentiment_score for p in posts if p.sentiment_score is not None]
labels = [p.sentiment_label for p in posts if p.sentiment_label]
return {
'avg_sentiment': sum(scores) / len(scores) if scores else 0.0,
'positive_count': labels.count('positive'),
'negative_count': labels.count('negative'),
'neutral_count': labels.count('neutral'),
'total_posts': len(posts)
}
AgentToolkit Integration
RAG-Enhanced Methods
# tradingagents/agents/libs/agent_toolkit.py (additions)
async def get_reddit_news(self, ticker: str, days: int = 7) -> str:
"""Get Reddit posts related to a ticker with RAG context"""
try:
# Get recent posts for ticker
posts = await self.social_service.repository.get_posts_for_ticker(
ticker=ticker,
days=days,
limit=20
)
if not posts:
return f"No Reddit posts found for {ticker} in the last {days} days."
# Format for agent consumption
context = f"Reddit Social Media Context for {ticker} ({len(posts)} posts):\n\n"
for post in posts[:10]: # Limit to top 10
sentiment_emoji = {"positive": "📈", "negative": "📉", "neutral": "➡️"}.get(post.sentiment_label, "")
context += f"{sentiment_emoji} r/{post.subreddit} - {post.title}\n"
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
context += f" Engagement: {post.upvotes} upvotes, {post.comment_count} comments\n"
if post.content:
context += f" Content: {post.content[:200]}...\n"
context += "\n"
return context
except Exception as e:
return f"Error fetching Reddit data for {ticker}: {str(e)}"
async def get_reddit_stock_info(self, ticker: str, query: Optional[str] = None) -> str:
"""Get Reddit stock information with semantic search"""
try:
if query:
# Generate embedding for semantic search
query_embedding = await self.social_service.embedding_generator.generate_embedding(query)
if query_embedding:
posts = await self.social_service.repository.vector_similarity_search(
query_embedding=query_embedding,
ticker=ticker,
limit=10
)
else:
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
else:
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
if not posts:
return f"No relevant Reddit discussions found for {ticker}."
# Aggregate sentiment and key insights
sentiment_summary = self.social_service._calculate_sentiment_summary(posts)
context = f"Reddit Stock Analysis for {ticker}:\n\n"
context += f"Overall Sentiment: {sentiment_summary.get('avg_sentiment', 0):.2f}/1.0\n"
context += f"Posts: {sentiment_summary.get('positive_count', 0)} positive, "
context += f"{sentiment_summary.get('negative_count', 0)} negative, "
context += f"{sentiment_summary.get('neutral_count', 0)} neutral\n\n"
context += "Key Discussions:\n"
for post in posts[:5]:
context += f"• {post.title} (r/{post.subreddit})\n"
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
return context
except Exception as e:
return f"Error analyzing Reddit stock info for {ticker}: {str(e)}"
Dagster Pipeline
Social Media Collection Asset
# tradingagents/data/assets/social_media.py
from dagster import asset, AssetExecutionContext
from tradingagents.domains.socialmedia.services import SocialMediaService
from tradingagents.config import TradingAgentsConfig
@asset(
group_name="social_media",
description="Collect Reddit posts from financial subreddits with sentiment analysis"
)
async def reddit_financial_posts(context: AssetExecutionContext) -> Dict[str, Any]:
"""Daily collection of Reddit financial posts"""
config = TradingAgentsConfig.from_env()
social_service = SocialMediaService(config)
# Collect from financial subreddits
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
total_collected = 0
results = {}
for subreddit in subreddits:
try:
social_context = await social_service.collect_social_data(
subreddits=[subreddit]
)
results[subreddit] = {
'posts_collected': len(social_context.posts),
'sentiment_summary': social_context.sentiment_summary
}
total_collected += len(social_context.posts)
context.log.info(f"Collected {len(social_context.posts)} posts from r/{subreddit}")
except Exception as e:
context.log.error(f"Failed to collect from r/{subreddit}: {e}")
results[subreddit] = {'error': str(e)}
context.log.info(f"Total posts collected: {total_collected}")
return results
Testing Strategy
Test Structure
tests/domains/socialmedia/
├── conftest.py # Fixtures and test setup
├── test_reddit_client.py # API integration tests with VCR
├── test_social_repository.py # PostgreSQL database tests
├── test_social_service.py # Business logic with mocks
├── test_sentiment_analyzer.py # LLM sentiment analysis tests
├── test_embedding_generator.py # Vector embedding tests
└── fixtures/ # VCR cassettes and test data
└── reddit_api_responses.yaml
Key Test Patterns
# tests/domains/socialmedia/test_social_service.py
import pytest
from unittest.mock import AsyncMock, MagicMock
from tradingagents.domains.socialmedia.services import SocialMediaService
@pytest.mark.asyncio
async def test_collect_social_data_success(mock_social_service):
"""Test successful social media data collection"""
# Mock Reddit API response
mock_posts = [
{
'post_id': 'abc123',
'title': 'AAPL to the moon!',
'subreddit': 'wallstreetbets',
# ... other fields
}
]
mock_social_service.reddit_client.fetch_financial_posts.return_value = mock_posts
mock_social_service.sentiment_analyzer.analyze_sentiment.return_value = (0.8, 'positive', 0.9)
result = await mock_social_service.collect_social_data(ticker='AAPL')
assert len(result.posts) == 1
assert result.posts[0].sentiment_label == 'positive'
assert result.sentiment_summary['positive_count'] == 1
Dependencies
Technical Dependencies
- Reddit API access (PRAW or Reddit API client)
- OpenRouter API for LLM sentiment analysis
- PostgreSQL with TimescaleDB and pgvectorscale extensions
- Existing database infrastructure from news domain
- OpenRouter configuration in TradingAgentsConfig
- Dagster orchestration framework for scheduled execution
Reference Implementations
- News domain patterns: Follow NewsService, NewsRepository, NewsArticleEntity patterns for consistency
- Database schema: Mirror NewsArticleEntity vector embedding approach for social posts
- Agent integration: Follow existing AgentToolkit get_news() pattern for social media methods
- Testing approach: Apply news domain testing patterns: VCR for API, real DB for repositories
Success Criteria
Functionality
- Daily Reddit collection with sentiment analysis and vector search
- Seamless integration with existing multi-agent trading framework
- RAG-enhanced social context for AI agents
Performance
- < 2 second social context queries
- < 100ms repository operations
- Efficient vector similarity search
Quality
- 85%+ test coverage matching project standards
- Comprehensive error handling and resilience
- Data quality monitoring and validation
Integration
- Seamless AgentToolkit RAG integration for AI agents
- Architecture and patterns match successful news domain implementation
- Consistent with existing TradingAgents configuration and conventions
Implementation Approach
Complete domain implementation following successful news domain patterns:
- Database migration from file storage to PostgreSQL
- Entity models with TimescaleDB and vector support
- Reddit client implementation with rate limiting
- Repository layer with vector search capabilities
- Service layer with sentiment analysis and embedding generation
- AgentToolkit integration with RAG-enhanced methods
- Dagster pipeline for automated daily collection
- Comprehensive testing with VCR mocking and real database tests
This comprehensive implementation transforms the social media domain from basic stubs into a production-ready system that seamlessly integrates with the existing TradingAgents framework.