740 lines
28 KiB
Markdown
740 lines
28 KiB
Markdown
# Social Media Domain Specification
|
|
|
|
## Feature Overview
|
|
|
|
**Complete implementation of social media data collection and analysis** - Transform the current stub implementation into a production-ready social media domain that provides comprehensive Reddit sentiment analysis for trading agents.
|
|
|
|
### User Story
|
|
|
|
As a Dagster pipeline, I want to collect Reddit posts from financial subreddits with LLM sentiment analysis and vector embeddings, so that AI Agents can access comprehensive social media context for ticker-specific trading decisions through RAG-powered queries.
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Daily Data Collection
|
|
- **GIVEN** a scheduled Dagster pipeline **WHEN** it executes daily **THEN** it collects Reddit posts from configured financial subreddits without manual intervention
|
|
- **GIVEN** Reddit posts are collected **WHEN** processed **THEN** they are stored in PostgreSQL with TimescaleDB optimization and vector embeddings for semantic search
|
|
|
|
### LLM Sentiment Analysis
|
|
- **GIVEN** social media posts **WHEN** processed **THEN** each post receives OpenRouter LLM sentiment analysis with structured scores (positive/negative/neutral with confidence)
|
|
|
|
### Agent Integration
|
|
- **GIVEN** a ticker symbol **WHEN** AI agents request social context **THEN** they receive relevant Reddit posts with sentiment scores and vector similarity ranking within 2 seconds
|
|
- **GIVEN** social media data **WHEN** agents query **THEN** AgentToolkit provides RAG-enhanced context including post content, sentiment trends, and engagement metrics
|
|
|
|
## Business Rules and Constraints
|
|
|
|
### Data Collection Rules
|
|
1. **Daily automated collection** from configured financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
|
|
2. **OpenRouter LLM sentiment analysis** for all posts with confidence scoring
|
|
3. **Vector embeddings generation** for semantic similarity search
|
|
4. **Post deduplication** by Reddit post ID to prevent duplicates
|
|
5. **Rate limiting compliance** with Reddit API terms of service
|
|
|
|
### Data Management
|
|
1. **Data retention policy**: 90 days for social media posts
|
|
2. **Best effort processing**: API failures or rate limits don't block other posts
|
|
|
|
## Scope Definition
|
|
|
|
### Included Features ✅
|
|
- Complete socialmedia domain implementation from stub to production
|
|
- PostgreSQL migration from current file-based storage
|
|
- Reddit API integration using PRAW or Reddit API client
|
|
- OpenRouter LLM sentiment analysis integration
|
|
- Vector embeddings generation and similarity search
|
|
- AgentToolkit integration with `get_reddit_news` and `get_reddit_stock_info` methods
|
|
- Dagster pipeline for scheduled daily collection
|
|
- SQLAlchemy entities with TimescaleDB and pgvectorscale support
|
|
- Comprehensive test coverage with pytest-vcr for API mocking
|
|
|
|
### Excluded Features ❌
|
|
- Other social media platforms beyond Reddit (Twitter, LinkedIn, etc.)
|
|
- Real-time social media streaming (batch processing only)
|
|
- Custom sentiment models (use OpenRouter LLMs only)
|
|
- Social media influence scoring or user reputation tracking
|
|
- Multi-language post support (English only)
|
|
- Historical Reddit data backfilling beyond 30 days
|
|
|
|
## Technical Implementation Details
|
|
|
|
### Architecture Pattern
|
|
**Router → Service → Repository → Entity → Database** (matching news domain)
|
|
|
|
### Current Implementation Status
|
|
**Basic stub implementation - requires complete rebuild**
|
|
|
|
### Missing Components
|
|
1. PostgreSQL database migration from file storage
|
|
2. Reddit API client implementation (RedditClient is empty stub)
|
|
3. SQLAlchemy entity models for social posts with vector fields
|
|
4. LLM sentiment analysis integration via OpenRouter
|
|
5. Vector embedding generation and similarity search
|
|
6. AgentToolkit RAG methods (`get_reddit_news`, `get_reddit_stock_info`)
|
|
7. Dagster pipeline for scheduled data collection
|
|
8. Comprehensive test suite with domain-specific patterns
|
|
|
|
### Existing Stub Components
|
|
- SocialMediaService with empty method stubs
|
|
- SocialRepository with file-based JSON storage
|
|
- Basic data models: SocialPost, PostData, SocialContext
|
|
- Empty RedditClient class requiring full implementation
|
|
- Agent references to social methods (not yet implemented)
|
|
|
|
## Database Integration
|
|
|
|
### PostgreSQL Schema Design
|
|
```sql
|
|
-- Social media posts table with TimescaleDB optimization
|
|
CREATE TABLE social_media_posts (
|
|
id SERIAL PRIMARY KEY,
|
|
post_id VARCHAR(50) UNIQUE NOT NULL, -- Reddit post ID
|
|
ticker VARCHAR(10), -- Associated ticker
|
|
subreddit VARCHAR(50) NOT NULL, -- Source subreddit
|
|
title TEXT NOT NULL, -- Post title
|
|
content TEXT, -- Post content
|
|
author VARCHAR(50), -- Reddit username
|
|
created_at TIMESTAMPTZ NOT NULL, -- Post creation time
|
|
collected_at TIMESTAMPTZ DEFAULT NOW(), -- Data collection time
|
|
upvotes INTEGER DEFAULT 0, -- Reddit upvotes
|
|
downvotes INTEGER DEFAULT 0, -- Reddit downvotes
|
|
comment_count INTEGER DEFAULT 0, -- Number of comments
|
|
url TEXT, -- Reddit URL
|
|
permalink TEXT, -- Reddit permalink
|
|
|
|
-- Sentiment analysis fields
|
|
sentiment_score DECIMAL(3,2), -- -1.0 to +1.0
|
|
sentiment_label VARCHAR(20), -- positive/negative/neutral
|
|
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
|
|
|
|
-- Vector embeddings
|
|
embedding vector(1536), -- pgvectorscale embedding
|
|
|
|
-- Metadata
|
|
data_quality_score DECIMAL(3,2) DEFAULT 1.0,
|
|
processing_status VARCHAR(20) DEFAULT 'pending',
|
|
error_message TEXT
|
|
);
|
|
|
|
-- TimescaleDB hypertable for time-series optimization
|
|
SELECT create_hypertable('social_media_posts', 'created_at');
|
|
|
|
-- Vector similarity index
|
|
CREATE INDEX idx_social_posts_embedding ON social_media_posts USING vectors (embedding vector_cosine_ops);
|
|
|
|
-- Performance indexes
|
|
CREATE INDEX idx_social_posts_ticker ON social_media_posts (ticker, created_at DESC);
|
|
CREATE INDEX idx_social_posts_subreddit ON social_media_posts (subreddit, created_at DESC);
|
|
CREATE INDEX idx_social_posts_sentiment ON social_media_posts (sentiment_label, sentiment_score);
|
|
```
|
|
|
|
### Entity Model
|
|
```python
|
|
# tradingagents/domains/socialmedia/entities.py
|
|
from sqlalchemy import Column, Integer, String, Text, DECIMAL, TIMESTAMP, Index
|
|
from sqlalchemy.dialects.postgresql import VECTOR
|
|
from tradingagents.database import Base
|
|
from typing import Optional, Dict, Any
|
|
import json
|
|
|
|
class SocialMediaPostEntity(Base):
|
|
__tablename__ = 'social_media_posts'
|
|
|
|
id = Column(Integer, primary_key=True)
|
|
post_id = Column(String(50), unique=True, nullable=False)
|
|
ticker = Column(String(10), index=True)
|
|
subreddit = Column(String(50), nullable=False, index=True)
|
|
title = Column(Text, nullable=False)
|
|
content = Column(Text)
|
|
author = Column(String(50))
|
|
created_at = Column(TIMESTAMP(timezone=True), nullable=False, index=True)
|
|
collected_at = Column(TIMESTAMP(timezone=True), server_default='NOW()')
|
|
upvotes = Column(Integer, default=0)
|
|
downvotes = Column(Integer, default=0)
|
|
comment_count = Column(Integer, default=0)
|
|
url = Column(Text)
|
|
permalink = Column(Text)
|
|
|
|
# Sentiment analysis
|
|
sentiment_score = Column(DECIMAL(3,2))
|
|
sentiment_label = Column(String(20))
|
|
sentiment_confidence = Column(DECIMAL(3,2))
|
|
|
|
# Vector embeddings
|
|
embedding = Column(VECTOR(1536))
|
|
|
|
# Metadata
|
|
data_quality_score = Column(DECIMAL(3,2), default=1.0)
|
|
processing_status = Column(String(20), default='pending')
|
|
error_message = Column(Text)
|
|
|
|
def to_domain(self) -> 'SocialPost':
|
|
"""Convert entity to domain model"""
|
|
return SocialPost(
|
|
post_id=self.post_id,
|
|
ticker=self.ticker,
|
|
subreddit=self.subreddit,
|
|
title=self.title,
|
|
content=self.content,
|
|
author=self.author,
|
|
created_at=self.created_at,
|
|
upvotes=self.upvotes,
|
|
downvotes=self.downvotes,
|
|
comment_count=self.comment_count,
|
|
url=self.url,
|
|
sentiment_score=float(self.sentiment_score) if self.sentiment_score else None,
|
|
sentiment_label=self.sentiment_label,
|
|
sentiment_confidence=float(self.sentiment_confidence) if self.sentiment_confidence else None
|
|
)
|
|
|
|
@classmethod
|
|
def from_domain(cls, post: 'SocialPost', embedding: Optional[list] = None) -> 'SocialMediaPostEntity':
|
|
"""Create entity from domain model"""
|
|
return cls(
|
|
post_id=post.post_id,
|
|
ticker=post.ticker,
|
|
subreddit=post.subreddit,
|
|
title=post.title,
|
|
content=post.content,
|
|
author=post.author,
|
|
created_at=post.created_at,
|
|
upvotes=post.upvotes,
|
|
downvotes=post.downvotes,
|
|
comment_count=post.comment_count,
|
|
url=post.url,
|
|
sentiment_score=post.sentiment_score,
|
|
sentiment_label=post.sentiment_label,
|
|
sentiment_confidence=post.sentiment_confidence,
|
|
embedding=embedding
|
|
)
|
|
```
|
|
|
|
## Reddit API Integration
|
|
|
|
### RedditClient Implementation
|
|
```python
|
|
# tradingagents/domains/socialmedia/clients.py
|
|
import praw
|
|
from typing import List, Optional, Dict, Any
|
|
from datetime import datetime, timedelta
|
|
import asyncio
|
|
import aiohttp
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
class RedditClient:
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.config = config
|
|
self.reddit = praw.Reddit(
|
|
client_id=config.reddit_client_id,
|
|
client_secret=config.reddit_client_secret,
|
|
user_agent=config.reddit_user_agent
|
|
)
|
|
|
|
async def fetch_financial_posts(
|
|
self,
|
|
subreddits: List[str],
|
|
ticker: Optional[str] = None,
|
|
limit: int = 100,
|
|
time_filter: str = "day"
|
|
) -> List[Dict[str, Any]]:
|
|
"""Fetch financial posts from specified subreddits"""
|
|
posts = []
|
|
|
|
for subreddit_name in subreddits:
|
|
try:
|
|
subreddit = self.reddit.subreddit(subreddit_name)
|
|
submissions = subreddit.hot(limit=limit)
|
|
|
|
for submission in submissions:
|
|
# Filter by ticker if specified
|
|
if ticker and ticker.upper() not in submission.title.upper():
|
|
continue
|
|
|
|
post_data = {
|
|
'post_id': submission.id,
|
|
'subreddit': subreddit_name,
|
|
'title': submission.title,
|
|
'content': submission.selftext,
|
|
'author': str(submission.author),
|
|
'created_at': datetime.fromtimestamp(submission.created_utc),
|
|
'upvotes': submission.ups,
|
|
'downvotes': submission.downs,
|
|
'comment_count': submission.num_comments,
|
|
'url': submission.url,
|
|
'permalink': submission.permalink
|
|
}
|
|
posts.append(post_data)
|
|
|
|
except Exception as e:
|
|
# Log error but continue processing other subreddits
|
|
print(f"Error fetching from {subreddit_name}: {e}")
|
|
continue
|
|
|
|
return posts
|
|
```
|
|
|
|
## LLM Sentiment Analysis
|
|
|
|
### OpenRouter Integration
|
|
```python
|
|
# tradingagents/domains/socialmedia/services.py
|
|
from typing import Dict, Any, Optional, Tuple
|
|
import openai
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
class SentimentAnalyzer:
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.config = config
|
|
self.client = openai.OpenAI(
|
|
base_url="https://openrouter.ai/api/v1",
|
|
api_key=config.openrouter_api_key
|
|
)
|
|
|
|
async def analyze_sentiment(self, text: str) -> Tuple[float, str, float]:
|
|
"""
|
|
Analyze sentiment of social media post
|
|
Returns: (score, label, confidence)
|
|
"""
|
|
prompt = f"""
|
|
Analyze the financial sentiment of this social media post.
|
|
|
|
Post: "{text}"
|
|
|
|
Return sentiment as JSON with:
|
|
- score: float from -1.0 (very negative) to +1.0 (very positive)
|
|
- label: "positive", "negative", or "neutral"
|
|
- confidence: float from 0.0 to 1.0 indicating confidence
|
|
|
|
Focus on financial and trading sentiment, not general sentiment.
|
|
"""
|
|
|
|
try:
|
|
response = await self.client.chat.completions.create(
|
|
model=self.config.quick_think_llm,
|
|
messages=[{"role": "user", "content": prompt}],
|
|
max_tokens=100,
|
|
temperature=0.1
|
|
)
|
|
|
|
result = json.loads(response.choices[0].message.content)
|
|
return result['score'], result['label'], result['confidence']
|
|
|
|
except Exception as e:
|
|
# Return neutral sentiment on error
|
|
return 0.0, "neutral", 0.0
|
|
```
|
|
|
|
## Vector Embeddings and Search
|
|
|
|
### Embedding Generation
|
|
```python
|
|
# tradingagents/domains/socialmedia/embeddings.py
|
|
import openai
|
|
from typing import List, Optional
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
class EmbeddingGenerator:
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.config = config
|
|
self.client = openai.OpenAI(
|
|
base_url="https://openrouter.ai/api/v1",
|
|
api_key=config.openrouter_api_key
|
|
)
|
|
|
|
async def generate_embedding(self, text: str) -> Optional[List[float]]:
|
|
"""Generate vector embedding for text"""
|
|
try:
|
|
response = await self.client.embeddings.create(
|
|
model="text-embedding-3-small",
|
|
input=text,
|
|
encoding_format="float"
|
|
)
|
|
return response.data[0].embedding
|
|
except Exception as e:
|
|
print(f"Embedding generation failed: {e}")
|
|
return None
|
|
|
|
def prepare_text_for_embedding(self, post: Dict[str, Any]) -> str:
|
|
"""Combine title and content for embedding"""
|
|
title = post.get('title', '')
|
|
content = post.get('content', '')
|
|
return f"{title} {content}".strip()
|
|
```
|
|
|
|
## Repository Implementation
|
|
|
|
### SocialRepository with PostgreSQL
|
|
```python
|
|
# tradingagents/domains/socialmedia/repositories.py
|
|
from typing import List, Optional, Dict, Any
|
|
from sqlalchemy.orm import Session
|
|
from sqlalchemy import desc, and_, text
|
|
from tradingagents.domains.socialmedia.entities import SocialMediaPostEntity
|
|
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
|
|
from tradingagents.database import get_db_session
|
|
from datetime import datetime, timedelta
|
|
|
|
class SocialRepository:
|
|
def __init__(self):
|
|
self.session = get_db_session()
|
|
|
|
async def save_posts(self, posts: List[SocialPost]) -> List[str]:
|
|
"""Save social media posts with deduplication"""
|
|
saved_ids = []
|
|
|
|
for post in posts:
|
|
# Check for existing post
|
|
existing = self.session.query(SocialMediaPostEntity).filter(
|
|
SocialMediaPostEntity.post_id == post.post_id
|
|
).first()
|
|
|
|
if existing:
|
|
continue # Skip duplicates
|
|
|
|
entity = SocialMediaPostEntity.from_domain(post)
|
|
self.session.add(entity)
|
|
saved_ids.append(post.post_id)
|
|
|
|
self.session.commit()
|
|
return saved_ids
|
|
|
|
async def get_posts_for_ticker(
|
|
self,
|
|
ticker: str,
|
|
days: int = 7,
|
|
limit: int = 50
|
|
) -> List[SocialPost]:
|
|
"""Get social media posts for specific ticker"""
|
|
cutoff_date = datetime.now() - timedelta(days=days)
|
|
|
|
results = self.session.query(SocialMediaPostEntity).filter(
|
|
and_(
|
|
SocialMediaPostEntity.ticker == ticker,
|
|
SocialMediaPostEntity.created_at >= cutoff_date
|
|
)
|
|
).order_by(desc(SocialMediaPostEntity.created_at)).limit(limit).all()
|
|
|
|
return [entity.to_domain() for entity in results]
|
|
|
|
async def vector_similarity_search(
|
|
self,
|
|
query_embedding: List[float],
|
|
ticker: Optional[str] = None,
|
|
limit: int = 10
|
|
) -> List[SocialPost]:
|
|
"""Find similar posts using vector search"""
|
|
query = self.session.query(SocialMediaPostEntity)
|
|
|
|
if ticker:
|
|
query = query.filter(SocialMediaPostEntity.ticker == ticker)
|
|
|
|
# Vector similarity search using pgvectorscale
|
|
query = query.order_by(
|
|
text(f"embedding <-> '{query_embedding}'")
|
|
).limit(limit)
|
|
|
|
results = query.all()
|
|
return [entity.to_domain() for entity in results]
|
|
```
|
|
|
|
## Service Layer
|
|
|
|
### SocialMediaService
|
|
```python
|
|
# tradingagents/domains/socialmedia/services.py
|
|
from typing import List, Optional, Dict, Any
|
|
from tradingagents.domains.socialmedia.repositories import SocialRepository
|
|
from tradingagents.domains.socialmedia.clients import RedditClient
|
|
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
class SocialMediaService:
|
|
def __init__(self, config: TradingAgentsConfig):
|
|
self.config = config
|
|
self.repository = SocialRepository()
|
|
self.reddit_client = RedditClient(config)
|
|
self.sentiment_analyzer = SentimentAnalyzer(config)
|
|
self.embedding_generator = EmbeddingGenerator(config)
|
|
|
|
async def collect_social_data(
|
|
self,
|
|
ticker: Optional[str] = None,
|
|
subreddits: Optional[List[str]] = None
|
|
) -> SocialContext:
|
|
"""Main entry point for social media data collection"""
|
|
|
|
if not subreddits:
|
|
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
|
|
|
|
# Fetch posts from Reddit
|
|
raw_posts = await self.reddit_client.fetch_financial_posts(
|
|
subreddits=subreddits,
|
|
ticker=ticker,
|
|
limit=100
|
|
)
|
|
|
|
# Process posts: sentiment analysis + embeddings
|
|
processed_posts = []
|
|
for raw_post in raw_posts:
|
|
# Generate sentiment
|
|
text = f"{raw_post['title']} {raw_post['content']}"
|
|
score, label, confidence = await self.sentiment_analyzer.analyze_sentiment(text)
|
|
|
|
# Generate embedding
|
|
embedding = await self.embedding_generator.generate_embedding(text)
|
|
|
|
post = SocialPost(
|
|
**raw_post,
|
|
sentiment_score=score,
|
|
sentiment_label=label,
|
|
sentiment_confidence=confidence
|
|
)
|
|
processed_posts.append(post)
|
|
|
|
# Save to database
|
|
await self.repository.save_posts(processed_posts)
|
|
|
|
# Return context
|
|
return SocialContext(
|
|
posts=processed_posts,
|
|
ticker=ticker,
|
|
total_posts=len(processed_posts),
|
|
sentiment_summary=self._calculate_sentiment_summary(processed_posts)
|
|
)
|
|
|
|
def _calculate_sentiment_summary(self, posts: List[SocialPost]) -> Dict[str, Any]:
|
|
"""Calculate aggregate sentiment metrics"""
|
|
if not posts:
|
|
return {}
|
|
|
|
scores = [p.sentiment_score for p in posts if p.sentiment_score is not None]
|
|
labels = [p.sentiment_label for p in posts if p.sentiment_label]
|
|
|
|
return {
|
|
'avg_sentiment': sum(scores) / len(scores) if scores else 0.0,
|
|
'positive_count': labels.count('positive'),
|
|
'negative_count': labels.count('negative'),
|
|
'neutral_count': labels.count('neutral'),
|
|
'total_posts': len(posts)
|
|
}
|
|
```
|
|
|
|
## AgentToolkit Integration
|
|
|
|
### RAG-Enhanced Methods
|
|
```python
|
|
# tradingagents/agents/libs/agent_toolkit.py (additions)
|
|
|
|
async def get_reddit_news(self, ticker: str, days: int = 7) -> str:
|
|
"""Get Reddit posts related to a ticker with RAG context"""
|
|
try:
|
|
# Get recent posts for ticker
|
|
posts = await self.social_service.repository.get_posts_for_ticker(
|
|
ticker=ticker,
|
|
days=days,
|
|
limit=20
|
|
)
|
|
|
|
if not posts:
|
|
return f"No Reddit posts found for {ticker} in the last {days} days."
|
|
|
|
# Format for agent consumption
|
|
context = f"Reddit Social Media Context for {ticker} ({len(posts)} posts):\n\n"
|
|
|
|
for post in posts[:10]: # Limit to top 10
|
|
sentiment_emoji = {"positive": "📈", "negative": "📉", "neutral": "➡️"}.get(post.sentiment_label, "")
|
|
context += f"{sentiment_emoji} r/{post.subreddit} - {post.title}\n"
|
|
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
|
|
context += f" Engagement: {post.upvotes} upvotes, {post.comment_count} comments\n"
|
|
if post.content:
|
|
context += f" Content: {post.content[:200]}...\n"
|
|
context += "\n"
|
|
|
|
return context
|
|
|
|
except Exception as e:
|
|
return f"Error fetching Reddit data for {ticker}: {str(e)}"
|
|
|
|
async def get_reddit_stock_info(self, ticker: str, query: Optional[str] = None) -> str:
|
|
"""Get Reddit stock information with semantic search"""
|
|
try:
|
|
if query:
|
|
# Generate embedding for semantic search
|
|
query_embedding = await self.social_service.embedding_generator.generate_embedding(query)
|
|
if query_embedding:
|
|
posts = await self.social_service.repository.vector_similarity_search(
|
|
query_embedding=query_embedding,
|
|
ticker=ticker,
|
|
limit=10
|
|
)
|
|
else:
|
|
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
|
|
else:
|
|
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
|
|
|
|
if not posts:
|
|
return f"No relevant Reddit discussions found for {ticker}."
|
|
|
|
# Aggregate sentiment and key insights
|
|
sentiment_summary = self.social_service._calculate_sentiment_summary(posts)
|
|
|
|
context = f"Reddit Stock Analysis for {ticker}:\n\n"
|
|
context += f"Overall Sentiment: {sentiment_summary.get('avg_sentiment', 0):.2f}/1.0\n"
|
|
context += f"Posts: {sentiment_summary.get('positive_count', 0)} positive, "
|
|
context += f"{sentiment_summary.get('negative_count', 0)} negative, "
|
|
context += f"{sentiment_summary.get('neutral_count', 0)} neutral\n\n"
|
|
|
|
context += "Key Discussions:\n"
|
|
for post in posts[:5]:
|
|
context += f"• {post.title} (r/{post.subreddit})\n"
|
|
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
|
|
|
|
return context
|
|
|
|
except Exception as e:
|
|
return f"Error analyzing Reddit stock info for {ticker}: {str(e)}"
|
|
```
|
|
|
|
## Dagster Pipeline
|
|
|
|
### Social Media Collection Asset
|
|
```python
|
|
# tradingagents/data/assets/social_media.py
|
|
from dagster import asset, AssetExecutionContext
|
|
from tradingagents.domains.socialmedia.services import SocialMediaService
|
|
from tradingagents.config import TradingAgentsConfig
|
|
|
|
@asset(
|
|
group_name="social_media",
|
|
description="Collect Reddit posts from financial subreddits with sentiment analysis"
|
|
)
|
|
async def reddit_financial_posts(context: AssetExecutionContext) -> Dict[str, Any]:
|
|
"""Daily collection of Reddit financial posts"""
|
|
|
|
config = TradingAgentsConfig.from_env()
|
|
social_service = SocialMediaService(config)
|
|
|
|
# Collect from financial subreddits
|
|
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
|
|
|
|
total_collected = 0
|
|
results = {}
|
|
|
|
for subreddit in subreddits:
|
|
try:
|
|
social_context = await social_service.collect_social_data(
|
|
subreddits=[subreddit]
|
|
)
|
|
|
|
results[subreddit] = {
|
|
'posts_collected': len(social_context.posts),
|
|
'sentiment_summary': social_context.sentiment_summary
|
|
}
|
|
total_collected += len(social_context.posts)
|
|
|
|
context.log.info(f"Collected {len(social_context.posts)} posts from r/{subreddit}")
|
|
|
|
except Exception as e:
|
|
context.log.error(f"Failed to collect from r/{subreddit}: {e}")
|
|
results[subreddit] = {'error': str(e)}
|
|
|
|
context.log.info(f"Total posts collected: {total_collected}")
|
|
return results
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Test Structure
|
|
```
|
|
tests/domains/socialmedia/
|
|
├── conftest.py # Fixtures and test setup
|
|
├── test_reddit_client.py # API integration tests with VCR
|
|
├── test_social_repository.py # PostgreSQL database tests
|
|
├── test_social_service.py # Business logic with mocks
|
|
├── test_sentiment_analyzer.py # LLM sentiment analysis tests
|
|
├── test_embedding_generator.py # Vector embedding tests
|
|
└── fixtures/ # VCR cassettes and test data
|
|
└── reddit_api_responses.yaml
|
|
```
|
|
|
|
### Key Test Patterns
|
|
```python
|
|
# tests/domains/socialmedia/test_social_service.py
|
|
import pytest
|
|
from unittest.mock import AsyncMock, MagicMock
|
|
from tradingagents.domains.socialmedia.services import SocialMediaService
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_collect_social_data_success(mock_social_service):
|
|
"""Test successful social media data collection"""
|
|
# Mock Reddit API response
|
|
mock_posts = [
|
|
{
|
|
'post_id': 'abc123',
|
|
'title': 'AAPL to the moon!',
|
|
'subreddit': 'wallstreetbets',
|
|
# ... other fields
|
|
}
|
|
]
|
|
|
|
mock_social_service.reddit_client.fetch_financial_posts.return_value = mock_posts
|
|
mock_social_service.sentiment_analyzer.analyze_sentiment.return_value = (0.8, 'positive', 0.9)
|
|
|
|
result = await mock_social_service.collect_social_data(ticker='AAPL')
|
|
|
|
assert len(result.posts) == 1
|
|
assert result.posts[0].sentiment_label == 'positive'
|
|
assert result.sentiment_summary['positive_count'] == 1
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
### Technical Dependencies
|
|
- **Reddit API access** (PRAW or Reddit API client)
|
|
- **OpenRouter API** for LLM sentiment analysis
|
|
- **PostgreSQL** with TimescaleDB and pgvectorscale extensions
|
|
- **Existing database infrastructure** from news domain
|
|
- **OpenRouter configuration** in TradingAgentsConfig
|
|
- **Dagster orchestration framework** for scheduled execution
|
|
|
|
### Reference Implementations
|
|
- **News domain patterns**: Follow NewsService, NewsRepository, NewsArticleEntity patterns for consistency
|
|
- **Database schema**: Mirror NewsArticleEntity vector embedding approach for social posts
|
|
- **Agent integration**: Follow existing AgentToolkit get_news() pattern for social media methods
|
|
- **Testing approach**: Apply news domain testing patterns: VCR for API, real DB for repositories
|
|
|
|
## Success Criteria
|
|
|
|
### Functionality
|
|
- Daily Reddit collection with sentiment analysis and vector search
|
|
- Seamless integration with existing multi-agent trading framework
|
|
- RAG-enhanced social context for AI agents
|
|
|
|
### Performance
|
|
- < 2 second social context queries
|
|
- < 100ms repository operations
|
|
- Efficient vector similarity search
|
|
|
|
### Quality
|
|
- 85%+ test coverage matching project standards
|
|
- Comprehensive error handling and resilience
|
|
- Data quality monitoring and validation
|
|
|
|
### Integration
|
|
- Seamless AgentToolkit RAG integration for AI agents
|
|
- Architecture and patterns match successful news domain implementation
|
|
- Consistent with existing TradingAgents configuration and conventions
|
|
|
|
## Implementation Approach
|
|
|
|
**Complete domain implementation following successful news domain patterns:**
|
|
|
|
1. **Database migration** from file storage to PostgreSQL
|
|
2. **Entity models** with TimescaleDB and vector support
|
|
3. **Reddit client** implementation with rate limiting
|
|
4. **Repository layer** with vector search capabilities
|
|
5. **Service layer** with sentiment analysis and embedding generation
|
|
6. **AgentToolkit integration** with RAG-enhanced methods
|
|
7. **Dagster pipeline** for automated daily collection
|
|
8. **Comprehensive testing** with VCR mocking and real database tests
|
|
|
|
This comprehensive implementation transforms the social media domain from basic stubs into a production-ready system that seamlessly integrates with the existing TradingAgents framework. |