TradingAgents/docs/specs/socialmedia/spec-lite.md

3.6 KiB

Social Media Domain - Specification Lite

Summary

Complete implementation of social media data collection from Reddit with LLM sentiment analysis and vector embeddings for AI agent RAG integration.

Core Requirements

Data Collection

  • Daily Reddit collection from financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
  • OpenRouter LLM sentiment analysis with confidence scoring
  • Vector embeddings for semantic similarity search
  • PostgreSQL storage with TimescaleDB + pgvectorscale optimization

Agent Integration

  • AgentToolkit methods: get_reddit_news() and get_reddit_stock_info()
  • RAG-enhanced queries with < 2 second response time
  • Vector similarity search for contextual social media insights

Technical Implementation

Architecture Pattern

Router → Service → Repository → Entity → Database (matching news domain)

Database Schema

social_media_posts (
    post_id, ticker, subreddit, title, content, author,
    created_at, upvotes, comment_count, 
    sentiment_score, sentiment_label, sentiment_confidence,
    embedding vector(1536), -- pgvectorscale
    data_quality_score, processing_status
)

Key Components

1. RedditClient

  • PRAW integration with rate limiting
  • Financial subreddit targeting
  • Ticker-specific post filtering

2. SentimentAnalyzer

  • OpenRouter LLM integration
  • Structured sentiment scoring (-1.0 to +1.0)
  • Financial context awareness

3. SocialRepository

  • PostgreSQL with deduplication by post_id
  • Vector similarity search using pgvectorscale
  • TimescaleDB time-series optimization

4. SocialMediaService

  • Orchestrates collection pipeline: Reddit → Sentiment → Embeddings → Storage
  • Provides ticker-specific social context
  • Calculates aggregate sentiment metrics

5. AgentToolkit Integration

async def get_reddit_news(ticker: str, days: int = 7) -> str:
    # Returns formatted social media context with sentiment analysis
    
async def get_reddit_stock_info(ticker: str, query: Optional[str] = None) -> str:  
    # Returns semantic search results with sentiment aggregation

Implementation Scope

Complete Implementation

  • PostgreSQL migration from file storage
  • Reddit API client (currently empty stub)
  • SQLAlchemy entities with vector fields
  • LLM sentiment analysis pipeline
  • Vector embedding generation and search
  • Dagster pipeline for scheduled collection
  • Comprehensive test coverage (pytest-vcr for APIs)

Current Status

Basic stub implementation - requires complete rebuild of all components

Dependencies

  • Reddit API credentials (PRAW)
  • OpenRouter API access
  • PostgreSQL with TimescaleDB + pgvectorscale
  • Existing TradingAgentsConfig
  • News domain patterns for consistency

Data Flow

  1. Dagster pipeline triggers daily collection
  2. RedditClient fetches posts from financial subreddits
  3. SentimentAnalyzer processes posts via OpenRouter LLM
  4. EmbeddingGenerator creates vector embeddings
  5. SocialRepository stores in PostgreSQL with deduplication
  6. AI Agents query via AgentToolkit with RAG-enhanced context

Testing Strategy

  • pytest-vcr for Reddit API mocking
  • Real PostgreSQL for repository integration tests
  • Service mocks for business logic testing
  • 85%+ coverage matching project standards

Success Criteria

  • Daily automated Reddit collection with sentiment analysis
  • Sub-2-second agent queries with vector search
  • Seamless RAG integration matching news domain patterns
  • Production-ready reliability with comprehensive error handling