11 KiB
TradingAgents Personal Fork Roadmap
Overview
This roadmap outlines the technical development path for the personal fork of TradingAgents, focusing on building a robust data infrastructure with PostgreSQL + TimescaleDB + pgvectorscale, implementing RAG-powered agents, and establishing automated data collection pipelines with Dagster.
Last Updated: 2025-11-11
Key Roadmap Changes
- Pragmatic Dagster Integration: Dagster jobs built incrementally per domain (not separate phase)
- Accurate Timeline: 10-14 weeks total (vs original 16-22 weeks) based on actual progress
- Incremental Automation: Each domain gets automated collection as it completes
- Earlier Production Readiness: Automated data collection starts Week 1 (not Month 4)
Development Velocity
- Observed Completion Rate: News clients 85-90% complete with 600+ lines of quality tests
- AI-Assisted Multiplier: 3-4x faster development with spec-driven workflow
- Target Task Velocity: 15-20 tasks/week with AI assistance
- Test Coverage: Maintained 85%+ with pytest-vcr pattern
Current Status: Phase 1 - News Domain + Dagster Integration (85% Complete)
The foundation has been established with core domain architecture, comprehensive testing framework, and the news domain clients complete.
Completed Infrastructure
- Domain Architecture: Clean separation of news, marketdata, and socialmedia domains
- Testing Framework: Pragmatic TDD with 85%+ coverage, pytest-vcr for HTTP mocking
- News Clients: Google News RSS + Article Scraper with comprehensive tests (600+ lines)
- Database Stack: PostgreSQL + TimescaleDB + pgvectorscale ready
- Basic Agent System: Multi-agent trading analysis framework with LangGraph
Current Priorities (Next 5-7 Days)
- Complete News Domain Foundation - Repository, Service, Entity layers
- LLM Integration - OpenRouter sentiment analysis + vector embeddings
- Basic Dagster Job - Automated daily news collection
- Spec Documentation - Create status.md and tasks.md for progress tracking
Development Phases
Phase 1: News Domain + Basic Dagster (Current - 85% Complete)
Timeline: 5-7 days remaining Status: 🔄 In Progress
Remaining Work (5-7 days)
- News Repository Layer: PostgreSQL async operations with TimescaleDB (1-2 days)
- News Service Layer: Business logic with LLM integration (1-2 days)
- NewsArticle Entity: Domain models with sentiment and embeddings (1 day)
- OpenRouter Integration: Sentiment analysis via LLM (1-2 days)
- Vector Embeddings: OpenAI embeddings via OpenRouter for semantic search (1 day)
- Basic Dagster Job: Daily news collection automation (1-2 days)
- Integration Testing: End-to-end workflow validation (1 day)
Key Deliverables
- News domain following Router → Service → Repository → Entity → Database pattern
- OpenRouter LLM sentiment analysis operational
- pgvectorscale vector embeddings for semantic search
- Automated Dagster job for daily news collection
- 85%+ test coverage maintained
Success Criteria
- ✅ Complete layered architecture implemented
- ✅ LLM sentiment scores with confidence ratings
- ✅ Vector embeddings enabling semantic search
- ✅ Dagster job running daily news collection
- ✅ Query performance < 2 seconds
- ✅ News domain ready for agent integration
Phase 2: Market Data Domain + Dagster Integration (Next Priority)
Timeline: 4-5 weeks Status: 📋 Planned
Core Objectives
- TimescaleDB Hypertables: Efficient time-series storage for price/volume data
- Market Data Collection: FinnHub/yfinance integration with retry logic
- PostgreSQL Migration: Move from file-based to database storage
- Technical Indicators: MACD, RSI, Bollinger Bands calculations
- Dagster Market Data Job: Twice-daily price data collection automation
- Performance Optimization: Sub-100ms queries with proper indexing
Key Deliverables
- MarketDataRepository with TimescaleDB optimization
- MarketDataService with technical analysis calculations
- MarketData entities (Price, OHLCV, TechnicalIndicators)
- Dagster job for automated twice-daily collection
- pytest-vcr tests for API clients
- Performance benchmarks for time-series queries
Success Criteria
- ✅ TimescaleDB hypertables storing historical price data
- ✅ Sub-100ms queries for price lookups and indicators
- ✅ Technical indicators calculating accurately
- ✅ Dagster job running twice daily (market open/close)
- ✅ Complete migration from file-based storage
- ✅ Market data domain ready for agent integration
Phase 3: Social Media Domain + Dagster Integration
Timeline: 2-3 weeks Status: 📋 Planned
Core Objectives
- Reddit Integration: PRAW library for financial subreddits (r/wallstreetbets, r/stocks)
- Twitter/X Alternative: Evaluate Reddit-only approach or alternative sources
- Social Sentiment Analysis: OpenRouter LLM sentiment across posts
- Cross-Domain Relations: Link social sentiment to market data and news
- Dagster Social Media Job: Daily social sentiment collection
- Vector Embeddings: Semantic search across social discussions
Key Deliverables
- RedditClient with pytest-vcr tests
- SocialMediaRepository with PostgreSQL + pgvectorscale
- SocialMediaService with sentiment aggregation
- Dagster job for daily Reddit data collection
- Cross-domain correlation queries (social ↔ news ↔ price)
- Vector embeddings for semantic post search
Success Criteria
- ✅ Reddit data collected daily from financial subreddits
- ✅ Sentiment scores integrated with market events
- ✅ Cross-domain relationships queryable in database
- ✅ Dagster job running daily social collection
- ✅ Vector embeddings enabling semantic social search
- ✅ Three-domain architecture complete
Blockers to Resolve
- Reddit API Access: Obtain REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET
- Twitter/X Alternative: Evaluate API costs or alternative data sources
Phase 4: RAG Enhancement + Advanced Orchestration
Timeline: 3-4 weeks Status: 📋 Planned
Core Objectives
- RAG Agent Enhancement: All agents use vector similarity search for context
- Historical Pattern Matching: Semantic search for comparable market scenarios
- Cross-Domain RAG: Agents query across news, price, and social data
- Advanced Dagster Features: Data quality monitoring, gap detection, backfill
- Performance Optimization: Vector query tuning, database optimization
- Monitoring & Alerting: Pipeline health tracking and failure notifications
Key Deliverables
- RAG-enhanced agents with similarity-based context retrieval
- Cross-domain vector search (find similar market conditions)
- Dagster data quality checks and validation
- Automated backfill for missing historical data
- Monitoring dashboard for pipeline health
- Performance benchmarks for vector queries (< 50ms target)
Success Criteria
- ✅ All agents using RAG for contextual decisions
- ✅ Vector similarity search < 50ms across all domains
- ✅ Cross-domain queries enabling holistic analysis
- ✅ Dagster monitoring with automated alerts
- ✅ Data quality metrics tracked and reported
- ✅ Historical gaps detected and auto-filled
- ✅ Production-ready data infrastructure complete
Technical Milestones
Revised Timeline: 10-14 weeks (vs original 16-22 weeks)
Phase Breakdown:
- Phase 1 (News + Dagster): 5-7 days
- Phase 2 (Market Data + Dagster): 4-5 weeks
- Phase 3 (Social Media + Dagster): 2-3 weeks
- Phase 4 (RAG + Advanced Orchestration): 3-4 weeks
Database Architecture
- Week 1: PostgreSQL + TimescaleDB + pgvectorscale operational (News domain)
- Week 6: TimescaleDB hypertables optimized for market data time-series
- Week 9: Three-domain database architecture complete with vector embeddings
- Week 12: Full RAG implementation with cross-domain similarity search
Agent Capabilities
- Week 1: News Analysts accessing news with LLM sentiment
- Week 6: Technical Analysts using market data with indicators
- Week 9: Sentiment Analysts using social media data
- Week 12: All agents RAG-enhanced with historical context
Data Pipeline Maturity (Incremental Dagster)
- Week 1: Daily news collection automated via Dagster
- Week 6: Twice-daily market data collection automated
- Week 9: Daily social media collection automated
- Week 12: Production-grade orchestration with monitoring, backfill, and alerting
Success Metrics
Technical Excellence
- Test Coverage: Maintain 85%+ across all domains
- Query Performance: < 100ms for common database operations
- Pipeline Reliability: 99%+ uptime for data collection
- Data Quality: < 0.1% missing data points across all domains
Feature Completeness
- Domain Coverage: 100% implementation across news, marketdata, socialmedia
- Agent Capabilities: RAG-enhanced decision making operational
- Data Infrastructure: Complete PostgreSQL + TimescaleDB + pgvectorscale stack
- Automation: Fully automated data collection and processing
Development Velocity
- Code Quality: Consistent formatting, type checking, and documentation
- Testing Strategy: Comprehensive test suite with domain-specific approaches
- Architecture Consistency: Clean domain separation and layered architecture
- Performance Optimization: Regular profiling and optimization cycles
Risk Management
Technical Risks
- Database Performance: Mitigate with proper indexing and query optimization
- API Rate Limits: Implement intelligent backoff and caching strategies
- Data Quality: Establish comprehensive validation and monitoring
- Vector Search Performance: Optimize pgvectorscale configuration and queries
Development Risks
- Scope Creep: Maintain focus on sequential domain completion
- Technical Debt: Regular refactoring and code quality maintenance
- Testing Coverage: Continuous integration with coverage enforcement
- Documentation: Maintain comprehensive documentation throughout development
Long-Term Vision (6+ Months)
Advanced Capabilities
- Strategy Backtesting: Historical strategy validation with complete data
- Real-Time Analysis: Live market analysis with sub-second agent responses
- Advanced RAG: Multi-modal RAG with charts, documents, and audio data
- Performance Analytics: Comprehensive analysis of agent decision accuracy
Research Applications
- Academic Research: Platform for publishing trading AI research
- Strategy Development: Complete environment for developing proprietary strategies
- Data Science: Advanced analytics and machine learning on financial data
- Educational Use: Comprehensive learning platform for financial AI
This roadmap prioritizes building a solid data foundation before enhancing agent capabilities, ensuring each phase delivers measurable value while maintaining high code quality and comprehensive testing.