docs: Adopt Spec-Driven Development framework
Establish complete Spec-Driven Development documentation structure to enable AI-assisted implementation with product context, feature specs, and architectural standards. Documentation: - Add product docs (product.md, roadmap.md) for business context - Add feature specs for marketdata, news, and socialmedia domains - Add technical standards (practices.md, security.md, style.md, tech.md) - Update README with SDD workflow and PostgreSQL architecture Restructure: - Move Docker files to docker/db/ for cleaner organization - Move docker-compose.yml to project root - Remove deprecated configs (litellm.yml, package.json, setup.py) - Update tests for pytest-vcr integration This establishes the foundation for /spec:* workflow commands and structured AI-agent collaboration.
This commit is contained in:
parent
4565a41600
commit
c20771bf20
|
|
@ -2,7 +2,6 @@
|
|||
python = "3.13"
|
||||
uv = "latest"
|
||||
ruff = "latest"
|
||||
docker = "latest"
|
||||
|
||||
[env]
|
||||
_.file = ".env"
|
||||
|
|
|
|||
3
LICENSE
3
LICENSE
|
|
@ -186,7 +186,8 @@
|
|||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
Copyright 2025 Martin C. Richards
|
||||
Copyright 2025 Tauric Research
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
|
|
|||
340
README.md
340
README.md
|
|
@ -1,53 +1,46 @@
|
|||
<p align="center">
|
||||
<img src="assets/TauricResearch.png" style="width: 60%; height: auto;">
|
||||
</p>
|
||||
# TradingAgents Project Overview
|
||||
|
||||
<div align="center" style="line-height: 1;">
|
||||
<a href="https://arxiv.org/abs/2412.20138" target="_blank"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2412.20138-B31B1B?logo=arxiv"/></a>
|
||||
<a href="https://discord.com/invite/hk9PGKShPK" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-TradingResearch-7289da?logo=discord&logoColor=white&color=7289da"/></a>
|
||||
<a href="./assets/wechat.png" target="_blank"><img alt="WeChat" src="https://img.shields.io/badge/WeChat-TauricResearch-brightgreen?logo=wechat&logoColor=white"/></a>
|
||||
<a href="https://x.com/TauricResearch" target="_blank"><img alt="X Follow" src="https://img.shields.io/badge/X-TauricResearch-white?logo=x&logoColor=white"/></a>
|
||||
<br>
|
||||
<a href="https://github.com/TauricResearch/" target="_blank"><img alt="Community" src="https://img.shields.io/badge/Join_GitHub_Community-TauricResearch-14C290?logo=discourse"/></a>
|
||||
</div>
|
||||
## Spec-Driven Development Integration
|
||||
|
||||
<div align="center">
|
||||
<!-- Keep these links. Translations will automatically update with the README. -->
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=de">Deutsch</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=es">Español</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=fr">français</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=ja">日本語</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=ko">한국어</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=pt">Português</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=ru">Русский</a> |
|
||||
<a href="https://www.readme-i18n.com/TauricResearch/TradingAgents?lang=zh">中文</a>
|
||||
</div>
|
||||
TradingAgents integrates with the Spec-Driven Development workflow to accelerate feature development while maintaining architectural consistency. This project uses the specialized agent system described in your global CLAUDE.md for structured specifications and AI-assisted implementation.
|
||||
|
||||
### Project Context for AI Agents
|
||||
|
||||
**Product Definition**: Multi-agent LLM financial trading framework that mirrors real-world trading firm dynamics for research-based market analysis and trading decisions.
|
||||
|
||||
**Target Users**: Single developer/researcher focused on personal trading research and data infrastructure development.
|
||||
|
||||
**Core Architecture**: Domain-driven design with three domains (marketdata, news, socialmedia), PostgreSQL + TimescaleDB + pgvectorscale data stack, RAG-powered multi-agent collaboration through LangGraph workflows.
|
||||
|
||||
**Key Constraints**: Research-only framework (not production trading), OpenRouter as sole LLM provider, 85%+ test coverage requirement, TDD with pytest.
|
||||
|
||||
### Documentation Structure
|
||||
|
||||
- **Product Docs**: `/Users/martinrichards/code/TradingAgents/docs/product/` - Business context and roadmap
|
||||
- **Feature Specs**: `/Users/martinrichards/code/TradingAgents/docs/spec/` - Implementation specifications
|
||||
- **Standards**: `/Users/martinrichards/code/TradingAgents/docs/standards/` - Technical architecture and practices
|
||||
|
||||
### Agent Context for Implementation
|
||||
|
||||
When implementing features, AI agents should reference:
|
||||
- `docs/product/product.md` for business context and user requirements
|
||||
- `docs/standards/tech.md` for architectural patterns and technical standards
|
||||
- `docs/standards/practices.md` for TDD workflow and development practices
|
||||
- `docs/standards/style.md` for code style and naming conventions
|
||||
|
||||
Apply the layered architecture pattern: **Router → Service → Repository → Entity → Database** consistently across all domains.
|
||||
|
||||
---
|
||||
|
||||
# TradingAgents: Multi-Agents LLM Financial Trading Framework
|
||||
# TradingAgents: Multi-Agents LLM Financial Trading Framework
|
||||
|
||||
> 🎉 **TradingAgents** officially released! We have received numerous inquiries about the work, and we would like to express our thanks for the enthusiasm in our community.
|
||||
> **Personal Fork Notice**: This is a personal fork of the original TradingAgents framework by TauricResearch, originally licensed under Apache 2.0. This fork focuses on individual research and development with significant architectural changes including PostgreSQL + TimescaleDB + pgvectorscale data infrastructure and RAG-powered agents.
|
||||
>
|
||||
> So we decided to fully open-source the framework. Looking forward to building impactful projects with you!
|
||||
> **Original Work**: [TauricResearch/TradingAgents](https://github.com/TauricResearch/TradingAgents) - [arXiv:2412.20138](https://arxiv.org/abs/2412.20138)
|
||||
|
||||
</div>
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
|
||||
🚀 [TradingAgents](#tradingagents-framework) | ⚡ [Installation & CLI](#installation-and-cli) | 🎬 [Demo](https://www.youtube.com/watch?v=90gr5lwjIho) | 📦 [Package Usage](#tradingagents-package) | 📚 [API Docs](./docs/api-reference.md) | 🔧 [Troubleshooting](./docs/troubleshooting.md) | 👥 [Agent Dev](./docs/agent-development.md) | 🤝 [Contributing](#contributing) | 📄 [Citation](#citation)
|
||||
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<a href="https://www.star-history.com/#TauricResearch/TradingAgents&Date">
|
||||
<picture>
|
||||
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=TauricResearch/TradingAgents&type=Date&theme=dark" />
|
||||
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=TauricResearch/TradingAgents&type=Date" />
|
||||
<img alt="TradingAgents Star History" src="https://api.star-history.com/svg?repos=TauricResearch/TradingAgents&type=Date" style="width: 80%; height: auto;" />
|
||||
</picture>
|
||||
</a>
|
||||
</div>
|
||||
🚀 [TradingAgents](#tradingagents-framework) | ⚡ [Installation & CLI](#installation-and-cli) | 📦 [Package Usage](#tradingagents-package) | 📚 [API Docs](./docs/api-reference.md) | 🔧 [Troubleshooting](./docs/troubleshooting.md) | 👥 [Agent Dev](./docs/agent-development.md) | 📄 [Citation](#citation)
|
||||
|
||||
## TradingAgents Framework
|
||||
|
||||
|
|
@ -57,7 +50,7 @@ TradingAgents is a multi-agent trading framework that mirrors the dynamics of re
|
|||
<img src="assets/schema.png" style="width: 100%; height: auto;">
|
||||
</p>
|
||||
|
||||
> TradingAgents framework is designed for research purposes. Trading performance may vary based on many factors, including the chosen backbone language models, model temperature, trading periods, the quality of data, and other non-deterministic factors. [It is not intended as financial, investment, or trading advice.](https://tauric.ai/disclaimer/)
|
||||
> TradingAgents framework is designed for research purposes. Trading performance may vary based on many factors, including the chosen backbone language models, model temperature, trading periods, the quality of data, and other non-deterministic factors. It is not intended as financial, investment, or trading advice.
|
||||
|
||||
Our framework decomposes complex trading tasks into specialized roles. This ensures the system achieves a robust, scalable approach to market analysis and decision-making.
|
||||
|
||||
|
|
@ -99,63 +92,79 @@ Our framework decomposes complex trading tasks into specialized roles. This ensu
|
|||
|
||||
Clone TradingAgents:
|
||||
```bash
|
||||
git clone https://github.com/TauricResearch/TradingAgents.git
|
||||
git clone https://github.com/martinrichards23/TradingAgents.git
|
||||
cd TradingAgents
|
||||
```
|
||||
|
||||
Create a virtual environment in any of your favorite environment managers:
|
||||
Install development tools (mise manages Python, uv, and other tools):
|
||||
```bash
|
||||
conda create -n tradingagents python=3.13
|
||||
conda activate tradingagents
|
||||
# Install mise if not already installed
|
||||
curl https://mise.run | sh
|
||||
|
||||
# Install project tools and dependencies
|
||||
mise install # Installs Python, uv, ruff, pyright
|
||||
mise run install # Installs project dependencies with uv
|
||||
```
|
||||
|
||||
Install dependencies:
|
||||
Alternative manual setup:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
# Create virtual environment with uv
|
||||
uv venv
|
||||
source .venv/bin/activate # or .venv\Scripts\activate on Windows
|
||||
|
||||
# Install dependencies
|
||||
uv sync
|
||||
```
|
||||
|
||||
### Database Setup
|
||||
|
||||
This fork uses PostgreSQL with TimescaleDB and pgvectorscale extensions:
|
||||
|
||||
```bash
|
||||
# Using Docker Compose (recommended)
|
||||
docker-compose up -d
|
||||
|
||||
# Or install PostgreSQL with extensions manually
|
||||
# See docs/setup-database.md for detailed instructions
|
||||
```
|
||||
|
||||
### Required APIs
|
||||
|
||||
You will also need the FinnHub API for financial data. All of our code is implemented with the free tier.
|
||||
OpenRouter API (unified LLM provider):
|
||||
```bash
|
||||
export OPENROUTER_API_KEY=$YOUR_OPENROUTER_API_KEY
|
||||
```
|
||||
|
||||
FinnHub API for financial data (optional):
|
||||
```bash
|
||||
export FINNHUB_API_KEY=$YOUR_FINNHUB_API_KEY
|
||||
```
|
||||
|
||||
You will need the OpenAI API for all the agents.
|
||||
Database connection:
|
||||
```bash
|
||||
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
|
||||
export DATABASE_URL="postgresql://user:pass@localhost:5432/tradingagents"
|
||||
```
|
||||
|
||||
### CLI Usage
|
||||
|
||||
You can also try out the CLI directly by running:
|
||||
Run the CLI directly:
|
||||
```bash
|
||||
python -m cli.main
|
||||
mise run dev # or python -m cli.main
|
||||
```
|
||||
You will see a screen where you can select your desired tickers, date, LLMs, research depth, etc.
|
||||
|
||||
<p align="center">
|
||||
<img src="assets/cli/cli_init.png" width="100%" style="display: inline-block; margin: 0 2%;">
|
||||
</p>
|
||||
|
||||
An interface will appear showing results as they load, letting you track the agent's progress as it runs.
|
||||
|
||||
<p align="center">
|
||||
<img src="assets/cli/cli_news.png" width="100%" style="display: inline-block; margin: 0 2%;">
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img src="assets/cli/cli_transaction.png" width="100%" style="display: inline-block; margin: 0 2%;">
|
||||
</p>
|
||||
|
||||
## Quick Start
|
||||
|
||||
Get up and running with TradingAgents in 3 simple steps:
|
||||
|
||||
### Step 1: Set API Keys
|
||||
```bash
|
||||
export OPENAI_API_KEY="your_openai_api_key"
|
||||
export OPENROUTER_API_KEY="your_openrouter_api_key"
|
||||
export FINNHUB_API_KEY="your_finnhub_api_key" # Optional for financial data
|
||||
export DATABASE_URL="postgresql://user:pass@localhost:5432/tradingagents"
|
||||
```
|
||||
|
||||
### Step 2: Run Your First Analysis
|
||||
|
|
@ -179,18 +188,21 @@ The analysis returns:
|
|||
- **Decision**: `BUY`, `SELL`, or `HOLD`
|
||||
- **Result**: Detailed analysis from all agents including market data, news sentiment, and risk assessment
|
||||
|
||||
**Next Steps**: Explore the [CLI interface](#cli-usage), check out [usage examples](#multi-llm-provider-examples), or dive into the [API documentation](./docs/api-reference.md).
|
||||
**Next Steps**: Explore the [CLI interface](#cli-usage), check out [usage examples](#openrouter-configuration), or dive into the [API documentation](./docs/api-reference.md).
|
||||
|
||||
## TradingAgents Package
|
||||
|
||||
### Implementation Details
|
||||
|
||||
We built TradingAgents with LangGraph to ensure flexibility and modularity. We utilize `o1-preview` and `gpt-4o` as our deep thinking and fast thinking LLMs for our experiments. However, for testing purposes, we recommend you use `o4-mini` and `gpt-4.1-mini` to save on costs as our framework makes **lots of** API calls.
|
||||
This fork is built with:
|
||||
- **LangGraph** for agent orchestration
|
||||
- **PostgreSQL + TimescaleDB + pgvectorscale** for data storage and vector search
|
||||
- **OpenRouter** as the unified LLM provider
|
||||
- **RAG** for context-aware agent decision making
|
||||
- **Dagster** for data collection orchestration
|
||||
|
||||
### Python Usage
|
||||
|
||||
To use TradingAgents inside your code, you can import the `tradingagents` module and initialize a `TradingAgentsGraph()` object. The `.propagate()` function will return a decision. You can run `main.py`, here's also a quick example:
|
||||
|
||||
```python
|
||||
from tradingagents.graph.trading_graph import TradingAgentsGraph
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
|
@ -198,88 +210,64 @@ from tradingagents.config import TradingAgentsConfig
|
|||
config = TradingAgentsConfig.from_env()
|
||||
ta = TradingAgentsGraph(debug=True, config=config)
|
||||
|
||||
# forward propagate
|
||||
# Forward propagate
|
||||
_, decision = ta.propagate("NVDA", "2024-05-10")
|
||||
print(decision)
|
||||
```
|
||||
|
||||
You can also adjust the default configuration to set your own choice of LLMs, debate rounds, etc.
|
||||
### Custom Configuration
|
||||
|
||||
```python
|
||||
from tradingagents.graph.trading_graph import TradingAgentsGraph
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
# Create a custom config
|
||||
config = TradingAgentsConfig(
|
||||
deep_think_llm="gpt-4.1-nano", # Use a different model
|
||||
quick_think_llm="gpt-4.1-nano", # Use a different model
|
||||
max_debate_rounds=3, # Increase debate rounds
|
||||
online_tools=True # Use online tools or cached data
|
||||
llm_provider="openrouter",
|
||||
deep_think_llm="anthropic/claude-3.5-sonnet",
|
||||
quick_think_llm="anthropic/claude-3.5-haiku",
|
||||
max_debate_rounds=3,
|
||||
use_rag=True, # Enable RAG-powered agents
|
||||
database_url="postgresql://user:pass@localhost:5432/tradingagents"
|
||||
)
|
||||
|
||||
# Initialize with custom config
|
||||
ta = TradingAgentsGraph(debug=True, config=config)
|
||||
|
||||
# forward propagate
|
||||
_, decision = ta.propagate("NVDA", "2024-05-10")
|
||||
print(decision)
|
||||
```
|
||||
|
||||
> For `online_tools`, we recommend enabling them for experimentation, as they provide access to real-time data. The agents' offline tools rely on cached data from our **Tauric TradingDB**, a curated dataset we use for backtesting. We're currently in the process of refining this dataset, and we plan to release it soon alongside our upcoming projects. Stay tuned!
|
||||
|
||||
You can view the full list of configurations in `tradingagents/config.py`.
|
||||
|
||||
### Complete Environment Variables Reference
|
||||
### Environment Variables Reference
|
||||
|
||||
| Variable | Description | Default | Example |
|
||||
|----------|-------------|---------|---------|
|
||||
| `LLM_PROVIDER` | LLM provider to use | `openai` | `anthropic` |
|
||||
| `DEEP_THINK_LLM` | Model for complex analysis | `o4-mini` | `claude-3-5-sonnet-latest` |
|
||||
| `QUICK_THINK_LLM` | Model for fast responses | `gpt-4o-mini` | `gpt-4o-mini` |
|
||||
| `BACKEND_URL` | API endpoint | `https://api.openai.com/v1` | `https://api.anthropic.com` |
|
||||
| `LLM_PROVIDER` | LLM provider to use | `openrouter` | `openrouter` |
|
||||
| `OPENROUTER_API_KEY` | OpenRouter API key | Required | `sk-or-...` |
|
||||
| `DEEP_THINK_LLM` | Model for complex analysis | `anthropic/claude-3.5-sonnet` | `openai/gpt-4` |
|
||||
| `QUICK_THINK_LLM` | Model for fast responses | `anthropic/claude-3.5-haiku` | `openai/gpt-4o-mini` |
|
||||
| `MAX_DEBATE_ROUNDS` | Investment debate rounds | `1` | `3` |
|
||||
| `MAX_RISK_DISCUSS_ROUNDS` | Risk discussion rounds | `1` | `2` |
|
||||
| `ONLINE_TOOLS` | Use live APIs vs cached data | `true` | `false` |
|
||||
| `USE_RAG` | Enable RAG for agents | `true` | `false` |
|
||||
| `DATABASE_URL` | PostgreSQL connection string | Required | `postgresql://...` |
|
||||
| `DEFAULT_LOOKBACK_DAYS` | Historical data range | `30` | `60` |
|
||||
| `TRADINGAGENTS_RESULTS_DIR` | Output directory | `./results` | `./my_results` |
|
||||
| `TRADINGAGENTS_DATA_DIR` | Data storage directory | System default | `./data` |
|
||||
|
||||
### Multi-LLM Provider Examples
|
||||
### OpenRouter Configuration
|
||||
|
||||
This fork exclusively uses OpenRouter for unified LLM access:
|
||||
|
||||
**Using Anthropic Claude:**
|
||||
```python
|
||||
from tradingagents.graph.trading_graph import TradingAgentsGraph
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
config = TradingAgentsConfig(
|
||||
llm_provider="anthropic",
|
||||
deep_think_llm="claude-3-5-sonnet-latest",
|
||||
quick_think_llm="claude-3-haiku-latest",
|
||||
llm_provider="openrouter",
|
||||
deep_think_llm="anthropic/claude-3.5-sonnet",
|
||||
quick_think_llm="openai/gpt-4o-mini",
|
||||
max_debate_rounds=2
|
||||
)
|
||||
|
||||
ta = TradingAgentsGraph(debug=True, config=config)
|
||||
_, decision = ta.propagate("TSLA", "2024-01-15")
|
||||
```
|
||||
|
||||
**Using Google Gemini:**
|
||||
```python
|
||||
config = TradingAgentsConfig(
|
||||
llm_provider="google",
|
||||
deep_think_llm="gemini-1.5-pro",
|
||||
quick_think_llm="gemini-1.5-flash"
|
||||
)
|
||||
```
|
||||
|
||||
See [docs/api-reference.md](./docs/api-reference.md) for complete API documentation.
|
||||
|
||||
## Development Guide
|
||||
|
||||
This section provides comprehensive development guidance for contributors working on the TradingAgents codebase.
|
||||
|
||||
### Common Development Commands
|
||||
|
||||
This project uses [mise](https://mise.jdx.dev/) for tool and task management. All development tasks are managed through mise.
|
||||
This project uses [mise](https://mise.jdx.dev/) for tool and task management:
|
||||
|
||||
#### Essential Commands
|
||||
- **CLI Application**: `mise run dev` - Interactive CLI for running trading analysis
|
||||
|
|
@ -289,9 +277,10 @@ This project uses [mise](https://mise.jdx.dev/) for tool and task management. Al
|
|||
- **Type checking**: `mise run typecheck` - Run pyright type checker
|
||||
- **Run all tests**: `mise run test` - Run tests with pytest
|
||||
|
||||
#### Initial Setup
|
||||
- **Install tools**: `mise install` - Install Python, uv, ruff, pyright
|
||||
- **Install dependencies**: `mise run install` - Install project dependencies with uv
|
||||
#### Database Commands
|
||||
- **Start database**: `docker-compose up -d`
|
||||
- **Run migrations**: `mise run migrate`
|
||||
- **Seed test data**: `mise run seed`
|
||||
|
||||
### Testing Principles
|
||||
|
||||
|
|
@ -306,55 +295,15 @@ tests/
|
|||
│ └── news/
|
||||
│ ├── __init__.py
|
||||
│ ├── test_news_service.py # Mock repo + clients
|
||||
│ ├── test_news_repository.py # Docker test DB
|
||||
│ ├── test_news_repository.py # PostgreSQL test DB
|
||||
│ └── test_google_news_client.py # pytest-vcr
|
||||
```
|
||||
|
||||
#### Mocking Strategy by Layer
|
||||
- **Services**: Mock Repository + Clients, test real transformations
|
||||
- **Repositories**: Real persistence (temp files/Docker), no mocks
|
||||
- **Clients**: Real HTTP with pytest-vcr cassettes
|
||||
|
||||
#### Quality Standards
|
||||
- **85% coverage** minimum
|
||||
- **< 100ms** per unit test
|
||||
- **Mock boundaries, test behavior**
|
||||
|
||||
### Configuration
|
||||
|
||||
The TradingAgents framework uses a centralized `TradingAgentsConfig` class for all configuration management.
|
||||
|
||||
#### Core Configuration Options
|
||||
|
||||
**LLM Settings**:
|
||||
- `llm_provider`: OpenAI, Anthropic, Google, Ollama, or OpenRouter (default: "openai")
|
||||
- `deep_think_llm`: Model for complex reasoning tasks (default: "o4-mini")
|
||||
- `quick_think_llm`: Model for fast responses (default: "gpt-4o-mini")
|
||||
|
||||
**Debate Parameters**:
|
||||
- `max_debate_rounds`: Maximum rounds in investment debates (default: 1)
|
||||
- `max_risk_discuss_rounds`: Maximum rounds in risk discussions (default: 1)
|
||||
|
||||
**Data Management**:
|
||||
- `online_tools`: Enable/disable live API calls vs cached data (default: True)
|
||||
- `default_lookback_days`: Historical data range for analysis (default: 30)
|
||||
|
||||
#### Required API Keys
|
||||
|
||||
```bash
|
||||
# For OpenAI (default)
|
||||
export OPENAI_API_KEY="your_openai_api_key"
|
||||
|
||||
# For Anthropic Claude
|
||||
export ANTHROPIC_API_KEY="your_anthropic_api_key"
|
||||
|
||||
# For Google Gemini
|
||||
export GOOGLE_API_KEY="your_google_api_key"
|
||||
|
||||
# For financial data (optional)
|
||||
export FINNHUB_API_KEY="your_finnhub_api_key"
|
||||
```
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Multi-Agent Trading System
|
||||
|
|
@ -367,74 +316,69 @@ TradingAgents uses specialized LLM agents that work together in a trading firm s
|
|||
#### 1. Domain-Driven Architecture
|
||||
Three main domains with clean separation:
|
||||
- **Financial Data** (`tradingagents/domains/marketdata/`): Market prices, technical analysis, fundamentals
|
||||
- **News** (`tradingagents/domains/news/`): News articles and sentiment analysis
|
||||
- **News** (`tradingagents/domains/news/`): News articles and sentiment analysis (95% complete)
|
||||
- **Social Media** (`tradingagents/domains/socialmedia/`): Social sentiment from Reddit/Twitter
|
||||
|
||||
#### 2. Repository-First Data Strategy
|
||||
- Services read from local repositories (cached data)
|
||||
- Separate update operations fetch fresh data from APIs
|
||||
- Smart caching with gap detection and deduplication
|
||||
#### 2. PostgreSQL + TimescaleDB + pgvectorscale Stack
|
||||
- **PostgreSQL**: Primary database for structured data
|
||||
- **TimescaleDB**: Time-series optimization for market data
|
||||
- **pgvectorscale**: Vector storage for RAG and semantic search
|
||||
- **Automated migrations**: Database schema versioning
|
||||
|
||||
#### 3. Agent Integration (Anti-Corruption Layer)
|
||||
- `AgentToolkit` mediates between agents and domain services
|
||||
- Converts rich domain models to structured JSON for LLM consumption
|
||||
- Handles parameter validation and error recovery
|
||||
#### 3. RAG-Powered Agent Integration
|
||||
- `AgentToolkit` with RAG capabilities for contextual decision making
|
||||
- Vector search for relevant historical data and patterns
|
||||
- Semantic similarity matching for comparable market conditions
|
||||
- Context-aware analysis based on historical performance
|
||||
|
||||
#### 4. Dagster Data Orchestration
|
||||
- Daily/twice-daily data collection pipelines
|
||||
- Automated data quality checks and validation
|
||||
- Gap detection and backfill capabilities
|
||||
- Monitoring and alerting for data pipeline health
|
||||
|
||||
### Key Design Patterns
|
||||
|
||||
1. **Debate-Driven Decisions**: Bull/bear researchers debate before trading
|
||||
2. **Memory-Augmented Learning**: ChromaDB stores past decisions for context
|
||||
1. **RAG-Enhanced Decisions**: Agents use vector similarity search for context
|
||||
2. **Time-Series Optimized**: TimescaleDB for efficient market data queries
|
||||
3. **Quality-Aware Data**: All contexts include data quality metadata
|
||||
4. **Structured Outputs**: Pydantic models replace error-prone string parsing
|
||||
4. **Structured Outputs**: Pydantic models with database persistence
|
||||
|
||||
### File Structure
|
||||
```
|
||||
tradingagents/
|
||||
├── agents/ # Agent implementations
|
||||
├── agents/ # Agent implementations with RAG capabilities
|
||||
│ └── libs/ # AgentToolkit and utilities
|
||||
├── domains/ # Domain-specific services
|
||||
│ ├── marketdata/ # Financial data domain
|
||||
│ ├── news/ # News domain
|
||||
│ ├── news/ # News domain (95% complete)
|
||||
│ └── socialmedia/ # Social media domain
|
||||
├── graph/ # LangGraph workflow orchestration
|
||||
├── data/ # Dagster pipelines and data management
|
||||
└── config.py # Configuration management
|
||||
```
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
**Caching Strategy:**
|
||||
- Repository-first data access minimizes API calls
|
||||
- Smart caching with automatic invalidation
|
||||
- Gap detection for missing data ranges
|
||||
**Database Strategy:**
|
||||
- TimescaleDB hypertables for efficient time-series queries
|
||||
- pgvectorscale for fast vector similarity search
|
||||
- Materialized views for common aggregations
|
||||
|
||||
**Model Selection:**
|
||||
- OpenRouter unified interface reduces API complexity
|
||||
- `quick_think_llm` for data retrieval and formatting
|
||||
- `deep_think_llm` for complex analysis and decisions
|
||||
|
||||
**Cost Optimization:**
|
||||
```python
|
||||
config = TradingAgentsConfig(
|
||||
deep_think_llm="gpt-4o-mini", # Lower cost
|
||||
max_debate_rounds=1, # Fewer debates
|
||||
online_tools=False, # Use cached data
|
||||
default_lookback_days=30 # Limit data range
|
||||
)
|
||||
```
|
||||
|
||||
## Need Help?
|
||||
|
||||
- **Detailed Architecture**: `docs/architecture.md`
|
||||
- **API Documentation**: `docs/api-reference.md`
|
||||
- **Troubleshooting**: `docs/troubleshooting.md`
|
||||
- **Agent Development**: `docs/agent-development.md`
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions from the community! Whether it's fixing a bug, improving documentation, or suggesting a new feature, your input helps make this project better. If you are interested in this line of research, please consider joining our open-source financial AI research community [Tauric Research](https://tauric.ai/).
|
||||
|
||||
## Citation
|
||||
|
||||
Please reference our work if you find *TradingAgents* provides you with some help :)
|
||||
Please reference the original work if you find *TradingAgents* provides you with some help:
|
||||
|
||||
```
|
||||
@misc{xiao2025tradingagentsmultiagentsllmfinancial,
|
||||
|
|
@ -448,12 +392,6 @@ Please reference our work if you find *TradingAgents* provides you with some hel
|
|||
}
|
||||
```
|
||||
|
||||
# important-instruction-reminders
|
||||
Do what has been asked; nothing more, nothing less.
|
||||
NEVER create files unless they're absolutely necessary for achieving your goal.
|
||||
ALWAYS prefer editing an existing file to creating a new one.
|
||||
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
|
||||
## License
|
||||
|
||||
|
||||
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context unless it is highly relevant to your task.
|
||||
- remember what we learnt about testing?
|
||||
This personal fork maintains the Apache 2.0 license from the original TauricResearch/TradingAgents project.
|
||||
|
|
@ -1,10 +1,10 @@
|
|||
services:
|
||||
timescaledb:
|
||||
build: .
|
||||
build: ./db
|
||||
container_name: tradingagents_timescaledb
|
||||
environment:
|
||||
POSTGRES_PASSWORD: postgres
|
||||
POSTGRES_USER: postgres
|
||||
POSTGRES_PASSWORD: tradingagents
|
||||
POSTGRES_DB: tradingagents
|
||||
ports:
|
||||
- "5432:5432"
|
||||
|
|
@ -0,0 +1,150 @@
|
|||
# TradingAgents Product Definition
|
||||
|
||||
## Product Overview
|
||||
|
||||
**TradingAgents** is a personal fork of the multi-agent LLM financial trading framework designed for individual trading research and data infrastructure development. This fork focuses on PostgreSQL + TimescaleDB + pgvectorscale architecture with RAG-powered agents for enhanced decision making through historical context and pattern recognition.
|
||||
|
||||
## Target User
|
||||
|
||||
### Primary User
|
||||
- **Single Developer/Researcher**: Individual focused on personal trading research, strategy development, and building robust data infrastructure for financial analysis
|
||||
|
||||
### Use Cases
|
||||
- **Personal Trading Research**: Developing and testing proprietary trading strategies with AI-powered analysis
|
||||
- **Data Infrastructure Development**: Building scalable time-series and vector search capabilities for financial data
|
||||
- **RAG Implementation**: Experimenting with retrieval-augmented generation for context-aware trading decisions
|
||||
- **Academic Research**: Individual research projects exploring AI applications in financial markets
|
||||
|
||||
## Core Value Proposition
|
||||
|
||||
This personal fork transforms the original TradingAgents framework into a focused research and development platform that:
|
||||
- **Enables Personal Research**: Provides a complete data infrastructure for individual trading research and strategy development
|
||||
- **Implements Modern Architecture**: PostgreSQL + TimescaleDB + pgvectorscale stack for efficient time-series and vector operations
|
||||
- **Supports RAG-Powered Decisions**: Agents leverage historical context through vector similarity search for informed decisions
|
||||
- **Streamlines Data Collection**: Automated daily/twice-daily data pipelines with Dagster orchestration
|
||||
- **Unifies LLM Access**: Single OpenRouter integration for consistent model access across all agents
|
||||
|
||||
## Key Features
|
||||
|
||||
### Enhanced Data Architecture
|
||||
- **PostgreSQL Foundation**: Robust relational database for structured financial data
|
||||
- **TimescaleDB Integration**: Optimized time-series storage and querying for market data
|
||||
- **pgvectorscale Extension**: High-performance vector search for RAG and similarity matching
|
||||
- **Automated Migrations**: Database schema versioning and management
|
||||
|
||||
### RAG-Powered Multi-Agent System
|
||||
- **Context-Aware Analysis**: Agents use vector similarity search to find relevant historical patterns
|
||||
- **Enhanced Decision Making**: Retrieval-augmented generation provides historical context for trading decisions
|
||||
- **Pattern Recognition**: Semantic similarity matching for comparable market conditions
|
||||
- **Learning from History**: Agents reference past decisions and outcomes for improved analysis
|
||||
|
||||
### Automated Data Collection
|
||||
- **Dagster Orchestration**: Daily/twice-daily data collection pipelines with monitoring and alerting
|
||||
- **Quality Assurance**: Automated data validation, gap detection, and backfill capabilities
|
||||
- **Domain Coverage**: Comprehensive data collection for news (95% complete), market data, and social media domains
|
||||
- **Scalable Processing**: Efficient batch processing with dependency management
|
||||
|
||||
### Unified LLM Provider
|
||||
- **OpenRouter Integration**: Single provider for all model access, reducing API complexity
|
||||
- **Cost Optimization**: Strategic model selection with clear separation between analysis and data processing models
|
||||
- **Model Flexibility**: Easy switching between different models through OpenRouter's unified interface
|
||||
|
||||
## Business Context
|
||||
|
||||
### Research Focus Areas
|
||||
- **Individual Strategy Development**: Personal trading algorithm research and backtesting
|
||||
- **Data Infrastructure**: Building scalable financial data storage and retrieval systems
|
||||
- **AI/ML in Finance**: Experimenting with RAG, vector search, and multi-agent systems
|
||||
- **Time-Series Analysis**: Advanced market data analysis with TimescaleDB optimization
|
||||
|
||||
### Technical Advantages
|
||||
- **Modern Data Stack**: PostgreSQL + TimescaleDB + pgvectorscale provides production-grade data infrastructure
|
||||
- **RAG Implementation**: Real-world application of retrieval-augmented generation in financial decision making
|
||||
- **Comprehensive Testing**: Maintains 85%+ test coverage with pragmatic TDD approach
|
||||
- **Scalable Architecture**: Domain-driven design supports extensibility and maintainability
|
||||
|
||||
### Development Metrics
|
||||
- **Code Quality**: 85%+ test coverage, comprehensive type checking, automated formatting
|
||||
- **Data Pipeline Health**: Automated monitoring and alerting for data collection processes
|
||||
- **Performance**: Optimized queries with TimescaleDB, fast vector search with pgvectorscale
|
||||
- **Maintainability**: Clean architecture patterns, comprehensive documentation
|
||||
|
||||
## Technical Constraints
|
||||
|
||||
### Requirements
|
||||
- **Database**: PostgreSQL with TimescaleDB and pgvectorscale extensions
|
||||
- **Python Environment**: Python 3.13+ with comprehensive dependency management
|
||||
- **API Access**: OpenRouter API key for LLM access, optional FinnHub for real-time data
|
||||
- **Infrastructure**: Docker Compose for local development, Dagster for data orchestration
|
||||
|
||||
### Architectural Decisions
|
||||
- **Single Developer Focus**: Optimized for individual use rather than multi-user collaboration
|
||||
- **PostgreSQL-First**: All data persistence through PostgreSQL with appropriate extensions
|
||||
- **OpenRouter Exclusive**: Unified LLM provider reduces complexity and improves consistency
|
||||
- **Domain Completion**: Sequential domain development (news 95% → marketdata → socialmedia)
|
||||
|
||||
## Project Scope
|
||||
|
||||
### Current Implementation Status
|
||||
- **News Domain**: 95% complete with comprehensive article scraping and sentiment analysis
|
||||
- **Core Infrastructure**: PostgreSQL + TimescaleDB + pgvectorscale foundation established
|
||||
- **Agent Framework**: RAG-powered agents with vector search capabilities
|
||||
- **Data Pipelines**: Dagster orchestration for automated data collection
|
||||
|
||||
### Included Features
|
||||
- Complete PostgreSQL-based data architecture with time-series and vector extensions
|
||||
- RAG-enhanced multi-agent analysis framework with historical context
|
||||
- Automated data collection pipelines with Dagster orchestration
|
||||
- OpenRouter integration for unified LLM access
|
||||
- Comprehensive test suite with domain-specific testing strategies
|
||||
- CLI interface for interactive analysis and debugging
|
||||
|
||||
### Excluded Features
|
||||
- Multi-user collaboration features
|
||||
- Real money trading capabilities
|
||||
- Production-grade risk management for live trading
|
||||
- Multiple database backend support
|
||||
- Legacy LLM provider integrations (focus on OpenRouter only)
|
||||
|
||||
## Development Phases
|
||||
|
||||
### Phase 1: News Domain Completion (Current - 95% Complete)
|
||||
- Finalize news article scraping and processing
|
||||
- Complete sentiment analysis pipeline
|
||||
- Optimize news data storage and retrieval
|
||||
- Implement comprehensive testing for news domain
|
||||
|
||||
### Phase 2: Market Data Domain + PostgreSQL Migration
|
||||
- Complete market data collection and processing
|
||||
- Implement TimescaleDB optimizations for price data
|
||||
- Add technical analysis calculations
|
||||
- Migrate all data persistence to PostgreSQL
|
||||
|
||||
### Phase 3: Social Media Domain
|
||||
- Implement Reddit and Twitter data collection
|
||||
- Add social sentiment analysis
|
||||
- Complete the three-domain architecture
|
||||
- Optimize cross-domain data relationships
|
||||
|
||||
### Phase 4: Dagster Pipeline Implementation
|
||||
- Daily/twice-daily data collection automation
|
||||
- Comprehensive monitoring and alerting
|
||||
- Data quality validation and gap detection
|
||||
- Performance optimization and scaling
|
||||
|
||||
### Phase 5: RAG Enhancement and OpenRouter Migration
|
||||
- Complete RAG implementation for all agents
|
||||
- Migrate to OpenRouter as sole LLM provider
|
||||
- Optimize vector search performance
|
||||
- Implement advanced pattern recognition
|
||||
|
||||
## Success Criteria
|
||||
|
||||
This personal fork is successful when it provides:
|
||||
- **Robust Data Infrastructure**: PostgreSQL + TimescaleDB + pgvectorscale handling all financial data efficiently
|
||||
- **Intelligent Decision Making**: RAG-powered agents making context-aware trading recommendations
|
||||
- **Reliable Data Collection**: Automated pipelines collecting high-quality data consistently
|
||||
- **Research Capability**: Complete platform for individual trading strategy research and development
|
||||
- **Maintainable Codebase**: 85%+ test coverage with clear architecture and comprehensive documentation
|
||||
|
||||
The fork serves as both a practical trading research platform and a demonstration of modern data architecture patterns applied to financial AI systems.
|
||||
|
|
@ -0,0 +1,206 @@
|
|||
# TradingAgents Personal Fork Roadmap
|
||||
|
||||
## Overview
|
||||
|
||||
This roadmap outlines the technical development path for the personal fork of TradingAgents, focusing on building a robust data infrastructure with PostgreSQL + TimescaleDB + pgvectorscale, implementing RAG-powered agents, and establishing automated data collection pipelines with Dagster.
|
||||
|
||||
## Current Status: Phase 1 - News Domain (95% Complete)
|
||||
|
||||
The foundation has been established with core domain architecture, comprehensive testing framework, and the news domain nearly complete.
|
||||
|
||||
### Completed Infrastructure
|
||||
- **Domain Architecture**: Clean separation of news, marketdata, and socialmedia domains
|
||||
- **Testing Framework**: Pragmatic TDD with 85%+ coverage, pytest-vcr for HTTP mocking
|
||||
- **Repository Pattern**: Efficient data caching and management system
|
||||
- **News Domain**: Article scraping, sentiment analysis, and storage (95% complete)
|
||||
- **Basic Agent System**: Multi-agent trading analysis framework with LangGraph
|
||||
|
||||
## Development Phases
|
||||
|
||||
### Phase 1: News Domain Completion (Current - 95% Complete)
|
||||
**Timeline**: 2-3 weeks
|
||||
**Status**: 🔄 In Progress
|
||||
|
||||
#### Remaining Work
|
||||
- **News Processing Pipeline**: Complete article content processing and deduplication
|
||||
- **Sentiment Analysis Optimization**: Fine-tune sentiment scoring algorithms
|
||||
- **News Repository**: Finalize PostgreSQL integration for news storage
|
||||
- **Testing Coverage**: Achieve 85%+ test coverage for news domain
|
||||
- **Performance Optimization**: Optimize news retrieval and search performance
|
||||
|
||||
#### Success Criteria
|
||||
- ✅ All news APIs integrated and tested
|
||||
- ✅ Sentiment analysis producing consistent scores
|
||||
- ✅ News data properly stored in PostgreSQL
|
||||
- ✅ Comprehensive test suite covering edge cases
|
||||
- ✅ News domain ready for RAG integration
|
||||
|
||||
### Phase 2: Market Data Domain + PostgreSQL Migration (Next Priority)
|
||||
**Timeline**: 4-6 weeks
|
||||
**Status**: 📋 Planned
|
||||
|
||||
#### Core Objectives
|
||||
- **TimescaleDB Integration**: Implement hypertables for efficient time-series storage
|
||||
- **Market Data Collection**: Complete price, volume, and technical indicator collection
|
||||
- **PostgreSQL Migration**: Move all data persistence from file-based to PostgreSQL
|
||||
- **Technical Analysis**: Implement MACD, RSI, and other technical indicators
|
||||
- **Database Schema**: Design optimized schema for market data with proper indexing
|
||||
|
||||
#### Key Deliverables
|
||||
- Market data repository with TimescaleDB optimization
|
||||
- Real-time and historical price data collection
|
||||
- Technical analysis calculation engine
|
||||
- Migration scripts for moving existing data
|
||||
- Performance benchmarks for time-series queries
|
||||
|
||||
#### Success Criteria
|
||||
- ✅ Market data efficiently stored in TimescaleDB hypertables
|
||||
- ✅ Sub-100ms queries for common market data retrievals
|
||||
- ✅ All technical indicators calculating accurately
|
||||
- ✅ Complete migration from file-based storage
|
||||
- ✅ Market data domain ready for agent integration
|
||||
|
||||
### Phase 3: Social Media Domain (Following Phase 2)
|
||||
**Timeline**: 3-4 weeks
|
||||
**Status**: 📋 Planned
|
||||
|
||||
#### Core Objectives
|
||||
- **Reddit Integration**: Implement Reddit API for financial subreddits
|
||||
- **Twitter/X Integration**: Add social sentiment from Twitter feeds
|
||||
- **Social Sentiment Analysis**: Aggregate sentiment scoring across platforms
|
||||
- **Cross-Domain Relations**: Link social sentiment to market data and news
|
||||
- **pgvectorscale Preparation**: Prepare social data for vector search
|
||||
|
||||
#### Key Deliverables
|
||||
- Reddit and Twitter data collection clients
|
||||
- Social sentiment aggregation algorithms
|
||||
- Social media data repository with PostgreSQL storage
|
||||
- Cross-domain correlation analysis tools
|
||||
- Foundation for RAG implementation
|
||||
|
||||
#### Success Criteria
|
||||
- ✅ Social media data collected from multiple sources
|
||||
- ✅ Sentiment scores integrated with market events
|
||||
- ✅ Cross-domain relationships established in database
|
||||
- ✅ Social media domain ready for RAG enhancement
|
||||
- ✅ Three-domain architecture complete
|
||||
|
||||
### Phase 4: Dagster Data Collection Orchestration
|
||||
**Timeline**: 3-4 weeks
|
||||
**Status**: 📋 Planned
|
||||
|
||||
#### Core Objectives
|
||||
- **Pipeline Architecture**: Design daily/twice-daily data collection workflows
|
||||
- **Data Quality Monitoring**: Implement validation and gap detection
|
||||
- **Automated Backfill**: Handle missing data and API failures gracefully
|
||||
- **Performance Monitoring**: Track pipeline health and data freshness
|
||||
- **Alerting System**: Notify on pipeline failures or data quality issues
|
||||
|
||||
#### Key Deliverables
|
||||
- Dagster asset definitions for all data domains
|
||||
- Automated data quality checks and validation
|
||||
- Gap detection and backfill capabilities
|
||||
- Monitoring dashboard for pipeline health
|
||||
- Comprehensive logging and error handling
|
||||
|
||||
#### Success Criteria
|
||||
- ✅ Fully automated data collection running daily
|
||||
- ✅ Data quality monitoring with automated alerts
|
||||
- ✅ Zero-downtime pipeline updates and maintenance
|
||||
- ✅ Historical data gaps automatically detected and filled
|
||||
- ✅ Pipeline performance metrics tracked and optimized
|
||||
|
||||
### Phase 5: RAG Implementation + OpenRouter Migration
|
||||
**Timeline**: 4-5 weeks
|
||||
**Status**: 📋 Planned
|
||||
|
||||
#### Core Objectives
|
||||
- **pgvectorscale Integration**: Implement vector storage for historical patterns
|
||||
- **RAG Agent Enhancement**: Agents use similarity search for context
|
||||
- **OpenRouter Migration**: Complete migration to unified LLM provider
|
||||
- **Historical Context**: Agents reference past decisions and market conditions
|
||||
- **Pattern Recognition**: Semantic similarity for comparable market scenarios
|
||||
|
||||
#### Key Deliverables
|
||||
- pgvectorscale extension configured and optimized
|
||||
- Vector embeddings for all historical data
|
||||
- RAG-enhanced agent decision making
|
||||
- OpenRouter integration replacing all LLM providers
|
||||
- Similarity search for historical pattern matching
|
||||
|
||||
#### Success Criteria
|
||||
- ✅ All agents using RAG for contextual decisions
|
||||
- ✅ Vector search performing sub-50ms similarity queries
|
||||
- ✅ OpenRouter as sole LLM provider across all agents
|
||||
- ✅ Agents demonstrating improved decision accuracy
|
||||
- ✅ Historical pattern matching enhancing trading analysis
|
||||
|
||||
## Technical Milestones
|
||||
|
||||
### Database Architecture
|
||||
- **Month 1**: Complete PostgreSQL foundation with news domain
|
||||
- **Month 2**: TimescaleDB hypertables optimized for market data
|
||||
- **Month 3**: pgvectorscale configured for RAG implementation
|
||||
- **Month 4**: Full database optimization and performance tuning
|
||||
|
||||
### Agent Capabilities
|
||||
- **Month 1**: Basic multi-agent framework operational
|
||||
- **Month 2**: Agents using PostgreSQL for all data access
|
||||
- **Month 3**: Cross-domain agent collaboration established
|
||||
- **Month 4**: RAG-powered agents with historical context
|
||||
|
||||
### Data Pipeline Maturity
|
||||
- **Month 1**: Manual data collection with basic automation
|
||||
- **Month 2**: Automated collection for market data
|
||||
- **Month 3**: Full three-domain automated collection
|
||||
- **Month 4**: Production-grade pipeline with monitoring and alerting
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Technical Excellence
|
||||
- **Test Coverage**: Maintain 85%+ across all domains
|
||||
- **Query Performance**: < 100ms for common database operations
|
||||
- **Pipeline Reliability**: 99%+ uptime for data collection
|
||||
- **Data Quality**: < 0.1% missing data points across all domains
|
||||
|
||||
### Feature Completeness
|
||||
- **Domain Coverage**: 100% implementation across news, marketdata, socialmedia
|
||||
- **Agent Capabilities**: RAG-enhanced decision making operational
|
||||
- **Data Infrastructure**: Complete PostgreSQL + TimescaleDB + pgvectorscale stack
|
||||
- **Automation**: Fully automated data collection and processing
|
||||
|
||||
### Development Velocity
|
||||
- **Code Quality**: Consistent formatting, type checking, and documentation
|
||||
- **Testing Strategy**: Comprehensive test suite with domain-specific approaches
|
||||
- **Architecture Consistency**: Clean domain separation and layered architecture
|
||||
- **Performance Optimization**: Regular profiling and optimization cycles
|
||||
|
||||
## Risk Management
|
||||
|
||||
### Technical Risks
|
||||
- **Database Performance**: Mitigate with proper indexing and query optimization
|
||||
- **API Rate Limits**: Implement intelligent backoff and caching strategies
|
||||
- **Data Quality**: Establish comprehensive validation and monitoring
|
||||
- **Vector Search Performance**: Optimize pgvectorscale configuration and queries
|
||||
|
||||
### Development Risks
|
||||
- **Scope Creep**: Maintain focus on sequential domain completion
|
||||
- **Technical Debt**: Regular refactoring and code quality maintenance
|
||||
- **Testing Coverage**: Continuous integration with coverage enforcement
|
||||
- **Documentation**: Maintain comprehensive documentation throughout development
|
||||
|
||||
## Long-Term Vision (6+ Months)
|
||||
|
||||
### Advanced Capabilities
|
||||
- **Strategy Backtesting**: Historical strategy validation with complete data
|
||||
- **Real-Time Analysis**: Live market analysis with sub-second agent responses
|
||||
- **Advanced RAG**: Multi-modal RAG with charts, documents, and audio data
|
||||
- **Performance Analytics**: Comprehensive analysis of agent decision accuracy
|
||||
|
||||
### Research Applications
|
||||
- **Academic Research**: Platform for publishing trading AI research
|
||||
- **Strategy Development**: Complete environment for developing proprietary strategies
|
||||
- **Data Science**: Advanced analytics and machine learning on financial data
|
||||
- **Educational Use**: Comprehensive learning platform for financial AI
|
||||
|
||||
This roadmap prioritizes building a solid data foundation before enhancing agent capabilities, ensuring each phase delivers measurable value while maintaining high code quality and comprehensive testing.
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
{
|
||||
"product_vision": "Multi-agent LLM financial trading framework with PostgreSQL + TimescaleDB + pgvectorscale architecture for research-based market analysis and trading decisions",
|
||||
"existing_features": [
|
||||
"marketdata_domain_85_complete_file_based",
|
||||
"yfinance_client_fully_implemented",
|
||||
"finnhub_client_with_insider_data",
|
||||
"talib_technical_analysis_integration",
|
||||
"postgresql_timescaledb_foundation",
|
||||
"agent_toolkit_rag_ready",
|
||||
"news_domain_postgresql_patterns",
|
||||
"database_manager_async_operations"
|
||||
],
|
||||
"architecture": {
|
||||
"layer_pattern": "Router → Service → Repository → Entity → Database",
|
||||
"database": "PostgreSQL + TimescaleDB + pgvectorscale with asyncpg driver",
|
||||
"llm_provider": "OpenRouter unified interface",
|
||||
"agent_orchestration": "LangGraph workflows with RAG-enhanced AgentToolkit",
|
||||
"data_pipeline": "Dagster planned for daily market data collection",
|
||||
"domain_structure": "news (95% PostgreSQL), marketdata (85% file-based), socialmedia (planned)",
|
||||
"testing_strategy": "Pragmatic TDD: services (mocked), repositories (real PostgreSQL), clients (pytest-vcr)"
|
||||
},
|
||||
"marketdata_implementation_status": {
|
||||
"current_components": {
|
||||
"MarketDataService": "Technical analysis with 20 TA-Lib indicators, trading style presets",
|
||||
"MarketDataRepository": "CSV-based storage - NEEDS PostgreSQL migration",
|
||||
"YFinanceClient": "Historical OHLC, company info, financials - fully implemented",
|
||||
"FinnhubClient": "Insider transactions, sentiment, company profiles - fully implemented",
|
||||
"FundamentalDataService": "Balance sheet, income statement, cash flow analysis",
|
||||
"InsiderDataService": "SEC insider transaction and sentiment analysis"
|
||||
},
|
||||
"current_limitations": {
|
||||
"storage": "CSV files in ./data/market_data/ - not scalable",
|
||||
"query_performance": "File-based lookups instead of indexed database queries",
|
||||
"concurrency": "No concurrent access support",
|
||||
"vector_embeddings": "No RAG capabilities for historical pattern matching"
|
||||
},
|
||||
"migration_needed": [
|
||||
"PostgreSQL entities for OHLC, fundamental, and insider data",
|
||||
"TimescaleDB hypertables for time-series optimization",
|
||||
"Vector embeddings for technical analysis RAG",
|
||||
"Async repository operations matching news domain patterns",
|
||||
"Batch data ingestion for daily collection"
|
||||
]
|
||||
},
|
||||
"reference_patterns": {
|
||||
"news_domain_success": {
|
||||
"NewsRepository": "Async PostgreSQL with vector embeddings and batch operations",
|
||||
"NewsArticleEntity": "SQLAlchemy model with UUID v7, TimescaleDB optimization",
|
||||
"database_patterns": "Connection pooling, async sessions, proper error handling",
|
||||
"testing_approach": "Real PostgreSQL for repositories, pytest-vcr for API clients"
|
||||
},
|
||||
"agent_integration": "AgentToolkit expects PostgreSQL-backed services for RAG capabilities"
|
||||
},
|
||||
"technical_dependencies": {
|
||||
"external": [
|
||||
"yfinance for daily OHLC data (already implemented)",
|
||||
"FinnHub API for insider and fundamental data (already implemented)",
|
||||
"PostgreSQL with TimescaleDB and pgvectorscale extensions (ready)"
|
||||
],
|
||||
"internal": [
|
||||
"DatabaseManager for async PostgreSQL connections (established)",
|
||||
"News domain PostgreSQL patterns for consistency (available)",
|
||||
"AgentToolkit integration for RAG-powered market analysis (ready)"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,52 @@
|
|||
{
|
||||
"requirements": {
|
||||
"entities": {
|
||||
"MarketDataEntity": "SQLAlchemy entity for OHLC price data with TimescaleDB optimization and vector embeddings",
|
||||
"FundamentalDataEntity": "Financial statement data (balance sheet, income statement, cash flow) with PostgreSQL storage",
|
||||
"InsiderDataEntity": "SEC insider transaction records with sentiment analysis and PostgreSQL persistence",
|
||||
"TechnicalIndicatorEntity": "Calculated TA-Lib indicator values with vector embeddings for RAG analysis"
|
||||
},
|
||||
"data_persistence": {
|
||||
"migration_scope": "CSV file storage to PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"current_storage": "./data/market_data/ CSV files with 85% complete functionality",
|
||||
"target_storage": "PostgreSQL with TimescaleDB hypertables and pgvectorscale vector storage",
|
||||
"performance_goal": "10x improvement with sub-100ms query times",
|
||||
"data_volume": "10 years OHLC, 5 years fundamentals, 3 years insider data for 500+ tickers"
|
||||
},
|
||||
"api_needed": {
|
||||
"preservation_requirement": "100% API compatibility with existing services",
|
||||
"existing_apis": [
|
||||
"MarketDataService with 20 TA-Lib technical indicators and trading style presets",
|
||||
"FundamentalDataService for balance sheet, income statement, cash flow analysis",
|
||||
"InsiderDataService for SEC transaction data and sentiment scoring"
|
||||
],
|
||||
"external_apis": [
|
||||
"YFinanceClient (fully implemented) for daily OHLC data",
|
||||
"FinnhubClient (fully implemented) for insider transactions and fundamental data"
|
||||
]
|
||||
},
|
||||
"components": {
|
||||
"repository_migration": "MarketDataRepository from CSV to async PostgreSQL operations",
|
||||
"entity_models": "SQLAlchemy entities with TimescaleDB and pgvectorscale integration",
|
||||
"service_preservation": "API-compatible service layer with PostgreSQL backend",
|
||||
"vector_embeddings": "RAG enhancement for historical pattern matching",
|
||||
"dagster_integration": "Daily data collection pipeline automation"
|
||||
},
|
||||
"domains": {
|
||||
"primary": "MarketData (PostgreSQL migration from 85% complete CSV system)",
|
||||
"integration": "Follows news domain PostgreSQL patterns for architectural consistency"
|
||||
},
|
||||
"business_rules": [
|
||||
"Preserve 100% API compatibility with existing MarketDataService, FundamentalDataService, InsiderDataService",
|
||||
"Daily automated collection from yfinance (OHLC) and FinnHub (insider + fundamentals)",
|
||||
"TimescaleDB hypertables for market_data, fundamental_data, insider_data tables",
|
||||
"Vector embeddings for technical analysis patterns using pgvectorscale",
|
||||
"Sub-100ms query performance for common market data operations",
|
||||
"Sub-200ms RAG queries for historical pattern matching",
|
||||
"Data retention: 10 years OHLC, 5 years fundamentals, 3 years insider data",
|
||||
"FinnHub API rate limiting compliance with backoff strategies",
|
||||
"Comprehensive audit logging and ACID transaction support",
|
||||
"Concurrent agent access with PostgreSQL async operations"
|
||||
]
|
||||
}
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,6 @@
|
|||
{
|
||||
"raw_user_story": "As a dagster job and AI Agent I want to collect daily OHLC data from yfinance for all my tickers, insider data from FinnHub, and fundamental data from FinnHub so that agents have comprehensive market data for trading decisions",
|
||||
"raw_criteria": "Daily OHLC data from yfinance for configured tickers, insider trading data from FinnHub API, fundamental data from FinnHub API, all data stored in PostgreSQL with TimescaleDB optimization, agents can query market data for analysis",
|
||||
"raw_rules": "Daily automated collection, FinnHub API rate limiting compliance, data quality validation, TimescaleDB for time-series optimization",
|
||||
"raw_scope": "Included: Daily OHLC from yfinance, insider data from FinnHub, fundamental data from FinnHub, PostgreSQL storage, agent integration. Excluded: Real-time data streaming, other data providers beyond yfinance and FinnHub."
|
||||
}
|
||||
|
|
@ -0,0 +1,98 @@
|
|||
# MarketData Domain - PostgreSQL Migration (Lite Spec)
|
||||
|
||||
## Migration Overview
|
||||
|
||||
**Project**: 85% complete MarketData domain → PostgreSQL + TimescaleDB + pgvectorscale
|
||||
**Objective**: 10x performance + RAG capabilities while preserving 100% API compatibility
|
||||
**Pattern**: Follow news domain PostgreSQL implementation for architectural consistency
|
||||
|
||||
## Key Requirements
|
||||
|
||||
### Performance Targets
|
||||
- Sub-100ms market data queries (10x improvement from CSV)
|
||||
- Sub-200ms RAG vector similarity search
|
||||
- Support 500+ tickers with concurrent agent access
|
||||
|
||||
### API Preservation (Critical)
|
||||
- **MarketDataService**: All existing methods preserved
|
||||
- **FundamentalDataService**: Complete compatibility maintained
|
||||
- **InsiderDataService**: Zero breaking changes
|
||||
- **20 TA-Lib indicators**: Full functionality preserved
|
||||
|
||||
### Data Sources & Collection
|
||||
- **yfinance**: Daily OHLC data via Dagster pipelines
|
||||
- **FinnHub**: Insider transactions + fundamental data
|
||||
- **TimescaleDB hypertables**: market_data, fundamental_data, insider_data
|
||||
- **Vector storage**: pgvectorscale for RAG pattern matching
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Database Schema (TimescaleDB)
|
||||
```sql
|
||||
-- Hypertables for time-series optimization
|
||||
market_data (symbol, date, ohlc, volume) - 10 year retention
|
||||
fundamental_data (symbol, report_date, metrics) - 5 year retention
|
||||
insider_data (symbol, transaction_date, person, shares) - 3 year retention
|
||||
technical_indicators (symbol, date, values, pattern_embedding) - RAG support
|
||||
```
|
||||
|
||||
### Entity Models
|
||||
- **MarketDataEntity**: OHLC + validation + database conversion
|
||||
- **FundamentalDataEntity**: Financial statement data
|
||||
- **InsiderDataEntity**: SEC transaction records
|
||||
- **TechnicalIndicatorEntity**: Calculated values + vector embeddings
|
||||
|
||||
### Repository Pattern (Async PostgreSQL)
|
||||
```python
|
||||
class MarketDataRepository:
|
||||
async def get_ohlc_data(symbol, start, end) -> List[MarketDataEntity]
|
||||
async def bulk_upsert_market_data(entities) -> int # Dagster ingestion
|
||||
async def find_similar_patterns(embedding, limit) -> List[Dict] # RAG
|
||||
```
|
||||
|
||||
### Service Layer (100% Compatible)
|
||||
```python
|
||||
class MarketDataService:
|
||||
async def get_stock_data(symbol, period) -> pd.DataFrame # Preserved API
|
||||
async def calculate_technical_indicators(symbol, indicators) -> Dict # 20 TA-Lib
|
||||
async def get_trading_style_preset(style) -> Dict # Existing presets
|
||||
```
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Entities & Schema
|
||||
1. Create SQLAlchemy entities following news domain patterns
|
||||
2. Setup TimescaleDB hypertables with proper indexing
|
||||
3. Configure pgvectorscale for vector embeddings
|
||||
|
||||
### Phase 2: Repository Migration
|
||||
1. Implement async PostgreSQL repositories (mirror NewsRepository pattern)
|
||||
2. Create data migration scripts (CSV → PostgreSQL)
|
||||
3. Add vector embedding generation for RAG
|
||||
|
||||
### Phase 3: Service Preservation
|
||||
1. Update services to use PostgreSQL repositories
|
||||
2. Maintain exact API signatures and return types
|
||||
3. Add RAG-enhanced pattern analysis capabilities
|
||||
|
||||
### Phase 4: Integration & Testing
|
||||
1. Real PostgreSQL tests for repositories
|
||||
2. Preserve pytest-vcr for YFinanceClient/FinnhubClient
|
||||
3. Validate 100% API compatibility with existing agents
|
||||
|
||||
## Ready Dependencies
|
||||
- YFinanceClient + FinnhubClient (fully implemented)
|
||||
- PostgreSQL + TimescaleDB + pgvectorscale (established)
|
||||
- DatabaseManager async operations (available)
|
||||
- News domain patterns for consistency (reference implementation)
|
||||
|
||||
## Success Metrics
|
||||
- **Performance**: 10x query improvement, sub-100ms operations
|
||||
- **Compatibility**: Zero API breaking changes, seamless agent migration
|
||||
- **Scalability**: 500+ concurrent tickers, efficient bulk ingestion
|
||||
- **Quality**: 85%+ test coverage, comprehensive validation
|
||||
|
||||
## Implementation Approach
|
||||
**Follow news domain patterns** → Create entities → Migrate repositories → Preserve service APIs → Enhance with vector RAG → Integrate Dagster pipelines
|
||||
|
||||
This migration provides the high-performance, RAG-enabled market data foundation essential for sophisticated multi-agent trading analysis while maintaining complete backward compatibility.
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
{
|
||||
"feature": "MarketData",
|
||||
"user_story": "As a Dagster pipeline and AI Agent, I want to collect daily OHLC data from yfinance, insider data from FinnHub, and fundamental data from FinnHub with PostgreSQL + TimescaleDB storage, so that agents have high-performance, RAG-enhanced market data access for comprehensive trading analysis",
|
||||
"acceptance_criteria": [
|
||||
"GIVEN the MarketData domain migration WHEN PostgreSQL + TimescaleDB integration is complete THEN all existing MarketDataService APIs remain 100% compatible with 10x performance improvement",
|
||||
"GIVEN daily market data collection WHEN Dagster pipelines execute THEN OHLC data from yfinance and insider/fundamental data from FinnHub are stored in TimescaleDB hypertables",
|
||||
"GIVEN historical market data queries WHEN AI agents request technical analysis THEN responses are delivered within 100ms using TimescaleDB time-series optimization",
|
||||
"GIVEN technical analysis requests WHEN agents query indicators THEN all 20 existing TA-Lib indicators are preserved with PostgreSQL-backed data access",
|
||||
"GIVEN RAG-powered analysis WHEN agents search for historical patterns THEN vector similarity search using pgvectorscale returns relevant market conditions within 200ms",
|
||||
"GIVEN concurrent agent operations WHEN multiple agents access market data THEN PostgreSQL async operations support concurrent reads without file system limitations",
|
||||
"GIVEN data quality requirements WHEN market data is collected THEN comprehensive validation, audit trails, and error handling maintain data integrity with PostgreSQL ACID transactions"
|
||||
],
|
||||
"business_rules": [
|
||||
"Preserve 100% API compatibility with existing MarketDataService for seamless migration",
|
||||
"Daily automated collection from yfinance (OHLC) and FinnHub (insider + fundamentals) via Dagster pipelines",
|
||||
"FinnHub API rate limiting compliance with proper backoff strategies",
|
||||
"TimescaleDB hypertables for market_data, fundamental_data, and insider_data tables",
|
||||
"Vector embeddings generation for technical analysis patterns using pgvectorscale",
|
||||
"Data retention policy: 10 years for OHLC, 5 years for fundamentals, 3 years for insider data",
|
||||
"Sub-100ms query performance for common market data operations",
|
||||
"Comprehensive audit logging for all data collection and agent queries",
|
||||
"Graceful degradation when external APIs are unavailable"
|
||||
],
|
||||
"scope": {
|
||||
"included": [
|
||||
"PostgreSQL + TimescaleDB + pgvectorscale migration from CSV storage",
|
||||
"Preserve all existing YFinanceClient and FinnhubClient integrations",
|
||||
"Maintain complete MarketDataService, FundamentalDataService, InsiderDataService APIs",
|
||||
"Async PostgreSQL repository operations following news domain patterns",
|
||||
"Vector embeddings for RAG-powered historical pattern matching",
|
||||
"TimescaleDB hypertables for time-series optimization",
|
||||
"Batch data ingestion pipeline for daily Dagster collection",
|
||||
"Comprehensive testing with real PostgreSQL database",
|
||||
"Agent integration enhancement with RAG capabilities"
|
||||
],
|
||||
"excluded": [
|
||||
"Real-time data streaming (daily batch collection only)",
|
||||
"Additional data providers beyond yfinance and FinnHub",
|
||||
"New technical indicators beyond existing 20 TA-Lib indicators",
|
||||
"Custom financial calculations beyond current scope",
|
||||
"Multi-database support (PostgreSQL only)",
|
||||
"GraphQL or REST API endpoints (agent integration only)"
|
||||
]
|
||||
},
|
||||
"current_implementation_status": "85% complete with file-based CSV storage - migration project to PostgreSQL",
|
||||
"existing_components": [
|
||||
"MarketDataService with 20 TA-Lib technical indicators and trading style presets",
|
||||
"YFinanceClient fully implemented for OHLC, company info, and financials",
|
||||
"FinnhubClient with structured models for insider transactions and sentiment",
|
||||
"FundamentalDataService for balance sheet, income statement, cash flow analysis",
|
||||
"InsiderDataService for SEC transaction data and sentiment scoring",
|
||||
"MarketDataRepository with CSV storage - MIGRATION TARGET",
|
||||
"AgentToolkit integration ready for PostgreSQL-backed RAG enhancement",
|
||||
"Comprehensive testing suite with pytest-vcr for API clients"
|
||||
],
|
||||
"migration_components": [
|
||||
"MarketDataEntity SQLAlchemy models for PostgreSQL storage",
|
||||
"FundamentalDataEntity for financial statement data",
|
||||
"InsiderDataEntity for SEC transaction records",
|
||||
"TechnicalIndicatorEntity for calculated indicator values",
|
||||
"Async PostgreSQL repository operations matching news domain patterns",
|
||||
"TimescaleDB hypertable setup for time-series optimization",
|
||||
"Vector embedding generation for technical analysis RAG",
|
||||
"Data migration scripts from CSV to PostgreSQL"
|
||||
],
|
||||
"aligns_with": "Multi-agent trading framework vision - provides high-performance market data foundation for sophisticated agent analysis with RAG-powered historical context",
|
||||
"dependencies": [
|
||||
"Existing YFinanceClient and FinnhubClient implementations (ready)",
|
||||
"PostgreSQL + TimescaleDB + pgvectorscale database infrastructure (established)",
|
||||
"News domain PostgreSQL patterns for migration consistency (available)",
|
||||
"DatabaseManager async operations and connection management (ready)",
|
||||
"OpenRouter configuration for vector embeddings generation (available)",
|
||||
"Dagster orchestration framework for daily data collection (planned)"
|
||||
],
|
||||
"technical_details": {
|
||||
"architecture_pattern": "Router → Service → Repository → Entity → Database (preserving existing service interfaces)",
|
||||
"database_integration": "PostgreSQL + TimescaleDB + pgvectorscale migration from CSV storage",
|
||||
"performance_optimization": "TimescaleDB hypertables, proper indexing, connection pooling, async operations",
|
||||
"vector_storage": "pgvectorscale for RAG-powered historical pattern matching in technical analysis",
|
||||
"api_preservation": "100% compatibility with existing MarketDataService, FundamentalDataService, InsiderDataService APIs",
|
||||
"testing_strategy": "Real PostgreSQL for repository tests, preserved pytest-vcr for API clients, service compatibility testing"
|
||||
},
|
||||
"implementation_approach": "PostgreSQL migration project following news domain patterns: create entities → migrate repositories → preserve service APIs → enhance with vector RAG → integrate Dagster pipelines",
|
||||
"reference_implementations": {
|
||||
"news_domain_patterns": "Follow NewsRepository, NewsArticleEntity, DatabaseManager async patterns for consistency",
|
||||
"database_migration": "Use established TimescaleDB hypertable and pgvectorscale vector storage patterns",
|
||||
"testing_approach": "Apply news domain testing strategy: real PostgreSQL for repositories, VCR for API clients"
|
||||
},
|
||||
"success_criteria": {
|
||||
"performance": "10x query performance improvement, sub-100ms market data operations, sub-200ms RAG queries",
|
||||
"compatibility": "100% existing API preservation, seamless migration without agent disruption",
|
||||
"scalability": "Support 500+ tickers with concurrent agent access, efficient bulk data ingestion",
|
||||
"quality": "85%+ test coverage maintained, comprehensive data validation and audit trails"
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,352 @@
|
|||
# MarketData Domain - PostgreSQL Migration Specification
|
||||
|
||||
## Feature Overview
|
||||
|
||||
**Feature**: MarketData Domain PostgreSQL Migration
|
||||
**Status**: Migration project (85% complete → PostgreSQL integration)
|
||||
**Priority**: High (foundational infrastructure for AI agents)
|
||||
|
||||
This specification defines the migration of the MarketData domain from CSV-based storage to PostgreSQL + TimescaleDB + pgvectorscale integration, while preserving 100% API compatibility and delivering 10x performance improvements for AI agent operations.
|
||||
|
||||
## User Stories
|
||||
|
||||
### Primary User Story
|
||||
> As a Dagster pipeline and AI Agent, I want to collect daily OHLC data from yfinance, insider data from FinnHub, and fundamental data from FinnHub with PostgreSQL + TimescaleDB storage, so that agents have high-performance, RAG-enhanced market data access for comprehensive trading analysis.
|
||||
|
||||
### Supporting User Stories
|
||||
|
||||
**Agent Performance**
|
||||
- As an AI Agent, I want market data queries to complete in under 100ms, so that real-time trading analysis is efficient
|
||||
- As a Technical Analyst Agent, I want vector similarity search for historical patterns, so that pattern-based trading decisions are context-aware
|
||||
|
||||
**Data Pipeline Reliability**
|
||||
- As a Dagster pipeline, I want atomic data ingestion with PostgreSQL ACID transactions, so that data integrity is guaranteed during bulk operations
|
||||
- As a Risk Management Agent, I want comprehensive audit trails for all market data access, so that trading decisions are fully traceable
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Migration Compatibility
|
||||
- **AC1**: GIVEN the MarketData domain migration WHEN PostgreSQL + TimescaleDB integration is complete THEN all existing MarketDataService APIs remain 100% compatible with 10x performance improvement
|
||||
|
||||
### Data Collection Pipeline
|
||||
- **AC2**: GIVEN daily market data collection WHEN Dagster pipelines execute THEN OHLC data from yfinance and insider/fundamental data from FinnHub are stored in TimescaleDB hypertables
|
||||
|
||||
### Performance Requirements
|
||||
- **AC3**: GIVEN historical market data queries WHEN AI agents request technical analysis THEN responses are delivered within 100ms using TimescaleDB time-series optimization
|
||||
- **AC4**: GIVEN technical analysis requests WHEN agents query indicators THEN all 20 existing TA-Lib indicators are preserved with PostgreSQL-backed data access
|
||||
|
||||
### RAG Integration
|
||||
- **AC5**: GIVEN RAG-powered analysis WHEN agents search for historical patterns THEN vector similarity search using pgvectorscale returns relevant market conditions within 200ms
|
||||
|
||||
### Scalability
|
||||
- **AC6**: GIVEN concurrent agent operations WHEN multiple agents access market data THEN PostgreSQL async operations support concurrent reads without file system limitations
|
||||
|
||||
### Data Quality
|
||||
- **AC7**: GIVEN data quality requirements WHEN market data is collected THEN comprehensive validation, audit trails, and error handling maintain data integrity with PostgreSQL ACID transactions
|
||||
|
||||
## Business Rules
|
||||
|
||||
### API Preservation
|
||||
- **BR1**: Preserve 100% API compatibility with existing MarketDataService for seamless migration
|
||||
- **BR2**: Maintain all existing method signatures in FundamentalDataService and InsiderDataService
|
||||
|
||||
### Data Collection Standards
|
||||
- **BR3**: Daily automated collection from yfinance (OHLC) and FinnHub (insider + fundamentals) via Dagster pipelines
|
||||
- **BR4**: FinnHub API rate limiting compliance with proper backoff strategies
|
||||
- **BR5**: Graceful degradation when external APIs are unavailable
|
||||
|
||||
### Database Architecture
|
||||
- **BR6**: TimescaleDB hypertables for market_data, fundamental_data, and insider_data tables
|
||||
- **BR7**: Vector embeddings generation for technical analysis patterns using pgvectorscale
|
||||
|
||||
### Performance Standards
|
||||
- **BR8**: Sub-100ms query performance for common market data operations
|
||||
- **BR9**: Data retention policy: 10 years for OHLC, 5 years for fundamentals, 3 years for insider data
|
||||
|
||||
### Audit and Compliance
|
||||
- **BR10**: Comprehensive audit logging for all data collection and agent queries
|
||||
|
||||
## Technical Implementation Details
|
||||
|
||||
### Architecture Pattern
|
||||
**Router → Service → Repository → Entity → Database**
|
||||
|
||||
The migration preserves the existing service interfaces while upgrading the underlying data persistence layer.
|
||||
|
||||
### Database Schema Design
|
||||
|
||||
#### TimescaleDB Hypertables
|
||||
|
||||
```sql
|
||||
-- Market Data (OHLC)
|
||||
CREATE TABLE market_data (
|
||||
id SERIAL PRIMARY KEY,
|
||||
symbol VARCHAR(10) NOT NULL,
|
||||
date TIMESTAMPTZ NOT NULL,
|
||||
open DECIMAL(12,4),
|
||||
high DECIMAL(12,4),
|
||||
low DECIMAL(12,4),
|
||||
close DECIMAL(12,4),
|
||||
adj_close DECIMAL(12,4),
|
||||
volume BIGINT,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
SELECT create_hypertable('market_data', 'date', chunk_time_interval => INTERVAL '1 month');
|
||||
|
||||
-- Fundamental Data
|
||||
CREATE TABLE fundamental_data (
|
||||
id SERIAL PRIMARY KEY,
|
||||
symbol VARCHAR(10) NOT NULL,
|
||||
report_date TIMESTAMPTZ NOT NULL,
|
||||
period_type VARCHAR(20), -- annual, quarterly
|
||||
metric_name VARCHAR(100),
|
||||
metric_value DECIMAL(20,4),
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
SELECT create_hypertable('fundamental_data', 'report_date', chunk_time_interval => INTERVAL '3 months');
|
||||
|
||||
-- Insider Data
|
||||
CREATE TABLE insider_data (
|
||||
id SERIAL PRIMARY KEY,
|
||||
symbol VARCHAR(10) NOT NULL,
|
||||
transaction_date TIMESTAMPTZ NOT NULL,
|
||||
person_name VARCHAR(200),
|
||||
position VARCHAR(100),
|
||||
transaction_type VARCHAR(50),
|
||||
shares BIGINT,
|
||||
price DECIMAL(12,4),
|
||||
value DECIMAL(20,4),
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
SELECT create_hypertable('insider_data', 'transaction_date', chunk_time_interval => INTERVAL '1 month');
|
||||
```
|
||||
|
||||
#### Vector Storage for RAG
|
||||
|
||||
```sql
|
||||
-- Technical Indicators with Vector Embeddings
|
||||
CREATE TABLE technical_indicators (
|
||||
id SERIAL PRIMARY KEY,
|
||||
symbol VARCHAR(10) NOT NULL,
|
||||
date TIMESTAMPTZ NOT NULL,
|
||||
indicator_name VARCHAR(50),
|
||||
indicator_value DECIMAL(12,6),
|
||||
pattern_embedding vector(384), -- OpenRouter embeddings
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX ON technical_indicators USING hnsw (pattern_embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
### SQLAlchemy Entity Models
|
||||
|
||||
```python
|
||||
# MarketDataEntity
|
||||
@dataclass
|
||||
class MarketDataEntity:
|
||||
symbol: str
|
||||
date: datetime
|
||||
open: Optional[Decimal] = None
|
||||
high: Optional[Decimal] = None
|
||||
low: Optional[Decimal] = None
|
||||
close: Optional[Decimal] = None
|
||||
adj_close: Optional[Decimal] = None
|
||||
volume: Optional[int] = None
|
||||
id: Optional[int] = None
|
||||
created_at: Optional[datetime] = None
|
||||
updated_at: Optional[datetime] = None
|
||||
|
||||
@classmethod
|
||||
def from_yfinance_data(cls, symbol: str, row: pd.Series) -> "MarketDataEntity":
|
||||
"""Convert yfinance data to entity"""
|
||||
|
||||
def to_database_record(self) -> dict:
|
||||
"""Convert entity to database record"""
|
||||
|
||||
def validate(self) -> None:
|
||||
"""Validate entity data integrity"""
|
||||
```
|
||||
|
||||
### Repository Migration
|
||||
|
||||
```python
|
||||
class MarketDataRepository:
|
||||
"""PostgreSQL + TimescaleDB repository with async operations"""
|
||||
|
||||
def __init__(self, database_manager: DatabaseManager):
|
||||
self.db = database_manager
|
||||
|
||||
async def get_ohlc_data(
|
||||
self,
|
||||
symbol: str,
|
||||
start_date: datetime,
|
||||
end_date: datetime
|
||||
) -> List[MarketDataEntity]:
|
||||
"""Retrieve OHLC data with TimescaleDB optimization"""
|
||||
query = """
|
||||
SELECT * FROM market_data
|
||||
WHERE symbol = $1 AND date BETWEEN $2 AND $3
|
||||
ORDER BY date DESC
|
||||
"""
|
||||
rows = await self.db.fetch(query, symbol, start_date, end_date)
|
||||
return [MarketDataEntity.from_database_record(row) for row in rows]
|
||||
|
||||
async def bulk_upsert_market_data(
|
||||
self,
|
||||
entities: List[MarketDataEntity]
|
||||
) -> int:
|
||||
"""Atomic bulk upsert for Dagster pipelines"""
|
||||
|
||||
async def find_similar_patterns(
|
||||
self,
|
||||
pattern_embedding: List[float],
|
||||
limit: int = 10
|
||||
) -> List[Dict]:
|
||||
"""RAG-powered pattern matching using pgvectorscale"""
|
||||
query = """
|
||||
SELECT symbol, date, indicator_name, indicator_value,
|
||||
pattern_embedding <=> $1 as similarity
|
||||
FROM technical_indicators
|
||||
ORDER BY pattern_embedding <=> $1
|
||||
LIMIT $2
|
||||
"""
|
||||
return await self.db.fetch(query, pattern_embedding, limit)
|
||||
```
|
||||
|
||||
### Service Compatibility Layer
|
||||
|
||||
```python
|
||||
class MarketDataService:
|
||||
"""Preserved API with PostgreSQL backend"""
|
||||
|
||||
def __init__(self, repository: MarketDataRepository, yfinance_client: YFinanceClient):
|
||||
self.repository = repository
|
||||
self.yfinance_client = yfinance_client
|
||||
|
||||
async def get_stock_data(self, symbol: str, period: str = "1y") -> pd.DataFrame:
|
||||
"""100% compatible with existing API signature"""
|
||||
# Implementation using PostgreSQL repository
|
||||
|
||||
async def calculate_technical_indicators(
|
||||
self,
|
||||
symbol: str,
|
||||
indicators: List[str]
|
||||
) -> Dict[str, np.ndarray]:
|
||||
"""Preserve all 20 TA-Lib indicators with PostgreSQL data"""
|
||||
|
||||
async def get_trading_style_preset(self, style: str) -> Dict:
|
||||
"""Preserved trading style presets with enhanced performance"""
|
||||
```
|
||||
|
||||
### Vector RAG Enhancement
|
||||
|
||||
```python
|
||||
class MarketDataRAGService:
|
||||
"""RAG-powered market analysis enhancement"""
|
||||
|
||||
async def find_historical_patterns(
|
||||
self,
|
||||
current_indicators: Dict[str, float],
|
||||
lookback_days: int = 30
|
||||
) -> List[Dict]:
|
||||
"""Vector similarity search for historical patterns"""
|
||||
|
||||
async def generate_pattern_embedding(
|
||||
self,
|
||||
indicator_values: Dict[str, float]
|
||||
) -> List[float]:
|
||||
"""Generate embeddings using OpenRouter for pattern matching"""
|
||||
```
|
||||
|
||||
## Migration Components
|
||||
|
||||
### Phase 1: Database Schema & Entities
|
||||
1. **SQLAlchemy Entity Models**
|
||||
- MarketDataEntity for OHLC data
|
||||
- FundamentalDataEntity for financial statements
|
||||
- InsiderDataEntity for SEC transactions
|
||||
- TechnicalIndicatorEntity for calculated values
|
||||
|
||||
2. **TimescaleDB Setup**
|
||||
- Hypertable creation for time-series optimization
|
||||
- Proper indexing strategy
|
||||
- Vector extension configuration
|
||||
|
||||
### Phase 2: Repository Migration
|
||||
1. **Async PostgreSQL Operations**
|
||||
- Follow news domain patterns for consistency
|
||||
- Connection pooling and transaction management
|
||||
- Error handling and retry logic
|
||||
|
||||
2. **Data Migration Scripts**
|
||||
- CSV to PostgreSQL data transfer
|
||||
- Data validation and integrity checks
|
||||
- Performance optimization
|
||||
|
||||
### Phase 3: Service Preservation
|
||||
1. **API Compatibility**
|
||||
- Maintain all existing method signatures
|
||||
- Preserve return types and data formats
|
||||
- Performance optimization through PostgreSQL
|
||||
|
||||
2. **Vector RAG Integration**
|
||||
- Pattern embedding generation
|
||||
- Similarity search capabilities
|
||||
- Historical context enhancement
|
||||
|
||||
### Phase 4: Testing & Integration
|
||||
1. **Comprehensive Testing**
|
||||
- Real PostgreSQL database for repository tests
|
||||
- Preserved pytest-vcr for API clients
|
||||
- Service compatibility validation
|
||||
|
||||
2. **Agent Integration**
|
||||
- AgentToolkit RAG capabilities
|
||||
- Performance benchmarking
|
||||
- Concurrent access testing
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Ready Dependencies
|
||||
- **YFinanceClient and FinnhubClient**: Fully implemented and tested
|
||||
- **PostgreSQL + TimescaleDB + pgvectorscale**: Database infrastructure established
|
||||
- **News domain PostgreSQL patterns**: Migration templates available
|
||||
- **DatabaseManager**: Async operations and connection management ready
|
||||
- **OpenRouter configuration**: Vector embeddings generation available
|
||||
|
||||
### Planned Dependencies
|
||||
- **Dagster orchestration**: Framework for daily data collection pipelines
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Performance Metrics
|
||||
- **10x query performance improvement** over CSV-based storage
|
||||
- **Sub-100ms market data operations** for common agent queries
|
||||
- **Sub-200ms RAG queries** for vector similarity search
|
||||
- **Support for 500+ tickers** with concurrent agent access
|
||||
|
||||
### Compatibility Standards
|
||||
- **100% existing API preservation** without breaking changes
|
||||
- **Seamless migration** without agent disruption
|
||||
- **Efficient bulk data ingestion** for Dagster pipelines
|
||||
|
||||
### Quality Assurance
|
||||
- **85%+ test coverage maintained** across all components
|
||||
- **Comprehensive data validation** and audit trails
|
||||
- **PostgreSQL ACID transactions** for data integrity
|
||||
|
||||
## Architecture Alignment
|
||||
|
||||
This migration aligns with the multi-agent trading framework vision by providing:
|
||||
|
||||
1. **High-performance market data foundation** for sophisticated agent analysis
|
||||
2. **RAG-powered historical context** for pattern-based trading decisions
|
||||
3. **Scalable concurrent access** supporting multiple agents simultaneously
|
||||
4. **Comprehensive audit trails** for regulatory compliance and risk management
|
||||
5. **Time-series optimization** for efficient technical analysis operations
|
||||
|
||||
The migration follows established news domain patterns to ensure architectural consistency across the entire TradingAgents framework.
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
{
|
||||
"product_vision": "Multi-agent LLM financial trading framework that mirrors real-world trading firm dynamics for research-based market analysis and trading decisions",
|
||||
"existing_features": [
|
||||
"news_domain_95_complete",
|
||||
"google_news_client",
|
||||
"article_scraper_client",
|
||||
"news_repository_with_embeddings",
|
||||
"postgresql_timescaledb_stack",
|
||||
"agent_toolkit_rag_integration",
|
||||
"openrouter_llm_provider"
|
||||
],
|
||||
"architecture": {
|
||||
"layer_pattern": "Router → Service → Repository → Entity → Database",
|
||||
"database": "PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"llm_provider": "OpenRouter unified interface",
|
||||
"agent_orchestration": "LangGraph workflows",
|
||||
"data_pipeline": "Dagster (planned, not implemented)",
|
||||
"domain_structure": "news (95% complete), marketdata (planned), socialmedia (planned)",
|
||||
"testing_strategy": "Domain-specific: mocks for services, real DB for repositories, pytest-vcr for HTTP"
|
||||
},
|
||||
"news_implementation_status": {
|
||||
"core_components": {
|
||||
"NewsService": "Business logic with company/global news context",
|
||||
"NewsRepository": "Async PostgreSQL with batch upsert, vector embeddings",
|
||||
"GoogleNewsClient": "RSS feed client for live data",
|
||||
"ArticleScraperClient": "newspaper4k with paywall detection"
|
||||
},
|
||||
"data_models": {
|
||||
"NewsArticle": "Domain dataclass with validation",
|
||||
"NewsArticleEntity": "SQLAlchemy model with 1536-dim vector embeddings"
|
||||
},
|
||||
"key_features": [
|
||||
"URL-based deduplication",
|
||||
"Vector embeddings for similarity",
|
||||
"Paywall detection and fallback",
|
||||
"Comprehensive test coverage with pytest-vcr"
|
||||
]
|
||||
},
|
||||
"dagster_status": "Planned but not implemented - documentation references exist but no pipeline code",
|
||||
"technical_patterns": {
|
||||
"async_operations": "All repository methods async with session management",
|
||||
"batch_operations": "upsert_batch for performance",
|
||||
"error_handling": "Graceful degradation with logging",
|
||||
"vector_search": "Semantic similarity for RAG",
|
||||
"connection_management": "DatabaseManager with asyncpg and pooling"
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,127 @@
|
|||
{
|
||||
"requirements": {
|
||||
"entities": {
|
||||
"NewsArticle": "Existing domain entity, enhance with structured sentiment and vector embedding support",
|
||||
"NewsJobConfig": "New configuration entity for scheduled job parameters (tickers, schedule, model settings)"
|
||||
},
|
||||
"data_persistence": {
|
||||
"news_articles_table": "Existing table with vector embedding columns, enhance sentiment_score JSONB column",
|
||||
"vector_indexes": "pgvectorscale indexes for title_embedding and content_embedding (1536 dimensions)",
|
||||
"data_flows": [
|
||||
"APScheduler → NewsService.update_company_news() → NewsRepository.upsert_batch()",
|
||||
"ArticleData → OpenRouter API → structured sentiment → NewsArticle entity",
|
||||
"Article content → OpenRouter embeddings API → pgvectorscale storage"
|
||||
]
|
||||
},
|
||||
"api_needed": {
|
||||
"external_apis": [
|
||||
"OpenRouter for LLM sentiment analysis using quick_think_llm",
|
||||
"OpenRouter for embeddings using text-embedding models",
|
||||
"Existing GoogleNewsClient and ArticleScraperClient"
|
||||
],
|
||||
"internal_apis": [
|
||||
"Enhanced NewsService.update_company_news() method",
|
||||
"New NewsRepository.find_similar_articles() for semantic search",
|
||||
"New ScheduledNewsCollector job orchestration class"
|
||||
]
|
||||
},
|
||||
"components": {
|
||||
"scheduler": "APScheduler integration for daily news collection",
|
||||
"sentiment_analyzer": "OpenRouter LLM client for structured sentiment analysis",
|
||||
"embedding_generator": "OpenRouter embeddings client for vector generation",
|
||||
"job_orchestrator": "ScheduledNewsCollector class for job coordination"
|
||||
},
|
||||
"domains": {
|
||||
"primary": "news (completing final 5%)",
|
||||
"integration": "Leverages existing Router → Service → Repository → Entity → Database pattern"
|
||||
},
|
||||
"business_rules": [
|
||||
"Best-effort sentiment analysis - LLM failures don't block article storage",
|
||||
"URL-based deduplication using existing NewsRepository patterns",
|
||||
"Paywall resilience via existing ArticleScraperClient graceful degradation",
|
||||
"Date filtering: articles within last 30 days only",
|
||||
"Sentiment confidence threshold: 0.5 minimum for reliable scores",
|
||||
"Content length limits: 8000 chars for embedding generation",
|
||||
"Embedding generation: Both title and content vectors required"
|
||||
]
|
||||
},
|
||||
"technical_needs": {
|
||||
"domain_model": {
|
||||
"entities": {
|
||||
"NewsArticle": {
|
||||
"status": "exists_needs_enhancement",
|
||||
"enhancements": [
|
||||
"Structured sentiment JSON format: {sentiment: positive|negative|neutral, confidence: 0.0-1.0, reasoning: string}",
|
||||
"Vector embedding support for title and content (1536 dimensions)",
|
||||
"Enhanced validation for sentiment confidence thresholds"
|
||||
]
|
||||
},
|
||||
"NewsJobConfig": {
|
||||
"status": "new_entity",
|
||||
"fields": ["tickers: list[str]", "schedule_hour: int", "sentiment_model: str", "embedding_model: str", "max_articles_per_ticker: int"],
|
||||
"validation": "Schedule hour 0-23, max articles 50-500 range"
|
||||
}
|
||||
},
|
||||
"services": {
|
||||
"NewsService": {
|
||||
"status": "exists_needs_enhancement",
|
||||
"enhancements": [
|
||||
"Integrate LLM sentiment analysis in update methods",
|
||||
"Add vector embedding generation pipeline",
|
||||
"Enhanced error handling for LLM and embedding failures"
|
||||
]
|
||||
},
|
||||
"ScheduledNewsCollector": {
|
||||
"status": "new_service",
|
||||
"responsibilities": [
|
||||
"Orchestrate daily news collection jobs",
|
||||
"Manage job configuration and scheduling",
|
||||
"Monitor job execution and handle failures",
|
||||
"Integrate with existing NewsService methods"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"persistence": {
|
||||
"database": "PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"schema_updates": {
|
||||
"news_articles": {
|
||||
"existing_columns": "headline, url, source, published_date, summary, entities, sentiment_score, author, category, title_embedding, content_embedding",
|
||||
"modifications": [
|
||||
"Enhance sentiment_score JSONB to support structured format",
|
||||
"Add vector similarity indexes for title_embedding and content_embedding",
|
||||
"Add composite index on (symbol, published_date) for News Analyst queries"
|
||||
]
|
||||
}
|
||||
},
|
||||
"access_patterns": [
|
||||
"Time-based queries: articles for ticker in date range",
|
||||
"Semantic similarity: find similar articles using vector search",
|
||||
"Sentiment filtering: articles by sentiment type and confidence",
|
||||
"Batch operations: efficient upsert of daily collection results"
|
||||
]
|
||||
},
|
||||
"router": {
|
||||
"status": "not_needed",
|
||||
"reason": "News Analysts access via AgentToolkit anti-corruption layer, no direct REST API required"
|
||||
},
|
||||
"events": {
|
||||
"status": "not_applicable",
|
||||
"reason": "Scheduled batch processing, no real-time event requirements"
|
||||
},
|
||||
"dependencies": {
|
||||
"external": [
|
||||
"OpenRouter API (existing TradingAgentsConfig integration)",
|
||||
"OpenRouter embeddings models (existing TradingAgentsConfig integration)",
|
||||
"APScheduler (new dependency for job scheduling)"
|
||||
],
|
||||
"internal": [
|
||||
"Existing NewsService (95% complete)",
|
||||
"Existing NewsRepository with async PostgreSQL patterns",
|
||||
"Existing GoogleNewsClient and ArticleScraperClient",
|
||||
"DatabaseManager for connection management",
|
||||
"TradingAgentsConfig for LLM and API configuration"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,946 @@
|
|||
# News Domain Technical Design
|
||||
|
||||
## Overview
|
||||
|
||||
This document details the technical design for completing the final 5% of the News domain implementation. The existing infrastructure is 95% complete with Google News collection, article scraping, and basic storage implemented. The remaining work focuses on **scheduled execution**, **LLM-powered sentiment analysis**, and **vector embeddings** using OpenRouter as the unified LLM provider.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Component Relationships
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[APScheduler] --> B[ScheduledNewsCollector]
|
||||
B --> C[NewsService]
|
||||
C --> D[GoogleNewsClient]
|
||||
C --> E[ArticleScraperClient]
|
||||
C --> F[OpenRouter LLM Client]
|
||||
C --> G[OpenRouter Embeddings Client]
|
||||
C --> H[NewsRepository]
|
||||
H --> I[PostgreSQL + TimescaleDB + pgvectorscale]
|
||||
|
||||
J[News Analysts] --> K[AgentToolkit]
|
||||
K --> C
|
||||
K --> H
|
||||
```
|
||||
|
||||
### Data Flow Architecture
|
||||
|
||||
1. **Scheduled Collection Flow**
|
||||
```
|
||||
APScheduler → ScheduledNewsCollector → NewsService.update_company_news()
|
||||
→ GoogleNewsClient → ArticleScraperClient → OpenRouter (sentiment + embeddings)
|
||||
→ NewsRepository.upsert_batch() → PostgreSQL
|
||||
```
|
||||
|
||||
2. **Agent Query Flow**
|
||||
```
|
||||
News Analyst → AgentToolkit → NewsService.find_relevant_articles()
|
||||
→ NewsRepository (semantic search) → pgvectorscale vector similarity
|
||||
```
|
||||
|
||||
### Key Design Principles
|
||||
|
||||
- **Leverage Existing 95%**: Build on proven GoogleNewsClient and ArticleScraperClient infrastructure
|
||||
- **OpenRouter Unified**: Single API for both sentiment analysis and embeddings
|
||||
- **Best-Effort Processing**: LLM failures don't block article storage
|
||||
- **Vector-Enhanced Search**: Semantic similarity for News Analysts
|
||||
- **Fault-Tolerant Scheduling**: Robust error handling and monitoring
|
||||
|
||||
## Domain Model
|
||||
|
||||
### Enhanced NewsArticle Entity
|
||||
|
||||
The existing `NewsArticle` entity requires enhancements for structured sentiment and vector support:
|
||||
|
||||
```python
|
||||
from typing import Optional, Dict, Any, List
|
||||
from pydantic import BaseModel, Field, validator
|
||||
import datetime
|
||||
|
||||
class SentimentScore(BaseModel):
|
||||
"""Structured sentiment analysis result"""
|
||||
sentiment: Literal["positive", "negative", "neutral"]
|
||||
confidence: float = Field(ge=0.0, le=1.0)
|
||||
reasoning: str
|
||||
|
||||
@validator('confidence')
|
||||
def validate_confidence(cls, v):
|
||||
if v < 0.5:
|
||||
raise ValueError("Confidence must be >= 0.5 for reliable sentiment")
|
||||
return v
|
||||
|
||||
class NewsArticle(BaseModel):
|
||||
"""Enhanced NewsArticle entity with sentiment and vector support"""
|
||||
# Existing fields (95% complete)
|
||||
headline: str
|
||||
url: str = Field(..., regex=r'^https?://')
|
||||
source: str
|
||||
published_date: datetime.datetime
|
||||
summary: Optional[str] = None
|
||||
entities: List[str] = Field(default_factory=list)
|
||||
author: Optional[str] = None
|
||||
category: Optional[str] = None
|
||||
|
||||
# Enhanced fields (final 5%)
|
||||
sentiment_score: Optional[SentimentScore] = None
|
||||
title_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
|
||||
content_embedding: Optional[List[float]] = Field(None, min_items=1536, max_items=1536)
|
||||
|
||||
# Metadata
|
||||
created_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
|
||||
updated_at: datetime.datetime = Field(default_factory=datetime.datetime.now)
|
||||
|
||||
@validator('content_embedding', 'title_embedding')
|
||||
def validate_embeddings(cls, v):
|
||||
if v and len(v) != 1536:
|
||||
raise ValueError("Embeddings must be 1536 dimensions for OpenRouter compatibility")
|
||||
return v
|
||||
|
||||
def has_reliable_sentiment(self) -> bool:
|
||||
"""Check if sentiment analysis is reliable (confidence >= 0.5)"""
|
||||
return bool(self.sentiment_score and self.sentiment_score.confidence >= 0.5)
|
||||
|
||||
def to_record(self) -> Dict[str, Any]:
|
||||
"""Convert to database record format"""
|
||||
record = self.dict()
|
||||
# Convert sentiment to JSONB format
|
||||
if self.sentiment_score:
|
||||
record['sentiment_score'] = self.sentiment_score.dict()
|
||||
return record
|
||||
|
||||
@classmethod
|
||||
def from_record(cls, record: Dict[str, Any]) -> 'NewsArticle':
|
||||
"""Create entity from database record"""
|
||||
if record.get('sentiment_score'):
|
||||
record['sentiment_score'] = SentimentScore(**record['sentiment_score'])
|
||||
return cls(**record)
|
||||
```
|
||||
|
||||
### New NewsJobConfig Entity
|
||||
|
||||
Configuration entity for scheduled news collection:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field, validator
|
||||
from typing import List
|
||||
|
||||
class NewsJobConfig(BaseModel):
|
||||
"""Configuration for scheduled news collection jobs"""
|
||||
tickers: List[str] = Field(..., min_items=1, max_items=50)
|
||||
schedule_hour: int = Field(..., ge=0, le=23)
|
||||
sentiment_model: str = Field(default="anthropic/claude-3.5-haiku")
|
||||
embedding_model: str = Field(default="text-embedding-3-large")
|
||||
max_articles_per_ticker: int = Field(default=20, ge=5, le=100)
|
||||
lookback_days: int = Field(default=7, ge=1, le=30)
|
||||
|
||||
@validator('tickers')
|
||||
def validate_tickers(cls, v):
|
||||
# Ensure uppercase stock symbols
|
||||
return [ticker.upper().strip() for ticker in v]
|
||||
|
||||
@validator('sentiment_model')
|
||||
def validate_sentiment_model(cls, v):
|
||||
# Ensure OpenRouter model format
|
||||
if '/' not in v:
|
||||
raise ValueError("Model must be in OpenRouter format (provider/model)")
|
||||
return v
|
||||
|
||||
def to_cron_expression(self) -> str:
|
||||
"""Convert to cron expression for APScheduler"""
|
||||
return f"0 {self.schedule_hour} * * *" # Daily at specified hour
|
||||
```
|
||||
|
||||
## Database Design
|
||||
|
||||
### Schema Enhancements
|
||||
|
||||
The existing `news_articles` table requires minimal modifications to support the final 5%:
|
||||
|
||||
```sql
|
||||
-- Existing table structure (95% complete)
|
||||
CREATE TABLE IF NOT EXISTS news_articles (
|
||||
id SERIAL PRIMARY KEY,
|
||||
headline TEXT NOT NULL,
|
||||
url TEXT UNIQUE NOT NULL,
|
||||
source TEXT NOT NULL,
|
||||
published_date TIMESTAMPTZ NOT NULL,
|
||||
summary TEXT,
|
||||
entities TEXT[] DEFAULT '{}',
|
||||
sentiment_score JSONB, -- Enhanced for structured format
|
||||
author TEXT,
|
||||
category TEXT,
|
||||
title_embedding vector(1536), -- New: pgvectorscale vector type
|
||||
content_embedding vector(1536), -- New: pgvectorscale vector type
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- New indexes for final 5% performance
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_symbol_date
|
||||
ON news_articles (((entities)), published_date DESC);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_title_embedding
|
||||
ON news_articles USING vectors (title_embedding vector_cosine_ops);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_content_embedding
|
||||
ON news_articles USING vectors (content_embedding vector_cosine_ops);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_news_articles_sentiment
|
||||
ON news_articles (((sentiment_score->>'sentiment')))
|
||||
WHERE sentiment_score IS NOT NULL;
|
||||
```
|
||||
|
||||
### Query Patterns
|
||||
|
||||
**Time-based News Queries (News Analysts)**
|
||||
```sql
|
||||
-- Optimized for Agent queries: recent news for specific ticker
|
||||
SELECT headline, summary, sentiment_score, published_date
|
||||
FROM news_articles
|
||||
WHERE entities @> ARRAY[$1::text]
|
||||
AND published_date >= NOW() - INTERVAL '30 days'
|
||||
ORDER BY published_date DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
**Semantic Similarity Queries (Vector Search)**
|
||||
```sql
|
||||
-- Find similar articles using pgvectorscale
|
||||
SELECT headline, url, summary,
|
||||
1 - (title_embedding <=> $1::vector) AS similarity_score
|
||||
FROM news_articles
|
||||
WHERE entities @> ARRAY[$2::text]
|
||||
AND title_embedding IS NOT NULL
|
||||
ORDER BY title_embedding <=> $1::vector
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
**Batch Upsert Operations (Daily Collection)**
|
||||
```sql
|
||||
-- Efficient upsert for daily news collection
|
||||
INSERT INTO news_articles (headline, url, source, published_date, summary, entities, sentiment_score, title_embedding, content_embedding)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
||||
ON CONFLICT (url) DO UPDATE SET
|
||||
headline = EXCLUDED.headline,
|
||||
summary = EXCLUDED.summary,
|
||||
entities = EXCLUDED.entities,
|
||||
sentiment_score = EXCLUDED.sentiment_score,
|
||||
title_embedding = EXCLUDED.title_embedding,
|
||||
content_embedding = EXCLUDED.content_embedding,
|
||||
updated_at = NOW();
|
||||
```
|
||||
|
||||
## API Integration
|
||||
|
||||
### OpenRouter Unified Client
|
||||
|
||||
Single OpenRouter integration for both sentiment analysis and embeddings:
|
||||
|
||||
```python
|
||||
from typing import List, Optional, Dict, Any
|
||||
import httpx
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
class OpenRouterClient:
|
||||
"""Unified OpenRouter client for sentiment analysis and embeddings"""
|
||||
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.base_url = "https://openrouter.ai/api/v1"
|
||||
self.headers = {
|
||||
"Authorization": f"Bearer {config.openrouter_api_key}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
async def analyze_sentiment(self, text: str, model: Optional[str] = None) -> SentimentScore:
|
||||
"""Generate structured sentiment analysis using LLM"""
|
||||
model = model or self.config.quick_think_llm
|
||||
|
||||
prompt = f"""Analyze the sentiment of this news article text and respond with ONLY a JSON object:
|
||||
|
||||
Article: {text[:2000]} # Truncate for token limits
|
||||
|
||||
Required JSON format:
|
||||
{{
|
||||
"sentiment": "positive|negative|neutral",
|
||||
"confidence": 0.0-1.0,
|
||||
"reasoning": "brief explanation"
|
||||
}}"""
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"temperature": 0.1, # Low temperature for consistent structured output
|
||||
"max_tokens": 200
|
||||
}
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
try:
|
||||
response = await client.post(
|
||||
f"{self.base_url}/chat/completions",
|
||||
headers=self.headers,
|
||||
json=payload,
|
||||
timeout=30.0
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
result = response.json()
|
||||
content = result["choices"][0]["message"]["content"].strip()
|
||||
|
||||
# Parse JSON response
|
||||
import json
|
||||
sentiment_data = json.loads(content)
|
||||
return SentimentScore(**sentiment_data)
|
||||
|
||||
except Exception as e:
|
||||
# Best-effort: return neutral sentiment on failure
|
||||
return SentimentScore(
|
||||
sentiment="neutral",
|
||||
confidence=0.3, # Below reliability threshold
|
||||
reasoning=f"Analysis failed: {str(e)[:100]}"
|
||||
)
|
||||
|
||||
async def generate_embeddings(self, texts: List[str], model: Optional[str] = None) -> List[List[float]]:
|
||||
"""Generate embeddings for multiple texts"""
|
||||
model = model or "text-embedding-3-large"
|
||||
|
||||
# Truncate texts to avoid token limits
|
||||
truncated_texts = [text[:8000] for text in texts]
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"input": truncated_texts
|
||||
}
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
try:
|
||||
response = await client.post(
|
||||
f"{self.base_url}/embeddings",
|
||||
headers=self.headers,
|
||||
json=payload,
|
||||
timeout=60.0
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
result = response.json()
|
||||
return [item["embedding"] for item in result["data"]]
|
||||
|
||||
except Exception as e:
|
||||
# Return None embeddings on failure (stored as NULL in DB)
|
||||
return [None] * len(texts)
|
||||
```
|
||||
|
||||
### Enhanced NewsService Integration
|
||||
|
||||
Update existing NewsService to integrate LLM capabilities:
|
||||
|
||||
```python
|
||||
class NewsService:
|
||||
"""Enhanced NewsService with LLM sentiment and embeddings (final 5%)"""
|
||||
|
||||
def __init__(self,
|
||||
repository: NewsRepository,
|
||||
google_client: GoogleNewsClient,
|
||||
scraper_client: ArticleScraperClient,
|
||||
openrouter_client: OpenRouterClient):
|
||||
self.repository = repository
|
||||
self.google_client = google_client
|
||||
self.scraper_client = scraper_client
|
||||
self.openrouter_client = openrouter_client
|
||||
|
||||
async def update_company_news(self,
|
||||
symbol: str,
|
||||
lookback_days: int = 7,
|
||||
max_articles: int = 20,
|
||||
include_sentiment: bool = True,
|
||||
include_embeddings: bool = True) -> List[NewsArticle]:
|
||||
"""Enhanced method with LLM sentiment analysis and embeddings"""
|
||||
|
||||
# Step 1: Use existing 95% infrastructure for collection
|
||||
cutoff_date = datetime.datetime.now() - datetime.timedelta(days=lookback_days)
|
||||
|
||||
# Fetch from Google News (existing)
|
||||
google_results = await self.google_client.fetch_company_news(symbol, max_articles)
|
||||
|
||||
articles = []
|
||||
for result in google_results:
|
||||
if result.published_date < cutoff_date:
|
||||
continue
|
||||
|
||||
# Scrape full content (existing)
|
||||
scraped_content = await self.scraper_client.scrape_article(result.url)
|
||||
|
||||
# Create base article (existing pattern)
|
||||
article = NewsArticle(
|
||||
headline=result.title,
|
||||
url=result.url,
|
||||
source=result.source,
|
||||
published_date=result.published_date,
|
||||
summary=scraped_content.summary if scraped_content else result.description,
|
||||
entities=[symbol],
|
||||
author=scraped_content.author if scraped_content else None
|
||||
)
|
||||
|
||||
# Step 2: NEW - Add LLM sentiment analysis
|
||||
if include_sentiment and scraped_content and scraped_content.content:
|
||||
article.sentiment_score = await self.openrouter_client.analyze_sentiment(
|
||||
scraped_content.content
|
||||
)
|
||||
|
||||
articles.append(article)
|
||||
|
||||
# Step 3: NEW - Batch generate embeddings
|
||||
if include_embeddings and articles:
|
||||
titles = [a.headline for a in articles]
|
||||
contents = [a.summary or a.headline for a in articles]
|
||||
|
||||
title_embeddings = await self.openrouter_client.generate_embeddings(titles)
|
||||
content_embeddings = await self.openrouter_client.generate_embeddings(contents)
|
||||
|
||||
for i, article in enumerate(articles):
|
||||
if i < len(title_embeddings) and title_embeddings[i]:
|
||||
article.title_embedding = title_embeddings[i]
|
||||
if i < len(content_embeddings) and content_embeddings[i]:
|
||||
article.content_embedding = content_embeddings[i]
|
||||
|
||||
# Step 4: Batch persist (existing pattern)
|
||||
await self.repository.upsert_batch(articles)
|
||||
return articles
|
||||
|
||||
async def find_similar_articles(self,
|
||||
query_text: str,
|
||||
symbol: Optional[str] = None,
|
||||
limit: int = 10) -> List[NewsArticle]:
|
||||
"""NEW: Semantic similarity search for News Analysts"""
|
||||
|
||||
# Generate query embedding
|
||||
query_embeddings = await self.openrouter_client.generate_embeddings([query_text])
|
||||
if not query_embeddings[0]:
|
||||
# Fallback to text search
|
||||
return await self.repository.find_by_text_search(query_text, symbol, limit)
|
||||
|
||||
return await self.repository.find_similar_articles(
|
||||
query_embeddings[0], symbol, limit
|
||||
)
|
||||
```
|
||||
|
||||
## Job Scheduling Architecture
|
||||
|
||||
### APScheduler Integration
|
||||
|
||||
Robust scheduled execution using APScheduler:
|
||||
|
||||
```python
|
||||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||
from apscheduler.jobstores.redis import RedisJobStore # Optional: persistent job store
|
||||
from apscheduler.executors.asyncio import AsyncIOExecutor
|
||||
import logging
|
||||
|
||||
class ScheduledNewsCollector:
|
||||
"""Orchestrates scheduled news collection jobs"""
|
||||
|
||||
def __init__(self,
|
||||
news_service: NewsService,
|
||||
config: TradingAgentsConfig,
|
||||
job_config: NewsJobConfig):
|
||||
self.news_service = news_service
|
||||
self.config = config
|
||||
self.job_config = job_config
|
||||
|
||||
# Configure APScheduler
|
||||
jobstores = {
|
||||
'default': {'type': 'memory'} # Use Redis for production
|
||||
}
|
||||
executors = {
|
||||
'default': AsyncIOExecutor(),
|
||||
}
|
||||
job_defaults = {
|
||||
'coalesce': False, # Don't combine missed jobs
|
||||
'max_instances': 1, # One job per ticker at a time
|
||||
'misfire_grace_time': 300 # 5 minute grace period
|
||||
}
|
||||
|
||||
self.scheduler = AsyncIOScheduler(
|
||||
jobstores=jobstores,
|
||||
executors=executors,
|
||||
job_defaults=job_defaults,
|
||||
timezone='UTC'
|
||||
)
|
||||
|
||||
async def start(self):
|
||||
"""Start the scheduler and register jobs"""
|
||||
|
||||
for ticker in self.job_config.tickers:
|
||||
# Schedule daily collection for each ticker
|
||||
self.scheduler.add_job(
|
||||
func=self._collect_ticker_news,
|
||||
trigger='cron',
|
||||
hour=self.job_config.schedule_hour,
|
||||
minute=0,
|
||||
args=[ticker],
|
||||
id=f"news_collection_{ticker}",
|
||||
replace_existing=True,
|
||||
max_instances=1
|
||||
)
|
||||
|
||||
self.scheduler.start()
|
||||
logging.info(f"Started news collection scheduler for {len(self.job_config.tickers)} tickers")
|
||||
|
||||
async def stop(self):
|
||||
"""Gracefully stop the scheduler"""
|
||||
if self.scheduler.running:
|
||||
self.scheduler.shutdown(wait=True)
|
||||
|
||||
async def _collect_ticker_news(self, ticker: str):
|
||||
"""Execute news collection for a single ticker"""
|
||||
|
||||
start_time = datetime.datetime.now()
|
||||
|
||||
try:
|
||||
logging.info(f"Starting news collection for {ticker}")
|
||||
|
||||
articles = await self.news_service.update_company_news(
|
||||
symbol=ticker,
|
||||
lookback_days=self.job_config.lookback_days,
|
||||
max_articles=self.job_config.max_articles_per_ticker,
|
||||
include_sentiment=True,
|
||||
include_embeddings=True
|
||||
)
|
||||
|
||||
# Log metrics
|
||||
sentiment_count = sum(1 for a in articles if a.has_reliable_sentiment())
|
||||
embedding_count = sum(1 for a in articles if a.title_embedding)
|
||||
|
||||
duration = (datetime.datetime.now() - start_time).total_seconds()
|
||||
|
||||
logging.info(
|
||||
f"Completed news collection for {ticker}: "
|
||||
f"{len(articles)} articles, {sentiment_count} with sentiment, "
|
||||
f"{embedding_count} with embeddings in {duration:.1f}s"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"News collection failed for {ticker}: {str(e)}")
|
||||
# Don't raise - let scheduler continue with other tickers
|
||||
|
||||
def get_job_status(self) -> Dict[str, Any]:
|
||||
"""Get status of all scheduled jobs"""
|
||||
jobs = self.scheduler.get_jobs()
|
||||
return {
|
||||
"scheduler_running": self.scheduler.running,
|
||||
"job_count": len(jobs),
|
||||
"jobs": [
|
||||
{
|
||||
"id": job.id,
|
||||
"next_run": job.next_run_time.isoformat() if job.next_run_time else None,
|
||||
"trigger": str(job.trigger)
|
||||
}
|
||||
for job in jobs
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Error Handling and Monitoring
|
||||
|
||||
Comprehensive error handling for production reliability:
|
||||
|
||||
```python
|
||||
class NewsCollectionMonitor:
|
||||
"""Monitor and handle news collection job failures"""
|
||||
|
||||
def __init__(self, collector: ScheduledNewsCollector):
|
||||
self.collector = collector
|
||||
self.failure_counts = defaultdict(int)
|
||||
self.max_failures = 3
|
||||
|
||||
async def handle_job_failure(self, ticker: str, error: Exception):
|
||||
"""Handle job failure with exponential backoff"""
|
||||
|
||||
self.failure_counts[ticker] += 1
|
||||
|
||||
if self.failure_counts[ticker] >= self.max_failures:
|
||||
logging.error(f"Max failures reached for {ticker}, disabling job")
|
||||
self.collector.scheduler.remove_job(f"news_collection_{ticker}")
|
||||
# Could send alert here
|
||||
else:
|
||||
# Schedule retry with exponential backoff
|
||||
delay_minutes = 2 ** self.failure_counts[ticker]
|
||||
retry_time = datetime.datetime.now() + datetime.timedelta(minutes=delay_minutes)
|
||||
|
||||
self.collector.scheduler.add_job(
|
||||
func=self.collector._collect_ticker_news,
|
||||
trigger='date',
|
||||
run_date=retry_time,
|
||||
args=[ticker],
|
||||
id=f"news_retry_{ticker}_{int(retry_time.timestamp())}",
|
||||
max_instances=1
|
||||
)
|
||||
|
||||
def reset_failure_count(self, ticker: str):
|
||||
"""Reset failure count on successful job"""
|
||||
if ticker in self.failure_counts:
|
||||
del self.failure_counts[ticker]
|
||||
```
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### Phase 1: Entity and Database Enhancements (Week 1)
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Enhanced `NewsArticle` entity with `SentimentScore` and vector support
|
||||
- [ ] New `NewsJobConfig` entity with validation
|
||||
- [ ] Database migration for vector indexes and sentiment_score JSONB enhancement
|
||||
- [ ] Repository method `find_similar_articles()` with pgvectorscale integration
|
||||
|
||||
**Testing Focus:**
|
||||
- Unit tests for entity validation and serialization
|
||||
- Repository integration tests with vector similarity queries
|
||||
- Database migration verification
|
||||
|
||||
### Phase 2: OpenRouter Integration (Week 2)
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] `OpenRouterClient` with sentiment analysis and embeddings
|
||||
- [ ] Enhanced `NewsService.update_company_news()` with LLM integration
|
||||
- [ ] Error handling for LLM failures (best-effort approach)
|
||||
- [ ] Integration tests with OpenRouter API (using pytest-vcr)
|
||||
|
||||
**Testing Focus:**
|
||||
- Mock OpenRouter responses for consistent testing
|
||||
- Error handling scenarios (API failures, malformed responses)
|
||||
- Embedding dimension validation
|
||||
|
||||
### Phase 3: Job Scheduling System (Week 3)
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] `ScheduledNewsCollector` with APScheduler integration
|
||||
- [ ] `NewsCollectionMonitor` for error handling and retries
|
||||
- [ ] Configuration management for job scheduling
|
||||
- [ ] Graceful startup and shutdown procedures
|
||||
|
||||
**Testing Focus:**
|
||||
- Scheduler lifecycle testing
|
||||
- Job execution and failure handling
|
||||
- Configuration validation
|
||||
|
||||
### Phase 4: Testing and Performance Optimization (Week 4)
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Complete test coverage maintaining >85% threshold
|
||||
- [ ] Performance optimization for vector queries
|
||||
- [ ] Documentation and deployment guides
|
||||
- [ ] Integration with existing News Analyst AgentToolkit
|
||||
|
||||
**Testing Focus:**
|
||||
- End-to-end integration tests
|
||||
- Performance benchmarks for vector similarity queries
|
||||
- Load testing for scheduled job execution
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Test Architecture
|
||||
|
||||
Following the existing pragmatic TDD approach with mock boundaries:
|
||||
|
||||
```
|
||||
tests/domains/news/
|
||||
├── __init__.py
|
||||
├── test_news_entities.py # Entity validation and serialization
|
||||
├── test_news_service.py # Mock repository and OpenRouter client
|
||||
├── test_news_repository.py # PostgreSQL test database
|
||||
├── test_openrouter_client.py # pytest-vcr for API responses
|
||||
├── test_scheduled_collector.py # Mock APScheduler and services
|
||||
└── integration/
|
||||
├── test_sentiment_pipeline.py # End-to-end sentiment analysis
|
||||
├── test_embedding_pipeline.py # End-to-end embedding generation
|
||||
└── test_scheduled_execution.py # Full job execution cycle
|
||||
```
|
||||
|
||||
### Key Test Categories
|
||||
|
||||
**Entity Tests (Fast Unit Tests)**
|
||||
```python
|
||||
def test_news_article_sentiment_validation():
|
||||
"""Test sentiment score validation and reliability checks"""
|
||||
|
||||
# Valid sentiment
|
||||
sentiment = SentimentScore(
|
||||
sentiment="positive",
|
||||
confidence=0.8,
|
||||
reasoning="Strong positive language"
|
||||
)
|
||||
|
||||
article = NewsArticle(
|
||||
headline="Test headline",
|
||||
url="https://example.com",
|
||||
source="Test Source",
|
||||
published_date=datetime.datetime.now(),
|
||||
sentiment_score=sentiment
|
||||
)
|
||||
|
||||
assert article.has_reliable_sentiment() == True
|
||||
|
||||
# Low confidence sentiment
|
||||
low_confidence = SentimentScore(
|
||||
sentiment="neutral",
|
||||
confidence=0.3,
|
||||
reasoning="Ambiguous language"
|
||||
)
|
||||
|
||||
article.sentiment_score = low_confidence
|
||||
assert article.has_reliable_sentiment() == False
|
||||
|
||||
def test_news_article_vector_validation():
|
||||
"""Test vector embedding validation"""
|
||||
|
||||
# Valid 1536-dimension embedding
|
||||
valid_embedding = [0.1] * 1536
|
||||
article = NewsArticle(
|
||||
headline="Test",
|
||||
url="https://example.com",
|
||||
source="Test",
|
||||
published_date=datetime.datetime.now(),
|
||||
title_embedding=valid_embedding
|
||||
)
|
||||
|
||||
assert len(article.title_embedding) == 1536
|
||||
|
||||
# Invalid dimension should raise ValidationError
|
||||
with pytest.raises(ValidationError):
|
||||
NewsArticle(
|
||||
headline="Test",
|
||||
url="https://example.com",
|
||||
source="Test",
|
||||
published_date=datetime.datetime.now(),
|
||||
title_embedding=[0.1] * 512 # Wrong dimension
|
||||
)
|
||||
```
|
||||
|
||||
**Service Integration Tests (Mock Boundaries)**
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_news_service_with_sentiment_analysis(mock_openrouter_client, mock_repository):
|
||||
"""Test NewsService integration with mocked LLM client"""
|
||||
|
||||
# Mock successful sentiment analysis
|
||||
mock_sentiment = SentimentScore(
|
||||
sentiment="positive",
|
||||
confidence=0.9,
|
||||
reasoning="Optimistic financial outlook"
|
||||
)
|
||||
mock_openrouter_client.analyze_sentiment.return_value = mock_sentiment
|
||||
|
||||
# Mock embeddings
|
||||
mock_openrouter_client.generate_embeddings.return_value = [
|
||||
[0.1] * 1536, # title embedding
|
||||
[0.2] * 1536 # content embedding
|
||||
]
|
||||
|
||||
service = NewsService(
|
||||
repository=mock_repository,
|
||||
google_client=mock_google_client,
|
||||
scraper_client=mock_scraper_client,
|
||||
openrouter_client=mock_openrouter_client
|
||||
)
|
||||
|
||||
articles = await service.update_company_news("AAPL", include_sentiment=True)
|
||||
|
||||
# Verify LLM integration
|
||||
assert len(articles) > 0
|
||||
assert articles[0].sentiment_score == mock_sentiment
|
||||
assert articles[0].title_embedding == [0.1] * 1536
|
||||
assert mock_openrouter_client.analyze_sentiment.called
|
||||
assert mock_openrouter_client.generate_embeddings.called
|
||||
```
|
||||
|
||||
**Repository Integration Tests (Real Database)**
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_repository_vector_similarity_search(test_db):
|
||||
"""Test vector similarity search with real pgvectorscale"""
|
||||
|
||||
repository = NewsRepository(test_db)
|
||||
|
||||
# Insert articles with embeddings
|
||||
article1 = NewsArticle(
|
||||
headline="Apple reports strong iPhone sales",
|
||||
url="https://example.com/1",
|
||||
source="TechNews",
|
||||
published_date=datetime.datetime.now(),
|
||||
entities=["AAPL"],
|
||||
title_embedding=[0.1, 0.2] + [0.0] * 1534 # Similar to query
|
||||
)
|
||||
|
||||
article2 = NewsArticle(
|
||||
headline="Microsoft launches new Azure features",
|
||||
url="https://example.com/2",
|
||||
source="CloudNews",
|
||||
published_date=datetime.datetime.now(),
|
||||
entities=["MSFT"],
|
||||
title_embedding=[0.9, 0.8] + [0.0] * 1534 # Different from query
|
||||
)
|
||||
|
||||
await repository.upsert_batch([article1, article2])
|
||||
|
||||
# Query with similar embedding
|
||||
query_embedding = [0.15, 0.25] + [0.0] * 1534
|
||||
similar_articles = await repository.find_similar_articles(
|
||||
query_embedding, symbol="AAPL", limit=1
|
||||
)
|
||||
|
||||
assert len(similar_articles) == 1
|
||||
assert similar_articles[0].headline == "Apple reports strong iPhone sales"
|
||||
```
|
||||
|
||||
**API Integration Tests (pytest-vcr)**
|
||||
```python
|
||||
@pytest.mark.vcr
|
||||
@pytest.mark.asyncio
|
||||
async def test_openrouter_sentiment_analysis():
|
||||
"""Test real OpenRouter API calls with VCR cassettes"""
|
||||
|
||||
config = TradingAgentsConfig.from_env()
|
||||
client = OpenRouterClient(config)
|
||||
|
||||
test_text = "Apple's quarterly earnings exceeded expectations with strong iPhone sales."
|
||||
|
||||
sentiment = await client.analyze_sentiment(test_text)
|
||||
|
||||
assert isinstance(sentiment, SentimentScore)
|
||||
assert sentiment.sentiment in ["positive", "negative", "neutral"]
|
||||
assert 0.0 <= sentiment.confidence <= 1.0
|
||||
assert len(sentiment.reasoning) > 0
|
||||
|
||||
@pytest.mark.vcr
|
||||
@pytest.mark.asyncio
|
||||
async def test_openrouter_embeddings_generation():
|
||||
"""Test real OpenRouter embeddings API with VCR"""
|
||||
|
||||
config = TradingAgentsConfig.from_env()
|
||||
client = OpenRouterClient(config)
|
||||
|
||||
texts = ["Apple stock rises", "Market volatility increases"]
|
||||
|
||||
embeddings = await client.generate_embeddings(texts)
|
||||
|
||||
assert len(embeddings) == 2
|
||||
assert all(len(emb) == 1536 for emb in embeddings)
|
||||
assert all(isinstance(val, float) for emb in embeddings for val in emb)
|
||||
```
|
||||
|
||||
### Coverage Requirements
|
||||
|
||||
Maintain existing >85% coverage with new components:
|
||||
|
||||
- **Entity Layer**: 95% coverage (comprehensive validation testing)
|
||||
- **Service Layer**: 90% coverage (mock external dependencies)
|
||||
- **Repository Layer**: 85% coverage (real database integration tests)
|
||||
- **Client Layer**: 80% coverage (pytest-vcr for API calls)
|
||||
- **Integration Tests**: End-to-end scenarios covering complete workflows
|
||||
|
||||
### Performance Testing
|
||||
|
||||
```python
|
||||
@pytest.mark.performance
|
||||
@pytest.mark.asyncio
|
||||
async def test_vector_similarity_performance():
|
||||
"""Ensure vector similarity queries perform under 100ms"""
|
||||
|
||||
repository = NewsRepository(test_db)
|
||||
|
||||
# Insert 1000 articles with embeddings
|
||||
articles = [create_test_article_with_embedding() for _ in range(1000)]
|
||||
await repository.upsert_batch(articles)
|
||||
|
||||
query_embedding = [random.random() for _ in range(1536)]
|
||||
|
||||
start_time = time.time()
|
||||
results = await repository.find_similar_articles(query_embedding, limit=10)
|
||||
duration = time.time() - start_time
|
||||
|
||||
assert duration < 0.1 # Under 100ms
|
||||
assert len(results) == 10
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### News Analyst AgentToolkit Integration
|
||||
|
||||
The completed News domain integrates seamlessly with existing News Analyst agents:
|
||||
|
||||
```python
|
||||
class NewsAnalystToolkit:
|
||||
"""Enhanced toolkit with semantic search capabilities"""
|
||||
|
||||
def __init__(self, news_service: NewsService):
|
||||
self.news_service = news_service
|
||||
|
||||
async def get_relevant_news(self,
|
||||
ticker: str,
|
||||
query: Optional[str] = None,
|
||||
days_back: int = 30) -> List[Dict[str, Any]]:
|
||||
"""Get news with optional semantic search"""
|
||||
|
||||
if query:
|
||||
# Use semantic similarity search
|
||||
articles = await self.news_service.find_similar_articles(
|
||||
query_text=query,
|
||||
symbol=ticker,
|
||||
limit=20
|
||||
)
|
||||
else:
|
||||
# Use time-based search (existing)
|
||||
articles = await self.news_service.find_recent_news(
|
||||
symbol=ticker,
|
||||
days_back=days_back
|
||||
)
|
||||
|
||||
return [
|
||||
{
|
||||
"headline": article.headline,
|
||||
"summary": article.summary,
|
||||
"published_date": article.published_date.isoformat(),
|
||||
"sentiment": article.sentiment_score.sentiment if article.sentiment_score else "unknown",
|
||||
"confidence": article.sentiment_score.confidence if article.sentiment_score else 0.0,
|
||||
"source": article.source,
|
||||
"url": article.url
|
||||
}
|
||||
for article in articles
|
||||
]
|
||||
```
|
||||
|
||||
### Configuration Integration
|
||||
|
||||
Seamless integration with existing `TradingAgentsConfig`:
|
||||
|
||||
```python
|
||||
# Enhanced configuration for news domain completion
|
||||
config = TradingAgentsConfig(
|
||||
# Existing LLM configuration
|
||||
llm_provider="openrouter",
|
||||
openrouter_api_key=os.getenv("OPENROUTER_API_KEY"),
|
||||
quick_think_llm="anthropic/claude-3.5-haiku", # For sentiment analysis
|
||||
|
||||
# New news-specific settings
|
||||
news_collection_enabled=True,
|
||||
news_schedule_hour=6, # UTC
|
||||
news_sentiment_enabled=True,
|
||||
news_embeddings_enabled=True,
|
||||
news_max_articles_per_ticker=20,
|
||||
|
||||
# Database (existing)
|
||||
database_url=os.getenv("DATABASE_URL"),
|
||||
)
|
||||
|
||||
# Job configuration
|
||||
news_job_config = NewsJobConfig(
|
||||
tickers=["AAPL", "GOOGL", "MSFT", "TSLA", "NVDA"],
|
||||
schedule_hour=6, # 6 AM UTC daily collection
|
||||
sentiment_model=config.quick_think_llm,
|
||||
embedding_model="text-embedding-3-large",
|
||||
max_articles_per_ticker=20
|
||||
)
|
||||
```
|
||||
|
||||
This design completes the final 5% of the News domain while leveraging the existing 95% infrastructure, maintaining architectural consistency, and providing the robust scheduled execution, LLM-powered sentiment analysis, and vector embeddings needed for advanced News Analyst capabilities.
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
{
|
||||
"raw_user_story": "a) As a Dagster Job I want to fetch all the google news articles for a ticker, fetch the article, perform sentimate analysis with LLM's and store in in the DB. b) As a News Analyst I want to fetch all relavent news data for a specific ticker and related tickers.",
|
||||
"raw_criteria": "a) the news data is updated on a schedule, daily to start. B) I can update the news for a ticker c) I can get the news for a ticker d) News is stored in DB with embeddings c) News is fetched from DB",
|
||||
"raw_rules": "a) best effort to fetch article, if it is paywalled or blocked, log a waring and continue",
|
||||
"raw_scope": "Included: Only fetch data from google new xml feed, use newspaper4k to fetch article content, use LLM to run sentiment analysis. Excluded: Other news sources beyond Google News XML feed."
|
||||
}
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
# News Domain Completion - Implementation Summary
|
||||
|
||||
## Core Requirement
|
||||
Complete final 5% of news domain: add scheduled execution, LLM sentiment analysis, and vector embeddings to existing 95% complete infrastructure.
|
||||
|
||||
## User Story
|
||||
**Dagster Job** automatically fetches Google News articles for tracked tickers, extracts content, performs LLM sentiment analysis, and stores with embeddings → **News Analysts** get comprehensive, up-to-date news data for trading decisions.
|
||||
|
||||
## Essential Requirements
|
||||
|
||||
### 1. Scheduled Execution
|
||||
- Daily job at 6 AM UTC for all configured tickers
|
||||
- APScheduler integration (no Dagster dependency)
|
||||
- Graceful error handling with comprehensive logging
|
||||
|
||||
### 2. LLM Sentiment Analysis
|
||||
- OpenRouter integration using `quick_think_llm` (claude-3.5-haiku)
|
||||
- Structured output: `{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}`
|
||||
- Best-effort processing - failures don't stop pipeline
|
||||
|
||||
### 3. Vector Embeddings
|
||||
- 1536-dimension embeddings for title and content
|
||||
- pgvectorscale storage with similarity indexes
|
||||
- Semantic search capability for News Analysts
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Architecture Pattern
|
||||
```
|
||||
ScheduledNewsJob → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
|
||||
```
|
||||
|
||||
### Database Changes
|
||||
```sql
|
||||
ALTER TABLE news_articles
|
||||
ADD COLUMN sentiment_score JSONB,
|
||||
ADD COLUMN title_embedding vector(1536),
|
||||
ADD COLUMN content_embedding vector(1536);
|
||||
```
|
||||
|
||||
### Key Integration Points
|
||||
- **Existing NewsService**: Enhance `update_news_for_symbol` method
|
||||
- **LLM Integration**: OpenRouter unified provider for sentiment
|
||||
- **Vector Generation**: text-embedding-3-small model (1536 dims)
|
||||
- **Job Scheduling**: APScheduler with cron trigger
|
||||
|
||||
## Implementation Phases
|
||||
1. **Scheduled Execution** (2-3h): APScheduler + config management
|
||||
2. **LLM Sentiment** (3-4h): OpenRouter integration + structured prompts
|
||||
3. **Vector Embeddings** (2-3h): Embedding generation + database schema
|
||||
4. **Testing & Monitoring** (2h): Coverage + performance validation
|
||||
|
||||
**Total: 9-12 hours**
|
||||
|
||||
## Success Criteria
|
||||
- ✅ Daily automated news collection without manual intervention
|
||||
- ✅ News retrieval with sentiment scores < 2 seconds response time
|
||||
- ✅ Vector embeddings enable semantic search for News Analysts
|
||||
- ✅ >95% article processing success rate despite paywall/blocking
|
||||
- ✅ Maintain >85% test coverage including new components
|
||||
|
||||
## Dependencies
|
||||
- **APIs**: OpenRouter (sentiment), OpenAI (embeddings)
|
||||
- **Infrastructure**: PostgreSQL + TimescaleDB + pgvectorscale
|
||||
- **New Package**: `apscheduler` for job scheduling
|
||||
- **Existing**: 95% complete news domain components
|
||||
|
||||
## Configuration
|
||||
```bash
|
||||
OPENROUTER_API_KEY="sk-or-..."
|
||||
OPENAI_API_KEY="sk-..."
|
||||
NEWS_SCHEDULE_HOUR=6
|
||||
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA"
|
||||
```
|
||||
|
||||
## Risk Mitigation
|
||||
- **API Rate Limits**: Exponential backoff + batch processing
|
||||
- **Paywall Blocking**: Metadata-only storage with warnings
|
||||
- **Job Failures**: Monitoring + alerting for operational visibility
|
||||
- **Performance**: Vector indexes + query optimization for <2s target
|
||||
|
|
@ -0,0 +1,68 @@
|
|||
{
|
||||
"feature": "news",
|
||||
"user_story": "As a Dagster Job, I want to automatically fetch Google News articles for tracked tickers, extract content, perform LLM sentiment analysis, and store with embeddings in the database, so that News Analysts can access comprehensive, up-to-date news data for trading decisions",
|
||||
"acceptance_criteria": [
|
||||
"GIVEN a scheduled job runs daily WHEN it executes THEN it fetches news for all configured tickers without manual intervention",
|
||||
"GIVEN a news article is found WHEN content extraction fails due to paywall THEN a warning is logged and processing continues with available metadata",
|
||||
"GIVEN a ticker symbol WHEN a News Analyst requests news data THEN they receive articles with sentiment scores and embeddings within 2 seconds",
|
||||
"GIVEN news articles are processed WHEN LLM sentiment analysis runs THEN each article gets a structured sentiment score (positive/negative/neutral with confidence)",
|
||||
"GIVEN news articles are stored WHEN saved to database THEN they include vector embeddings for both title and content for semantic search"
|
||||
],
|
||||
"business_rules": [
|
||||
"Best effort article fetching - log warnings for paywalled/blocked content but continue processing",
|
||||
"Daily schedule execution with configurable ticker list",
|
||||
"Deduplication by URL to prevent duplicate articles",
|
||||
"Sentiment analysis using OpenRouter LLM integration",
|
||||
"Vector embeddings generated for semantic similarity search",
|
||||
"Graceful error handling for network failures and API limits"
|
||||
],
|
||||
"scope": {
|
||||
"included": [
|
||||
"Scheduled news collection job using existing NewsService",
|
||||
"LLM-based sentiment analysis replacing current keyword approach",
|
||||
"Vector embedding generation for articles",
|
||||
"Configuration management for ticker lists and schedules",
|
||||
"Integration with existing GoogleNewsClient and ArticleScraperClient",
|
||||
"Database storage using existing NewsRepository patterns"
|
||||
],
|
||||
"excluded": [
|
||||
"Other news sources beyond Google News XML feed",
|
||||
"Real-time news streaming (daily batch processing only)",
|
||||
"Custom sentiment models (use OpenRouter LLMs only)",
|
||||
"News source reliability scoring",
|
||||
"Multi-language news support"
|
||||
]
|
||||
},
|
||||
"current_implementation_status": "95% complete - core components exist",
|
||||
"missing_components": [
|
||||
"Scheduled execution framework (Dagster alternative needed)",
|
||||
"LLM sentiment analysis integration",
|
||||
"Vector embedding generation",
|
||||
"Configuration management for tickers and schedules",
|
||||
"Pipeline monitoring and status tracking"
|
||||
],
|
||||
"existing_components": [
|
||||
"NewsService with update_news_for_symbol method",
|
||||
"GoogleNewsClient for RSS feed parsing",
|
||||
"ArticleScraperClient with newspaper4k integration",
|
||||
"NewsRepository with async PostgreSQL and vector schema",
|
||||
"NewsArticle domain model with validation",
|
||||
"Comprehensive test coverage with pytest-vcr"
|
||||
],
|
||||
"aligns_with": "Multi-agent trading framework vision - provides news context for agent decision making",
|
||||
"dependencies": [
|
||||
"OpenRouter API for LLM sentiment analysis",
|
||||
"PostgreSQL with pgvectorscale for embeddings",
|
||||
"Existing news domain components (95% complete)",
|
||||
"APScheduler or similar for job scheduling (Dagster not in current dependencies)"
|
||||
],
|
||||
"technical_details": {
|
||||
"architecture_pattern": "Router → Service → Repository → Entity → Database",
|
||||
"database_integration": "Async PostgreSQL with TimescaleDB optimization",
|
||||
"llm_integration": "OpenRouter unified provider with two-tier model strategy",
|
||||
"vector_storage": "1536-dimension embeddings using pgvectorscale",
|
||||
"error_handling": "Graceful degradation with comprehensive logging",
|
||||
"testing_strategy": "Domain-specific with pytest-vcr for HTTP mocking"
|
||||
},
|
||||
"implementation_approach": "Complete the missing 5% by adding scheduled execution, LLM sentiment analysis, and vector embedding generation to existing NewsService infrastructure"
|
||||
}
|
||||
|
|
@ -0,0 +1,334 @@
|
|||
# News Domain Completion Specification
|
||||
|
||||
## Feature Overview
|
||||
|
||||
Complete the final 5% of the news domain by adding scheduled execution, LLM sentiment analysis, and vector embeddings to the existing 95% complete infrastructure. This enables automated daily news collection with advanced sentiment analysis and semantic search capabilities for News Analysts in the multi-agent trading framework.
|
||||
|
||||
## User Story
|
||||
|
||||
**Primary User**: Dagster Job (automated system)
|
||||
**Secondary Users**: News Analysts (LLM agents)
|
||||
|
||||
> As a Dagster Job, I want to automatically fetch Google News articles for tracked tickers, extract content, perform LLM sentiment analysis, and store with embeddings in the database, so that News Analysts can access comprehensive, up-to-date news data for trading decisions.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### AC1: Scheduled Execution
|
||||
**GIVEN** a scheduled job runs daily
|
||||
**WHEN** it executes
|
||||
**THEN** it fetches news for all configured tickers without manual intervention
|
||||
|
||||
**Validation**:
|
||||
- Job executes at configured time (default: daily at 6 AM UTC)
|
||||
- All tickers in configuration are processed
|
||||
- Job completion status is logged with metrics
|
||||
|
||||
### AC2: Content Extraction Resilience
|
||||
**GIVEN** a news article is found
|
||||
**WHEN** content extraction fails due to paywall
|
||||
**THEN** a warning is logged and processing continues with available metadata
|
||||
|
||||
**Validation**:
|
||||
- Paywall detection doesn't halt processing
|
||||
- Warning messages include article URL and error reason
|
||||
- Metadata (title, source, publish_date) is still stored
|
||||
|
||||
### AC3: Fast News Retrieval
|
||||
**GIVEN** a ticker symbol
|
||||
**WHEN** a News Analyst requests news data
|
||||
**THEN** they receive articles with sentiment scores and embeddings within 2 seconds
|
||||
|
||||
**Validation**:
|
||||
- Database queries return results in < 2 seconds
|
||||
- Results include sentiment scores and vector embeddings
|
||||
- Pagination supports large result sets
|
||||
|
||||
### AC4: LLM Sentiment Analysis
|
||||
**GIVEN** news articles are processed
|
||||
**WHEN** LLM sentiment analysis runs
|
||||
**THEN** each article gets a structured sentiment score (positive/negative/neutral with confidence)
|
||||
|
||||
**Validation**:
|
||||
- Sentiment scores use structured format: `{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}`
|
||||
- LLM integration uses OpenRouter unified provider
|
||||
- Failed sentiment analysis doesn't prevent article storage
|
||||
|
||||
### AC5: Vector Embeddings Storage
|
||||
**GIVEN** news articles are stored
|
||||
**WHEN** saved to database
|
||||
**THEN** they include vector embeddings for both title and content for semantic search
|
||||
|
||||
**Validation**:
|
||||
- 1536-dimension embeddings generated for title and content
|
||||
- Embeddings stored in pgvectorscale-optimized columns
|
||||
- Semantic similarity search returns relevant results
|
||||
|
||||
## Business Rules
|
||||
|
||||
### BR1: Best Effort Processing
|
||||
- Log warnings for paywalled/blocked content but continue processing
|
||||
- Network failures don't halt entire job execution
|
||||
- API rate limits are respected with exponential backoff
|
||||
|
||||
### BR2: Daily Schedule Execution
|
||||
- Configurable ticker list supports adding/removing symbols
|
||||
- Job execution time is configurable (default: daily at 6 AM UTC)
|
||||
- Manual job execution available for testing and backfill
|
||||
|
||||
### BR3: Data Quality Standards
|
||||
- URL-based deduplication prevents duplicate articles
|
||||
- Article publish dates must be within last 30 days
|
||||
- Source URLs must be valid and accessible
|
||||
|
||||
### BR4: LLM Integration Standards
|
||||
- Use OpenRouter unified provider for sentiment analysis
|
||||
- Quick-think LLM for sentiment processing (cost optimization)
|
||||
- Structured prompts ensure consistent sentiment format
|
||||
|
||||
### BR5: Vector Search Optimization
|
||||
- Embeddings enable semantic similarity search for agents
|
||||
- Vector indexes optimize query performance
|
||||
- Embedding generation uses consistent model for coherence
|
||||
|
||||
### BR6: Graceful Error Handling
|
||||
- Individual article failures don't stop batch processing
|
||||
- Comprehensive logging for monitoring and debugging
|
||||
- Database transactions ensure data consistency
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Architecture Alignment
|
||||
|
||||
Follows established **Router → Service → Repository → Entity → Database** pattern:
|
||||
|
||||
```
|
||||
ScheduledNewsJob → NewsService → NewsRepository → NewsArticle → PostgreSQL+pgvectorscale
|
||||
```
|
||||
|
||||
### Database Schema Integration
|
||||
|
||||
Leverages existing NewsRepository with vector extensions:
|
||||
|
||||
```sql
|
||||
-- Existing news_articles table enhanced with:
|
||||
ALTER TABLE news_articles
|
||||
ADD COLUMN IF NOT EXISTS sentiment_score JSONB,
|
||||
ADD COLUMN IF NOT EXISTS title_embedding vector(1536),
|
||||
ADD COLUMN IF NOT EXISTS content_embedding vector(1536);
|
||||
|
||||
-- Vector similarity indexes
|
||||
CREATE INDEX IF NOT EXISTS idx_title_embedding
|
||||
ON news_articles USING ivfflat (title_embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
### LLM Integration Pattern
|
||||
|
||||
```python
|
||||
# OpenRouter sentiment analysis
|
||||
sentiment_result = await llm_client.analyze_sentiment(
|
||||
text=article.content,
|
||||
model="anthropic/claude-3.5-haiku", # quick_think_llm
|
||||
structured_output=True
|
||||
)
|
||||
|
||||
# Expected response format
|
||||
{
|
||||
"sentiment": "positive|negative|neutral",
|
||||
"confidence": 0.85,
|
||||
"reasoning": "Brief explanation"
|
||||
}
|
||||
```
|
||||
|
||||
### Vector Embedding Strategy
|
||||
|
||||
```python
|
||||
# Generate embeddings for semantic search
|
||||
title_embedding = await embedding_client.create_embedding(
|
||||
text=article.title,
|
||||
model="text-embedding-3-small" # 1536 dimensions
|
||||
)
|
||||
|
||||
content_embedding = await embedding_client.create_embedding(
|
||||
text=article.content[:8000], # Truncate for token limits
|
||||
model="text-embedding-3-small"
|
||||
)
|
||||
```
|
||||
|
||||
### Scheduled Execution Framework
|
||||
|
||||
Use APScheduler for job orchestration (Dagster not in current dependencies):
|
||||
|
||||
```python
|
||||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||||
|
||||
scheduler = AsyncIOScheduler()
|
||||
scheduler.add_job(
|
||||
run_news_collection,
|
||||
'cron',
|
||||
hour=6, # 6 AM UTC
|
||||
minute=0,
|
||||
timezone=timezone.utc,
|
||||
id='daily_news_collection'
|
||||
)
|
||||
```
|
||||
|
||||
## Implementation Approach
|
||||
|
||||
### Phase 1: Scheduled Execution (2-3 hours)
|
||||
1. Configure APScheduler for daily news collection
|
||||
2. Create job configuration management for ticker lists
|
||||
3. Implement job monitoring and status tracking
|
||||
4. Add manual execution capability for testing
|
||||
|
||||
### Phase 2: LLM Sentiment Integration (3-4 hours)
|
||||
1. Integrate OpenRouter LLM for sentiment analysis
|
||||
2. Create structured sentiment analysis prompts
|
||||
3. Update NewsService to include sentiment processing
|
||||
4. Add sentiment data to NewsArticle domain model
|
||||
|
||||
### Phase 3: Vector Embeddings (2-3 hours)
|
||||
1. Add embedding generation to article processing
|
||||
2. Update database schema for vector storage
|
||||
3. Implement semantic search capabilities in NewsRepository
|
||||
4. Create vector similarity query methods
|
||||
|
||||
### Phase 4: Testing & Monitoring (2 hours)
|
||||
1. Add comprehensive test coverage for new components
|
||||
2. Implement job monitoring and alerting
|
||||
3. Create configuration validation
|
||||
4. Performance testing for 2-second query requirement
|
||||
|
||||
### Total Estimated Effort: 9-12 hours
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Required APIs
|
||||
- **OpenRouter API**: LLM sentiment analysis (`OPENROUTER_API_KEY`)
|
||||
- **OpenAI API**: Vector embeddings (`OPENAI_API_KEY` for embeddings)
|
||||
|
||||
### Database Requirements
|
||||
- **PostgreSQL**: Base storage with async support
|
||||
- **TimescaleDB**: Time-series optimization for news data
|
||||
- **pgvectorscale**: Vector storage and similarity search
|
||||
|
||||
### Existing Infrastructure (95% Complete)
|
||||
- `NewsService` with `update_news_for_symbol` method
|
||||
- `GoogleNewsClient` for RSS feed parsing
|
||||
- `ArticleScraperClient` with newspaper4k integration
|
||||
- `NewsRepository` with async PostgreSQL operations
|
||||
- `NewsArticle` domain model with validation
|
||||
- Comprehensive test coverage with pytest-vcr
|
||||
|
||||
### New Dependencies
|
||||
- `apscheduler` for job scheduling
|
||||
- Enhanced vector embedding capabilities
|
||||
- LLM client integration for sentiment analysis
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# Existing
|
||||
OPENROUTER_API_KEY="sk-or-..."
|
||||
DATABASE_URL="postgresql://..."
|
||||
|
||||
# New requirements
|
||||
OPENAI_API_KEY="sk-..." # For embeddings
|
||||
NEWS_SCHEDULE_HOUR=6 # UTC hour for daily execution
|
||||
NEWS_TICKERS="AAPL,GOOGL,MSFT,TSLA" # Comma-separated ticker list
|
||||
```
|
||||
|
||||
### Configuration File Support
|
||||
```yaml
|
||||
# config/news_collection.yaml
|
||||
schedule:
|
||||
hour: 6
|
||||
minute: 0
|
||||
timezone: "UTC"
|
||||
|
||||
tickers:
|
||||
- "AAPL"
|
||||
- "GOOGL"
|
||||
- "MSFT"
|
||||
- "TSLA"
|
||||
|
||||
sentiment:
|
||||
llm_model: "anthropic/claude-3.5-haiku"
|
||||
confidence_threshold: 0.5
|
||||
|
||||
embeddings:
|
||||
model: "text-embedding-3-small"
|
||||
dimensions: 1536
|
||||
content_max_length: 8000
|
||||
```
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Performance Targets
|
||||
- **Query Response Time**: < 2 seconds for news retrieval with sentiment
|
||||
- **Job Execution Time**: < 30 minutes for daily collection (4 tickers)
|
||||
- **Success Rate**: > 95% article processing success rate
|
||||
- **Test Coverage**: Maintain > 85% coverage including new components
|
||||
|
||||
### Operational Metrics
|
||||
- Daily job completion status and execution time
|
||||
- Article processing success/failure rates per ticker
|
||||
- LLM sentiment analysis success rates
|
||||
- Vector embedding generation performance
|
||||
- Database query performance monitoring
|
||||
|
||||
## Risk Mitigation
|
||||
|
||||
### Technical Risks
|
||||
1. **LLM API Rate Limits**: Implement exponential backoff and batch processing
|
||||
2. **Vector Storage Performance**: Monitor query times and optimize indexes
|
||||
3. **Paywall Content Blocking**: Graceful degradation with metadata-only storage
|
||||
4. **Database Migration Complexity**: Test schema changes thoroughly
|
||||
|
||||
### Operational Risks
|
||||
1. **Scheduled Job Failures**: Implement monitoring and alerting
|
||||
2. **API Key Management**: Secure configuration management
|
||||
3. **Data Quality Issues**: Validation at multiple pipeline stages
|
||||
4. **Performance Degradation**: Regular performance monitoring and optimization
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Testing (pytest with pytest-vcr)
|
||||
- Scheduled job execution logic
|
||||
- LLM sentiment analysis integration
|
||||
- Vector embedding generation
|
||||
- Configuration management
|
||||
|
||||
### Integration Testing
|
||||
- End-to-end news collection pipeline
|
||||
- Database vector operations
|
||||
- LLM API integration
|
||||
- Job scheduling functionality
|
||||
|
||||
### Performance Testing
|
||||
- Query response time validation (< 2 seconds)
|
||||
- Batch processing performance
|
||||
- Vector similarity search optimization
|
||||
- Concurrent job execution handling
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Logging Strategy
|
||||
- Job execution start/completion with metrics
|
||||
- Individual article processing success/failure
|
||||
- LLM API call status and timing
|
||||
- Database operation performance
|
||||
|
||||
### Health Checks
|
||||
- Daily job completion status
|
||||
- Database connectivity and performance
|
||||
- LLM API availability and response times
|
||||
- Vector search functionality
|
||||
|
||||
### Alerting Triggers
|
||||
- Failed daily news collection jobs
|
||||
- API rate limit violations
|
||||
- Database query performance degradation
|
||||
- Sentiment analysis failure rates > 10%
|
||||
|
||||
This specification completes the news domain infrastructure to support advanced news analysis for the multi-agent trading framework, providing News Analysts with comprehensive, sentiment-analyzed, and semantically searchable news data for informed trading decisions.
|
||||
|
|
@ -0,0 +1,336 @@
|
|||
# News Domain Completion - Progress Status
|
||||
|
||||
## Overview
|
||||
|
||||
**Feature**: News Domain Final 5% Completion
|
||||
**Status**: Ready for Implementation
|
||||
**Total Estimated Time**: 12-16 hours with AI assistance
|
||||
**Target Timeline**: 3-4 days
|
||||
**Current Progress**: 95% complete (infrastructure ready)
|
||||
|
||||
---
|
||||
|
||||
## Progress Summary
|
||||
|
||||
### Overall Completion: 0% (95% + 0% of final 5%)
|
||||
|
||||
| Phase | Status | Progress | Duration | Completion |
|
||||
|-------|--------|----------|----------|------------|
|
||||
| Phase 1: Foundation | ⏳ Not Started | 0/3 tasks | 0h/4-7h | ⬜⬜⬜⬜⬜⬜⬜ |
|
||||
| Phase 2: Data Access | ⏳ Not Started | 0/1 tasks | 0h/2-3h | ⬜⬜⬜ |
|
||||
| Phase 3: LLM Integration | ⏳ Not Started | 0/3 tasks | 0h/5-8h | ⬜⬜⬜⬜⬜⬜⬜⬜ |
|
||||
| Phase 4: Scheduling | ⏳ Not Started | 0/2 tasks | 0h/4-6h | ⬜⬜⬜⬜⬜⬜ |
|
||||
| Phase 5: Validation | ⏳ Not Started | 0/2 tasks | 0h/3-5h | ⬜⬜⬜⬜⬜ |
|
||||
|
||||
**Legend**: ✅ Complete | 🟡 In Progress | ⏳ Not Started | ❌ Blocked
|
||||
|
||||
---
|
||||
|
||||
## Task Status Tracking
|
||||
|
||||
### Phase 1: Foundation (0% Complete)
|
||||
|
||||
#### ⏳ T001: Database Migration - NewsJobConfig Table
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 1-2 hours
|
||||
- **Dependencies**: None
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/4 completed
|
||||
- [ ] `news_job_configs` table created with UUID primary key
|
||||
- [ ] JSONB fields for symbols and categories with validation
|
||||
- [ ] Proper indexes for enabled/frequency queries
|
||||
- [ ] Migration script tests with rollback capability
|
||||
- **Blocking Issues**: None
|
||||
- **Next Actions**: Create Alembic migration script
|
||||
|
||||
#### ⏳ T002: Enhance NewsArticle Entity - Sentiment and Embeddings
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 2-3 hours
|
||||
- **Dependencies**: T001
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] Add sentiment_score, sentiment_confidence, sentiment_label fields
|
||||
- [ ] Add title_embedding and content_embedding vector fields
|
||||
- [ ] Enhanced validate() method with sentiment range checks
|
||||
- [ ] Updated transformations for vector handling
|
||||
- [ ] Embedding dimension validation (1536)
|
||||
- **Blocking Issues**: None
|
||||
- **Next Actions**: Extend NewsArticle dataclass
|
||||
|
||||
#### ⏳ T003: Create NewsJobConfig Entity
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 1-2 hours
|
||||
- **Dependencies**: T001
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] NewsJobConfig dataclass with all required fields
|
||||
- [ ] Business rule validation for job configuration
|
||||
- [ ] Cron expression validation for frequency
|
||||
- [ ] Symbol list validation
|
||||
- [ ] JSON serialization for database storage
|
||||
- **Blocking Issues**: None
|
||||
- **Next Actions**: Create new entity file
|
||||
|
||||
### Phase 2: Data Access (0% Complete)
|
||||
|
||||
#### ⏳ T004: Enhance NewsRepository - Vector and Job Operations
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 2-3 hours
|
||||
- **Dependencies**: T002, T003
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] Vector similarity search with cosine distance
|
||||
- [ ] Batch embedding update operations
|
||||
- [ ] NewsJobConfig CRUD methods
|
||||
- [ ] Optimized query performance for vector operations
|
||||
- [ ] Proper async connection handling
|
||||
- **Blocking Issues**: Waiting for T002, T003
|
||||
- **Next Actions**: Extend NewsRepository class
|
||||
|
||||
### Phase 3: LLM Integration (0% Complete)
|
||||
|
||||
#### ⏳ T005: OpenRouter Client - Sentiment Analysis
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 2-3 hours
|
||||
- **Dependencies**: T002
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] OpenRouter API integration for sentiment analysis
|
||||
- [ ] Structured prompts for financial news sentiment
|
||||
- [ ] Response parsing with Pydantic models
|
||||
- [ ] Error handling with graceful fallbacks
|
||||
- [ ] Retry logic with exponential backoff
|
||||
- **Blocking Issues**: Waiting for T002
|
||||
- **Next Actions**: Create OpenRouter sentiment client
|
||||
|
||||
#### ⏳ T006: OpenRouter Client - Vector Embeddings
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 1-2 hours
|
||||
- **Dependencies**: T002
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] OpenRouter embeddings API integration
|
||||
- [ ] Text preprocessing for embedding generation
|
||||
- [ ] Batch processing for multiple articles
|
||||
- [ ] 1536-dimensional vector validation
|
||||
- [ ] Proper error handling and retries
|
||||
- **Blocking Issues**: Waiting for T002
|
||||
- **Next Actions**: Create OpenRouter embeddings client
|
||||
|
||||
#### ⏳ T007: Enhance NewsService - LLM Integration
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Critical
|
||||
- **Estimated**: 2-3 hours
|
||||
- **Dependencies**: T005, T006
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] Replace keyword sentiment with LLM analysis
|
||||
- [ ] Add embedding generation to article processing
|
||||
- [ ] End-to-end article processing pipeline
|
||||
- [ ] Proper error handling and fallback strategies
|
||||
- [ ] Integration with existing service methods
|
||||
- **Blocking Issues**: Waiting for T005, T006
|
||||
- **Next Actions**: Integrate LLM clients into NewsService
|
||||
|
||||
### Phase 4: Scheduling (0% Complete)
|
||||
|
||||
#### ⏳ T008: APScheduler Integration - Job Scheduling
|
||||
- **Status**: Not Started
|
||||
- **Priority**: High
|
||||
- **Estimated**: 3-4 hours
|
||||
- **Dependencies**: T003, T004, T007
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] APScheduler setup with PostgreSQL job store
|
||||
- [ ] Scheduled job execution with proper error handling
|
||||
- [ ] Job configuration loading and validation
|
||||
- [ ] Status monitoring and failure recovery
|
||||
- [ ] CLI integration for job management
|
||||
- **Blocking Issues**: Waiting for T003, T004, T007
|
||||
- **Next Actions**: Implement ScheduledNewsCollector
|
||||
|
||||
#### ⏳ T009: CLI Integration - Job Management Commands
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Medium
|
||||
- **Estimated**: 1-2 hours
|
||||
- **Dependencies**: T008
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] CLI commands for job creation/management
|
||||
- [ ] Manual job execution commands
|
||||
- [ ] Job status and monitoring commands
|
||||
- [ ] Integration with existing CLI structure
|
||||
- [ ] Proper error handling and user feedback
|
||||
- **Blocking Issues**: Waiting for T008
|
||||
- **Next Actions**: Extend CLI with news job commands
|
||||
|
||||
### Phase 5: Validation (0% Complete)
|
||||
|
||||
#### ⏳ T010: Integration Tests - End-to-End Workflow
|
||||
- **Status**: Not Started
|
||||
- **Priority**: High
|
||||
- **Estimated**: 2-3 hours
|
||||
- **Dependencies**: T007, T008
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] End-to-end workflow tests from RSS to vector storage
|
||||
- [ ] Agent integration tests via AgentToolkit
|
||||
- [ ] Performance tests for daily collection volumes
|
||||
- [ ] Error recovery and fallback tests
|
||||
- [ ] Test coverage maintained above 85%
|
||||
- **Blocking Issues**: Waiting for T007, T008
|
||||
- **Next Actions**: Create comprehensive integration test suite
|
||||
|
||||
#### ⏳ T011: Documentation and Monitoring
|
||||
- **Status**: Not Started
|
||||
- **Priority**: Medium
|
||||
- **Estimated**: 1-2 hours
|
||||
- **Dependencies**: T010
|
||||
- **Progress**: 0%
|
||||
- **Acceptance Criteria**: 0/5 completed
|
||||
- [ ] Updated API documentation for new methods
|
||||
- [ ] Job scheduling configuration examples
|
||||
- [ ] Performance monitoring dashboard queries
|
||||
- [ ] Troubleshooting guide for common issues
|
||||
- [ ] Agent integration documentation
|
||||
- **Blocking Issues**: Waiting for T010
|
||||
- **Next Actions**: Update documentation and monitoring
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria Validation
|
||||
|
||||
### Technical Requirements Status
|
||||
- [ ] **OpenRouter-only LLM Integration**: Not started
|
||||
- [ ] **Vector Embeddings with pgvectorscale**: Not started
|
||||
- [ ] **APScheduler Job Execution**: Not started
|
||||
- [ ] **Test Coverage >85%**: Baseline established (needs monitoring)
|
||||
- [ ] **Query Performance <100ms**: Not tested
|
||||
- [ ] **Vector Search Performance <1s**: Not tested
|
||||
- [ ] **Backward Compatibility**: Not validated
|
||||
|
||||
### Functional Requirements Status
|
||||
- [ ] **Sentiment Analysis Pipeline**: Not implemented
|
||||
- [ ] **Embedding Generation Pipeline**: Not implemented
|
||||
- [ ] **Scheduled News Collection**: Not implemented
|
||||
- [ ] **CLI Job Management**: Not implemented
|
||||
- [ ] **AgentToolkit Integration**: Not validated
|
||||
- [ ] **Error Handling & Fallbacks**: Not implemented
|
||||
|
||||
### Quality Requirements Status
|
||||
- [ ] **TDD Implementation**: Process defined, not applied
|
||||
- [ ] **Layered Architecture**: Pattern defined, not validated
|
||||
- [ ] **Async Connection Pooling**: Not implemented
|
||||
- [ ] **Production Monitoring**: Not implemented
|
||||
- [ ] **Documentation Completeness**: Not updated
|
||||
|
||||
---
|
||||
|
||||
## Current Blocking Issues
|
||||
|
||||
### Critical Blockers
|
||||
**None currently** - All dependencies are internal to this implementation
|
||||
|
||||
### Potential Risk Areas
|
||||
1. **OpenRouter API Access**: Requires valid API keys and model access
|
||||
2. **Database Migration**: Need proper PostgreSQL permissions for schema changes
|
||||
3. **Vector Extension**: pgvectorscale must be properly installed and configured
|
||||
4. **Performance Testing**: Need realistic data volumes for benchmark validation
|
||||
|
||||
---
|
||||
|
||||
## Weekly Progress Targets
|
||||
|
||||
### Week 1 Target (Days 1-2)
|
||||
- **Goal**: Complete Phase 1 & 2 (Foundation + Data Access)
|
||||
- **Expected Completion**: T001, T002, T003, T004
|
||||
- **Target Progress**: 45% overall completion
|
||||
|
||||
### Week 1 Target (Days 3-4)
|
||||
- **Goal**: Complete Phase 3 & 4 (LLM Integration + Scheduling)
|
||||
- **Expected Completion**: T005, T006, T007, T008, T009
|
||||
- **Target Progress**: 90% overall completion
|
||||
|
||||
### Week 2 Target (Day 1)
|
||||
- **Goal**: Complete Phase 5 (Validation)
|
||||
- **Expected Completion**: T010, T011
|
||||
- **Target Progress**: 100% overall completion
|
||||
|
||||
---
|
||||
|
||||
## Metrics Dashboard
|
||||
|
||||
### Code Coverage
|
||||
- **Current**: 95% (existing infrastructure)
|
||||
- **Target**: >85% (including new functionality)
|
||||
- **Status**: ⏳ Pending implementation
|
||||
|
||||
### Performance Benchmarks
|
||||
- **Query Performance**: Not measured (Target: <100ms)
|
||||
- **Vector Search**: Not measured (Target: <1s)
|
||||
- **Batch Processing**: Not measured (Target: TBD)
|
||||
- **Status**: ⏳ Pending implementation
|
||||
|
||||
### Test Execution
|
||||
- **Unit Tests**: 0/11 tasks have tests
|
||||
- **Integration Tests**: 0/11 tasks have integration tests
|
||||
- **VCR Tests**: 0/3 API clients have VCR tests
|
||||
- **Status**: ⏳ Pending implementation
|
||||
|
||||
---
|
||||
|
||||
## Communication & Reporting
|
||||
|
||||
### Daily Standup Format
|
||||
```
|
||||
Yesterday: [Tasks completed with IDs]
|
||||
Today: [Tasks planned with IDs]
|
||||
Blockers: [Any issues requiring attention]
|
||||
Help Needed: [Specific areas for collaboration]
|
||||
```
|
||||
|
||||
### Weekly Status Report Format
|
||||
```
|
||||
Completed: [Phase progress with task counts]
|
||||
In Progress: [Current focus areas]
|
||||
Upcoming: [Next phase priorities]
|
||||
Risks: [Technical or timeline concerns]
|
||||
Metrics: [Coverage, performance, test results]
|
||||
```
|
||||
|
||||
### Milestone Checkpoints
|
||||
- **Checkpoint 1** (End of Day 2): Foundation Complete (T001-T004)
|
||||
- **Checkpoint 2** (End of Day 4): LLM Integration Complete (T005-T009)
|
||||
- **Checkpoint 3** (End of Day 5): Full Implementation Complete (T001-T011)
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
### Implementation Context
|
||||
- Building on 95% complete news domain infrastructure
|
||||
- Focus on OpenRouter-only LLM integration (no other providers)
|
||||
- Maintaining backward compatibility with AgentToolkit
|
||||
- Following established TDD and layered architecture patterns
|
||||
|
||||
### Key Success Factors
|
||||
1. **Incremental Progress**: Validate each layer before proceeding
|
||||
2. **Comprehensive Testing**: Maintain test coverage throughout
|
||||
3. **Performance Monitoring**: Validate benchmarks at each step
|
||||
4. **Error Resilience**: Implement fallbacks for all LLM dependencies
|
||||
5. **Documentation**: Keep implementation and usage docs current
|
||||
|
||||
### Last Updated
|
||||
**Date**: 2024-08-30
|
||||
**By**: System
|
||||
**Next Review**: Daily during implementation
|
||||
|
||||
---
|
||||
|
||||
*This status document will be updated as implementation progresses. Use this as a single source of truth for current progress and blocking issues.*
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,70 @@
|
|||
{
|
||||
"product_vision": "Multi-agent LLM financial trading framework that mirrors real-world trading firm dynamics for research-based market analysis and trading decisions with PostgreSQL + TimescaleDB + pgvectorscale architecture",
|
||||
"existing_features": [
|
||||
"news_domain_95_complete",
|
||||
"social_media_domain_stub_only",
|
||||
"postgresql_timescaledb_stack",
|
||||
"agent_toolkit_rag_integration",
|
||||
"openrouter_llm_provider",
|
||||
"reddit_client_empty_stub",
|
||||
"social_repository_file_based"
|
||||
],
|
||||
"architecture": {
|
||||
"layer_pattern": "Router → Service → Repository → Entity → Database",
|
||||
"database": "PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"llm_provider": "OpenRouter unified interface",
|
||||
"agent_orchestration": "LangGraph workflows",
|
||||
"data_pipeline": "APScheduler/Dagster (planned, not implemented)",
|
||||
"domain_structure": "news (95% complete), marketdata (planned), socialmedia (stub only)",
|
||||
"testing_strategy": "Domain-specific: mocks for services, real DB for repositories, pytest-vcr for HTTP"
|
||||
},
|
||||
"socialmedia_implementation_status": {
|
||||
"current_components": {
|
||||
"SocialMediaService": "Stub implementation with empty methods",
|
||||
"SocialRepository": "File-based JSON storage with deduplication",
|
||||
"RedditClient": "Empty stub class - needs full implementation",
|
||||
"Data Models": "Basic SocialPost, PostData, SocialContext models exist"
|
||||
},
|
||||
"missing_components": {
|
||||
"PostgreSQL_migration": "Current file storage needs database migration",
|
||||
"Reddit_API_integration": "RedditClient is empty - needs PRAW implementation",
|
||||
"LLM_sentiment_analysis": "No sentiment analysis for social posts",
|
||||
"Vector_embeddings": "No embedding generation or similarity search",
|
||||
"Agent_toolkit_methods": "get_reddit_news and get_reddit_stock_info missing",
|
||||
"Scheduled_execution": "No daily data collection pipeline"
|
||||
},
|
||||
"implementation_gaps": [
|
||||
"SocialRepository uses file storage instead of PostgreSQL",
|
||||
"No SQLAlchemy entity for social posts with vector support",
|
||||
"RedditClient has no API integration code",
|
||||
"No LLM integration for sentiment analysis",
|
||||
"Agent toolkit missing social media methods",
|
||||
"No scheduled execution framework"
|
||||
]
|
||||
},
|
||||
"reference_patterns": {
|
||||
"news_domain_success": {
|
||||
"NewsService": "95% complete business logic orchestration",
|
||||
"NewsRepository": "Async PostgreSQL with vector embeddings",
|
||||
"GoogleNewsClient": "RSS feed integration with error handling",
|
||||
"Agent_integration": "RAG-powered context via AgentToolkit"
|
||||
},
|
||||
"database_patterns": "Async PostgreSQL with TimescaleDB optimization and pgvectorscale",
|
||||
"llm_integration": "OpenRouter unified provider with two-tier model strategy",
|
||||
"testing_approach": "pytest-vcr for HTTP, real DB for repositories, mocks for services"
|
||||
},
|
||||
"technical_dependencies": {
|
||||
"external": [
|
||||
"PRAW (Python Reddit API Wrapper) for Reddit data access",
|
||||
"OpenRouter API for LLM sentiment analysis",
|
||||
"PostgreSQL with pgvectorscale for embeddings",
|
||||
"APScheduler or Dagster for scheduled execution"
|
||||
],
|
||||
"internal": [
|
||||
"Existing database infrastructure from news domain",
|
||||
"OpenRouter configuration in TradingAgentsConfig",
|
||||
"DatabaseManager for connection management",
|
||||
"AgentToolkit patterns for RAG integration"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,567 @@
|
|||
{
|
||||
"requirements": {
|
||||
"entities": {
|
||||
"SocialPost": "Core domain entity for Reddit posts with sentiment and engagement data",
|
||||
"SocialMediaPostEntity": "New SQLAlchemy entity for PostgreSQL storage with vector embeddings"
|
||||
},
|
||||
"data_persistence": {
|
||||
"migration_required": "File-based JSON storage to PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"schema": "social_media_posts table with vector embeddings, sentiment fields, and TimescaleDB optimization",
|
||||
"deduplication": "Reddit post_id unique constraint prevents duplicates"
|
||||
},
|
||||
"api_needed": {
|
||||
"external_apis": [
|
||||
"PRAW (Python Reddit API Wrapper) for Reddit data collection",
|
||||
"OpenRouter API for LLM sentiment analysis and embeddings"
|
||||
],
|
||||
"internal_apis": [
|
||||
"AgentToolkit methods: get_reddit_news, get_reddit_stock_info",
|
||||
"SocialMediaService orchestration methods",
|
||||
"SocialRepository PostgreSQL operations"
|
||||
]
|
||||
},
|
||||
"components": {
|
||||
"reddit_client": "Complete PRAW implementation (currently empty stub)",
|
||||
"repository": "PostgreSQL migration from file storage",
|
||||
"service": "Business logic with LLM integration",
|
||||
"agent_toolkit": "RAG methods for AI agents",
|
||||
"dagster_pipeline": "Scheduled daily collection"
|
||||
},
|
||||
"domains": {
|
||||
"primary": "socialmedia (complete greenfield implementation)",
|
||||
"integration": "Follows news domain patterns for consistency"
|
||||
},
|
||||
"business_rules": [
|
||||
"Daily collection from financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)",
|
||||
"OpenRouter LLM sentiment analysis with structured scoring",
|
||||
"Vector embeddings for semantic similarity search",
|
||||
"Post deduplication by Reddit post_id",
|
||||
"90-day data retention policy",
|
||||
"Rate limiting compliance with Reddit API",
|
||||
"Best effort processing for API failures"
|
||||
]
|
||||
},
|
||||
"technical_needs": {
|
||||
"domain_model": {
|
||||
"entities": {
|
||||
"SocialPost": {
|
||||
"purpose": "Domain entity managing business rules and data transformations",
|
||||
"responsibilities": [
|
||||
"fromRequest() - Create from Reddit API response",
|
||||
"toRecord() - Transform for PostgreSQL storage",
|
||||
"toResponse() - Format for agent consumption",
|
||||
"validate() - Business rule validation",
|
||||
"calculateSentiment() - Derived sentiment scoring",
|
||||
"extractTickers() - Ticker symbol detection"
|
||||
],
|
||||
"fields": [
|
||||
"post_id: str (Reddit unique ID)",
|
||||
"title: str",
|
||||
"content: str",
|
||||
"author: str",
|
||||
"subreddit: str",
|
||||
"created_utc: datetime",
|
||||
"upvotes: int",
|
||||
"downvotes: int",
|
||||
"comments_count: int",
|
||||
"url: str",
|
||||
"sentiment_score: float",
|
||||
"sentiment_label: str",
|
||||
"tickers: List[str]",
|
||||
"embedding: Optional[List[float]]"
|
||||
]
|
||||
},
|
||||
"SocialMediaPostEntity": {
|
||||
"purpose": "SQLAlchemy entity for PostgreSQL persistence",
|
||||
"table": "social_media_posts",
|
||||
"hypertable": "TimescaleDB partitioned by created_utc",
|
||||
"indexes": [
|
||||
"post_id (unique)",
|
||||
"subreddit, created_utc",
|
||||
"tickers (GIN array)",
|
||||
"embedding (pgvectorscale HNSW)"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"persistence": {
|
||||
"database_type": "PostgreSQL + TimescaleDB + pgvectorscale",
|
||||
"schema_design": {
|
||||
"table": "social_media_posts",
|
||||
"columns": [
|
||||
"id: UUID PRIMARY KEY",
|
||||
"post_id: VARCHAR(50) UNIQUE NOT NULL",
|
||||
"title: TEXT",
|
||||
"content: TEXT",
|
||||
"author: VARCHAR(100)",
|
||||
"subreddit: VARCHAR(50)",
|
||||
"created_utc: TIMESTAMPTZ (hypertable partition key)",
|
||||
"upvotes: INTEGER",
|
||||
"downvotes: INTEGER",
|
||||
"comments_count: INTEGER",
|
||||
"url: TEXT",
|
||||
"sentiment_score: FLOAT",
|
||||
"sentiment_label: VARCHAR(20)",
|
||||
"tickers: TEXT[] (array)",
|
||||
"embedding: VECTOR(1536) (pgvectorscale)",
|
||||
"inserted_at: TIMESTAMPTZ DEFAULT NOW()",
|
||||
"updated_at: TIMESTAMPTZ DEFAULT NOW()"
|
||||
],
|
||||
"constraints": [
|
||||
"UNIQUE(post_id)",
|
||||
"CHECK(sentiment_score BETWEEN -1 AND 1)"
|
||||
]
|
||||
},
|
||||
"access_patterns": [
|
||||
"Ticker-based queries: SELECT * WHERE 'AAPL' = ANY(tickers)",
|
||||
"Time-range filtering: SELECT * WHERE created_utc BETWEEN ? AND ?",
|
||||
"Vector similarity: SELECT * ORDER BY embedding <=> ? LIMIT 10",
|
||||
"Sentiment aggregations: SELECT AVG(sentiment_score) GROUP BY subreddit"
|
||||
],
|
||||
"data_volume": "~400+ posts daily, 90-day retention = ~36K posts max"
|
||||
},
|
||||
"router": {
|
||||
"type": "AgentToolkit Integration (No HTTP Router)",
|
||||
"methods": [
|
||||
"get_reddit_news(ticker: str, days: int) -> List[SocialPost]",
|
||||
"get_reddit_stock_info(ticker: str) -> Dict",
|
||||
"search_similar_posts(query: str, limit: int) -> List[SocialPost]",
|
||||
"get_subreddit_sentiment(subreddit: str, ticker: str) -> SentimentSummary"
|
||||
],
|
||||
"dependencies": [
|
||||
"SocialMediaService for business orchestration",
|
||||
"Entity transformations: SocialPost.toResponse()"
|
||||
]
|
||||
},
|
||||
"events": {
|
||||
"domain_events": [
|
||||
"SocialPostCollected: Published when new posts are scraped",
|
||||
"SentimentAnalyzed: Published after LLM sentiment analysis",
|
||||
"EmbeddingGenerated: Published after vector embedding creation"
|
||||
],
|
||||
"integration_events": [
|
||||
"MarketDataRequested: Subscribe to ticker validation events",
|
||||
"TradingDecisionMade: Consume for social sentiment correlation"
|
||||
]
|
||||
},
|
||||
"dependencies": {
|
||||
"external_services": [
|
||||
"Reddit API (PRAW): Post collection and metadata",
|
||||
"OpenRouter API: Sentiment analysis and embeddings",
|
||||
"PostgreSQL: Data persistence and queries",
|
||||
"TimescaleDB: Time-series optimization",
|
||||
"pgvectorscale: Vector similarity search"
|
||||
],
|
||||
"internal_services": [
|
||||
"None (greenfield implementation)"
|
||||
],
|
||||
"required_by": [
|
||||
"AI agents: Social sentiment context for trading decisions",
|
||||
"Multi-agent workflows: RAG-powered social media analysis",
|
||||
"Risk management: Social sentiment risk factors"
|
||||
],
|
||||
"component_order": [
|
||||
"1. SocialMediaPostEntity (database schema)",
|
||||
"2. SocialPost (domain entity with transformations)",
|
||||
"3. RedditClient (PRAW implementation)",
|
||||
"4. SocialRepository (PostgreSQL operations)",
|
||||
"5. SocialMediaService (business orchestration + LLM)",
|
||||
"6. AgentToolkit methods (RAG integration)",
|
||||
"7. Dagster pipeline (scheduled collection)"
|
||||
]
|
||||
}
|
||||
},
|
||||
"design": {
|
||||
"architecture_overview": {
|
||||
"pattern": "Event-driven microservice with layered internal architecture",
|
||||
"data_flow": "Dagster Pipeline → RedditClient → SocialMediaService → SocialRepository → PostgreSQL + pgvectorscale",
|
||||
"agent_flow": "AgentToolkit → SocialMediaService → SocialRepository → Vector Similarity Search + Sentiment Aggregation",
|
||||
"key_principles": [
|
||||
"Leverage news domain patterns for consistency",
|
||||
"OpenRouter unified LLM provider",
|
||||
"Best-effort processing for API failures",
|
||||
"Vector-enhanced semantic search",
|
||||
"Rate limiting compliance with Reddit API",
|
||||
"Complete greenfield implementation from empty stubs"
|
||||
]
|
||||
},
|
||||
"domain_model": {
|
||||
"SentimentScore": {
|
||||
"purpose": "Structured sentiment analysis result from OpenRouter LLM",
|
||||
"fields": {
|
||||
"sentiment": "Literal['positive', 'negative', 'neutral']",
|
||||
"confidence": "float (0.0-1.0)",
|
||||
"reasoning": "str (brief explanation)"
|
||||
},
|
||||
"validation": [
|
||||
"confidence >= 0.5 for reliable sentiment",
|
||||
"reasoning must be non-empty"
|
||||
]
|
||||
},
|
||||
"SocialPost": {
|
||||
"purpose": "Core domain entity with business rules and transformations",
|
||||
"base_fields": {
|
||||
"post_id": "str (Reddit unique ID, e.g., 't3_abc123')",
|
||||
"title": "str",
|
||||
"content": "Optional[str] (selftext for text posts)",
|
||||
"author": "str",
|
||||
"subreddit": "str",
|
||||
"created_utc": "datetime",
|
||||
"upvotes": "int (score)",
|
||||
"downvotes": "int (calculated from score + upvote_ratio)",
|
||||
"comments_count": "int (num_comments)",
|
||||
"url": "str (permalink or external URL)"
|
||||
},
|
||||
"enhanced_fields": {
|
||||
"sentiment_score": "Optional[SentimentScore]",
|
||||
"tickers": "List[str] (extracted ticker symbols)",
|
||||
"title_embedding": "Optional[List[float]] (1536 dimensions)",
|
||||
"content_embedding": "Optional[List[float]] (1536 dimensions)"
|
||||
},
|
||||
"methods": {
|
||||
"from_praw_submission": "Create from PRAW Submission object",
|
||||
"to_entity": "Transform to SocialMediaPostEntity for database storage",
|
||||
"from_entity": "Create from database entity",
|
||||
"validate": "Business rule validation",
|
||||
"extract_tickers": "Extract stock symbols from title and content",
|
||||
"has_reliable_sentiment": "Check if sentiment confidence >= 0.5",
|
||||
"to_response": "Format for agent consumption"
|
||||
},
|
||||
"validation_rules": [
|
||||
"post_id must match Reddit format (starts with 't3_')",
|
||||
"title cannot be empty",
|
||||
"created_utc cannot be in future",
|
||||
"sentiment_score confidence must be 0.0-1.0",
|
||||
"embeddings must be 1536 dimensions if present",
|
||||
"subreddit must be in allowed financial subreddits"
|
||||
]
|
||||
},
|
||||
"SocialJobConfig": {
|
||||
"purpose": "Configuration for scheduled Reddit collection",
|
||||
"fields": {
|
||||
"subreddits": "List[str] (financial subreddits to monitor)",
|
||||
"schedule_times": "List[str] (cron expressions for collection)",
|
||||
"sentiment_model": "str (OpenRouter model for sentiment)",
|
||||
"embedding_model": "str (OpenRouter model for embeddings)",
|
||||
"max_posts_per_subreddit": "int (limit per collection run)",
|
||||
"lookback_hours": "int (how far back to collect)",
|
||||
"min_score": "int (minimum upvotes threshold)",
|
||||
"rate_limit_delay": "float (seconds between API calls)"
|
||||
},
|
||||
"defaults": {
|
||||
"subreddits": "['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']",
|
||||
"schedule_times": "['0 6 * * *', '0 18 * * *']",
|
||||
"sentiment_model": "anthropic/claude-3.5-haiku",
|
||||
"embedding_model": "text-embedding-3-large",
|
||||
"max_posts_per_subreddit": 50,
|
||||
"lookback_hours": 12,
|
||||
"min_score": 10,
|
||||
"rate_limit_delay": 1.0
|
||||
}
|
||||
}
|
||||
},
|
||||
"data_persistence": {
|
||||
"database_schema": {
|
||||
"table_definition": "CREATE TABLE social_media_posts (\n id UUID PRIMARY KEY DEFAULT uuid7(),\n post_id VARCHAR(50) UNIQUE NOT NULL,\n title TEXT NOT NULL,\n content TEXT,\n author VARCHAR(100) NOT NULL,\n subreddit VARCHAR(50) NOT NULL,\n created_utc TIMESTAMPTZ NOT NULL,\n upvotes INTEGER NOT NULL DEFAULT 0,\n downvotes INTEGER NOT NULL DEFAULT 0,\n comments_count INTEGER NOT NULL DEFAULT 0,\n url TEXT NOT NULL,\n sentiment_score JSONB,\n sentiment_label VARCHAR(20),\n tickers TEXT[] DEFAULT '{}',\n title_embedding VECTOR(1536),\n content_embedding VECTOR(1536),\n inserted_at TIMESTAMPTZ DEFAULT NOW(),\n updated_at TIMESTAMPTZ DEFAULT NOW()\n);",
|
||||
"hypertable": "SELECT create_hypertable('social_media_posts', 'created_utc', chunk_time_interval => INTERVAL '1 day');",
|
||||
"indexes": [
|
||||
"CREATE UNIQUE INDEX idx_social_posts_post_id ON social_media_posts (post_id);",
|
||||
"CREATE INDEX idx_social_posts_subreddit_time ON social_media_posts (subreddit, created_utc DESC);",
|
||||
"CREATE INDEX idx_social_posts_tickers_gin ON social_media_posts USING GIN (tickers);",
|
||||
"CREATE INDEX idx_social_posts_title_embedding ON social_media_posts USING vectors (title_embedding vector_cosine_ops);",
|
||||
"CREATE INDEX idx_social_posts_content_embedding ON social_media_posts USING vectors (content_embedding vector_cosine_ops);",
|
||||
"CREATE INDEX idx_social_posts_sentiment ON social_media_posts (((sentiment_score->>'sentiment'))) WHERE sentiment_score IS NOT NULL;"
|
||||
],
|
||||
"constraints": [
|
||||
"ALTER TABLE social_media_posts ADD CONSTRAINT chk_sentiment_score CHECK (sentiment_score IS NULL OR ((sentiment_score->>'confidence')::float BETWEEN 0 AND 1));",
|
||||
"ALTER TABLE social_media_posts ADD CONSTRAINT chk_created_utc CHECK (created_utc <= NOW());"
|
||||
]
|
||||
},
|
||||
"repository_methods": {
|
||||
"find_by_ticker": "async def find_by_ticker(self, ticker: str, days: int = 30, limit: int = 50) -> List[SocialPost]",
|
||||
"find_by_subreddit": "async def find_by_subreddit(self, subreddit: str, hours: int = 24, limit: int = 100) -> List[SocialPost]",
|
||||
"find_similar_posts": "async def find_similar_posts(self, query_embedding: List[float], ticker: Optional[str] = None, limit: int = 10) -> List[SocialPost]",
|
||||
"get_sentiment_summary": "async def get_sentiment_summary(self, ticker: str, subreddit: Optional[str] = None, hours: int = 24) -> Dict[str, Any]",
|
||||
"upsert_batch": "async def upsert_batch(self, posts: List[SocialPost]) -> List[SocialPost]",
|
||||
"cleanup_old_posts": "async def cleanup_old_posts(self, days: int = 90) -> int"
|
||||
},
|
||||
"query_optimizations": [
|
||||
"TimescaleDB hypertables for time-based partitioning",
|
||||
"pgvectorscale HNSW indexes for fast vector similarity",
|
||||
"GIN indexes for ticker array queries",
|
||||
"Composite indexes for common access patterns",
|
||||
"Materialized views for sentiment aggregations"
|
||||
]
|
||||
},
|
||||
"api_specification": {
|
||||
"reddit_client": {
|
||||
"class": "RedditClient",
|
||||
"purpose": "PRAW wrapper with rate limiting and error handling",
|
||||
"configuration": {
|
||||
"client_id": "Reddit app client ID",
|
||||
"client_secret": "Reddit app client secret",
|
||||
"user_agent": "TradingAgents/1.0 by /u/tradingagents",
|
||||
"rate_limit": "1 request per second",
|
||||
"timeout": "30 seconds per request"
|
||||
},
|
||||
"methods": {
|
||||
"fetch_subreddit_posts": "async def fetch_subreddit_posts(self, subreddit: str, limit: int = 50, time_filter: str = 'day') -> List[Dict[str, Any]]",
|
||||
"search_posts": "async def search_posts(self, query: str, subreddit: Optional[str] = None, limit: int = 25) -> List[Dict[str, Any]]",
|
||||
"get_post_details": "async def get_post_details(self, post_id: str) -> Optional[Dict[str, Any]]"
|
||||
},
|
||||
"error_handling": [
|
||||
"Rate limit exceeded: Exponential backoff",
|
||||
"Authentication errors: Log and continue with next subreddit",
|
||||
"Network timeouts: Retry up to 3 times",
|
||||
"Invalid subreddit: Skip and log warning"
|
||||
]
|
||||
},
|
||||
"openrouter_client": {
|
||||
"reuse": "Leverage existing OpenRouterClient from news domain",
|
||||
"enhancements": [
|
||||
"Social media specific prompts for sentiment analysis",
|
||||
"Batch processing for Reddit post embeddings",
|
||||
"Optimized token usage for short social media text"
|
||||
],
|
||||
"sentiment_prompt": "Analyze this Reddit post about stocks/finance. Consider the informal language, memes, and community context. Respond with JSON: {\"sentiment\": \"positive|negative|neutral\", \"confidence\": 0.0-1.0, \"reasoning\": \"brief explanation\"}"
|
||||
}
|
||||
},
|
||||
"components": {
|
||||
"RedditClient": {
|
||||
"layer": "External API Integration",
|
||||
"responsibilities": [
|
||||
"Authenticate with Reddit API using PRAW",
|
||||
"Fetch posts from financial subreddits",
|
||||
"Handle rate limiting and API errors",
|
||||
"Transform PRAW responses to standard format"
|
||||
],
|
||||
"dependencies": [
|
||||
"PRAW library",
|
||||
"Reddit API credentials",
|
||||
"Async HTTP client (httpx)"
|
||||
],
|
||||
"error_handling": "Best-effort with graceful degradation"
|
||||
},
|
||||
"SocialRepository": {
|
||||
"layer": "Data Access",
|
||||
"responsibilities": [
|
||||
"PostgreSQL + TimescaleDB operations",
|
||||
"Vector similarity searches using pgvectorscale",
|
||||
"Batch upsert operations for performance",
|
||||
"Sentiment aggregation queries"
|
||||
],
|
||||
"dependencies": [
|
||||
"AsyncSession (SQLAlchemy)",
|
||||
"SocialMediaPostEntity",
|
||||
"Vector similarity functions"
|
||||
],
|
||||
"performance_targets": [
|
||||
"Batch upsert: <5s for 1000 posts",
|
||||
"Vector similarity: <1s for top 10 results",
|
||||
"Ticker queries: <100ms for 30-day range"
|
||||
]
|
||||
},
|
||||
"SocialMediaService": {
|
||||
"layer": "Business Logic",
|
||||
"responsibilities": [
|
||||
"Orchestrate Reddit data collection",
|
||||
"Coordinate LLM sentiment analysis",
|
||||
"Generate vector embeddings",
|
||||
"Apply business rules and validation"
|
||||
],
|
||||
"methods": {
|
||||
"collect_subreddit_posts": "async def collect_subreddit_posts(self, config: SocialJobConfig) -> int",
|
||||
"update_post_sentiment": "async def update_post_sentiment(self, posts: List[SocialPost]) -> List[SocialPost]",
|
||||
"generate_embeddings": "async def generate_embeddings(self, posts: List[SocialPost]) -> List[SocialPost]",
|
||||
"find_trending_tickers": "async def find_trending_tickers(self, hours: int = 24) -> List[Dict[str, Any]]"
|
||||
},
|
||||
"integration_patterns": [
|
||||
"OpenRouter for sentiment and embeddings",
|
||||
"Repository for data persistence",
|
||||
"Event publishing for domain events"
|
||||
]
|
||||
},
|
||||
"AgentToolkit": {
|
||||
"layer": "Agent Integration",
|
||||
"responsibilities": [
|
||||
"Provide RAG methods for AI agents",
|
||||
"Format social data for agent consumption",
|
||||
"Semantic search for relevant posts",
|
||||
"Sentiment aggregation and analysis"
|
||||
],
|
||||
"methods": {
|
||||
"get_reddit_sentiment": "async def get_reddit_sentiment(self, ticker: str, days: int = 7) -> Dict[str, Any]",
|
||||
"search_social_posts": "async def search_social_posts(self, query: str, ticker: Optional[str] = None) -> List[Dict[str, Any]]",
|
||||
"get_trending_discussions": "async def get_trending_discussions(self, ticker: str) -> List[Dict[str, Any]]",
|
||||
"get_subreddit_analysis": "async def get_subreddit_analysis(self, subreddit: str, ticker: str) -> Dict[str, Any]"
|
||||
],
|
||||
"response_format": [
|
||||
"Structured JSON with post content, metadata, and sentiment",
|
||||
"Data quality indicators",
|
||||
"Source attribution and confidence scores"
|
||||
]
|
||||
}
|
||||
},
|
||||
"events": {
|
||||
"domain_events": {
|
||||
"SocialPostCollected": {
|
||||
"trigger": "New Reddit post successfully stored",
|
||||
"payload": {
|
||||
"post_id": "str",
|
||||
"subreddit": "str",
|
||||
"tickers": "List[str]",
|
||||
"created_utc": "datetime",
|
||||
"collection_timestamp": "datetime"
|
||||
}
|
||||
},
|
||||
"SentimentAnalyzed": {
|
||||
"trigger": "LLM sentiment analysis completed",
|
||||
"payload": {
|
||||
"post_id": "str",
|
||||
"sentiment": "str",
|
||||
"confidence": "float",
|
||||
"processing_time": "float"
|
||||
}
|
||||
},
|
||||
"EmbeddingGenerated": {
|
||||
"trigger": "Vector embedding created and stored",
|
||||
"payload": {
|
||||
"post_id": "str",
|
||||
"embedding_type": "str (title|content)",
|
||||
"dimensions": "int",
|
||||
"model_used": "str"
|
||||
}
|
||||
}
|
||||
},
|
||||
"integration_events": {
|
||||
"MarketDataRequested": {
|
||||
"purpose": "Validate ticker symbols against market data",
|
||||
"consumption": "Subscribe to ensure social posts reference valid tickers"
|
||||
},
|
||||
"TradingDecisionRequested": {
|
||||
"purpose": "Provide social sentiment context for trading decisions",
|
||||
"consumption": "Publish social sentiment summaries when trading decisions are being made"
|
||||
}
|
||||
}
|
||||
},
|
||||
"dependencies": {
|
||||
"external_dependencies": {
|
||||
"Reddit API": {
|
||||
"library": "PRAW (Python Reddit API Wrapper)",
|
||||
"authentication": "OAuth2 with client credentials",
|
||||
"rate_limits": "60 requests per minute per OAuth client",
|
||||
"required_credentials": ["client_id", "client_secret", "user_agent"]
|
||||
},
|
||||
"OpenRouter API": {
|
||||
"reuse": "Existing OpenRouterClient from news domain",
|
||||
"models": {
|
||||
"sentiment": "anthropic/claude-3.5-haiku",
|
||||
"embeddings": "text-embedding-3-large"
|
||||
},
|
||||
"cost_optimization": "Batch requests and token-efficient prompts"
|
||||
},
|
||||
"PostgreSQL Stack": {
|
||||
"database": "PostgreSQL 16+",
|
||||
"extensions": ["TimescaleDB", "pgvectorscale", "uuid-ossp"],
|
||||
"connection": "AsyncSession with asyncpg driver"
|
||||
}
|
||||
},
|
||||
"internal_dependencies": {
|
||||
"news_domain": "Reference implementation patterns for consistency",
|
||||
"config_management": "TradingAgentsConfig for unified configuration",
|
||||
"database_manager": "Shared DatabaseManager and session handling"
|
||||
},
|
||||
"implementation_order": [
|
||||
"1. Database migration: Create social_media_posts table with TimescaleDB and vector support",
|
||||
"2. SocialMediaPostEntity: SQLAlchemy entity with proper field mappings",
|
||||
"3. SocialPost: Domain entity with validation and transformation methods",
|
||||
"4. RedditClient: PRAW integration with rate limiting and error handling",
|
||||
"5. SocialRepository: Database operations with vector similarity search",
|
||||
"6. SocialMediaService: Business logic orchestration with LLM integration",
|
||||
"7. AgentToolkit integration: RAG methods for AI agent consumption",
|
||||
"8. Dagster pipeline: Scheduled collection and processing"
|
||||
]
|
||||
},
|
||||
"implementation_guidance": {
|
||||
"database_setup": {
|
||||
"migration_script": [
|
||||
"Create social_media_posts table with all columns",
|
||||
"Add TimescaleDB hypertable partitioning on created_utc",
|
||||
"Create all indexes including vector similarity indexes",
|
||||
"Add constraints for data validation",
|
||||
"Set up retention policy for 90-day data cleanup"
|
||||
],
|
||||
"seed_data": "Optional test data with sample Reddit posts for development"
|
||||
},
|
||||
"reddit_integration": {
|
||||
"praw_setup": [
|
||||
"Create Reddit app at https://www.reddit.com/prefs/apps/",
|
||||
"Configure OAuth2 credentials in environment variables",
|
||||
"Implement rate limiting to respect API limits",
|
||||
"Handle subreddit access and content filtering"
|
||||
],
|
||||
"data_collection_strategy": [
|
||||
"Focus on financial subreddits: wallstreetbets, investing, stocks, SecurityAnalysis",
|
||||
"Collect hot/trending posts twice daily (6 AM, 6 PM UTC)",
|
||||
"Filter by minimum score threshold (10+ upvotes)",
|
||||
"Extract ticker symbols from post titles and content",
|
||||
"Deduplicate by Reddit post_id"
|
||||
]
|
||||
},
|
||||
"llm_integration": {
|
||||
"sentiment_analysis": [
|
||||
"Use OpenRouter with anthropic/claude-3.5-haiku for cost efficiency",
|
||||
"Social media-specific prompts accounting for informal language and memes",
|
||||
"Structured JSON output with sentiment, confidence, and reasoning",
|
||||
"Best-effort processing: store posts even if sentiment analysis fails"
|
||||
],
|
||||
"embeddings": [
|
||||
"Use text-embedding-3-large for 1536-dimension vectors",
|
||||
"Batch process for efficiency",
|
||||
"Generate embeddings for both title and content when available",
|
||||
"Store NULL for failed embedding generation"
|
||||
]
|
||||
},
|
||||
"testing_strategy": {
|
||||
"unit_tests": [
|
||||
"Entity validation and transformation methods",
|
||||
"Reddit client with mocked PRAW responses",
|
||||
"Repository operations with test database",
|
||||
"Service orchestration with mocked dependencies"
|
||||
],
|
||||
"integration_tests": [
|
||||
"End-to-end collection pipeline",
|
||||
"Vector similarity search with real pgvectorscale",
|
||||
"LLM integration with pytest-vcr cassettes",
|
||||
"Dagster pipeline execution"
|
||||
],
|
||||
"performance_tests": [
|
||||
"Vector similarity query performance (<1s for top 10)",
|
||||
"Batch upsert performance (<5s for 1000 posts)",
|
||||
"Memory usage during large collection runs"
|
||||
]
|
||||
},
|
||||
"monitoring_and_observability": {
|
||||
"metrics": [
|
||||
"Posts collected per subreddit per day",
|
||||
"Sentiment analysis success rate",
|
||||
"Embedding generation success rate",
|
||||
"Vector similarity query performance",
|
||||
"Reddit API rate limit utilization"
|
||||
],
|
||||
"logging": [
|
||||
"Collection job start/completion with statistics",
|
||||
"API errors and retry attempts",
|
||||
"Data quality issues and validation failures",
|
||||
"Performance metrics for optimization"
|
||||
],
|
||||
"alerts": [
|
||||
"Collection job failures",
|
||||
"Reddit API authentication issues",
|
||||
"High error rates in LLM processing",
|
||||
"Database connection problems"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,834 @@
|
|||
# Social Media Domain - Technical Design Document
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document specifies the complete greenfield implementation of the Social Media domain within TradingAgents, transitioning from empty stubs to a production-ready system for collecting and analyzing social media sentiment from financial subreddits. This domain will provide AI agents with social sentiment context for trading decisions through a PostgreSQL + TimescaleDB + pgvectorscale architecture with RAG-powered capabilities.
|
||||
|
||||
**Implementation Scope**: Complete domain implementation (0% → 100% completion)
|
||||
**Architecture**: PostgreSQL + TimescaleDB + pgvectorscale with PRAW Reddit integration and OpenRouter LLM processing
|
||||
**Target**: 400+ posts daily across 4 financial subreddits with 85%+ test coverage
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Overview
|
||||
|
||||
### 1.1 System Architecture
|
||||
|
||||
The Social Media domain follows the established layered architecture pattern while introducing new capabilities for social media data collection and semantic search:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Dagster Pipeline │
|
||||
│ (Scheduled Collection) │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ RedditClient │
|
||||
│ (PRAW + Rate Limiting) │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ SocialMediaService │
|
||||
│ (Business Logic + LLM Integration) │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ SocialRepository │
|
||||
│ (PostgreSQL + TimescaleDB + pgvectorscale) │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ PostgreSQL + TimescaleDB + pgvectorscale │
|
||||
│ (Time-series + Vector Storage) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 1.2 Data Flow Architecture
|
||||
|
||||
**Collection Flow:**
|
||||
```
|
||||
Reddit API → RedditClient → SocialMediaService → OpenRouter LLM →
|
||||
SocialRepository → PostgreSQL + Vector Storage
|
||||
```
|
||||
|
||||
**Agent Query Flow:**
|
||||
```
|
||||
AgentToolkit → SocialMediaService → SocialRepository →
|
||||
Vector Similarity Search + Sentiment Aggregation → Structured Response
|
||||
```
|
||||
|
||||
### 1.3 Key Architectural Principles
|
||||
|
||||
- **Consistent Patterns**: Follow news domain architecture for maintainability
|
||||
- **Vector-Enhanced Search**: Semantic similarity using pgvectorscale for contextual social media analysis
|
||||
- **Best-Effort Processing**: Continue operation even when LLM services are unavailable
|
||||
- **Rate Limiting Compliance**: Respect Reddit API limits with exponential backoff
|
||||
- **Event-Driven Design**: Publish domain events for system integration
|
||||
|
||||
---
|
||||
|
||||
## 2. Domain Model
|
||||
|
||||
### 2.1 Core Entities
|
||||
|
||||
#### SocialPost (Domain Entity)
|
||||
|
||||
The primary domain entity managing business rules and data transformations:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SocialPost:
|
||||
"""Core domain entity for Reddit posts with sentiment and engagement data."""
|
||||
|
||||
# Core Reddit Data
|
||||
post_id: str # Reddit unique ID (e.g., 't3_abc123')
|
||||
title: str # Post title
|
||||
content: Optional[str] # Post content (selftext for text posts)
|
||||
author: str # Reddit username
|
||||
subreddit: str # Subreddit name
|
||||
created_utc: datetime # Post creation time
|
||||
url: str # Reddit permalink or external URL
|
||||
|
||||
# Engagement Metrics
|
||||
upvotes: int # Post score
|
||||
downvotes: int # Calculated from score + upvote_ratio
|
||||
comments_count: int # Number of comments
|
||||
|
||||
# Enhanced Data
|
||||
sentiment_score: Optional[SentimentScore] = None
|
||||
tickers: List[str] = field(default_factory=list)
|
||||
title_embedding: Optional[List[float]] = None
|
||||
content_embedding: Optional[List[float]] = None
|
||||
|
||||
def from_praw_submission(cls, submission: praw.Submission) -> 'SocialPost':
|
||||
"""Create SocialPost from PRAW Submission object."""
|
||||
|
||||
def to_entity(self) -> SocialMediaPostEntity:
|
||||
"""Transform to database entity for storage."""
|
||||
|
||||
def validate(self) -> List[str]:
|
||||
"""Validate business rules and return errors."""
|
||||
|
||||
def extract_tickers(self) -> List[str]:
|
||||
"""Extract stock ticker symbols from title and content."""
|
||||
|
||||
def has_reliable_sentiment(self) -> bool:
|
||||
"""Check if sentiment confidence >= 0.5."""
|
||||
|
||||
def to_response(self) -> Dict[str, Any]:
|
||||
"""Format for agent consumption."""
|
||||
```
|
||||
|
||||
**Validation Rules:**
|
||||
- `post_id` must match Reddit format (starts with 't3_')
|
||||
- `title` cannot be empty
|
||||
- `created_utc` cannot be in the future
|
||||
- `sentiment_score.confidence` must be 0.0-1.0
|
||||
- `embeddings` must be 1536 dimensions if present
|
||||
- `subreddit` must be in allowed financial subreddits list
|
||||
|
||||
#### SentimentScore (Value Object)
|
||||
|
||||
Structured sentiment analysis result from OpenRouter LLM:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SentimentScore:
|
||||
"""Structured sentiment analysis result with confidence and reasoning."""
|
||||
|
||||
sentiment: Literal['positive', 'negative', 'neutral']
|
||||
confidence: float # 0.0-1.0
|
||||
reasoning: str # Brief explanation
|
||||
|
||||
def is_reliable(self) -> bool:
|
||||
"""Check if confidence >= 0.5 for reliable sentiment."""
|
||||
return self.confidence >= 0.5
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""Convert to dictionary for JSON storage."""
|
||||
```
|
||||
|
||||
#### SocialJobConfig (Configuration)
|
||||
|
||||
Configuration for scheduled Reddit collection:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SocialJobConfig:
|
||||
"""Configuration for scheduled Reddit data collection."""
|
||||
|
||||
# Collection Settings
|
||||
subreddits: List[str] = field(default_factory=lambda: [
|
||||
'wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis'
|
||||
])
|
||||
max_posts_per_subreddit: int = 50
|
||||
lookback_hours: int = 12
|
||||
min_score: int = 10
|
||||
|
||||
# Processing Settings
|
||||
sentiment_model: str = "anthropic/claude-3.5-haiku"
|
||||
embedding_model: str = "text-embedding-3-large"
|
||||
|
||||
# Rate Limiting
|
||||
rate_limit_delay: float = 1.0 # seconds between API calls
|
||||
|
||||
# Scheduling
|
||||
schedule_times: List[str] = field(default_factory=lambda: [
|
||||
'0 6 * * *', # 6 AM UTC
|
||||
'0 18 * * *' # 6 PM UTC
|
||||
])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Database Design
|
||||
|
||||
### 3.1 Schema Definition
|
||||
|
||||
The `social_media_posts` table leverages PostgreSQL with TimescaleDB for time-series optimization and pgvectorscale for vector similarity search:
|
||||
|
||||
```sql
|
||||
-- Core table definition
|
||||
CREATE TABLE social_media_posts (
|
||||
id UUID PRIMARY KEY DEFAULT uuid7(),
|
||||
post_id VARCHAR(50) UNIQUE NOT NULL,
|
||||
title TEXT NOT NULL,
|
||||
content TEXT,
|
||||
author VARCHAR(100) NOT NULL,
|
||||
subreddit VARCHAR(50) NOT NULL,
|
||||
created_utc TIMESTAMPTZ NOT NULL,
|
||||
upvotes INTEGER NOT NULL DEFAULT 0,
|
||||
downvotes INTEGER NOT NULL DEFAULT 0,
|
||||
comments_count INTEGER NOT NULL DEFAULT 0,
|
||||
url TEXT NOT NULL,
|
||||
sentiment_score JSONB,
|
||||
sentiment_label VARCHAR(20),
|
||||
tickers TEXT[] DEFAULT '{}',
|
||||
title_embedding VECTOR(1536),
|
||||
content_embedding VECTOR(1536),
|
||||
inserted_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- TimescaleDB hypertable for time-series optimization
|
||||
SELECT create_hypertable('social_media_posts', 'created_utc',
|
||||
chunk_time_interval => INTERVAL '1 day');
|
||||
|
||||
-- Performance indexes
|
||||
CREATE UNIQUE INDEX idx_social_posts_post_id ON social_media_posts (post_id);
|
||||
CREATE INDEX idx_social_posts_subreddit_time ON social_media_posts (subreddit, created_utc DESC);
|
||||
CREATE INDEX idx_social_posts_tickers_gin ON social_media_posts USING GIN (tickers);
|
||||
CREATE INDEX idx_social_posts_title_embedding ON social_media_posts
|
||||
USING vectors (title_embedding vector_cosine_ops);
|
||||
CREATE INDEX idx_social_posts_content_embedding ON social_media_posts
|
||||
USING vectors (content_embedding vector_cosine_ops);
|
||||
CREATE INDEX idx_social_posts_sentiment ON social_media_posts
|
||||
(((sentiment_score->>'sentiment'))) WHERE sentiment_score IS NOT NULL;
|
||||
|
||||
-- Data validation constraints
|
||||
ALTER TABLE social_media_posts ADD CONSTRAINT chk_sentiment_score
|
||||
CHECK (sentiment_score IS NULL OR
|
||||
((sentiment_score->>'confidence')::float BETWEEN 0 AND 1));
|
||||
ALTER TABLE social_media_posts ADD CONSTRAINT chk_created_utc
|
||||
CHECK (created_utc <= NOW());
|
||||
```
|
||||
|
||||
### 3.2 SQLAlchemy Entity
|
||||
|
||||
```python
|
||||
class SocialMediaPostEntity(Base):
|
||||
"""SQLAlchemy entity for PostgreSQL persistence with vector support."""
|
||||
|
||||
__tablename__ = "social_media_posts"
|
||||
|
||||
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid7)
|
||||
post_id = Column(String(50), unique=True, nullable=False, index=True)
|
||||
title = Column(Text, nullable=False)
|
||||
content = Column(Text)
|
||||
author = Column(String(100), nullable=False)
|
||||
subreddit = Column(String(50), nullable=False)
|
||||
created_utc = Column(DateTime(timezone=True), nullable=False)
|
||||
upvotes = Column(Integer, nullable=False, default=0)
|
||||
downvotes = Column(Integer, nullable=False, default=0)
|
||||
comments_count = Column(Integer, nullable=False, default=0)
|
||||
url = Column(Text, nullable=False)
|
||||
sentiment_score = Column(JSONB)
|
||||
sentiment_label = Column(String(20))
|
||||
tickers = Column(ARRAY(String), default=[])
|
||||
title_embedding = Column(Vector(1536))
|
||||
content_embedding = Column(Vector(1536))
|
||||
inserted_at = Column(DateTime(timezone=True), default=func.now())
|
||||
updated_at = Column(DateTime(timezone=True), default=func.now(), onupdate=func.now())
|
||||
|
||||
def to_domain(self) -> SocialPost:
|
||||
"""Convert to domain entity."""
|
||||
|
||||
@classmethod
|
||||
def from_domain(cls, post: SocialPost) -> 'SocialMediaPostEntity':
|
||||
"""Create from domain entity."""
|
||||
```
|
||||
|
||||
### 3.3 Access Patterns and Query Optimization
|
||||
|
||||
**Common Access Patterns:**
|
||||
- Ticker-based queries: `SELECT * WHERE 'AAPL' = ANY(tickers)`
|
||||
- Time-range filtering: `SELECT * WHERE created_utc BETWEEN ? AND ?`
|
||||
- Vector similarity: `SELECT * ORDER BY embedding <=> ? LIMIT 10`
|
||||
- Sentiment aggregations: `SELECT AVG(sentiment_score) GROUP BY subreddit`
|
||||
|
||||
**Performance Targets:**
|
||||
- Vector similarity queries: < 1s for top 10 results
|
||||
- Batch upserts: < 5s for 1000 posts
|
||||
- Ticker-based queries: < 100ms for 30-day ranges
|
||||
|
||||
---
|
||||
|
||||
## 4. API Integration
|
||||
|
||||
### 4.1 Reddit Client (PRAW Integration)
|
||||
|
||||
Complete implementation of Reddit data collection using PRAW (Python Reddit API Wrapper):
|
||||
|
||||
```python
|
||||
class RedditClient:
|
||||
"""PRAW wrapper with rate limiting and error handling."""
|
||||
|
||||
def __init__(self, config: RedditClientConfig):
|
||||
"""Initialize Reddit client with OAuth2 credentials."""
|
||||
self.reddit = praw.Reddit(
|
||||
client_id=config.client_id,
|
||||
client_secret=config.client_secret,
|
||||
user_agent=config.user_agent
|
||||
)
|
||||
self.rate_limiter = AsyncLimiter(1, 1) # 1 request per second
|
||||
|
||||
async def fetch_subreddit_posts(
|
||||
self,
|
||||
subreddit: str,
|
||||
limit: int = 50,
|
||||
time_filter: str = 'day'
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Fetch hot posts from subreddit with rate limiting."""
|
||||
|
||||
async def search_posts(
|
||||
self,
|
||||
query: str,
|
||||
subreddit: Optional[str] = None,
|
||||
limit: int = 25
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Search posts with ticker symbols or keywords."""
|
||||
|
||||
async def get_post_details(self, post_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Get detailed information for a specific post."""
|
||||
```
|
||||
|
||||
**Configuration Requirements:**
|
||||
- Reddit App Credentials: `client_id`, `client_secret`, `user_agent`
|
||||
- Rate Limiting: 1 request per second (60 requests/minute limit)
|
||||
- Error Handling: Exponential backoff for rate limits, graceful degradation for authentication errors
|
||||
|
||||
### 4.2 OpenRouter LLM Integration
|
||||
|
||||
Leverage existing OpenRouter infrastructure with social media-specific enhancements:
|
||||
|
||||
**Sentiment Analysis Prompt:**
|
||||
```
|
||||
Analyze this Reddit post about stocks/finance. Consider the informal language,
|
||||
memes, and community context typical of financial subreddits.
|
||||
|
||||
Post: {title} - {content}
|
||||
|
||||
Respond with valid JSON:
|
||||
{
|
||||
"sentiment": "positive|negative|neutral",
|
||||
"confidence": 0.0-1.0,
|
||||
"reasoning": "brief explanation considering context"
|
||||
}
|
||||
```
|
||||
|
||||
**Embedding Configuration:**
|
||||
- Model: `text-embedding-3-large` (1536 dimensions)
|
||||
- Batch processing for efficiency
|
||||
- Generate embeddings for both title and content when available
|
||||
- Store NULL for failed embedding generation (best-effort processing)
|
||||
|
||||
---
|
||||
|
||||
## 5. Component Architecture
|
||||
|
||||
### 5.1 Repository Layer (Data Access)
|
||||
|
||||
```python
|
||||
class SocialRepository:
|
||||
"""Data access layer for social media posts with vector capabilities."""
|
||||
|
||||
def __init__(self, session: AsyncSession):
|
||||
self.session = session
|
||||
|
||||
async def find_by_ticker(
|
||||
self,
|
||||
ticker: str,
|
||||
days: int = 30,
|
||||
limit: int = 50
|
||||
) -> List[SocialPost]:
|
||||
"""Find posts mentioning specific ticker within time range."""
|
||||
|
||||
async def find_similar_posts(
|
||||
self,
|
||||
query_embedding: List[float],
|
||||
ticker: Optional[str] = None,
|
||||
limit: int = 10
|
||||
) -> List[SocialPost]:
|
||||
"""Find semantically similar posts using vector similarity."""
|
||||
|
||||
async def get_sentiment_summary(
|
||||
self,
|
||||
ticker: str,
|
||||
subreddit: Optional[str] = None,
|
||||
hours: int = 24
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate sentiment aggregation for ticker."""
|
||||
|
||||
async def upsert_batch(self, posts: List[SocialPost]) -> List[SocialPost]:
|
||||
"""Batch upsert posts with conflict resolution."""
|
||||
|
||||
async def cleanup_old_posts(self, days: int = 90) -> int:
|
||||
"""Remove posts older than retention period."""
|
||||
```
|
||||
|
||||
### 5.2 Service Layer (Business Logic)
|
||||
|
||||
```python
|
||||
class SocialMediaService:
|
||||
"""Business logic orchestration with LLM integration."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
repository: SocialRepository,
|
||||
reddit_client: RedditClient,
|
||||
openrouter_client: OpenRouterClient
|
||||
):
|
||||
self.repository = repository
|
||||
self.reddit_client = reddit_client
|
||||
self.openrouter_client = openrouter_client
|
||||
|
||||
async def collect_subreddit_posts(self, config: SocialJobConfig) -> int:
|
||||
"""Orchestrate complete collection process for configured subreddits."""
|
||||
|
||||
async def update_post_sentiment(
|
||||
self,
|
||||
posts: List[SocialPost]
|
||||
) -> List[SocialPost]:
|
||||
"""Add sentiment analysis to posts using OpenRouter LLM."""
|
||||
|
||||
async def generate_embeddings(
|
||||
self,
|
||||
posts: List[SocialPost]
|
||||
) -> List[SocialPost]:
|
||||
"""Generate vector embeddings for semantic search."""
|
||||
|
||||
async def find_trending_tickers(
|
||||
self,
|
||||
hours: int = 24
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Identify trending ticker mentions across subreddits."""
|
||||
```
|
||||
|
||||
### 5.3 Agent Integration Layer
|
||||
|
||||
```python
|
||||
class SocialMediaAgentToolkit:
|
||||
"""RAG methods for AI agent integration."""
|
||||
|
||||
def __init__(self, service: SocialMediaService):
|
||||
self.service = service
|
||||
|
||||
async def get_reddit_sentiment(
|
||||
self,
|
||||
ticker: str,
|
||||
days: int = 7
|
||||
) -> Dict[str, Any]:
|
||||
"""Get sentiment summary for ticker from Reddit discussions."""
|
||||
|
||||
async def search_social_posts(
|
||||
self,
|
||||
query: str,
|
||||
ticker: Optional[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Semantic search for relevant social media posts."""
|
||||
|
||||
async def get_trending_discussions(
|
||||
self,
|
||||
ticker: str
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Get trending discussions and sentiment for specific ticker."""
|
||||
|
||||
async def get_subreddit_analysis(
|
||||
self,
|
||||
subreddit: str,
|
||||
ticker: str
|
||||
) -> Dict[str, Any]:
|
||||
"""Analyze sentiment and engagement for ticker in specific subreddit."""
|
||||
```
|
||||
|
||||
**Agent Response Format:**
|
||||
```json
|
||||
{
|
||||
"posts": [
|
||||
{
|
||||
"post_id": "t3_abc123",
|
||||
"title": "AAPL earnings beat expectations",
|
||||
"subreddit": "stocks",
|
||||
"created_utc": "2024-01-15T14:30:00Z",
|
||||
"sentiment": {
|
||||
"sentiment": "positive",
|
||||
"confidence": 0.85,
|
||||
"reasoning": "Strong positive language about earnings"
|
||||
},
|
||||
"engagement": {
|
||||
"upvotes": 245,
|
||||
"comments_count": 67
|
||||
},
|
||||
"tickers": ["AAPL"],
|
||||
"url": "https://reddit.com/r/stocks/comments/abc123"
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"total_posts": 15,
|
||||
"sentiment_breakdown": {
|
||||
"positive": 0.6,
|
||||
"negative": 0.2,
|
||||
"neutral": 0.2
|
||||
},
|
||||
"avg_confidence": 0.78,
|
||||
"data_quality": "high"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Dagster Pipeline Architecture
|
||||
|
||||
### 6.1 Scheduled Collection Pipeline
|
||||
|
||||
```python
|
||||
@asset(
|
||||
partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
|
||||
config_schema=SocialJobConfig.schema()
|
||||
)
|
||||
def reddit_posts_collection(context: AssetExecutionContext) -> MaterializeResult:
|
||||
"""Collect Reddit posts from financial subreddits."""
|
||||
|
||||
@asset(deps=[reddit_posts_collection])
|
||||
def reddit_sentiment_analysis(context: AssetExecutionContext) -> MaterializeResult:
|
||||
"""Add sentiment analysis to collected posts."""
|
||||
|
||||
@asset(deps=[reddit_sentiment_analysis])
|
||||
def reddit_embeddings_generation(context: AssetExecutionContext) -> MaterializeResult:
|
||||
"""Generate vector embeddings for semantic search."""
|
||||
|
||||
# Schedule: Twice daily collection
|
||||
reddit_collection_schedule = ScheduleDefinition(
|
||||
name="reddit_collection_schedule",
|
||||
job=define_asset_job("reddit_collection", selection=[
|
||||
reddit_posts_collection,
|
||||
reddit_sentiment_analysis,
|
||||
reddit_embeddings_generation
|
||||
]),
|
||||
cron_schedule="0 6,18 * * *" # 6 AM and 6 PM UTC
|
||||
)
|
||||
```
|
||||
|
||||
### 6.2 Data Quality and Monitoring
|
||||
|
||||
**Collection Metrics:**
|
||||
- Posts collected per subreddit per run
|
||||
- Sentiment analysis success rate
|
||||
- Embedding generation success rate
|
||||
- API error rates and retry attempts
|
||||
|
||||
**Data Quality Checks:**
|
||||
- Post deduplication verification
|
||||
- Sentiment confidence distribution
|
||||
- Embedding vector validation
|
||||
- Reddit API rate limit utilization
|
||||
|
||||
**Failure Handling:**
|
||||
- Best-effort processing: Continue with remaining subreddits if one fails
|
||||
- Exponential backoff for Reddit API rate limits
|
||||
- Graceful degradation: Store posts without sentiment/embeddings if LLM fails
|
||||
- Dead letter queue for failed posts with retry mechanism
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing Strategy
|
||||
|
||||
### 7.1 Test Structure
|
||||
|
||||
Following the project's pragmatic outside-in TDD approach:
|
||||
|
||||
```
|
||||
tests/domains/socialmedia/
|
||||
├── __init__.py
|
||||
├── test_social_post.py # Domain entity validation
|
||||
├── test_social_repository.py # PostgreSQL + vector operations
|
||||
├── test_reddit_client.py # PRAW integration with VCR
|
||||
├── test_social_media_service.py # Business logic with mocked deps
|
||||
├── test_social_agent_toolkit.py # Agent integration methods
|
||||
└── fixtures/
|
||||
├── reddit_responses.json # Sample PRAW responses
|
||||
└── vcr_cassettes/ # HTTP cassettes for external APIs
|
||||
```
|
||||
|
||||
### 7.2 Testing Approach
|
||||
|
||||
**Unit Tests (Mock I/O boundaries):**
|
||||
- `SocialPost` entity validation and transformations
|
||||
- `SocialRepository` with test PostgreSQL database
|
||||
- `RedditClient` with mocked PRAW responses
|
||||
- `SocialMediaService` with mocked dependencies
|
||||
|
||||
**Integration Tests (Real components):**
|
||||
- End-to-end collection pipeline with test Reddit data
|
||||
- Vector similarity search with actual pgvectorscale
|
||||
- LLM integration with pytest-vcr cassettes
|
||||
- Dagster pipeline execution
|
||||
|
||||
**Performance Tests:**
|
||||
- Vector similarity query performance (< 1s target)
|
||||
- Batch upsert performance (< 5s for 1000 posts)
|
||||
- Memory usage during large collection runs
|
||||
|
||||
### 7.3 Test Fixtures and Mocking
|
||||
|
||||
**Reddit API Mocking:**
|
||||
```python
|
||||
@pytest.fixture
|
||||
def mock_reddit_response():
|
||||
"""Sample Reddit API response for testing."""
|
||||
return {
|
||||
"id": "abc123",
|
||||
"title": "AAPL earnings discussion",
|
||||
"selftext": "Strong quarter, bullish outlook",
|
||||
"author": "test_user",
|
||||
"subreddit_display_name": "stocks",
|
||||
"created_utc": 1705315200,
|
||||
"score": 150,
|
||||
"upvote_ratio": 0.85,
|
||||
"num_comments": 45,
|
||||
"permalink": "/r/stocks/comments/abc123/aapl_earnings/"
|
||||
}
|
||||
```
|
||||
|
||||
**Vector Similarity Testing:**
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_vector_similarity_search(social_repository, sample_posts):
|
||||
"""Test semantic similarity search using pgvectorscale."""
|
||||
# Insert test posts with embeddings
|
||||
await social_repository.upsert_batch(sample_posts)
|
||||
|
||||
# Test similarity search
|
||||
query_embedding = [0.1] * 1536 # Sample embedding
|
||||
similar_posts = await social_repository.find_similar_posts(
|
||||
query_embedding, limit=5
|
||||
)
|
||||
|
||||
assert len(similar_posts) <= 5
|
||||
assert all(post.title_embedding for post in similar_posts)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Implementation Roadmap
|
||||
|
||||
### 8.1 Phase 1: Database Foundation (Week 1)
|
||||
|
||||
**Priority 1: Database Schema**
|
||||
1. Create PostgreSQL migration for `social_media_posts` table
|
||||
2. Add TimescaleDB hypertable configuration
|
||||
3. Set up pgvectorscale indexes for vector similarity
|
||||
4. Implement data validation constraints
|
||||
|
||||
**Priority 2: Core Entities**
|
||||
1. `SocialMediaPostEntity` (SQLAlchemy entity)
|
||||
2. `SocialPost` (domain entity with validation)
|
||||
3. `SentimentScore` (value object)
|
||||
4. Entity transformation methods (`to_domain`, `from_domain`)
|
||||
|
||||
### 8.2 Phase 2: Data Collection (Week 2)
|
||||
|
||||
**Priority 1: Reddit Integration**
|
||||
1. `RedditClient` with PRAW implementation
|
||||
2. Rate limiting and error handling
|
||||
3. Subreddit post collection methods
|
||||
4. Reddit API authentication setup
|
||||
|
||||
**Priority 2: Repository Layer**
|
||||
1. `SocialRepository` with PostgreSQL operations
|
||||
2. Vector similarity search methods
|
||||
3. Batch upsert operations
|
||||
4. Sentiment aggregation queries
|
||||
|
||||
### 8.3 Phase 3: Processing & Intelligence (Week 3)
|
||||
|
||||
**Priority 1: Service Layer**
|
||||
1. `SocialMediaService` business logic
|
||||
2. OpenRouter LLM integration for sentiment
|
||||
3. Vector embedding generation
|
||||
4. Batch processing workflows
|
||||
|
||||
**Priority 2: Agent Integration**
|
||||
1. `SocialMediaAgentToolkit` RAG methods
|
||||
2. Structured response formatting
|
||||
3. Context-aware social media analysis
|
||||
4. Integration with existing agent workflows
|
||||
|
||||
### 8.4 Phase 4: Automation & Monitoring (Week 4)
|
||||
|
||||
**Priority 1: Dagster Pipeline**
|
||||
1. Scheduled Reddit collection assets
|
||||
2. Processing pipeline orchestration
|
||||
3. Data quality monitoring
|
||||
4. Error handling and retry logic
|
||||
|
||||
**Priority 2: Testing & Documentation**
|
||||
1. Comprehensive test suite (>85% coverage)
|
||||
2. Performance testing and optimization
|
||||
3. API documentation updates
|
||||
4. Integration with existing test infrastructure
|
||||
|
||||
---
|
||||
|
||||
## 9. Monitoring and Observability
|
||||
|
||||
### 9.1 Key Metrics
|
||||
|
||||
**Collection Metrics:**
|
||||
- Posts collected per subreddit per day
|
||||
- Collection job success/failure rates
|
||||
- Reddit API rate limit utilization
|
||||
- Data deduplication effectiveness
|
||||
|
||||
**Processing Metrics:**
|
||||
- Sentiment analysis success rate and latency
|
||||
- Embedding generation success rate and latency
|
||||
- LLM token usage and costs
|
||||
- Vector similarity query performance
|
||||
|
||||
**Business Metrics:**
|
||||
- Active tickers with social sentiment data
|
||||
- Sentiment distribution across subreddits
|
||||
- Trending ticker detection accuracy
|
||||
- Agent query response times
|
||||
|
||||
### 9.2 Alerting Strategy
|
||||
|
||||
**Critical Alerts:**
|
||||
- Collection job failures (> 2 consecutive failures)
|
||||
- Reddit API authentication errors
|
||||
- Database connection failures
|
||||
- High LLM processing error rates (> 20%)
|
||||
|
||||
**Warning Alerts:**
|
||||
- Low collection volumes (< 50% of expected)
|
||||
- High sentiment analysis latency (> 30s per batch)
|
||||
- Vector similarity performance degradation
|
||||
- Approaching Reddit API rate limits
|
||||
|
||||
### 9.3 Logging and Debugging
|
||||
|
||||
**Structured Logging Format:**
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-15T14:30:00Z",
|
||||
"level": "INFO",
|
||||
"component": "SocialMediaService",
|
||||
"operation": "collect_subreddit_posts",
|
||||
"subreddit": "stocks",
|
||||
"posts_collected": 45,
|
||||
"sentiment_analyzed": 43,
|
||||
"embeddings_generated": 41,
|
||||
"duration_ms": 12500,
|
||||
"metadata": {
|
||||
"reddit_api_calls": 3,
|
||||
"llm_tokens_used": 15420
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Security and Compliance
|
||||
|
||||
### 10.1 Data Privacy
|
||||
|
||||
**Reddit Data Handling:**
|
||||
- Store only publicly available Reddit posts
|
||||
- Respect user privacy: hash usernames for analytics
|
||||
- Implement data retention policies (90-day maximum)
|
||||
- No collection of private or deleted content
|
||||
|
||||
**API Key Management:**
|
||||
- Environment variable storage for Reddit credentials
|
||||
- OpenRouter API key rotation support
|
||||
- No credential logging or persistence in plain text
|
||||
|
||||
### 10.2 Rate Limiting Compliance
|
||||
|
||||
**Reddit API Compliance:**
|
||||
- Respect 60 requests per minute OAuth limit
|
||||
- Implement exponential backoff for rate limit violations
|
||||
- User-Agent string identification as required
|
||||
- Monitor and log API usage statistics
|
||||
|
||||
**OpenRouter Usage:**
|
||||
- Monitor token usage and costs
|
||||
- Implement request batching for efficiency
|
||||
- Handle API rate limits gracefully
|
||||
- Cost optimization through model selection
|
||||
|
||||
---
|
||||
|
||||
## 11. Future Enhancements
|
||||
|
||||
### 11.1 Extended Social Media Sources
|
||||
|
||||
**Twitter/X Integration:**
|
||||
- Similar architecture pattern for Twitter API v2
|
||||
- Real-time streaming for high-frequency updates
|
||||
- Hashtag and mention tracking
|
||||
|
||||
**News Comment Sections:**
|
||||
- Integration with financial news comment sections
|
||||
- Cross-platform sentiment correlation
|
||||
- Enhanced context for news articles
|
||||
|
||||
### 11.2 Advanced Analytics
|
||||
|
||||
**Sentiment Trend Analysis:**
|
||||
- Time-series sentiment tracking
|
||||
- Volatility correlation with social sentiment
|
||||
- Predictive sentiment modeling
|
||||
|
||||
**Influence Network Analysis:**
|
||||
- User influence scoring based on engagement
|
||||
- Community detection within financial subreddits
|
||||
- Viral content identification and tracking
|
||||
|
||||
### 11.3 Real-time Processing
|
||||
|
||||
**Streaming Architecture:**
|
||||
- Real-time Reddit post collection
|
||||
- Event-driven sentiment processing
|
||||
- Live sentiment dashboards for agents
|
||||
|
||||
**Market Hours Integration:**
|
||||
- Increased collection frequency during market hours
|
||||
- After-hours sentiment tracking
|
||||
- Weekend vs. weekday sentiment patterns
|
||||
|
||||
---
|
||||
|
||||
This technical design provides a comprehensive blueprint for implementing the complete Social Media domain from empty stubs to a production-ready system. The architecture leverages proven patterns from the news domain while introducing specialized capabilities for social media data collection, semantic search, and AI agent integration.
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
{
|
||||
"raw_user_story": "a) As a dagster job I want to scrape specific sub reddits from market sentiment b) As an AI Agent I want to get relavent social media data about a specific ticker or market",
|
||||
"raw_criteria": "a) All reddit posts are stored with sentiment analysis in db b) Agents can get RAG data from db",
|
||||
"raw_rules": "updated daily",
|
||||
"raw_scope": "Included: reddit. Excluded: Other social media platforms beyond Reddit."
|
||||
}
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
# Social Media Domain - Specification Lite
|
||||
|
||||
## Summary
|
||||
Complete implementation of social media data collection from Reddit with LLM sentiment analysis and vector embeddings for AI agent RAG integration.
|
||||
|
||||
## Core Requirements
|
||||
|
||||
### Data Collection
|
||||
- **Daily Reddit collection** from financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
|
||||
- **OpenRouter LLM sentiment analysis** with confidence scoring
|
||||
- **Vector embeddings** for semantic similarity search
|
||||
- **PostgreSQL storage** with TimescaleDB + pgvectorscale optimization
|
||||
|
||||
### Agent Integration
|
||||
- **AgentToolkit methods**: `get_reddit_news()` and `get_reddit_stock_info()`
|
||||
- **RAG-enhanced queries** with < 2 second response time
|
||||
- **Vector similarity search** for contextual social media insights
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Architecture Pattern
|
||||
**Router → Service → Repository → Entity → Database** (matching news domain)
|
||||
|
||||
### Database Schema
|
||||
```sql
|
||||
social_media_posts (
|
||||
post_id, ticker, subreddit, title, content, author,
|
||||
created_at, upvotes, comment_count,
|
||||
sentiment_score, sentiment_label, sentiment_confidence,
|
||||
embedding vector(1536), -- pgvectorscale
|
||||
data_quality_score, processing_status
|
||||
)
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
#### 1. RedditClient
|
||||
- PRAW integration with rate limiting
|
||||
- Financial subreddit targeting
|
||||
- Ticker-specific post filtering
|
||||
|
||||
#### 2. SentimentAnalyzer
|
||||
- OpenRouter LLM integration
|
||||
- Structured sentiment scoring (-1.0 to +1.0)
|
||||
- Financial context awareness
|
||||
|
||||
#### 3. SocialRepository
|
||||
- PostgreSQL with deduplication by post_id
|
||||
- Vector similarity search using pgvectorscale
|
||||
- TimescaleDB time-series optimization
|
||||
|
||||
#### 4. SocialMediaService
|
||||
- Orchestrates collection pipeline: Reddit → Sentiment → Embeddings → Storage
|
||||
- Provides ticker-specific social context
|
||||
- Calculates aggregate sentiment metrics
|
||||
|
||||
#### 5. AgentToolkit Integration
|
||||
```python
|
||||
async def get_reddit_news(ticker: str, days: int = 7) -> str:
|
||||
# Returns formatted social media context with sentiment analysis
|
||||
|
||||
async def get_reddit_stock_info(ticker: str, query: Optional[str] = None) -> str:
|
||||
# Returns semantic search results with sentiment aggregation
|
||||
```
|
||||
|
||||
## Implementation Scope
|
||||
|
||||
### Complete Implementation ✅
|
||||
- PostgreSQL migration from file storage
|
||||
- Reddit API client (currently empty stub)
|
||||
- SQLAlchemy entities with vector fields
|
||||
- LLM sentiment analysis pipeline
|
||||
- Vector embedding generation and search
|
||||
- Dagster pipeline for scheduled collection
|
||||
- Comprehensive test coverage (pytest-vcr for APIs)
|
||||
|
||||
### Current Status
|
||||
**Basic stub implementation** - requires complete rebuild of all components
|
||||
|
||||
### Dependencies
|
||||
- Reddit API credentials (PRAW)
|
||||
- OpenRouter API access
|
||||
- PostgreSQL with TimescaleDB + pgvectorscale
|
||||
- Existing TradingAgentsConfig
|
||||
- News domain patterns for consistency
|
||||
|
||||
## Data Flow
|
||||
1. **Dagster pipeline** triggers daily collection
|
||||
2. **RedditClient** fetches posts from financial subreddits
|
||||
3. **SentimentAnalyzer** processes posts via OpenRouter LLM
|
||||
4. **EmbeddingGenerator** creates vector embeddings
|
||||
5. **SocialRepository** stores in PostgreSQL with deduplication
|
||||
6. **AI Agents** query via AgentToolkit with RAG-enhanced context
|
||||
|
||||
## Testing Strategy
|
||||
- **pytest-vcr** for Reddit API mocking
|
||||
- **Real PostgreSQL** for repository integration tests
|
||||
- **Service mocks** for business logic testing
|
||||
- **85%+ coverage** matching project standards
|
||||
|
||||
## Success Criteria
|
||||
- Daily automated Reddit collection with sentiment analysis
|
||||
- Sub-2-second agent queries with vector search
|
||||
- Seamless RAG integration matching news domain patterns
|
||||
- Production-ready reliability with comprehensive error handling
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
{
|
||||
"feature": "socialmedia",
|
||||
"user_story": "As a Dagster pipeline, I want to collect Reddit posts from financial subreddits with LLM sentiment analysis and vector embeddings, so that AI Agents can access comprehensive social media context for ticker-specific trading decisions through RAG-powered queries",
|
||||
"acceptance_criteria": [
|
||||
"GIVEN a scheduled Dagster pipeline WHEN it executes daily THEN it collects Reddit posts from configured financial subreddits without manual intervention",
|
||||
"GIVEN Reddit posts are collected WHEN processed THEN they are stored in PostgreSQL with TimescaleDB optimization and vector embeddings for semantic search",
|
||||
"GIVEN social media posts WHEN processed THEN each post receives OpenRouter LLM sentiment analysis with structured scores (positive/negative/neutral with confidence)",
|
||||
"GIVEN a ticker symbol WHEN AI agents request social context THEN they receive relevant Reddit posts with sentiment scores and vector similarity ranking within 2 seconds",
|
||||
"GIVEN social media data WHEN agents query THEN AgentToolkit provides RAG-enhanced context including post content, sentiment trends, and engagement metrics"
|
||||
],
|
||||
"business_rules": [
|
||||
"Daily automated collection from configured financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)",
|
||||
"OpenRouter LLM sentiment analysis for all posts with confidence scoring",
|
||||
"Vector embeddings generation for semantic similarity search",
|
||||
"Post deduplication by Reddit post ID to prevent duplicates",
|
||||
"Rate limiting compliance with Reddit API terms of service",
|
||||
"Data retention policy: 90 days for social media posts",
|
||||
"Best effort processing: API failures or rate limits don't block other posts"
|
||||
],
|
||||
"scope": {
|
||||
"included": [
|
||||
"Complete socialmedia domain implementation from stub to production",
|
||||
"PostgreSQL migration from current file-based storage",
|
||||
"Reddit API integration using PRAW (Python Reddit API Wrapper)",
|
||||
"OpenRouter LLM sentiment analysis integration",
|
||||
"Vector embeddings generation and similarity search",
|
||||
"AgentToolkit integration with get_reddit_news and get_reddit_stock_info methods",
|
||||
"Dagster pipeline for scheduled daily collection",
|
||||
"SQLAlchemy entities with TimescaleDB and pgvectorscale support",
|
||||
"Comprehensive test coverage with pytest-vcr for API mocking"
|
||||
],
|
||||
"excluded": [
|
||||
"Other social media platforms beyond Reddit (Twitter, LinkedIn, etc.)",
|
||||
"Real-time social media streaming (batch processing only)",
|
||||
"Custom sentiment models (use OpenRouter LLMs only)",
|
||||
"Social media influence scoring or user reputation tracking",
|
||||
"Multi-language post support (English only)",
|
||||
"Historical Reddit data backfilling beyond 30 days"
|
||||
]
|
||||
},
|
||||
"current_implementation_status": "Basic stub implementation - requires complete rebuild",
|
||||
"missing_components": [
|
||||
"PostgreSQL database migration from file storage",
|
||||
"Reddit API client implementation (RedditClient is empty stub)",
|
||||
"SQLAlchemy entity models for social posts with vector fields",
|
||||
"LLM sentiment analysis integration via OpenRouter",
|
||||
"Vector embedding generation and similarity search",
|
||||
"AgentToolkit RAG methods (get_reddit_news, get_reddit_stock_info)",
|
||||
"Dagster pipeline for scheduled data collection",
|
||||
"Comprehensive test suite with domain-specific patterns"
|
||||
],
|
||||
"existing_stub_components": [
|
||||
"SocialMediaService with empty method stubs",
|
||||
"SocialRepository with file-based JSON storage",
|
||||
"Basic data models: SocialPost, PostData, SocialContext",
|
||||
"Empty RedditClient class requiring full implementation",
|
||||
"Agent references to social methods (not yet implemented)"
|
||||
],
|
||||
"aligns_with": "Multi-agent trading framework vision - provides social sentiment context for comprehensive market analysis alongside news and market data",
|
||||
"dependencies": [
|
||||
"PRAW (Python Reddit API Wrapper) for Reddit API access",
|
||||
"OpenRouter API for LLM sentiment analysis",
|
||||
"PostgreSQL with TimescaleDB and pgvectorscale extensions",
|
||||
"Existing database infrastructure from news domain",
|
||||
"OpenRouter configuration in TradingAgentsConfig",
|
||||
"Dagster orchestration framework for scheduled execution"
|
||||
],
|
||||
"technical_details": {
|
||||
"architecture_pattern": "Router → Service → Repository → Entity → Database (matching news domain)",
|
||||
"database_integration": "PostgreSQL + TimescaleDB + pgvectorscale (consistent with news domain)",
|
||||
"llm_integration": "OpenRouter unified provider with two-tier model strategy",
|
||||
"vector_storage": "1536-dimension embeddings using pgvectorscale (consistent with news)",
|
||||
"api_integration": "PRAW (Python Reddit API Wrapper) with rate limiting and error handling",
|
||||
"testing_strategy": "pytest-vcr for HTTP mocking, real PostgreSQL for repository tests, service mocks for business logic"
|
||||
},
|
||||
"implementation_approach": "Complete domain implementation following successful news domain patterns: database migration → entity models → Reddit client → repository → service → AgentToolkit → Dagster pipeline",
|
||||
"reference_implementations": {
|
||||
"news_domain_patterns": "Follow NewsService, NewsRepository, NewsArticleEntity patterns for consistency",
|
||||
"database_schema": "Mirror NewsArticleEntity vector embedding approach for social posts",
|
||||
"agent_integration": "Follow existing AgentToolkit get_news() pattern for social media methods",
|
||||
"testing_approach": "Apply news domain testing patterns: VCR for API, real DB for repositories"
|
||||
},
|
||||
"success_criteria": {
|
||||
"functionality": "Daily Reddit collection with sentiment analysis and vector search",
|
||||
"performance": "< 2 second social context queries, < 100ms repository operations",
|
||||
"quality": "85%+ test coverage, comprehensive error handling",
|
||||
"integration": "Seamless AgentToolkit RAG integration for AI agents",
|
||||
"consistency": "Architecture and patterns match successful news domain implementation"
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,740 @@
|
|||
# Social Media Domain Specification
|
||||
|
||||
## Feature Overview
|
||||
|
||||
**Complete implementation of social media data collection and analysis** - Transform the current stub implementation into a production-ready social media domain that provides comprehensive Reddit sentiment analysis for trading agents.
|
||||
|
||||
### User Story
|
||||
|
||||
As a Dagster pipeline, I want to collect Reddit posts from financial subreddits with LLM sentiment analysis and vector embeddings, so that AI Agents can access comprehensive social media context for ticker-specific trading decisions through RAG-powered queries.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Daily Data Collection
|
||||
- **GIVEN** a scheduled Dagster pipeline **WHEN** it executes daily **THEN** it collects Reddit posts from configured financial subreddits without manual intervention
|
||||
- **GIVEN** Reddit posts are collected **WHEN** processed **THEN** they are stored in PostgreSQL with TimescaleDB optimization and vector embeddings for semantic search
|
||||
|
||||
### LLM Sentiment Analysis
|
||||
- **GIVEN** social media posts **WHEN** processed **THEN** each post receives OpenRouter LLM sentiment analysis with structured scores (positive/negative/neutral with confidence)
|
||||
|
||||
### Agent Integration
|
||||
- **GIVEN** a ticker symbol **WHEN** AI agents request social context **THEN** they receive relevant Reddit posts with sentiment scores and vector similarity ranking within 2 seconds
|
||||
- **GIVEN** social media data **WHEN** agents query **THEN** AgentToolkit provides RAG-enhanced context including post content, sentiment trends, and engagement metrics
|
||||
|
||||
## Business Rules and Constraints
|
||||
|
||||
### Data Collection Rules
|
||||
1. **Daily automated collection** from configured financial subreddits (wallstreetbets, investing, stocks, SecurityAnalysis)
|
||||
2. **OpenRouter LLM sentiment analysis** for all posts with confidence scoring
|
||||
3. **Vector embeddings generation** for semantic similarity search
|
||||
4. **Post deduplication** by Reddit post ID to prevent duplicates
|
||||
5. **Rate limiting compliance** with Reddit API terms of service
|
||||
|
||||
### Data Management
|
||||
1. **Data retention policy**: 90 days for social media posts
|
||||
2. **Best effort processing**: API failures or rate limits don't block other posts
|
||||
|
||||
## Scope Definition
|
||||
|
||||
### Included Features ✅
|
||||
- Complete socialmedia domain implementation from stub to production
|
||||
- PostgreSQL migration from current file-based storage
|
||||
- Reddit API integration using PRAW or Reddit API client
|
||||
- OpenRouter LLM sentiment analysis integration
|
||||
- Vector embeddings generation and similarity search
|
||||
- AgentToolkit integration with `get_reddit_news` and `get_reddit_stock_info` methods
|
||||
- Dagster pipeline for scheduled daily collection
|
||||
- SQLAlchemy entities with TimescaleDB and pgvectorscale support
|
||||
- Comprehensive test coverage with pytest-vcr for API mocking
|
||||
|
||||
### Excluded Features ❌
|
||||
- Other social media platforms beyond Reddit (Twitter, LinkedIn, etc.)
|
||||
- Real-time social media streaming (batch processing only)
|
||||
- Custom sentiment models (use OpenRouter LLMs only)
|
||||
- Social media influence scoring or user reputation tracking
|
||||
- Multi-language post support (English only)
|
||||
- Historical Reddit data backfilling beyond 30 days
|
||||
|
||||
## Technical Implementation Details
|
||||
|
||||
### Architecture Pattern
|
||||
**Router → Service → Repository → Entity → Database** (matching news domain)
|
||||
|
||||
### Current Implementation Status
|
||||
**Basic stub implementation - requires complete rebuild**
|
||||
|
||||
### Missing Components
|
||||
1. PostgreSQL database migration from file storage
|
||||
2. Reddit API client implementation (RedditClient is empty stub)
|
||||
3. SQLAlchemy entity models for social posts with vector fields
|
||||
4. LLM sentiment analysis integration via OpenRouter
|
||||
5. Vector embedding generation and similarity search
|
||||
6. AgentToolkit RAG methods (`get_reddit_news`, `get_reddit_stock_info`)
|
||||
7. Dagster pipeline for scheduled data collection
|
||||
8. Comprehensive test suite with domain-specific patterns
|
||||
|
||||
### Existing Stub Components
|
||||
- SocialMediaService with empty method stubs
|
||||
- SocialRepository with file-based JSON storage
|
||||
- Basic data models: SocialPost, PostData, SocialContext
|
||||
- Empty RedditClient class requiring full implementation
|
||||
- Agent references to social methods (not yet implemented)
|
||||
|
||||
## Database Integration
|
||||
|
||||
### PostgreSQL Schema Design
|
||||
```sql
|
||||
-- Social media posts table with TimescaleDB optimization
|
||||
CREATE TABLE social_media_posts (
|
||||
id SERIAL PRIMARY KEY,
|
||||
post_id VARCHAR(50) UNIQUE NOT NULL, -- Reddit post ID
|
||||
ticker VARCHAR(10), -- Associated ticker
|
||||
subreddit VARCHAR(50) NOT NULL, -- Source subreddit
|
||||
title TEXT NOT NULL, -- Post title
|
||||
content TEXT, -- Post content
|
||||
author VARCHAR(50), -- Reddit username
|
||||
created_at TIMESTAMPTZ NOT NULL, -- Post creation time
|
||||
collected_at TIMESTAMPTZ DEFAULT NOW(), -- Data collection time
|
||||
upvotes INTEGER DEFAULT 0, -- Reddit upvotes
|
||||
downvotes INTEGER DEFAULT 0, -- Reddit downvotes
|
||||
comment_count INTEGER DEFAULT 0, -- Number of comments
|
||||
url TEXT, -- Reddit URL
|
||||
permalink TEXT, -- Reddit permalink
|
||||
|
||||
-- Sentiment analysis fields
|
||||
sentiment_score DECIMAL(3,2), -- -1.0 to +1.0
|
||||
sentiment_label VARCHAR(20), -- positive/negative/neutral
|
||||
sentiment_confidence DECIMAL(3,2), -- 0.0 to 1.0
|
||||
|
||||
-- Vector embeddings
|
||||
embedding vector(1536), -- pgvectorscale embedding
|
||||
|
||||
-- Metadata
|
||||
data_quality_score DECIMAL(3,2) DEFAULT 1.0,
|
||||
processing_status VARCHAR(20) DEFAULT 'pending',
|
||||
error_message TEXT
|
||||
);
|
||||
|
||||
-- TimescaleDB hypertable for time-series optimization
|
||||
SELECT create_hypertable('social_media_posts', 'created_at');
|
||||
|
||||
-- Vector similarity index
|
||||
CREATE INDEX idx_social_posts_embedding ON social_media_posts USING vectors (embedding vector_cosine_ops);
|
||||
|
||||
-- Performance indexes
|
||||
CREATE INDEX idx_social_posts_ticker ON social_media_posts (ticker, created_at DESC);
|
||||
CREATE INDEX idx_social_posts_subreddit ON social_media_posts (subreddit, created_at DESC);
|
||||
CREATE INDEX idx_social_posts_sentiment ON social_media_posts (sentiment_label, sentiment_score);
|
||||
```
|
||||
|
||||
### Entity Model
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/entities.py
|
||||
from sqlalchemy import Column, Integer, String, Text, DECIMAL, TIMESTAMP, Index
|
||||
from sqlalchemy.dialects.postgresql import VECTOR
|
||||
from tradingagents.database import Base
|
||||
from typing import Optional, Dict, Any
|
||||
import json
|
||||
|
||||
class SocialMediaPostEntity(Base):
|
||||
__tablename__ = 'social_media_posts'
|
||||
|
||||
id = Column(Integer, primary_key=True)
|
||||
post_id = Column(String(50), unique=True, nullable=False)
|
||||
ticker = Column(String(10), index=True)
|
||||
subreddit = Column(String(50), nullable=False, index=True)
|
||||
title = Column(Text, nullable=False)
|
||||
content = Column(Text)
|
||||
author = Column(String(50))
|
||||
created_at = Column(TIMESTAMP(timezone=True), nullable=False, index=True)
|
||||
collected_at = Column(TIMESTAMP(timezone=True), server_default='NOW()')
|
||||
upvotes = Column(Integer, default=0)
|
||||
downvotes = Column(Integer, default=0)
|
||||
comment_count = Column(Integer, default=0)
|
||||
url = Column(Text)
|
||||
permalink = Column(Text)
|
||||
|
||||
# Sentiment analysis
|
||||
sentiment_score = Column(DECIMAL(3,2))
|
||||
sentiment_label = Column(String(20))
|
||||
sentiment_confidence = Column(DECIMAL(3,2))
|
||||
|
||||
# Vector embeddings
|
||||
embedding = Column(VECTOR(1536))
|
||||
|
||||
# Metadata
|
||||
data_quality_score = Column(DECIMAL(3,2), default=1.0)
|
||||
processing_status = Column(String(20), default='pending')
|
||||
error_message = Column(Text)
|
||||
|
||||
def to_domain(self) -> 'SocialPost':
|
||||
"""Convert entity to domain model"""
|
||||
return SocialPost(
|
||||
post_id=self.post_id,
|
||||
ticker=self.ticker,
|
||||
subreddit=self.subreddit,
|
||||
title=self.title,
|
||||
content=self.content,
|
||||
author=self.author,
|
||||
created_at=self.created_at,
|
||||
upvotes=self.upvotes,
|
||||
downvotes=self.downvotes,
|
||||
comment_count=self.comment_count,
|
||||
url=self.url,
|
||||
sentiment_score=float(self.sentiment_score) if self.sentiment_score else None,
|
||||
sentiment_label=self.sentiment_label,
|
||||
sentiment_confidence=float(self.sentiment_confidence) if self.sentiment_confidence else None
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_domain(cls, post: 'SocialPost', embedding: Optional[list] = None) -> 'SocialMediaPostEntity':
|
||||
"""Create entity from domain model"""
|
||||
return cls(
|
||||
post_id=post.post_id,
|
||||
ticker=post.ticker,
|
||||
subreddit=post.subreddit,
|
||||
title=post.title,
|
||||
content=post.content,
|
||||
author=post.author,
|
||||
created_at=post.created_at,
|
||||
upvotes=post.upvotes,
|
||||
downvotes=post.downvotes,
|
||||
comment_count=post.comment_count,
|
||||
url=post.url,
|
||||
sentiment_score=post.sentiment_score,
|
||||
sentiment_label=post.sentiment_label,
|
||||
sentiment_confidence=post.sentiment_confidence,
|
||||
embedding=embedding
|
||||
)
|
||||
```
|
||||
|
||||
## Reddit API Integration
|
||||
|
||||
### RedditClient Implementation
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/clients.py
|
||||
import praw
|
||||
from typing import List, Optional, Dict, Any
|
||||
from datetime import datetime, timedelta
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
class RedditClient:
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.reddit = praw.Reddit(
|
||||
client_id=config.reddit_client_id,
|
||||
client_secret=config.reddit_client_secret,
|
||||
user_agent=config.reddit_user_agent
|
||||
)
|
||||
|
||||
async def fetch_financial_posts(
|
||||
self,
|
||||
subreddits: List[str],
|
||||
ticker: Optional[str] = None,
|
||||
limit: int = 100,
|
||||
time_filter: str = "day"
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Fetch financial posts from specified subreddits"""
|
||||
posts = []
|
||||
|
||||
for subreddit_name in subreddits:
|
||||
try:
|
||||
subreddit = self.reddit.subreddit(subreddit_name)
|
||||
submissions = subreddit.hot(limit=limit)
|
||||
|
||||
for submission in submissions:
|
||||
# Filter by ticker if specified
|
||||
if ticker and ticker.upper() not in submission.title.upper():
|
||||
continue
|
||||
|
||||
post_data = {
|
||||
'post_id': submission.id,
|
||||
'subreddit': subreddit_name,
|
||||
'title': submission.title,
|
||||
'content': submission.selftext,
|
||||
'author': str(submission.author),
|
||||
'created_at': datetime.fromtimestamp(submission.created_utc),
|
||||
'upvotes': submission.ups,
|
||||
'downvotes': submission.downs,
|
||||
'comment_count': submission.num_comments,
|
||||
'url': submission.url,
|
||||
'permalink': submission.permalink
|
||||
}
|
||||
posts.append(post_data)
|
||||
|
||||
except Exception as e:
|
||||
# Log error but continue processing other subreddits
|
||||
print(f"Error fetching from {subreddit_name}: {e}")
|
||||
continue
|
||||
|
||||
return posts
|
||||
```
|
||||
|
||||
## LLM Sentiment Analysis
|
||||
|
||||
### OpenRouter Integration
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/services.py
|
||||
from typing import Dict, Any, Optional, Tuple
|
||||
import openai
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
class SentimentAnalyzer:
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.client = openai.OpenAI(
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
api_key=config.openrouter_api_key
|
||||
)
|
||||
|
||||
async def analyze_sentiment(self, text: str) -> Tuple[float, str, float]:
|
||||
"""
|
||||
Analyze sentiment of social media post
|
||||
Returns: (score, label, confidence)
|
||||
"""
|
||||
prompt = f"""
|
||||
Analyze the financial sentiment of this social media post.
|
||||
|
||||
Post: "{text}"
|
||||
|
||||
Return sentiment as JSON with:
|
||||
- score: float from -1.0 (very negative) to +1.0 (very positive)
|
||||
- label: "positive", "negative", or "neutral"
|
||||
- confidence: float from 0.0 to 1.0 indicating confidence
|
||||
|
||||
Focus on financial and trading sentiment, not general sentiment.
|
||||
"""
|
||||
|
||||
try:
|
||||
response = await self.client.chat.completions.create(
|
||||
model=self.config.quick_think_llm,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=100,
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
result = json.loads(response.choices[0].message.content)
|
||||
return result['score'], result['label'], result['confidence']
|
||||
|
||||
except Exception as e:
|
||||
# Return neutral sentiment on error
|
||||
return 0.0, "neutral", 0.0
|
||||
```
|
||||
|
||||
## Vector Embeddings and Search
|
||||
|
||||
### Embedding Generation
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/embeddings.py
|
||||
import openai
|
||||
from typing import List, Optional
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
class EmbeddingGenerator:
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.client = openai.OpenAI(
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
api_key=config.openrouter_api_key
|
||||
)
|
||||
|
||||
async def generate_embedding(self, text: str) -> Optional[List[float]]:
|
||||
"""Generate vector embedding for text"""
|
||||
try:
|
||||
response = await self.client.embeddings.create(
|
||||
model="text-embedding-3-small",
|
||||
input=text,
|
||||
encoding_format="float"
|
||||
)
|
||||
return response.data[0].embedding
|
||||
except Exception as e:
|
||||
print(f"Embedding generation failed: {e}")
|
||||
return None
|
||||
|
||||
def prepare_text_for_embedding(self, post: Dict[str, Any]) -> str:
|
||||
"""Combine title and content for embedding"""
|
||||
title = post.get('title', '')
|
||||
content = post.get('content', '')
|
||||
return f"{title} {content}".strip()
|
||||
```
|
||||
|
||||
## Repository Implementation
|
||||
|
||||
### SocialRepository with PostgreSQL
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/repositories.py
|
||||
from typing import List, Optional, Dict, Any
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy import desc, and_, text
|
||||
from tradingagents.domains.socialmedia.entities import SocialMediaPostEntity
|
||||
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
|
||||
from tradingagents.database import get_db_session
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
class SocialRepository:
|
||||
def __init__(self):
|
||||
self.session = get_db_session()
|
||||
|
||||
async def save_posts(self, posts: List[SocialPost]) -> List[str]:
|
||||
"""Save social media posts with deduplication"""
|
||||
saved_ids = []
|
||||
|
||||
for post in posts:
|
||||
# Check for existing post
|
||||
existing = self.session.query(SocialMediaPostEntity).filter(
|
||||
SocialMediaPostEntity.post_id == post.post_id
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue # Skip duplicates
|
||||
|
||||
entity = SocialMediaPostEntity.from_domain(post)
|
||||
self.session.add(entity)
|
||||
saved_ids.append(post.post_id)
|
||||
|
||||
self.session.commit()
|
||||
return saved_ids
|
||||
|
||||
async def get_posts_for_ticker(
|
||||
self,
|
||||
ticker: str,
|
||||
days: int = 7,
|
||||
limit: int = 50
|
||||
) -> List[SocialPost]:
|
||||
"""Get social media posts for specific ticker"""
|
||||
cutoff_date = datetime.now() - timedelta(days=days)
|
||||
|
||||
results = self.session.query(SocialMediaPostEntity).filter(
|
||||
and_(
|
||||
SocialMediaPostEntity.ticker == ticker,
|
||||
SocialMediaPostEntity.created_at >= cutoff_date
|
||||
)
|
||||
).order_by(desc(SocialMediaPostEntity.created_at)).limit(limit).all()
|
||||
|
||||
return [entity.to_domain() for entity in results]
|
||||
|
||||
async def vector_similarity_search(
|
||||
self,
|
||||
query_embedding: List[float],
|
||||
ticker: Optional[str] = None,
|
||||
limit: int = 10
|
||||
) -> List[SocialPost]:
|
||||
"""Find similar posts using vector search"""
|
||||
query = self.session.query(SocialMediaPostEntity)
|
||||
|
||||
if ticker:
|
||||
query = query.filter(SocialMediaPostEntity.ticker == ticker)
|
||||
|
||||
# Vector similarity search using pgvectorscale
|
||||
query = query.order_by(
|
||||
text(f"embedding <-> '{query_embedding}'")
|
||||
).limit(limit)
|
||||
|
||||
results = query.all()
|
||||
return [entity.to_domain() for entity in results]
|
||||
```
|
||||
|
||||
## Service Layer
|
||||
|
||||
### SocialMediaService
|
||||
```python
|
||||
# tradingagents/domains/socialmedia/services.py
|
||||
from typing import List, Optional, Dict, Any
|
||||
from tradingagents.domains.socialmedia.repositories import SocialRepository
|
||||
from tradingagents.domains.socialmedia.clients import RedditClient
|
||||
from tradingagents.domains.socialmedia.models import SocialPost, SocialContext
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
class SocialMediaService:
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.repository = SocialRepository()
|
||||
self.reddit_client = RedditClient(config)
|
||||
self.sentiment_analyzer = SentimentAnalyzer(config)
|
||||
self.embedding_generator = EmbeddingGenerator(config)
|
||||
|
||||
async def collect_social_data(
|
||||
self,
|
||||
ticker: Optional[str] = None,
|
||||
subreddits: Optional[List[str]] = None
|
||||
) -> SocialContext:
|
||||
"""Main entry point for social media data collection"""
|
||||
|
||||
if not subreddits:
|
||||
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
|
||||
|
||||
# Fetch posts from Reddit
|
||||
raw_posts = await self.reddit_client.fetch_financial_posts(
|
||||
subreddits=subreddits,
|
||||
ticker=ticker,
|
||||
limit=100
|
||||
)
|
||||
|
||||
# Process posts: sentiment analysis + embeddings
|
||||
processed_posts = []
|
||||
for raw_post in raw_posts:
|
||||
# Generate sentiment
|
||||
text = f"{raw_post['title']} {raw_post['content']}"
|
||||
score, label, confidence = await self.sentiment_analyzer.analyze_sentiment(text)
|
||||
|
||||
# Generate embedding
|
||||
embedding = await self.embedding_generator.generate_embedding(text)
|
||||
|
||||
post = SocialPost(
|
||||
**raw_post,
|
||||
sentiment_score=score,
|
||||
sentiment_label=label,
|
||||
sentiment_confidence=confidence
|
||||
)
|
||||
processed_posts.append(post)
|
||||
|
||||
# Save to database
|
||||
await self.repository.save_posts(processed_posts)
|
||||
|
||||
# Return context
|
||||
return SocialContext(
|
||||
posts=processed_posts,
|
||||
ticker=ticker,
|
||||
total_posts=len(processed_posts),
|
||||
sentiment_summary=self._calculate_sentiment_summary(processed_posts)
|
||||
)
|
||||
|
||||
def _calculate_sentiment_summary(self, posts: List[SocialPost]) -> Dict[str, Any]:
|
||||
"""Calculate aggregate sentiment metrics"""
|
||||
if not posts:
|
||||
return {}
|
||||
|
||||
scores = [p.sentiment_score for p in posts if p.sentiment_score is not None]
|
||||
labels = [p.sentiment_label for p in posts if p.sentiment_label]
|
||||
|
||||
return {
|
||||
'avg_sentiment': sum(scores) / len(scores) if scores else 0.0,
|
||||
'positive_count': labels.count('positive'),
|
||||
'negative_count': labels.count('negative'),
|
||||
'neutral_count': labels.count('neutral'),
|
||||
'total_posts': len(posts)
|
||||
}
|
||||
```
|
||||
|
||||
## AgentToolkit Integration
|
||||
|
||||
### RAG-Enhanced Methods
|
||||
```python
|
||||
# tradingagents/agents/libs/agent_toolkit.py (additions)
|
||||
|
||||
async def get_reddit_news(self, ticker: str, days: int = 7) -> str:
|
||||
"""Get Reddit posts related to a ticker with RAG context"""
|
||||
try:
|
||||
# Get recent posts for ticker
|
||||
posts = await self.social_service.repository.get_posts_for_ticker(
|
||||
ticker=ticker,
|
||||
days=days,
|
||||
limit=20
|
||||
)
|
||||
|
||||
if not posts:
|
||||
return f"No Reddit posts found for {ticker} in the last {days} days."
|
||||
|
||||
# Format for agent consumption
|
||||
context = f"Reddit Social Media Context for {ticker} ({len(posts)} posts):\n\n"
|
||||
|
||||
for post in posts[:10]: # Limit to top 10
|
||||
sentiment_emoji = {"positive": "📈", "negative": "📉", "neutral": "➡️"}.get(post.sentiment_label, "")
|
||||
context += f"{sentiment_emoji} r/{post.subreddit} - {post.title}\n"
|
||||
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
|
||||
context += f" Engagement: {post.upvotes} upvotes, {post.comment_count} comments\n"
|
||||
if post.content:
|
||||
context += f" Content: {post.content[:200]}...\n"
|
||||
context += "\n"
|
||||
|
||||
return context
|
||||
|
||||
except Exception as e:
|
||||
return f"Error fetching Reddit data for {ticker}: {str(e)}"
|
||||
|
||||
async def get_reddit_stock_info(self, ticker: str, query: Optional[str] = None) -> str:
|
||||
"""Get Reddit stock information with semantic search"""
|
||||
try:
|
||||
if query:
|
||||
# Generate embedding for semantic search
|
||||
query_embedding = await self.social_service.embedding_generator.generate_embedding(query)
|
||||
if query_embedding:
|
||||
posts = await self.social_service.repository.vector_similarity_search(
|
||||
query_embedding=query_embedding,
|
||||
ticker=ticker,
|
||||
limit=10
|
||||
)
|
||||
else:
|
||||
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
|
||||
else:
|
||||
posts = await self.social_service.repository.get_posts_for_ticker(ticker, days=7)
|
||||
|
||||
if not posts:
|
||||
return f"No relevant Reddit discussions found for {ticker}."
|
||||
|
||||
# Aggregate sentiment and key insights
|
||||
sentiment_summary = self.social_service._calculate_sentiment_summary(posts)
|
||||
|
||||
context = f"Reddit Stock Analysis for {ticker}:\n\n"
|
||||
context += f"Overall Sentiment: {sentiment_summary.get('avg_sentiment', 0):.2f}/1.0\n"
|
||||
context += f"Posts: {sentiment_summary.get('positive_count', 0)} positive, "
|
||||
context += f"{sentiment_summary.get('negative_count', 0)} negative, "
|
||||
context += f"{sentiment_summary.get('neutral_count', 0)} neutral\n\n"
|
||||
|
||||
context += "Key Discussions:\n"
|
||||
for post in posts[:5]:
|
||||
context += f"• {post.title} (r/{post.subreddit})\n"
|
||||
context += f" Sentiment: {post.sentiment_label} ({post.sentiment_score:.2f})\n"
|
||||
|
||||
return context
|
||||
|
||||
except Exception as e:
|
||||
return f"Error analyzing Reddit stock info for {ticker}: {str(e)}"
|
||||
```
|
||||
|
||||
## Dagster Pipeline
|
||||
|
||||
### Social Media Collection Asset
|
||||
```python
|
||||
# tradingagents/data/assets/social_media.py
|
||||
from dagster import asset, AssetExecutionContext
|
||||
from tradingagents.domains.socialmedia.services import SocialMediaService
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
|
||||
@asset(
|
||||
group_name="social_media",
|
||||
description="Collect Reddit posts from financial subreddits with sentiment analysis"
|
||||
)
|
||||
async def reddit_financial_posts(context: AssetExecutionContext) -> Dict[str, Any]:
|
||||
"""Daily collection of Reddit financial posts"""
|
||||
|
||||
config = TradingAgentsConfig.from_env()
|
||||
social_service = SocialMediaService(config)
|
||||
|
||||
# Collect from financial subreddits
|
||||
subreddits = ['wallstreetbets', 'investing', 'stocks', 'SecurityAnalysis']
|
||||
|
||||
total_collected = 0
|
||||
results = {}
|
||||
|
||||
for subreddit in subreddits:
|
||||
try:
|
||||
social_context = await social_service.collect_social_data(
|
||||
subreddits=[subreddit]
|
||||
)
|
||||
|
||||
results[subreddit] = {
|
||||
'posts_collected': len(social_context.posts),
|
||||
'sentiment_summary': social_context.sentiment_summary
|
||||
}
|
||||
total_collected += len(social_context.posts)
|
||||
|
||||
context.log.info(f"Collected {len(social_context.posts)} posts from r/{subreddit}")
|
||||
|
||||
except Exception as e:
|
||||
context.log.error(f"Failed to collect from r/{subreddit}: {e}")
|
||||
results[subreddit] = {'error': str(e)}
|
||||
|
||||
context.log.info(f"Total posts collected: {total_collected}")
|
||||
return results
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Test Structure
|
||||
```
|
||||
tests/domains/socialmedia/
|
||||
├── conftest.py # Fixtures and test setup
|
||||
├── test_reddit_client.py # API integration tests with VCR
|
||||
├── test_social_repository.py # PostgreSQL database tests
|
||||
├── test_social_service.py # Business logic with mocks
|
||||
├── test_sentiment_analyzer.py # LLM sentiment analysis tests
|
||||
├── test_embedding_generator.py # Vector embedding tests
|
||||
└── fixtures/ # VCR cassettes and test data
|
||||
└── reddit_api_responses.yaml
|
||||
```
|
||||
|
||||
### Key Test Patterns
|
||||
```python
|
||||
# tests/domains/socialmedia/test_social_service.py
|
||||
import pytest
|
||||
from unittest.mock import AsyncMock, MagicMock
|
||||
from tradingagents.domains.socialmedia.services import SocialMediaService
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_collect_social_data_success(mock_social_service):
|
||||
"""Test successful social media data collection"""
|
||||
# Mock Reddit API response
|
||||
mock_posts = [
|
||||
{
|
||||
'post_id': 'abc123',
|
||||
'title': 'AAPL to the moon!',
|
||||
'subreddit': 'wallstreetbets',
|
||||
# ... other fields
|
||||
}
|
||||
]
|
||||
|
||||
mock_social_service.reddit_client.fetch_financial_posts.return_value = mock_posts
|
||||
mock_social_service.sentiment_analyzer.analyze_sentiment.return_value = (0.8, 'positive', 0.9)
|
||||
|
||||
result = await mock_social_service.collect_social_data(ticker='AAPL')
|
||||
|
||||
assert len(result.posts) == 1
|
||||
assert result.posts[0].sentiment_label == 'positive'
|
||||
assert result.sentiment_summary['positive_count'] == 1
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Technical Dependencies
|
||||
- **Reddit API access** (PRAW or Reddit API client)
|
||||
- **OpenRouter API** for LLM sentiment analysis
|
||||
- **PostgreSQL** with TimescaleDB and pgvectorscale extensions
|
||||
- **Existing database infrastructure** from news domain
|
||||
- **OpenRouter configuration** in TradingAgentsConfig
|
||||
- **Dagster orchestration framework** for scheduled execution
|
||||
|
||||
### Reference Implementations
|
||||
- **News domain patterns**: Follow NewsService, NewsRepository, NewsArticleEntity patterns for consistency
|
||||
- **Database schema**: Mirror NewsArticleEntity vector embedding approach for social posts
|
||||
- **Agent integration**: Follow existing AgentToolkit get_news() pattern for social media methods
|
||||
- **Testing approach**: Apply news domain testing patterns: VCR for API, real DB for repositories
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Functionality
|
||||
- Daily Reddit collection with sentiment analysis and vector search
|
||||
- Seamless integration with existing multi-agent trading framework
|
||||
- RAG-enhanced social context for AI agents
|
||||
|
||||
### Performance
|
||||
- < 2 second social context queries
|
||||
- < 100ms repository operations
|
||||
- Efficient vector similarity search
|
||||
|
||||
### Quality
|
||||
- 85%+ test coverage matching project standards
|
||||
- Comprehensive error handling and resilience
|
||||
- Data quality monitoring and validation
|
||||
|
||||
### Integration
|
||||
- Seamless AgentToolkit RAG integration for AI agents
|
||||
- Architecture and patterns match successful news domain implementation
|
||||
- Consistent with existing TradingAgents configuration and conventions
|
||||
|
||||
## Implementation Approach
|
||||
|
||||
**Complete domain implementation following successful news domain patterns:**
|
||||
|
||||
1. **Database migration** from file storage to PostgreSQL
|
||||
2. **Entity models** with TimescaleDB and vector support
|
||||
3. **Reddit client** implementation with rate limiting
|
||||
4. **Repository layer** with vector search capabilities
|
||||
5. **Service layer** with sentiment analysis and embedding generation
|
||||
6. **AgentToolkit integration** with RAG-enhanced methods
|
||||
7. **Dagster pipeline** for automated daily collection
|
||||
8. **Comprehensive testing** with VCR mocking and real database tests
|
||||
|
||||
This comprehensive implementation transforms the social media domain from basic stubs into a production-ready system that seamlessly integrates with the existing TradingAgents framework.
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
# Social Media Domain Implementation Status
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Feature:** Complete socialmedia domain implementation from empty stubs to production
|
||||
**Total Estimated Time:** 32 hours across 3 phases
|
||||
**Approach:** Parallel development with multiple AI agents
|
||||
**Target:** >85% test coverage, PostgreSQL migration, PRAW Reddit integration, OpenRouter LLM sentiment analysis
|
||||
|
||||
---
|
||||
|
||||
## Progress Summary
|
||||
|
||||
| Phase | Status | Completed | Total | Progress | Est. Time |
|
||||
|-------|--------|-----------|-------|----------|-----------|
|
||||
| **Phase 1: Foundation** | 🟡 Not Started | 0 | 4 | 0% | 12h |
|
||||
| **Phase 2: API Integration** | 🟡 Not Started | 0 | 4 | 0% | 12h |
|
||||
| **Phase 3: Integration** | 🟡 Not Started | 0 | 3 | 0% | 8h |
|
||||
| **Overall Progress** | 🟡 Not Started | **0** | **11** | **0%** | **32h** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (12 hours)
|
||||
|
||||
### 🏗️ Database & Core Models
|
||||
|
||||
| Task | Agent | Status | Progress | Time | Priority |
|
||||
|------|-------|--------|----------|------|----------|
|
||||
| **1.1** Database Schema Migration | Database Specialist | 🟡 Not Started | 0% | 3h | 🔴 Blocking |
|
||||
| **1.2** SQLAlchemy Entity Implementation | Entity Specialist | 🟡 Not Started | 0% | 3h | 🔴 Blocking |
|
||||
| **1.3** Domain Model Enhancement | Domain Specialist | 🟡 Not Started | 0% | 3h | 🔴 Blocking |
|
||||
| **1.4** Repository Implementation | Repository Specialist | 🟡 Not Started | 0% | 3h | 🟠 Medium |
|
||||
|
||||
#### Phase 1 Dependencies
|
||||
- Task 1.1 → Task 1.2 (Entity requires database schema)
|
||||
- Task 1.4 depends on Tasks 1.1 + 1.2
|
||||
- Task 1.3 can run parallel with others
|
||||
|
||||
#### Phase 1 Acceptance Criteria
|
||||
- [ ] PostgreSQL table `social_media_posts` with TimescaleDB + pgvectorscale
|
||||
- [ ] SocialMediaPostEntity with proper field mappings and transformations
|
||||
- [ ] SocialPost domain model with validation and business rules
|
||||
- [ ] SocialRepository with vector similarity search and sentiment aggregation
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: API Integration & Processing (12 hours)
|
||||
|
||||
### 🔌 Clients & Services
|
||||
|
||||
| Task | Agent | Status | Progress | Time | Priority |
|
||||
|------|-------|--------|----------|------|----------|
|
||||
| **2.1** Reddit Client Implementation | API Integration Specialist | 🟡 Not Started | 0% | 4h | 🔴 Blocking |
|
||||
| **2.2** OpenRouter Sentiment Analysis | LLM Integration Specialist | 🟡 Not Started | 0% | 3h | 🟠 Medium |
|
||||
| **2.3** Vector Embedding Generation | ML Integration Specialist | 🟡 Not Started | 0% | 2h | 🟠 Medium |
|
||||
| **2.4** Service Layer Implementation | Service Integration Specialist | 🟡 Not Started | 0% | 3h | 🟠 Medium |
|
||||
|
||||
#### Phase 2 Dependencies
|
||||
- All tasks can run in parallel initially
|
||||
- Task 2.4 depends on completion of Tasks 2.1, 2.2, 2.3
|
||||
|
||||
#### Phase 2 Acceptance Criteria
|
||||
- [ ] PRAW Reddit client with rate limiting and error handling
|
||||
- [ ] OpenRouter sentiment analysis with social media-specific prompts
|
||||
- [ ] Vector embeddings (1536-dim) for titles and content using text-embedding-3-large
|
||||
- [ ] SocialMediaService orchestrating collection, sentiment, and embeddings
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Integration & Validation (8 hours)
|
||||
|
||||
### 🎯 AgentToolkit & Pipeline
|
||||
|
||||
| Task | Agent | Status | Progress | Time | Priority |
|
||||
|------|-------|--------|----------|------|----------|
|
||||
| **3.1** AgentToolkit Integration | Agent Integration Specialist | 🟡 Not Started | 0% | 3h | 🔴 High |
|
||||
| **3.2** Dagster Pipeline Implementation | Pipeline Specialist | 🟡 Not Started | 0% | 2h | 🟠 Medium |
|
||||
| **3.3** Comprehensive Testing Suite | Testing Specialist | 🟡 Not Started | 0% | 3h | 🔴 High |
|
||||
|
||||
#### Phase 3 Dependencies
|
||||
- Task 3.1 depends on Task 2.4 (SocialMediaService)
|
||||
- Task 3.2 depends on Task 2.4
|
||||
- Task 3.3 can start after any component is implemented
|
||||
|
||||
#### Phase 3 Acceptance Criteria
|
||||
- [ ] AgentToolkit RAG methods: `get_reddit_sentiment()`, `get_reddit_stock_info()`, etc.
|
||||
- [ ] Daily Dagster pipeline with sentiment analysis and embedding generation
|
||||
- [ ] >85% test coverage with VCR cassettes and mocked dependencies
|
||||
|
||||
---
|
||||
|
||||
## Current Blocking Issues
|
||||
|
||||
| Issue | Impact | Affected Tasks | Resolution |
|
||||
|-------|---------|----------------|------------|
|
||||
| No active blocking issues | - | - | Ready to start Phase 1 |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Readiness
|
||||
|
||||
### Prerequisites Status
|
||||
| Requirement | Status | Notes |
|
||||
|-------------|---------|-------|
|
||||
| PostgreSQL + Extensions | ✅ Available | TimescaleDB + pgvectorscale ready |
|
||||
| Reddit API Credentials | ⚠️ Required | Need REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET |
|
||||
| OpenRouter API Access | ✅ Available | Existing OpenRouterClient integration |
|
||||
| Database Migration System | ✅ Available | Existing migration infrastructure |
|
||||
| Testing Framework | ✅ Available | pytest, pytest-vcr, pytest-asyncio |
|
||||
|
||||
### Risk Assessment
|
||||
| Risk Level | Tasks | Mitigation |
|
||||
|------------|-------|------------|
|
||||
| 🔴 **High** | 2.1 (Reddit Client) | Use proven PRAW library, implement circuit breaker |
|
||||
| 🟠 **Medium** | 1.1, 1.4, 2.2, 2.4 | Follow existing news domain patterns |
|
||||
| 🟢 **Low** | 1.2, 1.3, 2.3, 3.1, 3.2, 3.3 | Standard implementation patterns |
|
||||
|
||||
---
|
||||
|
||||
## Key Success Metrics
|
||||
|
||||
### Technical Metrics
|
||||
- [ ] **Database Performance:** <1s vector similarity queries for top 10 results
|
||||
- [ ] **API Performance:** <2s social context generation for AI agents
|
||||
- [ ] **Processing Performance:** <5s batch processing for 1000 posts
|
||||
- [ ] **Test Coverage:** >85% across all socialmedia domain components
|
||||
- [ ] **Data Quality:** >80% posts with reliable sentiment analysis
|
||||
|
||||
### Integration Metrics
|
||||
- [ ] **AgentToolkit Integration:** 4 RAG methods implemented and tested
|
||||
- [ ] **Dagster Pipeline:** Daily automated collection with monitoring
|
||||
- [ ] **Architecture Consistency:** Follows news domain patterns exactly
|
||||
- [ ] **Error Resilience:** Graceful degradation on API failures
|
||||
|
||||
### Business Metrics
|
||||
- [ ] **Data Collection:** 400+ posts collected daily from financial subreddits
|
||||
- [ ] **Sentiment Analysis:** Structured scoring with confidence levels
|
||||
- [ ] **Semantic Search:** Vector-based similarity search operational
|
||||
- [ ] **Agent Context:** Rich social media context for trading decisions
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Immediate Actions (Next Sprint)
|
||||
1. **🚀 Start Phase 1:** Begin database schema migration (Task 1.1)
|
||||
2. **📋 Environment Setup:** Configure Reddit API credentials
|
||||
3. **👥 Agent Assignment:** Assign specialized agents to parallel tasks
|
||||
4. **📊 Progress Tracking:** Update status after each task completion
|
||||
|
||||
### Phase Transition Criteria
|
||||
**Phase 1 → Phase 2:** All foundation tasks complete, database operational
|
||||
**Phase 2 → Phase 3:** Service layer operational, sentiment and embeddings working
|
||||
**Phase 3 → Production:** All tests passing, AgentToolkit integration complete
|
||||
|
||||
---
|
||||
|
||||
## Change Log
|
||||
|
||||
| Date | Change | Impact | Updated By |
|
||||
|------|--------|---------|------------|
|
||||
| 2024-08-30 | Initial status tracking setup | Baseline established | System |
|
||||
|
||||
---
|
||||
|
||||
## Notes and Observations
|
||||
|
||||
**Implementation Strategy:**
|
||||
- Leverage existing news domain as reference implementation
|
||||
- Prioritize blocking tasks (database, core models) first
|
||||
- Enable parallel development in Phase 2 for efficiency
|
||||
- Comprehensive testing throughout to maintain >85% coverage
|
||||
|
||||
**Key Dependencies:**
|
||||
- Reddit API reliability and rate limiting compliance
|
||||
- OpenRouter LLM performance for sentiment analysis
|
||||
- PostgreSQL vector extension performance at scale
|
||||
- Integration with existing TradingAgents configuration
|
||||
|
||||
**Success Indicators:**
|
||||
- Clean migration from file-based to PostgreSQL storage
|
||||
- Reliable daily data collection without manual intervention
|
||||
- AI agents receiving rich social context within performance targets
|
||||
- Production-ready error handling and monitoring
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,649 @@
|
|||
# Development Practices - TradingAgents
|
||||
|
||||
## Testing Standards
|
||||
|
||||
### Pragmatic Outside-In TDD
|
||||
|
||||
**Philosophy**: Mock I/O boundaries, test real logic, optimize for fast feedback.
|
||||
|
||||
**Core Principle**: Test behavior, not implementation. Focus on public interfaces and data transformations while mocking external dependencies (HTTP, database, filesystem).
|
||||
|
||||
### Testing Strategy by Layer
|
||||
|
||||
#### 1. Services (Business Logic) - Mock Boundaries
|
||||
```python
|
||||
# tests/domains/news/test_news_service.py
|
||||
import pytest
|
||||
from unittest.mock import Mock, AsyncMock
|
||||
from tradingagents.domains.news.news_service import NewsService
|
||||
from tradingagents.domains.news.news_repository import NewsArticle
|
||||
|
||||
@pytest.fixture
|
||||
def mock_repository():
|
||||
return AsyncMock(spec=NewsRepository)
|
||||
|
||||
@pytest.fixture
|
||||
def mock_google_client():
|
||||
return AsyncMock(spec=GoogleNewsClient)
|
||||
|
||||
async def test_get_articles_returns_empty_on_repository_error(mock_repository):
|
||||
# Mock repository failure
|
||||
mock_repository.list.side_effect = Exception("Database connection failed")
|
||||
|
||||
service = NewsService(repository=mock_repository, clients={})
|
||||
|
||||
# Service should handle error gracefully
|
||||
articles = await service.get_articles("AAPL", date(2024, 1, 15))
|
||||
|
||||
assert articles == []
|
||||
mock_repository.list.assert_called_once_with("AAPL", date(2024, 1, 15))
|
||||
|
||||
async def test_update_articles_transforms_external_data_correctly():
|
||||
# Real business logic: test data transformation and coordination
|
||||
external_articles = [create_external_article("Breaking News", "CNN")]
|
||||
|
||||
mock_repository = AsyncMock()
|
||||
mock_google_client = AsyncMock()
|
||||
mock_google_client.search.return_value = external_articles
|
||||
|
||||
service = NewsService(
|
||||
repository=mock_repository,
|
||||
clients={"google": mock_google_client}
|
||||
)
|
||||
|
||||
# Test business logic: coordination and transformation
|
||||
result_count = await service.update_articles("AAPL", date(2024, 1, 15))
|
||||
|
||||
# Verify transformation happened correctly
|
||||
stored_articles = mock_repository.upsert_batch.call_args[0][0]
|
||||
assert len(stored_articles) == 1
|
||||
assert isinstance(stored_articles[0], NewsArticle)
|
||||
assert stored_articles[0].headline == "Breaking News"
|
||||
```
|
||||
|
||||
#### 2. Repositories (Data Access) - Real Persistence
|
||||
```python
|
||||
# tests/domains/news/test_news_repository.py
|
||||
import pytest
|
||||
from tradingagents.lib.database import create_test_database_manager
|
||||
from tradingagents.domains.news.news_repository import NewsRepository, NewsArticle
|
||||
|
||||
@pytest.fixture
|
||||
async def db_manager():
|
||||
"""Use real PostgreSQL for repository tests"""
|
||||
manager = create_test_database_manager()
|
||||
await manager.create_tables()
|
||||
yield manager
|
||||
await manager.drop_tables()
|
||||
await manager.close()
|
||||
|
||||
async def test_upsert_batch_handles_duplicates_correctly(db_manager):
|
||||
"""Test actual database behavior with real SQL operations"""
|
||||
repository = NewsRepository(db_manager)
|
||||
|
||||
# Insert initial articles
|
||||
articles = [
|
||||
NewsArticle("Apple Earnings Beat", "https://cnn.com/1", "CNN", date(2024, 1, 15)),
|
||||
NewsArticle("Apple Stock Rises", "https://cnn.com/2", "CNN", date(2024, 1, 15))
|
||||
]
|
||||
|
||||
result1 = await repository.upsert_batch(articles, "AAPL")
|
||||
assert len(result1) == 2
|
||||
|
||||
# Update one article (same URL)
|
||||
updated_articles = [
|
||||
NewsArticle("Apple Earnings Beat Expectations", "https://cnn.com/1", "CNN", date(2024, 1, 15))
|
||||
]
|
||||
|
||||
result2 = await repository.upsert_batch(updated_articles, "AAPL")
|
||||
|
||||
# Should update existing, not create duplicate
|
||||
all_articles = await repository.list("AAPL", date(2024, 1, 15))
|
||||
assert len(all_articles) == 2
|
||||
assert any("Beat Expectations" in a.headline for a in all_articles)
|
||||
|
||||
async def test_list_by_date_range_performance(db_manager):
|
||||
"""Test query performance with indexed queries"""
|
||||
repository = NewsRepository(db_manager)
|
||||
|
||||
# Insert test data
|
||||
articles = [
|
||||
NewsArticle(f"News {i}", f"https://example.com/{i}", "Test", date(2024, 1, i+1))
|
||||
for i in range(100)
|
||||
]
|
||||
await repository.upsert_batch(articles, "AAPL")
|
||||
|
||||
# Test indexed query performance
|
||||
start_time = time.time()
|
||||
results = await repository.list_by_date_range(
|
||||
"AAPL", date(2024, 1, 1), date(2024, 1, 10), limit=50
|
||||
)
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
assert len(results) == 10
|
||||
assert elapsed < 0.1 # < 100ms for simple query
|
||||
```
|
||||
|
||||
#### 3. Clients (External APIs) - pytest-vcr
|
||||
```python
|
||||
# tests/domains/news/test_google_news_client.py
|
||||
import pytest
|
||||
import pytest_vcr
|
||||
from tradingagents.domains.news.google_news_client import GoogleNewsClient
|
||||
|
||||
class TestGoogleNewsClient:
|
||||
@pytest_vcr.use_cassette("google_news_apple_search.yaml")
|
||||
async def test_search_returns_structured_articles(self):
|
||||
"""Real HTTP calls recorded with VCR cassettes"""
|
||||
client = GoogleNewsClient()
|
||||
|
||||
articles = await client.search("AAPL", max_results=5)
|
||||
|
||||
# Test real API response structure
|
||||
assert len(articles) > 0
|
||||
assert all(article.title for article in articles)
|
||||
assert all(article.link.startswith("http") for article in articles)
|
||||
assert all(article.source for article in articles)
|
||||
|
||||
@pytest_vcr.use_cassette("google_news_no_results.yaml")
|
||||
async def test_search_handles_no_results_gracefully(self):
|
||||
"""Test error cases with real API responses"""
|
||||
client = GoogleNewsClient()
|
||||
|
||||
articles = await client.search("NONEXISTENT_SYMBOL_XYZ", max_results=5)
|
||||
|
||||
assert articles == []
|
||||
```
|
||||
|
||||
### Quality Standards
|
||||
|
||||
#### Coverage Requirements
|
||||
- **85% minimum coverage** across all domains
|
||||
- **100% coverage** for critical financial calculations
|
||||
- **Branch coverage** for error handling paths
|
||||
|
||||
**Coverage Enforcement**:
|
||||
```bash
|
||||
# mise tasks for coverage
|
||||
[tasks.test-coverage]
|
||||
description = "Run tests with coverage report"
|
||||
run = "uv run pytest --cov=tradingagents --cov-report=html --cov-fail-under=85"
|
||||
|
||||
[tasks.coverage-report]
|
||||
description = "Open coverage report in browser"
|
||||
run = "open htmlcov/index.html"
|
||||
```
|
||||
|
||||
#### Performance Standards
|
||||
- **< 100ms per unit test** (fast feedback)
|
||||
- **< 5s for integration test suite** (rapid development)
|
||||
- **< 30s for full test suite** (CI/CD efficiency)
|
||||
|
||||
**Performance Monitoring**:
|
||||
```python
|
||||
# conftest.py - Test timing
|
||||
@pytest.fixture(autouse=True)
|
||||
def test_timer(request):
|
||||
start_time = time.time()
|
||||
yield
|
||||
duration = time.time() - start_time
|
||||
if duration > 0.1: # 100ms threshold
|
||||
pytest.warn(f"Slow test: {request.node.nodeid} took {duration:.2f}s")
|
||||
```
|
||||
|
||||
#### Test Structure Standards
|
||||
|
||||
**Mirror Source Structure**:
|
||||
```
|
||||
tests/
|
||||
├── conftest.py # Shared fixtures
|
||||
├── domains/
|
||||
│ ├── news/
|
||||
│ │ ├── test_news_service.py # Business logic tests (mocked boundaries)
|
||||
│ │ ├── test_news_repository.py # Data persistence tests (real DB)
|
||||
│ │ └── test_google_news_client.py # External API tests (VCR cassettes)
|
||||
│ ├── marketdata/
|
||||
│ └── socialmedia/
|
||||
├── agents/
|
||||
│ └── test_trading_graph.py # Agent workflow tests
|
||||
└── integration/
|
||||
└── test_end_to_end.py # Full system tests
|
||||
```
|
||||
|
||||
**Naming Conventions**:
|
||||
- `test_{method_name}_{expected_behavior}_{context}`
|
||||
- Example: `test_upsert_batch_handles_duplicates_correctly`
|
||||
|
||||
## Development Workflow with Mise
|
||||
|
||||
### Daily Development Commands
|
||||
|
||||
**Core Development Flow**:
|
||||
```bash
|
||||
# 1. Start development environment
|
||||
mise run docker # Start PostgreSQL + TimescaleDB
|
||||
|
||||
# 2. Install/update dependencies
|
||||
mise run install # uv sync --dev
|
||||
|
||||
# 3. Development iteration
|
||||
mise run format # Auto-format with ruff
|
||||
mise run lint # Check code quality
|
||||
mise run typecheck # Type checking with pyrefly
|
||||
mise run test # Run test suite
|
||||
|
||||
# 4. Run application
|
||||
mise run dev # Interactive CLI
|
||||
mise run run # Direct execution
|
||||
```
|
||||
|
||||
**Quality Assurance**:
|
||||
```bash
|
||||
# Run all quality checks before commit
|
||||
mise run all # format + lint + typecheck
|
||||
|
||||
# Coverage analysis
|
||||
mise run test-coverage
|
||||
mise run coverage-report
|
||||
```
|
||||
|
||||
**Troubleshooting**:
|
||||
```bash
|
||||
# Clean build artifacts
|
||||
mise run clean
|
||||
|
||||
# Reset development environment
|
||||
mise run docker # Restart containers
|
||||
mise run install # Reinstall dependencies
|
||||
```
|
||||
|
||||
### Code Quality Standards
|
||||
|
||||
#### Linting with Ruff
|
||||
```toml
|
||||
# pyproject.toml
|
||||
[tool.ruff]
|
||||
target-version = "py313"
|
||||
line-length = 88
|
||||
extend-exclude = ["migrations/", "alembic/"]
|
||||
|
||||
[tool.ruff.lint]
|
||||
select = [
|
||||
"E", # pycodestyle errors
|
||||
"W", # pycodestyle warnings
|
||||
"F", # Pyflakes
|
||||
"I", # isort
|
||||
"B", # flake8-bugbear
|
||||
"C4", # flake8-comprehensions
|
||||
"UP", # pyupgrade
|
||||
"ERA", # eradicate
|
||||
"PIE", # flake8-pie
|
||||
"SIM", # flake8-simplify
|
||||
]
|
||||
|
||||
ignore = [
|
||||
"E501", # Line too long (handled by formatter)
|
||||
"B008", # Do not perform function calls in argument defaults
|
||||
"B904", # raise ... from None
|
||||
]
|
||||
|
||||
[tool.ruff.lint.per-file-ignores]
|
||||
"tests/**/*.py" = [
|
||||
"S101", # Use of assert detected
|
||||
"ARG", # Unused function args
|
||||
"FBT", # Boolean trap
|
||||
]
|
||||
```
|
||||
|
||||
#### Type Checking with Pyrefly
|
||||
```toml
|
||||
[tool.pyrefly]
|
||||
python-version = "3.13"
|
||||
warn-unused-ignores = true
|
||||
show-error-codes = true
|
||||
strict = true
|
||||
|
||||
# Enable async-aware type checking
|
||||
plugins = ["sqlalchemy.ext.mypy.plugin"]
|
||||
|
||||
# Per-module configuration
|
||||
[[tool.pyrefly.overrides]]
|
||||
module = "tests.*"
|
||||
disallow_untyped_defs = false
|
||||
```
|
||||
|
||||
### Database Development Patterns
|
||||
|
||||
#### Migration Workflow
|
||||
```bash
|
||||
# 1. Create migration after model changes
|
||||
alembic revision --autogenerate -m "Add user preferences table"
|
||||
|
||||
# 2. Review generated migration
|
||||
# Edit alembic/versions/{hash}_add_user_preferences_table.py
|
||||
|
||||
# 3. Apply migration
|
||||
alembic upgrade head
|
||||
|
||||
# 4. Test with sample data
|
||||
mise run test-migrations
|
||||
```
|
||||
|
||||
#### Development Database Management
|
||||
```bash
|
||||
# Reset development database
|
||||
mise run docker # Stop/start containers
|
||||
alembic upgrade head # Apply all migrations
|
||||
python scripts/seed_dev_data.py # Load sample data
|
||||
```
|
||||
|
||||
#### Testing Database Strategy
|
||||
```python
|
||||
# Test database isolation
|
||||
@pytest.fixture(scope="function")
|
||||
async def clean_db():
|
||||
"""Fresh database for each test"""
|
||||
db_manager = create_test_database_manager()
|
||||
await db_manager.create_tables()
|
||||
yield db_manager
|
||||
await db_manager.drop_tables()
|
||||
await db_manager.close()
|
||||
|
||||
# Shared test data
|
||||
@pytest.fixture
|
||||
def sample_news_articles():
|
||||
"""Reusable test data across test modules"""
|
||||
return [
|
||||
NewsArticle("Apple Earnings", "https://cnn.com/1", "CNN", date(2024, 1, 15)),
|
||||
NewsArticle("Tesla Updates", "https://reuters.com/2", "Reuters", date(2024, 1, 16))
|
||||
]
|
||||
```
|
||||
|
||||
## Error Handling and Retry Strategies
|
||||
|
||||
### Resilient External API Integration
|
||||
|
||||
#### Exponential Backoff with Circuit Breaker
|
||||
```python
|
||||
import asyncio
|
||||
import logging
|
||||
from functools import wraps
|
||||
from typing import TypeVar, Callable, Any
|
||||
|
||||
T = TypeVar('T')
|
||||
|
||||
class APIClient:
|
||||
def __init__(self):
|
||||
self.circuit_breaker = CircuitBreaker(
|
||||
failure_threshold=5,
|
||||
reset_timeout=60,
|
||||
expected_exception=aiohttp.ClientError
|
||||
)
|
||||
|
||||
@retry_with_backoff(max_retries=3, base_delay=1.0)
|
||||
async def fetch_data(self, url: str) -> dict:
|
||||
"""Resilient HTTP requests with retry logic"""
|
||||
async with self.circuit_breaker:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(url, timeout=30) as response:
|
||||
if response.status >= 500:
|
||||
raise aiohttp.ClientError(f"Server error: {response.status}")
|
||||
return await response.json()
|
||||
|
||||
def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
|
||||
"""Decorator for exponential backoff retry logic"""
|
||||
def decorator(func: Callable[..., Awaitable[T]]) -> Callable[..., Awaitable[T]]:
|
||||
@wraps(func)
|
||||
async def wrapper(*args, **kwargs) -> T:
|
||||
last_exception = None
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
return await func(*args, **kwargs)
|
||||
except Exception as e:
|
||||
last_exception = e
|
||||
if attempt == max_retries:
|
||||
break
|
||||
|
||||
delay = base_delay * (2 ** attempt) # Exponential backoff
|
||||
jitter = random.uniform(0.1, 0.9) # Add jitter
|
||||
await asyncio.sleep(delay * jitter)
|
||||
|
||||
logging.warning(f"Retry {attempt + 1}/{max_retries} for {func.__name__}: {e}")
|
||||
|
||||
raise last_exception
|
||||
return wrapper
|
||||
return decorator
|
||||
```
|
||||
|
||||
### Database Error Handling
|
||||
|
||||
#### Graceful Degradation
|
||||
```python
|
||||
class NewsService:
|
||||
async def get_articles(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
"""Service-level error handling with fallbacks"""
|
||||
try:
|
||||
# Try primary repository
|
||||
articles = await self.repository.list(symbol, date)
|
||||
logger.info(f"Retrieved {len(articles)} articles from database")
|
||||
return articles
|
||||
|
||||
except DatabaseConnectionError:
|
||||
logger.warning("Database unavailable, trying cache fallback")
|
||||
# Fallback to file cache
|
||||
return await self.cache_repository.list(symbol, date)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to retrieve articles for {symbol}: {e}")
|
||||
# Graceful degradation - return empty list rather than crash
|
||||
return []
|
||||
|
||||
async def update_articles_with_partial_failure_handling(self, symbol: str, date: date) -> dict:
|
||||
"""Handle partial failures in batch operations"""
|
||||
results = {"successful": 0, "failed": 0, "errors": []}
|
||||
|
||||
try:
|
||||
# Attempt batch fetch from multiple sources
|
||||
sources = ["google_news", "finnhub", "alpha_vantage"]
|
||||
articles_by_source = {}
|
||||
|
||||
for source in sources:
|
||||
try:
|
||||
client = self.clients[source]
|
||||
articles = await client.fetch_news(symbol, date)
|
||||
articles_by_source[source] = articles
|
||||
logger.info(f"Fetched {len(articles)} from {source}")
|
||||
except Exception as e:
|
||||
results["errors"].append(f"{source}: {str(e)}")
|
||||
logger.warning(f"Failed to fetch from {source}: {e}")
|
||||
|
||||
# Process successful fetches
|
||||
all_articles = []
|
||||
for source, articles in articles_by_source.items():
|
||||
try:
|
||||
validated = [a for a in articles if self.validate_article(a)]
|
||||
all_articles.extend(validated)
|
||||
results["successful"] += len(validated)
|
||||
except Exception as e:
|
||||
results["failed"] += len(articles)
|
||||
results["errors"].append(f"Validation failed for {source}: {str(e)}")
|
||||
|
||||
# Store successfully processed articles
|
||||
if all_articles:
|
||||
await self.repository.upsert_batch(all_articles, symbol)
|
||||
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Critical error in update_articles: {e}")
|
||||
results["errors"].append(f"Critical failure: {str(e)}")
|
||||
return results
|
||||
```
|
||||
|
||||
### Logging Standards
|
||||
|
||||
#### Structured Logging Configuration
|
||||
```python
|
||||
import logging
|
||||
import json
|
||||
from datetime import datetime
|
||||
|
||||
class JSONFormatter(logging.Formatter):
|
||||
"""Structured JSON logging for production"""
|
||||
|
||||
def format(self, record):
|
||||
log_entry = {
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"level": record.levelname,
|
||||
"logger": record.name,
|
||||
"message": record.getMessage(),
|
||||
}
|
||||
|
||||
# Add context information
|
||||
if hasattr(record, 'symbol'):
|
||||
log_entry["symbol"] = record.symbol
|
||||
if hasattr(record, 'user_id'):
|
||||
log_entry["user_id"] = record.user_id
|
||||
if hasattr(record, 'request_id'):
|
||||
log_entry["request_id"] = record.request_id
|
||||
|
||||
# Add exception info
|
||||
if record.exc_info:
|
||||
log_entry["exception"] = self.formatException(record.exc_info)
|
||||
|
||||
return json.dumps(log_entry)
|
||||
|
||||
# Configuration
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler('tradingagents.log')
|
||||
]
|
||||
)
|
||||
|
||||
# Domain-specific loggers
|
||||
news_logger = logging.getLogger('tradingagents.domains.news')
|
||||
market_logger = logging.getLogger('tradingagents.domains.marketdata')
|
||||
agent_logger = logging.getLogger('tradingagents.agents')
|
||||
```
|
||||
|
||||
#### Contextual Logging in Services
|
||||
```python
|
||||
class NewsService:
|
||||
def __init__(self):
|
||||
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
|
||||
|
||||
async def get_articles(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
# Add context to log messages
|
||||
extra = {"symbol": symbol, "date": date.isoformat()}
|
||||
|
||||
self.logger.info("Starting article retrieval", extra=extra)
|
||||
|
||||
try:
|
||||
articles = await self.repository.list(symbol, date)
|
||||
self.logger.info(
|
||||
f"Successfully retrieved {len(articles)} articles",
|
||||
extra={**extra, "count": len(articles)}
|
||||
)
|
||||
return articles
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
f"Failed to retrieve articles: {e}",
|
||||
extra=extra,
|
||||
exc_info=True
|
||||
)
|
||||
raise
|
||||
```
|
||||
|
||||
## Performance Monitoring
|
||||
|
||||
### Application Metrics
|
||||
|
||||
#### Key Performance Indicators
|
||||
```python
|
||||
import time
|
||||
import asyncio
|
||||
from functools import wraps
|
||||
from collections import defaultdict
|
||||
|
||||
class PerformanceMonitor:
|
||||
def __init__(self):
|
||||
self.metrics = defaultdict(list)
|
||||
|
||||
def track_execution_time(self, operation: str):
|
||||
"""Decorator to track method execution time"""
|
||||
def decorator(func):
|
||||
@wraps(func)
|
||||
async def wrapper(*args, **kwargs):
|
||||
start_time = time.time()
|
||||
try:
|
||||
result = await func(*args, **kwargs)
|
||||
return result
|
||||
finally:
|
||||
duration = time.time() - start_time
|
||||
self.metrics[f"{operation}_duration"].append(duration)
|
||||
|
||||
# Log slow operations
|
||||
if duration > 1.0:
|
||||
logging.warning(f"Slow operation {operation}: {duration:.2f}s")
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
def get_performance_summary(self) -> dict:
|
||||
"""Get performance statistics"""
|
||||
summary = {}
|
||||
for operation, durations in self.metrics.items():
|
||||
if durations:
|
||||
summary[operation] = {
|
||||
"count": len(durations),
|
||||
"avg": sum(durations) / len(durations),
|
||||
"min": min(durations),
|
||||
"max": max(durations),
|
||||
"p95": sorted(durations)[int(len(durations) * 0.95)]
|
||||
}
|
||||
return summary
|
||||
|
||||
# Usage in services
|
||||
monitor = PerformanceMonitor()
|
||||
|
||||
class NewsService:
|
||||
@monitor.track_execution_time("news_fetch")
|
||||
async def get_articles(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
return await self.repository.list(symbol, date)
|
||||
|
||||
@monitor.track_execution_time("news_update")
|
||||
async def update_articles(self, symbol: str, date: date) -> int:
|
||||
return await self._fetch_and_store_articles(symbol, date)
|
||||
```
|
||||
|
||||
### Database Query Optimization
|
||||
|
||||
#### Query Performance Monitoring
|
||||
```python
|
||||
# Custom SQLAlchemy event listener for query timing
|
||||
from sqlalchemy import event
|
||||
from sqlalchemy.engine import Engine
|
||||
import logging
|
||||
|
||||
query_logger = logging.getLogger('tradingagents.database.queries')
|
||||
|
||||
@event.listens_for(Engine, "before_cursor_execute")
|
||||
def before_cursor_execute(conn, cursor, statement, parameters, context, executemany):
|
||||
context._query_start_time = time.time()
|
||||
|
||||
@event.listens_for(Engine, "after_cursor_execute")
|
||||
def after_cursor_execute(conn, cursor, statement, parameters, context, executemany):
|
||||
total = time.time() - context._query_start_time
|
||||
|
||||
# Log slow queries
|
||||
if total > 0.1: # 100ms threshold
|
||||
query_logger.warning(
|
||||
f"Slow query ({total:.2f}s): {statement[:100]}...",
|
||||
extra={"duration": total, "query": statement[:200]}
|
||||
)
|
||||
```
|
||||
|
||||
This comprehensive development practices document establishes the foundation for maintaining high code quality, rapid development cycles, and robust error handling in the TradingAgents system.
|
||||
|
|
@ -0,0 +1,837 @@
|
|||
# Security Standards - TradingAgents
|
||||
|
||||
## API Key Management
|
||||
|
||||
### OpenRouter and LLM Provider Security
|
||||
|
||||
**Environment Variable Management**:
|
||||
```bash
|
||||
# Required API keys
|
||||
export OPENROUTER_API_KEY="sk-or-v1-xxxxxxxxxxxx"
|
||||
|
||||
# Optional provider keys (for fallback)
|
||||
export OPENAI_API_KEY="sk-xxxxxxxxxxxx"
|
||||
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"
|
||||
|
||||
# Financial data APIs
|
||||
export FINNHUB_API_KEY="xxxxxxxxxxxx"
|
||||
export ALPHA_VANTAGE_API_KEY="xxxxxxxxxxxx"
|
||||
```
|
||||
|
||||
**Configuration Security**:
|
||||
```python
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
class SecureConfig:
|
||||
"""Secure configuration management with validation"""
|
||||
|
||||
@classmethod
|
||||
def get_required_env(cls, key: str, description: str = "") -> str:
|
||||
"""Get required environment variable with validation"""
|
||||
value = os.getenv(key)
|
||||
if not value:
|
||||
raise EnvironmentError(
|
||||
f"Required environment variable {key} not set. {description}"
|
||||
)
|
||||
|
||||
# Validate API key format
|
||||
if key.endswith("_API_KEY"):
|
||||
cls._validate_api_key(key, value)
|
||||
|
||||
return value
|
||||
|
||||
@classmethod
|
||||
def _validate_api_key(cls, key: str, value: str) -> None:
|
||||
"""Validate API key format and warn on potential issues"""
|
||||
if len(value) < 20:
|
||||
raise ValueError(f"API key {key} appears too short (< 20 chars)")
|
||||
|
||||
if value.startswith("sk-") and len(value) < 40:
|
||||
raise ValueError(f"OpenAI/OpenRouter API key {key} appears invalid")
|
||||
|
||||
# Detect placeholder values
|
||||
placeholder_patterns = ["your_", "replace_", "xxxx", "test"]
|
||||
if any(pattern in value.lower() for pattern in placeholder_patterns):
|
||||
raise ValueError(f"API key {key} appears to be a placeholder")
|
||||
|
||||
@classmethod
|
||||
def load_openrouter_config(cls) -> dict[str, str]:
|
||||
"""Load and validate OpenRouter configuration"""
|
||||
return {
|
||||
"api_key": cls.get_required_env(
|
||||
"OPENROUTER_API_KEY",
|
||||
"Get your key from https://openrouter.ai/keys"
|
||||
),
|
||||
"base_url": os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"),
|
||||
"app_name": os.getenv("OPENROUTER_APP_NAME", "TradingAgents"),
|
||||
"site_url": os.getenv("OPENROUTER_SITE_URL", "https://github.com/TauricResearch/TradingAgents")
|
||||
}
|
||||
```
|
||||
|
||||
**Development vs Production Key Management**:
|
||||
```python
|
||||
# .env.example (committed to repo)
|
||||
OPENROUTER_API_KEY=your_openrouter_api_key_here
|
||||
DATABASE_URL=postgresql+asyncpg://postgres:tradingagents@localhost:5432/tradingagents
|
||||
TRADINGAGENTS_RESULTS_DIR=./results
|
||||
TRADINGAGENTS_DATA_DIR=./data
|
||||
|
||||
# .env (never committed, gitignored)
|
||||
OPENROUTER_API_KEY=sk-or-v1-actual-key-here
|
||||
DATABASE_URL=postgresql+asyncpg://user:password@prod-db:5432/tradingagents
|
||||
```
|
||||
|
||||
### Secret Rotation and Management
|
||||
|
||||
**Key Rotation Strategy**:
|
||||
```python
|
||||
import logging
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class APIKeyManager:
|
||||
"""Manages API key rotation and health monitoring"""
|
||||
|
||||
def __init__(self):
|
||||
self.key_health: Dict[str, Dict] = {}
|
||||
self.rotation_schedule: Dict[str, datetime] = {}
|
||||
|
||||
async def validate_key_health(self, service: str, api_key: str) -> bool:
|
||||
"""Test API key validity with minimal request"""
|
||||
try:
|
||||
if service == "openrouter":
|
||||
return await self._test_openrouter_key(api_key)
|
||||
elif service == "finnhub":
|
||||
return await self._test_finnhub_key(api_key)
|
||||
else:
|
||||
logger.warning(f"No health check implemented for {service}")
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error(f"API key health check failed for {service}: {e}")
|
||||
return False
|
||||
|
||||
async def _test_openrouter_key(self, api_key: str) -> bool:
|
||||
"""Test OpenRouter key with lightweight request"""
|
||||
import aiohttp
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Use minimal model list request to test auth
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(
|
||||
"https://openrouter.ai/api/v1/models",
|
||||
headers=headers,
|
||||
timeout=aiohttp.ClientTimeout(total=10)
|
||||
) as response:
|
||||
return response.status == 200
|
||||
|
||||
def schedule_rotation(self, service: str, days: int = 90) -> None:
|
||||
"""Schedule API key rotation"""
|
||||
rotation_date = datetime.now() + timedelta(days=days)
|
||||
self.rotation_schedule[service] = rotation_date
|
||||
logger.info(f"Scheduled {service} key rotation for {rotation_date.date()}")
|
||||
|
||||
def get_rotation_alerts(self) -> list[str]:
|
||||
"""Get list of keys requiring rotation"""
|
||||
alerts = []
|
||||
now = datetime.now()
|
||||
warning_threshold = timedelta(days=7)
|
||||
|
||||
for service, rotation_date in self.rotation_schedule.items():
|
||||
if now >= rotation_date:
|
||||
alerts.append(f"URGENT: {service} API key rotation overdue")
|
||||
elif now >= rotation_date - warning_threshold:
|
||||
alerts.append(f"WARNING: {service} API key rotation due in {(rotation_date - now).days} days")
|
||||
|
||||
return alerts
|
||||
```
|
||||
|
||||
## Database Security Patterns
|
||||
|
||||
### Connection Security
|
||||
|
||||
**Secure Connection Configuration**:
|
||||
```python
|
||||
from sqlalchemy.ext.asyncio import create_async_engine
|
||||
from sqlalchemy.pool import NullPool
|
||||
import ssl
|
||||
|
||||
class SecureDatabaseManager:
|
||||
"""Database manager with security-first configuration"""
|
||||
|
||||
def __init__(self, database_url: str, require_ssl: bool = True):
|
||||
# Parse and validate database URL
|
||||
if not database_url.startswith(("postgresql+asyncpg://", "postgresql://")):
|
||||
raise ValueError("Only PostgreSQL databases are supported")
|
||||
|
||||
# Ensure asyncpg driver for better async performance
|
||||
if database_url.startswith("postgresql://"):
|
||||
database_url = database_url.replace("postgresql://", "postgresql+asyncpg://")
|
||||
|
||||
# SSL/TLS configuration for production
|
||||
connect_args = {}
|
||||
if require_ssl:
|
||||
ssl_context = ssl.create_default_context()
|
||||
ssl_context.check_hostname = False # Often needed for cloud databases
|
||||
ssl_context.verify_mode = ssl.CERT_REQUIRED
|
||||
connect_args["ssl"] = ssl_context
|
||||
|
||||
self.engine = create_async_engine(
|
||||
database_url,
|
||||
# Security settings
|
||||
connect_args=connect_args,
|
||||
pool_pre_ping=True, # Verify connections
|
||||
pool_recycle=3600, # Recycle connections (1 hour)
|
||||
|
||||
# Connection limits to prevent resource exhaustion
|
||||
pool_size=10, # Base connection pool
|
||||
max_overflow=20, # Additional connections under load
|
||||
|
||||
# Prevent connection leaks in development
|
||||
poolclass=NullPool if self._is_test_env() else None,
|
||||
|
||||
# Disable SQL echo in production (information disclosure)
|
||||
echo=False if os.getenv("ENVIRONMENT") == "production" else False
|
||||
)
|
||||
|
||||
def _is_test_env(self) -> bool:
|
||||
"""Detect test environment"""
|
||||
return any([
|
||||
"test" in os.getenv("DATABASE_URL", "").lower(),
|
||||
os.getenv("TESTING") == "true",
|
||||
"pytest" in sys.modules
|
||||
])
|
||||
|
||||
async def create_tables_secure(self):
|
||||
"""Create tables with security considerations"""
|
||||
async with self.engine.begin() as conn:
|
||||
# Set secure session parameters
|
||||
await conn.execute(text("SET session_replication_role = 'origin'"))
|
||||
await conn.execute(text("SET log_statement = 'none'")) # Disable query logging for DDL
|
||||
|
||||
# Create tables
|
||||
await conn.run_sync(Base.metadata.create_all)
|
||||
|
||||
# Set up row-level security policies if needed
|
||||
await self._setup_row_level_security(conn)
|
||||
|
||||
async def _setup_row_level_security(self, conn):
|
||||
"""Configure row-level security for multi-tenant data"""
|
||||
# Enable RLS on sensitive tables
|
||||
await conn.execute(text("ALTER TABLE news_articles ENABLE ROW LEVEL SECURITY"))
|
||||
|
||||
# Create policy for data isolation (if implementing multi-user features)
|
||||
# await conn.execute(text("""
|
||||
# CREATE POLICY user_data_policy ON news_articles
|
||||
# FOR ALL TO app_user
|
||||
# USING (user_id = current_setting('app.user_id')::UUID)
|
||||
# """))
|
||||
```
|
||||
|
||||
### Data Privacy and Anonymization
|
||||
|
||||
**Financial Data Protection**:
|
||||
```python
|
||||
import hashlib
|
||||
import secrets
|
||||
from typing import Any, Dict
|
||||
|
||||
class DataPrivacyManager:
|
||||
"""Handles sensitive financial data with privacy controls"""
|
||||
|
||||
def __init__(self):
|
||||
self.salt = self._get_or_create_salt()
|
||||
|
||||
def _get_or_create_salt(self) -> bytes:
|
||||
"""Get encryption salt from secure storage"""
|
||||
salt_path = Path(os.getenv("TRADINGAGENTS_DATA_DIR", "./data")) / ".salt"
|
||||
|
||||
if salt_path.exists():
|
||||
return salt_path.read_bytes()
|
||||
else:
|
||||
# Generate cryptographically secure salt
|
||||
salt = secrets.token_bytes(32)
|
||||
salt_path.write_bytes(salt)
|
||||
salt_path.chmod(0o600) # Restrict file permissions
|
||||
return salt
|
||||
|
||||
def hash_symbol(self, symbol: str) -> str:
|
||||
"""Create consistent hash for symbols (for analytics without exposure)"""
|
||||
return hashlib.pbkdf2_hmac(
|
||||
'sha256',
|
||||
symbol.encode(),
|
||||
self.salt,
|
||||
100000 # iterations
|
||||
).hex()[:16]
|
||||
|
||||
def sanitize_article_content(self, content: str) -> str:
|
||||
"""Remove PII and sensitive information from article content"""
|
||||
import re
|
||||
|
||||
# Remove potential SSNs, account numbers, etc.
|
||||
patterns = [
|
||||
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
|
||||
r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', # Credit card
|
||||
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', # Email
|
||||
]
|
||||
|
||||
sanitized = content
|
||||
for pattern in patterns:
|
||||
sanitized = re.sub(pattern, '[REDACTED]', sanitized)
|
||||
|
||||
return sanitized
|
||||
|
||||
def audit_data_access(self, table: str, operation: str, record_count: int = 1):
|
||||
"""Log data access for compliance auditing"""
|
||||
logger.info(
|
||||
"Data access audit",
|
||||
extra={
|
||||
"table": table,
|
||||
"operation": operation,
|
||||
"record_count": record_count,
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"user": os.getenv("USER", "system")
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Query Security
|
||||
|
||||
**SQL Injection Prevention**:
|
||||
```python
|
||||
from sqlalchemy import text, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
class SecureQueryBuilder:
|
||||
"""Build secure parameterized queries"""
|
||||
|
||||
def __init__(self, session: AsyncSession):
|
||||
self.session = session
|
||||
|
||||
async def get_articles_secure(
|
||||
self,
|
||||
symbol: str,
|
||||
date_filter: date,
|
||||
user_input_query: Optional[str] = None
|
||||
) -> list[NewsArticle]:
|
||||
"""Secure article query with parameterization"""
|
||||
|
||||
# Base query with parameterized symbol and date
|
||||
query = select(NewsArticleEntity).where(
|
||||
and_(
|
||||
NewsArticleEntity.symbol == symbol, # Parameterized automatically
|
||||
NewsArticleEntity.published_date == date_filter
|
||||
)
|
||||
)
|
||||
|
||||
# Secure text search if provided
|
||||
if user_input_query:
|
||||
# Use full-text search instead of LIKE to prevent injection
|
||||
# Sanitize and escape the search term
|
||||
sanitized_query = self._sanitize_search_term(user_input_query)
|
||||
query = query.where(
|
||||
NewsArticleEntity.headline.match(sanitized_query) # PostgreSQL full-text search
|
||||
)
|
||||
|
||||
result = await self.session.execute(query)
|
||||
return [NewsArticle.from_entity(e) for e in result.scalars()]
|
||||
|
||||
def _sanitize_search_term(self, query: str) -> str:
|
||||
"""Sanitize user input for full-text search"""
|
||||
import re
|
||||
|
||||
# Remove SQL injection patterns
|
||||
dangerous_patterns = [
|
||||
r"[';\"\\]", # SQL metacharacters
|
||||
r"\b(union|select|drop|delete|update|insert)\b", # SQL keywords
|
||||
r"--", # SQL comments
|
||||
r"/\*.*?\*/" # SQL block comments
|
||||
]
|
||||
|
||||
sanitized = query
|
||||
for pattern in dangerous_patterns:
|
||||
sanitized = re.sub(pattern, "", sanitized, flags=re.IGNORECASE)
|
||||
|
||||
# Limit length to prevent DoS
|
||||
sanitized = sanitized[:100]
|
||||
|
||||
# Convert to PostgreSQL full-text search format
|
||||
terms = sanitized.split()
|
||||
return " & ".join(f'"{term}"' for term in terms if term.isalnum())
|
||||
|
||||
async def execute_safe_raw_query(self, query_template: str, **params) -> Any:
|
||||
"""Execute raw SQL with parameter validation"""
|
||||
# Whitelist allowed query templates
|
||||
allowed_templates = {
|
||||
"performance_stats": "SELECT * FROM pg_stat_statements WHERE query LIKE :pattern",
|
||||
"table_sizes": "SELECT schemaname, tablename, pg_total_relation_size(schemaname||'.'||tablename) as size FROM pg_tables WHERE schemaname = :schema"
|
||||
}
|
||||
|
||||
if query_template not in allowed_templates:
|
||||
raise ValueError(f"Query template not in whitelist: {query_template}")
|
||||
|
||||
# Validate parameters
|
||||
for key, value in params.items():
|
||||
if not self._validate_parameter(key, value):
|
||||
raise ValueError(f"Invalid parameter {key}: {value}")
|
||||
|
||||
query = text(allowed_templates[query_template])
|
||||
result = await self.session.execute(query, params)
|
||||
return result.fetchall()
|
||||
|
||||
def _validate_parameter(self, key: str, value: Any) -> bool:
|
||||
"""Validate query parameters"""
|
||||
# Length limits
|
||||
if isinstance(value, str) and len(value) > 100:
|
||||
return False
|
||||
|
||||
# Type restrictions
|
||||
if key.endswith("_id") and not isinstance(value, (str, int)):
|
||||
return False
|
||||
|
||||
# No SQL injection patterns
|
||||
if isinstance(value, str):
|
||||
dangerous = ["'", '"', ";", "--", "/*", "*/", "union", "select"]
|
||||
if any(pattern in value.lower() for pattern in dangerous):
|
||||
return False
|
||||
|
||||
return True
|
||||
```
|
||||
|
||||
## Development Environment Security
|
||||
|
||||
### Local Development Protection
|
||||
|
||||
**Secure Development Setup**:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# secure_dev_setup.sh - Secure development environment initialization
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# 1. Create secure data directory
|
||||
DATA_DIR="${TRADINGAGENTS_DATA_DIR:-./data}"
|
||||
mkdir -p "$DATA_DIR"
|
||||
chmod 700 "$DATA_DIR" # Owner read/write/execute only
|
||||
|
||||
# 2. Create .env file with secure permissions
|
||||
if [ ! -f .env ]; then
|
||||
cp .env.example .env
|
||||
chmod 600 .env # Owner read/write only
|
||||
echo "Created .env file. Please update with actual API keys."
|
||||
fi
|
||||
|
||||
# 3. Set up secure Docker environment
|
||||
if [ ! -f docker-compose.override.yml ]; then
|
||||
cat > docker-compose.override.yml << EOF
|
||||
version: '3.8'
|
||||
services:
|
||||
timescaledb:
|
||||
environment:
|
||||
# Use strong password in development
|
||||
POSTGRES_PASSWORD: \${DB_PASSWORD:-$(openssl rand -base64 32)}
|
||||
volumes:
|
||||
- ./data/postgres:/var/lib/postgresql/data
|
||||
EOF
|
||||
echo "Created docker-compose.override.yml with secure settings"
|
||||
fi
|
||||
|
||||
# 4. Configure Git security
|
||||
git config --local core.hooksPath .githooks
|
||||
chmod +x .githooks/pre-commit
|
||||
|
||||
# 5. Install security scanning tools
|
||||
if command -v pip &> /dev/null; then
|
||||
pip install bandit safety
|
||||
echo "Installed security scanning tools"
|
||||
fi
|
||||
|
||||
echo "Secure development environment configured"
|
||||
echo "Remember to:"
|
||||
echo " 1. Update .env with real API keys"
|
||||
echo " 2. Never commit .env or API keys"
|
||||
echo " 3. Run 'bandit -r tradingagents/' before commits"
|
||||
```
|
||||
|
||||
**Git Security Hooks**:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# .githooks/pre-commit - Prevent secrets from being committed
|
||||
|
||||
# Check for common secret patterns
|
||||
if git diff --cached --name-only | grep -E "\.(py|yml|yaml|json|env)$"; then
|
||||
echo "Scanning for secrets..."
|
||||
|
||||
# Pattern matching for common secrets
|
||||
if git diff --cached | grep -i -E "(api_key|secret|password|token)" | grep -v -E "(example|template|your_|replace_)"; then
|
||||
echo "ERROR: Potential secrets detected in staged files!"
|
||||
echo "Please review and remove any sensitive information."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for hardcoded URLs with credentials
|
||||
if git diff --cached | grep -E "postgresql://[^:]+:[^@]+@"; then
|
||||
echo "ERROR: Database URL with credentials detected!"
|
||||
echo "Use environment variables instead."
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Run security linting if bandit is available
|
||||
if command -v bandit &> /dev/null; then
|
||||
echo "Running security scan..."
|
||||
bandit -r tradingagents/ -f json | jq '.results[] | select(.issue_severity == "HIGH")' | grep -q . && {
|
||||
echo "ERROR: High-severity security issues found!"
|
||||
echo "Run 'bandit -r tradingagents/' for details."
|
||||
exit 1
|
||||
}
|
||||
fi
|
||||
|
||||
echo "Pre-commit security checks passed"
|
||||
```
|
||||
|
||||
### Secrets Management with Environment Variables
|
||||
|
||||
**Environment Variable Security**:
|
||||
```python
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
class EnvironmentManager:
|
||||
"""Secure environment variable management"""
|
||||
|
||||
def __init__(self):
|
||||
self.env_file = Path(".env")
|
||||
self.required_vars = [
|
||||
"OPENROUTER_API_KEY",
|
||||
"DATABASE_URL"
|
||||
]
|
||||
self.sensitive_vars = [
|
||||
"API_KEY", "SECRET", "PASSWORD", "TOKEN", "PRIVATE_KEY"
|
||||
]
|
||||
|
||||
def validate_environment(self) -> list[str]:
|
||||
"""Validate environment setup and return any issues"""
|
||||
issues = []
|
||||
|
||||
# Check required variables
|
||||
for var in self.required_vars:
|
||||
if not os.getenv(var):
|
||||
issues.append(f"Missing required environment variable: {var}")
|
||||
|
||||
# Check .env file permissions
|
||||
if self.env_file.exists():
|
||||
stat = self.env_file.stat()
|
||||
if stat.st_mode & 0o077: # Check if group/other have any permissions
|
||||
issues.append(".env file has overly permissive permissions (should be 600)")
|
||||
|
||||
# Validate sensitive variables aren't using placeholder values
|
||||
for var_name in os.environ:
|
||||
if any(sensitive in var_name for sensitive in self.sensitive_vars):
|
||||
value = os.getenv(var_name, "")
|
||||
if self._is_placeholder_value(value):
|
||||
issues.append(f"{var_name} appears to contain a placeholder value")
|
||||
|
||||
return issues
|
||||
|
||||
def _is_placeholder_value(self, value: str) -> bool:
|
||||
"""Detect common placeholder patterns"""
|
||||
placeholders = [
|
||||
"your_", "replace_", "change_me", "xxxx", "test_key",
|
||||
"example", "sample", "placeholder", "todo"
|
||||
]
|
||||
return any(placeholder in value.lower() for placeholder in placeholders)
|
||||
|
||||
def setup_production_env(self) -> dict[str, str]:
|
||||
"""Configure production environment with security hardening"""
|
||||
return {
|
||||
# Security settings
|
||||
"PYTHONDONTWRITEBYTECODE": "1", # Don't create .pyc files
|
||||
"PYTHONUNBUFFERED": "1", # Unbuffered output
|
||||
"PYTHONHASHSEED": "random", # Random hash seed
|
||||
|
||||
# Application security
|
||||
"ENVIRONMENT": "production",
|
||||
"DEBUG": "false",
|
||||
"LOG_LEVEL": "INFO", # Don't log debug info
|
||||
|
||||
# Database security
|
||||
"DB_SSL_MODE": "require",
|
||||
"DB_POOL_PRE_PING": "true",
|
||||
"DB_ECHO": "false", # Don't log SQL queries
|
||||
|
||||
# API security
|
||||
"API_RATE_LIMIT": "100", # Requests per minute
|
||||
"API_TIMEOUT": "30", # Request timeout in seconds
|
||||
}
|
||||
|
||||
def main():
|
||||
"""Development environment security check"""
|
||||
env_manager = EnvironmentManager()
|
||||
issues = env_manager.validate_environment()
|
||||
|
||||
if issues:
|
||||
print("⚠️ Environment Security Issues:")
|
||||
for issue in issues:
|
||||
print(f" - {issue}")
|
||||
print("\nRun ./scripts/secure_dev_setup.sh to fix common issues")
|
||||
return 1
|
||||
else:
|
||||
print("✅ Environment security validation passed")
|
||||
return 0
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
```
|
||||
|
||||
## Production Security Considerations
|
||||
|
||||
### API Rate Limiting and DoS Protection
|
||||
|
||||
**Request Throttling**:
|
||||
```python
|
||||
import asyncio
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from typing import Dict, Optional
|
||||
|
||||
class RateLimiter:
|
||||
"""Protect against API abuse and DoS attacks"""
|
||||
|
||||
def __init__(self):
|
||||
self.request_counts: Dict[str, list] = defaultdict(list)
|
||||
self.blocked_ips: Dict[str, float] = {}
|
||||
self.rate_limits = {
|
||||
"default": (100, 60), # 100 requests per 60 seconds
|
||||
"openrouter": (50, 60), # 50 LLM requests per 60 seconds
|
||||
"database": (1000, 60), # 1000 DB operations per 60 seconds
|
||||
}
|
||||
|
||||
async def check_rate_limit(
|
||||
self,
|
||||
identifier: str,
|
||||
category: str = "default"
|
||||
) -> tuple[bool, Optional[str]]:
|
||||
"""Check if request should be allowed"""
|
||||
|
||||
# Check if identifier is temporarily blocked
|
||||
if identifier in self.blocked_ips:
|
||||
block_until = self.blocked_ips[identifier]
|
||||
if time.time() < block_until:
|
||||
return False, f"Temporarily blocked until {time.ctime(block_until)}"
|
||||
else:
|
||||
del self.blocked_ips[identifier]
|
||||
|
||||
# Get rate limit for category
|
||||
max_requests, window_seconds = self.rate_limits.get(
|
||||
category, self.rate_limits["default"]
|
||||
)
|
||||
|
||||
# Clean old requests outside window
|
||||
now = time.time()
|
||||
cutoff = now - window_seconds
|
||||
self.request_counts[identifier] = [
|
||||
req_time for req_time in self.request_counts[identifier]
|
||||
if req_time > cutoff
|
||||
]
|
||||
|
||||
# Check if within limits
|
||||
current_count = len(self.request_counts[identifier])
|
||||
if current_count >= max_requests:
|
||||
# Block for increasing duration based on violations
|
||||
violation_count = getattr(self, f"_{identifier}_violations", 0) + 1
|
||||
setattr(self, f"_{identifier}_violations", violation_count)
|
||||
|
||||
block_duration = min(300, 30 * violation_count) # Max 5 minutes
|
||||
self.blocked_ips[identifier] = now + block_duration
|
||||
|
||||
return False, f"Rate limit exceeded. Blocked for {block_duration} seconds"
|
||||
|
||||
# Record this request
|
||||
self.request_counts[identifier].append(now)
|
||||
return True, None
|
||||
|
||||
async def check_api_health(self) -> dict:
|
||||
"""Monitor for suspicious patterns"""
|
||||
now = time.time()
|
||||
|
||||
# Count recent requests across all identifiers
|
||||
recent_requests = 0
|
||||
for requests in self.request_counts.values():
|
||||
recent_requests += len([r for r in requests if r > now - 60])
|
||||
|
||||
# Calculate metrics
|
||||
total_blocked = len(self.blocked_ips)
|
||||
active_identifiers = len([
|
||||
requests for requests in self.request_counts.values()
|
||||
if any(r > now - 300 for r in requests) # Active in last 5 minutes
|
||||
])
|
||||
|
||||
status = "healthy"
|
||||
if recent_requests > 500: # Threshold for concern
|
||||
status = "high_load"
|
||||
if total_blocked > 10:
|
||||
status = "under_attack"
|
||||
|
||||
return {
|
||||
"status": status,
|
||||
"recent_requests_per_minute": recent_requests,
|
||||
"blocked_identifiers": total_blocked,
|
||||
"active_identifiers": active_identifiers,
|
||||
"timestamp": now
|
||||
}
|
||||
```
|
||||
|
||||
### Audit Logging and Compliance
|
||||
|
||||
**Security Event Logging**:
|
||||
```python
|
||||
import json
|
||||
import logging
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
class SecurityEventType(Enum):
|
||||
AUTH_SUCCESS = "auth_success"
|
||||
AUTH_FAILURE = "auth_failure"
|
||||
DATA_ACCESS = "data_access"
|
||||
DATA_EXPORT = "data_export"
|
||||
CONFIG_CHANGE = "config_change"
|
||||
API_ABUSE = "api_abuse"
|
||||
SYSTEM_ERROR = "system_error"
|
||||
|
||||
class SecurityAuditor:
|
||||
"""Centralized security event logging for compliance"""
|
||||
|
||||
def __init__(self):
|
||||
# Separate logger for security events
|
||||
self.security_logger = logging.getLogger("tradingagents.security")
|
||||
|
||||
# Configure structured logging handler
|
||||
handler = logging.FileHandler("logs/security.log")
|
||||
formatter = SecurityLogFormatter()
|
||||
handler.setFormatter(formatter)
|
||||
self.security_logger.addHandler(handler)
|
||||
self.security_logger.setLevel(logging.INFO)
|
||||
|
||||
def log_event(
|
||||
self,
|
||||
event_type: SecurityEventType,
|
||||
message: str,
|
||||
user_id: Optional[str] = None,
|
||||
ip_address: Optional[str] = None,
|
||||
resource: Optional[str] = None,
|
||||
additional_data: Optional[Dict[str, Any]] = None
|
||||
) -> None:
|
||||
"""Log security event with structured data"""
|
||||
|
||||
event_data = {
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"event_type": event_type.value,
|
||||
"message": message,
|
||||
"severity": self._get_severity(event_type),
|
||||
"user_id": user_id or "system",
|
||||
"ip_address": ip_address or "unknown",
|
||||
"resource": resource,
|
||||
"additional_data": additional_data or {},
|
||||
"process_id": os.getpid(),
|
||||
"hostname": os.uname().nodename
|
||||
}
|
||||
|
||||
# Log at appropriate level based on severity
|
||||
if event_data["severity"] == "critical":
|
||||
self.security_logger.critical(json.dumps(event_data))
|
||||
elif event_data["severity"] == "warning":
|
||||
self.security_logger.warning(json.dumps(event_data))
|
||||
else:
|
||||
self.security_logger.info(json.dumps(event_data))
|
||||
|
||||
def _get_severity(self, event_type: SecurityEventType) -> str:
|
||||
"""Determine event severity"""
|
||||
critical_events = {
|
||||
SecurityEventType.AUTH_FAILURE,
|
||||
SecurityEventType.API_ABUSE,
|
||||
SecurityEventType.CONFIG_CHANGE
|
||||
}
|
||||
|
||||
if event_type in critical_events:
|
||||
return "critical"
|
||||
elif event_type == SecurityEventType.SYSTEM_ERROR:
|
||||
return "warning"
|
||||
else:
|
||||
return "info"
|
||||
|
||||
def log_data_access(
|
||||
self,
|
||||
table: str,
|
||||
operation: str,
|
||||
record_count: int,
|
||||
user_id: str = "system"
|
||||
) -> None:
|
||||
"""Log data access for compliance auditing"""
|
||||
self.log_event(
|
||||
SecurityEventType.DATA_ACCESS,
|
||||
f"Database {operation} on {table}",
|
||||
user_id=user_id,
|
||||
resource=table,
|
||||
additional_data={
|
||||
"operation": operation,
|
||||
"record_count": record_count
|
||||
}
|
||||
)
|
||||
|
||||
def log_api_key_usage(
|
||||
self,
|
||||
provider: str,
|
||||
model: str,
|
||||
tokens_used: int,
|
||||
cost_estimate: float
|
||||
) -> None:
|
||||
"""Log LLM API usage for cost monitoring and abuse detection"""
|
||||
self.log_event(
|
||||
SecurityEventType.DATA_ACCESS,
|
||||
f"LLM API call to {provider}/{model}",
|
||||
resource=f"{provider}/{model}",
|
||||
additional_data={
|
||||
"tokens_used": tokens_used,
|
||||
"cost_estimate": cost_estimate,
|
||||
"timestamp": datetime.utcnow().isoformat()
|
||||
}
|
||||
)
|
||||
|
||||
class SecurityLogFormatter(logging.Formatter):
|
||||
"""Custom formatter for security logs"""
|
||||
|
||||
def format(self, record: logging.LogRecord) -> str:
|
||||
# Security logs are already JSON formatted
|
||||
return record.getMessage()
|
||||
|
||||
# Usage in repository classes
|
||||
class NewsRepository:
|
||||
def __init__(self, database_manager: DatabaseManager):
|
||||
self.db_manager = database_manager
|
||||
self.auditor = SecurityAuditor()
|
||||
|
||||
async def list(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
# ... existing implementation ...
|
||||
|
||||
# Log data access for compliance
|
||||
self.auditor.log_data_access(
|
||||
table="news_articles",
|
||||
operation="SELECT",
|
||||
record_count=len(result),
|
||||
user_id=getattr(self, 'current_user_id', 'system')
|
||||
)
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
This comprehensive security standards document provides the foundation for protecting sensitive financial data, API keys, and system resources while maintaining compliance with data protection regulations in the TradingAgents system.
|
||||
|
|
@ -0,0 +1,715 @@
|
|||
# Style Guide - TradingAgents
|
||||
|
||||
## Python Code Style
|
||||
|
||||
### Formatting with Ruff
|
||||
|
||||
**Configuration** (pyproject.toml):
|
||||
```toml
|
||||
[tool.ruff]
|
||||
target-version = "py313"
|
||||
line-length = 88
|
||||
fix = true
|
||||
extend-exclude = [
|
||||
"migrations/",
|
||||
"alembic/versions/",
|
||||
".env",
|
||||
"venv/",
|
||||
".venv/",
|
||||
]
|
||||
|
||||
[tool.ruff.lint]
|
||||
select = [
|
||||
"E", # pycodestyle errors
|
||||
"W", # pycodestyle warnings
|
||||
"F", # Pyflakes
|
||||
"I", # isort
|
||||
"B", # flake8-bugbear
|
||||
"C4", # flake8-comprehensions
|
||||
"UP", # pyupgrade
|
||||
"ERA", # eradicate
|
||||
"PIE", # flake8-pie
|
||||
"SIM", # flake8-simplify
|
||||
"TCH", # flake8-type-checking
|
||||
"ARG", # flake8-unused-arguments
|
||||
"PTH", # flake8-use-pathlib
|
||||
"FIX", # flake8-fixme
|
||||
"TD", # flake8-todos
|
||||
]
|
||||
|
||||
ignore = [
|
||||
"E501", # Line too long (handled by formatter)
|
||||
"B008", # Do not perform function calls in argument defaults
|
||||
"B904", # Use `raise ... from ...` for exception chaining
|
||||
"TD002", # Missing author in TODO
|
||||
"TD003", # Missing issue link on line following TODO
|
||||
"FIX002", # Line contains TODO
|
||||
]
|
||||
|
||||
[tool.ruff.lint.per-file-ignores]
|
||||
"tests/**/*.py" = [
|
||||
"S101", # Use of assert detected
|
||||
"ARG001", # Unused function argument
|
||||
"FBT001", # Boolean positional arg
|
||||
"PLR2004", # Magic value used in comparison
|
||||
]
|
||||
|
||||
"migrations/**/*.py" = [
|
||||
"ERA001", # Found commented-out code
|
||||
]
|
||||
|
||||
[tool.ruff.lint.isort]
|
||||
known-first-party = ["tradingagents"]
|
||||
force-sort-within-sections = true
|
||||
```
|
||||
|
||||
### Type Hints and Annotations
|
||||
|
||||
**Modern Type Syntax** (Python 3.13):
|
||||
```python
|
||||
# Use built-in generics (no typing.List, typing.Dict)
|
||||
def process_articles(articles: list[NewsArticle]) -> dict[str, int]:
|
||||
"""Process articles and return symbol counts"""
|
||||
counts: dict[str, int] = {}
|
||||
for article in articles:
|
||||
symbol = article.symbol or "UNKNOWN"
|
||||
counts[symbol] = counts.get(symbol, 0) + 1
|
||||
return counts
|
||||
|
||||
# Union types with |
|
||||
def get_article(article_id: str | int) -> NewsArticle | None:
|
||||
"""Get article by ID (string or integer)"""
|
||||
if isinstance(article_id, str):
|
||||
return get_by_url(article_id)
|
||||
return get_by_id(article_id)
|
||||
|
||||
# Optional with explicit None
|
||||
def calculate_sentiment(text: str, model: str | None = None) -> float | None:
|
||||
"""Calculate sentiment score"""
|
||||
if not text.strip():
|
||||
return None
|
||||
# Implementation
|
||||
return 0.5
|
||||
```
|
||||
|
||||
**Type Annotations for Complex Types**:
|
||||
```python
|
||||
from typing import TypeVar, Generic, Protocol, TypedDict, Awaitable
|
||||
from collections.abc import Callable, AsyncGenerator
|
||||
from datetime import date, datetime
|
||||
|
||||
# Type variables
|
||||
T = TypeVar('T')
|
||||
ArticleT = TypeVar('ArticleT', bound='NewsArticle')
|
||||
|
||||
# Protocol for type checking
|
||||
class Repository(Protocol[T]):
|
||||
async def list(self, symbol: str, date: date) -> list[T]:
|
||||
...
|
||||
|
||||
async def upsert(self, item: T) -> T:
|
||||
...
|
||||
|
||||
# TypedDict for structured data
|
||||
class ArticleData(TypedDict):
|
||||
headline: str
|
||||
url: str
|
||||
published_date: str
|
||||
sentiment_score: float | None
|
||||
|
||||
# Callable types
|
||||
ProcessorFunc = Callable[[list[NewsArticle]], Awaitable[dict[str, int]]]
|
||||
```
|
||||
|
||||
### Docstring Standards
|
||||
|
||||
**Google Style Docstrings**:
|
||||
```python
|
||||
class NewsRepository:
|
||||
"""Repository for news article data access with PostgreSQL backend.
|
||||
|
||||
Handles CRUD operations for news articles with support for batch operations,
|
||||
vector similarity search, and TimescaleDB time-series optimization.
|
||||
|
||||
Attributes:
|
||||
db_manager: AsyncIO database connection manager
|
||||
|
||||
Example:
|
||||
>>> db_manager = DatabaseManager("postgresql://...")
|
||||
>>> repo = NewsRepository(db_manager)
|
||||
>>> articles = await repo.list("AAPL", date(2024, 1, 15))
|
||||
"""
|
||||
|
||||
def __init__(self, database_manager: DatabaseManager) -> None:
|
||||
"""Initialize repository with database connection.
|
||||
|
||||
Args:
|
||||
database_manager: Async database connection manager with
|
||||
PostgreSQL + TimescaleDB + pgvector support.
|
||||
"""
|
||||
self.db_manager = database_manager
|
||||
|
||||
async def upsert_batch(
|
||||
self,
|
||||
articles: list[NewsArticle],
|
||||
symbol: str,
|
||||
*,
|
||||
chunk_size: int = 1000
|
||||
) -> list[NewsArticle]:
|
||||
"""Batch insert or update articles with deduplication.
|
||||
|
||||
Uses PostgreSQL ON CONFLICT for atomic upserts based on URL uniqueness.
|
||||
Processes articles in chunks to optimize memory usage for large datasets.
|
||||
|
||||
Args:
|
||||
articles: News articles to store
|
||||
symbol: Stock symbol to associate with articles
|
||||
chunk_size: Number of articles to process per database transaction.
|
||||
Defaults to 1000 for optimal PostgreSQL performance.
|
||||
|
||||
Returns:
|
||||
List of stored articles with database-generated metadata
|
||||
|
||||
Raises:
|
||||
IntegrityError: If URL constraint violations occur
|
||||
DatabaseConnectionError: If database is unavailable
|
||||
|
||||
Example:
|
||||
>>> articles = [NewsArticle("Title", "https://...", ...)]
|
||||
>>> stored = await repo.upsert_batch(articles, "AAPL")
|
||||
>>> assert len(stored) == len(articles)
|
||||
"""
|
||||
if not articles:
|
||||
return []
|
||||
|
||||
# Implementation...
|
||||
```
|
||||
|
||||
**Module-Level Docstrings**:
|
||||
```python
|
||||
"""
|
||||
News repository with PostgreSQL + TimescaleDB backend.
|
||||
|
||||
This module provides data access patterns for financial news articles with
|
||||
support for:
|
||||
- Time-series queries optimized by TimescaleDB
|
||||
- Vector similarity search using pgvector
|
||||
- Bulk operations with PostgreSQL-specific optimizations
|
||||
- Async/await patterns for high-performance I/O
|
||||
|
||||
Example Usage:
|
||||
from tradingagents.domains.news.news_repository import NewsRepository
|
||||
from tradingagents.lib.database import DatabaseManager
|
||||
|
||||
db = DatabaseManager("postgresql+asyncpg://...")
|
||||
repo = NewsRepository(db)
|
||||
|
||||
# Get articles for a symbol and date
|
||||
articles = await repo.list("AAPL", date(2024, 1, 15))
|
||||
|
||||
# Batch store new articles
|
||||
new_articles = [...]
|
||||
stored = await repo.upsert_batch(new_articles, "AAPL")
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
```
|
||||
|
||||
### Variable and Function Naming
|
||||
|
||||
**Snake Case for Everything**:
|
||||
```python
|
||||
# Variables
|
||||
article_count = len(articles)
|
||||
sentiment_threshold = 0.5
|
||||
openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
|
||||
|
||||
# Functions
|
||||
def calculate_portfolio_risk(positions: list[Position]) -> float:
|
||||
"""Calculate portfolio-wide risk metrics"""
|
||||
|
||||
async def fetch_news_articles(symbol: str, date: date) -> list[NewsArticle]:
|
||||
"""Fetch news articles from external APIs"""
|
||||
|
||||
# Private methods
|
||||
def _validate_sentiment_score(score: float | None) -> bool:
|
||||
"""Internal validation for sentiment scores"""
|
||||
|
||||
# Constants
|
||||
MAX_ARTICLES_PER_REQUEST = 100
|
||||
DEFAULT_LOOKBACK_DAYS = 30
|
||||
OPENAI_EMBEDDING_DIMENSIONS = 1536
|
||||
```
|
||||
|
||||
**Descriptive Names Over Short Names**:
|
||||
```python
|
||||
# Good - Clear intent
|
||||
async def update_articles_for_symbol(symbol: str, target_date: date) -> int:
|
||||
successful_count = 0
|
||||
failed_count = 0
|
||||
|
||||
for news_source in self.configured_sources:
|
||||
try:
|
||||
articles = await news_source.fetch(symbol, target_date)
|
||||
stored_articles = await self.repository.upsert_batch(articles, symbol)
|
||||
successful_count += len(stored_articles)
|
||||
except Exception as e:
|
||||
failed_count += 1
|
||||
logger.warning(f"Failed to fetch from {news_source.name}: {e}")
|
||||
|
||||
return successful_count
|
||||
|
||||
# Avoid - Unclear abbreviations
|
||||
async def upd_arts(sym: str, dt: date) -> int:
|
||||
cnt = 0
|
||||
for src in self.srcs:
|
||||
arts = await src.get(sym, dt)
|
||||
cnt += len(arts)
|
||||
return cnt
|
||||
```
|
||||
|
||||
### Import Organization
|
||||
|
||||
**Import Order with isort**:
|
||||
```python
|
||||
# 1. Standard library imports
|
||||
import asyncio
|
||||
import logging
|
||||
import uuid
|
||||
from datetime import date, datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
# 2. Third-party imports
|
||||
import aiohttp
|
||||
from sqlalchemy import select, and_
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
import pytest
|
||||
|
||||
# 3. First-party imports
|
||||
from tradingagents.config import TradingAgentsConfig
|
||||
from tradingagents.domains.news.news_repository import NewsArticle, NewsRepository
|
||||
from tradingagents.lib.database import DatabaseManager
|
||||
|
||||
# 4. Relative imports (avoid when possible)
|
||||
from .google_news_client import GoogleNewsClient
|
||||
```
|
||||
|
||||
**Import Aliases**:
|
||||
```python
|
||||
# Standard aliases for common packages
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from datetime import datetime as dt, date
|
||||
|
||||
# Avoid long module paths
|
||||
from tradingagents.domains.news.news_repository import (
|
||||
NewsArticle,
|
||||
NewsRepository,
|
||||
NewsArticleEntity
|
||||
)
|
||||
|
||||
# Type-only imports for forward references
|
||||
from typing import TYPE_CHECKING
|
||||
if TYPE_CHECKING:
|
||||
from tradingagents.agents.trading_agent import TradingAgent
|
||||
```
|
||||
|
||||
## Database Naming Conventions
|
||||
|
||||
### Table Names
|
||||
|
||||
**Snake Case with Domain Prefix**:
|
||||
```sql
|
||||
-- Domain-prefixed tables
|
||||
news_articles -- Core news data
|
||||
news_article_embeddings -- Vector embeddings (if separate)
|
||||
|
||||
market_data_daily -- Daily market prices
|
||||
market_data_intraday -- Intraday tick data
|
||||
|
||||
social_media_posts -- Social media content
|
||||
social_sentiment_scores -- Sentiment analysis results
|
||||
|
||||
-- Agent-specific tables
|
||||
agent_decisions -- Trading decisions
|
||||
agent_portfolios -- Portfolio states
|
||||
agent_memories -- RAG memory store
|
||||
```
|
||||
|
||||
### Column Names
|
||||
|
||||
**Descriptive Snake Case**:
|
||||
```sql
|
||||
-- Good - Clear and consistent
|
||||
CREATE TABLE news_articles (
|
||||
id UUID PRIMARY KEY DEFAULT uuid7(),
|
||||
headline TEXT NOT NULL,
|
||||
url TEXT UNIQUE NOT NULL,
|
||||
published_date DATE NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
|
||||
-- Foreign key relationships
|
||||
symbol VARCHAR(20) REFERENCES stocks(symbol),
|
||||
source_id UUID REFERENCES news_sources(id),
|
||||
|
||||
-- Metrics and scores
|
||||
sentiment_score DECIMAL(3,2) CHECK (sentiment_score BETWEEN -1 AND 1),
|
||||
readability_score INTEGER CHECK (readability_score BETWEEN 0 AND 100),
|
||||
|
||||
-- Vector embeddings
|
||||
title_embedding VECTOR(1536),
|
||||
content_embedding VECTOR(1536)
|
||||
);
|
||||
|
||||
-- Avoid - Unclear abbreviations
|
||||
CREATE TABLE art (
|
||||
id UUID,
|
||||
ttl TEXT, -- title?
|
||||
dt DATE, -- published_date?
|
||||
scr DECIMAL, -- score? source?
|
||||
emb VECTOR(1536) -- embedding?
|
||||
);
|
||||
```
|
||||
|
||||
### Index Names
|
||||
|
||||
**Descriptive with Purpose**:
|
||||
```sql
|
||||
-- Pattern: idx_{table}_{columns}_{purpose}
|
||||
CREATE INDEX idx_news_articles_symbol_date_lookup
|
||||
ON news_articles (symbol, published_date);
|
||||
|
||||
CREATE INDEX idx_news_articles_published_date_timeseries
|
||||
ON news_articles (published_date DESC);
|
||||
|
||||
CREATE INDEX idx_news_articles_url_unique
|
||||
ON news_articles (url);
|
||||
|
||||
-- Vector indexes with algorithm
|
||||
CREATE INDEX idx_news_articles_title_embedding_cosine
|
||||
ON news_articles USING ivfflat (title_embedding vector_cosine_ops);
|
||||
|
||||
-- Partial indexes for specific queries
|
||||
CREATE INDEX idx_news_articles_recent_high_sentiment
|
||||
ON news_articles (published_date, sentiment_score)
|
||||
WHERE published_date > CURRENT_DATE - INTERVAL '30 days'
|
||||
AND sentiment_score > 0.5;
|
||||
```
|
||||
|
||||
## API Design Patterns
|
||||
|
||||
### RESTful URL Structure
|
||||
|
||||
**Resource-Based URLs**:
|
||||
```python
|
||||
# Good - Resource-oriented
|
||||
GET /api/v1/symbols/AAPL/articles?date=2024-01-15 # Get articles
|
||||
POST /api/v1/symbols/AAPL/articles # Create articles
|
||||
PUT /api/v1/articles/{article_id} # Update article
|
||||
DELETE /api/v1/articles/{article_id} # Delete article
|
||||
|
||||
GET /api/v1/symbols/AAPL/market-data?start=2024-01-01&end=2024-01-31
|
||||
POST /api/v1/trading/decisions # Create trading decision
|
||||
GET /api/v1/agents/portfolios/{portfolio_id} # Get portfolio state
|
||||
|
||||
# Avoid - Action-oriented
|
||||
POST /api/v1/getArticles # Should be GET
|
||||
POST /api/v1/updateSymbolData # Should be PUT
|
||||
GET /api/v1/performTradingAnalysis # Should be POST
|
||||
```
|
||||
|
||||
**Query Parameter Standards**:
|
||||
```python
|
||||
from datetime import date
|
||||
from pydantic import BaseModel, Field, validator
|
||||
|
||||
class ArticleQueryParams(BaseModel):
|
||||
"""Query parameters for article endpoints"""
|
||||
|
||||
# Date filtering
|
||||
date: date | None = None
|
||||
start_date: date | None = Field(None, alias="start")
|
||||
end_date: date | None = Field(None, alias="end")
|
||||
|
||||
# Pagination
|
||||
limit: int = Field(default=50, ge=1, le=1000)
|
||||
offset: int = Field(default=0, ge=0)
|
||||
|
||||
# Filtering
|
||||
sources: list[str] | None = Field(None, description="Filter by news sources")
|
||||
min_sentiment: float | None = Field(None, ge=-1.0, le=1.0)
|
||||
max_sentiment: float | None = Field(None, ge=-1.0, le=1.0)
|
||||
|
||||
# Search
|
||||
query: str | None = Field(None, max_length=200)
|
||||
|
||||
@validator('end_date')
|
||||
def end_date_after_start(cls, v, values):
|
||||
if v and values.get('start_date') and v < values['start_date']:
|
||||
raise ValueError('end_date must be after start_date')
|
||||
return v
|
||||
```
|
||||
|
||||
### Response Formats
|
||||
|
||||
**Consistent JSON Structure**:
|
||||
```python
|
||||
from typing import Generic, TypeVar
|
||||
from pydantic import BaseModel
|
||||
|
||||
T = TypeVar('T')
|
||||
|
||||
class APIResponse(BaseModel, Generic[T]):
|
||||
"""Standard API response wrapper"""
|
||||
|
||||
data: T | None = None
|
||||
success: bool = True
|
||||
message: str | None = None
|
||||
errors: list[str] = []
|
||||
|
||||
# Metadata
|
||||
request_id: str | None = None
|
||||
timestamp: str = Field(default_factory=lambda: datetime.utcnow().isoformat())
|
||||
|
||||
class PaginatedResponse(APIResponse[list[T]]):
|
||||
"""Paginated response with metadata"""
|
||||
|
||||
pagination: dict[str, int] = Field(default_factory=dict)
|
||||
|
||||
@classmethod
|
||||
def create(
|
||||
cls,
|
||||
data: list[T],
|
||||
total: int,
|
||||
limit: int,
|
||||
offset: int
|
||||
) -> 'PaginatedResponse[T]':
|
||||
return cls(
|
||||
data=data,
|
||||
pagination={
|
||||
"total": total,
|
||||
"limit": limit,
|
||||
"offset": offset,
|
||||
"has_more": offset + len(data) < total
|
||||
}
|
||||
)
|
||||
|
||||
# Usage example
|
||||
@app.get("/api/v1/symbols/{symbol}/articles")
|
||||
async def get_articles(
|
||||
symbol: str,
|
||||
params: ArticleQueryParams = Depends(),
|
||||
db: AsyncSession = Depends(get_db_session)
|
||||
) -> PaginatedResponse[ArticleData]:
|
||||
"""Get news articles for a symbol"""
|
||||
|
||||
# Query implementation
|
||||
articles, total = await article_service.get_paginated(
|
||||
symbol=symbol,
|
||||
limit=params.limit,
|
||||
offset=params.offset,
|
||||
date_filter=params.date
|
||||
)
|
||||
|
||||
return PaginatedResponse.create(
|
||||
data=[ArticleData.from_entity(a) for a in articles],
|
||||
total=total,
|
||||
limit=params.limit,
|
||||
offset=params.offset
|
||||
)
|
||||
```
|
||||
|
||||
## Documentation Standards
|
||||
|
||||
### Code Comments
|
||||
|
||||
**When to Comment**:
|
||||
```python
|
||||
class NewsRepository:
|
||||
async def upsert_batch(self, articles: list[NewsArticle], symbol: str) -> list[NewsArticle]:
|
||||
# Don't comment obvious code
|
||||
if not articles:
|
||||
return []
|
||||
|
||||
# DO comment complex business logic
|
||||
# Use PostgreSQL ON CONFLICT for atomic upsert operations.
|
||||
# This prevents race conditions when multiple processes
|
||||
# are updating the same articles simultaneously.
|
||||
stmt = insert(NewsArticleEntity).values(entity_data_list)
|
||||
upsert_stmt = stmt.on_conflict_do_update(
|
||||
index_elements=["url"], # Deduplication key
|
||||
set_={
|
||||
# Update all fields except ID and created_at
|
||||
**{col: stmt.excluded[col] for col in updateable_columns},
|
||||
"updated_at": func.now(),
|
||||
},
|
||||
)
|
||||
|
||||
# DO comment performance optimizations
|
||||
# Batch size of 1000 optimizes PostgreSQL memory usage
|
||||
# while avoiding transaction timeout for large datasets
|
||||
for chunk in chunks(entity_data_list, 1000):
|
||||
result = await session.execute(upsert_stmt)
|
||||
```
|
||||
|
||||
**TODO Comments**:
|
||||
```python
|
||||
# TODO(martin): Implement caching layer for frequently accessed articles
|
||||
# TODO(martin): Add vector similarity search for related articles
|
||||
# FIXME(martin): Handle edge case where published_date is in future
|
||||
# HACK(martin): Temporary workaround for API rate limiting - remove after v2.0
|
||||
```
|
||||
|
||||
### README Structure
|
||||
|
||||
**Repository README.md Template**:
|
||||
```markdown
|
||||
# TradingAgents - Multi-Agent Financial Analysis
|
||||
|
||||
Brief description of what the project does and why it exists.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Setup environment
|
||||
export OPENROUTER_API_KEY="your_key"
|
||||
mise run docker # Start PostgreSQL
|
||||
|
||||
# 2. Install and run
|
||||
mise run install
|
||||
mise run dev # Interactive CLI
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
High-level overview with diagrams if helpful.
|
||||
|
||||
## Development
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.13+
|
||||
- PostgreSQL 16+ with TimescaleDB
|
||||
- OpenRouter API access
|
||||
|
||||
### Setup
|
||||
```bash
|
||||
mise run install # Install dependencies
|
||||
mise run test # Run test suite
|
||||
mise run format # Format code
|
||||
```
|
||||
|
||||
### Testing
|
||||
Details about test strategy and running tests.
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables and configuration options.
|
||||
|
||||
## Contributing
|
||||
|
||||
Link to contributing guidelines.
|
||||
```
|
||||
|
||||
### Commit Message Conventions
|
||||
|
||||
**Conventional Commits Format**:
|
||||
```
|
||||
type(scope): description
|
||||
|
||||
[optional body]
|
||||
|
||||
[optional footer(s)]
|
||||
```
|
||||
|
||||
**Types**:
|
||||
- `feat`: New feature
|
||||
- `fix`: Bug fix
|
||||
- `docs`: Documentation changes
|
||||
- `style`: Code style changes (formatting, missing semicolons, etc.)
|
||||
- `refactor`: Code refactoring
|
||||
- `test`: Adding missing tests or correcting existing tests
|
||||
- `chore`: Changes to build process or auxiliary tools
|
||||
|
||||
**Examples**:
|
||||
```
|
||||
feat(news): add vector similarity search for related articles
|
||||
|
||||
Implements pgvector-based similarity search using OpenAI embeddings.
|
||||
Articles can now find related content based on semantic similarity
|
||||
rather than just keyword matching.
|
||||
|
||||
- Add title_embedding and content_embedding columns
|
||||
- Implement cosine similarity search in NewsRepository
|
||||
- Add vector index for performance optimization
|
||||
|
||||
Closes #123
|
||||
|
||||
---
|
||||
|
||||
fix(database): handle connection timeouts in async sessions
|
||||
|
||||
Connection pooling was causing timeouts under high load.
|
||||
Added proper timeout handling and connection recycling.
|
||||
|
||||
- Set pool_recycle=3600 for connection health
|
||||
- Add retry logic for transient connection errors
|
||||
- Improve error logging for debugging
|
||||
|
||||
---
|
||||
|
||||
test(news): add integration tests for batch upsert operations
|
||||
|
||||
Covers edge cases for duplicate URL handling and large batch processing.
|
||||
|
||||
---
|
||||
|
||||
docs(api): update OpenAPI spec for news endpoints
|
||||
|
||||
- Add pagination parameters
|
||||
- Document error response formats
|
||||
- Include example requests and responses
|
||||
```
|
||||
|
||||
### Code Organization
|
||||
|
||||
**File and Directory Structure**:
|
||||
```
|
||||
tradingagents/
|
||||
├── __init__.py
|
||||
├── config.py # Application configuration
|
||||
├── main.py # Entry point
|
||||
├──
|
||||
├── domains/ # Domain-driven design
|
||||
│ ├── __init__.py
|
||||
│ ├── news/ # News domain
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── news_service.py # Business logic
|
||||
│ │ ├── news_repository.py # Data access
|
||||
│ │ ├── google_news_client.py # External API
|
||||
│ │ └── models.py # Domain models
|
||||
│ ├── marketdata/ # Market data domain
|
||||
│ └── socialmedia/ # Social media domain
|
||||
│
|
||||
├── agents/ # LLM agents
|
||||
│ ├── __init__.py
|
||||
│ ├── trading_agent.py
|
||||
│ ├── analyst_agent.py
|
||||
│ └── libs/ # Agent utilities
|
||||
│ ├── __init__.py
|
||||
│ └── agent_toolkit.py
|
||||
│
|
||||
├── lib/ # Shared utilities
|
||||
│ ├── __init__.py
|
||||
│ ├── database.py # Database connection
|
||||
│ ├── logging.py # Logging configuration
|
||||
│ └── utils.py # Common utilities
|
||||
│
|
||||
└── types/ # Shared type definitions
|
||||
├── __init__.py
|
||||
├── common.py
|
||||
└── financial.py
|
||||
```
|
||||
|
||||
This style guide ensures consistent, maintainable code across the TradingAgents project while leveraging modern Python features and database optimization techniques.
|
||||
|
|
@ -0,0 +1,543 @@
|
|||
# Technical Standards - TradingAgents
|
||||
|
||||
## Database Architecture
|
||||
|
||||
### Core Stack: PostgreSQL + TimescaleDB + pgvectorscale
|
||||
|
||||
**Primary Database**: PostgreSQL 16+ with TimescaleDB and pgvector extensions
|
||||
- **TimescaleDB**: Optimized for time-series financial data (prices, volumes, news timestamps)
|
||||
- **pgvector/pgvectorscale**: Vector embeddings for RAG-powered agents
|
||||
- **Connection**: asyncpg driver for high-performance async operations
|
||||
|
||||
**Database URL Pattern**:
|
||||
```python
|
||||
# Development
|
||||
DATABASE_URL = "postgresql+asyncpg://postgres:tradingagents@localhost:5432/tradingagents"
|
||||
|
||||
# Production
|
||||
DATABASE_URL = "postgresql+asyncpg://username:password@host:port/database"
|
||||
```
|
||||
|
||||
**Required Extensions**:
|
||||
```sql
|
||||
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
|
||||
CREATE EXTENSION IF NOT EXISTS vector CASCADE;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
```
|
||||
|
||||
### Schema Design Standards
|
||||
|
||||
**Time-Series Tables (TimescaleDB)**:
|
||||
```sql
|
||||
-- Market data with time-based partitioning
|
||||
CREATE TABLE market_data (
|
||||
id UUID PRIMARY KEY DEFAULT uuid7(),
|
||||
symbol VARCHAR(20) NOT NULL,
|
||||
timestamp TIMESTAMPTZ NOT NULL,
|
||||
price DECIMAL(18,8),
|
||||
volume BIGINT,
|
||||
-- Metadata
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Convert to hypertable for time-series optimization
|
||||
SELECT create_hypertable('market_data', 'timestamp');
|
||||
|
||||
-- Indexes for common query patterns
|
||||
CREATE INDEX ON market_data (symbol, timestamp DESC);
|
||||
```
|
||||
|
||||
**Vector-Enabled Tables**:
|
||||
```sql
|
||||
-- News articles with embeddings
|
||||
CREATE TABLE news_articles (
|
||||
id UUID PRIMARY KEY DEFAULT uuid7(),
|
||||
headline TEXT NOT NULL,
|
||||
url TEXT UNIQUE NOT NULL, -- Deduplication key
|
||||
published_date DATE NOT NULL,
|
||||
title_embedding VECTOR(1536), -- OpenAI embedding size
|
||||
content_embedding VECTOR(1536),
|
||||
-- TimescaleDB partitioning on published_date
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Vector similarity index
|
||||
CREATE INDEX ON news_articles USING ivfflat (title_embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
**Composite Indexes for Query Optimization**:
|
||||
```sql
|
||||
-- Common query patterns
|
||||
CREATE INDEX idx_symbol_date ON news_articles (symbol, published_date);
|
||||
CREATE INDEX idx_published_date ON news_articles (published_date);
|
||||
CREATE INDEX idx_url_unique ON news_articles (url);
|
||||
```
|
||||
|
||||
### Connection Management
|
||||
|
||||
**Async Session Factory**:
|
||||
```python
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
||||
|
||||
class DatabaseManager:
|
||||
def __init__(self, database_url: str, echo: bool = False):
|
||||
# Ensure asyncpg driver
|
||||
if not database_url.startswith("postgresql+asyncpg://"):
|
||||
database_url = database_url.replace("postgresql://", "postgresql+asyncpg://")
|
||||
|
||||
self.engine = create_async_engine(
|
||||
database_url,
|
||||
echo=echo,
|
||||
pool_recycle=3600, # 1-hour connection recycling
|
||||
pool_pre_ping=True, # Connection health checks
|
||||
)
|
||||
|
||||
self.AsyncSessionLocal = async_sessionmaker(
|
||||
bind=self.engine,
|
||||
class_=AsyncSession,
|
||||
autocommit=False,
|
||||
autoflush=False,
|
||||
)
|
||||
```
|
||||
|
||||
**Session Context Management**:
|
||||
```python
|
||||
@asynccontextmanager
|
||||
async def get_session(self) -> AsyncGenerator[AsyncSession, None]:
|
||||
"""Type-checker friendly session management"""
|
||||
session = self.AsyncSessionLocal()
|
||||
try:
|
||||
yield session
|
||||
await session.commit()
|
||||
except Exception:
|
||||
await session.rollback()
|
||||
raise
|
||||
finally:
|
||||
await session.close()
|
||||
```
|
||||
|
||||
## LLM Integration Standards
|
||||
|
||||
### OpenRouter as Unified Provider
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
# Environment variables
|
||||
OPENROUTER_API_KEY = "your_openrouter_key"
|
||||
LLM_PROVIDER = "openrouter"
|
||||
DEEP_THINK_LLM = "openai/gpt-4o" # Complex analysis
|
||||
QUICK_THINK_LLM = "openai/gpt-4o-mini" # Fast responses
|
||||
BACKEND_URL = "https://openrouter.ai/api/v1"
|
||||
```
|
||||
|
||||
**Model Selection Strategy**:
|
||||
- **Deep Think**: Complex reasoning, debates, risk analysis (`openai/gpt-4o`, `anthropic/claude-3.5-sonnet`)
|
||||
- **Quick Think**: Data formatting, simple queries (`openai/gpt-4o-mini`, `anthropic/claude-3-haiku`)
|
||||
|
||||
**Cost Optimization**:
|
||||
```python
|
||||
# Development/testing configuration
|
||||
config = TradingAgentsConfig(
|
||||
llm_provider="openrouter",
|
||||
deep_think_llm="openai/gpt-4o-mini", # Lower cost
|
||||
quick_think_llm="openai/gpt-4o-mini", # Consistent model
|
||||
max_debate_rounds=1, # Reduce API calls
|
||||
online_tools=False, # Use cached data
|
||||
)
|
||||
```
|
||||
|
||||
### Agent Integration Patterns
|
||||
|
||||
**Anti-Corruption Layer**:
|
||||
```python
|
||||
class AgentToolkit:
|
||||
"""Mediates between LLM agents and domain services"""
|
||||
|
||||
def __init__(self, config: TradingAgentsConfig):
|
||||
self.config = config
|
||||
self.services = self._initialize_services()
|
||||
|
||||
async def get_news_context(self, symbol: str, date: date) -> dict:
|
||||
"""Convert domain models to structured LLM context"""
|
||||
articles = await self.news_service.get_articles(symbol, date)
|
||||
|
||||
return {
|
||||
"articles": [article.to_dict() for article in articles],
|
||||
"count": len(articles),
|
||||
"data_quality": self._assess_data_quality(articles),
|
||||
"source_distribution": self._analyze_sources(articles)
|
||||
}
|
||||
```
|
||||
|
||||
## Layered Architecture Enforcement
|
||||
|
||||
### Standard Layer Pattern
|
||||
|
||||
**Data Flow**: `Request → Router → Service → Repository → Entity → Database`
|
||||
|
||||
**Component Responsibilities**:
|
||||
|
||||
1. **Entity (Domain Model)**:
|
||||
```python
|
||||
@dataclass
|
||||
class NewsArticle:
|
||||
"""Domain entity with business rules and transformations"""
|
||||
|
||||
headline: str
|
||||
url: str
|
||||
published_date: date
|
||||
sentiment_score: float | None = None
|
||||
|
||||
def to_entity(self, symbol: str | None = None) -> NewsArticleEntity:
|
||||
"""Transform to database model"""
|
||||
return NewsArticleEntity(
|
||||
headline=self.headline,
|
||||
url=self.url,
|
||||
published_date=self.published_date,
|
||||
symbol=symbol
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def from_entity(entity: NewsArticleEntity) -> 'NewsArticle':
|
||||
"""Transform from database model"""
|
||||
return NewsArticle(
|
||||
headline=entity.headline,
|
||||
url=entity.url,
|
||||
published_date=entity.published_date,
|
||||
sentiment_score=entity.sentiment_score
|
||||
)
|
||||
|
||||
def validate(self) -> list[str]:
|
||||
"""Business rule validation"""
|
||||
errors = []
|
||||
if not self.headline.strip():
|
||||
errors.append("Headline cannot be empty")
|
||||
if not self.url.startswith(("http://", "https://")):
|
||||
errors.append("Invalid URL format")
|
||||
return errors
|
||||
```
|
||||
|
||||
2. **Repository (Data Access)**:
|
||||
```python
|
||||
class NewsRepository:
|
||||
"""Handles data persistence with async operations"""
|
||||
|
||||
def __init__(self, database_manager: DatabaseManager):
|
||||
self.db_manager = database_manager
|
||||
|
||||
async def list(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
"""Query with proper error handling and logging"""
|
||||
async with self.db_manager.get_session() as session:
|
||||
result = await session.execute(
|
||||
select(NewsArticleEntity)
|
||||
.filter(and_(
|
||||
NewsArticleEntity.symbol == symbol,
|
||||
NewsArticleEntity.published_date == date
|
||||
))
|
||||
.order_by(NewsArticleEntity.published_date.desc())
|
||||
)
|
||||
entities = result.scalars().all()
|
||||
return [NewsArticle.from_entity(e) for e in entities]
|
||||
|
||||
async def upsert_batch(self, articles: list[NewsArticle], symbol: str) -> list[NewsArticle]:
|
||||
"""Bulk operations for performance"""
|
||||
if not articles:
|
||||
return []
|
||||
|
||||
async with self.db_manager.get_session() as session:
|
||||
# Use PostgreSQL ON CONFLICT for atomic upserts
|
||||
stmt = insert(NewsArticleEntity).values([
|
||||
article.to_entity(symbol).__dict__ for article in articles
|
||||
])
|
||||
upsert_stmt = stmt.on_conflict_do_update(
|
||||
index_elements=["url"],
|
||||
set_={k: stmt.excluded[k] for k in stmt.excluded.keys()}
|
||||
).returning(NewsArticleEntity)
|
||||
|
||||
result = await session.execute(upsert_stmt)
|
||||
entities = result.scalars().all()
|
||||
return [NewsArticle.from_entity(e) for e in entities]
|
||||
```
|
||||
|
||||
3. **Service (Business Logic)**:
|
||||
```python
|
||||
class NewsService:
|
||||
"""Orchestrates business operations"""
|
||||
|
||||
def __init__(self, repository: NewsRepository, clients: dict):
|
||||
self.repository = repository
|
||||
self.clients = clients
|
||||
|
||||
async def get_articles(self, symbol: str, date: date) -> list[NewsArticle]:
|
||||
"""Business logic with error handling"""
|
||||
try:
|
||||
articles = await self.repository.list(symbol, date)
|
||||
logger.info(f"Retrieved {len(articles)} articles for {symbol}")
|
||||
return articles
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get articles for {symbol}: {e}")
|
||||
return [] # Graceful degradation
|
||||
|
||||
async def update_articles(self, symbol: str, date: date) -> int:
|
||||
"""Coordinated data refresh"""
|
||||
new_articles = await self._fetch_from_sources(symbol, date)
|
||||
if new_articles:
|
||||
stored = await self.repository.upsert_batch(new_articles, symbol)
|
||||
return len(stored)
|
||||
return 0
|
||||
```
|
||||
|
||||
### Domain Isolation
|
||||
|
||||
**Three Core Domains**:
|
||||
|
||||
1. **News Domain** (`tradingagents/domains/news/`)
|
||||
2. **Market Data Domain** (`tradingagents/domains/marketdata/`)
|
||||
3. **Social Media Domain** (`tradingagents/domains/socialmedia/`)
|
||||
|
||||
**Domain Boundary Rules**:
|
||||
- Domains communicate through service interfaces only
|
||||
- No direct database access between domains
|
||||
- Shared types in `tradingagents/types/`
|
||||
- Domain events for loose coupling
|
||||
|
||||
## Vector Integration and RAG Patterns
|
||||
|
||||
### Vector Embedding Storage
|
||||
|
||||
**OpenAI Embeddings (1536 dimensions)**:
|
||||
```python
|
||||
# Entity definition
|
||||
class NewsArticleEntity(Base):
|
||||
title_embedding: Mapped[list[float] | None] = mapped_column(
|
||||
Vector(1536), nullable=True
|
||||
)
|
||||
content_embedding: Mapped[list[float] | None] = mapped_column(
|
||||
Vector(1536), nullable=True
|
||||
)
|
||||
|
||||
# Similarity search
|
||||
async def find_similar_articles(self, query_embedding: list[float], limit: int = 10) -> list[NewsArticle]:
|
||||
async with self.db_manager.get_session() as session:
|
||||
result = await session.execute(
|
||||
select(NewsArticleEntity)
|
||||
.order_by(NewsArticleEntity.title_embedding.cosine_distance(query_embedding))
|
||||
.limit(limit)
|
||||
)
|
||||
return [NewsArticle.from_entity(e) for e in result.scalars()]
|
||||
```
|
||||
|
||||
### RAG Context Assembly
|
||||
|
||||
**Agent Context Pattern**:
|
||||
```python
|
||||
async def build_agent_context(self, symbol: str, date: date) -> dict:
|
||||
"""Assemble multi-source context for agents"""
|
||||
|
||||
# Recent news with embeddings
|
||||
news_articles = await self.news_service.get_articles(symbol, date)
|
||||
|
||||
# Market data
|
||||
market_data = await self.market_service.get_recent_data(symbol, days=30)
|
||||
|
||||
# Social sentiment
|
||||
social_data = await self.social_service.get_sentiment(symbol, date)
|
||||
|
||||
return {
|
||||
"news": {
|
||||
"articles": [a.to_dict() for a in news_articles],
|
||||
"sentiment_avg": sum(a.sentiment_score or 0 for a in news_articles) / len(news_articles),
|
||||
"sources": list({a.source for a in news_articles})
|
||||
},
|
||||
"market": {
|
||||
"current_price": market_data.current_price,
|
||||
"volatility": market_data.volatility_30d,
|
||||
"volume_trend": market_data.volume_trend
|
||||
},
|
||||
"social": {
|
||||
"reddit_sentiment": social_data.reddit_score,
|
||||
"twitter_mentions": social_data.twitter_mentions
|
||||
},
|
||||
"context_quality": self._assess_context_quality(news_articles, market_data, social_data)
|
||||
}
|
||||
```
|
||||
|
||||
## Migration and Deployment Standards
|
||||
|
||||
### Database Migrations
|
||||
|
||||
**Alembic Configuration**:
|
||||
```python
|
||||
# alembic/env.py
|
||||
import asyncio
|
||||
from sqlalchemy.ext.asyncio import create_async_engine
|
||||
from tradingagents.lib.database import Base
|
||||
|
||||
def run_async_migrations():
|
||||
config = context.config
|
||||
database_url = config.get_main_option("sqlalchemy.url")
|
||||
|
||||
# Ensure asyncpg driver
|
||||
if database_url.startswith("postgresql://"):
|
||||
database_url = database_url.replace("postgresql://", "postgresql+asyncpg://")
|
||||
|
||||
engine = create_async_engine(database_url)
|
||||
|
||||
async def do_run_migrations():
|
||||
async with engine.begin() as connection:
|
||||
await connection.run_sync(do_run_migrations_sync)
|
||||
|
||||
asyncio.run(do_run_migrations())
|
||||
```
|
||||
|
||||
**TimescaleDB-Specific Migrations**:
|
||||
```python
|
||||
"""Add TimescaleDB hypertable
|
||||
|
||||
Revision ID: 001
|
||||
"""
|
||||
|
||||
def upgrade():
|
||||
# Create table first
|
||||
op.create_table(
|
||||
'market_data',
|
||||
sa.Column('id', postgresql.UUID(), nullable=False),
|
||||
sa.Column('symbol', sa.String(20), nullable=False),
|
||||
sa.Column('timestamp', sa.TIMESTAMP(timezone=True), nullable=False),
|
||||
sa.Column('price', sa.Numeric(18, 8)),
|
||||
sa.PrimaryKeyConstraint('id')
|
||||
)
|
||||
|
||||
# Convert to hypertable
|
||||
op.execute("SELECT create_hypertable('market_data', 'timestamp');")
|
||||
|
||||
# Add indexes
|
||||
op.create_index('idx_market_symbol_time', 'market_data', ['symbol', 'timestamp'])
|
||||
```
|
||||
|
||||
### Docker Configuration
|
||||
|
||||
**Development Environment**:
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
services:
|
||||
timescaledb:
|
||||
build: ./db
|
||||
container_name: tradingagents_timescaledb
|
||||
environment:
|
||||
POSTGRES_USER: postgres
|
||||
POSTGRES_PASSWORD: tradingagents
|
||||
POSTGRES_DB: tradingagents
|
||||
ports:
|
||||
- "5432:5432"
|
||||
volumes:
|
||||
- ./seed.sql:/docker-entrypoint-initdb.d/seed.sql
|
||||
- timescale_data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U postgres -d tradingagents"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
```
|
||||
|
||||
### Environment Configuration
|
||||
|
||||
**Required Environment Variables**:
|
||||
```bash
|
||||
# Database
|
||||
DATABASE_URL=postgresql+asyncpg://postgres:tradingagents@localhost:5432/tradingagents
|
||||
|
||||
# OpenRouter LLM
|
||||
OPENROUTER_API_KEY=your_openrouter_key
|
||||
LLM_PROVIDER=openrouter
|
||||
DEEP_THINK_LLM=openai/gpt-4o
|
||||
QUICK_THINK_LLM=openai/gpt-4o-mini
|
||||
BACKEND_URL=https://openrouter.ai/api/v1
|
||||
|
||||
# Application
|
||||
TRADINGAGENTS_RESULTS_DIR=./results
|
||||
TRADINGAGENTS_DATA_DIR=./data
|
||||
DEFAULT_LOOKBACK_DAYS=30
|
||||
ONLINE_TOOLS=true
|
||||
|
||||
# Performance
|
||||
MAX_DEBATE_ROUNDS=1
|
||||
MAX_RISK_DISCUSS_ROUNDS=1
|
||||
```
|
||||
|
||||
## Quality Gates
|
||||
|
||||
### Database Performance
|
||||
|
||||
**Query Performance Standards**:
|
||||
- Simple queries: < 100ms
|
||||
- Complex aggregations: < 500ms
|
||||
- Vector similarity searches: < 1s
|
||||
- Batch operations: < 5s for 1000 records
|
||||
|
||||
**Monitoring Queries**:
|
||||
```sql
|
||||
-- Query performance monitoring
|
||||
SELECT query, mean_exec_time, calls, total_exec_time
|
||||
FROM pg_stat_statements
|
||||
WHERE mean_exec_time > 100
|
||||
ORDER BY mean_exec_time DESC;
|
||||
|
||||
-- TimescaleDB chunk information
|
||||
SELECT * FROM chunk_relation_size('market_data');
|
||||
```
|
||||
|
||||
### Connection Health
|
||||
|
||||
**Health Check Implementation**:
|
||||
```python
|
||||
async def health_check() -> dict:
|
||||
"""Comprehensive system health check"""
|
||||
checks = {}
|
||||
|
||||
# Database connectivity
|
||||
try:
|
||||
async with db_manager.get_session() as session:
|
||||
await session.execute(text("SELECT 1"))
|
||||
checks["database"] = {"status": "healthy", "latency_ms": None}
|
||||
except Exception as e:
|
||||
checks["database"] = {"status": "unhealthy", "error": str(e)}
|
||||
|
||||
# OpenRouter API
|
||||
try:
|
||||
# Test API connection
|
||||
checks["llm_api"] = {"status": "healthy"}
|
||||
except Exception as e:
|
||||
checks["llm_api"] = {"status": "unhealthy", "error": str(e)}
|
||||
|
||||
return checks
|
||||
```
|
||||
|
||||
### Data Quality Enforcement
|
||||
|
||||
**Validation Pipeline**:
|
||||
```python
|
||||
class DataQualityValidator:
|
||||
"""Ensures data meets quality standards before storage"""
|
||||
|
||||
def validate_news_article(self, article: NewsArticle) -> list[str]:
|
||||
errors = []
|
||||
|
||||
# Business rules
|
||||
if not article.headline.strip():
|
||||
errors.append("Empty headline")
|
||||
|
||||
if len(article.headline) > 500:
|
||||
errors.append("Headline too long")
|
||||
|
||||
if article.sentiment_score and not (-1 <= article.sentiment_score <= 1):
|
||||
errors.append("Invalid sentiment score range")
|
||||
|
||||
# Data freshness
|
||||
if article.published_date > date.today():
|
||||
errors.append("Future publication date")
|
||||
|
||||
return errors
|
||||
```
|
||||
|
||||
This technical standards document provides the foundation for maintaining consistency across the TradingAgents codebase while ensuring optimal performance for financial data processing and AI agent operations.
|
||||
17
litellm.yml
17
litellm.yml
|
|
@ -1,17 +0,0 @@
|
|||
model_list:
|
||||
- model_name: "*" # Catches any model request
|
||||
litellm_params:
|
||||
model: "openrouter/qwen/qwen3-coder"
|
||||
api_key: os.environ/OPENROUTER_API_KEY
|
||||
stream: false
|
||||
timeout: 600 # 10 minutes total - complex code can take time
|
||||
stop: []
|
||||
|
||||
general_settings:
|
||||
drop_params: true
|
||||
stream: false
|
||||
|
||||
router_settings:
|
||||
num_retries: 10
|
||||
retry_after: 2
|
||||
allowed_fails: 100
|
||||
|
|
@ -1,6 +0,0 @@
|
|||
{
|
||||
"name": "TradingAgents",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {}
|
||||
}
|
||||
|
|
@ -1 +0,0 @@
|
|||
{}
|
||||
1019
prd/news_service.md
1019
prd/news_service.md
File diff suppressed because it is too large
Load Diff
43
setup.py
43
setup.py
|
|
@ -1,43 +0,0 @@
|
|||
"""
|
||||
Setup script for the TradingAgents package.
|
||||
"""
|
||||
|
||||
from setuptools import find_packages, setup
|
||||
|
||||
setup(
|
||||
name="tradingagents",
|
||||
version="0.1.0",
|
||||
description="Multi-Agents LLM Financial Trading Framework",
|
||||
author="TradingAgents Team",
|
||||
author_email="yijia.xiao@cs.ucla.edu",
|
||||
url="https://github.com/TauricResearch",
|
||||
packages=find_packages(),
|
||||
install_requires=[
|
||||
"langchain>=0.1.0",
|
||||
"langchain-openai>=0.0.2",
|
||||
"langchain-experimental>=0.0.40",
|
||||
"langgraph>=0.0.20",
|
||||
"numpy>=1.24.0",
|
||||
"pandas>=2.0.0",
|
||||
"praw>=7.7.0",
|
||||
"stockstats>=0.5.4",
|
||||
"yfinance>=0.2.31",
|
||||
"typer>=0.9.0",
|
||||
"rich>=13.0.0",
|
||||
"questionary>=2.0.1",
|
||||
],
|
||||
python_requires=">=3.10",
|
||||
entry_points={
|
||||
"console_scripts": [
|
||||
"tradingagents=cli.main:app",
|
||||
],
|
||||
},
|
||||
classifiers=[
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Financial and Trading Industry",
|
||||
"License :: OSI Approved :: Apache Software License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Topic :: Office/Business :: Financial :: Investment",
|
||||
],
|
||||
)
|
||||
|
|
@ -1,4 +0,0 @@
|
|||
#!/bin/bash
|
||||
echo "Running type check..."
|
||||
cd /Users/martinrichards/code/TradingAgents
|
||||
mise run typecheck
|
||||
|
|
@ -2,6 +2,8 @@
|
|||
Tests for ArticleScraperClient using pytest-vcr for HTTP interactions.
|
||||
"""
|
||||
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from tradingagents.domains.news.article_scraper_client import (
|
||||
|
|
@ -9,34 +11,21 @@ from tradingagents.domains.news.article_scraper_client import (
|
|||
ScrapeResult,
|
||||
)
|
||||
|
||||
|
||||
# VCR configuration optimized for minimal cassette size
|
||||
def response_content_filter(response):
|
||||
"""Filter response content to reduce cassette size."""
|
||||
if "text/html" in response.get("headers", {}).get("content-type", [""])[0]:
|
||||
# For HTML responses, keep only the first 1KB for basic structure
|
||||
if "string" in response["body"]:
|
||||
content = response["body"]["string"]
|
||||
if len(content) > 1024:
|
||||
response["body"]["string"] = (
|
||||
content[:1024] + "... [TRUNCATED for test size]"
|
||||
)
|
||||
return response
|
||||
|
||||
|
||||
# VCR configuration
|
||||
vcr = pytest.mark.vcr(
|
||||
cassette_library_dir="tests/fixtures/vcr_cassettes/news",
|
||||
record_mode="once", # Record once, then replay
|
||||
match_on=["uri", "method"],
|
||||
filter_headers=["authorization", "cookie", "user-agent", "set-cookie"],
|
||||
before_record_response=response_content_filter,
|
||||
filter_headers=["authorization", "cookie", "user-agent"],
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def scraper():
|
||||
"""ArticleScraperClient instance for testing."""
|
||||
return ArticleScraperClient(user_agent="Test-Agent/1.0", delay=0.1)
|
||||
# Mock NLTK downloads to avoid external HTTP requests during tests
|
||||
with patch("nltk.download"):
|
||||
return ArticleScraperClient(user_agent="Test-Agent/1.0", delay=0.1)
|
||||
|
||||
|
||||
class TestArticleScraperClient:
|
||||
|
|
|
|||
|
|
@ -14,33 +14,12 @@ from tradingagents.domains.news.google_news_client import (
|
|||
GoogleNewsClient,
|
||||
)
|
||||
|
||||
|
||||
# VCR configuration optimized for minimal cassette size
|
||||
def rss_content_filter(response):
|
||||
"""Filter RSS content to reduce cassette size while preserving test data."""
|
||||
content_type = response.get("headers", {}).get("content-type", [""])[0]
|
||||
if "xml" in content_type and "string" in response["body"]:
|
||||
content = response["body"]["string"]
|
||||
# For RSS feeds, keep only first 5 items to reduce size
|
||||
if len(content) > 5000: # Only truncate large RSS feeds
|
||||
# Find closing tag of 5th item
|
||||
item_count = content.count("<item>")
|
||||
if item_count > 5:
|
||||
# Keep RSS structure but limit to 5 items
|
||||
parts = content.split("</item>")
|
||||
if len(parts) > 6: # 5 items + everything after
|
||||
response["body"]["string"] = (
|
||||
"</item>".join(parts[:6]) + "</channel></rss>"
|
||||
)
|
||||
return response
|
||||
|
||||
|
||||
# VCR configuration
|
||||
vcr = pytest.mark.vcr(
|
||||
cassette_library_dir="tests/fixtures/vcr_cassettes/news",
|
||||
record_mode="once", # Record once, then replay
|
||||
match_on=["uri", "method"],
|
||||
filter_headers=["authorization", "cookie", "user-agent", "set-cookie"],
|
||||
before_record_response=rss_content_filter,
|
||||
filter_headers=["authorization", "cookie"],
|
||||
)
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue