342 lines
9.5 KiB
Markdown
342 lines
9.5 KiB
Markdown
# TradingAgents SLURM Cluster Guide
|
|
|
|
This guide explains how to run the TradingAgents framework on a SLURM cluster environment.
|
|
|
|
## Overview
|
|
|
|
The TradingAgents framework has been configured to run efficiently on SLURM clusters with the following features:
|
|
|
|
- **Multi-job support**: Single analysis, batch processing, and GPU-accelerated runs
|
|
- **Resource management**: Optimized CPU, memory, and GPU allocation
|
|
- **Environment isolation**: Python virtual environments and dependency management
|
|
- **Result collection**: Structured output and error handling
|
|
- **LLM flexibility**: Support for various LLM providers (OpenAI, Anthropic, Ollama, etc.)
|
|
|
|
## Files Created
|
|
|
|
| File | Purpose |
|
|
| -------------------------- | --------------------------------------------- |
|
|
| `slurm_setup.sh` | Environment setup and dependency installation |
|
|
| `slurm_single_analysis.sh` | Single stock analysis job |
|
|
| `slurm_batch_analysis.sh` | Batch analysis for multiple stocks |
|
|
| `slurm_gpu_analysis.sh` | GPU-accelerated analysis with local models |
|
|
| `slurm_manager.sh` | Job management and utility script |
|
|
| `.env.slurm.template` | Environment configuration template |
|
|
|
|
## Quick Start
|
|
|
|
### 1. Initial Setup
|
|
|
|
```bash
|
|
# Make the manager script executable
|
|
chmod +x slurm_manager.sh
|
|
|
|
# Setup environment and create directories
|
|
./slurm_manager.sh setup
|
|
|
|
# Submit setup job to install dependencies
|
|
./slurm_manager.sh submit-setup
|
|
```
|
|
|
|
### 2. Configure Environment
|
|
|
|
Edit the `.env` file (created from template) to configure your LLM provider:
|
|
|
|
```bash
|
|
# For Ollama (local models)
|
|
LLM_PROVIDER=ollama
|
|
LLM_BACKEND_URL=http://localhost:11434/v1
|
|
DEEP_THINK_LLM=llama3.2
|
|
QUICK_THINK_LLM=llama3.2
|
|
|
|
# For OpenAI
|
|
LLM_PROVIDER=openai
|
|
OPENAI_API_KEY=your_api_key_here
|
|
DEEP_THINK_LLM=gpt-4
|
|
QUICK_THINK_LLM=gpt-3.5-turbo
|
|
|
|
# For Anthropic
|
|
LLM_PROVIDER=anthropic
|
|
ANTHROPIC_API_KEY=your_api_key_here
|
|
DEEP_THINK_LLM=claude-3-sonnet-20240229
|
|
QUICK_THINK_LLM=claude-3-haiku-20240307
|
|
```
|
|
|
|
### 3. Submit Jobs
|
|
|
|
```bash
|
|
# Single stock analysis
|
|
./slurm_manager.sh submit-single AAPL
|
|
|
|
# Batch analysis (multiple stocks)
|
|
./slurm_manager.sh submit-batch
|
|
|
|
# GPU-accelerated analysis
|
|
./slurm_manager.sh submit-gpu TSLA
|
|
```
|
|
|
|
### 4. Monitor Jobs
|
|
|
|
```bash
|
|
# Check all recent jobs
|
|
./slurm_manager.sh status
|
|
|
|
# Check specific job
|
|
./slurm_manager.sh status 12345
|
|
|
|
# View job output
|
|
./slurm_manager.sh output 12345
|
|
|
|
# View job errors
|
|
./slurm_manager.sh output 12345 err
|
|
```
|
|
|
|
### 5. Collect Results
|
|
|
|
```bash
|
|
# View results for all symbols
|
|
./slurm_manager.sh results
|
|
|
|
# View results for specific symbol
|
|
./slurm_manager.sh results AAPL
|
|
|
|
# View results for specific date
|
|
./slurm_manager.sh results AAPL 2024-01-15
|
|
```
|
|
|
|
## Job Types
|
|
|
|
### 1. Single Analysis (`slurm_single_analysis.sh`)
|
|
|
|
- **Purpose**: Analyze a single stock symbol
|
|
- **Resources**: 8 CPUs, 16GB RAM, 4 hours
|
|
- **Usage**: Best for focused analysis or testing
|
|
|
|
```bash
|
|
sbatch slurm_single_analysis.sh SYMBOL DATE
|
|
# or
|
|
./slurm_manager.sh submit-single SYMBOL DATE
|
|
```
|
|
|
|
### 2. Batch Analysis (`slurm_batch_analysis.sh`)
|
|
|
|
- **Purpose**: Analyze multiple stocks in parallel
|
|
- **Resources**: Array job with up to 5 concurrent tasks
|
|
- **Default symbols**: SPY, QQQ, AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX
|
|
- **Usage**: Efficient for portfolio-wide analysis
|
|
|
|
```bash
|
|
sbatch slurm_batch_analysis.sh
|
|
# or
|
|
./slurm_manager.sh submit-batch
|
|
```
|
|
|
|
### 3. GPU Analysis (`slurm_gpu_analysis.sh`)
|
|
|
|
- **Purpose**: GPU-accelerated analysis with local models
|
|
- **Resources**: 1 GPU, 8 CPUs, 32GB RAM, 8 hours
|
|
- **Usage**: Best for Ollama or other local LLM providers
|
|
|
|
```bash
|
|
sbatch slurm_gpu_analysis.sh SYMBOL DATE
|
|
# or
|
|
./slurm_manager.sh submit-gpu SYMBOL DATE
|
|
```
|
|
|
|
## Resource Requirements
|
|
|
|
### Minimum Requirements
|
|
|
|
- **CPU Jobs**: 4-8 cores, 8-16GB RAM
|
|
- **GPU Jobs**: 1 GPU, 8 cores, 32GB RAM
|
|
- **Storage**: ~1GB for dependencies, variable for results/cache
|
|
|
|
### Recommended Partitions
|
|
|
|
- **CPU Partition**: For most analysis jobs
|
|
- **GPU Partition**: For local LLM acceleration
|
|
- **High-Memory Partition**: For large-scale batch processing
|
|
|
|
## LLM Provider Configuration
|
|
|
|
### Ollama (Recommended for Clusters)
|
|
|
|
- Runs locally on compute nodes
|
|
- No external API dependencies
|
|
- GPU acceleration support
|
|
- Models: llama3.2, mistral, etc.
|
|
|
|
### OpenAI/OpenRouter
|
|
|
|
- Requires API key and internet access
|
|
- Fast inference
|
|
- Usage costs apply
|
|
- Models: gpt-4, gpt-3.5-turbo, etc.
|
|
|
|
### Anthropic
|
|
|
|
- Requires API key and internet access
|
|
- High-quality reasoning
|
|
- Usage costs apply
|
|
- Models: claude-3-sonnet, claude-3-haiku
|
|
|
|
## File Structure
|
|
|
|
```
|
|
TradingAgents/
|
|
├── slurm_*.sh # SLURM job scripts
|
|
├── slurm_manager.sh # Job management utility
|
|
├── .env # Environment configuration
|
|
├── logs/ # Job output and error logs
|
|
├── results/ # Analysis results by symbol/date
|
|
├── venv/ # Python virtual environment
|
|
└── data_cache/ # Cached market data
|
|
```
|
|
|
|
## Error Handling and Exit Behavior
|
|
|
|
### **Automatic Script Exit**
|
|
|
|
✅ **Yes, scripts will exit automatically on failures** with the following behavior:
|
|
|
|
#### **1. Bash Script Level**
|
|
|
|
- **`set -euo pipefail`**: Scripts exit immediately on any command failure
|
|
- **`-e`**: Exit on any non-zero exit status
|
|
- **`-u`**: Exit on undefined variables
|
|
- **`-o pipefail`**: Exit if any command in a pipeline fails
|
|
|
|
#### **2. Python Script Level**
|
|
|
|
- **Exception handling**: All Python errors are caught and logged
|
|
- **Explicit exit**: `sys.exit(1)` on any analysis failure
|
|
- **Error logging**: Failures are saved to JSON files for debugging
|
|
|
|
#### **3. SLURM Level**
|
|
|
|
- **Job cancellation**: Failed jobs are marked as FAILED in SLURM
|
|
- **Resource cleanup**: Allocated resources are automatically released
|
|
- **Log preservation**: Output and error logs are saved for investigation
|
|
|
|
### **What Happens on Failure**
|
|
|
|
1. **Immediate termination** of the failing script
|
|
2. **Error information saved** to `results/[SYMBOL]/[DATE]/error_[JOB_ID].json`
|
|
3. **SLURM job status** set to FAILED
|
|
4. **Exit code 1** returned to SLURM scheduler
|
|
5. **Resources released** back to the cluster
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Job Fails to Start**
|
|
|
|
- Check SLURM partition availability: `sinfo`
|
|
- Verify resource requirements match cluster limits
|
|
- Ensure environment setup job completed successfully
|
|
|
|
2. **Python Dependencies Missing**
|
|
|
|
- Run setup job: `./slurm_manager.sh submit-setup`
|
|
- Check setup job output: `./slurm_manager.sh output SETUP_JOB_ID`
|
|
|
|
3. **LLM Connection Issues**
|
|
|
|
- Verify API keys in `.env` file
|
|
- Check network connectivity for external providers
|
|
- For Ollama, ensure GPU resources are available
|
|
|
|
4. **Out of Memory Errors**
|
|
|
|
- Increase memory allocation in job scripts
|
|
- Reduce `max_debate_rounds` in configuration
|
|
- Use GPU partition for memory-intensive models
|
|
|
|
5. **Script Exit Issues**
|
|
- Check exit codes: `sacct -j JOB_ID --format=JobID,State,ExitCode`
|
|
- Review error logs: `./slurm_manager.sh output JOB_ID err`
|
|
- Verify all prerequisites are met before job submission
|
|
|
|
### Debugging
|
|
|
|
```bash
|
|
# Check job status and exit codes
|
|
squeue -u $USER
|
|
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason
|
|
|
|
# View detailed job information
|
|
scontrol show job JOB_ID
|
|
|
|
# Check node resources
|
|
sinfo -N -l
|
|
|
|
# View job output in real-time
|
|
tail -f logs/trading_JOB_ID.out
|
|
|
|
# Check for error files
|
|
find results -name "error_*.json" -exec echo "Found error in: {}" \; -exec cat {} \;
|
|
```
|
|
|
|
## Customization
|
|
|
|
### Modify Stock Lists
|
|
|
|
Edit the `SYMBOLS` array in `slurm_batch_analysis.sh`:
|
|
|
|
```bash
|
|
SYMBOLS=("AAPL" "MSFT" "GOOGL" "AMZN" "TSLA")
|
|
```
|
|
|
|
### Adjust Resources
|
|
|
|
Modify SLURM directives in job scripts:
|
|
|
|
```bash
|
|
#SBATCH --cpus-per-task=16 # More CPUs
|
|
#SBATCH --mem=64G # More memory
|
|
#SBATCH --time=12:00:00 # Longer runtime
|
|
```
|
|
|
|
### Configure Analysis Parameters
|
|
|
|
Edit the config in Python scripts:
|
|
|
|
```python
|
|
config["max_debate_rounds"] = 3 # More thorough analysis
|
|
config["max_risk_discuss_rounds"] = 3 # More risk assessment
|
|
config["online_tools"] = True # Enable web scraping
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Start Small**: Test with single analysis before batch jobs
|
|
2. **Monitor Resources**: Check CPU/memory usage during jobs
|
|
3. **Batch Wisely**: Use array jobs for multiple symbols
|
|
4. **Cache Data**: Leverage data caching to reduce API calls
|
|
5. **Log Everything**: Review job logs for optimization opportunities
|
|
6. **Backup Results**: Copy important results to permanent storage
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use Local Models**: Ollama reduces API latency and costs
|
|
2. **Parallel Processing**: Leverage array jobs for batch analysis
|
|
3. **Resource Matching**: Match job resources to actual needs
|
|
4. **Data Locality**: Store frequently accessed data on fast storage
|
|
5. **Network Optimization**: Use cluster-internal services when possible
|
|
|
|
## Security Considerations
|
|
|
|
1. **API Keys**: Store sensitive keys in `.env` file, not in scripts
|
|
2. **File Permissions**: Ensure job scripts and data have appropriate permissions
|
|
3. **Network Access**: Some clusters restrict external API access
|
|
4. **Data Privacy**: Be aware of data residency requirements for financial data
|
|
|
|
## Support
|
|
|
|
For issues specific to:
|
|
|
|
- **SLURM**: Consult your cluster documentation or administrator
|
|
- **TradingAgents**: Check the main repository issues and documentation
|
|
- **LLM Providers**: Refer to respective provider documentation
|