TradingAgents/SLURM_GUIDE.md

342 lines
9.5 KiB
Markdown

# TradingAgents SLURM Cluster Guide
This guide explains how to run the TradingAgents framework on a SLURM cluster environment.
## Overview
The TradingAgents framework has been configured to run efficiently on SLURM clusters with the following features:
- **Multi-job support**: Single analysis, batch processing, and GPU-accelerated runs
- **Resource management**: Optimized CPU, memory, and GPU allocation
- **Environment isolation**: Python virtual environments and dependency management
- **Result collection**: Structured output and error handling
- **LLM flexibility**: Support for various LLM providers (OpenAI, Anthropic, Ollama, etc.)
## Files Created
| File | Purpose |
| -------------------------- | --------------------------------------------- |
| `slurm_setup.sh` | Environment setup and dependency installation |
| `slurm_single_analysis.sh` | Single stock analysis job |
| `slurm_batch_analysis.sh` | Batch analysis for multiple stocks |
| `slurm_gpu_analysis.sh` | GPU-accelerated analysis with local models |
| `slurm_manager.sh` | Job management and utility script |
| `.env.slurm.template` | Environment configuration template |
## Quick Start
### 1. Initial Setup
```bash
# Make the manager script executable
chmod +x slurm_manager.sh
# Setup environment and create directories
./slurm_manager.sh setup
# Submit setup job to install dependencies
./slurm_manager.sh submit-setup
```
### 2. Configure Environment
Edit the `.env` file (created from template) to configure your LLM provider:
```bash
# For Ollama (local models)
LLM_PROVIDER=ollama
LLM_BACKEND_URL=http://localhost:11434/v1
DEEP_THINK_LLM=llama3.2
QUICK_THINK_LLM=llama3.2
# For OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
DEEP_THINK_LLM=gpt-4
QUICK_THINK_LLM=gpt-3.5-turbo
# For Anthropic
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_api_key_here
DEEP_THINK_LLM=claude-3-sonnet-20240229
QUICK_THINK_LLM=claude-3-haiku-20240307
```
### 3. Submit Jobs
```bash
# Single stock analysis
./slurm_manager.sh submit-single AAPL
# Batch analysis (multiple stocks)
./slurm_manager.sh submit-batch
# GPU-accelerated analysis
./slurm_manager.sh submit-gpu TSLA
```
### 4. Monitor Jobs
```bash
# Check all recent jobs
./slurm_manager.sh status
# Check specific job
./slurm_manager.sh status 12345
# View job output
./slurm_manager.sh output 12345
# View job errors
./slurm_manager.sh output 12345 err
```
### 5. Collect Results
```bash
# View results for all symbols
./slurm_manager.sh results
# View results for specific symbol
./slurm_manager.sh results AAPL
# View results for specific date
./slurm_manager.sh results AAPL 2024-01-15
```
## Job Types
### 1. Single Analysis (`slurm_single_analysis.sh`)
- **Purpose**: Analyze a single stock symbol
- **Resources**: 8 CPUs, 16GB RAM, 4 hours
- **Usage**: Best for focused analysis or testing
```bash
sbatch slurm_single_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-single SYMBOL DATE
```
### 2. Batch Analysis (`slurm_batch_analysis.sh`)
- **Purpose**: Analyze multiple stocks in parallel
- **Resources**: Array job with up to 5 concurrent tasks
- **Default symbols**: SPY, QQQ, AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX
- **Usage**: Efficient for portfolio-wide analysis
```bash
sbatch slurm_batch_analysis.sh
# or
./slurm_manager.sh submit-batch
```
### 3. GPU Analysis (`slurm_gpu_analysis.sh`)
- **Purpose**: GPU-accelerated analysis with local models
- **Resources**: 1 GPU, 8 CPUs, 32GB RAM, 8 hours
- **Usage**: Best for Ollama or other local LLM providers
```bash
sbatch slurm_gpu_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-gpu SYMBOL DATE
```
## Resource Requirements
### Minimum Requirements
- **CPU Jobs**: 4-8 cores, 8-16GB RAM
- **GPU Jobs**: 1 GPU, 8 cores, 32GB RAM
- **Storage**: ~1GB for dependencies, variable for results/cache
### Recommended Partitions
- **CPU Partition**: For most analysis jobs
- **GPU Partition**: For local LLM acceleration
- **High-Memory Partition**: For large-scale batch processing
## LLM Provider Configuration
### Ollama (Recommended for Clusters)
- Runs locally on compute nodes
- No external API dependencies
- GPU acceleration support
- Models: llama3.2, mistral, etc.
### OpenAI/OpenRouter
- Requires API key and internet access
- Fast inference
- Usage costs apply
- Models: gpt-4, gpt-3.5-turbo, etc.
### Anthropic
- Requires API key and internet access
- High-quality reasoning
- Usage costs apply
- Models: claude-3-sonnet, claude-3-haiku
## File Structure
```
TradingAgents/
├── slurm_*.sh # SLURM job scripts
├── slurm_manager.sh # Job management utility
├── .env # Environment configuration
├── logs/ # Job output and error logs
├── results/ # Analysis results by symbol/date
├── venv/ # Python virtual environment
└── data_cache/ # Cached market data
```
## Error Handling and Exit Behavior
### **Automatic Script Exit**
**Yes, scripts will exit automatically on failures** with the following behavior:
#### **1. Bash Script Level**
- **`set -euo pipefail`**: Scripts exit immediately on any command failure
- **`-e`**: Exit on any non-zero exit status
- **`-u`**: Exit on undefined variables
- **`-o pipefail`**: Exit if any command in a pipeline fails
#### **2. Python Script Level**
- **Exception handling**: All Python errors are caught and logged
- **Explicit exit**: `sys.exit(1)` on any analysis failure
- **Error logging**: Failures are saved to JSON files for debugging
#### **3. SLURM Level**
- **Job cancellation**: Failed jobs are marked as FAILED in SLURM
- **Resource cleanup**: Allocated resources are automatically released
- **Log preservation**: Output and error logs are saved for investigation
### **What Happens on Failure**
1. **Immediate termination** of the failing script
2. **Error information saved** to `results/[SYMBOL]/[DATE]/error_[JOB_ID].json`
3. **SLURM job status** set to FAILED
4. **Exit code 1** returned to SLURM scheduler
5. **Resources released** back to the cluster
## Troubleshooting
### Common Issues
1. **Job Fails to Start**
- Check SLURM partition availability: `sinfo`
- Verify resource requirements match cluster limits
- Ensure environment setup job completed successfully
2. **Python Dependencies Missing**
- Run setup job: `./slurm_manager.sh submit-setup`
- Check setup job output: `./slurm_manager.sh output SETUP_JOB_ID`
3. **LLM Connection Issues**
- Verify API keys in `.env` file
- Check network connectivity for external providers
- For Ollama, ensure GPU resources are available
4. **Out of Memory Errors**
- Increase memory allocation in job scripts
- Reduce `max_debate_rounds` in configuration
- Use GPU partition for memory-intensive models
5. **Script Exit Issues**
- Check exit codes: `sacct -j JOB_ID --format=JobID,State,ExitCode`
- Review error logs: `./slurm_manager.sh output JOB_ID err`
- Verify all prerequisites are met before job submission
### Debugging
```bash
# Check job status and exit codes
squeue -u $USER
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason
# View detailed job information
scontrol show job JOB_ID
# Check node resources
sinfo -N -l
# View job output in real-time
tail -f logs/trading_JOB_ID.out
# Check for error files
find results -name "error_*.json" -exec echo "Found error in: {}" \; -exec cat {} \;
```
## Customization
### Modify Stock Lists
Edit the `SYMBOLS` array in `slurm_batch_analysis.sh`:
```bash
SYMBOLS=("AAPL" "MSFT" "GOOGL" "AMZN" "TSLA")
```
### Adjust Resources
Modify SLURM directives in job scripts:
```bash
#SBATCH --cpus-per-task=16 # More CPUs
#SBATCH --mem=64G # More memory
#SBATCH --time=12:00:00 # Longer runtime
```
### Configure Analysis Parameters
Edit the config in Python scripts:
```python
config["max_debate_rounds"] = 3 # More thorough analysis
config["max_risk_discuss_rounds"] = 3 # More risk assessment
config["online_tools"] = True # Enable web scraping
```
## Best Practices
1. **Start Small**: Test with single analysis before batch jobs
2. **Monitor Resources**: Check CPU/memory usage during jobs
3. **Batch Wisely**: Use array jobs for multiple symbols
4. **Cache Data**: Leverage data caching to reduce API calls
5. **Log Everything**: Review job logs for optimization opportunities
6. **Backup Results**: Copy important results to permanent storage
## Performance Tips
1. **Use Local Models**: Ollama reduces API latency and costs
2. **Parallel Processing**: Leverage array jobs for batch analysis
3. **Resource Matching**: Match job resources to actual needs
4. **Data Locality**: Store frequently accessed data on fast storage
5. **Network Optimization**: Use cluster-internal services when possible
## Security Considerations
1. **API Keys**: Store sensitive keys in `.env` file, not in scripts
2. **File Permissions**: Ensure job scripts and data have appropriate permissions
3. **Network Access**: Some clusters restrict external API access
4. **Data Privacy**: Be aware of data residency requirements for financial data
## Support
For issues specific to:
- **SLURM**: Consult your cluster documentation or administrator
- **TradingAgents**: Check the main repository issues and documentation
- **LLM Providers**: Refer to respective provider documentation