9.5 KiB
9.5 KiB
TradingAgents SLURM Cluster Guide
This guide explains how to run the TradingAgents framework on a SLURM cluster environment.
Overview
The TradingAgents framework has been configured to run efficiently on SLURM clusters with the following features:
- Multi-job support: Single analysis, batch processing, and GPU-accelerated runs
- Resource management: Optimized CPU, memory, and GPU allocation
- Environment isolation: Python virtual environments and dependency management
- Result collection: Structured output and error handling
- LLM flexibility: Support for various LLM providers (OpenAI, Anthropic, Ollama, etc.)
Files Created
| File | Purpose |
|---|---|
slurm_setup.sh |
Environment setup and dependency installation |
slurm_single_analysis.sh |
Single stock analysis job |
slurm_batch_analysis.sh |
Batch analysis for multiple stocks |
slurm_gpu_analysis.sh |
GPU-accelerated analysis with local models |
slurm_manager.sh |
Job management and utility script |
.env.slurm.template |
Environment configuration template |
Quick Start
1. Initial Setup
# Make the manager script executable
chmod +x slurm_manager.sh
# Setup environment and create directories
./slurm_manager.sh setup
# Submit setup job to install dependencies
./slurm_manager.sh submit-setup
2. Configure Environment
Edit the .env file (created from template) to configure your LLM provider:
# For Ollama (local models)
LLM_PROVIDER=ollama
LLM_BACKEND_URL=http://localhost:11434/v1
DEEP_THINK_LLM=llama3.2
QUICK_THINK_LLM=llama3.2
# For OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
DEEP_THINK_LLM=gpt-4
QUICK_THINK_LLM=gpt-3.5-turbo
# For Anthropic
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_api_key_here
DEEP_THINK_LLM=claude-3-sonnet-20240229
QUICK_THINK_LLM=claude-3-haiku-20240307
3. Submit Jobs
# Single stock analysis
./slurm_manager.sh submit-single AAPL
# Batch analysis (multiple stocks)
./slurm_manager.sh submit-batch
# GPU-accelerated analysis
./slurm_manager.sh submit-gpu TSLA
4. Monitor Jobs
# Check all recent jobs
./slurm_manager.sh status
# Check specific job
./slurm_manager.sh status 12345
# View job output
./slurm_manager.sh output 12345
# View job errors
./slurm_manager.sh output 12345 err
5. Collect Results
# View results for all symbols
./slurm_manager.sh results
# View results for specific symbol
./slurm_manager.sh results AAPL
# View results for specific date
./slurm_manager.sh results AAPL 2024-01-15
Job Types
1. Single Analysis (slurm_single_analysis.sh)
- Purpose: Analyze a single stock symbol
- Resources: 8 CPUs, 16GB RAM, 4 hours
- Usage: Best for focused analysis or testing
sbatch slurm_single_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-single SYMBOL DATE
2. Batch Analysis (slurm_batch_analysis.sh)
- Purpose: Analyze multiple stocks in parallel
- Resources: Array job with up to 5 concurrent tasks
- Default symbols: SPY, QQQ, AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX
- Usage: Efficient for portfolio-wide analysis
sbatch slurm_batch_analysis.sh
# or
./slurm_manager.sh submit-batch
3. GPU Analysis (slurm_gpu_analysis.sh)
- Purpose: GPU-accelerated analysis with local models
- Resources: 1 GPU, 8 CPUs, 32GB RAM, 8 hours
- Usage: Best for Ollama or other local LLM providers
sbatch slurm_gpu_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-gpu SYMBOL DATE
Resource Requirements
Minimum Requirements
- CPU Jobs: 4-8 cores, 8-16GB RAM
- GPU Jobs: 1 GPU, 8 cores, 32GB RAM
- Storage: ~1GB for dependencies, variable for results/cache
Recommended Partitions
- CPU Partition: For most analysis jobs
- GPU Partition: For local LLM acceleration
- High-Memory Partition: For large-scale batch processing
LLM Provider Configuration
Ollama (Recommended for Clusters)
- Runs locally on compute nodes
- No external API dependencies
- GPU acceleration support
- Models: llama3.2, mistral, etc.
OpenAI/OpenRouter
- Requires API key and internet access
- Fast inference
- Usage costs apply
- Models: gpt-4, gpt-3.5-turbo, etc.
Anthropic
- Requires API key and internet access
- High-quality reasoning
- Usage costs apply
- Models: claude-3-sonnet, claude-3-haiku
File Structure
TradingAgents/
├── slurm_*.sh # SLURM job scripts
├── slurm_manager.sh # Job management utility
├── .env # Environment configuration
├── logs/ # Job output and error logs
├── results/ # Analysis results by symbol/date
├── venv/ # Python virtual environment
└── data_cache/ # Cached market data
Error Handling and Exit Behavior
Automatic Script Exit
✅ Yes, scripts will exit automatically on failures with the following behavior:
1. Bash Script Level
set -euo pipefail: Scripts exit immediately on any command failure-e: Exit on any non-zero exit status-u: Exit on undefined variables-o pipefail: Exit if any command in a pipeline fails
2. Python Script Level
- Exception handling: All Python errors are caught and logged
- Explicit exit:
sys.exit(1)on any analysis failure - Error logging: Failures are saved to JSON files for debugging
3. SLURM Level
- Job cancellation: Failed jobs are marked as FAILED in SLURM
- Resource cleanup: Allocated resources are automatically released
- Log preservation: Output and error logs are saved for investigation
What Happens on Failure
- Immediate termination of the failing script
- Error information saved to
results/[SYMBOL]/[DATE]/error_[JOB_ID].json - SLURM job status set to FAILED
- Exit code 1 returned to SLURM scheduler
- Resources released back to the cluster
Troubleshooting
Common Issues
-
Job Fails to Start
- Check SLURM partition availability:
sinfo - Verify resource requirements match cluster limits
- Ensure environment setup job completed successfully
- Check SLURM partition availability:
-
Python Dependencies Missing
- Run setup job:
./slurm_manager.sh submit-setup - Check setup job output:
./slurm_manager.sh output SETUP_JOB_ID
- Run setup job:
-
LLM Connection Issues
- Verify API keys in
.envfile - Check network connectivity for external providers
- For Ollama, ensure GPU resources are available
- Verify API keys in
-
Out of Memory Errors
- Increase memory allocation in job scripts
- Reduce
max_debate_roundsin configuration - Use GPU partition for memory-intensive models
-
Script Exit Issues
- Check exit codes:
sacct -j JOB_ID --format=JobID,State,ExitCode - Review error logs:
./slurm_manager.sh output JOB_ID err - Verify all prerequisites are met before job submission
- Check exit codes:
Debugging
# Check job status and exit codes
squeue -u $USER
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason
# View detailed job information
scontrol show job JOB_ID
# Check node resources
sinfo -N -l
# View job output in real-time
tail -f logs/trading_JOB_ID.out
# Check for error files
find results -name "error_*.json" -exec echo "Found error in: {}" \; -exec cat {} \;
Customization
Modify Stock Lists
Edit the SYMBOLS array in slurm_batch_analysis.sh:
SYMBOLS=("AAPL" "MSFT" "GOOGL" "AMZN" "TSLA")
Adjust Resources
Modify SLURM directives in job scripts:
#SBATCH --cpus-per-task=16 # More CPUs
#SBATCH --mem=64G # More memory
#SBATCH --time=12:00:00 # Longer runtime
Configure Analysis Parameters
Edit the config in Python scripts:
config["max_debate_rounds"] = 3 # More thorough analysis
config["max_risk_discuss_rounds"] = 3 # More risk assessment
config["online_tools"] = True # Enable web scraping
Best Practices
- Start Small: Test with single analysis before batch jobs
- Monitor Resources: Check CPU/memory usage during jobs
- Batch Wisely: Use array jobs for multiple symbols
- Cache Data: Leverage data caching to reduce API calls
- Log Everything: Review job logs for optimization opportunities
- Backup Results: Copy important results to permanent storage
Performance Tips
- Use Local Models: Ollama reduces API latency and costs
- Parallel Processing: Leverage array jobs for batch analysis
- Resource Matching: Match job resources to actual needs
- Data Locality: Store frequently accessed data on fast storage
- Network Optimization: Use cluster-internal services when possible
Security Considerations
- API Keys: Store sensitive keys in
.envfile, not in scripts - File Permissions: Ensure job scripts and data have appropriate permissions
- Network Access: Some clusters restrict external API access
- Data Privacy: Be aware of data residency requirements for financial data
Support
For issues specific to:
- SLURM: Consult your cluster documentation or administrator
- TradingAgents: Check the main repository issues and documentation
- LLM Providers: Refer to respective provider documentation