9.5 KiB

Raw Blame History

TradingAgents SLURM Cluster Guide

This guide explains how to run the TradingAgents framework on a SLURM cluster environment.

Overview

The TradingAgents framework has been configured to run efficiently on SLURM clusters with the following features:

Multi-job support: Single analysis, batch processing, and GPU-accelerated runs
Resource management: Optimized CPU, memory, and GPU allocation
Environment isolation: Python virtual environments and dependency management
Result collection: Structured output and error handling
LLM flexibility: Support for various LLM providers (OpenAI, Anthropic, Ollama, etc.)

Files Created

File	Purpose
`slurm_setup.sh`	Environment setup and dependency installation
`slurm_single_analysis.sh`	Single stock analysis job
`slurm_batch_analysis.sh`	Batch analysis for multiple stocks
`slurm_gpu_analysis.sh`	GPU-accelerated analysis with local models
`slurm_manager.sh`	Job management and utility script
`.env.slurm.template`	Environment configuration template

Quick Start

1. Initial Setup

# Make the manager script executable
chmod +x slurm_manager.sh

# Setup environment and create directories
./slurm_manager.sh setup

# Submit setup job to install dependencies
./slurm_manager.sh submit-setup

2. Configure Environment

Edit the .env file (created from template) to configure your LLM provider:

# For Ollama (local models)
LLM_PROVIDER=ollama
LLM_BACKEND_URL=http://localhost:11434/v1
DEEP_THINK_LLM=llama3.2
QUICK_THINK_LLM=llama3.2

# For OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
DEEP_THINK_LLM=gpt-4
QUICK_THINK_LLM=gpt-3.5-turbo

# For Anthropic
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_api_key_here
DEEP_THINK_LLM=claude-3-sonnet-20240229
QUICK_THINK_LLM=claude-3-haiku-20240307

3. Submit Jobs

# Single stock analysis
./slurm_manager.sh submit-single AAPL

# Batch analysis (multiple stocks)
./slurm_manager.sh submit-batch

# GPU-accelerated analysis
./slurm_manager.sh submit-gpu TSLA

4. Monitor Jobs

# Check all recent jobs
./slurm_manager.sh status

# Check specific job
./slurm_manager.sh status 12345

# View job output
./slurm_manager.sh output 12345

# View job errors
./slurm_manager.sh output 12345 err

5. Collect Results

# View results for all symbols
./slurm_manager.sh results

# View results for specific symbol
./slurm_manager.sh results AAPL

# View results for specific date
./slurm_manager.sh results AAPL 2024-01-15

Job Types

1. Single Analysis (`slurm_single_analysis.sh`)

Purpose: Analyze a single stock symbol
Resources: 8 CPUs, 16GB RAM, 4 hours
Usage: Best for focused analysis or testing

sbatch slurm_single_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-single SYMBOL DATE

2. Batch Analysis (`slurm_batch_analysis.sh`)

Purpose: Analyze multiple stocks in parallel
Resources: Array job with up to 5 concurrent tasks
Default symbols: SPY, QQQ, AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX
Usage: Efficient for portfolio-wide analysis

sbatch slurm_batch_analysis.sh
# or
./slurm_manager.sh submit-batch

3. GPU Analysis (`slurm_gpu_analysis.sh`)

Purpose: GPU-accelerated analysis with local models
Resources: 1 GPU, 8 CPUs, 32GB RAM, 8 hours
Usage: Best for Ollama or other local LLM providers

sbatch slurm_gpu_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-gpu SYMBOL DATE

Resource Requirements

Minimum Requirements

CPU Jobs: 4-8 cores, 8-16GB RAM
GPU Jobs: 1 GPU, 8 cores, 32GB RAM
Storage: ~1GB for dependencies, variable for results/cache

Recommended Partitions

CPU Partition: For most analysis jobs
GPU Partition: For local LLM acceleration
High-Memory Partition: For large-scale batch processing

LLM Provider Configuration

Ollama (Recommended for Clusters)

Runs locally on compute nodes
No external API dependencies
GPU acceleration support
Models: llama3.2, mistral, etc.

OpenAI/OpenRouter

Requires API key and internet access
Fast inference
Usage costs apply
Models: gpt-4, gpt-3.5-turbo, etc.

Anthropic

Requires API key and internet access
High-quality reasoning
Usage costs apply
Models: claude-3-sonnet, claude-3-haiku

File Structure

TradingAgents/
├── slurm_*.sh           # SLURM job scripts
├── slurm_manager.sh     # Job management utility
├── .env                 # Environment configuration
├── logs/                # Job output and error logs
├── results/             # Analysis results by symbol/date
├── venv/                # Python virtual environment
└── data_cache/          # Cached market data

Error Handling and Exit Behavior

Automatic Script Exit

✅ Yes, scripts will exit automatically on failures with the following behavior:

1. Bash Script Level

set -euo pipefail: Scripts exit immediately on any command failure
-e: Exit on any non-zero exit status
-u: Exit on undefined variables
-o pipefail: Exit if any command in a pipeline fails

2. Python Script Level

Exception handling: All Python errors are caught and logged
Explicit exit: sys.exit(1) on any analysis failure
Error logging: Failures are saved to JSON files for debugging

3. SLURM Level

Job cancellation: Failed jobs are marked as FAILED in SLURM
Resource cleanup: Allocated resources are automatically released
Log preservation: Output and error logs are saved for investigation

What Happens on Failure

Immediate termination of the failing script
Error information saved to results/[SYMBOL]/[DATE]/error_[JOB_ID].json
SLURM job status set to FAILED
Exit code 1 returned to SLURM scheduler
Resources released back to the cluster

Troubleshooting

Common Issues

Job Fails to Start
- Check SLURM partition availability: sinfo
- Verify resource requirements match cluster limits
- Ensure environment setup job completed successfully
Python Dependencies Missing
- Run setup job: ./slurm_manager.sh submit-setup
- Check setup job output: ./slurm_manager.sh output SETUP_JOB_ID
LLM Connection Issues
- Verify API keys in .env file
- Check network connectivity for external providers
- For Ollama, ensure GPU resources are available
Out of Memory Errors
- Increase memory allocation in job scripts
- Reduce max_debate_rounds in configuration
- Use GPU partition for memory-intensive models
Script Exit Issues
- Check exit codes: sacct -j JOB_ID --format=JobID,State,ExitCode
- Review error logs: ./slurm_manager.sh output JOB_ID err
- Verify all prerequisites are met before job submission

Debugging

# Check job status and exit codes
squeue -u $USER
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason

# View detailed job information
scontrol show job JOB_ID

# Check node resources
sinfo -N -l

# View job output in real-time
tail -f logs/trading_JOB_ID.out

# Check for error files
find results -name "error_*.json" -exec echo "Found error in: {}" \; -exec cat {} \;

Customization

Modify Stock Lists

Edit the SYMBOLS array in slurm_batch_analysis.sh:

SYMBOLS=("AAPL" "MSFT" "GOOGL" "AMZN" "TSLA")

Adjust Resources

Modify SLURM directives in job scripts:

#SBATCH --cpus-per-task=16    # More CPUs
#SBATCH --mem=64G             # More memory
#SBATCH --time=12:00:00       # Longer runtime

Configure Analysis Parameters

Edit the config in Python scripts:

config["max_debate_rounds"] = 3        # More thorough analysis
config["max_risk_discuss_rounds"] = 3  # More risk assessment
config["online_tools"] = True          # Enable web scraping

Best Practices

Start Small: Test with single analysis before batch jobs
Monitor Resources: Check CPU/memory usage during jobs
Batch Wisely: Use array jobs for multiple symbols
Cache Data: Leverage data caching to reduce API calls
Log Everything: Review job logs for optimization opportunities
Backup Results: Copy important results to permanent storage

Performance Tips

Use Local Models: Ollama reduces API latency and costs
Parallel Processing: Leverage array jobs for batch analysis
Resource Matching: Match job resources to actual needs
Data Locality: Store frequently accessed data on fast storage
Network Optimization: Use cluster-internal services when possible

Security Considerations

API Keys: Store sensitive keys in .env file, not in scripts
File Permissions: Ensure job scripts and data have appropriate permissions
Network Access: Some clusters restrict external API access
Data Privacy: Be aware of data residency requirements for financial data

Support

For issues specific to:

SLURM: Consult your cluster documentation or administrator
TradingAgents: Check the main repository issues and documentation
LLM Providers: Refer to respective provider documentation

9.5 KiB Raw Blame History