TradingAgents/SLURM_GUIDE.md

9.5 KiB

TradingAgents SLURM Cluster Guide

This guide explains how to run the TradingAgents framework on a SLURM cluster environment.

Overview

The TradingAgents framework has been configured to run efficiently on SLURM clusters with the following features:

  • Multi-job support: Single analysis, batch processing, and GPU-accelerated runs
  • Resource management: Optimized CPU, memory, and GPU allocation
  • Environment isolation: Python virtual environments and dependency management
  • Result collection: Structured output and error handling
  • LLM flexibility: Support for various LLM providers (OpenAI, Anthropic, Ollama, etc.)

Files Created

File Purpose
slurm_setup.sh Environment setup and dependency installation
slurm_single_analysis.sh Single stock analysis job
slurm_batch_analysis.sh Batch analysis for multiple stocks
slurm_gpu_analysis.sh GPU-accelerated analysis with local models
slurm_manager.sh Job management and utility script
.env.slurm.template Environment configuration template

Quick Start

1. Initial Setup

# Make the manager script executable
chmod +x slurm_manager.sh

# Setup environment and create directories
./slurm_manager.sh setup

# Submit setup job to install dependencies
./slurm_manager.sh submit-setup

2. Configure Environment

Edit the .env file (created from template) to configure your LLM provider:

# For Ollama (local models)
LLM_PROVIDER=ollama
LLM_BACKEND_URL=http://localhost:11434/v1
DEEP_THINK_LLM=llama3.2
QUICK_THINK_LLM=llama3.2

# For OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
DEEP_THINK_LLM=gpt-4
QUICK_THINK_LLM=gpt-3.5-turbo

# For Anthropic
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_api_key_here
DEEP_THINK_LLM=claude-3-sonnet-20240229
QUICK_THINK_LLM=claude-3-haiku-20240307

3. Submit Jobs

# Single stock analysis
./slurm_manager.sh submit-single AAPL

# Batch analysis (multiple stocks)
./slurm_manager.sh submit-batch

# GPU-accelerated analysis
./slurm_manager.sh submit-gpu TSLA

4. Monitor Jobs

# Check all recent jobs
./slurm_manager.sh status

# Check specific job
./slurm_manager.sh status 12345

# View job output
./slurm_manager.sh output 12345

# View job errors
./slurm_manager.sh output 12345 err

5. Collect Results

# View results for all symbols
./slurm_manager.sh results

# View results for specific symbol
./slurm_manager.sh results AAPL

# View results for specific date
./slurm_manager.sh results AAPL 2024-01-15

Job Types

1. Single Analysis (slurm_single_analysis.sh)

  • Purpose: Analyze a single stock symbol
  • Resources: 8 CPUs, 16GB RAM, 4 hours
  • Usage: Best for focused analysis or testing
sbatch slurm_single_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-single SYMBOL DATE

2. Batch Analysis (slurm_batch_analysis.sh)

  • Purpose: Analyze multiple stocks in parallel
  • Resources: Array job with up to 5 concurrent tasks
  • Default symbols: SPY, QQQ, AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX
  • Usage: Efficient for portfolio-wide analysis
sbatch slurm_batch_analysis.sh
# or
./slurm_manager.sh submit-batch

3. GPU Analysis (slurm_gpu_analysis.sh)

  • Purpose: GPU-accelerated analysis with local models
  • Resources: 1 GPU, 8 CPUs, 32GB RAM, 8 hours
  • Usage: Best for Ollama or other local LLM providers
sbatch slurm_gpu_analysis.sh SYMBOL DATE
# or
./slurm_manager.sh submit-gpu SYMBOL DATE

Resource Requirements

Minimum Requirements

  • CPU Jobs: 4-8 cores, 8-16GB RAM
  • GPU Jobs: 1 GPU, 8 cores, 32GB RAM
  • Storage: ~1GB for dependencies, variable for results/cache
  • CPU Partition: For most analysis jobs
  • GPU Partition: For local LLM acceleration
  • High-Memory Partition: For large-scale batch processing

LLM Provider Configuration

  • Runs locally on compute nodes
  • No external API dependencies
  • GPU acceleration support
  • Models: llama3.2, mistral, etc.

OpenAI/OpenRouter

  • Requires API key and internet access
  • Fast inference
  • Usage costs apply
  • Models: gpt-4, gpt-3.5-turbo, etc.

Anthropic

  • Requires API key and internet access
  • High-quality reasoning
  • Usage costs apply
  • Models: claude-3-sonnet, claude-3-haiku

File Structure

TradingAgents/
├── slurm_*.sh           # SLURM job scripts
├── slurm_manager.sh     # Job management utility
├── .env                 # Environment configuration
├── logs/                # Job output and error logs
├── results/             # Analysis results by symbol/date
├── venv/                # Python virtual environment
└── data_cache/          # Cached market data

Error Handling and Exit Behavior

Automatic Script Exit

Yes, scripts will exit automatically on failures with the following behavior:

1. Bash Script Level

  • set -euo pipefail: Scripts exit immediately on any command failure
  • -e: Exit on any non-zero exit status
  • -u: Exit on undefined variables
  • -o pipefail: Exit if any command in a pipeline fails

2. Python Script Level

  • Exception handling: All Python errors are caught and logged
  • Explicit exit: sys.exit(1) on any analysis failure
  • Error logging: Failures are saved to JSON files for debugging

3. SLURM Level

  • Job cancellation: Failed jobs are marked as FAILED in SLURM
  • Resource cleanup: Allocated resources are automatically released
  • Log preservation: Output and error logs are saved for investigation

What Happens on Failure

  1. Immediate termination of the failing script
  2. Error information saved to results/[SYMBOL]/[DATE]/error_[JOB_ID].json
  3. SLURM job status set to FAILED
  4. Exit code 1 returned to SLURM scheduler
  5. Resources released back to the cluster

Troubleshooting

Common Issues

  1. Job Fails to Start

    • Check SLURM partition availability: sinfo
    • Verify resource requirements match cluster limits
    • Ensure environment setup job completed successfully
  2. Python Dependencies Missing

    • Run setup job: ./slurm_manager.sh submit-setup
    • Check setup job output: ./slurm_manager.sh output SETUP_JOB_ID
  3. LLM Connection Issues

    • Verify API keys in .env file
    • Check network connectivity for external providers
    • For Ollama, ensure GPU resources are available
  4. Out of Memory Errors

    • Increase memory allocation in job scripts
    • Reduce max_debate_rounds in configuration
    • Use GPU partition for memory-intensive models
  5. Script Exit Issues

    • Check exit codes: sacct -j JOB_ID --format=JobID,State,ExitCode
    • Review error logs: ./slurm_manager.sh output JOB_ID err
    • Verify all prerequisites are met before job submission

Debugging

# Check job status and exit codes
squeue -u $USER
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason

# View detailed job information
scontrol show job JOB_ID

# Check node resources
sinfo -N -l

# View job output in real-time
tail -f logs/trading_JOB_ID.out

# Check for error files
find results -name "error_*.json" -exec echo "Found error in: {}" \; -exec cat {} \;

Customization

Modify Stock Lists

Edit the SYMBOLS array in slurm_batch_analysis.sh:

SYMBOLS=("AAPL" "MSFT" "GOOGL" "AMZN" "TSLA")

Adjust Resources

Modify SLURM directives in job scripts:

#SBATCH --cpus-per-task=16    # More CPUs
#SBATCH --mem=64G             # More memory
#SBATCH --time=12:00:00       # Longer runtime

Configure Analysis Parameters

Edit the config in Python scripts:

config["max_debate_rounds"] = 3        # More thorough analysis
config["max_risk_discuss_rounds"] = 3  # More risk assessment
config["online_tools"] = True          # Enable web scraping

Best Practices

  1. Start Small: Test with single analysis before batch jobs
  2. Monitor Resources: Check CPU/memory usage during jobs
  3. Batch Wisely: Use array jobs for multiple symbols
  4. Cache Data: Leverage data caching to reduce API calls
  5. Log Everything: Review job logs for optimization opportunities
  6. Backup Results: Copy important results to permanent storage

Performance Tips

  1. Use Local Models: Ollama reduces API latency and costs
  2. Parallel Processing: Leverage array jobs for batch analysis
  3. Resource Matching: Match job resources to actual needs
  4. Data Locality: Store frequently accessed data on fast storage
  5. Network Optimization: Use cluster-internal services when possible

Security Considerations

  1. API Keys: Store sensitive keys in .env file, not in scripts
  2. File Permissions: Ensure job scripts and data have appropriate permissions
  3. Network Access: Some clusters restrict external API access
  4. Data Privacy: Be aware of data residency requirements for financial data

Support

For issues specific to:

  • SLURM: Consult your cluster documentation or administrator
  • TradingAgents: Check the main repository issues and documentation
  • LLM Providers: Refer to respective provider documentation