TradingAgents/docs/plans/2026-02-09-ml-win-probabili...

1.8 KiB
Raw Blame History

ML Win Probability Model — TabPFN + Triple-Barrier

Overview

Add an ML model that predicts win probability for each discovery candidate.

  • Training data: Universe-wide historical simulation (~375K labeled samples)
  • Model: TabPFN (foundation model for tabular data) with LightGBM fallback
  • Labels: Triple-barrier method (+5% profit, -3% stop loss, 7-day timeout)
  • Integration: Adds ml_win_probability field during enrichment

Components

1. Feature Engineering (tradingagents/ml/feature_engineering.py)

Shared feature extraction used by both training and inference. 20 features computed locally from OHLCV via stockstats + pandas.

2. Dataset Builder (scripts/build_ml_dataset.py)

  • Fetches OHLCV for ~500 stocks × 3 years
  • Computes features locally (no API calls for indicators)
  • Applies triple-barrier labels
  • Outputs data/ml/training_dataset.parquet

3. Model Trainer (scripts/train_ml_model.py)

  • Time-based train/validation split
  • TabPFN or LightGBM training
  • Walk-forward evaluation
  • Outputs data/ml/tabpfn_model.pkl + data/ml/metrics.json

4. Pipeline Integration

  • tradingagents/ml/predictor.py — model loading + inference
  • tradingagents/dataflows/discovery/filter.py — call predictor during enrichment
  • tradingagents/dataflows/discovery/ranker.py — surface in LLM prompt

Triple-Barrier Labels

+1 (WIN):     Price hits +5% within 7 trading days
-1 (LOSS):    Price hits -3% within 7 trading days
 0 (TIMEOUT): Neither barrier hit

Features (20)

All computed locally from OHLCV — zero API calls for indicators. rsi_14, macd, macd_signal, macd_hist, atr_pct, bb_width_pct, bb_position, adx, mfi, stoch_k, volume_ratio_5d, volume_ratio_20d, return_1d, return_5d, return_20d, sma50_distance, sma200_distance, high_low_range, gap_pct, log_market_cap