TradingAgents/docs/iterations/pipeline/scoring.md

2.9 KiB
Raw Blame History

Pipeline Scoring & Ranking

Current Understanding

LLM assigns a final_score (0-100) and confidence (1-10) to each candidate. Score and confidence are correlated but not identical — a speculative setup can score 80 with confidence 6. The ranker uses final_score as primary sort key. No evidence yet on whether confidence or score is a better predictor of outcomes.

Evidence Log

2026-04-12 — Cross-scanner calibration analysis

  • All scanners show tight calibration: avg score/10 within 0.5 of avg confidence across all scanners. No systemic miscalibration.
  • The current min_score_threshold=55 in discovery_config.py:52 allows borderline candidates (GME social_dd score 56, TSLA options_flow 60, FRT early_accumulation 60) into final rankings.
  • These low-scoring picks carry confidence 5-6 and are explicitly speculative. Raising threshold to 65 would eliminate them without losing high-conviction picks.
  • insider_buying has 136 recs — only 1 below score 60 (score 50-59 bucket had 1 entry). Raising to 65 would trim ~15% of insider picks (the 20 in 60-69 range).
  • Confidence: medium

Pending Hypotheses

  • Is confidence a better outcome predictor than final_score?
  • Does score threshold >65 improve hit rate? → Evidence supports it: low-score candidates are weak (social sentiment without data, speculative momentum). Implement threshold raise to 65.

2026-04-12 — P&L outcome analysis (mature recs, 2nd iteration)

  • news_catalyst: 0% 7d win rate, -8.79% avg 7d return (7 samples). Worst performing strategy by far.
  • social_hype: 14.3% 7d win rate, -4.84% avg 7d, -10.45% avg 30d (21-22 samples). Consistent destroyer.
  • social_dd: surprisingly best long-term: 55% 30d win rate, +0.94% avg 30d return — only scanner positive at 30d.
  • minervini: best short-term signal but small sample (n=3 for 1d tracking).
  • Critical gap confirmed: format_stats_summary() shows only top 3 best strategies. LLM never sees news_catalyst (0% 7d) or social_hype (14.3% 7d) as poor performers.
  • Confidence: high

2026-04-14 — P&L update (mature recs, 3rd iteration: Apr 3-9)

  • news_catalyst: still 0% 7d win rate, -8.37% avg 7d (8 samples, +1). WTI appeared Apr 3 (score=72) and Apr 6 (score=78) despite 0% track record. Ranker prompt updated: news_catalyst now explicitly flagged as "AVOID by default" with 0% win rate stated in criteria section.
  • social_hype: 18.2% 7d win rate (updated from 14.3%), -4.58% avg 7d (22 samples). LLY scored 82 and AI scored 80 from social_hype in Apr 3-9 — overconfident. Ranker prompt already warns "SPECULATIVE" for social_hype.
  • short_squeeze: 7d 60% win rate confirmed; 30d 30% — signal degrades sharply. Noted in short_squeeze.md.
  • insider_buying staleness: 50% of insider_buying picks in Apr 3-9 were stale repeats (PAGS×4, ZBIO×4, HMH×3). Staleness suppression filter implemented in insider_buying.py.
  • Overall pipeline: 626 tracked recs, 41.9% 7d win rate, 34.7% 30d win rate, -2.79% avg 30d return.
  • Confidence: high