Data & Feature Engineering
Data is pulled from two independent OHLCV sources, annual fundamental CSVs, and a news sentiment API, then combined into a stacked panel of ~145 features per ticker-day. The four feature layers must be applied in a fixed order — each depends on the output of the previous — and all degrade gracefully when upstream data is absent.
- Technical (30+) — trend, momentum, volatility, volume, price structure, and market regime indicators via
pandas-ta - Fundamental (57) — margins, growth, balance sheet health, cash flow quality, Piotroski F-Score, valuation multiples, and 1Y/2Y/3Y valuation momentum from annual FastFS exports; forward-filled with a 90-day reporting lag
- Sentiment (33) — daily news sentiment aggregated at the 16:00 ET market-close cut-off, with exponential decay fill-forward; rolling lags, momentum, volatility, directional mix, and regime features
- Peer-relative (22) — cross-sectional rankings and ratios against industry and sector peers; composite value, quality, and momentum scores
Models & Backtesting
Two model architectures are implemented behind a common interface. The stacked ensemble uses Optuna TPE search to tune four XGBoost and LightGBM base learners, whose out-of-fold predictions are fed to logistic regression and ridge meta-learners. The LSTM uses a dual classification/regression head over 30-day sliding windows and trains with cosine-annealing LR on CUDA when available. Both support resumable checkpoints and a versioned model registry.
The walk-forward backtester retrains a fresh model per window on historical-only data, applies one of three position strategies (long-only, market-neutral, threshold), and reports Sharpe, Sortino, max drawdown, CAGR, win rate, and Spearman IC — with transaction costs and slippage configurable in basis points.
- Stacked ensemble: XGBoost + LightGBM → LogisticRegression + Ridge meta-learners, tuned with Optuna
- PyTorch LSTM: dual-head (BCE classification + MSE regression), sequence length 30, CUDA-aware
- Walk-forward engine: expanding or rolling windows, per-window retraining, full trade log output
- 250+ tests across 9 modules using synthetic data; CI on Python 3.11 and 3.12
Forecasting & Equity Research
The forecast pipeline runs the full feature stack end-to-end and outputs a ranked table of S&P 500 tickers with predicted return, direction probability, bootstrap confidence intervals, and a conviction score. A separate research command queries the xAI Grok API for a structured market summary, then re-scores it independently with a local Ollama model and FinBERT for positive/negative/neutral probabilities.
- Ranked daily forecast: predicted return, direction probability, bootstrap CIs, conviction
- Equity research pipeline: Grok (live web access) → Ollama sentiment score → FinBERT classification
- Trade visualisation: annotated two-panel price + volume charts with entry/exit markers and P&L shading