themes
piotr@marciniak ~/projects/athena
← home summary system
type · Personal Project

Athena

A modular quantitative research system for S&P 500 equities — multi-layer feature engineering, stacked ensemble and LSTM models, walk-forward backtesting, and ranked forecasting.

Languages
Python
Models
XGBoost LightGBM LSTM FinBERT
Tools
PyTorch Optuna pandas-ta pydantic
Focus
Quant Research Feature Engineering Backtesting
01 // summary

Athena is a research-grade stock prediction pipeline built around S&P 500 equities. It ingests 15 years of OHLCV data from dual sources (yfinance and EODHD), annual fundamental statements, and daily news sentiment, then engineers approximately 145 features across four independent layers before training and evaluating ML models.

The system is designed around strict no-leakage principles — walk-forward retraining, point-in-time fundamentals with a 90-day reporting lag, sentiment cut-offs aligned to market close, and transaction costs baked into all backtest results. The CLI exposes 12 commands covering the full lifecycle from data fetching through to ranked daily forecasts and single-ticker equity research.

02 // system
3 components
§

Data & Feature Engineering

Data is pulled from two independent OHLCV sources, annual fundamental CSVs, and a news sentiment API, then combined into a stacked panel of ~145 features per ticker-day. The four feature layers must be applied in a fixed order — each depends on the output of the previous — and all degrade gracefully when upstream data is absent.

  • Technical (30+) — trend, momentum, volatility, volume, price structure, and market regime indicators via pandas-ta
  • Fundamental (57) — margins, growth, balance sheet health, cash flow quality, Piotroski F-Score, valuation multiples, and 1Y/2Y/3Y valuation momentum from annual FastFS exports; forward-filled with a 90-day reporting lag
  • Sentiment (33) — daily news sentiment aggregated at the 16:00 ET market-close cut-off, with exponential decay fill-forward; rolling lags, momentum, volatility, directional mix, and regime features
  • Peer-relative (22) — cross-sectional rankings and ratios against industry and sector peers; composite value, quality, and momentum scores
§

Models & Backtesting

Two model architectures are implemented behind a common interface. The stacked ensemble uses Optuna TPE search to tune four XGBoost and LightGBM base learners, whose out-of-fold predictions are fed to logistic regression and ridge meta-learners. The LSTM uses a dual classification/regression head over 30-day sliding windows and trains with cosine-annealing LR on CUDA when available. Both support resumable checkpoints and a versioned model registry.

The walk-forward backtester retrains a fresh model per window on historical-only data, applies one of three position strategies (long-only, market-neutral, threshold), and reports Sharpe, Sortino, max drawdown, CAGR, win rate, and Spearman IC — with transaction costs and slippage configurable in basis points.

  • Stacked ensemble: XGBoost + LightGBM → LogisticRegression + Ridge meta-learners, tuned with Optuna
  • PyTorch LSTM: dual-head (BCE classification + MSE regression), sequence length 30, CUDA-aware
  • Walk-forward engine: expanding or rolling windows, per-window retraining, full trade log output
  • 250+ tests across 9 modules using synthetic data; CI on Python 3.11 and 3.12
§

Forecasting & Equity Research

The forecast pipeline runs the full feature stack end-to-end and outputs a ranked table of S&P 500 tickers with predicted return, direction probability, bootstrap confidence intervals, and a conviction score. A separate research command queries the xAI Grok API for a structured market summary, then re-scores it independently with a local Ollama model and FinBERT for positive/negative/neutral probabilities.

  • Ranked daily forecast: predicted return, direction probability, bootstrap CIs, conviction
  • Equity research pipeline: Grok (live web access) → Ollama sentiment score → FinBERT classification
  • Trade visualisation: annotated two-panel price + volume charts with entry/exit markers and P&L shading