This repository contains a quantitative investment strategy pipeline developed for the McGill-FIAM Hackathon. The end-to-end project showcases:
- Data Preprocessing
- Feature Engineering & Selection
- Machine Learning Predictor (returns alpha model)
- Portfolio Construction (Black-Litterman, Hierarchical Risk Parity)
- Backtesting & Performance Evaluation
Additionally, it integrates causal discovery techniques and LLM-driven fundamental features (from 10-K filings) for richer alpha signals.
Slides presented at McGill-FIAM: http://bit.ly/40GVIlI
-
ML-Based Alpha Generation
- We use a Bagging Random Forest (with advanced feature engineering) to predict stock excess returns.
-
Causal Discovery
- Uses AVICI (pretrained causal discovery model) to identify potential causal links among financial factors.
-
Portfolio Optimization
- Implements Hierarchical Risk Parity (HRP) and Black-Litterman for robust asset allocation.
- Multiple strategies exploring Carhart factors, custom alpha signals, and different optimization methods.
-
NLP & Fundamental Analysis
- Zero-shot language model (LLaMA) to extract risk factor scores, readability, and sentiment from SEC 10-K filings.
- Integrates these textual features into factor and alpha models.
-
Comprehensive Backtesting
- Rolling-window approach from 2010–2023.
- Reports Sharpe Ratio, Information Ratio, max drawdown, log-loss, confusion matrices, etc.
- Final results in
06-Backtesting/BlackLitermann-HRP/HRP_backtest.png
showcasing multiple strategies:- Carhart + HRP
- Alpha-signal + HRP (no BL)
- Alpha-signal + HRP + Black-Litterman
Below is a high-level breakdown (see scripts/directory_tree.py
for full detail):
mcgill_fiam
├─ 01-Data_Preprocessing
│ ├─ preprocessing_code.py
│ ├─ factors_theme.json
├─ 02-Feature_Engineering
│ └─ feature_engineering_code.py
├─ 03-Feature_Importance
│ ├─ feature_importance_code.py
│ ├─ feature_selection_code.py
│ └─ top_100_features.json
├─ 04-Predictor
│ ├─ train_AlphaSignals.py
│ └─ predictions_performance.py
├─ 05-Asset_Allocation
│ ├─ strategy_3/...
│ ├─ strategy_10/...
│ ├─ ...
│ └─ original_hrp.py
├─ 06-Backtesting
│ ├─ backtester_parallel.py
│ ├─ backtest_stats.py
│ ├─ BlackLitermann-HRP/
│ │ ├─ HRP_backtest.png ← **Final backtest chart**
│ └─ ...
├─ 0X-Causal_discovery
│ ├─ discovery.py
│ └─ Final_Features.json
├─ 0L-CoTZeroShotFeatures
│ ├─ create_dataset.py
│ └─ llama-3.2-3B-Instruct-Inference-*.py
├─ objects/
│ ├─ (Intermediate data, model outputs, predictions, performance metrics, etc.)
├─ raw_data/
│ ├─ factor_char_list.csv
│ ├─ hackathon_sample_v2.csv
│ └─ mkt_ind.csv
├─ notebooks/
│ ├─ (Assorted exploratory & debugging Jupyter notebooks)
├─ requirements.txt
├─ README.md (this file)
└─ ...
-
Clone this repository:
git clone https://github.com/theAayushbajaj/mcgill_fiam.git cd mcgill_fiam
-
Install dependencies:
pip install -r requirements.txt
- Python 3.7+ recommended
- If you plan on running large language models locally, ensure you have GPU support or suitable hardware.
-
Data Files
- The main data files (
hackathon_sample_v2.csv
,mkt_ind.csv
, etc.) should reside inraw_data
orraw_data_v3
. - Additional references:
- Factor definitions:
factor_char_list.csv
- Market index:
mkt_ind.csv
- SEC 10-Ks for fundamental NLP in
0L-CoTZeroShotFeatures/assets/
ordatasets/
.
- Factor definitions:
- The main data files (
Below is the typical end-to-end workflow:
-
Data Preprocessing
cd 01-Data_Preprocessing python preprocessing_code.py cd ..
-
Feature Engineering
cd 02-Feature_Engineering python feature_engineering_code.py cd ..
-
Feature Importance & Selection
cd 03-Feature_Importance python feature_importance_code.py python feature_selection_code.py cd ..
-
(Optional) Causal Discovery
cd 0X-Causal_discovery python discovery.py cd ..
- Produces a causal DAG using the AVICI model.
- Helps refine final feature sets.
-
Predictor Training
cd 04-Predictor python train_AlphaSignals.py # Produces out-of-sample predictions cd ..
- Rolling-window Bagging RF
- Generates monthly predictions in
objects/predictions_*.csv
-
Asset Allocation
- Multiple strategies exist in
05-Asset_Allocation
. - By default, the backtester calls a “main.py” from your chosen strategy folder.
- E.g.,
strategy_3/main.py
orstrategy_12/main.py
.
- Multiple strategies exist in
-
Backtesting
cd 06-Backtesting python backtester_parallel.py cd ..
- Aggregates predictions, builds a monthly portfolio, and tracks performance.
- Final results (Sharpe ratio, alpha, drawdowns, etc.) stored in
objects/
and plots in06-Backtesting/...
.
-
(Optional) NLP on 10-K Filings
cd 0L-CoTZeroShotFeatures python create_dataset.py python llama-3.2-3B-Instruct-Inference-RISK_FACTORS.py python llama-3.2-3B-Instruct-Inference-READABILITY-SCORE.py python llama-3.2-3B-Instruct-Inference-SENTIMENT-SCORES.py cd ..
- Integrates fundamental signals into your pipeline or factor set.
-
Backtest Results
06-Backtesting/BlackLitermann-HRP/HRP_backtest.png
:- Shows three final portfolios:
- Carhart + HRP
- Alpha signals + HRP (no BL)
- Alpha signals + Black-Litterman + HRP
- Shows three final portfolios:
-
Model Predictions & Performance
objects/predictions_*.csv
: ML predictions for each monthly test setobjects/performance_summary.csv
: Overall classification metrics (accuracy, F1, etc.)objects/Trading_Stats.pkl
: Dictionary with final trading performance statsobjects/TradingLog_Stats.pkl
: Detailed log-level stats (hit-miss, average trade ret, etc.)
-
Intermediate Files
objects/prices.csv
,objects/signals.csv
,objects/market_caps.csv
, etc.objects/X_DATASET.csv
&objects/Y_DATASET.csv
: Aggregated features & targetsobjects/FULL_stacked_data.csv
: Full feature-target-time panel
Raw Data
|
(1) Data Preprocessing
|
(2) Feature Engineering ----> (3) Feature Importance & Causal Discovery
| |
v |
(4) ML Alpha Model <-------------
|
(5) Portfolio Construction & Optimization (HRP / BL)
|
(6) Backtesting & Evaluation
|
Results & Plots
-
Causal Discovery:
The pipeline can incorporate causality-based feature selection. Ensureavici
is installed if you rundiscovery.py
. -
Performance Tips:
- For large data or heavy ML training, consider parallel execution / HPC.
- For LLM-based scripts, ensure GPU availability or use smaller models.
-
References
- Advances in Financial Machine Learning by Marcos López de Prado
- Black-Litterman (1992)
- Carhart 4-Factor Model (Carhart, 1997)
Thank you for using the McGill-FIAM Quant Asset Management Project! Feel free to open issues or pull requests to improve the codebase. Happy hacking and investing!