The integration of a GPT-2 based embedding with a simple machine learning model for classification has proven to outperform baseline models in comprehensive comparative analyses. This hybrid approach, leveraging the context-rich representations generated by GPT-2 alongside traditional machine learning techniques, demonstrates enhanced predictive capabilities. The success of this combined model suggests that the synergistic use of advanced language embeddings with domain-specific machine learning models can yield superior results in classification tasks involving historical stock prices and news headline features.
This study seeks to utilize sentiment analysis on information sourced from news outlets, merging it with stock market data to forecast the volatility movement of the DJIA index.Diverging from prevalent methods primarily centered around stock price predictions, this study delves into the prospect of leveraging contextual analysis of textual data to improve the precision of DJIA index volatility forecasts. The primary research question is formulated as follows: Can sentiment analysis on news headlines contribute to predicting stock volatility movements in the DJIA index? To address this overarching question, two specific sub-questions are formulated:
- How does stock market related news headlines enhance the accuracy scores of deep neural network models in predicting DJIA index volatility movement?
- In what ways do Natural Language Processing techniques and qualitative analysis of news headlines contribute to forecasting DJIA index volatility movements?
The dataset utilized in this investigation is sourced from Kaggle covering the period from June 8th, 2008, to July 1st, 2016. Originally designed for students participating in a Deep Learning and NLP course, the dataset comprises both stock and news data, provided in .csv format and conveniently accessible on the associated website. Link: https://www.kaggle.com/aaron7sun/stocknews
- Data Preprocessing: Loading and processing datasets, particularly using Pandas (pandas library). Tokenizing and encoding text data for model input.
- Model Building: Utilizing the Hugging Face Transformers library (transformers) to work with pre-trained models for NLP tasks. Creating and configuring a text classification model.
- Model Training: Training the text classification model on the provided dataset. Fine-tuning the pre-trained transformer model for specific tasks using custom data.
- Evaluation: Evaluating the trained model's performance on a validation dataset, likely using metrics such as accuracy, precision, recall, and F1 score.
- Prediction: Using the trained model to make predictions on new, unseen text data.
- Integration with Scikit-Learn: Leveraging Scikit-Learn (scikit-learn) for machine learning functionalities, such as splitting the dataset into training and validation sets.
- TensorFlow and Keras Integration: Combining the Hugging Face Transformers library with TensorFlow (tensorflow) and Keras (keras) for building and training models.
- Logging and Reporting: Logging information during the training process, possibly for monitoring training progress and model performance.
- Custom Tokenization and Padding: Handling tokenization and padding of text sequences for model input. 10.Data Visualization (Possibly): Plotting and visualizing data or model performance using libraries like Matplotlib or Seaborn.
- Requirements File: Organizing project dependencies using a requirements.txt file.
-
Financial Indicators : ['Stochastic_K', 'Stochastic_D', 'Momentum', 'Rate_of_Change', 'William_R', 'A/D_Oscillator', 'Disparity_5']
-
Textual Features : 25 news headlines extracted from the Reddit World-News Channel
-
Data Preprocessing: Loading and processing datasets, particularly using Pandas (pandas library). Tokenizing and encoding text data for model input.
-
Model Building: Utilizing the Hugging Face Transformers library (transformers) to work with pre-trained models for NLP tasks. Creating and configuring a text classification model.
-
Model Training: Training the text classification model on the provided dataset. Fine-tuning the pre-trained transformer model for specific tasks using custom data.
-
Evaluation: Evaluating the trained model's performance on a validation dataset, likely using metrics such as accuracy, precision, recall, and F1 score.
-
Prediction: Using the trained model to make predictions on new, unseen text data.
-
Integration with Scikit-Learn: Leveraging Scikit-Learn (scikit-learn) for machine learning functionalities, such as splitting the dataset into training and validation sets.
-
TensorFlow and Keras Integration: Combining the Hugging Face Transformers library with TensorFlow (tensorflow) and Keras (keras) for building and training models.
-
Logging and Reporting: Logging information during the training process, possibly for monitoring training progress and model performance.
-
Custom Tokenization and Padding: Handling tokenization and padding of text sequences for model input. 10.Data Visualization (Possibly): Plotting and visualizing data or model performance using libraries like Matplotlib or Seaborn.
-
Requirements File: Organizing project dependencies using a requirements.txt file.
- Pandas: Used for data manipulation and analysis.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
- Transformers (Hugging Face): Used for working with transformer models.
from transformers import GPT2Tokenizer, GPT2Model
- PyTorch: Used for building and training neural networks.
import torch
- NumPy: Used for numerical operations.
import numpy as np
- Keras (with TensorFlow backend): Used for building and training neural networks.
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, Dropout
from keras.optimizers import Adam
- SimpleImputer (scikit-learn): Used for imputing missing values in data.
from sklearn.impute import SimpleImputer
- Matplotlib: Used for data visualization (not explicitly shown in the provided code).
import matplotlib.pyplot as plt
- Conv1D, MaxPooling1D, GRU (Keras layers): Used for building convolutional and recurrent neural network layers.
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GRU
- LSTM Model: The app utilizes a pre-trained LSTM model for predicting DJIA movements.
- GPT-2 Embedding: News headlines are transformed using the GPT-2 model for better input representation.
- User Interaction: Users can input 25 news headlines and select a date for prediction.
- Technical Indicators: Additional features such as Stochastic K/D, Momentum, Rate of Change, etc., are fetched from Yahoo Finance.
- Clone the repository
git clone https://github.com/Jeetanand/IntradayDJIAForecast/tree/main/djia_app/app.git
- Install dependencies:
pip install -r requirements.txt
- Run the app:
streamlit run app.py