Financial Text Mining

Learning Financial Domain Word Embedding based on BERT

Reference	Title	Data source (open-sourced?)	Model Type	Evaluation Metirc(s)	Time Span	Primary Research Problem	Venue
Liu et al. (2020)	FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining	English Wikipedia and BooksCorpus, (General Domain), financial Web like CommonCrawl News dataset, YahooFinance, and RadditFinanceQA. (Financial Domain), totally over 61GB text. (open-sourced)	BERT	Financial Sentence Boundary Detection: outperforms the baseline by 0.085 Mean score. Financial Sentiment Analysis: Accuracy 0.94 F1 score 0.93. Financial Question Answering: Normalized Discounted Cumulative Gain (NDCG) 0.76. Mean reciprocal rank (MRR): 0.68	-/07/2013- -/12/2019	Due to the lack of labeled training data, applying deep learning on financial text mining is often unsuccessful	IJCAI-20
Yang et al. (2020)	A Pretrained BERT Model for Financial Communications	Corporate Reports 10-K & 10-Q: 2.5B tokens; Earnings Call Transcripts: 1.3B tokens; Analyst Reports: 1.1B tokens (Financial Domain) Avaliable at link	BERT	-	-	FinBERT is a BERT model pre-trained on financial communication text. The purpose is to enhance finaincal NLP research and practice. It is trained on the following three finanical communication corpus. The total corpora size is 4.9B tokens.	Arxiv Paper
Araci (2019)	Financial Sentiment Analysis with BERT Resources	There are two datasets used for FinBERT. The language model further training is done on a subset of Reuters TRC2 dataset. Avaliable at link	BERT	-	-	FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification.	Arxiv Paper
McClelland et al.(1986)	In: Parallel distributed processing: Explorations in the microstructure of cognition	\	\	\	\	The first paper that raise the idea to encode the knowledge using script- or frame-like representations	Book
David et al. (1986)	Learning representations by back-propagating errors	\	NN	Difference between the actual output vector of the net and the desired output vector.	\	The earliest paper that represent words as continuous vectors.	Nature
Du et al. (2019)	AIG Investments.AI at the FinSBD Task: Sentence Boundary Detection through Sequence Labelling and BERT Fine-tuning	FinSBD-2019, Pre-trained word embedding: glove.6B, public domain implementation	BERT, LSTM	1) F1 scores for predicting beginning (BS) and ending (ES) tokens separately as well as 2) the mean of two separate, F1 scores, precision, recall	\	Financial document sentence boundary detection	Proceedings of the First Workshop on Financial Technology and Natural Language Processing

Text Mining

Reference	Title	Data source (open-sourced?)	Model Type	Evaluation Metirc(s)	Time Span	Primary Research Problem	Venue
Guo et al. (2020)	Deep Semantic Compliance Advisor for Unstructured Document Compliance Checking	Stanford Natural Language Inference (SNLI) dataset (open-sourced), a real English contract data (NOT open source)	Graph Neural Network,attention-based RNN	It takes a legal professional 4+ hours for each contract checking, DSCA can return the checking results with detail comparison info in one minute.	\	Unstructured document checking, sentiment analysis	IJCAI-20
Guo et al. (2020)	IGNITE: A Minimax Game Toward Learning Individual Treatment Effects from Networked Observational Data	Create semi-synthesis data to mimic the real-world situation (NOT open-sourced)	\	\	\	Learn Individual Treatment Effects (ITEs) from network information	Eco
Wang & Zhu (2020)	Interpretable Multimodal Learning for Intelligent Regulation in Online Payment Systems	WeChat Pay of Tencent (NOT open source)	Attention mechanism	85.9% Accuracy, and triplet loss is 0.01 lower than baseline model	01/07/2019- 31/08/2019	Try to investigate the relationship between transactions and texts on e-commerce system	IJCAI-20
David et al. (2020)	Leveraging Contextual Text Representations for Anonymizing German Financial Documents	Bundesanzeiger11 (BANZ) (Open sourced)	Bi-directional Character-based Recurrent Neural Network	98.9% Precision, 0.973 Recall, 0.972 F1	\	App of anonymizing the sensitive components in financial document	AAAI-20
Kiyoshi et al. (2020)	Economic News Impact Analysis, Using Causal-Chain Search from Textual Data	Tokyo Stock Exchange (open sourced)	Casual Chain Search VS Absolute Return in Stock Market	Both related (Using similarity of AR)	01/10/2012- 31/05/2018	We created lists of related companies and measured impacts on those stock prices for the two important news about a wheat price in 2018. As a result, the market impacts appeared in the companies related to the ripple effects when the news is about the obvious fact	AAAI-20
Edminston et al. (2020)	Unsupervised Discovery of Firm-Level Variables in Earnings Call Transcript Embeddings	Compustat	SAFE - Graph Algorithm	SAFE Score	Q1-2020	Repurpose algorithm from computational biology. Compares embedding methods across economic variables.	FinNLP-2020
Taylor & Keselj (2020)	Using Extractive Lexicon-based Sentiment Analysis to Enhance Understanding of the Impact of Non-GAAP Measures in Financial Reporting	McDonald (2019) 10-K	\	Hypothesis Test	1998-2019	First to use extractive approach for sentiment analysis in Finance	FinNLP-2020
Chen & Sarkar (2020)	A Semantic Approach to Financial Fundamentals	Stage One 10-X Parse Data	BERT	Cross-industry variation	2006-2018	Introduces the Semantically-Informed Financial Index	FinNLP-2020
Bambrick et al. (2020)	NSTM: Real-Time Query-Driven News Overview Composition at Bloomberg	Not OS	NSTM	User feedback	\	Developed a novel system that composes concise and human readable news overviews given arbitrary user search queries.	ACL-2020
Zheng et al. (2019)	Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction	Chinese Financial Announcements	Doc2EDAG	Precision, Recall, F1	2008-2018	New model to directly generate event tables. Reformalise DEE task without trigger words. New real-world dataset.	EMNLP-2019
Moreno-Sandoval et al. (2019)	Tone Analysis in Spanish Financial Reporting Narratives	ORBIS & Annual Reports	Lexicon/Rule-based	F1, Accuracy, Precision, Recall	2014-2017	First corpus of "letters to shareholders" in Spanish. Created a gold standard to evaluate opinion systems.	2019 (FNP)
Tian & Peng (2019)	Finance document Extraction Using Data Augmentation and Attention	\	Attention-based LSTM	Weighted F1	\	Title detection using attention based LSTM	2019 (FNP)
Blumenthal & Graf (2019)	Utilizing Pre-Trained Word Embeddings to Learn Classification Lexicons with Little Supervision	SST-2 & FNHL	Neural Network	Accuracy	\	Present a novel method to learn classification lexicons from a labeled text corpus that incorporates word sim- ilarities in the form of pre-trained word em- beddings	2019 (FNP)
Gooding & Briscoe (2019)	Active Learning for Financial Investment Reports	All Street Research	Linear SVC	F1-Score	\	Built a classification pipeline to categorise investment-related content.	2019 (FNP)
Chen et al. (2019)	Numeracy-600K: Learning Numeracy for Detecting Exaggerated Information in Market Comments	Reuters	BiGRU, LR, CNN…	F1-Score	\	Providing novel challenge and dataset. Set strong baseline.	ACL-2019
Dereli & Saraclar (2019)	Convolutional Neural Networks for Financial Text Regression	10-K Data - Tsai et al. (2016)	CNN	Spearmans Rank Correlation	\	Reduced dependencies on lexicon.	ACL-2019
Sedinkina et al. (2019)	Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes	H4N and L&M	OLS	t-statistic, R^2	\	Automatic domain adaptation of lexicons outperforms manual.	ACL-2019
Chung-chi et al. (2017)	NLG301 at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News	SemEval-2017 Task 5	SVM	Cosine Similarity	01/01/2015 - 31/10/2016	Text Span, Ensemble.	SemEval-2017 Task 5
Chung-chi et al. (2018)	Fine-Grained Analysis of Financial Tweets	FiQA 2018 Task 1	CNN / Bi-LSTM / CRNN	Accuracy/MSE/R2	/	Aspect, Extension Dataset	FiQA 2018 Task 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Financial Text Mining

Learning Financial Domain Word Embedding based on BERT

Text Mining

Files

README.md

Latest commit

History

README.md

File metadata and controls

Financial Text Mining

Learning Financial Domain Word Embedding based on BERT

Text Mining