PDF-mining

I was given about 600 PDF files of various formats and sources, and a CSV file bse_companies.csv that had names of about 7000 Indian companies.

Goal:

Go through each PDF file, extract names of the authors of the document.
extract the institution that authored this document, e.g., Kotak Equitiies Research, ICICI securities etc. In certain cases there will be no institution associated with the document, in that case mark it as "Others".
Extract the names of companies mentioned in each PDF.
If it is a company report, then get the broker recommendation - BUY/SELL etc
If it is a company report, get the Target Price of the stock.

Approach:

Extracted pagewise text from PDF files with pdfminer
Built Indian data of companies and names, finetuned dislim/BERT-based-NER with Indian data
Trained NER model to recognise ORG, PER, used it for Goals 1, 2, 3
Used fasttext with weighted words from text and title as seperate features and used it with unsupervised KNN clustering to identfify reports
Used regex on reports classified to get Goals 4, 5

Provide feedback