Skip to content

Repository for Off-Top's Flask App and for our Machine Learning services


Notifications You must be signed in to change notification settings


Repository files navigation


Repository for Off-Top's Flask App and for our Machine Learning services.

PDF Miner:

Python script that mines PDF files, converts them to text and creates a text file with all the collected data using PDF Miner library.


  • Need to have Python3 installed and pip3.

  • Need to set up a virtual environment in Python.

    • Use the following command to install virtual-env: (Linux) sudo apt install python3-venv (Windows) pip3 install virtualenv
    • Create a virtual directory which has the required scripts using: python3 -m venv my-venv
    • Activate the virtual env using the following command: source my-venv/bin/activate
  • Need to have pdfminer package installed using: (Linux) sudo apt-get install python-pdfminer (Windows)pip3 install pdfminer


  • Inside this branch navigate into pdfminer/src. You will see 2 scripts: - Takes in multiple pdf files from a folder called pdf and returns .txt files inside a folder called txt in the branch. - Takes in a single pdf file hardcoded inside the script in the bottom and returns a .txt file version of the pdf inside the same directory.

How to use:

  1. For multiple files - Locate the directory of inside pdfminer/src and then run python3 This will automatically generate .txt files and can be found in the directory.

  2. For single files - Locate the directory of inside pdfminer/src and then run python3 This will display the pdf file in the terminal and automatically generate .txt file in the directory.

  • Make sure to hardcode the pdf file's name inside the script which you want to convert if you're using the single miner.

Mozilla DeepSpeech:

Python script that takes a .WAV audio input and transcribes the content into a text file called output.txt


  • Need to have Python3 installed and pip3.

  • Need to set up a virtual environment in Python skip to 'Activate' if you already have one set up.

    • Use the following command to install virtual-env: (Linux) sudo apt install python3-venv (Windows) pip3 install virtualenv
    • Create a virtual directory which has the required scripts using: python3 -m venv my-venv
    • Activate the virtual env using the following command: source my-venv/bin/activate
  • Need to have PyTorch package installed using: (Linux) pip3 install torch torchvision (Windows)pip3 install torch===1.4.0 torchvision===0.5.0 -f (Mac) pip3 install torch torchvision *MacOS Binaries dont support CUDA, install from source if CUDA is needed

  • Need to have deepspeech package installed using: (Linux) sudo apt-get install deepspeech (Windows)pip3 install deepspeech

  • Need to have sox package installed using: (Linux) sudo apt-get install -y sox (Windows) download and install the .exe file from (Mac) install homebrew using: ruby -e "$(curl -fsSL" < /dev/null 2> /dev/null install sox using: brew install sox

  • Need to download pre-trained English model and extract it using: curl -LO tar xvf deepspeech-0.6.1-models.tar.gz *Make sure to rename the models folder to "deepspeech-models".


  • Inside this branch navigate into mozilla-deepspeech. You will see a script called mds: - Sets the pre-trained model arguments, and takes in an audio file from mozilla-deepspeech/audio folder, and returns the results in output.txt inside the same folder.

How to use:

  • Locate the directory of inside mozilla-deepspeech and then run python3 This will automatically transcribe the audio file and generate a .txt file with the content that can be found in the directory.
  • Another way of using the MDS instance is by calling the transcribeData() function in another instance.
  • Check the output.txt after the script runs to verify the results.

Stanford CoreNLP/Stanza:

Python Flask server that checks for input from a text file containing the results from MozillaDeepSpeech and performs the following Natural Language Processing operatiopns using the StanfordNLP library, currently known as Stanza:

  • TokenizeProcessor.
  • POSProcessor.
  • LemmaProcessor.
  • NERProcessor.


  • Need to have Python3 installed and pip3.

  • Need to set up a virtual environment in Python skip to 'Activate' if you already have one set up.

    • Use the following command to install virtual-env: (Linux) sudo apt install python3-venv (Windows) pip3 install virtualenv
    • Create a virtual directory which has the required scripts using: python3 -m venv my-venv
    • Activate the virtual env using the following command: source my-venv/bin/activate
  • Need to have flask package installed using: (Linux) sudo apt-get install flask (Windows)pip3 install flask (Mac) pip3 install flask

  • Need to have flask_apscheduler package installed using: (Windows)pip3 install flask_apscheduler (Mac) pip3 install flask_apscheduler

  • Need to have datetime package installed using: (Windows)pip3 install datetime (Mac) pip3 install datetime

  • Need to have stanza package installed using: (Linux) sudo apt-get install stanza (Windows)pip3 install stanza (Mac) pip3 install stanza

  • Need to download the pre-trained English model and extract it using: Launch the Python interactive interpreter and first, run import stanza then,'en', '~/stanza_resources') *By default, Stanza stores its models in a folder in your home directory. The English model should be downloaded to off-top-python/stanfordnlp. *Make sure to specify your own directory (the second argument in the download command. *Make sure the name of the model folder is "stanza_resources".


  • Inside this branch navigate into stanfordnlp. You will see a script called which is a Flask server. - The Flask server runs through a task scheduler, checks for input from a text file, and displays the processed results (Tokens, POS', NER tags) on the terminal every 10 seconds.

How to use: Locate the directory of inside stanfordnlp and then run python3 This will automatically run the processes and generate the results on the terminal. *Make sure you have the input .txt file in the same directory which is the same file that Mozilla DeepSpeech generated with the results.

NER Training Dataset [ML] | Webscraping | Data Preprocessing:

This part deals with Acquiring data from the web, preprocess it, clean it and annotate it to be well formatted as a training dataset.


  • Need to have Python3 installed and pip3.

  • Need to set up a virtual environment in Python skip to 'Activate' if you already have one set up.

    • Use the following command to install virtual-env: (Linux) sudo apt install python3-venv (Windows) pip3 install virtualenv
    • Create a virtual directory which has the required scripts using: python3 -m venv my-venv
    • Activate the virtual env using the following command: source my-venv/bin/activate
  • Need to have requests package installed using:

    • (Linux) sudo apt-get install requests
    • (Windows)pip3 install requests
    • (Mac) pip3 install requests
  • Need to have urllib.request package installed using:

    • (Windows)pip3 install urllib2
    • (Mac) pip3 install urllib2
  • Need to have pandas package installed using:

    • (Windows)pip3 install pandas
    • (Mac) pip3 install pandas
  • Need to have beautifulsoup4 package installed using:

    • (Linux) sudo apt-get install beautifulsoup4
    • (Windows)pip3 install beautifulsoup4
    • (Mac) pip3 install beautifulsoup4
  • Need to have re2 package installed using:

    • (Linux) sudo apt-get install re2
    • (Windows)pip3 install re2
    • (Mac) pip3 install re2
  • Need to have nltk package installed using:

    • (Linux) sudo apt-get install nltk
    • (Windows)pip3 install nltk
    • (Mac) pip3 install nltk
  • Need to have unicodedata package installed using:

    • (Linux) sudo apt-get install unicodedata
    • (Windows)pip3 install unicodedata
    • (Mac) pip3 install unicodedata


  • In order to build a training dataset for our custom NERProcessor model we went used webscraping tools to mine terms, names, keywords brands related to Sport and Food.
  • The data we gathered was not in the perfect form, we had to clean and preprocess it for example: Dataset generation, splitting, tokenization, character removal, filtering out punctuation, etc.
  • The training dataset had to be in BIO format, that's why we annotated our data accordingly.
  • Inside this branch navigate into stanfordnlp/preprocessing. You will see two folders, 'clean-data' which includes the processed/cleaned data, and 'data' which includes raw unprocessed and scraped data. You will also find 3 Python scripts -,, which were used to preprocess and clean the raw data.
  • Each website you scrape needs a dedicated script to fit its outline. Inside the same branch navigate into stanfordnlp/webscraping and you will find 12 Python scripts -,, which were used to scrape raw data from various websites.
  • The resulting annotated data and built training dataset can be found inside stanfordnlp/dataset under

How to use: Locate the directory of,, inside stanfordnlp/preprocessing and then for example run python3 This will automatically run the process, generate the results on the terminal, and create an output file called clean-output.txt Locate the directory of,, inside stanfordnlp/webscraping and then for example run python3 This will automatically run the process, generate the results on the terminal, and create an output file called food-scrape.txt


Repository for Off-Top's Flask App and for our Machine Learning services







No releases published


No packages published