- Upload some files( pdf/ text/ image) which contain Hindi text then that text will be extracted using OCR ( optical character recognition ) tool and stored in the database.
Mongodb is used to store file data, For every file it will create new document with two fields
1. file_name - to store name of the file 2. content - to store content of the file
- Serach for Hindi string using English script, This system do Transliteration of English Script to Hindi script.
Transliteration is the process of converting text from one script another script
example : "namaste" --> "नमस्ते" (same pronunciation )
- Then it will search for Hindi string in all documents present in database and give file_names, matching accuracy and matching content as result wherever string is best matched.
- Download and Install python
- Install python flask
- download mongodb Database & Install pymongo
- Install python Imaging Library
- Download and install Tesseract OCR and add path to Environment variable.
- Install pdf2image / you can use this document also pdf2image by GFG
- Install latest version of poppler then extract into C:\programm files, and add path to system environment 'C:\Program Files[poppler folder]\bin'.
- Install fuzzywuzzy
- Install Jinja2 / In VisualStudio add jinja2 extension
