-
Notifications
You must be signed in to change notification settings - Fork 1
Text detection using Tesseract
Pytesseract python module is used to detect text from the images.
Tesseract is now run on this image to get the bounding boxes for the text.
An option of bw
is provided. When this is True
, then since the text would be in black, the image is converted to HSV and only the black color is filtered. But this leads to inaccurate text. Hence whiten the original image except these bounding boxes and run tesseract again to get the text. The difference between bounding boxes on actual image and the HSV image is shown below.
Bounding boxes for normal image.
Bounding boxes when bw
is set to True
.
Use the config -l eng --oem 1 --psm 6 -c tessedit_char_whitelist=.0123456789
to detect only numbers using tesseract. Note that through the experiments, it is observed that psm mode 6 works better than other psm modes for numerical value detection. Below shows the comparision with psm modes 11 and 6.
We can see that, with psm mode 11
, the numbers 90, 60, 40 and 0 are missed (or detected as non-numeric text), whereas with psm mode 6
, these are detected as numeric values.