Text detection using Tesseract

Text detection

Pytesseract python module is used to detect text from the images. Tesseract is now run on this image to get the bounding boxes for the text. An option of bw is provided. When this is True, then since the text would be in black, the image is converted to HSV and only the black color is filtered. But this leads to inaccurate text. Hence whiten the original image except these bounding boxes and run tesseract again to get the text. The difference between bounding boxes on actual image and the HSV image is shown below.

Bounding boxes for normal image.

Bounding boxes when bw is set to True.

Number detection (using tesseract)

Use the config -l eng --oem 1 --psm 6 -c tessedit_char_whitelist=.0123456789 to detect only numbers using tesseract. Note that through the experiments, it is observed that psm mode 6 works better than other psm modes for numerical value detection. Below shows the comparision with psm modes 11 and 6.

We can see that, with psm mode 11, the numbers 90, 60, 40 and 0 are missed (or detected as non-numeric text), whereas with psm mode 6, these are detected as numeric values.

ChartReader
Text detection
- Tesseract
- AWS Rekognition
Legend detection
- Regex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text detection using Tesseract

Text detection

Number detection (using tesseract)

Clone this wiki locally