Python Tesseract 4.0 OCR: Recognize only Numbers / Digits and exclude all other Characters
Googles Tesseract (originally from HP) is one of the most popular, free Optical Character Recognition (OCR) software out there. It can be used with several programming languages because many wrappers exist for this project. PyTesserocr is an example of a Python wrapper for the tesseract-ocr API.
The "get numbers only"-problem
Someday, I wanted to build a small Python program to recognize only numbers from an image and ignore all other spaces, letters, special characters and so on. After installing Tesseract 4.0 from my Ubuntu distribution repository and a bit of playing around I couldn't find a solution to only extract the digits out of my image.
After some googling I found the problem in a GitHub issue: Until Tesseract 3 the option tessedit_char_whitelist
was supported which allows the creation of a character-whitelist. This feature is sadly missing in the Tesseract 4.0 version. So how to recognize only numbers from an image in Python with Tesseract?
Solution 1: Update Tesseract
The first simple solution is to upgrade Tesseract to version > 4.1 because the missing function has been added again in version 4.1 (see this comment). The only problem is that currently (March 2020) you have to install version 4.1 manually (building from source or adding some PPA) because the Ubuntu 18.04 package sources only support version 4.0.0-beta.1.
Solution 2: Use an old Tesseract version (Legacy mode)
A dirty workaround is to make use of the implemented Legacy mode to use some old Tesseract functions in Tesseract 4.0. You have to add the --oem 0
flag for this. Then it is possible to call the tessedit_char_whitelist
option to filter only numbers: -c tessedit_char_whitelist=0123456789
.
Solution 3: Use a Python function (my suggestion)
If you are annoyed about the versioning, why not use Python for filtering instead of juggling around with several flags and versions!?
The idea:
Just use the Tesseract image_to_string(...)
function to recognize all characters and put the result string into a Python function that removes every non-numeric char.
The result:
The whole python code that outputs only the number in image.tif
looks like this:
# default imports
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
import re
def replace_chars(text):
"""
Replaces all characters instead of numbers from 'text'.
:param text: Text string to be filtered
:return: Resulting number
"""
list_of_numbers = re.findall(r'\d+', text)
result_number = ''.join(list_of_numbers)
return result_number
ocr_result = pytesseract.image_to_string(Image.open('image.tif'), lang='eng')
print(replace_chars(ocr_result))
The replace_chars
function uses a pretty simple regex to extract all numbers from the input string. Every number block will be placed into a Python list. Then, this list is converted into a single attribute. You can modify this if you need only single numbers or if you are interested in special blocks. At the end of the code, the above-mentioned Tesseract function is used to make the recognition process.
To conclude, I like solution 3 the most because it isn't dependent on the Tesseract version and you can modify the function according to your own needs.
References
- Header Image Background: Photo by Rishabh Agarwal on Unsplash