Optical Character Recognition (OCR) using (Py)Tesseract : Part 2

Jul 16, 20203 min read

In part1, we have seen that from the textbook page image, and noisy image how we can extract the text using tesseract.

In this part, we will see that from photographs how can we extract text.

Let's try a new example and bring some of the things we have learned.

Here's an image of a storefront, let's load it and try and get the name of the store out of the image.

from PIL import Image
import pytesseract
# Lets read in the storefront image I've loaded into the course and display it
image=Image.open('../input/OCR1/storefront.png')
display(image)
# Finally, let's try and run tesseract on that image and see what the results are
pytesseract.image_to_string(image)

'fa | INTERNATIONAL\n\nEe oat\n\n \n\nae\n\n| bile\n\n-_\nS =\nE “ee —\n.\n\n| pe 1 800 GO DRAKE PTV Cheol i i\n\noes\n\n \n\nK iM he ie'

We see at the very bottom there is a string that we can not identify easily. Tesseract is unable to take this image and pull out the name. By cropping the image it will be able to identify text. So let's try and help Tesseract by cropping out certain pieces.

First, lets set the bounding box. In this image, the store name is in a box.

bounding_box=(470, 150, 1020, 320)

# Now lets crop the image
title_image=image.crop(bounding_box)

# Now lets display it and pull out the text
display(title_image)
pytesseract.image_to_string(title_image)

'DRAKE\n\nINTERNATIONAL'

Great, we see how with a bit of a problem reduction we can make that work. So now we have been able to take an image, preprocess it where we expect to see text, and turn that text into a string that python can understand.

If you look back up at the image though, you'll see there is a small sign outside of the shop that has the shop website on it. I wonder if we're able to recognize the text on that sign? Let's give it a try.

First, we need to determine a bounding box for that sign. For that, let's just use the bounding box I decided on.

# Now, lets crop the image
little_sign=image.crop((1000, 548, 1215, 690))
display(little_sign)

All right, that is a little sign! OCR works better with higher resolution images, so let's increase the size of this image by using the pillow resize() function.

Lets set the width and height equal to ten times the size it is now in a (w,h) tuple

new_size=(little_sign.width*5,little_sign.height*5)
display(little_sign.resize( new_size, Image.NEAREST))

pytesseract.image_to_string(little_sign.resize( new_size, Image.NEAREST))

'DRAKEINTL.COM'

With increased size, we are able to extract text.

If you look back up at the image though, you'll see there is a small sign outside of the shop that has the slogan on it. I wonder if we're able to recognize the text on that sign? Let's give it a try on that too.

We will crop that image again.

little_sign=image.crop((570, 490, 690, 720))
display(little_sign)

We will resize this image for a better view.

new_size=(little_sign.width*5,little_sign.height*5)
display(little_sign.resize( new_size, Image.NEAREST))

'eel\ner D\n\nLe ee de\nWITH ONE OF\n\nOUR CONSULTANTS\nTODAY\n\nRring'

I think we should be able to find something better. I can read it, but it looks really pixelated. Let's see what all the different resize options look like.

options=[Image.NEAREST, Image.BOX, Image.BILINEAR, Image.HAMMING, Image.BICUBIC, Image.LANCZOS]
for option in options:
    # lets print the option name
    print(option)
    # lets display what this option looks like on our little sign
    display(little_sign.resize( new_size, option))

From this, we can notice two things. First, when we print out one of the resampling values it actually just prints an integer! This is really common: that the API developer writes a property, such as an Image. BICUBIC, and then assigns it to an integer value to pass it around. Some languages use enumerations of values, which is common in say, Java, but in python, this is a pretty normal way of doing things.

The second thing we learned is that there are a number of different algorithms for image resampling. In this case, the Image.BICUBIC filters do a good job. Let's see if we are able to recognize the text off of this resized image.

# First lets resize to the larger size
bigger_sign=little_sign.resize(new_size, Image.BICUBIC)
# Lets print out the text
pytesseract.image_to_string(bigger_sign)

'noua\nCOME\n\n IN AND SPEAK \nWITH ONE OF\n\nOUR CONSULTANTS\n TODAYwees'

Well, not perfect, but we can see the sentence.

This is not a bad way to clean up OCR data. It can be useful to use a language or domain-specific dictionary in practice, especially if you are generating a search engine for a specialized language such as a medical knowledge base or location. And if you scroll up and look at the data we were working with - this small little board hanging outside of the store - it's not so bad.

At this point, you've now learned how to manipulate images and convert them into text.

Optical Character Recognition (OCR) using (Py)Tesseract : Part 2

Recent Posts