top of page
shwetasainiksr

How to Extract text from an Image using Selenium Webdriver-Page Object Model

Introduction:

WebDriver does not support the functionality of extracting text from an image. This task can be achieved by using Tesseract, an OCR (Optical Character Recognition) engine that can recognize and extract text from images, handwritten notes, PDFs, etc., into machine-readable text format.

Tesseract is an open-source text recognition engine sponsored by Google. It supports a wide range of languages and fonts in more than 100 languages, including ideographic and right-to-left languages.

 

Let’s take an example. We want to extract text from image given below:


 

Step 1: Setup Tesseract OCR:


·         Download and install Tesseract OCR from https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata

The eng.traineddata file is part of Tesseract OCR. It contains language-specific data needed for text recognition.


o   Create folder “tessdata” and store eng.traindata file here as below..




o   Configure the Tesseract path in your system environment variables.

  • Open the Start menu, search for "Environment Variables," and select "Edit the system environment variables."

  • In the System Properties window, click the "Environment Variables" button.

  • In the Environment Variables window, find the "Path" variable in the "System variables" section and select it. Click "Edit."

  • In the Edit Environment Variable window, click "New" and add the path to the Tesseract executable and then grab the “tessdata” folder path where you stored the “eng.traindata” file. For example: C:\Git\..\..\tessdata

  • Click "OK" to close all windows.



Step 2: Add Tesseract and Selenium dependencies Or .jar files: Ensure you have the required dependencies in your pom.xml if you are using Maven:

Tess4J is a java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP.



Asprise OCR allows you to perform OCR and barcode recognition on images (JPEG, PNG, TIFF, PDF, etc.) and output the results as plain text, XML, searchable PDF, or editable RTF.


OR


Downloading the following files.


1. java API (.jar file)




2. Download Aspire OCR files



·         Extract all files after downloading them, copy and paste them in one folder name libs in your workspace.

·         Right click on project build ---> path --->configure build path --->click add files ---> click on libs folder (select all jar files)

·         You should be able to see all jar files added to the build path.



Step 3: Write Code



Console Output




Explanation


o   Sets up the WebDriver and navigates to the desired URL.

o   Uses Selenium WebDriver to find an element on the web page.

o   The ‘src’ attribute contains the URL of an image

o   Creates a URL object using the image URL, allowing us to work with the image as a resource.

o   Creates a buffered image from the image URL.

o   Initializes a Tesseract instance.

o   Uses Tesseract to perform OCR on the captured image file and prints the extracted text.



Notes


  • Ensure that the path to chromedriver and tessdata are correctly set.

  • Tesseract requires its trained data files (typically. traineddata files) located in the tessdata directory. This is usually part of the Tesseract installation.

 

 I hope this blog will help you. Happy Learning!

820 views

Recent Posts

See All
bottom of page