At Eradicate Diabetes We Process 1200 to 1500 Blood Reports a day. It takes about 10 mins per report to be read and blood marker values manually updated to DB by our dietitians. This translates to 200-300 hours wasted per day and 100,000 hours wasted per year. Plus manually reading and updating to DB had an error rate of 15-20%.
The mission statement aims to address specific challenges
Reading text & extracting information from handwritten images & handwritten text is quite difficult to read from person to person due to the variation in handwriting.
Reading text from different formats (pdf, various image formats), like blood report values, prescription data.
how to obtain it readily available in a precise format to users.
Hence we tried to explore solutions for the above challenges,
By applying Computer Vision Technology to read and extract the text from reports, handwritten information.
Since there were numerous solutions provided we tried to utilize the AWS Textract service provided by Amazon.
Let's understand, what is Textract?
AWS Textract is a service provided by Amazon that will assist us with Automatic Text Extraction from scanned documents and handwritten images.
Amazon Textract provides services for detecting text only and operations for analyzing text that discovers more extensive relations, such as form data and tables.
In today's world, many companies deal with the issues of data extraction from pdf documents, tabular format, & handwritten information. This service will help to obtain a pretty good result to automatically read and extract the information.
Thinking about Textract implementation for reading text from images.
Amazon Textract detects and analyzes text in documents and converts it into machine-readable text format. This is the API reference documentation for Amazon Textract.
About Boto3 library: Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.
To use the AWS Textract service, follow the below implementation process:
!pip install boto3 #install required service.
import boto3 #AWS SDK for python
client = boto3.client('textract') #create client & client representing
amazon textract.
import boto3.session #Boto3 acts as a proxy for the default session & created automatically when we create a client for the session.
my_session = boto3.session.Session()
#Environment Variables for below to create successful connection to AWS, aws_access_key_id, aws_secret_access_key, & region_name.
Another way to create session,
client=boto3.client('textract',aws_access_key_id="",aws_secret_access_key=" ",region_name=" ")
#In the above code it will create session giving access to key values by initializing textract.
documentName =(r"C:/Users//Desktop/project/Example.jpg") #mention path of the document stored in your system.
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read()) #read documents by using python file i/o functions and mention read mode.
response = client.detect_document_text(Document={'Bytes':imageBytes})
# Detects text in the input document.
Amazon Textract can detect lines of text and the words that make up a line of text. The input document is an image /JPEG or PNG format.
DetectDocumentText returns the detected text in an array of Block objects.
for item in response["Blocks"]:
if item["BlockType"]=="LINE":
print(item["Text"]) # The BlockType field determines if the text is a line of text (LINE) or a word (WORD).
While trying to analyze & understand the implementation part of textract I came across methods we can use for text extraction as have seen above, Detect Documenttext, Analyze document() method.
Some of the images are taken below which will be referred to text extraction.
Handwritten: Text Extracted from the Image1
Let's understand how Analyze Document method will helpful during the text extraction. To analyze text in a document, use the Analyze Document method and pass a document file as input.
1. Analyze Document returns a JSON structure that includes the analyzed text in the Form data (key-value pairs).
2. The associated data is returned in two Block objects, each of type KEY_VALUE_SET:
a KEY Block object and a VALUE Block object.
3. A TABLE Block object contains data about a detected table.
A CELL Block object is returned for each cell in a table.
Text Extraction from Blood cell Report: Image2
We can determine which prototype of analysis to perform by specifying the feature samples list. Hence we can easily extract text from provided images & documents and by using few operations of textract, we can read the text if it's provided via tabular format.
I believe these details will be helpful to understand the text extraction from numerous documents which exist in various formats.
Happy Learning!
References: