OCR with Machine Learning

Optical Character Recognition(OCR) is a process run by OCR software. The software will open a digital image, e.g., a tiff file containing full-text characters, and then attempt to read and translate the characters into recognizable full text and save them as a full-text file. This is a quick process that enables the automated conversion of millions of images into full-text files that can then be searched by word or character. This is a very useful and cost-efficient process for large-scale digitization projects for text-based materials, including books, journals, and newspapers. There are several OCR software packages on the market but a popular package for older material or that in languages other than English is Abbyy Finereader. This is currently being used by several newspaper digitization projects internationally.

Machine learning has emerged as a remarkable technology that empowers the automatic extraction and interpretation of text from images or scanned documents. This process entails training machine learning models on extensive datasets of images and their corresponding text labels, enabling them to accurately recognize and transcribe characters. To accomplish this, OCR systems employ an amalgamation of image processing techniques like noise reduction, image enhancement, and segmentation. These techniques facilitate the isolation of individual characters or words within an image. Subsequently, the extracted text undergoes further processing to enhance accuracy and overcome challenges posed by varying fonts, sizes, and orientations.

The OCR process is dependent upon a number of factors, and these factors influence results quite radically. Experience to date has shown that using OCR software over good quality clean images (e.g., a new PDF file) has excellent results, and most characters will be recognized correctly, therefore, leading to successful word searching and retrieval. However, over older materials, e.g., books and newspapers, the OCR is extremely variable, and for this reason, some projects advocate re-keying the text from scratch rather than attempting OCR. The process is labor intensive, and sometimes a combination of both re-keying and OCR will be performed for a project. It is usual to undertake sample tests on the actual source material to be digitized before making decisions about OCR and re-keying.

OCR Can help you save your time and your effort in extracting texts from images; you save the time spent typing the whole text by yourself.

There are some issues you should take care of :

The quality of your image, the written content
, the font size, you can separate the font from the background !! The font is skewed or distorted !!
the size of the image
, the quality of the light

ocr.space

It is an OCR engine that offers a free API. It means that it is going to do pretty much all the work regarding text detection. We only need to send through their API an image with the text we want to scan, and it will return the scanned text.

First of all, you need to get an API key.

Go to http://ocr.space/OCRAPI and then click on "Register for free API Key".

Note: The free OCR API plan has a rate limit of 500 requests within one day per IP address to prevent accidental spamming.

Code:

Importing Libraries

import io #The io module provides Python's main facilities for dealing with various types of I/O.
import json #JSON (JavaScript Object Notation) is a lightweight data interchange format
import cv2 # cv2.imread(), cv2.imshow() , cv2.imwrite()
import numpy as np #create a NumPy array, use broadcasting, access values, manipulate arrays, and much more
import requests #Make a request to a web page, and print the response text
import matplotlib.pyplot as plt  #Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Loading the Image

Now we will load the image using OpenCV(CV2). Then, the image needs to be converted to a binary image, grayscaling it if it is an RGB image. Grayscaling takes the three RGB values of an image and transfers it with the following formula to a single value which represents a shade of gray. [0-255]: 255 being the brightest shade of grey (white) and 0 being the darkest shade of grey (black).

After grayscaling, there comes thresholding; thresholding is used to decide whether the value of a pixel is below or above a certain threshold.

If pixels < the threshold ===> turned to a white pixel
If pixels > the threshold ===> turned to a black pixel

The result of 1 and 2 is that we get a binary image ( white background and black foreground).

# load the image using matplotlib. 
img = cv2.imread("../input/tbs-image/TBS_image.png")
height, width, _ = img.shape
height
width,height

Output:

After loading the image of the TBS bachelor, we need to set the OCR engine: send the image to the ocr. space server in order to be processed. Here there are a few notes :

Sending the image to the ocr. space server
Since we are using the free service, we can not send an image with a maximum of one MB of size, so we need to shrink the size of our image by compressing it.
Also, to send the image to the server, we need to convert the image into bytes.

url_api = <a href="https://api.ocr.space/parse/image">"https://api.ocr.space/parse/image"</a>

# Ocr
url_api = "https://api.ocr.space/parse/image"
_, compressedimage = cv2.imencode(".jpg", img, [1, 90])
file_bytes = io.BytesIO(compressedimage)

#execute this code 

result = requests.post(url_api,
              files = {"../input/tbs-image/TBS_image.png": file_bytes},
              data = {"apikey": "eb516eb1f288957",
                      "language": "eng"})

result = result.content.decode()
result = json.loads(result)

result

Output:

parsed_results = result.get("ParsedResults")[0]
text_detected = parsed_results.get("ParsedText")
text_detected

Output:

Extracting Text Using Tesseract

# Generic Libraries
from PIL import Image
import os
import pandas as pd
import numpy as np
import re,string,unicodedata

#Tesseract Library
import pytesseract

#Warnings
import warnings
warnings.filterwarnings("ignore")

#Garbage Collection
import gc

import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
import pytesseract


# Let's start with a simple image
img = cv2.imread("../input/tbs-image/TBS_image.png") # image in BGR format
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
fig = plt.figure(figsize = [10,10])
height,width,channel = img.shape
plt.imshow(img)
print(type(img))
print(height,width,channel)

Output:

# As the image is simple enough, the image_to_string method reads all characters almost perfectly!
text = pytesseract.image_to_string(img)
print(text)

Output:

# The output of OCR can be saved in a file in necessary
file = open('output.txt','a') # file opened in append mode
file.write(text)
file.close()

Alternative Method

Output:

img_pil = Image.open("../input/ocr-working-in-progress/7.jpg")
MAX_SIZE = 2000
if img_pil.height > MAX_SIZE or img_pil.width > MAX_SIZE:
    scale = max(img_pil.height / MAX_SIZE, img_pil.width / MAX_SIZE)

    new_width = int(img_pil.width / scale + 0.5)
    new_height = int(img_pil.height / scale + 0.5)
    img_pil = img_pil.resize((new_width, new_height), Image.BICUBIC)

print(img_pil.width, img_pil.height)
# img_pil

Output:

gray_pil = img_pil.convert("L")

rect_arr = detect(img_pil, FLAG_RECT)

img_draw = ImageDraw.Draw(img_pil)
colors = ['red', 'green', 'blue', "yellow", "pink"]

for i, rect in enumerate(rect_arr):
    x, y, w, h = rect
    img_draw.rectangle(
        (x, y, x + w, y + h),
        outline=colors[i % len(colors)],
        width=4)

img_pil

Output:

OpenCV

# read the image with openCv
img = cv2.imread("../input/tbs-image/TBS_image.png")
# Convert to GrayScale
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply dilation and erosion to remove some noise
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel,  iterations=1)
img = cv2.erode(img, kernel, iterations=1)

cv2.imwrite(src_path + "removed_noise.png", img)

#Apply threshold to get image with only black and white
img = cv2.adaptiveThreshold (img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
cv2.imwrite(src_path + "thres.png",img)

#Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(src_path + "thres.png"))


print("---------Start Recognize text from image---------")
print (get_string(src_path+img_path))
print("--------Done-----------")

Output:

Files that were Generated during the above process:

Conclusion

In conclusion, OCR powered by machine learning is a transformative technology that revolutionizes the way we extract and interpret text from images and scanned documents. By leveraging large datasets and training sophisticated machine learning models, OCR systems achieve remarkable accuracy in recognizing and transcribing characters. The impact of OCR using machine learning extends across various industries, enabling document digitization, streamlining form processing, and facilitating data analysis through text extraction from images. With its ability to automate information management tasks and enhance efficiency, OCR with machine learning stands at the forefront of innovation, opening up new possibilities for improved productivity and streamlined workflows in the digital age.

Next TopicAir Pollution Prediction Using Machine Learning

← prev next →