2. Optical character recognition (OCR)

Note: As an OCR engine we propose to use Tesseract — probably the most accurate open source OCR engine available. Tesseract’s OCR accuracy is near 98% for character recognition and 95-97% for word recognition.

Now we are going to describe a simple algorithm implemented in MATLAB to recognize a business card layout. The algorithm will work with a grayscaled image. That’s why we start the process from transforming a color image into grayscaled one. To detect text areas we use a special filter — aВ modified method of standard deviation on sliding window calculation.В The result of this filter is converted into a binary image by means of Otsu Thresholding algorithm. After that we pick the blobs satisfying certain criteria for length, width, and direction. For each blob which satisfies these criteria we build a bounding box. Having a set of bounding boxes we obtain a mask for finding text areas. The pictures below illustrate the effectiveness of the suggested approach. Note that it is only a prototype and all control parameters are hardcoded for this type of image.

1.1. Original color image

1.2. Gray-scaled image

1.3. Filtered image

1.4. Thresholded image

1.5. Filtered blobs

1.6. Bounding boxes for found blobs

1.7. Found text areas

Here are a few examples of the algorithm at work:

Card 1 – horizontal layout, dark text on light background

Card 2 – horizontal layout, light text on dark background

Card 3 – vertical layout, combination of dark/light text and backgrounds

With the algorithm described above we can efficiently find text areas on business cards building reasonable guesses on the purpose of each text area. And then, using Tesseract for Optical character recognition (the 2nd stage of our task), we can reliably achieve 90% precision of business card layout and text recognition.

$${}$$