BookWorm - the BongOCR: Mid-Term Summary : Part 2

Monday, July 22, 2013

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

Language : Python 2.7
Development Environment :

most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.

However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make pango work on my windows.

Notepad++ 6.3.2

Software :

Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
Python Libraries :

Repository : GitHub

BookWorm - the BongOCR

Monday, July 22, 2013

Mid-Term Summary : Part 2 - Shaping BookWorm

No comments:

Post a Comment

About Me