Monday, July 22, 2013

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

  • Language : Python  2.7
  • Development Environment : 
    • most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.
      • However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make  pango work on my windows.
    • Notepad++ 6.3.2
  • Software : 
    • Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
    • Python Libraries :
  • Repository : GitHub

No comments:

Post a Comment