Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.
Tools Used :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.
Tools Used :
- Language : Python 2.7
- Development Environment :
- most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.
- However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make pango work on my windows.
- Notepad++ 6.3.2
- Software :
- Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
- Python Libraries :
- Repository : GitHub
No comments:
Post a Comment