Showing posts with label windows. Show all posts
Showing posts with label windows. Show all posts

Monday, July 22, 2013

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

  • Language : Python  2.7
  • Development Environment : 
    • most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.
      • However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make  pango work on my windows.
    • Notepad++ 6.3.2
  • Software : 
    • Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
    • Python Libraries :
  • Repository : GitHub

Sunday, July 14, 2013

WhiteWashing and box files

Finally, I have fixed the padding-values for the WhiteWash algorithm and it seems to be working nicely.
Here is a preview of the current output :
However, the padding uses integer numbers right now. It will be better if I could somehow relate this to the document statistics. It should be interesting to try out later.
This success with the algorithm allows me to focus on the box file generation again. Currently, I have managed to generate this using my workaround in the virtual machine, and after tweaking Debayan and Sayamindu's original algorithm using pango.
The box file generated now looks like :

Which is improvement over the last time. 
Now, I have to combine these two methods to try out my plan. 
Next, I will make an entry sometime this week about my updated plans and goals for the midterm evaluation period.
So Long.

Friday, June 28, 2013

roadblock

I've hit a roadblock.
Trying to generate a new, hopefully improved, training set, I've been trying to run a script that would generate the boxfiles. But it needed 2 libraries, Cairo and Pango, to work.
I'm having trouble getting Pango to work on my windows machine.
I'll spend another day or two on it, and if it doesn't work, will have to switch.
It's going to be a long night.