Showing posts with label pango. Show all posts
Showing posts with label pango. Show all posts

Friday, September 13, 2013

data generation method- complete

So finally, the method for generating data easily is complete.
I tried to continue making the script with EPD, but it kept giving bad results, mostly due to inefficient memory handling on my part. Here is a sample output of the previous code.
It can be seen that it's really bad, and this should not be used for any kind of raining or testing.
However, I switched to ImageMagick for the merging part, and now it works perfectly. The sample output looks like this.
Note the clarity in spacing of the individual images generated before, and that it takes up the whole space instead of crowding in the corner, like the previous result.
It can be done with a simple command : 
montage img1.ext img2.ext .... imgN.ext -geometry SizexSize output.ext
here, montage is the command, and -geometry is the option, the rest can be modified as per our needs.
Also, the image above is just a snip of the actual output as it is too big to publish here.

Monday, July 22, 2013

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

  • Language : Python  2.7
  • Development Environment : 
    • most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.
      • However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make  pango work on my windows.
    • Notepad++ 6.3.2
  • Software : 
    • Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
    • Python Libraries :
  • Repository : GitHub

Sunday, July 14, 2013

WhiteWashing and box files

Finally, I have fixed the padding-values for the WhiteWash algorithm and it seems to be working nicely.
Here is a preview of the current output :
However, the padding uses integer numbers right now. It will be better if I could somehow relate this to the document statistics. It should be interesting to try out later.
This success with the algorithm allows me to focus on the box file generation again. Currently, I have managed to generate this using my workaround in the virtual machine, and after tweaking Debayan and Sayamindu's original algorithm using pango.
The box file generated now looks like :

Which is improvement over the last time. 
Now, I have to combine these two methods to try out my plan. 
Next, I will make an entry sometime this week about my updated plans and goals for the midterm evaluation period.
So Long.

Wednesday, July 3, 2013

workaround for Pango

Having tried to make pango work on windows  for over a week now, I have finally given up, and have switched to Fedora.
On Fedora, it took a little over 3 minutes to install the dependencies and get everything up and running. more reasons to love Linux.
Anyway, the first trial of generating the images looks like this.
It is good in itself, but still needs work, and I need to synchronise the same fonts between my windows and Fedora environment, a silly mistake on my part to not take care of it already.
Coding for the clipping algorithm is about 30 % done. Right now, I am looking into connected-component analysis to see if it will work.
I plan to fix the fonts, and then work on the connected-component for the rest of the week.
So Long.


Friday, June 28, 2013

roadblock

I've hit a roadblock.
Trying to generate a new, hopefully improved, training set, I've been trying to run a script that would generate the boxfiles. But it needed 2 libraries, Cairo and Pango, to work.
I'm having trouble getting Pango to work on my windows machine.
I'll spend another day or two on it, and if it doesn't work, will have to switch.
It's going to be a long night.