BookWorm - the BongOCR: clip algorithm

Showing posts with label clip algorithm. Show all posts

Sunday, July 14, 2013

WhiteWashing and box files

Finally, I have fixed the padding-values for the WhiteWash algorithm and it seems to be working nicely.
Here is a preview of the current output :

However, the padding uses integer numbers right now. It will be better if I could somehow relate this to the document statistics. It should be interesting to try out later.

This success with the algorithm allows me to focus on the box file generation again. Currently, I have managed to generate this using my workaround in the virtual machine, and after tweaking Debayan and Sayamindu's original algorithm using pango.

The box file generated now looks like :

Which is improvement over the last time.

Now, I have to combine these two methods to try out my plan.

Next, I will make an entry sometime this week about my updated plans and goals for the midterm evaluation period.

So Long.

Monday, July 8, 2013

manual fix - working proof of concept

So I looked into the csv files and found the highest intensity y-coordinates using a pivot table, and the boundary coordinates for a line. And with those coordinates, my code generates these two images.
Boundary markers :

Intended whitewashed output :

This shows that the algorithm is correct, however, there is something wrong in the way the code is finding the maximum intensity line. Currently, I was using max( , key=itemgetter()) method to find the maximum intensity element within the sublist. I will try to do it in some other way.

So Long.

image wash - working

So, I fixed the code that wasn't working yesterday (putpixel). And the input ad output currently looks like this:
Original :

Output (NOTE : the black line is actually intended to be white, it's black here just for visualisation purposes)

Now, here we can see that the whitewashing code works. But it was supposed to overlap with the "matra", which did not happen, so it indicates a bug in my code that finds the location of the matra.

I was following something I learnt from Prof. Mostafa's class,(which I completed June'2013), and NOT looking at the data before making the model, but seems I will have to break the pattern, and delve deeper.

So, next thing on the list :

Go back and fix the bug in code to find the matra.

Also, Originally, the matra clipper was intended to work as a pre-processing module for Tesseract. But having worked with Tesseract 3.0 for real, I have a change of plan.

Now I intend to use it to whitewash the matras, and then introduce gaps between every character, making them into individual glyphs.

Then, we can use these glyphs easily to make new box files and train Tesseract.

Also, the Tesseract 3.0, combined with a few filters (see here) seems to work nicely with individual words at this level.

One sample input :

Final output :

At a word-level, that's a whopping 100% accuracy with just a few filters and a bit increase in the resolution. This proves the mettle for Tesseract 3.0 , and hope for Indic OCR.

However, as an insight, this project is turning out to be way more research-oriented than I initially thought. That makes it all the more interesting.

On a more personal note, I will be visiting Kolkata for a week from 14th to 22nd July. I will continue with the work, but may not be able to update the blog regularly, however, I can still be reached through emails.

And with that, It's about time to call it a night.

So Long.

Sunday, July 7, 2013

washed image test

Things are looking up since the last post.
The problems with the clipping code is fixed now, and it detects and writes the darkest pixels in each subdivision to a separate csv (yes, i love csvs). The modification it needed was the use of the indexes instead of using the coordinates directly. That did the trick.
Also, while at it, I optimised the code and now everything is under 50 lines. Once again, Notebook has been a great help, and I'm convinced it was a good decision to switch from Eclipse+PyDev to EPD.
However, the file now looks like this :

Then, I tried to wash-over those pixel with a putpixel(255) , i.e., white-line, but it doesn't seem to work.

Next on list :

Fix the part to "wash" the pixels
Talk to mentor if it's okay to push notebooks directly (sadly, I don't know much about software packaging)

Also, I have been thinking about the InfoRescue project, and I think I would modify my plans for the second half a bit. Will write more about that soon.
So Long.

Thursday, July 4, 2013

chop-chop

The work on the clipping code is going well. Thanks to Debayan for laying the ground-work for the algorithm.
It's complete to a point where it detects the lines as a whole, and for my convenience, prints the coordinates to a csv in a (line number, count) format. Here's an interesting glimpse :

The jump from line 66 to 124 represents a white gap, the space between the two lines.

However, it still needs some tuning. For instance, this :

the presence of 183 and 184 bothers me. It most probably represents some noise, and shows that the code still needs tuning.

But I admit, IPython notebooks are a great way to modularise and optimise when experimenting with code. It only enhances my love for Python.

Also, my algorithm is enhanced in a way as it should (in theory) be able to distinguish straight and curved lines, I'll get started on that modification after completing the basic code.

So Long.

p.s. - I'm starting to get worried now, as I think I am a bit behind the schedule I had set for myself. So push, push and chop-chop.