BookWorm - the BongOCR: image wash

So, I fixed the code that wasn't working yesterday (putpixel). And the input ad output currently looks like this:
Original :

Output (NOTE : the black line is actually intended to be white, it's black here just for visualisation purposes)

Now, here we can see that the whitewashing code works. But it was supposed to overlap with the "matra", which did not happen, so it indicates a bug in my code that finds the location of the matra.

I was following something I learnt from Prof. Mostafa's class,(which I completed June'2013), and NOT looking at the data before making the model, but seems I will have to break the pattern, and delve deeper.

So, next thing on the list :

Go back and fix the bug in code to find the matra.

Also, Originally, the matra clipper was intended to work as a pre-processing module for Tesseract. But having worked with Tesseract 3.0 for real, I have a change of plan.

Now I intend to use it to whitewash the matras, and then introduce gaps between every character, making them into individual glyphs.

Then, we can use these glyphs easily to make new box files and train Tesseract.

Also, the Tesseract 3.0, combined with a few filters (see here) seems to work nicely with individual words at this level.

One sample input :

Final output :

At a word-level, that's a whopping 100% accuracy with just a few filters and a bit increase in the resolution. This proves the mettle for Tesseract 3.0 , and hope for Indic OCR.

However, as an insight, this project is turning out to be way more research-oriented than I initially thought. That makes it all the more interesting.

On a more personal note, I will be visiting Kolkata for a week from 14th to 22nd July. I will continue with the work, but may not be able to update the blog regularly, however, I can still be reached through emails.

And with that, It's about time to call it a night.

So Long.

BookWorm - the BongOCR

Monday, July 8, 2013

image wash - working

No comments:

Post a Comment

About Me