BookWorm - the BongOCR: filters

Showing posts with label filters. Show all posts

Thursday, September 5, 2013

importance of color profiles

2 days ago, talking to a professor made my realise how useful HSI profiles could be. Then I spent the time till now looking into it, and if it could help us in our goal.
The basic idea is : we could jump to I-parameter, and depending on that, we could infer if any useful information could be retrieved from the data. This could theoretically also lead to a probabilistic model to assign weights to different training samples.
However, while very interesting, it seems to be very vast a topic in itself, and I am back to my senses, and would not spend any more time on this issue immediately. It's time to focus on wrapping up things at hand. So Long.

Monday, July 22, 2013

Mid -Term Summary : Part 3 - working on the BookWorm

My working Methods.

This post probably needs a bit more structure, and hence, I will use some formatting styles Blogger provides.

The work I have done, and how I have done it, during the previous weeks can be summarised in the following steps :

Objective A. Get practical with Tesseract

I started the project by getting a first-hand feel of Tesseract. From all the reading I had done by this point, I was familiar with the working theory, but actually trying out the software provided with me a better insight into the current scenario.
Another valuable lesson learnt through this was that all the reading and previous work was based on earlier versions of Tesseract, but Tesseract 3.02 has a major upgrade in terms of BookWorm, it already has some base support for Indic languages, and the training method has also changed a bit from the previous versions (Tesseract 2.x).
I updated my plan several times along the course, and finally arrived at this conclusion to form the matrix mentioned in my previous post. I also finalised the my version of shirorekha chopping algorithm, the WhiteWashing Algorithm.
Also, discussions with +Abhishek Gupta made it clear that our projects are closely related, and his project is like a post-processor for OCR while mine is a pre-processor.

Objective B. Decide on Imaging Libraries

After familiarising myself practically with Tesseract, I started trying different libraries in Python, and finally decided that I will be using PIL and Scikit-Image. Some entries about the same can be found here.

Objective C. WhiteWashing Algorithm

Now, having made sure that the chosen libraries were suitable and sufficient for my purpose, I moved on, and started working on the WhiteWashing Algorithm.
Now that the coding for whitewashing an image is complete, after the midterm, I will turn my focus back to the filters, and start with forming the matrix.

Monday, July 15, 2013

testing with close-typed document

As planned, I tested the algorithm with some image samples of bengali newspaper, found as a result of a simple Google Image search.
The original image i used had a lot of noise, here is a preview :

Here, we can see that it has a lot of salt-pepper style noise, ad the resolution is not particularly good.

However, trying to whitewash it as it is yielded poor results.

It can be seen that due to the noise, the algorithm fails to detect the individual lines. S I appliead an adaptive thresholding filter on the image, and the denoised image looked like :

Applying whitewashing to this as input resulted as expected,

This proves the generic working of the algorithm.

However, I still had to manually reset the padding values, which needs to be resolved.

As midterm approaches, i have a lot at hand and on my mind.

Here is my new plan, in brief, and in the order I plan to carry them out :

To Do :

Document Everything done so far. (2 notebooks - ImageTest with filters and WhiteWashing algorithm)
Update and Publish the documents.
Document effects of domain-dependency (like this post deals with close-typed documents) and useful filters.
Document summarising what I learnt during this period, and what made me change my initial plan.
Test with different resolutions. (After mid-term)
Merge the notebooks.
Document.

I think I will stop coding for a while and get started with the documentation now. I will have to discuss the same with my mentor.

So Long.

Monday, July 8, 2013

image wash - working

So, I fixed the code that wasn't working yesterday (putpixel). And the input ad output currently looks like this:
Original :

Output (NOTE : the black line is actually intended to be white, it's black here just for visualisation purposes)

Now, here we can see that the whitewashing code works. But it was supposed to overlap with the "matra", which did not happen, so it indicates a bug in my code that finds the location of the matra.

I was following something I learnt from Prof. Mostafa's class,(which I completed June'2013), and NOT looking at the data before making the model, but seems I will have to break the pattern, and delve deeper.

So, next thing on the list :

Go back and fix the bug in code to find the matra.

Also, Originally, the matra clipper was intended to work as a pre-processing module for Tesseract. But having worked with Tesseract 3.0 for real, I have a change of plan.

Now I intend to use it to whitewash the matras, and then introduce gaps between every character, making them into individual glyphs.

Then, we can use these glyphs easily to make new box files and train Tesseract.

Also, the Tesseract 3.0, combined with a few filters (see here) seems to work nicely with individual words at this level.

One sample input :

Final output :

At a word-level, that's a whopping 100% accuracy with just a few filters and a bit increase in the resolution. This proves the mettle for Tesseract 3.0 , and hope for Indic OCR.

However, as an insight, this project is turning out to be way more research-oriented than I initially thought. That makes it all the more interesting.

On a more personal note, I will be visiting Kolkata for a week from 14th to 22nd July. I will continue with the work, but may not be able to update the blog regularly, however, I can still be reached through emails.

And with that, It's about time to call it a night.

So Long.