BookWorm - the BongOCR: July 2013

Monday, July 29, 2013

minor tweaks and documentation

This past week, I have been working on tweaking the code to optimise it at various places.
Now, it has the facility to check for the input image mode and convert it to grayscale only if needed, unlike earlier when it always converted it. It was a pretty simple, but important thing. One that I had been thinking about for sometime, and at long last, managed to get around to it.
Another notable change is the way the directories work. Earlier, I was using explicit, absolute path based locations for the input files. Now, I have updated to take them as a variabe-based relative path. Next to do : Take the path as argument input, but I will do that when I translate everything to a *.py from the notebook at the end.
Also, I spent some time documenting the existing code, and though split in 3 notebooks, they have all the documentation in a manner I intended them for now. Later, when merging the notebooks, I will be using the respective parts (cell blocks) from the different notebooks to form the final script.
So Long.

Tuesday, July 23, 2013

Mid-Term Summary : Part 5 - using the notebooks

This post is meant to guide anyone who wants to use the notebooks.
So far, I have pushed the notebooks (JSON format) to the repository directly. To have a look at the code, follow the instructions below.

Make sure you have the software and libraries required installed. A list of the same can be found here.
Make sure you have IPython and notebooks support. Installation instructions can be found here.
Get the notebooks (*.ipynb) from the repository Preferably, do a fork/clone to let us know you are testing out the codes.
Import the notebooks in your browser session or open directly with EPD.

p.s. - I will provide everything in native python (*.py) format at the end of the project.

Mid -Term Summary : Part 4 - Results so far

This post is going to be all about the various results I have had so far, which will include both the good ones (obviously) and the bad ones, that compelled me to update my plans somehow.

1.Tesseract

To start with, I am happy with the decision to use Tesseract 3.02. It's effective,not difficult to setup or use on any platform, and the training procedure, though long and tedious, is not very complex once you get the hang of it.

2. Environment

Next, my decision to use EPD and IPython has also been helpful. Infact, I love Python more than ever now, and think that using the Notebook is a really god option when you are concerned with heavily documenting your work.

3. Test Results

Resolution : Contradictory to my initial hypotheses, tests suggest these do not seem to affect the outcome much. However, more extensive testing needs to be done. Also, this inspired me to focus harder on the image filters.

Image Filters : So far, Bilateral filters, and adaptive thresholding seem to be the most effective on all kinds of documents. For the final matrix, document-specific effect of the filters need to be sorted out.

WhiteWashing Algorithm : The tests show that the code works as intended.

Box-Files and traineddata : Initial box files I generated were devastating, but with the work-around on using Fedora, they have improved. The traineddata I generated is not an improvement over the stock. I will continue to work on them.

Monday, July 22, 2013

Mid -Term Summary : Part 3 - working on the BookWorm

My working Methods.

This post probably needs a bit more structure, and hence, I will use some formatting styles Blogger provides.

The work I have done, and how I have done it, during the previous weeks can be summarised in the following steps :

Objective A. Get practical with Tesseract

I started the project by getting a first-hand feel of Tesseract. From all the reading I had done by this point, I was familiar with the working theory, but actually trying out the software provided with me a better insight into the current scenario.
Another valuable lesson learnt through this was that all the reading and previous work was based on earlier versions of Tesseract, but Tesseract 3.02 has a major upgrade in terms of BookWorm, it already has some base support for Indic languages, and the training method has also changed a bit from the previous versions (Tesseract 2.x).
I updated my plan several times along the course, and finally arrived at this conclusion to form the matrix mentioned in my previous post. I also finalised the my version of shirorekha chopping algorithm, the WhiteWashing Algorithm.
Also, discussions with +Abhishek Gupta made it clear that our projects are closely related, and his project is like a post-processor for OCR while mine is a pre-processor.

Objective B. Decide on Imaging Libraries

After familiarising myself practically with Tesseract, I started trying different libraries in Python, and finally decided that I will be using PIL and Scikit-Image. Some entries about the same can be found here.

Objective C. WhiteWashing Algorithm

Now, having made sure that the chosen libraries were suitable and sufficient for my purpose, I moved on, and started working on the WhiteWashing Algorithm.
Now that the coding for whitewashing an image is complete, after the midterm, I will turn my focus back to the filters, and start with forming the matrix.

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

Language : Python 2.7
Development Environment :

most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.

However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make pango work on my windows.

Notepad++ 6.3.2

Software :

Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
Python Libraries :

Repository : GitHub

Sunday, July 21, 2013

Mid-Term Summary : Why BookWorm

I have paused the coding for a while, and have been concentrating on documentation. A part of the same is this multiple-part entry to summarise my thoughts and efforts to help with the mid-term evaluation.

This interlude begins here.

Part 1 - Background and Motivation

Tesseract has undoubtedly been the most prominent open-source OCR tool, and in past few years, the support for many languages has been incorporated to it. The most notable of these (to me personally) being :

My motivation for taking up this project is rooted at the fact that Bengali is my mother tongue, and I have first-hand experience of situations where an OCR for Bengali would have helped me.

With this project, my aim is to help in digitization of the huge literary heritage of the language and at the same time, making the language itself more easily accessible to people across the world.

So I read up the existing projects, talked to some of the awesome people who originally took this initiative, and decided to go forward and try to improve accuracy for OCR for Bengali. And thus began the Bookworm.

Thursday, July 18, 2013

some quick training

The last two days, I decided to go with my impulse and try generating some training data. Now I have a good grasp on the Tesseract commands for generating the required files, but still need deeper understanding about the clustering part. Amazingly, the long and tedious training procedure at the original documentation doesn't look so difficult after staring at it for almost 3 months.
However, after testing with the new data, no miracle happened, and while it managed to detect the language correctly, the accuracy wasn't worth mentioning.
Next step, I will generate some training data with the WhiteWashing algorithm and see how it fares.
But now, off to documentation.
So Long.

Monday, July 15, 2013

testing with close-typed document

As planned, I tested the algorithm with some image samples of bengali newspaper, found as a result of a simple Google Image search.
The original image i used had a lot of noise, here is a preview :

Here, we can see that it has a lot of salt-pepper style noise, ad the resolution is not particularly good.

However, trying to whitewash it as it is yielded poor results.

It can be seen that due to the noise, the algorithm fails to detect the individual lines. S I appliead an adaptive thresholding filter on the image, and the denoised image looked like :

Applying whitewashing to this as input resulted as expected,

This proves the generic working of the algorithm.

However, I still had to manually reset the padding values, which needs to be resolved.

As midterm approaches, i have a lot at hand and on my mind.

Here is my new plan, in brief, and in the order I plan to carry them out :

To Do :

Document Everything done so far. (2 notebooks - ImageTest with filters and WhiteWashing algorithm)
Update and Publish the documents.
Document effects of domain-dependency (like this post deals with close-typed documents) and useful filters.
Document summarising what I learnt during this period, and what made me change my initial plan.
Test with different resolutions. (After mid-term)
Merge the notebooks.
Document.

I think I will stop coding for a while and get started with the documentation now. I will have to discuss the same with my mentor.

So Long.

Sunday, July 14, 2013

WhiteWashing and box files

Finally, I have fixed the padding-values for the WhiteWash algorithm and it seems to be working nicely.
Here is a preview of the current output :

However, the padding uses integer numbers right now. It will be better if I could somehow relate this to the document statistics. It should be interesting to try out later.

This success with the algorithm allows me to focus on the box file generation again. Currently, I have managed to generate this using my workaround in the virtual machine, and after tweaking Debayan and Sayamindu's original algorithm using pango.

The box file generated now looks like :

Which is improvement over the last time.

Now, I have to combine these two methods to try out my plan.

Next, I will make an entry sometime this week about my updated plans and goals for the midterm evaluation period.

So Long.

Thursday, July 11, 2013

WhiteWashing Algo - Working

It took a bit longer than expected, but I have finally fixed the script to detect the shirorekha and whitewash over them.
I will write more about what I was doing wrong and how I fixed it later.
Sample Input :

Sample Output : (NOTE : The black highlights show the marked coordinates)

Actual whitewashed Output :

On a Closer looks, we can see :

Here, we can see that the white-lines are too thin. Hence, next step to do for me is to find some appropriate values of padding around the coordinates when doing the whitewash.

So Long.

Monday, July 8, 2013

manual fix - working proof of concept

So I looked into the csv files and found the highest intensity y-coordinates using a pivot table, and the boundary coordinates for a line. And with those coordinates, my code generates these two images.
Boundary markers :

Intended whitewashed output :

This shows that the algorithm is correct, however, there is something wrong in the way the code is finding the maximum intensity line. Currently, I was using max( , key=itemgetter()) method to find the maximum intensity element within the sublist. I will try to do it in some other way.

So Long.

image wash - working

So, I fixed the code that wasn't working yesterday (putpixel). And the input ad output currently looks like this:
Original :

Output (NOTE : the black line is actually intended to be white, it's black here just for visualisation purposes)

Now, here we can see that the whitewashing code works. But it was supposed to overlap with the "matra", which did not happen, so it indicates a bug in my code that finds the location of the matra.

I was following something I learnt from Prof. Mostafa's class,(which I completed June'2013), and NOT looking at the data before making the model, but seems I will have to break the pattern, and delve deeper.

So, next thing on the list :

Go back and fix the bug in code to find the matra.

Also, Originally, the matra clipper was intended to work as a pre-processing module for Tesseract. But having worked with Tesseract 3.0 for real, I have a change of plan.

Now I intend to use it to whitewash the matras, and then introduce gaps between every character, making them into individual glyphs.

Then, we can use these glyphs easily to make new box files and train Tesseract.

Also, the Tesseract 3.0, combined with a few filters (see here) seems to work nicely with individual words at this level.

One sample input :

Final output :

At a word-level, that's a whopping 100% accuracy with just a few filters and a bit increase in the resolution. This proves the mettle for Tesseract 3.0 , and hope for Indic OCR.

However, as an insight, this project is turning out to be way more research-oriented than I initially thought. That makes it all the more interesting.

On a more personal note, I will be visiting Kolkata for a week from 14th to 22nd July. I will continue with the work, but may not be able to update the blog regularly, however, I can still be reached through emails.

And with that, It's about time to call it a night.

So Long.

Sunday, July 7, 2013

washed image test

Things are looking up since the last post.
The problems with the clipping code is fixed now, and it detects and writes the darkest pixels in each subdivision to a separate csv (yes, i love csvs). The modification it needed was the use of the indexes instead of using the coordinates directly. That did the trick.
Also, while at it, I optimised the code and now everything is under 50 lines. Once again, Notebook has been a great help, and I'm convinced it was a good decision to switch from Eclipse+PyDev to EPD.
However, the file now looks like this :

Then, I tried to wash-over those pixel with a putpixel(255) , i.e., white-line, but it doesn't seem to work.

Next on list :

Fix the part to "wash" the pixels
Talk to mentor if it's okay to push notebooks directly (sadly, I don't know much about software packaging)

Also, I have been thinking about the InfoRescue project, and I think I would modify my plans for the second half a bit. Will write more about that soon.
So Long.

Thursday, July 4, 2013

chop-chop

The work on the clipping code is going well. Thanks to Debayan for laying the ground-work for the algorithm.
It's complete to a point where it detects the lines as a whole, and for my convenience, prints the coordinates to a csv in a (line number, count) format. Here's an interesting glimpse :

The jump from line 66 to 124 represents a white gap, the space between the two lines.

However, it still needs some tuning. For instance, this :

the presence of 183 and 184 bothers me. It most probably represents some noise, and shows that the code still needs tuning.

But I admit, IPython notebooks are a great way to modularise and optimise when experimenting with code. It only enhances my love for Python.

Also, my algorithm is enhanced in a way as it should (in theory) be able to distinguish straight and curved lines, I'll get started on that modification after completing the basic code.

So Long.

p.s. - I'm starting to get worried now, as I think I am a bit behind the schedule I had set for myself. So push, push and chop-chop.

Wednesday, July 3, 2013

workaround for Pango

Having tried to make pango work on windows for over a week now, I have finally given up, and have switched to Fedora.
On Fedora, it took a little over 3 minutes to install the dependencies and get everything up and running. more reasons to love Linux.
Anyway, the first trial of generating the images looks like this.

It is good in itself, but still needs work, and I need to synchronise the same fonts between my windows and Fedora environment, a silly mistake on my part to not take care of it already.

Coding for the clipping algorithm is about 30 % done. Right now, I am looking into connected-component analysis to see if it will work.

I plan to fix the fonts, and then work on the connected-component for the rest of the week.

So Long.