BookWorm - the BongOCR: June 2013

Friday, June 28, 2013

roadblock

I've hit a roadblock.
Trying to generate a new, hopefully improved, training set, I've been trying to run a script that would generate the boxfiles. But it needed 2 libraries, Cairo and Pango, to work.
I'm having trouble getting Pango to work on my windows machine.
I'll spend another day or two on it, and if it doesn't work, will have to switch.
It's going to be a long night.

Wednesday, June 26, 2013

more tests and results

As planned, I ran some denoising functions on the images (NOTE : conversion to float was needed).
After this, I compared the outputs from tesserat on the 5 images side by side. They denoising filters I tried are :

Bilateral filters: This seems to give improved results.
Total variation filter using split-Bregman : No improvements.
Chambolle : Very little Improvements.

The sample outputs can be seen here.

Bilateral filters

Chambolle filters

Just as a reminder, the original image looked as

Next Up, some playing with the resolution before moving on work with characters individually from next week. So Long.

As a side note, I also tried the same procedure with another test image using a separate font, but as expected, it did not perform well.

Monday, June 24, 2013

profile conversion and library tests

After spending a few days loking into different PIL and ImageMagick-based libraries, I have finally settles on what seems to be the most user-friendly and awesome among them. Scikit-Image.
Having used Scikit-Learn briefly, I was aware of the popularity of the scikit-*family* libraries, but was unaware of this particular member.

After 3 days (rather nights) , I have figured out the data format and conversion techniques, and have pushed a sample script to the repo, which converts a test tif from RGB to Gray and then performs the following operations on the same :

Canny Edge Detector
Sobel Operator

Here is how the output looks:

However, I still need to work on the image viewer. Presently, I am waiting on Travis. It is doing something (not sure what or why) as I write this blog.

The new To Do list:

Try increasing size of the files.
Try out the output on Tesseract.
Optimise the method for reading the file.
Optimise the overall code.
Learn more about using the Notebooks.

But it's time to hit the bed now, so the new experiments will have to wait for another night. So Long.

Wednesday, June 19, 2013

quick peek into the data.

It just occurred to me, that even if I do not get the correct information about the dataset, I should still be able to use it.
due to the vast amount of the data, it can always be used as a k-fold measure (maybe 4-fold) and split into training and test data. But I will have to make sure all the samples make use of the same font.

Tuesday, June 18, 2013

making box files

After trying for a few days, I succeeded in making a few box files today.
First, a view of the excerpt I used. I came across this piece when searching for bengali pangrams. As far as I can tell on a first view, this passage satisfies my purposes.

So I started out and made a simple box file, which, after lots of manual corrections, looks like this.

Now, this makes clear that I have to have the shirorekha-chopping even before making box files. Otherwise, it detects the whole words as a single blob. I'll try another box-file maker to see if the problem persists.

If it does, that could be a problem, as it means I have to stop the training procedure until my pre-processing module is complete.

On the other hand, I spoke to Abhishek Gupta today, and he helped me out regarding the datasets I have been thinking about. But it seems they do not have the information about the "font" associated with them. However, I'll have a look at them over the week.

Also, the trouble with NPP++ is fixed, though it does not take the specific encoding by default, and since I change it manually, it asks to save the file, which is a bit irritating.
So Long.

Saturday, June 15, 2013

More Tests

So, after yesterday's bleak attempt, I decided to test the system on a random image of bengali script, and passed one of the results of a simple Google image search for "Bengali text". A full page excerpt.

The result was terrible.

The noise level on the image was very high compared to the previous test sample, and the font was obviously different. But still Tesseract did manage to identify the language and few of the characters.

The work on the pre-processing algorithm is going good, and I can start coding as soon as I learn more about the imaging library. I'll go through the previous work done once again to look for some more help in this regard.
However, I still haven't figured out a way to make NPP++ display bengali characters.

So Long.

Friday, June 14, 2013

Testing the Tesseract.

After 10 days, I am back, and with lessons to be shared.
First of all, the very much expected mention about the "Delhi Heat". It is still burning.

However, from the corner of my couch, I have been enjoying the mangoes and messing with the Tesseract.
Starting with a simple test, I typed something in Bengali, and converted it to an image, then fed it to Tesseract. As expected, BOOM! the result was devastating.

The interesting similarity in both errors got me curious, and for reasons yet unknown, I mapped the input and output directly. A pretty simple thing to do actually.

The orange-ish blobs show the corresponding mappings. Note the similarity between the shape of "ম " and "W" , for example, and also, that "ম " is almost at the mean position for the particular blob.
This shows that the errors, obviously, are NOT random, and give us some insight into them.
(NOTE to Self : This could be useful for the post-processing module.)

But then, having looked around a bit for the box files now, and after fixing some configurations, I could get it to work a bit more decently.

However, these are just tests, and I am not going to push them to the repo yet.

Now time to work up a new "To Do" list :

Finalize pre-processing algorithm and move/share from Evernote and through the BookWorm repository.
Try to get support for "Bengali" in NPP++. - Would make life a bit more easier :)
Decide on and start with ImageMagick or something similar.
To test my basic image on different versions of Tesseract (OPTIONAL)- still looking into the differences in detail, and they look very different indeed.

Also, I tried to contact Abhishek Gupta, who is working on a similar project, to have a chat. It should be interesting to talk with him, as I feel our projects are closely related. Maybe I will be using the same dataset, and that should be the benchmark! Maybe my insight into the "errors" mentioned above might help him in formulating his 'error model'.

We shall see what happens when he replies to my email.

So Long.

Thursday, June 6, 2013

Canopy and CI

The days since the last post have been interesting and busy, between preparations for BookWorm, trying to understand CI and finishing up the LFD course, not to mention the numerous meetups before everyone departs for the summer.
But things are looking up today. The LFD final exam is 80% done, and hopefully will be complete by this time tomorrow. And Travis-CI has been set up and linked with the BookWorm repository. Though the current commits give an error, but most probably that is due to the lack of any actual code in there. (NOTE TO SELF : needs more reading).
However, excitement of going home after almost 10 months makes me jumpy, and reading about the weather in Delhi is not helping at all.

Among the good things, 3 things have been crossed off from the ast 'To Do' list.

Done :

Understand the training procedure for Tesseract 3.02. - pretty clear view inside the head
Make sure no documents are left to be submitted to Melange. - Done
Decide whether to use IPython or not. - not sure about IPython, but definitely switching to Canopy. Almost in love with EPD now.

To Do :

Start making the box files. - looks like it will have to wait till I get to Kolkata next week, as I think it's better to complete the process without a disruption in the flow.
Pack bags for home. - Friday night/Saturday early morning should be the time.
Read about the build issue with CI , though according to their blog, a documentation change should skip the build process. It looks like the problem is with the build configuration, hopefully it will be clearer with the next commit.

Monday, June 3, 2013

beginning the BookWorm

Summer 2013 has arrived. And so begins the Summer of Code.

Today marks a week since the results were announced, and with help form my mentors at Ankur India , we have set up a working repository for the BookWorm.
In the meantime, I have been busy making sure I submit the necessary documents to melange, while trying to get familiarised with IPython, specially the notebooks, which could be helpful to document the code, and also trying to wrap up the matters at University.

As planned, I am currently reading "training the Tesseract" procedure, and trying to understand the differences in dependencies of 2.0x and 3.0x, which seem to have changed a lot.
I hope to train Tesseract 3.02 and then start with my scripts for the pre-processing module soon.

The "To Do" list for this week looks like :

Understand the training procedure for Tesseract 3.02.
Make sure no documents are left to be submitted to Melange.
Decide whether to use IPython or not.
Start making the box files.
Pack bags for home. :)