After 10 days, I am back, and with lessons to be shared.
First of all, the very much expected mention about the "Delhi Heat". It is still burning.
However, from the corner of my couch, I have been enjoying the mangoes and messing with the Tesseract.
Starting with a simple test, I typed something in Bengali, and converted it to an image, then fed it to Tesseract. As expected, BOOM! the result was devastating.
The interesting similarity in both errors got me curious, and for reasons yet unknown, I mapped the input and output directly. A pretty simple thing to do actually.
The orange-ish blobs show the corresponding mappings. Note the similarity between the shape of "ম " and "W" , for example, and also, that "ম " is almost at the mean position for the particular blob.
This shows that the errors, obviously, are NOT random, and give us some insight into them.
(NOTE to Self : This could be useful for the post-processing module.)
But then, having looked around a bit for the box files now, and after fixing some configurations, I could get it to work a bit more decently.
However, these are just tests, and I am not going to push them to the repo yet.
Now time to work up a new "To Do" list :
- Finalize pre-processing algorithm and move/share from Evernote and through the BookWorm repository.
- Try to get support for "Bengali" in NPP++. - Would make life a bit more easier :)
- Decide on and start with ImageMagick or something similar.
- To test my basic image on different versions of Tesseract (OPTIONAL)- still looking into the differences in detail, and they look very different indeed.
Also, I tried to contact Abhishek Gupta, who is working on a similar project, to have a chat. It should be interesting to talk with him, as I feel our projects are closely related. Maybe I will be using the same dataset, and that should be the benchmark! Maybe my insight into the "errors" mentioned above might help him in formulating his 'error model'.
We shall see what happens when he replies to my email.
So Long.