Tuesday, July 23, 2013

Mid -Term Summary : Part 4 - Results so far

This post is going to be all about the various results I have had so far, which will include both  the good ones (obviously) and the bad ones, that compelled me to update my plans somehow.

1.Tesseract
To start with, I am happy with the decision to use Tesseract 3.02. It's effective,not difficult to setup or use on any platform, and the training procedure, though long and tedious, is not very complex once you get the hang of it.

2. Environment
Next, my decision to use EPD and IPython has also been helpful. Infact, I love Python more than ever now, and think that using the Notebook is a really god option when you are concerned with heavily documenting your work.

3. Test Results
Resolution : Contradictory to my initial hypotheses, tests suggest these do not seem to affect the outcome much. However, more extensive testing needs to be done. Also, this inspired me to focus harder on the image filters.

Image Filters : So far, Bilateral filters, and adaptive thresholding seem to be the most effective on all kinds of documents. For the final matrix, document-specific effect of the filters need to be sorted out.

WhiteWashing Algorithm : The tests show that the code works as intended. 

Box-Files and traineddata : Initial box files I generated were devastating, but with the work-around on using Fedora, they have improved. The traineddata I generated is not an improvement over the stock. I will continue to work on them.


No comments:

Post a Comment