Tuesday, September 17, 2013

Final Report : Part 2

At the beginning of the project, I spent some time finalising the tools I would be using.
Having done so, I started with the pre-processor, and finally, by the mid-term, a working prototype of the Whitewashing code was completed.

During this period, I also realised how difficult it was to generate good data for training/testing purposes. They were awful in the begining, when I got the results for whole words instead of characters. But gradually the quality improved. During these trials, I tried out various methods for generating the data and eventually the box files, some of them are :

  1.  Ari's trainer  helped a lot in learning about the training procedure
  2.  OCR chopper was easy to use to make box files, but not always accurate. It needed lots of manual editing.
  3. BoxMaker was similar to OCR chopper, but more flexible in terms of size
  4.  I also tried with some data from Parichit 
  5.  Open source icr , a project related to Ari's project mentioned above, was somewhat helpful.
  6. Some amazing work at Silpa inspired and motivated me, but i ended up not using them because I felt they were not very easy to incorporate in the project.
  7. Debayan's Tesseract Indic project has been a great help and provided me with much required guidance to get started.
After having tried all the methods, I took the decision to use Cairo-Pango in combination, and though had some problem initially, finally it worked out.

Further reading made it clear that I should always jumble up and mix the characters in a training set. I took it one step further, and decided to mix up even the sizes of the different images. And for that, I wrote another script. This helps in better training.
For the final leg of this, I was trying to make a  python script, but it was not working. In the end, I switched to using ImageMagick as it does the trick with a single command. There was no point in making things more complicated than necessary.

Along with this, I continued my work on testing various filters on different types of documents. I will publish it in the next post.

No comments:

Post a Comment