At the beginning of the project, I spent some time finalising the tools I would be using.
Having done so, I started with the pre-processor, and finally, by the mid-term, a working prototype of the Whitewashing code was completed.
Having done so, I started with the pre-processor, and finally, by the mid-term, a working prototype of the Whitewashing code was completed.
During this period, I also realised how difficult it was to generate good data for training/testing purposes. They were awful in the begining, when I got the results for whole words instead of characters. But gradually the quality improved. During these trials, I tried out various methods for generating the data and eventually the box files, some of them are :
- Ari's trainer helped a lot in learning about the training procedure
- OCR chopper was easy to use to make box files, but not always accurate. It needed lots of manual editing.
- BoxMaker was similar to OCR chopper, but more flexible in terms of size
- I also tried with some data from Parichit
- Open source icr , a project related to Ari's project mentioned above, was somewhat helpful.
- Some amazing work at Silpa inspired and motivated me, but i ended up not using them because I felt they were not very easy to incorporate in the project.
- Debayan's Tesseract Indic project has been a great help and provided me with much required guidance to get started.
After having tried all the methods, I took the decision to use Cairo-Pango in combination, and though had some problem initially, finally it worked out.
Further reading made it clear that I should always jumble up and mix the characters in a training set. I took it one step further, and decided to mix up even the sizes of the different images. And for that, I wrote another script. This helps in better training.
For the final leg of this, I was trying to make a python script, but it was not working. In the end, I switched to using ImageMagick as it does the trick with a single command. There was no point in making things more complicated than necessary.
For the final leg of this, I was trying to make a python script, but it was not working. In the end, I switched to using ImageMagick as it does the trick with a single command. There was no point in making things more complicated than necessary.
Along with this, I continued my work on testing various filters on different types of documents. I will publish it in the next post.
No comments:
Post a Comment