Monday, September 16, 2013

Final Report : Part 1

Finally, The Summer has come to an end, and it's time to look back on the journey.
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.

My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will  sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.

Set Goals :
  1. Pre-processor : shirorekha chopper
  2. Post-processor : CBR-based
  3. Output matrix : performance based comparison
Achieved Goals :
  1. Pre-processor : Whitewashing code , a modified shirorekha chopper that paints over the shirorekha instead of chopping only at the gaps.
  2. Data generation : It helps in generating data that may be used for testing, as well as used to make the box-files to train the system.
  3. Output table : Several filters and their effect on different type of documents was tested, in hope of providing a  guideline to better pre-process any document that needs to be OCR'ed.
Key extra takeaways :
  • Learnt a lot about OCR software, specifically about Tesseract.
  • Learnt about Pango-Cairo. 
  • Used IPython and Notebooks with EPD. I am definitely going to use them a lot now.
  • Practised modular development as a single developer on this big a project for the first time.
In the next few posts, I will explain my reasoning for the changes I made to the plan along the way.
So Long.

No comments:

Post a Comment