Final Report : Part 1

Finally, The Summer has come to an end, and it's time to look back on the journey.
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.

My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.

Set Goals :

Pre-processor : shirorekha chopper
Post-processor : CBR-based
Output matrix : performance based comparison

Achieved Goals :

Pre-processor : Whitewashing code , a modified shirorekha chopper that paints over the shirorekha instead of chopping only at the gaps.
Data generation : It helps in generating data that may be used for testing, as well as used to make the box-files to train the system.
Output table : Several filters and their effect on different type of documents was tested, in hope of providing a guideline to better pre-process any document that needs to be OCR'ed.

Key extra takeaways :

Learnt a lot about OCR software, specifically about Tesseract.
Learnt about Pango-Cairo.
Used IPython and Notebooks with EPD. I am definitely going to use them a lot now.
Practised modular development as a single developer on this big a project for the first time.

In the next few posts, I will explain my reasoning for the changes I made to the plan along the way.

So Long.

BookWorm - the BongOCR

Monday, September 16, 2013

Final Report : Part 1

No comments:

Post a Comment

About Me