Finally, The Summer has come to an end, and it's time to look back on the journey.
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.
My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.
Set Goals :
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.
My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.
Set Goals :
- Pre-processor : shirorekha chopper
- Post-processor : CBR-based
- Output matrix : performance based comparison
- Pre-processor : Whitewashing code , a modified shirorekha chopper that paints over the shirorekha instead of chopping only at the gaps.
- Data generation : It helps in generating data that may be used for testing, as well as used to make the box-files to train the system.
- Output table : Several filters and their effect on different type of documents was tested, in hope of providing a guideline to better pre-process any document that needs to be OCR'ed.
Key extra takeaways :
- Learnt a lot about OCR software, specifically about Tesseract.
- Learnt about Pango-Cairo.
- Used IPython and Notebooks with EPD. I am definitely going to use them a lot now.
- Practised modular development as a single developer on this big a project for the first time.
In the next few posts, I will explain my reasoning for the changes I made to the plan along the way.
So Long.
No comments:
Post a Comment