Thursday, September 26, 2013

Datasets : for the love of OCR

I am back to post some information about the data I used.
There is a lot of training data available 'out there'. Initially, I looked into the FIRE dataset from RISOT, thanks to +Abhishek Gupta , and it might be good for testing purposes. But it is difficult to be used for training, as the associated font information is not available.
However, here is a link to some data I generated/used.
Also, for testing, most of images I used were obtained by a simple Google Image search, with relevant keywords, like "bangla newspaper" or "bangla poster".
Since Google uses bubble filters, I suggest everyone to use the generic Google site to get over the country defaults, and log out of their Google accounts or use an incognito/private window. This should ensure some consistency in the search results over all platforms, geological and temporal differences, as well as user preferences. However, it is not of the utmost importance, and just a suggestion.

No comments:

Post a Comment