BookWorm - the BongOCR: 2013

Thursday, September 26, 2013

Datasets : for the love of OCR

I am back to post some information about the data I used.
There is a lot of training data available 'out there'. Initially, I looked into the FIRE dataset from RISOT, thanks to +Abhishek Gupta , and it might be good for testing purposes. But it is difficult to be used for training, as the associated font information is not available.
However, here is a link to some data I generated/used.
Also, for testing, most of images I used were obtained by a simple Google Image search, with relevant keywords, like "bangla newspaper" or "bangla poster".
Since Google uses bubble filters, I suggest everyone to use the generic Google site to get over the country defaults, and log out of their Google accounts or use an incognito/private window. This should ensure some consistency in the search results over all platforms, geological and temporal differences, as well as user preferences. However, it is not of the utmost importance, and just a suggestion.

Friday, September 20, 2013

Self Evaluation

During the last few months, I have seen and learnt a lot. I read about software and related state of the art for my project. I exchanged emails with some pretty awesome people working in the field, some still continuing with their passion and going strong, while some caught up in other phases of life. They inspired me with their work and helped me with guidance.
I made use of knowledge acquired in the past, and gained more new knowledge that will be useful in the future. I realised how hard it could be to efficiently manage time, specially juggling between matters of health where you are helpless, and work that you desperately want to see to completion.

My project goals transformed over the course, but I believe I have a deeper understanding of the scenario now. I tried several methods/libraries, tested out my hypotheses that were sometimes right, and sometimes very wrong, and most importantly, I got to work on a project that I had designed myself and was close to my heart.
One major benefit of this project is, now I understand more about managing my projects on a bigger scale and in a better way. And I can already feel the difference it has made. Overall it has been a great life experience.

Thursday, September 19, 2013

Final report : Part 3

The filters I tested out generated some results that help in preparing a document for OCR.
In a tabular format, the results look like :

	Median Filter	Gaussian Filters	Greyscale Conversion	Bilateral Filters	Adaptive Thresholding	Resolution Changes	WhiteWashing Algo

Color Images	G	R	G	G	G	N,A	Works as intended
NewsPaper	R	G	G	G	R, S	N,A	Works as intended
Books	R	G	G	G	R, S	N,A	Works as intended
Posters	G	N, A	G	G	G, S	N,A	Works as intended

Legend :
R : Recommended
G : Good Results, Useful to have
N : Neutral
A : Adverse Effects
S : improved results with Whitewashing

Explanations :

All tests were performed on random image samples found via a basic image search on Google.
NOTE : Tests mentioned in Column N should be performed before tests in column N+1.

Median Filters:
These are useful for Color Images, and seem to improve the output in general for all documents.
In case of samples from newspapers and books, they helped greatly in removing noise.

Gaussian Smoothing:
It helped in smoothing the images before greyscale conversion, but in case of Posters, the smoothing often distorted the text as well.

Greyscale Conversion :
It is always better to do a greyscale conversion as there is no downside to it.
Often, using a filter to "skeletonize" the input resulted in improved output. But this was also affected by the resolution, and it is suggested not to use skeletonize for input with very low resolution.

Bilateral Filters and Adaptive Thresholding :
These tests were started before the mid-term, and are explained in details here and here, respectively.
However, I continued checking the combinations, and they were found to be mostly useful.

I will write one last post about me, myself, life, the universe and everything that happened this summer. So Long.

Tuesday, September 17, 2013

Final Report : Part 2

At the beginning of the project, I spent some time finalising the tools I would be using.
Having done so, I started with the pre-processor, and finally, by the mid-term, a working prototype of the Whitewashing code was completed.

During this period, I also realised how difficult it was to generate good data for training/testing purposes. They were awful in the begining, when I got the results for whole words instead of characters. But gradually the quality improved. During these trials, I tried out various methods for generating the data and eventually the box files, some of them are :

Ari's trainer helped a lot in learning about the training procedure
OCR chopper was easy to use to make box files, but not always accurate. It needed lots of manual editing.
BoxMaker was similar to OCR chopper, but more flexible in terms of size
I also tried with some data from Parichit
Open source icr , a project related to Ari's project mentioned above, was somewhat helpful.
Some amazing work at Silpa inspired and motivated me, but i ended up not using them because I felt they were not very easy to incorporate in the project.
Debayan's Tesseract Indic project has been a great help and provided me with much required guidance to get started.

After having tried all the methods, I took the decision to use Cairo-Pango in combination, and though had some problem initially, finally it worked out.

Further reading made it clear that I should always jumble up and mix the characters in a training set. I took it one step further, and decided to mix up even the sizes of the different images. And for that, I wrote another script. This helps in better training.
For the final leg of this, I was trying to make a python script, but it was not working. In the end, I switched to using ImageMagick as it does the trick with a single command. There was no point in making things more complicated than necessary.

Along with this, I continued my work on testing various filters on different types of documents. I will publish it in the next post.

Monday, September 16, 2013

Final Report : Part 1

Finally, The Summer has come to an end, and it's time to look back on the journey.
It turned out to be very different from what I had thought at the beginning. To sum up, my work towards GSoC 2013 can be categorised in two parts : the whitewashing algorithm, and the data generation method.

My First plan was to have a pre-processor and a post-processor, which would combine and work around the OCR to improve the quality of the final result. But after learning about the InfoRescue project, the plan for the post-processor changed, and in the final implementation, I have the pre-processor, and a system for generating data easily.
For the result, my initial plan was to chart out the performance of the individual systems, but instead, I ended up doing a table that helps in the pre-processing of the documents.
Here, I will sum up my Aimed vs Achieved Goals in brief, and then in the following post(s), I will explain in detail the arguments and train of thoughts/events that lead to the changes in the plan.

Set Goals :

Pre-processor : shirorekha chopper
Post-processor : CBR-based
Output matrix : performance based comparison

Achieved Goals :

Pre-processor : Whitewashing code , a modified shirorekha chopper that paints over the shirorekha instead of chopping only at the gaps.
Data generation : It helps in generating data that may be used for testing, as well as used to make the box-files to train the system.
Output table : Several filters and their effect on different type of documents was tested, in hope of providing a guideline to better pre-process any document that needs to be OCR'ed.

Key extra takeaways :

Learnt a lot about OCR software, specifically about Tesseract.
Learnt about Pango-Cairo.
Used IPython and Notebooks with EPD. I am definitely going to use them a lot now.
Practised modular development as a single developer on this big a project for the first time.

In the next few posts, I will explain my reasoning for the changes I made to the plan along the way.

So Long.

Sunday, September 15, 2013

code clean up

It was bugging me to put up the whole code in a single file, specially after having used a very nice, modular structure in the notebooks. So I spent the day scrubbing, and finally I have split them up in manageable parts, and pushed everything to the repository.

Also included with the code is instructions on how to use them, and my personal views and suggestion at some points, in the form of README files.

In the future, I would like to continue and maybe develop a GUI to make it easier to use.

However, it's time to stop coding, and start documenting whatever is left.

Saturday, September 14, 2013

porting to py completed

It has been a long, but productive day. After a compulsory 40-hour week at school, I have finally completed porting all the code from *.ipynb to *.py and collecting all code from Windows and Fedora to one place. However, I still prefer to use the notebook format, and would recommend everyone to use the notebooks if possible.
Tomorrow, I will improve format of the final matrix, and publish it. So Long.

Friday, September 13, 2013

data generation method- complete

So finally, the method for generating data easily is complete.
I tried to continue making the script with EPD, but it kept giving bad results, mostly due to inefficient memory handling on my part. Here is a sample output of the previous code.

It can be seen that it's really bad, and this should not be used for any kind of raining or testing.

However, I switched to ImageMagick for the merging part, and now it works perfectly. The sample output looks like this.

Note the clarity in spacing of the individual images generated before, and that it takes up the whole space instead of crowding in the corner, like the previous result.

It can be done with a simple command :

montage img1.ext img2.ext .... imgN.ext -geometry SizexSize output.ext

here, montage is the command, and -geometry is the option, the rest can be modified as per our needs.

Also, the image above is just a snip of the actual output as it is too big to publish here.

Thursday, September 5, 2013

importance of color profiles

2 days ago, talking to a professor made my realise how useful HSI profiles could be. Then I spent the time till now looking into it, and if it could help us in our goal.
The basic idea is : we could jump to I-parameter, and depending on that, we could infer if any useful information could be retrieved from the data. This could theoretically also lead to a probabilistic model to assign weights to different training samples.
However, while very interesting, it seems to be very vast a topic in itself, and I am back to my senses, and would not spend any more time on this issue immediately. It's time to focus on wrapping up things at hand. So Long.

Sunday, September 1, 2013

more training data

Still a bit irregular with the updates, but finally I have some more training data that is about 80% ready.

Also, I am working on a separate script to generate the test image for producing the test data and hence, the 20% incompleteness in the new data. A preliminary version of the same is already in the repository.
However, I expect this script to be complete by next Thursday, and then have the final matrix by next Sunday night.
After that, as I have been reminded by my mentor again, I should focus on preparing the final report.
It's sad, but harsh reality, that OCR projects never seem to turn out enough data. I plan to publish at least 2 sets of training data I generated during this summer in some way through the repository. Though, the stock data still seems to perform better.

Also, I need to start working on an outline plan for the final report. Lots of work to do in a seemingly short period of time, maybe more so because it coincides with some period of health issues and then re-opening of school. It is one of the points I would like to mention in the GSoC review, but of course as my personal opinion.

The most important factor right now is not to loose motivation and carry on with the project. So Long.

Sunday, August 25, 2013

batch resize script

Hi again.

After an online absence of almost 3 weeks, I am finally in a position to update the blog again.

during these "dark" days, I have been continuing with my attempt at making the matrix, and hopefully, I'll be able to present the final structure of the matrix by next weekend.

However, as far as improvements go, I have figured out the tweaks of NPP++ (just updated to the latest version) to include localization and now it renders without me having to manually change the settings every time for the individual files.

Another part of the project I have been working on is the optimisation of the existing code, and to modify it to handle large amount of data at the same time.

Due to my extensive unplanned travels over the last few weeks (Delhi - Kolkata - Delhi - Stockholm - Vasteras) I was unable to push any code to the repository, but finally I have updated it with the current, working version of the same script.

However, overall, I a a bit disappointed that I had to switch from my very initial plan, but the need to have a matrix to guide us with preparation of the data is also important. I am happy ( though not satisfied ) with the performance of Tesseract's latest stock data on individual words, and if I get enough time, will try to publish some statistical data related to the same.

Goals for next week :

Update blog with activities from past few weeks.
Finalize structure of the matrix.

So Long.

Monday, July 29, 2013

minor tweaks and documentation

This past week, I have been working on tweaking the code to optimise it at various places.
Now, it has the facility to check for the input image mode and convert it to grayscale only if needed, unlike earlier when it always converted it. It was a pretty simple, but important thing. One that I had been thinking about for sometime, and at long last, managed to get around to it.
Another notable change is the way the directories work. Earlier, I was using explicit, absolute path based locations for the input files. Now, I have updated to take them as a variabe-based relative path. Next to do : Take the path as argument input, but I will do that when I translate everything to a *.py from the notebook at the end.
Also, I spent some time documenting the existing code, and though split in 3 notebooks, they have all the documentation in a manner I intended them for now. Later, when merging the notebooks, I will be using the respective parts (cell blocks) from the different notebooks to form the final script.
So Long.

Tuesday, July 23, 2013

Mid-Term Summary : Part 5 - using the notebooks

This post is meant to guide anyone who wants to use the notebooks.
So far, I have pushed the notebooks (JSON format) to the repository directly. To have a look at the code, follow the instructions below.

Make sure you have the software and libraries required installed. A list of the same can be found here.
Make sure you have IPython and notebooks support. Installation instructions can be found here.
Get the notebooks (*.ipynb) from the repository Preferably, do a fork/clone to let us know you are testing out the codes.
Import the notebooks in your browser session or open directly with EPD.

p.s. - I will provide everything in native python (*.py) format at the end of the project.

Mid -Term Summary : Part 4 - Results so far

This post is going to be all about the various results I have had so far, which will include both the good ones (obviously) and the bad ones, that compelled me to update my plans somehow.

1.Tesseract

To start with, I am happy with the decision to use Tesseract 3.02. It's effective,not difficult to setup or use on any platform, and the training procedure, though long and tedious, is not very complex once you get the hang of it.

2. Environment

Next, my decision to use EPD and IPython has also been helpful. Infact, I love Python more than ever now, and think that using the Notebook is a really god option when you are concerned with heavily documenting your work.

3. Test Results

Resolution : Contradictory to my initial hypotheses, tests suggest these do not seem to affect the outcome much. However, more extensive testing needs to be done. Also, this inspired me to focus harder on the image filters.

Image Filters : So far, Bilateral filters, and adaptive thresholding seem to be the most effective on all kinds of documents. For the final matrix, document-specific effect of the filters need to be sorted out.

WhiteWashing Algorithm : The tests show that the code works as intended.

Box-Files and traineddata : Initial box files I generated were devastating, but with the work-around on using Fedora, they have improved. The traineddata I generated is not an improvement over the stock. I will continue to work on them.

Monday, July 22, 2013

Mid -Term Summary : Part 3 - working on the BookWorm

My working Methods.

This post probably needs a bit more structure, and hence, I will use some formatting styles Blogger provides.

The work I have done, and how I have done it, during the previous weeks can be summarised in the following steps :

Objective A. Get practical with Tesseract

I started the project by getting a first-hand feel of Tesseract. From all the reading I had done by this point, I was familiar with the working theory, but actually trying out the software provided with me a better insight into the current scenario.
Another valuable lesson learnt through this was that all the reading and previous work was based on earlier versions of Tesseract, but Tesseract 3.02 has a major upgrade in terms of BookWorm, it already has some base support for Indic languages, and the training method has also changed a bit from the previous versions (Tesseract 2.x).
I updated my plan several times along the course, and finally arrived at this conclusion to form the matrix mentioned in my previous post. I also finalised the my version of shirorekha chopping algorithm, the WhiteWashing Algorithm.
Also, discussions with +Abhishek Gupta made it clear that our projects are closely related, and his project is like a post-processor for OCR while mine is a pre-processor.

Objective B. Decide on Imaging Libraries

After familiarising myself practically with Tesseract, I started trying different libraries in Python, and finally decided that I will be using PIL and Scikit-Image. Some entries about the same can be found here.

Objective C. WhiteWashing Algorithm

Now, having made sure that the chosen libraries were suitable and sufficient for my purpose, I moved on, and started working on the WhiteWashing Algorithm.
Now that the coding for whitewashing an image is complete, after the midterm, I will turn my focus back to the filters, and start with forming the matrix.

Mid-Term Summary : Part 2 - Shaping BookWorm

Approach :
I decided to tackle the problem of shirorekha in connected scripts for the first half of the summer, and the result is the WhiteWashing Algorithm. This detects the shirorekha in a script, and removes it, making it easier to separate the "akkhars" (alhabets) in a "shabda" (word).
Next, I am working on developing a matrix to help us determine the best suited filtering method for making a document OCR-ready. It will have specific instructions for a document based on it's type. An early example of the same is here.

Tools Used :

Language : Python 2.7
Development Environment :

most of the work is done on EPD 1.0.1 on Windows 7 (32-bit), using IPython notebooks.

However, I had to Switch to Linux ( Fedora 14 on VMware workstation 9.0 ) for a while as I was unable to make pango work on my windows.

Notepad++ 6.3.2

Software :

Tesseract 3.02 on command line (NOTE : 3.0 is available on Fedora 14, but it did not effect as I Switched to windows for the part to use Tesseract) .
Python Libraries :

Repository : GitHub

Sunday, July 21, 2013

Mid-Term Summary : Why BookWorm

I have paused the coding for a while, and have been concentrating on documentation. A part of the same is this multiple-part entry to summarise my thoughts and efforts to help with the mid-term evaluation.

This interlude begins here.

Part 1 - Background and Motivation

Tesseract has undoubtedly been the most prominent open-source OCR tool, and in past few years, the support for many languages has been incorporated to it. The most notable of these (to me personally) being :

My motivation for taking up this project is rooted at the fact that Bengali is my mother tongue, and I have first-hand experience of situations where an OCR for Bengali would have helped me.

With this project, my aim is to help in digitization of the huge literary heritage of the language and at the same time, making the language itself more easily accessible to people across the world.

So I read up the existing projects, talked to some of the awesome people who originally took this initiative, and decided to go forward and try to improve accuracy for OCR for Bengali. And thus began the Bookworm.

Thursday, July 18, 2013

some quick training

The last two days, I decided to go with my impulse and try generating some training data. Now I have a good grasp on the Tesseract commands for generating the required files, but still need deeper understanding about the clustering part. Amazingly, the long and tedious training procedure at the original documentation doesn't look so difficult after staring at it for almost 3 months.
However, after testing with the new data, no miracle happened, and while it managed to detect the language correctly, the accuracy wasn't worth mentioning.
Next step, I will generate some training data with the WhiteWashing algorithm and see how it fares.
But now, off to documentation.
So Long.

Monday, July 15, 2013

testing with close-typed document

As planned, I tested the algorithm with some image samples of bengali newspaper, found as a result of a simple Google Image search.
The original image i used had a lot of noise, here is a preview :

Here, we can see that it has a lot of salt-pepper style noise, ad the resolution is not particularly good.

However, trying to whitewash it as it is yielded poor results.

It can be seen that due to the noise, the algorithm fails to detect the individual lines. S I appliead an adaptive thresholding filter on the image, and the denoised image looked like :

Applying whitewashing to this as input resulted as expected,

This proves the generic working of the algorithm.

However, I still had to manually reset the padding values, which needs to be resolved.

As midterm approaches, i have a lot at hand and on my mind.

Here is my new plan, in brief, and in the order I plan to carry them out :

To Do :

Document Everything done so far. (2 notebooks - ImageTest with filters and WhiteWashing algorithm)
Update and Publish the documents.
Document effects of domain-dependency (like this post deals with close-typed documents) and useful filters.
Document summarising what I learnt during this period, and what made me change my initial plan.
Test with different resolutions. (After mid-term)
Merge the notebooks.
Document.

I think I will stop coding for a while and get started with the documentation now. I will have to discuss the same with my mentor.

So Long.

Sunday, July 14, 2013

WhiteWashing and box files

Finally, I have fixed the padding-values for the WhiteWash algorithm and it seems to be working nicely.
Here is a preview of the current output :

However, the padding uses integer numbers right now. It will be better if I could somehow relate this to the document statistics. It should be interesting to try out later.

This success with the algorithm allows me to focus on the box file generation again. Currently, I have managed to generate this using my workaround in the virtual machine, and after tweaking Debayan and Sayamindu's original algorithm using pango.

The box file generated now looks like :

Which is improvement over the last time.

Now, I have to combine these two methods to try out my plan.

Next, I will make an entry sometime this week about my updated plans and goals for the midterm evaluation period.

So Long.

Thursday, July 11, 2013

WhiteWashing Algo - Working

It took a bit longer than expected, but I have finally fixed the script to detect the shirorekha and whitewash over them.
I will write more about what I was doing wrong and how I fixed it later.
Sample Input :

Sample Output : (NOTE : The black highlights show the marked coordinates)

Actual whitewashed Output :

On a Closer looks, we can see :

Here, we can see that the white-lines are too thin. Hence, next step to do for me is to find some appropriate values of padding around the coordinates when doing the whitewash.

So Long.

Monday, July 8, 2013

manual fix - working proof of concept

So I looked into the csv files and found the highest intensity y-coordinates using a pivot table, and the boundary coordinates for a line. And with those coordinates, my code generates these two images.
Boundary markers :

Intended whitewashed output :

This shows that the algorithm is correct, however, there is something wrong in the way the code is finding the maximum intensity line. Currently, I was using max( , key=itemgetter()) method to find the maximum intensity element within the sublist. I will try to do it in some other way.

So Long.

image wash - working

So, I fixed the code that wasn't working yesterday (putpixel). And the input ad output currently looks like this:
Original :

Output (NOTE : the black line is actually intended to be white, it's black here just for visualisation purposes)

Now, here we can see that the whitewashing code works. But it was supposed to overlap with the "matra", which did not happen, so it indicates a bug in my code that finds the location of the matra.

I was following something I learnt from Prof. Mostafa's class,(which I completed June'2013), and NOT looking at the data before making the model, but seems I will have to break the pattern, and delve deeper.

So, next thing on the list :

Go back and fix the bug in code to find the matra.

Also, Originally, the matra clipper was intended to work as a pre-processing module for Tesseract. But having worked with Tesseract 3.0 for real, I have a change of plan.

Now I intend to use it to whitewash the matras, and then introduce gaps between every character, making them into individual glyphs.

Then, we can use these glyphs easily to make new box files and train Tesseract.

Also, the Tesseract 3.0, combined with a few filters (see here) seems to work nicely with individual words at this level.

One sample input :

Final output :

At a word-level, that's a whopping 100% accuracy with just a few filters and a bit increase in the resolution. This proves the mettle for Tesseract 3.0 , and hope for Indic OCR.

However, as an insight, this project is turning out to be way more research-oriented than I initially thought. That makes it all the more interesting.

On a more personal note, I will be visiting Kolkata for a week from 14th to 22nd July. I will continue with the work, but may not be able to update the blog regularly, however, I can still be reached through emails.

And with that, It's about time to call it a night.

So Long.

Sunday, July 7, 2013

washed image test

Things are looking up since the last post.
The problems with the clipping code is fixed now, and it detects and writes the darkest pixels in each subdivision to a separate csv (yes, i love csvs). The modification it needed was the use of the indexes instead of using the coordinates directly. That did the trick.
Also, while at it, I optimised the code and now everything is under 50 lines. Once again, Notebook has been a great help, and I'm convinced it was a good decision to switch from Eclipse+PyDev to EPD.
However, the file now looks like this :

Then, I tried to wash-over those pixel with a putpixel(255) , i.e., white-line, but it doesn't seem to work.

Next on list :

Fix the part to "wash" the pixels
Talk to mentor if it's okay to push notebooks directly (sadly, I don't know much about software packaging)

Also, I have been thinking about the InfoRescue project, and I think I would modify my plans for the second half a bit. Will write more about that soon.
So Long.

Thursday, July 4, 2013

chop-chop

The work on the clipping code is going well. Thanks to Debayan for laying the ground-work for the algorithm.
It's complete to a point where it detects the lines as a whole, and for my convenience, prints the coordinates to a csv in a (line number, count) format. Here's an interesting glimpse :

The jump from line 66 to 124 represents a white gap, the space between the two lines.

However, it still needs some tuning. For instance, this :

the presence of 183 and 184 bothers me. It most probably represents some noise, and shows that the code still needs tuning.

But I admit, IPython notebooks are a great way to modularise and optimise when experimenting with code. It only enhances my love for Python.

Also, my algorithm is enhanced in a way as it should (in theory) be able to distinguish straight and curved lines, I'll get started on that modification after completing the basic code.

So Long.

p.s. - I'm starting to get worried now, as I think I am a bit behind the schedule I had set for myself. So push, push and chop-chop.

Wednesday, July 3, 2013

workaround for Pango

Having tried to make pango work on windows for over a week now, I have finally given up, and have switched to Fedora.
On Fedora, it took a little over 3 minutes to install the dependencies and get everything up and running. more reasons to love Linux.
Anyway, the first trial of generating the images looks like this.

It is good in itself, but still needs work, and I need to synchronise the same fonts between my windows and Fedora environment, a silly mistake on my part to not take care of it already.

Coding for the clipping algorithm is about 30 % done. Right now, I am looking into connected-component analysis to see if it will work.

I plan to fix the fonts, and then work on the connected-component for the rest of the week.

So Long.

Friday, June 28, 2013

roadblock

I've hit a roadblock.
Trying to generate a new, hopefully improved, training set, I've been trying to run a script that would generate the boxfiles. But it needed 2 libraries, Cairo and Pango, to work.
I'm having trouble getting Pango to work on my windows machine.
I'll spend another day or two on it, and if it doesn't work, will have to switch.
It's going to be a long night.

Wednesday, June 26, 2013

more tests and results

As planned, I ran some denoising functions on the images (NOTE : conversion to float was needed).
After this, I compared the outputs from tesserat on the 5 images side by side. They denoising filters I tried are :

Bilateral filters: This seems to give improved results.
Total variation filter using split-Bregman : No improvements.
Chambolle : Very little Improvements.

The sample outputs can be seen here.

Bilateral filters

Chambolle filters

Just as a reminder, the original image looked as

Next Up, some playing with the resolution before moving on work with characters individually from next week. So Long.

As a side note, I also tried the same procedure with another test image using a separate font, but as expected, it did not perform well.

Monday, June 24, 2013

profile conversion and library tests

After spending a few days loking into different PIL and ImageMagick-based libraries, I have finally settles on what seems to be the most user-friendly and awesome among them. Scikit-Image.
Having used Scikit-Learn briefly, I was aware of the popularity of the scikit-*family* libraries, but was unaware of this particular member.

After 3 days (rather nights) , I have figured out the data format and conversion techniques, and have pushed a sample script to the repo, which converts a test tif from RGB to Gray and then performs the following operations on the same :

Canny Edge Detector
Sobel Operator

Here is how the output looks:

However, I still need to work on the image viewer. Presently, I am waiting on Travis. It is doing something (not sure what or why) as I write this blog.

The new To Do list:

Try increasing size of the files.
Try out the output on Tesseract.
Optimise the method for reading the file.
Optimise the overall code.
Learn more about using the Notebooks.

But it's time to hit the bed now, so the new experiments will have to wait for another night. So Long.

Wednesday, June 19, 2013

quick peek into the data.

It just occurred to me, that even if I do not get the correct information about the dataset, I should still be able to use it.
due to the vast amount of the data, it can always be used as a k-fold measure (maybe 4-fold) and split into training and test data. But I will have to make sure all the samples make use of the same font.

Tuesday, June 18, 2013

making box files

After trying for a few days, I succeeded in making a few box files today.
First, a view of the excerpt I used. I came across this piece when searching for bengali pangrams. As far as I can tell on a first view, this passage satisfies my purposes.

So I started out and made a simple box file, which, after lots of manual corrections, looks like this.

Now, this makes clear that I have to have the shirorekha-chopping even before making box files. Otherwise, it detects the whole words as a single blob. I'll try another box-file maker to see if the problem persists.

If it does, that could be a problem, as it means I have to stop the training procedure until my pre-processing module is complete.

On the other hand, I spoke to Abhishek Gupta today, and he helped me out regarding the datasets I have been thinking about. But it seems they do not have the information about the "font" associated with them. However, I'll have a look at them over the week.

Also, the trouble with NPP++ is fixed, though it does not take the specific encoding by default, and since I change it manually, it asks to save the file, which is a bit irritating.
So Long.

Saturday, June 15, 2013

More Tests

So, after yesterday's bleak attempt, I decided to test the system on a random image of bengali script, and passed one of the results of a simple Google image search for "Bengali text". A full page excerpt.

The result was terrible.

The noise level on the image was very high compared to the previous test sample, and the font was obviously different. But still Tesseract did manage to identify the language and few of the characters.

The work on the pre-processing algorithm is going good, and I can start coding as soon as I learn more about the imaging library. I'll go through the previous work done once again to look for some more help in this regard.
However, I still haven't figured out a way to make NPP++ display bengali characters.

So Long.

Friday, June 14, 2013

Testing the Tesseract.

After 10 days, I am back, and with lessons to be shared.
First of all, the very much expected mention about the "Delhi Heat". It is still burning.

However, from the corner of my couch, I have been enjoying the mangoes and messing with the Tesseract.
Starting with a simple test, I typed something in Bengali, and converted it to an image, then fed it to Tesseract. As expected, BOOM! the result was devastating.

The interesting similarity in both errors got me curious, and for reasons yet unknown, I mapped the input and output directly. A pretty simple thing to do actually.

The orange-ish blobs show the corresponding mappings. Note the similarity between the shape of "ম " and "W" , for example, and also, that "ম " is almost at the mean position for the particular blob.
This shows that the errors, obviously, are NOT random, and give us some insight into them.
(NOTE to Self : This could be useful for the post-processing module.)

But then, having looked around a bit for the box files now, and after fixing some configurations, I could get it to work a bit more decently.

However, these are just tests, and I am not going to push them to the repo yet.

Now time to work up a new "To Do" list :

Finalize pre-processing algorithm and move/share from Evernote and through the BookWorm repository.
Try to get support for "Bengali" in NPP++. - Would make life a bit more easier :)
Decide on and start with ImageMagick or something similar.
To test my basic image on different versions of Tesseract (OPTIONAL)- still looking into the differences in detail, and they look very different indeed.

Also, I tried to contact Abhishek Gupta, who is working on a similar project, to have a chat. It should be interesting to talk with him, as I feel our projects are closely related. Maybe I will be using the same dataset, and that should be the benchmark! Maybe my insight into the "errors" mentioned above might help him in formulating his 'error model'.

We shall see what happens when he replies to my email.

So Long.

Thursday, June 6, 2013

Canopy and CI

The days since the last post have been interesting and busy, between preparations for BookWorm, trying to understand CI and finishing up the LFD course, not to mention the numerous meetups before everyone departs for the summer.
But things are looking up today. The LFD final exam is 80% done, and hopefully will be complete by this time tomorrow. And Travis-CI has been set up and linked with the BookWorm repository. Though the current commits give an error, but most probably that is due to the lack of any actual code in there. (NOTE TO SELF : needs more reading).
However, excitement of going home after almost 10 months makes me jumpy, and reading about the weather in Delhi is not helping at all.

Among the good things, 3 things have been crossed off from the ast 'To Do' list.

Done :

Understand the training procedure for Tesseract 3.02. - pretty clear view inside the head
Make sure no documents are left to be submitted to Melange. - Done
Decide whether to use IPython or not. - not sure about IPython, but definitely switching to Canopy. Almost in love with EPD now.

To Do :

Start making the box files. - looks like it will have to wait till I get to Kolkata next week, as I think it's better to complete the process without a disruption in the flow.
Pack bags for home. - Friday night/Saturday early morning should be the time.
Read about the build issue with CI , though according to their blog, a documentation change should skip the build process. It looks like the problem is with the build configuration, hopefully it will be clearer with the next commit.

Monday, June 3, 2013

beginning the BookWorm

Summer 2013 has arrived. And so begins the Summer of Code.

Today marks a week since the results were announced, and with help form my mentors at Ankur India , we have set up a working repository for the BookWorm.
In the meantime, I have been busy making sure I submit the necessary documents to melange, while trying to get familiarised with IPython, specially the notebooks, which could be helpful to document the code, and also trying to wrap up the matters at University.

As planned, I am currently reading "training the Tesseract" procedure, and trying to understand the differences in dependencies of 2.0x and 3.0x, which seem to have changed a lot.
I hope to train Tesseract 3.02 and then start with my scripts for the pre-processing module soon.

The "To Do" list for this week looks like :

Understand the training procedure for Tesseract 3.02.
Make sure no documents are left to be submitted to Melange.
Decide whether to use IPython or not.
Start making the box files.
Pack bags for home. :)