Cool, let us know when you open source it.
Regarding ABBYY: It seems to be not free so it did not work for my requirements of all tools being free/open. I also found a comparison of both: http://lib.psnc.pl/Content/358/PSNC_Tesseract-FineReader-rep...
obviously its hard to compare and both have their strengths.
I appreciate your focus on open-source here since this feels like a really sensitive security scenario.
I mean, personally, I love the idea of just scanning all of my paper documents into a massive electronic database, but I'd really need to be able to trust the software. If something's closed-source, I'd be worried about everything from spyware to malicious bugs that might intentionally introduce errors (e.g., social-engineering attacks, or simple trolling) to unintentional bugs.
Not that open-source is a magical fix to everything, but it's a vast improvement over closed-source solutions.
I cannot say what has happened/changed lately (meaning the last several years) but besides the test above (which is mainly about Polish antique documents, printed before 1850 and in antiqua and gothic fonts) I needed for work a reliable OCR software (reading "normal" modern documents, almost invariably printed in Arial or New Time Roman) there was simply no match, the difference between the accuracy of Tesseract (very, very low) and Finereader (very high) was very noticeable.
In any case, OCRed documents needed anyway some serious editing/correcting
I also suspect that part of the good (or bad) results might be connected to the language in which the documents are written (Italian in my case) and the "width" of vocabulary used, and to the way the document is structured, in my case ther were often tables that for some reason were easily identified by Finereader and rarely or never by Tesseract.
In any case, I would say (without an actual measurement having been made) that roughly Tesseract was below or around 60% accuracy where Finereader was around 80%, possibly a little bit more.
But even if it was the best one (we tested also other softwares, cannot remember the names) even Finereader was far from being a "set and forget" kind of tool.
I would be curious to know how would you rate the word accuracy of your solution.
Anyway - and only as another data point - it was year 1986 (or possibly 1987) when I had Xerox representatives tell me that "soon" (meaning no more than a few years) the office would have become "paperless".
Wow - now I feel foolish! :-) How many more years of unix use will it take before it sinks into my thick skull that, 99 times out of 100, the feature I want is never so obscure that it's not already catered for...