Sometimes I'm playing with OCR (its actually very much related to
compression via MDL-based recognition and such), and a while ago I made some GUI to mark and extract the
symbol bitmaps from a book page (manually, to create some initial dictionary
for my own OCR).
Then, its also necessary to somehow associate the symbol bitmaps with codes,
so I decided to get some help from the existing software, and used the
extracted bitmaps to generate a table (perl script -> html -> msword -> pdf)
After that, the pdf was loaded by the two OCR programs with japanese
language support which I was able to find (Finereader 10 and Readiris 11),
and then OCR results were saved as utf8 plaintext, and processed again
to make an expanded table:
(columns: 1=internal id, 2=bitmap, 3=finereader result, 4=readiris result,
-/- means that script couldn't find the id for the symbol in OCR output)
Then again, this was processed manually to sort out the correctly
recognized symbols (unfortunately I don't have much experience with
kanji and sometimes they look considerably different in different fonts,
so maybe mistakes still remain there...).
And finally, another script collected the statistics by matching the
filtered table with supposedly correct symbols to OCR results table,
and the results are... well...
Ok, Readiris looks like it doesn't understand tables, so it frequently
Total symbols = 453 (including not recognized)
Finereader matches = 339
Readiris11 matches = 145
Total matches = 363 (symbols recognized by either program)
tried to "recognize" the frames, but Finereader explicitly supports them.
Also there's quite a number of special symbols like quotes and such,
and symbols are separated, so these programs can't properly use the word statistics.
But still, it sure looks like there's a place for some OCR software which
would actually work ;)
Update: recalculated the stats, after discarding "." and "," and removing
"|" from OCR results (they're frequently added for frames)
Total symbols = 425
Finereader matches = 357 (84.0%)
Readiris11 matches = 161 (37.9%)
Total matches = 388 (91.3%)