Results 1 to 2 of 2

Thread: Japanese OCR evaluation

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Japanese OCR evaluation

    Sometimes I'm playing with OCR (its actually very much related to
    compression via MDL-based recognition and such), and a while ago I made some GUI to mark and extract the
    symbol bitmaps from a book page (manually, to create some initial dictionary
    for my own OCR).

    Then, its also necessary to somehow associate the symbol bitmaps with codes,
    so I decided to get some help from the existing software, and used the
    extracted bitmaps to generate a table (perl script -> html -> msword -> pdf)
    http://nishi.dreamhosters.com/u/1b2.pdf

    After that, the pdf was loaded by the two OCR programs with japanese
    language support which I was able to find (Finereader 10 and Readiris 11),
    and then OCR results were saved as utf8 plaintext, and processed again
    to make an expanded table:
    http://nishi.dreamhosters.com/u/1b2_frri.pdf
    (columns: 1=internal id, 2=bitmap, 3=finereader result, 4=readiris result,
    -/- means that script couldn't find the id for the symbol in OCR output)

    Then again, this was processed manually to sort out the correctly
    recognized symbols (unfortunately I don't have much experience with
    kanji and sometimes they look considerably different in different fonts,
    so maybe mistakes still remain there...).

    And finally, another script collected the statistics by matching the
    filtered table with supposedly correct symbols to OCR results table,
    and the results are... well...
    Code:
    Total symbols      = 453 (including not recognized)
    Finereader matches = 339
    Readiris11 matches = 145
    Total matches      = 363 (symbols recognized by either program)
    Ok, Readiris looks like it doesn't understand tables, so it frequently
    tried to "recognize" the frames, but Finereader explicitly supports them.
    Also there's quite a number of special symbols like quotes and such,
    and symbols are separated, so these programs can't properly use the word statistics.
    But still, it sure looks like there's a place for some OCR software which
    would actually work ;)

    Update: recalculated the stats, after discarding "." and "," and removing
    "|" from OCR results (they're frequently added for frames)
    Code:
    Total symbols      = 425
    Finereader matches = 357 (84.0%)
    Readiris11 matches = 161 (37.9%)
    Total matches      = 388 (91.3%)
    Last edited by Shelwien; 30th May 2010 at 13:42.

  2. #2
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    78
    Thanks
    436
    Thanked 22 Times in 17 Posts

    Thumbs up

    Thanks! Interesting test

Similar Threads

  1. Sample evaluation
    By Shelwien in forum Data Compression
    Replies: 24
    Last Post: 30th January 2009, 00:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •