screenshot

What is OCR?

OCR stands for Optical character recognition and is the process of scanning physical text (like from a photograph or book) and converting it to digital text. Accuracy is depended on a number of factors, including font family and font weight.

Error Correcting OCR Mistakes

As with any data munging activity, the first step is always to explore your data. Understanding the characteristics of your data input can help inform the most effective way to process it.

Recently I've been working with an OCRed copy of a book from the late 1800s. 1895 to be exact. As characteristic with of books printed in that time period, the original typeset was bold, the ink somewhat splotchy and in some places the page of the original book were clearly very warn. There were some words that were completely illegible. As a result there were quite a few OCR errors to be corrected before the data would become useful. Trouble was the document consisted of 30,000 lines. Checking line by line, character by character by hand was just not practical. So where to start?

As always, start simple and work your way to the more challenging cases.

Catching the Obvious Mistakes

The first step was noticing the types of errors. The book I was working with was a dictionary, and not likely to have any punctuation besides commas and periods. If I encountered any, I knew I had an OCR issue. Thus I wrote a little java program to run though and check line by line.

if (line.matches(".*[\\*#><=^;\\?\\)\\(&'\"~:!_}{].*")) { throw new RuntimeException("Unexpected punctuation in line: " + line); }

When the exception was thrown, I'd search for the line that tripped it and manually correct it.

Another dead give away that a word had an OCRed mistake was a capitalization error. The dictionary capitalized the entirety of the first word in each section, as well as the first character in the first word in each line. If a word was mixed case with the first character being lower case, or at least two upper case characters, it had at least one OCRed mistake. Common mistakes were l and t characters transformed to a capital I.

if (line.matches(".*[a-z]\\S*[A-Z].*")) { throw new RuntimeException("Unexpected capitalization in line: " + line); } if (line.matches(".*[A-Z].*[A-Z].*")) { throw new RuntimeException("Unexpected capitalization in line: " + line); }

Catching the Likely Mistakes

Next, because I was working with a dictionary I could be reasonably certain that the dictionary contained no spelling errors. The next step was to scan for words that were spelled incorrectly. Unfortunately this approach ended up back firing a little bit. Language has a tendency to evolve, and between 70-80% of the misspellings I encountered were instances of a rare, sometimes archaic but still valid spelling. I had to get a little more specific.

Certain characters were often mistaken for others. With a thick typeset, t, f, and l can often look similar, as can i and r. ms sometimes were mistaken for iii. In fact, ii was the most common OCR mistake I encountered. While its a valid sequence (like in the word skiing), it's rare.

The next step was to scan for character sequences that were the same letter three times in a row, or the same unlikely letter twice in a row

if (line.matches(".*([a-z])\\1\\1.*")) { throw new RuntimeException("Unlikely Character Sequence (3 in a row) in line: " + line); } if (line.matches(".*([ij])\\1.*")) { throw new RuntimeException("Unlikely Character Sequence (2 in a row) in line: " + line); }

As it turned out, skiing was not in my dictionary. If it was, I would have added another check before throwing my RunTimeException

Finally, I checked for likely OCR mistakes of common suffixes and prefixes. For example, sometimes the letter n was mistaken for u, so I'd search for words ending in -uess. As before, while there are words that end with the -uess character sequence, more often than not it was a OCRed mistake. Similarly, there's a good chance that a word ending in -tlon was originally a word ending in -tion. The goal was to catch multiple mistakes with each pass to make the trade off in time to manually verify the mistakes worth it.

Catching the Possible Mistakes

As a final step, I ran the digital dictionary through my own custom built spell checker using word stems. A word stem is the morphological root of a word. For example, teacher, teach, and teaching might all have the same morphological root (teach). When the spell checker incorrectly identified a rare word as spelled incorrectly, I'd add it's stem to the dictionary. That way when I encountered a rare word like appropinquation, I wouldn't also have to check the validity of appropinquate.

A word of warning. An incorrectly spelled word can sometimes be stemmed to a correctly spelled word. Most stemmers will treat an -ence and -ance suffix similarly. My stemming dictionary approach would think both parlance and parlence are valid words. In practice it was a non issue, since most OCRed mistakes resulted in obviously wrong words.