OCR stands for Optical character recognition and is the process of scanning physical text (like from a photograph or book) and converting it to digital text. Accuracy is depended on a number of factors, including font family and font weight.
As with any data munging activity, the first step is always to explore your data. Understanding the characteristics of your data input can help inform the most effective way to process it.
Recently I've been working with an OCRed copy of a book from the late 1800s. 1895 to be exact. As characteristic with of books printed in that time period, the original typeset was bold, the ink somewhat splotchy and in some places the page of the original book were clearly very warn. There were some words that were completely illegable. As a result there were quite a few OCR errors to be corrected before the data would become useful. Trouble was the document consisted of 30,000 lines. Checking line by line, character by character by hand was just not practical. So where to start?
As always, start simple and work your way to the more challenging cases.
The first step was noticing the types of errors. The book I was working with was a dictionary, and not likely to have any puctation besides commas and periods. If I encountered any, I knew I had an OCR issue. Thus I wrote a little java program to run though and check line by line.
When the exception was thrown, I'd search for the line that tripped it and manually correct it.
Another dead give away that a word had an OCRed mistake was a capitalization error. The dictionary capitalized the entirety of the first word in each section, as well as the first word in each line. If a word had a lower case letter, and a capital letter that was not the first character, it had at least one OCRed mistake. Common mistakes were ls and ts to capital Is.
Next, because I was working with a dictionary I could be reasonably certain that the dictionary contained no spelling errors. The next step was to scan for words that were spelled incorrectly. Unfortunatly this approach ended up back firing a little bit. Language has a tendency to evolve, and between 70-80% of the mispellings I encountered were instances of a valid but rare, sometimes archic but still valid word. I had to get a little more specific.
Certain characters were often mistaken for others. With a thick typeset, ts, fs, and ls can often look similar, as can is and rs. ms sometimes were mistaken for iii. In fact, ii was the most common OCR mistake I encountered. While its a valid sequence (like in the word skiing), it's rare.
The next step was to scan for character sequences that were the same letter three times in a row, or the same unlikely letter twice in a row
As it turned out, skiing was not in my dictionary. If it was, I would have added another check before throwing my RunTimeException
Finally, I checked for likely OCR mistakes of common suffixes and prefixes. For example, sometimes n was mistaken for u, so I'd search for words ending in -uess. As before, while there are words that end with the -uess character sequence, more often than not it was a OCRed mistake. Similarly, there's a good chance that a word ending in -tlon was originally a word ending in -tion. The goal was to catch multiple mistakes with each pass to make the trade off in time worth it.
As a final step, I ran the digital dictionary through my own custom built spell checker using word stems. A word stem is the morphological root of a word. For example, teacher, teach, and teaching might all have the same morphological root (teach). When the spell checker incorrectly identified a rare word as spelled incorrectly, I'd add it's stem to the dictionary. That way when I encountered a rare word like appropinquation, I wouldn't also have to check the validity of appropinquate.
A word of warning. An incorrectly spelled word can sometimes be stemmed to a correctly spelled word. Most stemmers will treate an -ence and -ance suffix similarly. My stemming dictionary approach would think both parlance and parlence are valid words. In practice it was a non issue, since most OCRed mistakes resulted in obviously wrong words.