Correcting noisy OCR

Abstract

We describe a system for automatic post OCR text correction of digital collections of historical texts. Documents, such as old newspapers, are often degraded, so even the best OCR tools can yield garbled text. When keywords are corrupted, text is invisible to search tools. Manual correction is not feasible for large collections. Our non-interactive OCR correction method uses a "noisy channel" approach. The error model uses statistically weighted multiple character edits and a novel visual correlation adjustment using low resolution "reverse OCR". The language model uses normal and also "gap" word 3-grams, plus some 5-grams. Word correction candidates are generated by a deep heuristic search of weighted edit combinations guided by a trie. Testing shows good improvements in word error rate. Experiments demonstrate resilience and justify the use of a deep candidate search.

References

Page 1

	Year	Citations

Page 1