Publication | Closed Access
Correcting noisy OCR
57
Citations
19
References
2014
Year
Unknown Venue
EngineeringReverse OcrCorpus LinguisticsText MiningNatural Language ProcessingImage AnalysisLanguage DocumentationData ScienceInformation RetrievalPattern RecognitionText RecognitionComputational LinguisticsLanguage StudiesCharacter RecognitionMachine TranslationMachine VisionOptical Character RecognitionNoisy OcrComputer VisionSpeech ProcessingText ProcessingDigital CollectionsLinguisticsDocument Processing
We describe a system for automatic post OCR text correction of digital collections of historical texts. Documents, such as old newspapers, are often degraded, so even the best OCR tools can yield garbled text. When keywords are corrupted, text is invisible to search tools. Manual correction is not feasible for large collections. Our non-interactive OCR correction method uses a "noisy channel" approach. The error model uses statistically weighted multiple character edits and a novel visual correlation adjustment using low resolution "reverse OCR". The language model uses normal and also "gap" word 3-grams, plus some 5-grams. Word correction candidates are generated by a deep heuristic search of weighted edit combinations guided by a trie. Testing shows good improvements in word error rate. Experiments demonstrate resilience and justify the use of a deep candidate search.
| Year | Citations | |
|---|---|---|
Page 1
Page 1