Publication | Closed Access
Script and language identification from document images
55
Citations
17
References
2002
Year
Unknown Venue
Current ScriptEngineeringCharacter SegmentationCorpus LinguisticsDocument ImagesText MiningSpeech RecognitionNatural Language ProcessingLanguage DocumentationImage AnalysisPattern RecognitionText RecognitionComputational LinguisticsLanguage StudiesCharacter RecognitionMachine TranslationOptical Character RecognitionLanguage RecognitionDocument ImageLinguisticsDocument Processing
In this paper we present a detailed review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 rest decrements from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.
| Year | Citations | |
|---|---|---|
Page 1
Page 1