Concepedia

TLDR

Some Thai documents contain mixed Thai and English scripts in a single line, making OCR challenging; identifying script portions first improves recognition. The study proposes an SVM‑based method to identify word‑wise printed English and Thai scripts within a single line of a document page. The method segments the page into lines and words, then classifies each word’s script using an SVM trained on features derived from structural shape, profile, component overlap, topological properties, and water reservoir concepts. On 6,110 samples, the scheme achieved 99.36 % script identification accuracy.

Abstract

In some Thai documents, a single text line of a document page may contain both Thai and English scripts. For the optical character recognition (OCR) of such a document page it is better to identify, at first, Thai and English script portions and then to use individual OCR system of the respective scripts on these identified portions. In this paper, a SVM based method is proposed for identification of word-wise printed English and Thai scripts from a single line of a document page. Here, at first, the document is segmented into lines and then lines are segmented into character groups (words). In the proposed scheme, we identify the script of the individual character group combining different character features obtained from structural shape, profile, component overlapping information, topological properties, water reservoir concept etc. Based on the experiment on 6110 data we obtained 99.36% script identification accuracy from the proposed scheme.

References

YearCitations

Page 1