An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

TLDR

Bangla and Devnagari share a common Brahmi origin and many features, enabling a single OCR system to target both scripts. The authors propose an OCR system capable of reading both Bangla and Devnagari scripts. The system digitizes documents, detects skew, segments lines, zones, words, and characters, groups characters into basic, modifier, and compound categories, and applies script‑specific feature sets, classification trees, and lexicon‑based error correction. The OCR achieves good performance on single‑font printed documents.

Abstract

An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

References

Page 1

	Year	Citations

Page 1