Concepedia

Abstract

Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.

References

YearCitations

Page 1