Publication | Closed Access
Binarization-Free Text Line Segmentation for Historical Documents Based on Interest Point Clustering
42
Citations
23
References
2012
Year
Unknown Venue
Document ClusteringSaint Gall DatabaseImage AnalysisInformation RetrievalData ScienceData MiningPattern RecognitionEngineeringText RecognitionText SegmentationOptical Character RecognitionText LinesHistorical DocumentsCharacter RecognitionInterest Point ClusteringDocument ProcessingText MiningPage Images
Segmenting page images into text lines is a crucial pre-processing step for automated reading of historical documents. Challenging issues in this open research field are given \eg by paper or parchment background noise, ink bleed-through, artifacts due to aging, stains, and touching text lines. In this paper, we present a novel binarization-free line segmentation method that is robust to noise and copes with overlapping and touching text lines. First, interest points representing parts of characters are extracted from gray-scale images. Next, word clusters are identified in high-density regions and touching components such as ascenders and descenders are separated using seam carving. Finally, text lines are generated by concatenating neighboring word clusters, where neighborhood is defined by the prevailing orientation of the words in the document. An experimental evaluation on the Latin manuscript images of the Saint Gall database shows promising results for real-world applications in terms of both accuracy and efficiency.
| Year | Citations | |
|---|---|---|
Page 1
Page 1