A Hybrid Method for Text Line Extraction in Handwritten Document Images

View/ Open
File version
Accepted Manuscript (AM)
Author(s)
Kiumarsi, Ehsan
Alaei, Alireza
Griffith University Author(s)
Year published
2018
Metadata
Show full item recordAbstract
Text line segmentation in handwritten document image, as one of the preliminarily steps for document image recognition, is a challenging problem. In this paper, a hybrid method for text line extraction in handwritten document images is presented. Initially, a connected component (CC) labelling method following by a CC filtering is employed to extract a set of CCs from the input document image. A new distance measure is introduced to compute normal distances between the extracted CCs. By traversing the normal distance matrix from both the right and left directions, half-chains of CCs are constructed. The CCs half-chains are ...
View more >Text line segmentation in handwritten document image, as one of the preliminarily steps for document image recognition, is a challenging problem. In this paper, a hybrid method for text line extraction in handwritten document images is presented. Initially, a connected component (CC) labelling method following by a CC filtering is employed to extract a set of CCs from the input document image. A new distance measure is introduced to compute normal distances between the extracted CCs. By traversing the normal distance matrix from both the right and left directions, half-chains of CCs are constructed. The CCs half-chains are merged to obtain CCs full-chains. From the extracted full-chains separator lines are obtained. A gradient metric is proposed to detect and remove touching text lines. Using remaining separator lines the adaptive projection profile of the image is computed. Based on the projection profile, coarse text line extraction is performed. Finally, a fine text lines extraction is performed by applying a postprocessing step. To evaluate the method, two benchmarks named ICDAR2013 handwriting segmentation contest, and Kannada datasets composed of handwritten document images in English, Greek, Bengali, and Kannada languages were considered for experimentation. Experimental results indicate a promising performance was obtained compared to some of the state-of-the-art methods.
View less >
View more >Text line segmentation in handwritten document image, as one of the preliminarily steps for document image recognition, is a challenging problem. In this paper, a hybrid method for text line extraction in handwritten document images is presented. Initially, a connected component (CC) labelling method following by a CC filtering is employed to extract a set of CCs from the input document image. A new distance measure is introduced to compute normal distances between the extracted CCs. By traversing the normal distance matrix from both the right and left directions, half-chains of CCs are constructed. The CCs half-chains are merged to obtain CCs full-chains. From the extracted full-chains separator lines are obtained. A gradient metric is proposed to detect and remove touching text lines. Using remaining separator lines the adaptive projection profile of the image is computed. Based on the projection profile, coarse text line extraction is performed. Finally, a fine text lines extraction is performed by applying a postprocessing step. To evaluate the method, two benchmarks named ICDAR2013 handwriting segmentation contest, and Kannada datasets composed of handwritten document images in English, Greek, Bengali, and Kannada languages were considered for experimentation. Experimental results indicate a promising performance was obtained compared to some of the state-of-the-art methods.
View less >
Conference Title
2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Copyright Statement
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subject
Pattern recognition
Data mining and knowledge discovery