Automated Historical Document Analysis Using Image Processing
File version
Author(s)
Primary Supervisor
Blumenstein, Michael M
Other Supervisors
Liew, Wee-Chung
Pal, Umapada
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
In achieving space-free preservation and open access to historical information, digitization of historical documents is an exciting and popular field of research in document analysis and recognition (DAR). Automation in the digitization of historical documents is essential to save physical space, to preserve historical documents, to integrate our valuable historical resources with current systems, and to offer public access for those documents. This process will save time and expense for processing large amounts of historical documents, and thus it has been a prime issue for many libraries or archives. To provide an effective method of information extraction from scanned historical documents, many techniques have been implemented in the literature. However, the challenge for automatic processing of digitized historical documents is in finding reliable systems or refining well-known approaches. State-of-the-art methods in various subfields of historical document analysis are mainly based on some assumptions, and are highly specific to the databases. Hence, information extraction from handwritten historical documents is not a straightforward task and the process includes several phases, from preprocessing to recognition. This study aims to automate the several processing steps in reproducing readable and dynamic forms of valuable degraded historical resources. The historical handwritten documents from Australian archives and libraries collected for this experiment are comparatively new in this field, and this study discusses the difficulties and suitability of the state-of-the-art methods for various levels of tasks. The issues that need to be tackled and solved for historical documents are mainly poor physical preservation; different writing styles; varying and challenging layout/structure; degradation due to aging; black margins and bleed-through due to scanning; significant noise due to degradation, dirt, margins and artifacts; and obsolete languages/scripts. Segmentation of handwritten double-page historical document images is a very crucial and elementary preprocessing task. In literature, the state-of-the-art methods are either data-specific (only split centrally) or inefficient in preserving text content for complex datasets. In this circumstance, this research demonstrates a fast, generic and robust method for segmenting historical handwritten double-page document images while preserving text content with zero tolerance. The key idea is to locate the transitional space between two textual contents using an 1D discreteWavelet transformation. Experimental results show that the proposed method achieves higher accuracy in minimum span of time compared with state-of-the-art page segmentation method. Removal of marginal noise from handwritten historical document images is another vital task. An overview of the literature reveals that existing marginal noise removal techniques have not yet been developed for preserving isolated side text contents such as page numbers. In this thesis, a dynamic, fast and non-parametric method to preserve text content with zero tolerance while removing marginal noise from handwritten historical document images is further proposed. The proposed method combines the two state-of-the-art approaches and endeavours to improve page segmentation in a more robust way. The effectiveness of the proximity-based approach over other approaches in regard to obtaining maximum removal result is established. On document text images, this study develops a complete restoration method for eliminating several physical degradations and simultaneously restoring the textual content of historical handwritten document images. In contrast to many conventional methods, the proposed method performs without any prior knowledge of degradation or any assumption based on the properties of the document images. To increase the performance, the proposed approach uses two different optimization models: one for denoising the scanned images by the Genetic Algorithm in an incremental evolutionary process, and the second for restoring the valuable text content by an iterative optimization process. Maximum adaptation and exploration of degraded document images during the evolutionary process provide the best restored and enhanced document images compared to the results reported from other systems. In this thesis, the modifications on piece-wise algorithm for text-line segmentation are proposed for (i) estimating vertical stripes and their widths; (ii) identifying appropriate line segments; and (iii) reducing the over-segmentation problem. It has been shown in the experiment that these modifications have enormous effect on detecting appropriate line segments in each stripe and forming the line-segments. Furthermore, an extensive investigation and experimental analysis are required to segment the overlapping and touching components. This auto digitization process in text content preservation and segmentation will assist researchers and organizations to access valuable Australian historical resources for convenience, speed and accuracy.
Journal Title
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Info & Comm Tech
Publisher link
DOI
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
The author owns the copyright in this thesis, unless stated otherwise.
Item Access Status
Note
Access the data
Related item(s)
Subject
automatic processing
historical documents
digitisation
Document Analysis