|dc.description.abstract||Storing and manipulating documents in digital form to contribute to a paperless society has been the propensity of emerging technology. There has been notable growth in the variety and quantity of digitised documents, which have often been scanned/photographed and archived as images without any labelling or sufficient index information. The growth of these kinds of document images will undoubtedly continue with new technology. To provide an effective way for retrieving and organizing these document images, many techniques have been implemented in the literature. However, designing automation systems to accurately retrieve document images from archives remains a challenging problem.
Finding discriminative and effective features is the fundamental task for developing an efficient retrieval system. An overview of the literature reveals that research on document image retrieval using texture-based features has not yet been broadly investigated. Texture features are suitable for large volume data and are generally fast to compute. In this study, the effectiveness of more than 50 different texture-based feature extraction methods from four categories of texture features - statistical, transform-based, model-based, and structural approaches - are investigated in order to propose a more accurate method for document image retrieval. Moreover, the influence of resolution and similarity metrics on document image retrieval are examined. The MTDB, ITESOFT, and CLEF_IP datasets, which are heterogeneous datasets providing a great variety of page layouts and contents, are considered for experimentation, and the results are computed in terms of retrieval precision, recall, and F-score. By considering the performance, time complexity, and memory usage of different texture features on three datasets, the best category of texture features for obtaining the best retrieval results is discussed. The effectiveness of the transform-based category over other categories in regard to obtaining higher retrieval result is proven.
Many new feature extraction and document image retrieval methods are proposed in this research. To attain fast document image retrieval, the number of extracted features and time complexity play a significant role in the retrieval process. Thus, a fast and non-parametric texture feature extraction method based on summarising the local grey-level structure of the image is further proposed in this research work. The proposed fast local binary pattern provided promising results, with lower computing time as well as smaller memory space consumption compared to other variations of local binary pattern-based methods.
There is a challenge in DIR systems when document images in queries are of different resolutions from the document images considered for training the system. In addition, a small number of document image samples with a particular resolution may only be available for training a DIR system. To investigate these two issues, an under-sampling concept is considered to generate under-sampled images and to improve the retrieval results.
In order to use more than one characteristic of document images for document image retrieval, two different texture-based features are used for feature extraction. The fast-local binary method as a statistical approach, and a wavelet analysis technique as a transform-based approach, are used for feature extraction, and two feature vectors are obtained for every document image. The classifier fusion method using the weighted average fusion of distance measures obtained in relation to each feature vector is then proposed to improve document image retrieval results.
To extract features similar to human visual system perception, an appearance-based feature extraction method for document images is also proposed. In the proposed method, the Gist operator is employed on the sub-images obtained from the wavelet transform. Thereby, a set of global features from the original image as well as sub-images are extracted. Wavelet-based features are also considered as the second feature set. The classifier fusion technique is finally employed to find similarity distances between the extracted features using the Gist and wavelet transform from a given query and the knowledge-base. Higher document image retrieval results have been obtained from this proposed system compared to the other systems in the literature. The other appearance-based document image retrieval system proposed in this research is based on the use of a saliency map obtained from human visual attention. The saliency map obtained from the input document image is used to form a weighted document image. Features are then extracted from the weighted document images using the Gist operator. The proposed retrieval system provided the best document image retrieval results compared to the results reported from other systems.
Further research could be undertaken to combine the properties of other approaches to improve retrieval result. Since in the conducted experiments, a priori knowledge regarding document image layout and content has not been considered, the use of prior knowledge about the document classes may also be integrated into the feature set to further improve the retrieval performance||en_US