Structural Alignments for Similarity Detection in Bioinformatics

View/ Open
Embargoed until: 2021-12-20
Author(s)
Primary Supervisor
Pullan, Wayne J
Other Supervisors
Zhou, Yaoqi
Year published
2019-12-20
Metadata
Show full item recordAbstract
This thesis addresses problems involving structural alignments for similarity detection between entities. In the general computational context, a structural alignment is defined as an optimization problem where representative inputs are assigned to relative positions subject to the minimization of some objective function. The output is an inferred relationship based upon the resultant value of the objective function, and/or the arrangement of aligned positions. Two bioinformatics similarity detection applications were used as case studies within this work, the structural alignment of biomolecular proteins and the document ...
View more >This thesis addresses problems involving structural alignments for similarity detection between entities. In the general computational context, a structural alignment is defined as an optimization problem where representative inputs are assigned to relative positions subject to the minimization of some objective function. The output is an inferred relationship based upon the resultant value of the objective function, and/or the arrangement of aligned positions. Two bioinformatics similarity detection applications were used as case studies within this work, the structural alignment of biomolecular proteins and the document similarity detection problem in biomedical literature. The structural alignment of protein biomolecules involves generating residue pair correspondences of maximal overlap with minimal geometric divergence using each protein’s set of three-dimensional atomic coordinates. As protein structure decides its functionality, similarity in structure usually implies similarity in function. During the investigation of this structural alignment problem, it became apparent that a fast and approximate asymmetric linear sum assignment algorithm was required. Accordingly, a new heuristic algorithm, Asymmetric Greedy Search (AGS), was developed. Extensive computational experiments using a range of model graphs demonstrated the effectiveness of the algorithm. In addition, a new type of deterministic model graph that is suitable for reproducible benchmarking of these types of algorithms was also developed. Incorporating AGS, a new non-sequential protein structure alignment method, SPalignNS, was then developed. As compared to existing methods, SPalignNS achieved greater alignment accuracy with commonly used protein alignment test datasets, and also achieved the highest agreement with manually curated reference alignments. The document similarity detection problem is a fundamental application of natural language processing, and constitutes the basis of information retrieval systems. Document matching systems for locating relevant literature have mostly relied on methods developed over a decade ago, largely due to the unavailability of a common evaluation framework. A database of relevance annotations for over 180,000 PubMed-listed document pairs was developed with a subsequent application in training a sentence-based transferred learning model, HuBERT (Hierarchical PubMed BERT). When applied to relevant biomedical literature searches in PubMed, the new HuBERT method produced superior results compared to those attained by the baseline methods from existing document matching systems.
View less >
View more >This thesis addresses problems involving structural alignments for similarity detection between entities. In the general computational context, a structural alignment is defined as an optimization problem where representative inputs are assigned to relative positions subject to the minimization of some objective function. The output is an inferred relationship based upon the resultant value of the objective function, and/or the arrangement of aligned positions. Two bioinformatics similarity detection applications were used as case studies within this work, the structural alignment of biomolecular proteins and the document similarity detection problem in biomedical literature. The structural alignment of protein biomolecules involves generating residue pair correspondences of maximal overlap with minimal geometric divergence using each protein’s set of three-dimensional atomic coordinates. As protein structure decides its functionality, similarity in structure usually implies similarity in function. During the investigation of this structural alignment problem, it became apparent that a fast and approximate asymmetric linear sum assignment algorithm was required. Accordingly, a new heuristic algorithm, Asymmetric Greedy Search (AGS), was developed. Extensive computational experiments using a range of model graphs demonstrated the effectiveness of the algorithm. In addition, a new type of deterministic model graph that is suitable for reproducible benchmarking of these types of algorithms was also developed. Incorporating AGS, a new non-sequential protein structure alignment method, SPalignNS, was then developed. As compared to existing methods, SPalignNS achieved greater alignment accuracy with commonly used protein alignment test datasets, and also achieved the highest agreement with manually curated reference alignments. The document similarity detection problem is a fundamental application of natural language processing, and constitutes the basis of information retrieval systems. Document matching systems for locating relevant literature have mostly relied on methods developed over a decade ago, largely due to the unavailability of a common evaluation framework. A database of relevance annotations for over 180,000 PubMed-listed document pairs was developed with a subsequent application in training a sentence-based transferred learning model, HuBERT (Hierarchical PubMed BERT). When applied to relevant biomedical literature searches in PubMed, the new HuBERT method produced superior results compared to those attained by the baseline methods from existing document matching systems.
View less >
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Info & Comm Tech
Copyright Statement
The author owns the copyright in this thesis, unless stated otherwise.
Subject
bioinformatics
similarity detection applications
structural alignment of biomolecular proteins
document similarity detection problem