Data-driven prediction of molecular function
File version
Author(s)
Primary Supervisor
Zhou, Yaoqi
Other Supervisors
Zhan, Jian
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
This thesis outlines the development of several tools for the prediction of protein molecular functions. The limited data available for training represents a challenge for developing accurate, generalizable models. To combat this challenge, we established several molecularfunction predictors based on sequenceand structurebased inference from weakly annotated data. The core of the approach is driven by a geometric alignment between 3D structures as a proxy for functional relatedness. We also exploited a large dataset annotated for a related task to inform a model via transfer learning. In Chapter 2, SPOTLigand 2 was established for predicting proteinligand interactions. The method employs weakly annotated ligandbinding sequences to improve virtual screening performance by 93% (top 1% enrichment factor) when compared to a baseline using complex structures only. In Chapter 3, SPOTpeptide was developed for identifying peptidebinding domains and peptide binding sites. The method is an implementation of a structurebased homology modelling pipeline augmented by local measures of interface complementarity. This is the first method devoted specifically to the identification of peptidebinding domains at a genomescale and outperformed a simple sequencebased baseline by 30% according to Matthews correlation coefficient (MCC). Binding site MCC was also improved by 20% compared with the next best method from the literature. In Chapter 4, the SPalignbased structure homology framework was further validated in a prospective study involving the characterization of Bacillus subtilis YesU as a carbohydratebinding protein. In Chapter 5, we developed a deep learning model called SPOTMoRF for the identification of short peptide segments that undergo a disordertoorder transition when binding a functional partner. The deep learning model was facilitated by a transfer learning framework which allowed an MCC improvement of 40% compared with a model trained directly from random initialization. These tools are made available as web services from http://sparks-lab.org/ and should facilitate the annotation of protein function at various levels of resolution.
Journal Title
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Info & Comm Tech
Publisher link
DOI
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
The author owns the copyright in this thesis, unless stated otherwise.
Item Access Status
Note
Access the data
Related item(s)
Subject
protein molecular functions
predicting
tool development