Protein Function Prediction by Machine Learning
Author(s)
Primary Supervisor
Liew, Wee-Chung
Zhou, Yaoqi
Other Supervisors
Yang, Yuedong
Year published
2018-05
Metadata
Show full item recordAbstract
Overwhelmed with genomic data, determining functions of previously unseen proteins is one of the most challenging problems. While most protein functions can often be inferred from their homologous counterparts with known functions in other species, not all proteins have homologs whose functions were determined. The functional roles are performed by interactions between proteins and other biologically active molecules. Thus, the first step to identify protein function through its interaction is to detect potential binding sites of the protein. Moreover, protein functions may alter when proteins undergo some modifications. ...
View more >Overwhelmed with genomic data, determining functions of previously unseen proteins is one of the most challenging problems. While most protein functions can often be inferred from their homologous counterparts with known functions in other species, not all proteins have homologs whose functions were determined. The functional roles are performed by interactions between proteins and other biologically active molecules. Thus, the first step to identify protein function through its interaction is to detect potential binding sites of the protein. Moreover, protein functions may alter when proteins undergo some modifications. Obviously, experimental determination of functions for millions of new proteins is not practical due to vast amount of possible functions to be tested. Thus, it is highly desirable to have computational tools to prioritize possible functions for new proteins. In this thesis, we proposed machine learning-based methods for predicting putative binding sites of proteins interacting with small molecules, specifically peptides and carbohydrates, in addition to predicting putative sites of post-translational modifications (PTMs). The main contributions of our methods lie in three aspects. First, we proposed the first predictive model to predict protein-peptide binding sites without the knowledge of the protein structure (Taherzadeh et al. 2016). The method was further improved by using experimental structures. The performance of the method is robust even if unbound structures or quality model structures built from homologs were employed, indicating the wide applicability of the method developed (Taherzadeh et al. 2017). Second, we established the first publicly available tool for predicting carbohydrate binding sites in the absence of protein structures (Taherzadeh et al. 2016). Accurate performance of this method is confirmed by predicting more binding residues in carbohydrate-binding proteins than in non-binding proteins in human proteome and by its successful application to 1000 Genomes Project. Third, we proposed a method for predicting post-translational modification (PTM) site of lysine malonylation (Taherzadeh et al.). This predictive model built from M. musculus proteins achieved comparable performance when tested on H. sapiens proteins. All aforementioned methods are thoroughly assessed on cross-validation and the independent test sets after removing homologue sequences. Consistent performance on cross-validation and independent datasets confirmed the accuracy and robustness of predictive methods. All methods significantly outperform existing techniques.
View less >
View more >Overwhelmed with genomic data, determining functions of previously unseen proteins is one of the most challenging problems. While most protein functions can often be inferred from their homologous counterparts with known functions in other species, not all proteins have homologs whose functions were determined. The functional roles are performed by interactions between proteins and other biologically active molecules. Thus, the first step to identify protein function through its interaction is to detect potential binding sites of the protein. Moreover, protein functions may alter when proteins undergo some modifications. Obviously, experimental determination of functions for millions of new proteins is not practical due to vast amount of possible functions to be tested. Thus, it is highly desirable to have computational tools to prioritize possible functions for new proteins. In this thesis, we proposed machine learning-based methods for predicting putative binding sites of proteins interacting with small molecules, specifically peptides and carbohydrates, in addition to predicting putative sites of post-translational modifications (PTMs). The main contributions of our methods lie in three aspects. First, we proposed the first predictive model to predict protein-peptide binding sites without the knowledge of the protein structure (Taherzadeh et al. 2016). The method was further improved by using experimental structures. The performance of the method is robust even if unbound structures or quality model structures built from homologs were employed, indicating the wide applicability of the method developed (Taherzadeh et al. 2017). Second, we established the first publicly available tool for predicting carbohydrate binding sites in the absence of protein structures (Taherzadeh et al. 2016). Accurate performance of this method is confirmed by predicting more binding residues in carbohydrate-binding proteins than in non-binding proteins in human proteome and by its successful application to 1000 Genomes Project. Third, we proposed a method for predicting post-translational modification (PTM) site of lysine malonylation (Taherzadeh et al.). This predictive model built from M. musculus proteins achieved comparable performance when tested on H. sapiens proteins. All aforementioned methods are thoroughly assessed on cross-validation and the independent test sets after removing homologue sequences. Consistent performance on cross-validation and independent datasets confirmed the accuracy and robustness of predictive methods. All methods significantly outperform existing techniques.
View less >
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Info & Comm Tech
Copyright Statement
The author owns the copyright in this thesis, unless stated otherwise.
Subject
Protein function prediction
Machine learning
Peptides
Carbohydrates
Lysine malonylation