Protein Function Prediction by Machine Learning

Thumbnail Image
File version
Primary Supervisor

Liew, Wee-Chung

Zhou, Yaoqi

Other Supervisors

Yang, Yuedong

File type(s)

Overwhelmed with genomic data, determining functions of previously unseen proteins is one of the most challenging problems. While most protein functions can often be inferred from their homologous counterparts with known functions in other species, not all proteins have homologs whose functions were determined. The functional roles are performed by interactions between proteins and other biologically active molecules. Thus, the first step to identify protein function through its interaction is to detect potential binding sites of the protein. Moreover, protein functions may alter when proteins undergo some modifications. Obviously, experimental determination of functions for millions of new proteins is not practical due to vast amount of possible functions to be tested. Thus, it is highly desirable to have computational tools to prioritize possible functions for new proteins. In this thesis, we proposed machine learning-based methods for predicting putative binding sites of proteins interacting with small molecules, specifically peptides and carbohydrates, in addition to predicting putative sites of post-translational modifications (PTMs). The main contributions of our methods lie in three aspects. First, we proposed the first predictive model to predict protein-peptide binding sites without the knowledge of the protein structure (Taherzadeh et al. 2016). The method was further improved by using experimental structures. The performance of the method is robust even if unbound structures or quality model structures built from homologs were employed, indicating the wide applicability of the method developed (Taherzadeh et al. 2017). Second, we established the first publicly available tool for predicting carbohydrate binding sites in the absence of protein structures (Taherzadeh et al. 2016). Accurate performance of this method is confirmed by predicting more binding residues in carbohydrate-binding proteins than in non-binding proteins in human proteome and by its successful application to 1000 Genomes Project. Third, we proposed a method for predicting post-translational modification (PTM) site of lysine malonylation (Taherzadeh et al.). This predictive model built from M. musculus proteins achieved comparable performance when tested on H. sapiens proteins. All aforementioned methods are thoroughly assessed on cross-validation and the independent test sets after removing homologue sequences. Consistent performance on cross-validation and independent datasets confirmed the accuracy and robustness of predictive methods. All methods significantly outperform existing techniques.

Journal Title
Conference Title
Book Title
Thesis Type

Thesis (PhD Doctorate)

Degree Program

Doctor of Philosophy (PhD)


School of Info & Comm Tech

Publisher link
Patent number
Grant identifier(s)
Rights Statement
Rights Statement

The author owns the copyright in this thesis, unless stated otherwise.

Item Access Status
Access the data
Related item(s)

Protein function prediction

Machine learning



Lysine malonylation

Persistent link to this record