Effects of data quality and quantity on deep learning for protein-ligand binding affinity prediction
File version
Accepted Manuscript (AM)
Author(s)
Shi, Yun
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
Abstract
Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those observed among different deep learning approaches. In particular, the presence of proteins in the training data leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
Journal Title
Bioorganic & Medicinal Chemistry
Conference Title
Book Title
Edition
Volume
72
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
© 2022 Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licence (http://creativecommons.org/licenses/by-nc-nd/4.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, providing that the work is properly cited.
Item Access Status
Note
Access the data
Related item(s)
Subject
Medicinal and biomolecular chemistry
Science & Technology
Life Sciences & Biomedicine
Physical Sciences
Biochemistry & Molecular Biology
Chemistry, Medicinal
Persistent link to this record
Citation
Fan, FJ; Shi, Y, Effects of data quality and quantity on deep learning for protein-ligand binding affinity prediction, Bioorganic & Medicinal Chemistry, 2022, 72, pp. 117003