Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting

View/ Open
File version
Accepted Manuscript (AM)
Author(s)
Ke, Y
Rao, J
Zhao, H
Lu, Y
Xiao, N
Yang, Y
Griffith University Author(s)
Year published
2020
Metadata
Show full item recordAbstract
For permissions, please e-mail: journals.permissions@oup.com Motivation: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. Results: Here, we have developed a new method for the prediction of ...
View more >For permissions, please e-mail: journals.permissions@oup.com Motivation: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. Results: Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. Availability and implementation: The GRASP is available for academic use at https://github.com/sysu-yanglab/ GRASP.
View less >
View more >For permissions, please e-mail: journals.permissions@oup.com Motivation: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. Results: Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. Availability and implementation: The GRASP is available for academic use at https://github.com/sysu-yanglab/ GRASP.
View less >
Journal Title
Bioinformatics
Volume
36
Issue
17
Copyright Statement
© 2020 Oxford University Press. This is a pre-copy-editing, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The definitive publisher-authenticated version Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting, Bioinformatics, 2020, 36 (17), pp. 4576-4582 is available online at: https://doi.org/10.1093/bioinformatics/btaa534.
Subject
Mathematical sciences
Biological sciences