Multiple sequence alignment-based RNA language model and its application to structural inference

Loading...
Thumbnail Image
File version

Version of Record (VoR)

Author(s)
Zhang, Yikun
Lang, Mei
Jiang, Jiuhong
Gao, Zhiqiang
Xu, Fan
Litfin, Thomas
Chen, Ke
Singh, Jaswinder
Huang, Xiansong
Song, Guoli
Tian, Yonghong
Zhan, Jian
Chen, Jie
Zhou, Yaoqi
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2023
Size
File type(s)
Location
Abstract

Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.

Journal Title

Nucleic Acids Research

Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© The Author s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Item Access Status
Note

This publication has been entered in Griffith Research Online as an advanced online version.

Access the data
Related item(s)
Subject

Biological sciences

Chemical sciences

Environmental sciences

Persistent link to this record
Citation

Zhang, Y; Lang, M; Jiang, J; Gao, Z; Xu, F; Litfin, T; Chen, K; Singh, J; Huang, X; Song, G; Tian, Y; Zhan, J; Chen, J; Zhou, Y, Multiple sequence alignment-based RNA language model and its application to structural inference, Nucleic Acids Research, 2023, pp. gkad1031

Collections