Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning

Abstract Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this ‘informative base embedding’ and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman–Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.


Self-Attention Mechanism
illustrates the single-head case of the self-attention mechanism. The transformer is an encoder-decoder type of feed-forward neural network. The selfattention function for an encoder-decoder neural network is a dot-product attention formulated as where represents the encoder layer and represents the decoder layer. The BERT algorithm generalized it by considering as (search) query and separating into key and value , formulated as: In this formulation, the attention function computes an output (attention weight) based on a query ( ) and a set of key-value pairs ( , ). The key-value pairs ( , ) can be considered as a kind of dictionary. By separating into key and value , the dotproduct between query and key plays a role to measure the relevance of the value for query (how much it has an attention). These , and are calculated by linear projection from the input with learnable parameters , and , formulated as:

RNA Motif Detection using Self-Attention Map
The attention map calculates the inner product between the query vector of each base and the key vector of the other bases in the input RNA sequence, then measures the relevance of the base with the other bases, as illustrated in the supplemental Figure S1. The supplemental Figure S2 shows the strength of the relevance of each base, which is represented by the intensity of red. The sequence below in Figure S2 represents the relevance of the 10th base "G" from the left in the upper sequence, which is surrounded by a blue frame. In the Figure S2, arrows are drawn for bases that are particularly relevant to the base "G". The sum of the relevance calculated for each base is finally defined as an attention map. Thus, the attention map is an index showing how much each base contributed to the prediction of the pre-training task. Therefore, in the MLM task, the bases that are important for the prediction of the masked base, and in the SAL task, for the prediction of the structural alignment obtain high values in the attention map.
Finally, the bases with high attention values are identified as sequence motif.

Datasets
TrainSet Q (query) Figure S2. Example of RNA motif detection using self-attention map.
The attention map calculates the inner product between the query vector of each base and the key vector of the other bases in the input RNA sequence, then measures the relevance of the base with the other bases, as illustrated in the supplemental Figure S1. The supplemental Figure S2 shows the strength of the relevance of each base, which is represented by the intensity of red. The sequence below in Figure S2 represents the relevance of the 10th base "G" from the left in the upper sequence, which is surrounded by a blue frame. In the Figure S2, arrows are drawn for bases that are particularly relevant to the base "G". The sum of the relevance calculated for each base is finally defined as an attention map. Thus, the attention map is an index showing how much each base contributed to the prediction of the pre-training task. Therefore, in the MLM task, the bases that are important for the prediction of the masked base, and in the SAL task, for the prediction of the structural alignment obtain high values in the attention map.
Finally, the bases with high attention values are identified as sequence motif.