ConsAlign: simultaneous RNA structural aligner based on rich transfer learning and thermodynamic ensemble model of alignment scoring

Abstract Motivation To capture structural homology in RNAs, alignment and folding (AF) of RNA homologs has been a fundamental framework around RNA science. Learning sufficient scoring parameters for simultaneous AF (SAF) is an undeveloped subject because evaluating them is computationally expensive. Results We developed ConsTrain—a gradient-based machine learning method for rich SAF scoring. We also implemented ConsAlign—a SAF tool composed of ConsTrain’s learned scoring parameters. To aim for better AF quality, ConsAlign employs (1) transfer learning from well-defined scoring models and (2) the ensemble model between the ConsTrain model and a well-established thermodynamic scoring model. Keeping comparable running time, ConsAlign demonstrated competitive AF prediction quality among current AF tools. Availability and implementation Our code and our data are freely available at https://github.com/heartsh/consalign and https://github.com/heartsh/consprob-trained.

Here, puv(θ; d) is the explicit form of each sparse loop-matching probability puv(θ) regarding each d-th RNA sequence pair. Also, t(u) returns the type of any nucleotide (position) u and is t(u) ∈ {A, C, G, U}.
As another example, we can compute counts E S d [φ(·); θ]'s element corresponding to CONTRAfold's hairpin loop length θ hairpin length 30 as follows: Here, p hairpin loop ijkl (θ; d) is the explicit form of each posterior pairmatching probability p ijkl (θ) regarding each d-th RNA sequence pair and assumes that every two base pairings (i, j), (k, l) enclose two hairpin loops. [ConsProb can calculate p hairpin loop ijkl (θ; d) during the computation of p ijkl (θ).]

S2.1 Decomposing ConsTrain's SAF scoring model
CONTRAfold's parameters θ fold score any secondary structure S as follows (Do et al., 2006a): Here, a function φ fold (S) maps any secondary structure S to a vector that counts the occurrence of each f -th scoring parameter θ fold f : (θ fold f ) ≡ θ fold in S. Likewise, CONTRAlign's parameters θ align score any pairwise sequence alignment B in the following form (Do et al., 2006b): Here, a function φ align (B) maps any pairwise sequence alignment B to the vector that counts the occurrence of each f -th scoring parameter θ align f : (θ align f ) ≡ θ align in B. Any pairwise SAF candidate A of every two RNA sequences is composed of (1) some secondary structure S of one of them, (2) some secondary structure S of the other of them, and (3) some pairwise sequence alignment B of them: We specify the form of our SAF scoring parameters θ by concatenating CONTRAfold and CONTRAlign's parameters θ fold , θ align : Finally, we can rewrite our SAF scoring s(A; θ) using CONTRAfold and CONTRAlign's parameters θ fold , θ align : ≡ s fold (S; θ fold ) + s fold (S ; θ fold ) + s align (B; θ align ).

S2.2 CONTRAfold model
CONTRAfold's parameters θ fold decompose any secondary structure S into its set of loops L(S) to score S. CONTRAfold's parameters θ fold assign a different score to each loop L : L ∈ L(S) based on the class c L of L [one of the external loops, hairpin loops, stackings, bulge loops, interior loops, and multi-loops ( Fig. S1)]: We characterize each loop scoring function s c L (L; θ fold ) with loopspecific parameters in CONTRAfold's parameters θ fold (Table S1). For example, we can compute a hairpin loop L with its length |L| (i.e., the number of unpaired nucleotides) using the hairpin loop scoring function s hairpin loop (L; θ fold ) as follows: Here, p 1 is the base-pairing enclosing any hairpin loop L, and p 2 is the unpaired nucleotide pair neighboring p 1 (called terminal mismatches). Moreover, θ hairpin length x scores hairpin loops with the lengths of at least x; θ helix end p 1 scores the helix end formed by any base-pairing p 1 ; θ terminal mismatch p 1 p 2 scores the terminal mismatch end formed by every two nucleotide pairs p 1 , p 2 .

S2.3 CONTRAlign model
Based on the pair-conditional random field shown in Fig. S2, CONTRAlign's parameters θ align assign different scores to the emissions and the transitions involving nucleotide matches and nucleotide indels (Table S1). For example, we score the sequence alignment below ACCGU--GU AC--UUUGU by summing the CONTRAlign parameters appearing in it: "supplementary_material" -2023/4/14 -page 2 -#2

S3 Conventional AF tools' features
(1) RAF (a) is a SAF tool taking both CONTRAfold and CONTRAlign's posterior probabilities and (b) trains the weight parameters of these posterior probabilities in max-margin optimization (Do et al., 2008). (2, 3) LocARNA and SPARSE are different implementations of Sankoff's algorithm utilizing SAF sparsity (Will et al., 2007(Will et al., , 2015. SPARSE is a variant of LocARNA and exploits only structure-based constraints (Will et al., 2015), whereas LocARNA exploits matching-based and structurebased constraints (Will et al., 2007). (4) DAFS is a SAF tool and realizes reasonable computational complexities by applying dual decomposition to integer programming (Sato et al., 2012). (5) LinearTurboFold is an application of LinearPartition (Zhang et al., 2020) to TurboFold, an iterative AF tool (Tan et al., 2017). LinearTurboFold realizes its quick predictive iteration using both LinearPartition and beam search-based sequence alignment (Li et al., 2021).
Conventional tool SPS-based p-value SCI-based p-value