rnaCrosslinkOO: an object-oriented R package for the analysis of RNA structural data generated by RNA crosslinking experiments

Abstract Summary RNA (ribonucleic acid) molecules have secondary and tertiary structures in vivo which play a crucial role in cellular processes such as the regulation of gene expression, RNA processing and localization. The ability to investigate these structures will enhance our understanding of their function and contribute to the diagnosis and treatment of diseases caused by RNA dysregulation. However, there are no mature pipelines or packages for processing and analyzing complex in vivo RNA structural data. Here, we present rnaCrosslinkOO (RNA Crosslink Object-Oriented), a novel software package for the comprehensive analysis of data derived from the COMRADES (Crosslinking of Matched RNA and Deep Sequencing) method. rnaCrosslinkOO offers a comprehensive pipeline from raw sequencing reads to the identification and comparison of RNA structural features. It includes read processing and alignment, clustering of duplexes, data exploration, folding and comparisons of RNA structures. rnaCrosslinkOO also enables comparisons between conditions, the identification of inter-RNA interactions, and the incorporation of reactivity data to improve structure prediction. Availability and implementation rnaCrosslinkOO is freely available to noncommercial users and implemented in R, with the source code and documentation accessible at https://CRAN.R-project.org/package=rnaCrosslinkOO. The software is supported on Linux, macOS, and Windows platforms.


Introduction
RNA molecules exhibit secondary and tertiary structures in vivo.While ribosomal RNA (rRNA) with secondary structure and base pairings between nucleotides is a familiar concept, mRNA is frequently represented visually as a linear entity, typically marked with 5 0 and 3 0 labels (Vicens and Kieft 2022).This bias in conceptualization, compounded by the complexities of investigating RNA structures in vivo has led to the study of RNA structure lagging behind other fields of structural biology.
RNA structure is observed as dynamic in vivo, adapting to localized spatiotemporal conditions within the cell (Solayman et al. 2022).Factors such as minor changes in pH, salt concentrations, ligand availability, temperature, or point mutations can influence the behavior of covalent base pairs, consequently affecting the structure (Wan et al. 2011).These structural changes have a diverse impact on cellular biology (Mortimer et al. 2014), including transcriptional regulation (Tsai et al. 2010), splicing (Kar et al. 2011), translation (Ray et al. 2009), and RNA decay (Fukuchi and Tsuda 2010).
Studying RNA structure in vivo is becoming a combinatorial assay with recent success in the field coming from utilizing psoralen crosslinking methods, chemical probing and in silico folding for the same RNA (Spitale and Incarnato 2023).This is because the limitation of each of the methods are mitigated by the others; psoralen crosslinking methods such as COMRADES (Crosslinking of Matched RNA and Deep Sequencing) (Ziv et al. 2018(Ziv et al. , 2020)), PARIS (Lu et al. 2018), SPLASH (Aw et al. 2016), Karr-Seq (Wu et al. 2024), and LIGR-seq (Sharma et al. 2016) provide evidence for long-range base-pairing although not at base-pair resolution which complicates the use of in silico folding methods.Chemical probing methods, such as icSHAPE (Flynn et al. 2016), when applied alone, are limited by the presence of RNA binding proteins, solvent accessibility and their inability to detect long-range base pairing.However, their ability to provide information at single nucleotide resolution can improve in silico folding predictions of RNA crosslinking data.The subsequent analysis of this combinatorial data is complicated.
We present the rnaCrosslinkOO (RNA Crosslinking Object-Oriented) R package, a novel and versatile R package, that focusses on the downstream analysis of RNA crosslinking data.Analysis strategies exist for analyzing RNA crosslinking data including CRSSNT (Gabryelska et al. 2022, Zhang et al. 2022).Their focus is on the alignment of the chimeric reads and integrate well with this R package that provides visualization and analysis of processed reads.Although the package was designed to analyze COMRADES data, the rnaCrosslinkOO R package will accept any crosslinking data presented in the correct format.The objectoriented nature of the package allows the storage of raw and processed data.

Methods and application
2.1 Read pre-processing The COMRADES experimental protocol results in highthroughput sequencing data in FASTQ format.To process these raw sequencing reads for downstream analysis with the rnaCrosslinkOO package, we have developed a Nextflow (Di Tommaso et al. 2017) pipeline.Parameters for steps in the Nextflow pipeline can be found in Supplementary Table S3 (https://github.com/JLP-BioInf/rnaCrosslinkNF).Crosslinking experiments have varied library preparation protocols and often small differences mean that it is not possible to follow a prescribed pipeline for data pre-processing.For this reason, users can also create their own input files provided they follow the guidelines set out in the vignette and Supplementary Table S4.

The rnaCrosslinkOODataSet object
The rnaCrosslinkOO package can be installed using install.packages in R. A full vignette and usage documentation can be found on the CRAN website (https://CRAN.R-project.org/package=rnaCrosslinkOO) and through the vignette function.The object-oriented R package centers around a new S4 class, the rnaCrosslinkDataSet.This class consists of slots that facilitate the storage and accessibility of the data.

Reading in and exploring the global interactions
Loading the data into the rnaCrosslinkOO package requires; (i) Sample metadata in the form of a tab-separated table .(ii) The output of the COMRADES Nextflow pipeline or files in the same format (Supplementary Table S3).(iii) The ID of the RNA of interest.The COMRADES experimental protocol involves a round of enrichment for a specific RNA.However, the resulting data also contains structural information from other RNAs, as well as inter-and intra-RNA interactions for the RNA of interest.To gain a comprehensive overview of the RNAs within the dataset, there are three primary methods: featureInfo, topTranscripts, and topInteractions.These methods present the user with a table showing highly abundant RNAs and RNA-RNA interactions in the dataset (Fig. 1d).

Exploring the inter-RNA interactions
The COMRADES protocol crosslinks any nucleotide bound RNAs, this includes inter-RNA interactions.Exploring these interactions after using the topInteractions and topInteractors methods can be performed with getInteractions and getReverseInteractions.Plotting the resultant tables shows the location of reads for another chosen transcript.Users can also explore inter-RNA interactions as a 2D contact map with plotInteractions (Fig. 1i).

Clustering and trimming duplexes
In the COMRADES data, crosslinking and fragmentation leads to the production of redundant structural information, where the same in vivo structure from different RNA molecules produces slightly different RNA fragments.Clustering of these duplexes that originate from the same place in the reference transcript reduces computational time during the folding step and allows trimming of these clusters to improve the resolution.Clustering is performed as described in Ziv et al. (2020).Briefly, gapped alignments can be described by the transcript coordinates of the left (L) and right (R) side of the reads and by the nucleotides between L and R (g).Reads with similar or identical g values are likely to originate from the same structure of different molecules.In rnaCrosslinkOO, an adjacency matrix is created for all chimeric reads based on the nucleotide difference between their g values.From these weights the network can be defined as: G ¼ (V, E).To identify clusters within the graph, the graph is clustered using random walks with the cluster_waltrap function (steps ¼ 2) from the iGraph package (Cs� ardi et al. 2023).These clusters often contain a small number of longer L or R sequences due to the random fragmentation in the COMRADES protocol.Given the assumption that the reads within each cluster likely originate from the same structure in different molecules these clusters can be trimmed to contain the regions from L and R that have the most evidence (Fig. 1b  and c).The clustering and trimming is achieved with the clusterrnaCrosslink and trimClusters methods.The cluster agreement between replicates can be inspected with plotClusterAgreement and clusterAgreementHeat (Fig. 1g).

Check for domains
Folding RNA in silico becomes more computationally expensive and inaccurate as the size of the RNA increases.To allow the user to fold smaller parts of the RNA of interest rnaCrosslinkOO employs the plotDomains method.In the analysis of Hi-C data, domains are used to compartmentalize areas of the DNA with high inter-domain interactions and less interactions outside of the domain.Here we utilize a package designed for Hi-C analysis, TopDom to achieve this effect (Shin et al. 2016).The package was designed for larger molecules and the function provides output for a range of parameters (Supplementary Fig. S1B).

Folding
After choosing a domain, the user can create predicted structures for any region or the whole RNA of interest using the foldrnaCrosslink method.The folding works as follows; firstly, all clusters in the region are folded in silico using RNAFold from the Vienna package (Lorenz et al. 2011).For short range clusters (g > 10 nt) this is done by folding the region with RNAFold.For long-range clusters, an artificial linker is created between the two sides of the cluster and this sequence is folded using RNAFold.From these predicted structures of the clusters, the nucleotide contacts are then stored as constraints for the next step in the folding (Fig. 1e).Due to alternative topologies of the RNA in vivo, some of the cluster constraints may be mutually exclusive.In step two, the transcript region is folded 100 times by default, to produce a representative structural ensemble.Each time the RNA is folded, hard constraints that were identified in the first step are added sequentially and each time a constraint is added, the RNA is refolded.In the case where a constraint that is added shares a nucleotide position with a previously added constraint this new constraint is simply removed, and a new constraint is added.The user specifies how many constraints are added to each of the folded molecules.This produces an ensemble of structures that is stored in the object (Fig. 1e).To aid the analysis of the representative structural ensemble there are three functions; plotEnsemblPCA,  S3) and the matrixList contains the data as contact matrices.clusterGRangesList, ClusterTableList-These two slots are related to the clustering of the duplexes and contain a GRanges object and data frame of the cluster coordinates.viennaStructures, dgs, clusterTableFolded, and interactionTable-These slots relate to the folding of the RNA.The viennaStructures and dgs slots contain the vienna format structures and the free energy value for the predicted structures for each sample.The interactionTable contains the constraints identified from folding each of the clusters and the clusterTableFolded slot contains the predicted fold for each cluster.(a-i) Schematics and examples of steps in the rnaCrosslinkOO package, further detail for each step can be found in the main text. rnaCrosslinkOO

Figure 1 .
Figure 1.rnaCrosslinkOO.Main steps of the algorithm are shown in a,b, and c and optional analysis steps are shown in d-i.(a) Input for rnaCrosslinkOOand the main slots of the rnaCrosslinkDataSet object displayed with grey boxes with the functions used to access the slots in green.sampleTable-A table with the following column names; file (file path for sample), group (a code for the sample group), sample (a numeric value showing which replicates belong to which group), sampleName (a unique sample identifier).inputFiles, matrixList-The inputFiles slot contains the duplexes in their original input format (Supplementary TableS3) and the matrixList contains the data as contact matrices.clusterGRangesList, ClusterTableList-These two slots are related to the clustering of the duplexes and contain a GRanges object and data frame of the cluster coordinates.viennaStructures, dgs, clusterTableFolded, and interactionTable-These slots relate to the folding of the RNA.The viennaStructures and dgs slots contain the vienna format structures and the free energy value for the predicted structures for each sample.The interactionTable contains the constraints identified from folding each of the clusters and the clusterTableFolded slot contains the predicted fold for each cluster.(a-i) Schematics and examples of steps in the rnaCrosslinkOO package, further detail for each step can be found in the main text.