Tailored machine learning models for functional RNA detection in genome-wide screens

Abstract The in silico prediction of non-coding and protein-coding genetic loci has received considerable attention in comparative genomics aiming in particular at the identification of properties of nucleotide sequences that are informative of their biological role in the cell. We present here a software framework for the alignment-based training, evaluation and application of machine learning models with user-defined parameters. Instead of focusing on the one-size-fits-all approach of pervasive in silico annotation pipelines, we offer a framework for the structured generation and evaluation of models based on arbitrary features and input data, focusing on stable and explainable results. Furthermore, we showcase the usage of our software package in a full-genome screen of Drosophila melanogaster and evaluate our results against the well-known but much less flexible program RNAz.


INTRODUCTION
Experimental and theoretical work in molecular biology typically presupposes the correct annotation of genomic data and ther efor e depends on the compilation and curation of accessible sequence databases. This demand for highquality genomes led to the emergence of high-throughput sequencing (HTS) techniques and huge amounts of data since the 2000s ( 1 , 2 ).
Automa ted annota tion pipelines increasingly supplement -and to a certain extent replace -time consuming, errorprone and often poorly reproducible manual curation techniques ( 3 ). Still, until recently, annota tion ef forts have largely focused on protein-coding genes (re vie wed by ( 4 )).
Howe v er, less than 25% of the transcribed human genes account for the entirety of the 19 000 protein-coding RNAs ( 5 ). Non-coding RNAs (ncRNAs) have been shown to perform a di v erse range of biological and regulatory functions ( 6 , 7 ) and play an important role in the genesis of cancer and other pathophysiological processes ( 8 ). They are much more di v erse than protein-coding RNAs with respect to their biogenesis, processing, molecular mechanism, and evolutionary histories. While some prominent classes, such as microRN As, siRN As, snoRN As, rRN As and tRN As ( 9 ) are well understood, information is still sparse for most ncRNAs. These molecules display a wide range of features such as folding and assembling into complex superstructures, interactions with other RN As, DN A or proteins, and the regulation of their activity ( 10 ). Non-coding RNAs have also been found to play a major role in chromatin remodeling and epigenetics ( 11 ). Despite their abundance and biological relevance, identification and annotation of ncRNAs is still much sparser and more superficial than the functional annotation of protein-coding genes. At least in part this is explained by a pr efer ence for poly-A enrichment protocols in HTS-based studies, and by the predominant interest in protein-coding genes in medical r esear ch ( 12 ). As a consequence, protein-coding genes have also received more attention in computational studies.
A subset of well-described non-coding RNAs features a significant degree of conservation of their secondary structures, which is often associated with biological function ( 14 , 15 ), Figure 1 . Evolutionary conservation of RNA secondary structur e ther efor e serves as a promising pr edictor of function in the de novo identification of ncRNA elements. Both minimum free energy (MFE) structures and intrinsic properties of nucleotide sequences are used in tools such as RNAz ( 16 , 17 ) or Evofold ( 17 ) to identify evolutionary conserved, and thus likely biologically functional, secondary structures. Such approaches are facilitated by the Many examples, such as the depicted tRNA (left) and Hammerhead Ribozyme (right), show a particularly high degree of conservation of a common structure and a strong evolutionary selection towards muta tions tha t stabilize it. The high degree of observed co-varying and compatible m utations strongl y support the hypothesis that their structure is r equir ed for the sequence to r etain its structur e-to-function r elationship. Figur es wer e generated using R2R ( 13 ).
relati v e ease of predicting RNA secondary structures for a gi v en sequence using a simple, well-understood energy model ( 18 ). Recent studies suggest that the sequence of a gi v en RNA motif can e volv e at normal or e v en accelerated pace while conserving the associated secondary structure, thus suggesting evolutionary pr essur e targeted specifically a t conserva tion of structure configura tion ( 19 ). There is also evidence for a high diversity of functional motifs, such as structural switches and others, that is not yet fully mapped out e v en in bacterial regula tory pa thways ( 20 ).
The majority of recently de v eloped automatic annotation pipelines focus on the classification of confirmed transcripts, e.g. fr om high-thr oughput RNA sequencing experiments, as non-coding RNAs or, more common, long noncoding RN As (lncRN As) against a backgr ound of pr oteincoding sequences. Prime examples of such tools include CPAT , CPC2 and CNCI (21)(22)(23). Effecti v ely, these methods define ncRNAs as e v erything that is not recognized as protein-coding, instead of assuming an ambiguous background. The limitations of such a two-way classification approach for the identification of long non-coding RNAs were already discussed elsewhere ( 24 ). In the context of a screen conducted on genomic alignment data, it has to be noted that large parts of the genome are only transcribed in specific cell types ( 25 , 26 ).
While the reality of pervasi v e transcription of genomic DN A is widel y accepted, it has remained a matter of debate what fraction of the transcriptional output conveys recognizable biological function and whether ther e ar e high le v els of 'transcriptional noise' ( 27 ). The latter view is supported in particular by the low le v els of sequence conservation observed for most lncRNAs, although the majority of transcriptional units are evolutionarily old ( 28 ). Classification of transcribed sequences , furthermore , is only viable if the available transcriptome data is sufficiently accurate to assemble essentially complete full-length transcripts. This is often not the case due to comparably low expression levels. Taken together, modeling an ambiguous background is of utmost importance when screening sequences of unknown function, as is the case with genome-wide screens for identification of potentially transcribed loci of interest and their hypothetical function.
Machine learning methods are utilized in many automa tic annota tion pipelines in the field of compara ti v e genomics ( 29 , 30 ). A key issue, howe v er, remains the inability of trained models to adapt to other species not sufficiently r epr esented in the training data. This poses a serious problem since features like nucleotide frequencies, codon usage, and structure motifs are distributed unequally across differ ent r ealms of life. Ther efor e, one-size-fits-all approaches typically cannot tap their full potential as they have to forego properties of high discriminati v e power specific only to a smaller set of related species.
To address this issue, we implemented Svhip (Software and manual accessible under: https://github.com/ chrisBioInf/svhip ), a novel software pipeline for the calibration of di v erse machine learning models used in genomewide non-coding and / or protein-coding RNA screens. The software is available as source code on Github as well as provided via the conda environment manager. The key feature of Svhip is the ability to train a multitude of different models automaticall y w hile gi ving full control ov er indi vidual steps to the user. We offer the possibility to generate both models for the identification of structurally conserved noncoding RNA elements against an ambiguous background as well as for the dif ferentia tion between non-coding RNAs, protein-coding sequences and / or undefined others. Generated models can be used in the frame wor k of Svhip itself or, in the case of two-class models, be exported to be used with the highly successful annotation tool RNAz ( 16 , 31 ). Long-term, we aim towards creating a di v erse collection of models for various applications working out of the box, as well as enabling users to share their own models and parameter sets. This, by itself, ensures a large amount of reproducibility and accountability to individual experiments and genome screens and their resulting discoveries.
The Results section of this contribution focuses on new de v elopments and the evaluation of automatically trained Svhip classifiers, while the Materials and Methods section summarizes in detail how data has been acquired and processed. In the latter we first summarize how test data has been compiled that is used to evaluate prediction accuracy of existing approaches, i.e. RNAz and RNAcode , and the models generated with Svhip . Second, the processing of raw alignment data for model training is described. This includes random background model generation and a detailed description of well established and newly de v eloped features. The Materials and Methods section is complemented by the description on what kinds of models Svhip supports, how they are trained and how their quality is accessed. Finally, data acquisition for an exemplary genome wide screen on D. melanogaster is summarized. All this is utilized to present Svhip 's applicability to classify non-coding RNA elements and / or protein-coding sequences against an ambiguous background throughout the Results section.

Test Data pr epar ation
Two separate data sets were used in the evaluation process of Svhip . For initial testing, we used a preexisting data set that was originally applied to evaluate RNAz ( 31 ) and SISSIz ( 32 ). It consists of 3,832 established and structurally conserved alignments with 2-6 sequences of noncoding RNAs sourced from the Rfam database ( 33 , 34 ). A wide range of vertebrate species is covered in this set. As a control, it also contains a set of random genomic locations taken from full genome alignments, which mirrors the noncoding set in length, number of sequences and dinucleotide composition.
To test the capabilities of our own genomic background sim ulation a pproach for future da ta set genera tion, we shuffled each non-coding RNA alignment on a column by column basis using the rnazRandomizeAln.pl tool from the RNAz software frame wor k. We also simulated alignments based on the non-coding set with conserved dinucleotide composition and gap patterns utilizing SISSIz . Ther efor e, the data set consists of four subsets (non-coding, random genomic loca tions, column shuf fled and SISSIzsimulated) with 3,832 alignments each. This data set was later reduced in size to 3,060 alignments in each subset, as all alignments containing only two sequences have been removed. This was necessary, because SISSIz shows some substantial drawbacks in the simulation of such alignments, see Supplementary Figure S1. If not otherwise indicated, this is the data set used in evaluations.
A second data set is based on a 27-way full genome alignment with the model organism Drosophila melanogaster as r efer ence provided by the UCSC genome browser ( 35 , 36 ) ( https://hgdownload.soe.ucsc.edu/goldenPath/dm6/ multiz27way/maf/ ). The corresponding annotation in GTF format was obtained from FlyBase (version 6.44, last accessed on Mar ch 20, 2022) ( 37 , 38 ). To pr epar e full genome data for later assessment and comparison with the RNAz software, alignments were sliced into overlapping windows with the rnazWindow.pl software on a per-chromosome basis. We used windows of length 120 nucleotides and a step length of 40 nucleotides, generating an expected overlap of 80 nucleotides between neighboring windows. This rather large overlap is necessary to reduce the likelihood of falsely excluding genetic loci that display only very localized conservation signals. It should be noted that the parameters listed here are not mandatory for Svhip , but are considered optimal for most RNAz use cases. As we want to achie v e a direct comparison with RNAz on a per alignment basis, we ther efor e followed its original protocol as closely as possible.
True labels of Drosophila alignment windows for evaluation purposes were assigned using the annotation in GTF format as follows: For each window we calculated the overlap with annotated transcripts, either coding or non-coding, as a fraction of the total window length. If the overlap exceeded 0.5, we assigned the corresponding label to the window. The cutoff at 0.5 was chosen to leave room for disadvantageous window slices and to increase the likelihood of at least one window covering most genes with sufficient length for detection. The alignment windows that were not filter ed ar e then consider ed as r epr esentati v e of their respecti v e class and serve as the basis for the generated training set. We use fiv e features to capture essential information from these alignments while keeping redundancy at a minimum.

Data processing
The data processing pipeline is explicitly intended for the preparation of raw alignment data for model training. It is not to be used for prediction runs, as the filtering steps implemented here may disrupt signals. Svhip accepts multiple sequence alignments in Clustal or MAF format as input. In the first processing step, these are grouped by the label assigned to them during initialization, either ncRNA or other in the case of two-way classification or ncRNA, proteincoding and other for three-way classification. We then use the rnazSelectSeqs.pl script for the initial selection of sequences. All sequences with a pairwise identity beyond a certain threshold (defaul is 98%), are rejected. This is to ensure that redundancy in the data set is reduced to a minimum.
All alignments in the set are cut into overlapping windows of lengths 50-200 nucleotides. These windows then form the basis for the calculation of feature vectors for subsequent model training. Structure conservation is considered a prime indicator of functional RNA elements, windows r epr esenting the ncRNA subset of the data ar e filter ed for this property. Alignment windows are shuffled columnwise and a tree representation of the most stable secondary structure for each sequence is calculated utilizing RNAfold . Subsequently, the average tree edit distance is calculated using RNAdistance for the window as an approximation of the r emaining structur e conserva tion after shuf fling. Both of these tools are part of the ViennaRNA package ( 39 ). This step is repeated for all windows and a Gaussian background distribution is fitted to the observ ed av erage edit distances. This distribution is then used to filter the input alignments by calculating an empirical p -value of their average tree edit distance, which is considered significant if it is lower than 0.05, see Figure 2 A. This ensures that alignment fragments with a low structure conservation do not dilute the ncRNA training set, as would be the case with, for example, lncRNAs containing long unstructured stretches. In highly conserved alignments, the column-wise shuffling ensures a reduction in structure but not sequence conservation. An acceptance statistic for 100 randomly chosen align-A B Figure 2. The effect of column-wise shuffling on the average tree edit distance of the sequences in an alignment. The tree edit distance between two sequences of the alignment is deri v ed from the tree representations of the secondary structur e corr esponding to the minimum fr ee energy states. ( A ) Average tree edit distances of four selected alignments ( ) and the resulting distribution obtained by column-wise shuffling of these alignments (boxplot). Red lines indicate the estimated cutoff for statistical significance with a P -value of 0.05. ( B ) Possible outcomes of this filtering approach using different p -value cutoffs. The figure compares the acceptance rates of 100 randomly chosen strongly conserved non-coding RNA alignments and 100 alignments of random genomic locations as a control group. While it can be observed that the acceptance rate for conserved ncRNA elements remains relati v ely high at 60% e v en at a strict cutoff of 0.01, the acceptance of control alignments becomes increasingly unreliable with higher p -values. For this reason, a low P -value cut-off is usually preferable to reduce the false positi v e rate. ments from the pool of our 3,832 non-coding RNA alignments or genomic locations can be viewed in Figure 2 B.
If no negati v e set r epr esenting the genomic background is provided on program initialization, a synthetic negati v e set will be generated based on the input data. To achie v e this, SISSIz is applied to simulate alignments controlled for dinucleotide composition and gap pattern. Tests show that SISSIz generated alignments can be used as a valid a pproximation for randoml y selected genomic background yielding comparab le le v els of structure conservation and minimum free energy ( Figure 3 ). It can be observed that the variance using a randomly selected genomic background is notably larger, in particular for the alignment entropy. As this value, as well as the mean pairwise sequence identity, should be as stable as possible between positi v e and negati v e sets, this also serves to illustrate the necessity of generating artificial control sets based on input data. This makes it viable to simulate an arbitrary background, while mostly preserving species-specific nucleotide distributions based on the input alignments with SISSIz . We also inves-tigated the viability of using column-wise shuffled versions of the input alignments as negati v e sets ( Figure 3 ). For this we used rnazRandomizeAln.pl , which also attempts to preserve local conservation and gap patterns during shuffling. The substantial overlap between all distributions, Figure 3 B and C, indicates that all three control methods, i.e. randomly selected genomic background, SISSIz simulated and rnazRandomizeAln.pl randomized alignments, are useful and viable in principle. Howe v er, both the higher variance in sequence identity as well as the more difficult sampling process suggest to avoid the usage of random genomic locations.
The alignment windows that were not filtered are consider ed as r epr esentati v e of their respecti v e class and serv e as the basis for the generated training set. We use fiv e featur es to captur e essential information from these alignments while keeping redundancy at a minimum.
Structure conservation index. The structure conservation index (SCI) is a metric for the alignment-wide presence, size and frequency of sites that are conserved in their secondary structure. The SCI compares the MFE of individual sequences with the consensus sequence MFE. Significant devia tions indica te structured elements tha t are not present in other sequences. Thus, a value greater or equal to 1 r epr esents a v ery strong le v el of conservation, while a value close to 0 indicates none. The SCI for the alignment A is estimated by where x ∈ A are the individual sequences, E x is the MFE of x , and E consensus is the MFE of A 's consensus sequence as predicted by RNAalifold .
z-score of minimum free energy. The minimum free energy is a reliable indicator for the presence of paired bases and, hence, secondary structure. A point of interest in the search for conserved non-coding RNA elements is identifying (sub-)sequences with an MFE that significantly devia tes from tha t of sequences with equal nucleotide composition but no notable secondary structure. To detect these, the z -score of MFE is used. It can be calculated using one of the following two procedures. In the first case, a gi v en sequence is 100 times dinucleotide shuffled using the Altschul-Erickson algorithm ( 40 ) and the MFE is calculated for each resulting sequence. For the resulting distribution of random MFEs, the standard deviation and mean ar e calculated. Finally, the z -scor e of the gi v en MFE X is computed as The mean normalized z -score of the MFE for the full alignment can thus be expressed as: where x denotes the standard deviation, E x the MFE and x the mean for sequence x . Explicit shuffling, howe v er, is a highly e xpensi v e operation in terms of computation time per input alignment as it effecti v ely r equir es the sampling and estimation of the unimodal distribution for each sequence in the alignment. For this reason, we implemented support vector r egr ession models that estimate the expected standard deviation and mean based on fast-tocalcula te, sequence-intrinsic fea tur es. These ar e: all 16 dinucleotide frequencies, C to G+C nucleotide fraction, A to A+U nucleotide fraction and the normalized length of the sequence. Normalization was carried out by dividing the sequence length by the maximum considered length of 200, thereby yielding a fraction between 0 and 1. The models were trained on multiple large sets of synthetic sequences r epr esenting GC-content fractions in 0.1 intervals from 0.2 to 0.8 and sequence lengths from 50 to 200. The estimated values are within an acceptable deviation from those obtained with the e xhausti v e approach outlined abov e, see Figure 4 . In our use cases, this approximation yields sufficiently accurate results. It is, howe v er, possib le to enforce a shuffling and distribution sampling for all input alignments for maximum accuracy. It should be further noted that both a pproaches yield reasonabl y stable results: The SVR will naturally return identical values for identical input vectors, the mean MFE values obtained by manual sampling in our tests typically deviate from the mean by less than 5% for 100-fold resampling (Supplementary Figure S2).
Shannon entropy. The Shannon entropy H is a metric for the sequence variation not otherwise captured by the features outlined here. It is calculated for each alignment as where B = { A, U, G, C} is the alphabet of nucleobase symbols, and p b i r epr esents the empirical probability of nucleobase b at the i th of (in total) n alignment columns.
Alignment-wide hexamer score. The hexamer score ( Hex ) is a common metric for estimating coding potential in Figure 4. Performance of the support vector r egr ession engine trained to predict the standard deviation and averages of minimum free energy (MFE) values of sequences with a gi v en mono-and dinucleotide composition. A test set of 24,000 sequences was randomly selected from the non-coding and the random genomic subsets of our test data. The SVR was trained on 285,527 synthetic sequences generated to account for different GC content in 10%-intervals and different A / (A+U) and C / (C+G) ratios in 5%-steps. Left: The predicted and the calculated z -scores obtained from real distribution sampling. For a small number of sequences within the test set, the SVR significantly underestimates the average MFE of the shuffled sequence set (Supplementary Figure S2). The absolute mean average error (MAE) of all data points is 0.49. Right: The distribution of GC content ranges in the test set is shown. Sequences with a GC content of less than 30% and more than 70% are comparably rare in the test set. In our understanding, this accurately reflects the expected frequencies encountered in genomic test data, e.g. in the D. mel. genomic data analyzed as a case study. arbitrary sequences and transcripts. A log-odds ratio is calculated based on the probability to find a gi v en 6-mer of nucleotides in either a protein-coding or non-coding background ( 21 , 41 ). It is conceptually built on the observation that amino acids in a gi v en protein influence their neighbors in their empirical likelihood to appear in this position. The score is largely dependent on the observed reading frame.
Our approach for the calculation of an alignment-wide hexamer score is as follows: first, the hexamer score is calculated for each sequence and reading frame using the formula where l is the number of possib le he xamers in the sequence, F ( H i ) is the empirical probability to find hexamer H i in a coding sequence and F ( H i ) is the empirical probability assuming a non-coding background ( 21 ). As these genomic backgrounds are obviously not equal for all organisms, probability models are provided for human, mouse and drosophila genomes. Howe v er, as these are only an insufficient approximation for many r esear ch cases, the Svhip software also provides the program hexcalibrate to recalibrate these models gi v en fasta files with coding and noncoding sequences. We then interpret the reading frame with the highest score as the 'true' reading frame for the purpose of evaluation, as non-coding sequences tend to produce low scores in any frame. The average value of the maximum scores is then used as the alignment-wide hexamer score. We found the hexamer score to have high predictive capabilities for the purpose of identifying protein-coding sequences. As expected it is however not suitable for the dif ferentia tion of structurally conserved regions from an unstructured background (see Figure 7 ).
Codon conservation score. In terms of alignment-based coding potential estimation, the tool RNAcode is one of the most reliable estimators, even a decade after its initial release ( 42 ). It relies on the simulation of a large number of potential phylogenetic trees to assess the statistical significance of the calculated score, which is computationally expensi v e. In addition RNAcode extracts the optimally scoring interval from a gi v en alignment b lock and accounts for frameshifts and mis-aligned sequences. Here, we built a simplified version of the underlying scoring algorithm that sacrifices some accuracy in exchange for greatly reduced computational costs. The raw pair-wise codon conservation score (CCS) for each two sequences in an alignment is calculated as where B62( A i , A i ) is the BLOSUM62-score of the two amino acids A i and A i encoded by the i th aligned codon in the first and the second sequence, respecti v ely. N is the number of aligned codons. A rudimentary phylogenetic tree for the input alignment is estimated using a neighbor joining (NJ) approach based on pairwise sequence identity. The NJ algorithm is implemented using the corresponding function of the Phylo module of the Biopython library ( 43 ). This allows an estimation of the relati v e e volutionary distance d between the sequences. The default NJ method was chosen for simplicity and speed, howe v er, custom phylogenetic trees obtained with arbitrary means are supported.
Figur e 5. Different ga p pa ttern handling stra tegies for the purpose of codon conservation calculation. The currently analyzed triplets are marked with red boxes. The left illustrates common patterns of gaps while the right hand side shows approaches how these are handled by highlighting the next analyzed triplet. 1) In the simplest case, there is just an offset by 'aligned' gaps that are present in both sequences. In this case, the reading frame will simply be adjusted by the number of shared gaps and a small gap penalty will be applied. 2) If gaps occur that are not 'conserved' between sequences but that sum up to the same number, the frame is corrected by being extended one-sided exactly by this number of gaps. 3) If there is a single gap in the non-r efer ence sequence, it is likely that a sequence error is pr esent her e. In this case a constant penalty is applied and the gap is ignored, leaving the original (reference sequence) frame intact. 4) If there is one or a number indi visib le by three gaps in the curr ent r efer ence sequence but not in the aligned sequence, we will assume the correctness of the r efer ence sequence and the corr esponding sequence frame is extended by that number. 5) If ther e ar e gaps that are multiples of three, the reading frame stays intact and is just shifted past the ga ps. Onl y a slight penalty is applied to reflect the deletion of complete codons. Applied penalties are determined as follows: in case of a single or m ultiple ga p penalty, the penalty has a value of -1 times the number of observed gaps. In case of a frame gap penalty, the value is fixed at -12.
P ( t i j , t i j ) is then defined as the empirical probability to observe the mutation from the nucleotide in position j in triplet t i encoding A i to the corresponding nucleotide in t i encoding A i gi v en the tree. The above score is calculated accounting for all possible reading frames. Errors in the alignment are skipped over by correcting for single-nucleotide deletions and frame breaks, see Figure 5 . The conservation scores are stored in a matrix and used to calculate the most probable coding subregion in the alignment by calculating the maximum subarray. The score is then divided by the alignment length for normalization and used as a feature for classification.
The calcula ted fea tur es ar e saved in CSV format and can thus readily be accessed. Statistical information regarding feature distribution and degree of separability is provided as part of the output, Figure 6 .

Model training
By default, Svhip supports thr ee differ ent types of models: support vector machines (SVMs), random forests (RFs) and logistic r egr ession (LR). In principle, e v ery feature set can be used to train each type of model, e v en though there are some qualitati v e differences between them. SVMs hav e by far the highest computational complexity, whereas LR has the lowest ( 44 ). Furthermore, LR might not be suitable for particularly complica ted classifica tion scenarios with large overlap of feature distributions between classes, which is exemplified by the fact that many biological classification problems are not linear ly separ ab le. Howe v er, the lower training and classification time might make using an LR model worthwhile for a preliminary screen or when using an exceptionally clean data set. The question whether SVMs or RFs are fundamentally better in terms of raw accuracy for genomic applications is currently unanswered, e v en though both were capable of separating the same data sets equally well in our own studies when gi v en appropriate training da ta (da ta not shown). W hen we trained RF classifiers instead of SVMs using the same training data and using the same test data, we recei v ed marginally worse accuracy results (data not sho wn). Ho we v er, we belie v e that the size of the data sets is too small for generalized conclusions here.
Svhip offers automatic hyperparameter optimization for all models using either a grid search or a random walk based approach. Both options are fully customizable in their search depth and number of cross-validation steps, howe v er, we found that more than fiv e cross-validation iterations typically do not offer a significant improvement in accuracy gi v en the features used (data not shown). Generated models are saved in a binary format and can be loaded using the predict option. We also note that, under certain circumstances, the expected increase in accuracy by hyperparameter optimization might be marginal when compared with the quality of the training data. Parameters associated with training, including hyperparameters and scaling parameters for normalization are saved together with the model and loaded as r equir ed for prediction purposes.

Prediction with generated models
Svhip classifies alignment data by either employing one of the built-in prediction models, or by resorting to a custom model previously generated using the features command as described above. For the classification of input data, the alignments have to be cut into overlapping windows to achie v e accurate results, analogous to their prepara tion for fea ture calcula tion. An alignment length of 50-200 nucleotides with a sliding overlap of 40-80 nucleotides yields r eliable r esults for the cases pr esented her e. For the preparation of windows, the software rnazWindow.pl of the RNAz frame wor k was applied. The predicted class labels are assigned to the data sets on a per-window basis. Furthermore, an overall class probability is reported which, in the case of SVM classifiers, is based on a regularized maximum likelihood score ( 45 ). RFs nati v ely support the calculation of class probabilities as a fraction of class votes by trees in the ensemble, and LR inherently calculates probabilities by virtue of being a likelihood function fitted to binary observations.

Quality assessment of Svhip -trained classifiers
To assess the quality of the models generated by our software, we first constructed a classifier from curated Rfam data and compared the performance with the unmodified 8 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 A B Figur e 6. Gra phical output automaticall y generated by Svhip , exemplified on an alignment of bacterial RNAse P RNA sequences (Rfam ID: RF0009). Every data point r epr esents a generated alignment window incorporated into the training set (see Supplementary Table S1). For a comparison using an input file containing both eukaryotic and prokaryotic sequences as input, see Supplementary Figure S4. The graphical reports serve as a first indicator for the assessment of input data quality and separability of relevant features, thus hexamer score and codon conservation distributions are not shown for the RNA example here. ( A ) Sca tter ma trix showing rela tions between individual fea tures as well as density plots. A good separa tion between na ti v e non-coding sequences and the SISSIz -generated control set can be achie v ed based on SCI and the z -score of MFE. The discrepancies between class cardinalities, as inferred from the density plots in the diagonal, can be attributed to the structural conservation filter employed in the pipeline, causing a reduction of viable data points for the non-coding RNA set. ( B ) Box plots illustrating the distribution of individual features, allowing the same observations in a more str eamlined way. These r eports may also serve as a basic exploratory analysis of ne wly assemb led alignments, taking note of statistical significance of secondary structure properties as compared to the automatically generated control set. Figure 7. Distribution of the alignment-wide hexamer score and the codon conservation score across different gene classifications and the genomic background. Classification performance on D. melanogaster Multiz genome alignments (left) and applied to protein-coding alignments obtained from the InsectBase, see section Data processing, when compared with the ncRNA classification training set (right). In both cases, the codon conservation score is significantly higher in most sequences annotated as a known coding sequence. To a lesser extent, the same is true for the hexamer score.
RNAz software. Sensitivity and specificity were estimated on the same data set used in the original assessment of RNAz 's accuracy ( 31 ), see section Test Data preparation. The training set was assembled by manually selecting the first 50 Rfam alignments with known and significant secondary structure as evident by the literature, see Supplementary Table S1. The alignments also had to contain more than 50 sequences. If this was the case for the seed alignment, it was used directly, otherwise we chose the regular alignment instead. These were downloaded as fasta files, and a corresponding data set was generated using Svhip with standard parameters. Based on this data, a SVM classifier was trained, using the builtin grid search hyperparameter optimization with a range of 1 to 10,000 and a logarithmic increment for the cost parameter C of the SVM. As this par ameter in gener al governs the weight of misclassifying any gi v en observation in a support vector machine, it is vital to optimize it f or an y modified training set. The same range was used as a baseline for the gamma parameter, howe v er the auto-scaling option as provided by the scikit-learn ( 46 ) package was also added to the test range and proved to be the most efficient. The auto scaling sets the gamma value to one divided by the number of features and is, in many cases, a good estimate ( 46 ).

NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 9
We note that far mor e car e could have been taken to ensure the selection of high-quality and r epr esentati v e input data sets. Howe v er, an important point to demonstrate here is the underlying hypothesis that the data preprocessing pipeline as implemented in Svhip already optimizes the data set for maximum discriminati v e power and lowest redundancy. For this reason, we also compared the quality of the generated classifier to one using the same input data but with most of Svhip preprocessing steps deactivated. This was achie v ed by running the Svhipdata command with flag -F set to False and flagmax-id to 100. This deactivates both the tree-edit distance based filter for an insignificant structure signal and the filtering of (almost) identical sequences, respecti v ely. The e xpectation here was a statistically significant reduction of the resulting capability to differentiate positi v e from negati v e class instances, presumab ly below both the optimized Svhip -generated classifier and RNAz .

Genome-wide screen on Drosophila melanogaster
To acquire a reasonable estimate of the expected accuracy of a Svhip -trained classifier and showcase a basic experiment with a retrained classifier, we performed a genome-wide screen on D. melanogaster full genome Multiz alignments in MAF format and compared predicted loci with a contemporary annota tion. Alignment da ta for D. melanogaster was pr epar ed as outlined in section Test Data preparation. Two different Svhip -trained classifiers were utilized for this study: one already trained and tested in the previous section, and one with added data for protein-coding sequence dif ferentia tion.
Training data for the protein classification was assembled as follows. Coding sequences were sampled from the InsectBase ( 47 ). To this end, we randomly selected 400 protein-coding genes from the Drosphila annotation and then downloaded all corresponding sequences from the database. To further reduce the amount of data, we limited ourselves to the Ephydroidea (containing Drosophilids ) and Oestroidea superfamilies. Supplementary Table S3 summarizes all protein-coding genes that were processed using the Svhip pipeline and produced feature vectors used in training.

MAF alignment block merging
We implemented and applied mafmerge to merge short MAF blocks into larger consensus alignments. The tool, which is available as part of the Svhip software, performs the following steps: The r efer ence genome is traversed in sense direction and for each individual alignment block, the following block is analyzed for consensus of species aligned, genomic distance and continuous reading direction between sequences aligned in both blocks. By default we use a threshold of 75% of aligned species having to be present as consensus in both blocks for merging. Furthermore, all aligned sequences have to be directly continued in the next block with respect to their genome coordinates, i.e. the start nucleotide index of one sequence has to be its end nucleotide index plus one in the other alignment block. The merging process of continuous blocks terminates automaticall y w hen a length threshold is surpassed. A default value of 1,000 was chosen based on the observa tion tha t the majority of blocks in the Drosophila alignments do not get joined past this point due to lack of a sufficient consenus of aligned species, see Supplementary Figure S3. Considering the compatibility of genomic coordinates of non-r efer ence sequences implicitly adds information regarding synteny in the r esulting alignments. Ther e is no indication that this introduces biases that might ad versel y affect the accuracy of secondary structure prediction.

RESULTS
We tested the Svhip pipeline and deri v ed models for a range of different test sets and scenarios. The primary goal here was to properly estimate the specificity and expected false positi v e rate on both alignments of known non-coding RNAs as well as from the context of a full genome screen on D. melanogaster . Specifically, we also wanted to estimate the prediction quality on non-coding RNA data against the RNAz frame wor k, which shares se v eral of the implemented machine learning features, but comes with a hard-coded decision model, not allowing for retraining of the underlying classifier. In particular, we demonstrate that our preprocessing workflow for the raw training data does in fact yield high-quality positi v e instances, while the generation of synthetic negati v e data successfully reduces false discov eries to a minimum. We also show the separability of coding and non-coding data in an alignment context using our features designated for estimating coding potential.

Workflow and commands
Svhip of fers dif ferent modes of action, denoted as data , data3 , training , features and predict . The first two serve to generate training instances for the machine learning engine from multiple sequence alignment data, either assuming a two-way (structurally conserved noncoding RNA or protein-coding RNA vs genomic background) or three-way classification (non-coding RNA, protein-coding RNA and ambiguous background), respecti v ely. The data and data3 commands also evaluate the suitability of a gi v en set of alignments for training the classifier by analyzing properties such as the average minimum free energy, le v els of conservation and the ov erall quality of separation achie vab le with selected features. Ev ery step of the pipeline is accompanied by graphical output to facilitate the assessment of the results and potential issues, Figure 6 . The training command computes a classification model based on a data set generated with either of the previous commands. The remaining two modes, features and predict , serve to calculate the featur es r equir ed for pr ediction of alignment data, and to perform the actual classification using an existing model, respectively.

Protein-coding features
We generalized the hexamer score, commonly used to estimate coding potential of individual sequences, to be applicable to alignments. Furthermore, a reduced version of the scoring scheme applied in RNAcode is implemented in the codon conservation score that sacrifices some accuracy Table 1. Evaluation of classifier performance using two Svhip -generated classifiers and RNAz as gold standard. We calculated sensitivity , specificity , false positi v e r ates and r aw accur acy dir ectly from the classification r esults. Note that these metrics r efer to both control sets combined. In any case, the r esulting numbers ar e very similar. As an immedia te observa tion, the da ta preprocessing step of filtering secondary structure alignments for significance implemented by Svhip notably causes a trade-off between sensitivity and specificity towards a lower false positive rate for substantially improved speed, see Materials and Methods --Data processing and Supplementary Figure S5 for a comparison of run times. As the codon conservation score was deri v ed as a simplified version of the algorithm employed by the RNAcode software, we did a comparati v e analysis between both approaches on the Drosophila genome alignments (see section Materials and Methods --Test Data preparation). All alignment windows containing at least three sequences were screened using RNAcode and their respecti v e codon conservation scor es wer e calculated. The r estriction of the sequence number was necessary as RNAcode does not nati v ely support the calculation of scores for alignments of only two sequences. As RNAcode r eports scor es for multiple possible subalignments while the codon conservation score is alignment-wide, only the best RNAcode hit was taken into account for this comparison. The resulting Pearson and Spearman correlation coefficients of 0.49 and 0.5 indicated only a weak correlation of both scores. In part, this can be explained by the fact that the codon conservation score takes the full alignment length into account and not just the subalignment with the highest individual score. A mor e r ele vant question, howe v er, is whether both scores are suited metrics to tell coding and non-coding sequence alignments apart. Using annotated alignment windows as a basis and separating them into coding alignments and other alignments , we performed a Wilco x on rank sum test on both sets of data and metrics, respecti v el y. Using this a pproach, the null hypotheses that both sets of measurements originate from the same underlying distribution can be rejected for both metrics with a very high confidence as indicated by p -values smaller than 10 −15 . Thus, while ther e ar e certain intrinsic differences in both approaches, they do achie v e a strong dif ferentia tion between coding and non-coding sequence alignments.
The approach utilized by RNAcode yields higher raw discrimination power in comparison to the hexamer score (data not shown) when screening explicitly for the highest scoring, connected area in an alignment. In this sense, a protein-coding training set that was pr e-scr eened using the RNAcode approach would, in most cases, be of higher reliability than one relying e xclusi v ely on the alignment-wide hexamer score. The former is however associated with significantly higher computational cost in an already computationally e xpensi v e pipeline. We ther efor e leave the decision for either of the methods to the user.
By combining information of InsectBase and FlyBase, we generated protein-coding alignments and investigated the distribution of the alignment-wide hexamer score and codon conservation score in coding and non-coding context. The codon conservation score shows a very strong sep-arability of classes just based on this feature alone, while the alignment-wide hexamer score exhibits a moderate performance, Figure 7 . The separation is much stronger in the training data set than for the genomic alignment data. This, howe v er, is to be expected as a large number of genomic alignment windows naturally overlap with non-coding sequences, thus reducing the overall yield. The training instances can ther efor e be seen as an ideal classification set, depending on input data. It should also be kept in mind tha t both fea tur es measur e differ ent properties of coding sequences and thus naturally might produce results of varying specificity depending on the input data. In particular, the hexamer scor e measur es the relationship of codons and their immediate neighbors, while the codon conservation score estimates the overall retention of selection for coding potential in an evolutionary context. It has to be noted that, in this classification task, no single feature carries the full inf ormation f or a correct label assignment. In most cases, all of them have to be considered in the context of the remaining parameters.

Analysis of tw o-w ay classification perf ormance on alignments of known ncRNAs
Accuracy, sensitivity and specificity of RNAz and two Svhip -trained classifiers were compared on the original RNAz test set. The difference between the latter two lies in their optimization by filtering for secondary structure significance of the input. The structure filtered classifier refers to the one trained by a ppl ying the secondary structure filter method and serves to illustrate the relati v e effecti v eness in increasing the specificity by comparison (see section Quality assessment of Svhip -trained classifiers). All three classifiers show a high degree of dif ferentia tion between true positi v e and true negati v e instances, Tab le 1 . Generated ROCcurves further support this impression as all three classifiers cover an area under the curve of 0.97, see Table 1 . However, the ROC metric compares only relati v e la bel proba bilities assigned to the test instances, and it should be understood as supplementary to the raw assigned labels, see Supplementary Table S2. In terms of false positi v e rate, the Svhip classifier with structure filter enabled reaches the le v el of dif ferentia tion achie v ed with RNAz and surpasses the nonoptimized classifiers by a notable margin, see Table 1 and Figure 8 .
Based on the false positi v e rate on both control sets, the optimized Svhip classifier shows specificity well within the lines of expectation. We note that the classifiers with no structure filter fall short in terms of specificity when compared to the other classifiers and that the higher sensitivity can thus be partially attributed to this. This as well is to Figure 8. Label assignment of classifiers, comparing RNAz (right) and an optimized Svhip classifier (middle) as well as another Svhip-trained classifier using the same training set but without structure conservation significance filter (left). The test set originally contained the 3,832 alignments of known non-coding RNAs used to benchmark RNAz , which was reduced to 3,060 alignments by removing all those with only two sequences. Negati v e sets were generated using SISSIz and rnazWindow.pl for column-wise shuffling. Both control sets were based on the nati v e ncRNA alignments. Removal of alignments with only two sequences was deemed necessary, as generation of a fair control set is less often possible for these (see Figure 3 ). The raw numbers of classifier assigned labels is summarized in Supplementary Table S2. be expected as we suggest a strong link between the selection of statistically significant training examples and overall performance. We note howe v er that, in the context of genome wide screens, a low FPR is in many cases more valuable than a high sensitivity. This is the case for two reasons: Firstly, using a sliding window approach, the average screened genome will be separated into millions of overlapping windows, leading to an expected false positive classification of thousands of windows. Considering this, e v en a reduction of FPR by one percent can be considered beneficial, as the experimental validation of each individual discovered locus is substantially more costly than the genome screen itself. Secondly, as most genomic loci are larger than one window of around 120 nt, there is a reasonable chance for at least strongly conserved subsequences of the full locus to be classified correctly, such that many candidates can be identified e v en at lower sensiti vity. For these reasons, we strongly suggest an optimization towards the lowest possib le FPR feasib le for a gi v en use case. Aside from these considerations, we also note again that the analyzed classifiers are not optimized in terms of initial input data and serve more to showcase the overall capabilities of the pipeline. Alternati v e av enues of inv estigation here include for example the specific selection of only certain types of non-coding RNAs as input, or a restriction to certain species of origin. We further note that the stagnation of raw accuracy at around 0.96 seen here indica tes tha t a certain threshold is probably reached in terms of classifying this particular data set and any further attempts at improvement risk overfitting.

Analysis of tw o-w ay classification perf ormance on Drosophila genome screen
Gi v en the above observations, we decided to use only the classifier trained with structure filter for a non-coding RNA screen on the genome of D. melanogaster . A running time analysis for an increasing number of windows shows that the Svhip classifier out-performs RNAz , see Supplementary Figure S6. We expected that the overall classification accuracy of structurally conserved non-coding RNA elements would further increase when considering proteincoding sequences as a third class. This approach, however, cannot be compared to RNAz directl y, w hich e xclusi v ely screens for conserved RNA structures in an otherwise ambiguous genomic background and does not take protein coding sequences as a separate class into account. This latter fact makes a second comparison necessary, with RNAz on the one side, the better-performing, Svhip -generated classifier on the other, and a third classifier that differentiates non-coding RNA, protein-coding and background genetic locations, which will be discussed in the following section.
We performed genome-wide screens based on Multiz alignments with D. melanogaster as r efer ence and the current annotation from FlyBase v2 as ground truth ( 37 , 38 ). Comparison of D. melanogaster non-coding RNA annotations in FlyBase and RNACentral ( 48 ), a comprehensi v e collection of about 50 data sour ces, r e v ealed a large consensus for all analyzed classes, see Supplementary Figur e S10. Ther efor e, the well maintained data of FlyBase is used throughout this contribution. When comparing total genes covered by preprocessed alignment windows with those r ecover ed by individual classifiers, we note that the Svhip classifier (described in the previous section) performs highly similar to the RNAz classifier in recovering snoRN As, miRN As and tRN As, Table 2 . This demonstra tes tha t the Svhip pipeline is well capable of isolating high-quality training instances e v en from suboptimal input data. It has to be noted, howe v er, that it is difficult to correctly determine the false positi v e rate of the classifiers, especially for the scenario of a full genome screen. This is primarily caused by the lack of gold standard data that would guarantee correctness of the used annotation, and the higher abundance of spurious alignments containing non-homologous or very dissimilar sequences thus introducing the possibility for reduced conservation signals despite there being a genuine conserved transcript encoded. The ambiguity and likely incompleteness of available annotations is our primary motivation for restricting the analysis to the well-studied cases of tRN A, snoRN A and miRN A genes. Although the two primary sub-types of snoRNAs ar e cover ed in the input equally well, C / D-box snoRNAs ar e r ecalled at a substantially lower rate than H / ACA-box snoRNAs, Table 2 . This is most likely caused by the differences in complexity of the characteristic secondary structures of both snoRNA sub-types, ( 49 ). C / D-box snoR-NAs form large loops between the conserved sequence mo- Table 2. Comparison of the gene loci recall using both RNAz and the Svhip classifier trained on data from the Rfam database. Annotated refers to the number of genes of each analyzed RNA class annotated in FlyBase. Input alignments can be dif ferentia ted by their preprocessing status: (i) raw refers to the unmodified MAF blocks and (ii) mer g ed to joined alignment blocks if sufficient consensus in the number of species exists. The latter was created using the Svhipmafmerge subprogram. The next column indicates the number of annotations actually covered by sliced alignment windows with at least 50% of their total length, i.e. those that are reasonably recov erab le using the prediction strategy. For both RNAz and Svhip the respecti v e recall, i.e. the number of r e-cover ed annotations, and the r esulting sensiti vity is listed. For a comparison between the ov erlap of true positi v e hits see Supplementary Figure ( 50 , 51 ). The detection of box C / D snoRNA genes thus cannot rely on secondary structure predictions alone, calling for specialized tools ( 52 ). While analyzing the Multiz whole genome alignments we noted that a primary issue in the screening for conservation signals is the quality and coverage of the input alignments provided. Even a hypothetical ideal classifier would fail to recover all known annotated genes if the total coverage of the provided MAF blocks for these genes is insuf ficient. To investiga te the magnitude of this problem, we analyzed the number of annotated genes theoretically covered by sliced alignment windows and discovered that in most cases less than half are actually present in the final input with sufficient cov erage, Tab le 2 . In general the more species are included in the alignment process, the more fragmented the genome-wide alignment becomes, i.e. the number of alignment blocks increases. If a minimum alignment block length is a r equir ement, which is one of the filter criteria applied in rnazWindow.pl , this becomes a serious issue on genome wide screens. In many cases these short blocks still contain valid information and can be combined into larger blocks using a quite simple approach as implement in mafmerge of the Svhip software, see Material and Methods for details. Independent of the classifier, a substantial increase in both gene coverage and total recall of ncRNA genes is observable for the merged alignments. This indica tes tha t such preprocessing adds substantial value to the analysis.
Analyzing all hits with a minimal cutoff of P > 0.5, we get 48,217 total hits with Svhip and 46,596 with RNAz , respecti v ely. Of these, 29,015 (62%) are shared between the two, suggesting a relati v ely high consensus. In general, the confidence or P -value distribution of hits looks different for both classifiers, which is presumably an artifact of underlying model ar chitectur es, as LibSVM and scikit-learn di v erged in de v elopment a substantial time ago, e v en though the la tter's SVM implementa tion is built on top of the former. Of the shared hits, 4,183 map onto existing annotations of either coding or non-coding genes, or 6,941 in the case of Svhip and 6,717 for RNAz . Considering the very similar proportions of total hits and fractions that map onto existing annotations, we come to the conclusion that the false positi v e rate of the Svhip classifier is likely at least as good as that of RNAz for this confidence le v el. Together with the observations from the previous section we conclude that our classifier is about as efficient for the purpose of Drosophila non-coding RNA genome annotation as the RNAz frame wor k.

Analysis of protein classification performance on the Drosophila genome screen
A new classifier for the dif ferentia tion of protein-coding sequences and genomic background was trained based on the data obtained from InsectBase. The entire pool of training data from the ncRNA classification experiment, i.e. both the exemplary Rfam non-coding RNA alignments and the automa tically genera ted nega ti v e set, was reused as a negati v e set. The goal of this experiment was the identification of coding regions in a full genome screen under the exclusion of e v erything else. W hile genera tion of a three-way classifier dif ferentia ting between ncRNA, coding sequences and genomic background is in principal possible with Svhip , it showed subpar results in this particular use case when compared with the usage of two individual classifiers (data not shown).
Of the 17,835 protein-coding sequences covered by the pr eprocessed alignments, 13,129 wer e r ecover ed by the Svhip protein classifier (recall 73.8%). Note that one gene can consist of multiple coding sequences and that such a region was counted as recovered if ( ≥2) consecuti v e hits occurred on the same strand within that gene's boundaries. If we only count regions as recovered when at least 50% of their length is covered by hits, we still obtain a recall of 66.1%, corresponding to 11,787 r ecover ed r egions. Obviously, these numbers are heavily dependent on the quality and quantity of the available training data and thus the resulting classifier.
Since the annotation of protein-coding genes is more matur e as compar ed to that of non-coding RNA, the estimate of the expected false positive rate (FPR) can be considered to be much more reliable in this case. Of 477,593 alignment windows not annotated as protein-coding, 54,309 (11.3%) were classified as protein-coding loci. It is of course likely that the Flybase annotation does not include all coding sequences that exist in reality, in particular taking into account that also features currently annotated as non-coding can harbor some coding potential or code for sORFS / smORFS ( 53 ). Ther efor e, assuming that the annotation also does not include any wrongly identified coding regions, we consider it reasonable to interpret this rate as an upper bound, that may be lowered when more accurate ground truth data becomes available.
Usually, protein-coding genes are substantially longer than the non-coding RNA genes discovered with this approach. In fact, there is not a single region in the Drosophila genome annotated as protein-coding that is not split over two or more consecuti v e alignment windows. This means that, in the case of protein-coding screens, isolated windows reported as hits are most likely false positi v es and can be ignored in almost all cases. A ppl ying this correction as a postprocessing step reduces the false positi v e rate to 34,156 (i.e. 7.2%) incorrectly classified windows.
In general, it is difficult to compare the approach outlined here with other prediction methods. Typically, a lot of contemporary tools for coding sequence prediction, like CPAT ( 21 ), CPC2 ( 22 ) and RNAsamba ( 54 ), use machine learning models on individual transcripts or genomic sequences instead of multiple sequence alignments. This also means that most of them do not take evolutionary conservation into account directly, making a fair comparison difficult due to se v erel y differing a pplication strategies. For a baseline estimate of reliability, we did a genome-wide screen in D. melanogaster using RNAcode , which also relies on alignment windows, and compared predictions with both the Svhip trained classifier and the Flybase annotation. Although the applied Svhip features are chosen for simplicity and speed 90% of all covered coding regions have been correctly identified, Supplementary Figure S7. As expected, the much more sophisticated model of RNAcode outperforms Svhip . When comparing unannotated loci of both approaches in total 77,181 have been identified. In 30% of all cases the tools agree while Svhip and RNAcode predict additional 29% and 40%, respecti v ely. One has to consider these loci as seeds for an in depth downstream analysis that tries to extend these seeds to complete coding sequences.

Chaining of classifiers for reduced FPR
In an attempt to further reduce the expected false positive rate of the non-coding RNA classifier, we investigated ways to combine the previously described classifiers. The goal here was to effecti v ely remov e as many coding sequences from the non-coding RNA classification pool as possible as these form the largest chunk of all annotated data. It can be shown that there is a substantial overlap between potential for structure conservation in non-coding RNA and coding RNA since the conservation of base triplets tends to indirectly conserve (possibly random) secondary struc-tur es, Supplementary Figur e S8. In this interpretation, the potential for conservation of secondary structure is highest in genuine non-coding RNAs and lowest in the genomic background, while coding sequences exhibit an intermedia te conserva tion le v el. Thus, their remov al b y a preceding classification step based on coding potential could substantially reduce the margin of errors in a subsequent, conservation-based non-coding RNA screening. In general, the non-coding RNA classifier outlined in "Analysis of two-wa y classification perf ormance on Drosophila genome screen" can be assumed to have a worst-case FPR of FPR = FP FP + TN = 12 , 542 928 , 728 = 1 . 4% (7) or, in the case of merged alignments, where FP (false positi v es) refers to non-coding RNA classification hits on windows not overlapping with a non-coding RNA annotation and TN (true negati v es) refers to hits where coding sequence or genomic background windows ar e corr ectl y identified as not being non-coding RN A. We explicitly counted here all potentially incorrect hits --and not only the high-confidence hits --as we assume that the probability of overlapping with genuine but not annotated genes is higher. If the goal is to estimate an upper bound on the FPR gi v en the classifier at hand, then all hits irrespecti v e of assigned probability should be counted. Ther e ar e two obvious options to attempt an exclusion of coding sequences from the false positi v e set during classification. One is to first make a full genome screen using the protein model from "Analysis of protein classification performance on the Drosophila genome screen", and subsequently classify the remaining alignment windows. In principal this introduces the possibility of misclassifying noncoding RNAs as coding, which then subsequently disappear from the pool of true positi v e non-coding RNA hits. During our analysis, howe v er, this effect appeared to have no relevant impact on non-coding RNA classification. Unfortunatel y, this a pproach resulted onl y in a marginal reduction of FPR, which fell to 1.3% due to 206 alignment windows being removed from the false positive set.
The second option involves retraining the classifier after also including the protein-coding training set as part of the negati v e set for non-coding RNA classifica tion, thus a ttempting to generate a classifier that more directly excludes coding sequences instead of only being based on the original negati v e set of synthetic alignments. Mar king all training examples in the protein classifier training set as background (relati v e to the class of non-coding RNA) and retraining the non-coding RNA classifier with Svhip using the same parameters, yields a drastically improved FPR on the annotated Drosophila genome: It should be noted that this classifier, as outlined in the result section "Analysis of two-way classification performance on alignments of known ncRN As", onl y uses the features of SCI, z -score of MFE, and entropy, as it was noted that the more coding-sequence specific features do not substantially add to the non-coding RNA classification performance. These featur es wer e thus excluded from the training set for this experiment. We conclude that, depending on the use case, it may be worthw hile to manuall y add known negati v e e xamples to the training set instead of relying e xclusi v ely on the artificial generation of negati v e training instances.