uORF4u: a tool for annotation of conserved upstream open reading frames

Abstract Summary Upstream open reading frames (uORFs, often encoding so-called leader peptides) can regulate translation and transcription of downstream main ORFs (mORFs) in prokaryotes and eukaryotes. However, annotation of novel functional uORFs is challenging due to their short size of usually <100 codons. While transcription- and translation-level next-generation sequencing methods can be used for genome-wide functional uORF identification, this data are not available for the vast majority of species with sequenced genomes. At the same time, the exponentially increasing amount of genome assemblies gives us the opportunity to take advantage of evolutionary conservation in our predictions of functional ORFs. Here, we present a tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologs. It can also be used to find small conserved ORFs within a set of nucleotide sequences. The output includes publication-quality figures with multiple sequence alignments, sequence logos, and locus annotation of the predicted conserved uORFs in graphical vector format. Availability and implementation uORF4u is written in Python3 and runs on Linux and MacOS. The command-line interface covers most practical use cases, while the provided Python API allows usage within a Python program and additional customization. Source code is available from the GitHub page: github.com/GCA-VH-lab/uorf4u. Detailed documentation that includes an example-driven guide available at the software home page: gca-vh-lab.github.io/uorf4u. A web version of uORF4u is available at server.atkinson-lab.com/uorf4u.


Introduction
Functional upstream open reading frames (uORFs, encoding what are often referred to as leader peptides) regulate expression of downstream genes via translational and/or transcriptional attenuation in bacteria, archaea, eukaryotes, and viruses (Ito and Chiba 2013;Dever et al. 2020). Despite the fact that the structures of mRNA transcripts, and the mechanisms of translation initiation differ among the domains of life, the concept of uORFs as evolutionarily conserved cisacting regulatory elements is universal (Dever et al. 2020). Regulatory uORFs typically act via condition-specific ribosome stalling on nascent peptides. However, some uORFs can be translated into functional proteins Brown et al. 2017;Chen et al. 2020;Jayaram et al. 2021). Prediction and annotation of potentially functional uORFs is essential for understanding complex regulation mechanisms, including inducible expression of antibiotic resistance genes upon an antibiotic challenge (Ramu et al. 2009;Ito and Chiba 2013). However, the prediction of functional uORFs is complicated by their properties, such as short length, variable distance to the main ORF (mORF), and unusual sequence composition. A breakthrough in genome-wide annotation of translated regions including uORFs came with the singlenucleotide resolution sequencing method ribosome profiling (Ribo-Seq) (Ingolia et al. 2009;Brar and Weissman 2015).
Several bioinformatics tools were created for annotation of uORFs based on Ribo-Seq data, for example uORF-Tools (Scholz et al. 2019) and uORF-seqr (Spealman et al. 2021). However, Ribo-Seq data are available only for a limited number of usually model organisms, and annotation of short uORFs is often complicated by noise in phased signal tracks that appears due to the stochastic nature and specific cutting preferences of RNases (Gerashchenko and Gladyshev 2017). Another limitation of the Ribo-Seq approach to annotation is that translation of uORFs may only be induced in certain environmental or cell conditions; for example, translation of non-AUG uORFs that are only detectable under stress conditions .
In the absence of ribosome profiling data, researchers often annotate potentially functional uORFs in a manual or semimanual way. This involves retrieval of the 5 0 upstream regions of protein-coding genes of interest from sequence databases, and visual inspection of the region, with or without the aid of a sequence alignment to indicate functional conservation. Tell-tale signatures of potentially functional uORFs are the presence of Shine-Dalgarno elements (in the case of prokaryotes), along with start and stop codons in the right context and frame (Sakiyama et al. 2021;Mangano et al. 2022;Takada et al. 2022). However, inspection of large sequence alignments by eye is a time-consuming and tedious task. Thus, various methods have been developed to automate ORF annotation, even in the absence of expression data. The tool sORF finder (Hanada et al. 2010) takes advantage of nucleotide composition bias and can predict small eukaryotic ORFs of high coding potential within 10-100 amino acids length range. MiPepid (Zhu and Gribskov 2019) and csORFfinder (Zhang et al. 2022) are ML-based approaches trained on a limited set of eukaryotic organisms for prediction of micropeptides translated from small ORFs. We have found only one tool, uPEPperoni (Skarshewski et al. 2014), that when it was available, implemented conservation analysis for prediction. Other methods that take into account conservation analyses in functional uORF prediction are not distributed as tools (McGillivray et al. 2018;Spealman et al. 2018;Liu et al. 2023). Importantly, all these methods are designed for use with eukaryotic genomes.
Thus, there is currently a lack of a simple tool for functional uORF prediction in both prokaryotes and eukaryotes that leverages sequence conservation. To fill this gap, we set out to build a tool that also includes the following key properties: 1) Ease of installation and implementation with a command-line interface and Python API for higher customization. 2) Does not have a requirement to build or download large databases. The tool uses the NCBI API to access the RefSeq database (O'Leary et al. 2016) and is therefore always up-to-date. 3) Supports various input formats: a user-defined protein as the mORF, set of mORF homologs, or nucleotide sequences in FASTA format. 4) Thorough documentation with a home page that contains an example-driven guide and detailed API description. 5) Output that contains publication-ready and editable vector graphics. 6) Can be used for sequences across the tree of life (bacteria, archaea, eukaryotes, and viruses).

The uORF4u workflow
The architecture of the uORF4u workflow is defined on user input (Fig. 1A). If the input is a single RefSeq protein accession number, uORF4u performs a BlastP search (Camacho et al. 2009) against the online version of the RefSeq protein database (O'Leary et al. 2016). The retrieved list of homologs is saved to be used in the subsequent steps. Alternatively, a list of homologs previously curated by the user can be used as input. This is important for allowing the user to decide the breadth and depth of the search; uORFs may differ in their conservation levels across strains and species and therefore it might be necessary to test different input sets. Using the accession list, uORF4u retrieves the corresponding upstream sequences using the NCBI API as implemented in Biopython (Cock et al. 2009). For eukaryotes, the upstream region is the complete transcript's 5 0 UTR sequence, and for noneukaryotic microbes, the upstream region is a user-defined (default 500 nucleotides) length from the mORF start codon. The retrieved nucleotide sequences are saved as intermediate output in FASTA format. These sequences, as well as the list of homologs obtained in the previous step, can also be used as optional input for uORF4 in order to skip the previous steps.
It is useful to note that when using nucleotide sequences as input uORF4u can be used as a general conserved ORF search tool. That is, to find ORFs that are not necessarily upstream of any particular mORF. The next step after sequence retrieval is ORF annotation. An ORF is defined as a region between a start codon (alternative start codons can be included as well) and a downstream in-frame stop codon. The minimal length set by default is nine nucleotides (three codons). For prokaryotes, this step also includes Shine-Dalgarno (SD) sequence search within a 20-nucleotide window upstream of the start codon. SD sequence annotation is based on the calculation of the SD-antiSD interaction Gibbs free energy (Yang et al. 2016). For identified potential frames, the tool searches for conserved ORFs using a greedy algorithm: uORF4u iterates through sequences and tries to maximize the sum of pairwise alignment scores between uORFs. The detailed scheme of the algorithm is available at the Github and server home pages. The last step in our workflow is generation of multiple sequence alignments (MSAs) of the identified conserved uORFs, writing reports, and making results visualization files (annotation plots, sequence logos, and MSAs). To do this, we have made our own MSA visualization package, MSA4u, which is bundled with uORF4u (github.com/GCA-VH-lab/msa4u). Examples of output plots are shown on Fig. 1C-E , H, and G. 3 Implementation uORF4u is written in Python3 and uses multiple python libraries: Biopython (Cock et al. 2009), configs, argparse, pandas, statistics, Logomaker (Tareen and Kinney 2020), matplotlib (Hunter 2007), reportlab, and msa4u.
The python uORF4u package is available in PyPI (python3 -m pip install uorf4u), and the source code is provided on the GitHub page (github.com/GCA-VH-lab/uorf4u). Detailed documentation with an installation guide, and an exampledriven manual are available at the uORF4u home page (gcavh-lab.github.io/uorf4u). Additionally, the web version of uORF4u is also available at server.atkinson-lab.com/uorf4u.
The command-line interface allows users to run the tool with various standard usage scenarios without any additional effort for user-side scripting. Furthermore, we provide a python API that allows additional customization.

Conclusion
The problem of novel functional uORF annotation requires specialized tools. Here, we present uORF4u, which performs database parsing, uORF searching, conservation analysis, and produces publication-quality images of the results. The utility of uORF4u has been demonstrated with the discovery of uORFs that have been validated to regulate expression of ABCF antibiotic resistance genes (Obana et al. 2023), and the rediscovery of known functional uORFs presented in Fig. 1B-G. We believe that as well as identifying potentially functional uORFs of specific mORFs in targeted analyses that include experimental validation, our tool paves the way for systematic analysis of uORF genesis, conservation, and distribution on the scale of the whole proteomes. shown in yellow) is inducible by erythromycin. The arrest alters the regional mRNA structure, exposing the ermC SD sequence and allowing translation of the ermC mORF (Vazquez-Laslop et al. 2008). (C) Annotation plot of the upstream sequences with the conserved ermCL uORF. ermCL is shown in yellow, the 5 0 end of the ermC mORF is shown with a green outline, grey outlines indicate other putative ORFs around this locus, and, finally, the blue outline shows ORFs annotated in the RefSeq database. In this case, three of the eight were already annotated, and five additional ORFs were annotated by uORF4u (black outline). uORF4u does not use RefSeq ORF annotations in its predictions, and its ability to rediscover both known and missing uORFs validates our strategy (D, E) Multiple sequence alignment and sequence logo visualization of the identified ermCL ORFs. (F-G) Results example using twelve eukaryotic ATF4 proteins as a query (command: uorf4u -hl NP_001666.2 XP_036720744.1 XP_024434925.1 XP_034632036.1 XP_008703764.1 XP_034983127.1 XP_019400505.1 XP_003989324.2 XP_003419800.1 XP_019302483.1 XP_047407736.1 XP_032062344.1 -c eukaryotes). As with ermC, the list was built using the extended run results). (F) A eukaryotic uORFs example: the expression of ATF4 (activating transcription factor) is regulated by two uORFs. After translation of the first uORF1, ribosomes are normally able to reinitiate translation at a downstream uORF2 after rebinding the initiating ternary complex (eIF2-GTP-Met-tR-NA). Reduced levels of the ternary complex during stress conditions leads to the ribosome scanning through the uORF2 start codon and instead reinitiating at the ATF4 uORF (Vattem and Wek 2004). (H) Annotation plot of the 5 0 UTRs with both conserved uORFs shown in yellow with black outline. (G) Multiple sequence alignment and sequence logo visualization of the identified uORF1.
uORF4u: a tool for annotation of conserved upstream open reading frames