Sequencing the cap-snatching repertoire of H1N1 influenza provides insight into the mechanism of viral transcription initiation

The influenza polymerase cleaves host RNAs ∼10–13 nucleotides downstream of their 5′ ends and uses this capped fragment to prime viral mRNA synthesis. To better understand this process of cap snatching, we used high-throughput sequencing to determine the 5′ ends of A/WSN/33 (H1N1) influenza mRNAs. The sequences provided clear evidence for nascent-chain realignment during transcription initiation and revealed a strong influence of the viral template on the frequency of realignment. After accounting for the extra nucleotides inserted through realignment, analysis of the capped fragments indicated that the different viral mRNAs were each prepended with a common set of sequences and that the polymerase often cleaved host RNAs after a purine and often primed transcription on a single base pair to either the terminal or penultimate residue of the viral template. We also developed a bioinformatic approach to identify the targeted host transcripts despite limited information content within snatched fragments and found that small nuclear RNAs and small nucleolar RNAs contributed the most abundant capped leaders. These results provide insight into the mechanism of viral transcription initiation and reveal the diversity of the cap-snatched repertoire, showing that noncoding transcripts as well as mRNAs are used to make influenza mRNAs.


Oligonucleotides
Oligonucleotides were ordered from IDT and are summarized in Supplemental Table 2. Oligonucleotides with more than 60 bases were gel-purified prior to use.

Data Analysis
Preprocessing Randomized nucleotides in the TSO and all subsequent contiguous Gs were used as unique molecular identifiers to collapse PCR duplicates (1). Cutadapt was used to trim sequences that perfectly matched the complement of positions 0-9 in the vRNA (2, 3) (NS1, HA, NP, NA: GCAAAAGCAG; MP, PA, PB1, PB2: GCGAAAGCAG). Reads without a perfect match and reads containing Ns within the remaining sequence were not considered, as were reads in which the remaining heterogeneous sequence was <9 or >15 nt. For Figure 1C, Figure 3A and B, and Figure 4A, the length restriction of <9 or >15 nt was not imposed. To account for prime-andrealign, any sequences with ≥2 3′-terminal nucleotides matching the 5′ end of the complement of positions 0-9 in the vRNA were iteratively trimmed up to four times from heterogeneous sequences. The resulting reads were designated trimmed host leaders. Trimmed host leaders < 9 nt were again discarded (except in Figure 4A). The number of read counts and nucleotides trimmed at the 5′ and 3′ end were retained in the FASTQ header as meta-information (4).

Mapping
Annotated TSSs were obtained from Gencode 17 (5) and iteratively trimmed at each 5′ terminus until the first nucleotide was not a G. A 51 nt window around each adjusted TSS was extracted using BEDTools and converted to a Bowtie index (6,7). Cellular fragments were mapped to the custom Bowtie index, requiring a perfect, sense match (Bowtie parameters: -S -norc -p 24 -alltryhard -v 0), converted to hg19 human genome coordinates in BAM format, and intersected with the Gencode annotations using BEDTools (6,8,9). If a read mapped to multiple paralogs, one paralog was chosen at random and the rest were removed from the dataset by collapsing on the Gencode 17 gene_name attribute. Reads mapping to multiple snRNA/snoRNA paralogs were also removed, accounting for somewhat variable gene_name attributes sometimes assigned to these paralogs.
To determine the background mapping of short sequences to TSS windows, we generated shuffled versions of trimmed host leaders, requiring that 1) the shuffled fragment does not start with a G, 2) the CG dinucleotide frequency is the same as the original sequence, and 3) the shuffled sequence is not the same as the original sequence. If all shuffled permutations of the original leader did not meet these criteria, the original leader was not considered.

Prime and realign
To efficiently find reads that are 3′ extensions of other reads in an unbiased manner, we developed an algorithm termed a "sequence tree." Consider a sequence s consisting of To insert a sequence into the tree, start at the base node. Find the child node i such that nucleotide u = s[0]. If it does not exist, create it. If the length of s = 1, set the count c of node i to c s . Otherwise, recurse on this node using sequence s [1]…s [n]. Repeat for all sequences in dataset. To find the greatest common factor (GCF) nodes, examine the base node. For each child i, if it has an associated count c, it is a GCF. Otherwise, recurse on the children. To determine extensions of GCF nodes, find all nodes with counts beneath the GCF nodes. The program is available in the prime_and_realign.py file in the code repository.

Nucleotide composition
The nucleotide composition of cellular fragments with respect to their 3′ ends was determined by creating a position weight matrix (PWM) of nucleotide frequencies from the [-8, -1] interval and assessing the signal at each specific position and normalizing this signal to the overall nucleotide composition of the 51-nt Gencode 17 TSS windows. The interval [-8, -1] was chosen because positions [-15, -9] contain a high proportion of purines due to the nucleotide composition at Pol II TSSs and because some fragments are not that long.
The nucleotide composition downstream of the cellular fragments was inferred from mapping to annotated TSSs as described above and considering only those trimmed host leaders that mapped precisely to a TSS and had 10 or more nucleotides. Sequences were weighted in proportion to the ranked read count of the corresponding leaders and normalized to the overall nucleotide composition of TSS windows as before. Dinucleotide frequencies at positions -1 and 0 were inferred from the same set of mapped host leaders, imposing the same weighting but normalizing to the dinucleotide frequencies of the TSS windows rather than to the overall nucleotide composition.

Correlations
Trimmed host leaders with >6 reads in both of the two datasets being compared were considered. Correlations were visualized in tandem with histograms and kernel density estimates using default parameters from the seaborn package (http://stanford.edu/~mwaskom/software/seaborn/). Figure 1. Systematic analysis of nucleotides potentially added through the primeand-realign mechanism. Heterogeneous sequences were inserted into a sequence tree (Supplemental Methods), and sequences that were 3′-extensions of "greatest common factor" sequences in the dataset were determined. Extensions were grouped by the number of nucleotides in the extension (columns) and sorted in decreasing order by frequency in which that extension was observed in the dataset (rows, colored according to the key). Figure 2. Number of reads corresponding to the same host leaders from different influenza mRNAs, after trimming prime-and-realigned nucleotides, as in Figure 4C.  Figure 4C. Figure 6. Abundances of trimmed host leaders found using the CIP-TAP method compared to those found using the template-switching method. Sequencing statistics and highlighting of snRNA sequences are as in Figure 4C. Figure 7. Literature values of snRNA/snoRNA abundances (11) compared to abundances of host leaders corresponding to these snRNAs/snoRNAs in the NS1 templateswitching dataset. Figure 8. Mapping trimmed host leaders to annotated TSSs, showing where actual host leaders (original) and control host leaders (shuffled) map relative to the adjusted annotated TSS. The increased signal for the in the controls at position 0 reflects the requirement that both the adjusted TSSs and the controls lack a 5′-terminal G. Sequences <10 nt were not considered. Using lower or higher cutoffs for sequence length reduced the signal-to-background ratio (calculated as original/shuffled).