Prospects for a sequence-based taxonomy of influenza A virus subtypes

Abstract Hemagglutinin (HA) and neuraminidase (NA) proteins are the primary antigenic targets of influenza A virus (IAV) infections. IAV infections are generally classified into subtypes of HA and NA proteins, e.g. H3N2. Most of the known subtypes were originally defined by a lack of antibody cross-reactivity. However, genetic sequencing has played an increasingly important role in characterizing the evolving diversity of IAV. Novel subtypes have recently been described solely by their genetic sequences, and IAV infections are routinely subtyped by molecular assays, or the comparison of sequences to references. In this study, I carry out a comparative analysis of all available IAV protein sequences in the Genbank database (over 1.1 million, reduced to 272,292 unique sequences prior to phylogenetic reconstruction) to determine whether the serologically defined subtypes can be reproduced with sequence-based criteria. I show that a robust genetic taxonomy of HA and NA subtypes can be obtained using a simple clustering method, namely, by progressively partitioning the phylogeny on its longest internal branches. However, this taxonomy also requires some amendments to the current nomenclature. For example, two IAV isolates from bats previously characterized as a divergent lineage of H9N2 should be separated into their own subtype. With the exception of these small and highly divergent lineages, the phylogenies relating each of the other six genomic segments do not support partitions into major subtypes.


Introduction
Influenza A virus (IAV) genomes comprise eight single-stranded, negative sense RNA segments, which are numbered in decreasing order with respect to their lengths.The diversity of IAV is usually characterized by the two segments (numbered 4 and 6) carrying the genes encoding the hemagglutinin (HA) and neuraminidase (NA) proteins, respectively.HA and NA are surface-exposed proteins responsible for binding and release of virus particles from host cells, respectively.As significant antigenic targets of the adaptive immune system, they are the most rapidly evolving proteins of the IAV genome (Bhatt et al., 2011).The global diversities of both HA and NA sequences are clustered into serotypes or subtypes, terms which are used interchangeably in the literature; herein, I will use the term 'subtypes' for consistency.These subtypes are identified by H and N prefixes for the respective proteins, followed by an integer suffix, producing the widely familiar HnNn nomenclature, e.g.H3N2.Certain combinations of HA and NA subtypes are observed more frequently than others due to pandemic growth.Although these combinations are also referred to as subtypes, each genome segment has its own subtype nomenclature.
There are presently eighteen described HA subtypes that are numbered H1 to H18 (Table 1); a potential new subtype H19 has only recently been described (Fereidouni et al., 2023) from a partial HA sequence isolated in 2008 from waterfowl in Kazakhstan.In addition, there are eleven described subtypes of NA (N1 to N11).These subtypes were originally defined solely from their antigenic characteristics, predominantly from hemagglutination inhibition (HI, specifically for HA), complement fixation (CF), or double immunodiffusion (DID; Dowdle et al., 1974) testing.For instance, H2, initially labeled the 'Far East' or 'Asian' strain of IAV (Dunn, 1958), was distinguished from H1 by the lack of HI activity by sera from hosts infected by the other subtype (Jensen, 1957).The nomenclature for HA and NA subtypes also formerly incorporated the host organism (World Health Organization, 1971).For instance, H1, Hsw1, Heq1, and Hav1 were originally different HA subtypes isolated from human, swine, equine, and avian hosts, respectively.These were subsequently merged by Schild et al. (1980) into the current host-agnostic Hn and Nn nomenclature on the basis of the cross-reactivity of isolates from different host species.
Since the description of subtype H13 by Hinshaw et al. (1983), genetic sequence analysis has been increasingly incorporated into  Fereidouni et al. (2023) Previous subtype and prototype strain designations ('Previous') for subtypes H1 to H12 were obtained from Schild et al. (1980).Abbreviations: CF, complement fixation; P, phylogeny; AAI, amino acid sequence identity.a The authors use an immunodiffusion assay that resembles DID (described in Beard, 1970).b H19 was only recently been proposed from a single partial HA sequence.
the characterization of novel HA and NA subtypes (Table 1).Indeed, the last three HA subtypes (H17-H19) have been described solely by their genetic sequences (Tong et al., 2012;Tong et al., 2013;Fereidouni et al., 2023).High-throughput genetic sequencing has become a ubiquitous feature of clinical and public health laboratories, boosted recently by the expansion of resources to confront the global severe acute respiratory syndrome coronavirus 2 pandemic.For instance, the number of IAV sequences deposited in the Global Initiative for Sharing All Influenza Data (GISAID) (https://gisaid.org)database in a given year has grown exponentially from 943 sequences in 1992 to 307,198 in 2022 (Supplementary Figure S1).
Here, I investigate whether it is feasible to distinguish HA and NA subtypes based only on genetic diversity.This is a different problem than the supervised classification of genetic sequences to subtypes based on their similarity to a predefined set of reference sequences, commonly performed by BLAST search (Spackman, 2014).Instead, I ask whether a phylogenetic clustering method can robustly group HA and NA sequences into the antigenically defined subtypes.Furthermore, I assess whether it is feasible to apply a similar classification scheme to the other six genomic segments.There is surprisingly limited work in this area.When describing serotype H14, for instance, Kawaoka et al. (1990) proposed that HA sequences that differed at 30 per cent of amino acids or more should be separated into different subtypes.However, this proportional (p) distance criterion is not attained for all pairwise comparisons of subtypes (Röhm et al., 1996;Tong et al., 2012).This method also makes no adjustment for multiple substitutions at the same sites over long evolutionary time scales.Furthermore, the shortest p-distance between groups can depend by chance on which sequences have been incorporated into the analysis.Ideally, the phylogenetic analysis should incorporate as many genetic sequences as possible.

Methods
I downloaded all available protein sequences associated with all eight IAV segments from the National Center for Biotechnology Information (NCBI) Genbank database (accessed 6 June 2023 except for HA, which were retrieved 28 April 2023), using a minimum length filter to exclude records with incomplete nucleotide sequences (Supplementary Table S1).Although there are substantially more IAV genome sequences available in the GISAID database, I elected to use data exclusively from Genbank, which has no restrictions on redistribution of data and derived outputs, to maximize the reproducibility of this analysis.For each segment, I used regular expressions to filter sequences from other segments based on gene and protein annotations in the sequence names.In addition, I excluded sequences derived from alternate open reading frames (PB1-F2 and PA-X) and concatenated sequences involved in alternative splicing with overlapping regions excluded (M1/M2 and NS1/NS2).Next, I discarded sequences with more than 10 per cent ambiguous residues ('X') and compressed the remainder into unique sequences, recording the labels of duplicate sequences into a separate file.The remaining sequences were aligned using MAFFT (version 7.453;Katoh and Standley, 2013).I manually evaluated and amended the resulting alignments with AliView (v.2018;Larsson, 2014).Using FastTree (version 2.1.11, compiled to use double precision;Price et al., 2010), I reconstructed a preliminary tree from the alignment without maximum-likelihood or minimum-evolution steps (i.e., neighbor-joining) and then visually assessed the tree using Taxonium (Sanderson, 2022).After excluding any problematic sequences identified in the preliminary tree (e.g.recombinant sequences produced by molecular cloning; Hai et al., 2012), I reran FastTree with the default optimization steps to generate a maximum-likelihood tree.
I investigated two different methods to partition each tree into monophyletic clades (i.e.subtrees) as putative subtypes.First, I calculated the following quantities for each internal node in the tree: (1) y i , the mean tip-to-tip (patristic) distance between every tip descending from node i; (2) d i , the mean distance from node i to every descendant tip; and (3) v i , the total distance from node i to its sibling node j.
The quantities d i and y i were calculated by postorder traversal of the tree for increased efficiency.For instance, y i was calculated by this recurrence relation, where a and b are child nodes of i: where n a is the number of tips descending from node a and l a is the length of the branch associated with (below) node a. Tip nodes were initialized with n i = 1, d i = 0, and y i = 0. Hence, the y i for the internal node of a cherry reduces to l a + l b .The quantity y i is analogous to the mean pairwise distance within a subtype, while v i is analogous to the shortest distance between two subtypes.Given minimum v and maximum y cutoffs, subtrees meeting these criteria were located by evaluating nodes by preorder traversal of the tree.This method, which will be referred to as 'nodewise' clustering, was modeled on the clustering algorithms that have previously been used to partition a given IAV subtype into smaller clades (Evolution Working Group, 2008;Anderson et al., 2016).For example, Anderson et al. (2016) partitioned IAV H1 sequences into monophyletic clades with a mean within-clade distance less than 0.07 (analogous to y i ), a between-clade distance greater than 0.07 (analogous to v i ), and a node support ≥ 70%.Node support values highly correlate with the branch length; for instance, local support values for the longest internal branches in the HA tree were almost entirely 1.0 (Supplementary Figure S2).Thus, support values were unlikely to contribute additional information for nodewise clustering of subtypes and I excluded these values to simplify the method.
Second, I progressively cut the tree into subtrees on internal branches (edges) with a length exceeding a given cutoff, using the following algorithm: (1) initialize subtree list S with input tree (2) for each subtree s in S (a) locate longest internal branch b s in s (b) if length of b s exceeds cutoff (i) remove b s to yield subtrees s 1 and s 2 (ii) for each new subtree, remove the node associated with b s and join the branches (iii) append s 1 and s 2 to new list S ′ (c) otherwise append s to new list S ′ (3) replace S with S ′ (4) repeat from step 2 until no internal branches exceed cutoff.This method will be referred to as 'edgewise' clustering.Both algorithms were implemented using the Phylo toolkit in Biopython (Talevich et al., 2012).
To measure the correlation between the subtree clusters and the subtype annotations of sequences, I calculated the normalized mutual information (nMI) of these two partitions: where p i is the marginal proportion of labeled tips in subtree i, p j is the marginal proportion of tips labeled with subtype j, and p ij is the number of tips in subtree i with label j divided by the total number of labels, i.e. the joint proportion (see also Equation 14.4.17 in Press et al. (1988)).The nMI adjusts for different numbers of subsets between these two partitions, and it ranges from 0 (no correlation) to 1 (for identical partitions).

Results
Maximum-likelihood trees for n = 66, 864 HA and n = 52, 146 NA amino acid sequences are displayed in Figure 1 with branches colored by subtype annotations, which were extracted from NCBI Genbank record metadata (i.e.source qualifiers).As anticipated, the subtrees labeled by subtype annotations are distinctly separated by long internal branches in both trees, implying that these subtypes should be readily distinguished by basic sequence-based criteria.The overall scale of the NA tree is substantially greater than the HA, driven by the enormous divergence of N10 and N11 from the other subtypes (Figure 1).Based on their discordant locations in the tree, I identified 15 HA and 30 NA sequences with incorrect subtype annotations (Supplementary Tables S2 and S3).
In addition, I classified 2,353 HA and 2,563 NA sequences with missing subtype annotations (e.g.'mixed', 'HX', or 'unknown') by locating their nearest neighbors in the respective phylogenies.These sequences were substantially less likely to be assigned to the most common subtypes (Supplementary Figure S3).These inferred subtype labels were excluded from further analyses to mitigate potential biases.I evaluated two different methods to partition a phylogeny into subtrees that are consistent with the antigenically defined IAV HA and NA subtypes.First, I assessed a nodewise method in which I calculated the mean tip-to-tip distance y and total branch length to the sibling node v for subtrees descending from every internal node.These quantities are analogous to the within-and betweengroup measures of genetic divergence that have been used to define clusters within IAV subtypes, i.e. clades (WHO/OIE/FAO, 2012; Anderson et al., 2016).Both v and y were required to partition the tree de novo into subtrees.For instance, using only a minimum cutoff on v tends to prematurely terminate a search by preorder traversal of nodes, i.e. yielding two large subtrees that descend directly from the root node.
I evaluated the discordance between subtypes extracted from the sequence metadata (herein referred to as 'labels') and the phylogenetic subtrees produced at varying cutoffs of y and v (Supplementary Figure S4A-C).Overall, the discordance in the number of subtrees was minimized at a tip-to-tip distance cutoff of about y = 1.2.The results were less sensitive to varying cutoffs for v.However, the correspondence between subtrees and labels at the best settings was generally poor.Sequences labeled as subtype H9 were consistently partitioned into multiple subtrees.In addition, several groups of subtype labels become merged into single subtrees, namely H2/H5, H3/H4/H14, H7/H10, H8/H12, and H11/H13/H16 (Supplementary Figure S4D).These discrepancies could not be resolved by varying the clustering settings for this method.
Since the results of nodewise clustering on the HA phylogeny were so poor, I abandoned this method and implemented a simpler edgewise method.This method partitions the phylogeny by progressively cutting internal branches with lengths exceeding some cutoff.The number of subtrees obtained from the HA phylogeny under varying cutoffs and their correlation with subtype labels, as measured by nMI, are summarized in Figure 2a.Overall, the correlation between subtrees and subtype labels was high across a range of cutoffs, with nMI exceeding 0.99 at cutoffs from 0.12 to 0.2675, where nMI = 1 indicates perfect correlation.On the other hand, the expected number of 18 subtrees is attained only within a narrow range of cutoffs around 0.18.At this cutoff, the partition of subtrees was highly concordant with subtype labels, as summarized in Figure 2b, with the exception of two discrepancies.First, one subtree ('s8') comprised only two sequences corresponding to IAV isolated from bats resembling H9 (Kandeil et al., 2019;Rademan et al., 2023), which predominantly circulates in avian host species.This subtree does not merge with the adjacent subtree ('s7') carrying the other 8,197 H9 labels until the cutoff is relaxed to 0.27, at which there is a sudden drop in the number of subtrees from seventeen to fourteen (Figure 2a).Second, the nine HA sequences labeled as H15 are grouped with H7 at this cutoff.They are split when the cutoff is decreased from 0.18 to 0.1775.If the cutoff is increased from 0.18 to 0.1825, then subtrees labeled H13 and H16 become merged into a single subtree.More importantly, the number of subtrees does not decrease from 17 until the cutoff is further increased to 0.27.This implies that an optimal cutoff may be selected by locating the longest interval between change points in the number of subtrees and taking the midpoint of that interval as the cutoff value.In the case of Figure 2a, this would select the interval between 0.1825 and 0.2675 with a midpoint of 0.225.
Next, I applied the edgewise method to a maximum-likelihood phylogeny of 52, 146 NA sequences.This method yielded twelve subtrees for a broad range of cutoffs (from 0.222 to 0.4 expected substitutions, Figure 3a).Similar to the case with the HA phylogeny, the target number of 11 subtrees could only be obtained within a narrow range of cutoffs (from 0.405 to 0.425).At a cutoff of 0.41, I observed the same two major discrepancies among the resulting 11 subtrees (Figure 3b).Two sequences labeled as N2 were separated from the subtree carrying the bulk of the remaining N2 labels (n = 23, 803).Not surprisingly, these corresponded to the same divergent H9N2-like viruses isolated from bats as noted earlier (Kandeil et al., 2019;Rademan et al., 2023).In addition, 394 sequences labeled as N5 are placed in the same subtree as N8 unless the cutoff is reduced below 0.405.
Finally, I repeated this same process for the remaining IAV genome segments.To increase the amount of genetic variation available for reconstructing trees, I concatenated the nonoverlapping regions of the M1 and M2 proteins encoded on Segment 8, and the NS1 and NS2 proteins encoded on Segment 8. Unlike the HA and NA phylogenies, there was limited support for partitioning these other phylogenies into distinct subtrees.Using the results in Figures 2a and 3a as a guide, I selected cutoffs at which the number of subtrees was relatively stable for each of these other segments (Supplementary Figure S5).These cutoffs tended to be around 0.05, yielding four to six subtrees, with the exception of the NS1/NS2 phylogeny where five subtrees were obtained at cutoffs ranging from 0.12 to 0.175.This difference in scale is consistent with previous findings that diversifying selection at the amino acid level for NS1 is comparable to the surface-exposed HA and NA proteins (Chen and Holmes, 2006).In every case, nearly all sequences were assigned to only one of the four to six subtrees (Supplementary Table S4).Two of the remaining small subtrees corresponded to the recently described H17N10 (Tong et al., 2012) and H18N11 (Tong et al., 2013) strains, respectively.Additionally, the divergent lineage of H9N2 that was consistently separated into its own HA and NA subtrees was also separated for all other segments except PB1.For all phylogenies except NP, one of the small subtrees corresponded to isolates of equine influenza (H7N7) from 1956.Finally, the phylogeny relating concatenated M1 and M2 protein sequences contained an additional subtree comprising three reassortant sequences isolated from swine in Australia (Wong et al., 2018).S2).

Discussion
These results indicate that it is feasible to define subtypes on the basis of the variation among HA and NA protein sequences.Subtypes derived from phylogenies are largely consistent with the current definitions of eighteen HA and eleven NA subtypes, which are mostly derived from serological data.However, there are some important discrepancies.The analysis presented here supports seventeen genetically-defined HA subtypes, not eighteen.Two subtypes are removed by merging H15 with H7 and combining H13 and H16.This is consistent with previous work reporting average amino acid identities of 79.7 and 81.4 per cent between these respective groups, where the typical mean identity was about 49 per cent (Tong et al., 2012).In addition, it supports the creation of a new subtype from a divergent lineage of H9N2-like isolates from bats.This lineage not only also creates an additional twelfth subtype for NA but also forms distinct subtrees for nearly every other segment.The two sequences comprising this lineage (Genbank accessions MH376902 and OQ216561) were isolated from bats in Egypt in 2017 (Kandeil et al., 2019) and South Africa in 2018 (Rademan et al., 2023).Kandeil et al. (2019) reported the HA sequence was the most similar (73 per cent nucleotide identity) to H9 sequences that circulate predominantly in avian host species.Even though this divergence was comparable to that between other HA subtypes, neither study made a case for defining a new subtype.
Reconstructing a phylogeny relating tens of thousands of sequences is inherently uncertain given a finite amount of information, i.e. alignment length.As a result, the partition of IAV sequences into subtypes may depend on the methods used to reconstruct the tree.The stochastic nature of maximumlikelihood optimization can also induce some uncertainty in the topology and branch lengths of the resulting tree.However, the edgewise clustering method described in this paper should be relatively robust to this variation because it operates on the longest internal branches of the tree.To illustrate, I re-ran the HA and NA sequence alignments using the Le-Gascuel (LG) amino acid substitution model instead of the default Jones-Taylor-Thornton (JTT) model in Fasttree and also re-ran Fasttree under default settings for each dataset.The resulting numbers of subtrees are summarized in Supplementary Figure S6.Overall, the number of subtrees as a function of minimum branch length for the ML tree under the LG model is almost identical to the tree under the JTT model for HA sequences.We also observed similar trends for trees relating NA sequences, with the exception that the LG model tended to produce longer internal branch lengths, expanding the range of lengths for which edgewise clustering yielded eleven subtrees.The conservative approach would be to require a sequence-based taxonomy to specify the substitution model for reconstructing the tree.This is not the first attempt to cluster the genomic segments of IAV into subtypes.For instance, Lu et al. (2007) performed a similar clustering analysis of roughly all segments for 2,300 complete IAV genomes.For each segment, they used neighbor-joining to reconstruct trees from the HKY85 distance matrix for the multiple sequence alignment and then extracted clusters based on a bootstrap support threshold of >90 per cent and a nucleotide p-distance threshold of about 10 per cent.Their analysis partitioned the HA sequence phylogeny into 78 subtrees, for example.Unfortunately, the web interface for classifying IAV sequences under this system (https://www.flugenome.org) was discontinued in 2016, and the expired domain name is presently being used by a laboratory reagents supplier under the guise of the original website (last accessed 30 June 2023).In addition, the classifier developed by Lu et al. (2007) was not made publicly available, and it was not described with sufficient detail to reproduce.
Similarly, Zhuang et al. (2019) used neighbor joining to reconstruct phylogenies relating between 4,194 to 33,066 sequences for each of the eight IAV segments.They partitioned each tree into clusters based on the presence of a 'relatively separated branch', as well as the distribution of host species, sampling location, and  S3).collection dates among tips.For example, they described three major clusters in the PB2 phylogeny.Unfortunately, their clustering method was not described with sufficient quantitative detail to reproduce, and the study data (e.g.annotated trees) are not available online.Finally, there are multiple studies that have clustered sequences for all IAV genome segments for the purpose of detecting reassortment events (Gong et al., 2021;Nelson et al., 2008;Holmes et al., 2005).For example, Gong et al. (2021) partitioned the maximum-likelihood trees for 40,296 full IAV genomes into clusters based on the mean tip-to-tip distances associated with internal nodes, i.e. the y statistic in this study.Using the overall mean of y as a cutoff for each segment, they obtained 493-663 clusters per phylogeny.Putative reassortment events were identified from genomes with novel combinations of clusters among segments.Although this fine-grained clustering may confer greater sensitivity for detecting reassortment, it does not provide insight into the genetic characteristics of established IAV subtypes.
The process of updating the nomenclature of IAV subtypes has been complicated by the transition from serological testing to genetic sequencing to characterize novel variants (Table 1).Given the serological origins of most IAV subtype definitions, it is remarkable that these groupings can be readily recovered from genetic variation alone, although not all methods are equally successful.There are significant advantages of developing a genetic taxonomy for IAV subtypes.The number of published IAV sequences is growing exponentially (Supplementary Figure S1).IAV infections are increasingly subtyped by genetic sequencing methods that are more scaleable than serological testing.At the current pace, we can expect nearly half a million new IAV sequences to be deposited in GISAID per year.We can also expect an increasing number of these sequences to be derived from environmental and metagenomic samples (Greninger, 2018), where intact virus particles would not be available for serological characterization.Expanded sampling of non-human, non-domesticated host species with these technologies will increase our chances of encountering more novel and highly divergent lineages (Tong et al., 2012;Tong et al., 2013;Fereidouni et al., 2023).Finally, a major advantage of a genetic taxonomy is that sequence-based criteria for defining subtypes can be transparent and reproducible if the clustering methods are described sufficiently well or better yet, made freely available online.

Figure 1 .
Figure 1.Maximum-likelihood trees reconstructed from unique amino acid sequences of HA (n = 66, 864) and NA (n = 52, 146).The trees are midpoint-rooted and ladderized.Branches are colored by respective 'serotype' annotations in the NCBI Genbank database.Scale bars indicate the branch length corresponding to 0.1 expected amino acid substitutions per site.These plots were generated using the R package ggfree (https://github.com/ArtPoon/ggfree).

Figure 2 .
Figure 2. (a) A solid line indicates the number of subtrees produced by partitioning the phylogeny of HA sequences at varying branch length cutoffs (edgewise clustering).A dashed horizontal line is drawn at n = 18 to indicate the target number of subtrees.Points indicate the nMI (measuring the correlation between subtree and HA subtype labels) at varying cutoffs.The maximum nMI at 99.86 per cent is indicated by a dashed line.(b) Distribution of HA subtype labels among subtrees produced at an internal branch length cutoff of 0.18.Boxes are shaded in proportion to relative counts of labels.Total counts are listed along the right side.Individual counts are printed in the respective boxes for low-frequency labels, most of which correspond to misclassified sequences (see Supplementary TableS2).

Figure 3 .
Figure 3. (a) A solid line indicates the number of subtrees produced by partitioning the phylogeny of NA sequences at varying branch length cutoffs (edgewise clustering).A dashed horizontal line is drawn at n = 11 to indicate the target number of subtrees.Points indicate the nMI (measuring the correlation between subtree and NA subtype labels) at varying cutoffs.The maximum nMI at 99.46 per cent is indicated by a dashed line.(b) Distribution of NA subtype labels among subtrees produced at an internal branch length cutoff of 0.41.Boxes are shaded in proportion to relative counts of labels.Total counts are listed along the right side.Individual counts are printed in the respective boxes for minority serotypes.The smaller counts generally correspond to misclassified sequences (see Supplementary TableS3).

Table 1 .
Summary of the currently known HA subtypes and the methods used to describe them.