Long read assembly of the pandemic strain of Austropuccinia psidii (myrtle rust) reveals an unusually large (gigabase sized) and repetitive fungal genome

Austropuccinia psidii, originating in South America, is a globally invasive plant pathogen causing rust disease on Myrtaceae. Several biotypes are recognized with the most widely distributed pandemic strain spreading throughout the Asia-Pacific and Oceania regions within the last decade. Austropuccinia psidii has a broad host range (currently 480 myrtaceous species), making it a particularly dangerous plant pathogen. In the nine years since the pandemic biotype was first found in Australia in 2010, the pathogen has caused the near extinction of at least three species, the decline of at least one keystone species, and negatively affected commercial production of several Myrtaceae. To enable molecular and genomic studies into A. psidii pathogenicity, we assembled a highly contiguous genome for the pandemic biotype based on PacBio sequence data and scaffolding with Hi-C technology. With an estimated haploid genome size of just over 1 Gbp, it is the largest assembled fungal genome to date. We found the A. psidii genome to have a lower GC content (33.8 %), greatly expanded telomeric and intergenic regions and more repetitive regions (> 90 %) compared to other genomes of species in the Pucciniales, however numbers of predicted coding regions (18,875) are comparable. Most of the increase in genome size is caused by a recent expansion of transposable elements belonging to the Gypsy superfamily. Post-inoculation mRNA sequence capture from a susceptible host provides expression evidence for 10,613 predicted coding genes, including 210 of the 367 putative effectors. The completeness of the genome provides a greatly needed resource for strategic approaches to combat disease spread.

18,875 predicted coding regions. Post-inoculation mRNA sequence capture from a susceptible host provided evidence for expression of 10,613 predicted coding genes, including 210 of the 367 putative effectors. We determined a high proportion of repetitive sequences (> 90 %), predominated by Class I long terminal repeat (LTR) transposable elements of the Gypsy superfamily [28]. Telomeric regions appear expanded in A. psidii (>1000 nt in 21 scaffolds) and predicted telomerase transcripts are present during infection indicating a role in early pathogenesis. While predicted gene numbers appear comparable to other genomes of species in the Pucciniales we found greatly expanded intergenic regions. The overall GC content is 33.8 %, with 41.5 % present in coding sequences, making A. psidii a particularly AT-rich genome. The high contiguity of the genome, combined with the unique genomic features identified in this study, provide essential resources to understand and address the spread of this invasive plant pathogen.

Biological sample from an Australian isolate
Austropuccinia psidii urediniospores, from a single pustule isolate, were used for all workflows including flow cytometry, Hi-C libraries and high molecular weight DNA extraction, as described in the methods of this manuscript.

Predicting genome size with flow cytometry and kmer analysis
To guide the quantity of long read sequence required to assemble the A. psidii genome, the nuclear DNA content of A. psidii cells was estimated using flow cytometry. Details for all experimental approaches are described in the methods section of this manuscript. Flow cytometry estimated the 1C (single nuclei) genome content at 1.42 pg (1,388.91 Mbp) using the host genome (Eucalyptus grandis) as the internal reference. As an additional approach, we ran a kmer analysis using Jellyfish (v.2.2.6), to predict the diploid genome size at 2,100 Mbp. From these results we estimated a haploid genome of around 1 Gbp and sequenced first 21 and then a further 8 PacBio SMRT® cells (18 X RSII plus 11 X Sequel).

Long read genome assembly with Canu (v1.6)
After converting RSII data to conform with the Sequel output of subreads.bam, we obtained 162 GigaBytes (GB) of data. Details on data pre-processing and assembly are available in the methods section of this manuscript and scripts are available at the myrtle-rust-genome-assembly github repository [29]. An initial diploid assembly with Canu (v1.6), using data from 21 SMRT cells, produced an assembly size of 1.52 Gbp ( Table 1). The contig numbers for this assembly were high and completeness was low (79.8% complete and 7.0% fragmented) based on Benchmarking Universal Single-Copy Orthologs (BUSCO) (v3.1.0) in genome mode [30]. We therefore sequenced eight additional SMRT cells and assembled the diploid genome of A. psidii at 1.98 Gbp, with 13,361 contigs, using Canu (v1.6) [31] long read sequence assembler. The Canu corrected and trimmed reads, at 35 times coverage, were used in the final construction of contigs (Table 1).  e  d  r  e  a  d  s  2  0  4  8  3  2  2  1  8  5  3  9  8  0  C  o  r  r  e  c  t  e  d  a  n  d  t  r  i  m  m  e  d  (  b  p  )  2  3  7  7  4  1  7  8  1  6  7  3  5  0  4  1  5  8  9  6  9  3   A  s  s  e  m  b  l  y  s  t  a  t  i  s  t  i  c  s   A  s  s  e  m  b  l  y  p  a  r  a  m  e  t  e  r  s  d  i  p  l  o  i  d  d  i  p  l  o  i  We deduplicated the assembly contigs with the Purge Haplotigs pipeline [32] to produce primary contigs and secondary haplotigs of ~ 1 Gbp each (Table 2) as well as contigs labelled as assembly artefacts (34 MegaBytes ). We then generated scaffolds from the assembly contigs using Hi-C [24] generated read data with ALLHIC (v.0.8.11) [33]. Initially, we produced 40 primary scaffolds but found several telomeres from the pre-scaffolded assembly embedded within scaffolds. We therefore curated to break these scaffolds at the start/end of telomeres for a   (  A  )  A  s  s  e  m  b  l  y  c  o  n  t  i  g  s  N  5  0  L  5  0  L  a  r  g  e  s  t  B  a  s  e  s  G  C  %   P  r  i  m  a  r  y  3  1  8  7  5  2  0  3  1  1  6  1  9  2  5  5  9  5  0  6  1  0  1  8  0  8  6  7  3  2  3  3  .  8  0   S  e  c  o  n  d  a  r  y  8  6  2  6  1  5  9  7  2  7  1  7  9  4  9  5  7  9  7  4  9  3  3  8  8  7  3  3  3  3  4  .  2  We polished the primary assembly using the genomic consensus algorithm arrow, as described in the methods and github [29]. This process was repeated twice and genome completeness was assessed with BUSCO (v3.1.0) [30] and identified the presence of 94.7% conserved proteins for the primary assembly and 97.6% for the combined primary and secondary. The BUSCO analysis was run using protein mode following gene prediction and annotation, described in the methods.

The genome size expansion is driven by transposable elements
Our analysis of the A. psidii genome assembly provides evidence for very large and diverse repetitive regions that constitute greater than 91 % (932,575,815 bp) of the genome. When we repeat-masked the APSI_primary and secondary genomes with published fungi and viridiplantae repeat databases (in RepBase) [34], we found very low percentages of matches (Table 3). The A.
psidii genome is largely novel repeat regions, prompting further analysis with the REPET pipeline [35] to characterise transposable elements (TEs).  M  a  s  k  e  d  w  i  t  h  l  i  b  r  a  r  y  f  u  n  g  i  v  i  r  i  d  i  p  l  a  n  t  a  e   %  o  f  g  e  n  o  m  e  m  a  s  k  e  d  i  n  A  P  S  I  1  Using the REPET pipeline, we show that the massive repeat expansion in A. psidii is mostly driven by TEs belonging to the Gypsy superfamily ( Figure 1). Most of the expansion was driven by a limited amount of TE families at discrete time points ( Figure 2). For example, the extensive expansion observed around 78% family level identity is mostly driven by a single TE family.
This is similar to early expansions (30-50% family level identity) which are driven by only a limited amount of TE families and in contrast to the most recent expansion (80-90% family level identity), which is driven by a more diverse set of TE families including several Class II superfamilies, such as those belonging to the TIR order.  The high percent of TEs and the overall low GC content prompted us to investigate the presence of AT-rich region in the A. psidii genome. We identified AT-rich regions using OcculterCut [36], a program specifically designed to identify AT-isochore. The major peak identified is around 33% GC content and a second peak around 41% GC content ( Figure 3A).
The former largely overlaps with the GC content profile of TEs. The second peak overlaps with the GC content of genes in the A. psidii genome. This specific AT-isochore structure with two separate genome compartments appears to be specific to A. psidii, as the closely related Puccinia striiformis f. sp. tritici displays only a single peak at around 45% GC content ( Figure 3B). We further investigated the GC-content for two scaffolds, APSI_P025 (8 512 731 bp) and APSI_P027 (3 316 254 bp). Plotting of gene density and repeat density shows a lack of repeat-or gene-rich islands, but rather accumulation of TEs between genes ( Figure 3C).

Extended telomere repeats identified
Prior to scaffolding the assembled contigs, we identified 29 telomeric regions at the start or end of primary contigs (and 23 in the secondary contigs), based on the hexamer TTAGGG(n) [37].
We used these numbers to guide scaffolding and checked the telomere locations, post  (Table 4) indicate extraordinarily long telomeres compared to other studied rust fungi [37,38]. Previous studies indicate that telomere localisation in the nucleus show affinity to the nuclear membrane [39], therefore cross-linking during our Hi-C library preparation may have proximity-ligated adjacent telomeres leading to errors in scaffolding. Manual curation before and after scaffolding corrected for this problem by identifying and breaking at embedded telomeres. With no current cytological evidence for the number of A. psidii haploid chromosomes to guide scaffolding, we made predictions for scaffold grouping based on telomeres and evidence from a related rust fungus, Numbers and average length in nucleotides (nt) of contigs/scaffolds with n(TTAGGG) sequences. n = number of hexamers, for example >1000 means more than 1000 x (TTAGGG) or more than 6000 nt.

RNA used for genome annotation
In order to annotate the genome, we extracted total RNA from infected young leaves of the susceptible host, Syzygium jambos, and harvested at 6, 12, 24, 48 hours (h), 5, and 7 days (d) post-inoculation with A. psidii. RNA-seq data were generated for the six time points after inoculation and the assembled transcripts were merged for a total of 80,804 representative genes and 108,659 alternative forms. Of the total transcripts, 8% and 9% were classified as fungi in the main and alternative gene sets respectively (Additional Figure 1.HTML).
RNA-seq alignments showed that the majority of reads were from the host plant and a minor percentage of data were from the fungal isolate. Mapping the RNA reads to the assembled A. psidii genome, the lowest overall mapping rate was observed in sample 6 h (0.77%) and the highest in 7 d (8.58%). Mapping to the Metrosideros polymorpha (ōhi a) plant genome, the most closely related publicly available genome to the inoculation host plant, the overall mapping rate was stable and greater than 76.7% in all samples except 7 d (which was 71.55%).

Gene prediction and functional annotation
We used Braker2 [41] and RNA-seq data to computationally predict 18,875 (15,684) protein coding genes within the primary and secondary assemblies respectively (Table 5). We then used SignalP (v.4.1f ) [42], followed by EffectorP (v.2.0) [20] to identify putative effector genes within the primary and secondary assemblies at 367 and 304 respectively. These putative effectors did not appear to be compartmentalized and had similar intergenic distances when compared to BUSCOs or all genes ( Figure 4). We also ran the predicted proteins through ApoplastP [43] to identify sequences that are likely to be located to the apoplast within host plants. Predicted effectors from the primary assembly were visualized on scaffolds with MapChart v.2.3.2 [44] (Additional Figure 2) and expression validation, based on mapping reads to the primary assembly with Star (2.7.2b), was found for 10,613 sequences, including 210 predicted effectors. Of the predicted effectors in the primary and secondary assemblies, exact amino acid duplicates were found for 97 sequences. Annotation files and multi-sequence fasta files for predicted coding sequences, amino acid sequences and effectors for the combined primary and secondary assemblies are available at DOI:10.5281/zenodo.3567172.

Pucciniales orthologues and comparative statistics
We did a comparative analyses of fungi within the order Pucciniales with predicted proteins using Orthofinder (v2.2.7) [45]. We included protein sequences from two tree rust species, Cronartium. quercuum (white pine blister rust) [46] and Melampsora larici-populina (poplar rust) [47], to test whether protein orthologues may relate to life history of the host. We visualized the evolutionary phylogeny, using the species rooted tree based on orthologs, with Dendroscope (v3) [48]. Based on a species rooted phylogenetic tree from the multiple sequence alignment of single copy orthologues, A. psidii was placed closer to the cereal rusts than to tree rust species ( Figure 5A). This finding is in accordance with previous analyses using ribosomal DNA and cytochrome c oxidase subunit 3 (CO3) of mitochondrial DNA to determine the type specimen for A. psidii [1].  Comparative statistics for the Pucciniales genomes show that the scaffolded assembly for A. psidii is more contiguous, more repetitive and larger than other assembled genomes (Table 6 and Additional Table 1.xlsx). A comparative analysis using annotation files downloaded from Mycocosm [50] and NCBI [50], reveal dramatically large expansions of intergenic lengths in A.

Pandemic isolate variation identified in Australia
Whole genome DNA resequencing on the Illumina sequencing platform (paired-end 150 bp reads) was obtained from six other Australian isolates, one isolate from Hawaii believed to be from the pandemic strain, and a Brazilian isolate stored at the at the University of Sydney, Plant Breeding Institute (PBI). The DNA from these isolates was previously tested using simple sequence repeat markers and all Australian isolates were determined to be genetically similar [6].
When comparing pathogen isolates, we found 44,375 SNPs in total. On average one SNP per 23 Kbp. The maximum likelihood phylogenetic tree, in newick format, was visualized in Dendroscope v3 [48] (Figure 7) and indicates variations present in the pandemic strain of the pathogen. These isolates were collected from diverse hosts and locations in Australia dating back to 2011 (

Discussion
Austropuccinia psidii, the causal agent of the pandemic myrtle rust, is a dangerous and rapidly spreading plant pathogen. It infects perennial hosts in the Myrtaceae family, including iconic trees such as eucalypts, and has an expanding global host list [15]. Here we present the first highly contiguous genome assembly for A. psidii using DNA from an Australian isolate of the pandemic strain previously assembled with short-read sequence data [21]. We assembled the genome using long read sequence data to produce the largest assembled fungal genome to date at 1.02 Gbp haploid size.

Why is the Austropuccinia psidii genome so big?
Data based on flow cytometry determined the largest fungal genome at ~3.  [51], the evidence to date is that genome expansion in rust species is due to repetitive elements, notably a proliferation of transposable elements (TEs) [47,53].
We annotated 18,875 predicted protein coding genes, comparable to other cereal rusts (Table 6 and Additional Table 2.xlsx). We determined that the extraordinary size of the A. psidii genome is not a product of gene expansion but is due to extensive repetitive DNA (> 90%) and very large intergenic regions (an order of magnitude greater when compared to six other Puccinales species) ( Figure 6). Despite these large intergenic regions, we see no difference in the gene lengths for these species, further explaining that the increased genome size is related to non-genic regions that are largely repetitive. Our investigation into the nature of the repeat regions identified Gypsy superfamilies belonging to the LTR order as the major contributor to overall genomic expansion covering 73% of the genome (744,220,388 bp).

Active telomerase and longer telomeres
Telomeres are structural 5′-TTAGGG-3′ repeats that cap the physical ends of chromosomes and are important in maintenance of DNA integrity [37]. In human cell lines, where the telomerase holo-enzyme is not expressed beyond embryo stage, short telomeres are linked to ageing and telomerase activity is linked to cancer [54]. While telomeres can be from 5-15 Kbp in humans, fungal telomeres are generally around 240 bp in length [37,38]. We identified very long telomeres in the A. psidii assembly (more than 6 Kb for 10 of the scaffolds) and early infection transcripts (12 hours post inoculation) showed the presence of the predicted telomerase enzyme, indicating a putative role in maintaining telomeres during active growth. Annotations for telomerase reverse transcriptase (RNA-dependent DNA polymerase) and telomerase ribonucleoprotein complex -RNA binding domains (PF12009 and PF00078) within a predicted A. psidii protein were found to be expressed in in both the primary and secondary assemblies (protein ID: APSI_P017.12297.t3 and APSI_H002.12793.t2).
A study in 2000 [55] identified and located the Magnaporthe grisea (rice blast fungus) AVR-Pita effector gene directly adjacent to a telomeric region on chromosome 3. The gene encodes a secreted effector with a zinc metalloprotease domain and mutations occur spontaneously, due to telomeric proximity, leading to loss of avirulence. Effectors were not predicted within subtelomeric regions within the A. psidii assembly but there are several uncharacterised predicted genes within 10kb on scaffold 27, a putative complete chromosome (Additional Figure 2). High expression levels were identified for one of these predicted genes (APSI_P027.780) and might warrant further investigation during pathogenesis. Also of interest, several subtelomeric regions were annotated, perhaps due to roles as RNA templates for telomeric DNA synthesis [56]. The presence of transcriptional activity of telomerase within infecting spores provides interesting mechanistic evidence for potential genome maintenance or even telomere expansion. Caledonia [8]. Further studies and deeper coverage sequencing will clarify these results and enable proper containment of newly introduced and more virulent isolates.

Potential implications of the research
We have generated a highly contiguous reference genome assembly for the pandemic strain of A.
psidii using long-read sequencing and chromosomal confirmation capture sequencing technology. The resource of a genome assembly from long reads has enabled the spanning of highly repetitive sequence regions, previously problematic to assemble with short read sequence data. We used in vivo expression data for genome annotation and have identified genome

Nuclei size determination using flow cytometry
We estimated nuclear DNA content of A. psidii cells using flow cytometry. The fungal nuclei fluorescence intensity were compared with that of Eucalyptus grandis, as an internal standard, 2C=1.33pg, 640 Mbp [27,60]. Nuclei were extracted from cells using methods previously described [25]  were visualised and coverage determined at 3X. The diploid genome size was calculated at 2,100 Mbp by dividing the total number of kmers (21mer) by the mean coverage. From these results a haploid genome of around 1 Gbp was predicted.

HMW DNA extraction and PacBio sequencing
For long read sequencing a modified CTAB high molecular weight DNA extraction procedure

Genome assembly
We optimized the sequencing based on our predicted genome size of ~1 Gbp and sequenced 29 SMRT® cells (18 X RSII plus 11 X Sequel). After converting RSII data to subreads.bam we obtained 162 GB of data. Details on the genome assembly are available on the github repository [29]. In brief, RSII bax.h5 files were converted to subreads.bam before fasta files were extracted from all datasets. After extracting fasta files for assembly we had approximately 72 times raw sequence coverage (7.24E+10 bases and 5.80E+06 reads). The reads were then mapped to an inhouse A. psidii mitochondrial sequence to retain only genomic DNA sequence data (APSI_mitochondria.fa available on the github repository) [29]. Canu version 1.6 [31] long read assembly software was used to assemble the genome with correctedErrorRate=0.040, batOptions="-dg 3 -db 3 -dr 1 -ca 500 -cp 50" and predicted genome size of 1 Gb. The assembly ran for over four months continually on the University of Sydney High Performance Compute

Deduplication of assembly and polish
To deduplicate the assembly into primary contigs and secondary (haplotigs) we used the Purge Haplotigs (v1.0) pipeline [32] with alignment coverage of 65 percent. We polished the primary assembly using the genomic consensus algorithm arrow within Anaconda2 (v.5.0.1) (PacBio), after aligning with Minimap2 (v2.3) [68] and processing the data for compatibility, as described in the github repository [29]. This process was repeated twice and genome completeness for the primary and secondary genomes was assessed with BUSCO (v3.1.0) [30] using genome mode and the Basidiomycota database (basidiomycota_odb9 downloaded 24/02/2017) of 1335 conserved genes. Where conserved single copy orthologues were present on the secondary contigs and absent from the primary contigs, these contigs were detected and moved to the primary assembly using custom python scripts.

Telomere identification for scaffolding
Prior to scaffolding we first checked for telomeric regions using blast+ v.2.3.0 (blastn evalue 1e-5) [69] based on the hexamer TTAGGG(n) [37] with the aim of guiding the scaffold numbers to putative chromosomes. We tested n=20, 40 repeat locations were at the start and/or end of contigs and were visualized using Integrated Genomics Viewer (IGV v2.6.3) [71]. We reasoned that the repeats were likely to represent telomeric regions and used these numbers to guide Hi-C scaffolding.

Transposable elements (TE) analysis
Repeat regions were annotated as described previously [53] using the REPET pipeline (v2.5) for repeat annotation in combination with Repbase (v21.05). For de novo identification, we predicted repeat regions with TEdenovo in the initial assembly by subsampling 300 Mb of sequence space at random to reduce compute time in the risk of losing low abundant TEs. We called the resulting repeat library MR_P2A_300Mb. We annotated the primary assembly with three repeat libraries (repbase2005_aaSeq, repbase2005_ntSeq, and MR_P2A_300Mb) using TEanno from the REPET pipeline. Detailed description of the repeat annotation and analysis can be found in the associated github repository. We identified AT-isochomers using OcculterCut [36]. We calculated GC content for specific genome features or sliding windows (window size 1kb, slide 200bp) using bedtools [72]. Karyplots of scaffolds were plotted with karyplotR. [73]

Hi-C libraries and scaffolding
Hi-C libraries were prepared using the Phase Genomics, Inc. (www.phasegenomics.com) Proximo Hi-C Kit (Microbe) to the manufacturer's specifications. Illumina paired-end reads (125 bp on HiSeq) generated from Hi-C libraries were used to scaffold contigs. Reads were firstly trimmed with adapters and low-quality bases, and then mapped to the primary and secondary assemblies independently following the mapping workflow from Arima Genomics [74]. Three Hi-C scaffolding programs were tested (LACHESIS, SALSA and ALLHIC ) before settling with ALLHIC (v.0.8.11) ) [33]. We tested the grouping, ordering and orienting of the contigs with both MluCl and Sau3AI, (SALSA) as enzyme cutting sites, both together and separately (LACHESIS and ALLHIC). Final scaffolds were generated from

RNA data analysis
RNA-seq data were generated for six time points. Each set of data for different time point was processed in parallel in a consistent way. The quality of raw RNA-seq data was checked using FastQC (v.0.11.7) and an overall summary for all samples was created with MultiQC (v.1.5).

Gene prediction and functional annotation
We annotated the primary and secondary scaffolds independently using Braker2 [81] [42]. A reduced file of sequences conforming to these predictions was submitted to both EffectorP (v.2.0) [20] and ApoplastP (v.1.01) [43] to identify predicted effectors and apoplastic proteins from each scaffolded assembly. Predicted effector fasta files were submitted to the online HMMER [86], using phmmer and the reference proteomes database, for an alternative annotation output based on homology. In order to determine allelic counterparts within both the primary and secondary assemblies we ran blast+ (v.2.3.0) (blastp evalue 1e-5) [69] with the predicted protein fasta files and identified query and subject equal aligned lengths at 100 percent match.

Comparative analysis of Pucciniales proteins
We Additionally, we reviewed the literature to determine basic genome statistics for these rust pathogens and compared the datasets gene architecture based on annotations (gff/gff3 files).
For the analysis of gene and intergenic length we used the more complete assembly for Puccinia striiformis f. sp. tritici_104 E137 A, previously excluded from orthologue analysis due to diploidy, but excluded the strain Puccinia striiformis 93-210. We used custom R scripts to determine gene length and intergenic length distribution across datasets to determine alternative explanations for genome expansion. Untranslated regions (UTR) were not annotated for every gene therefore the analysis includes UTRs.

Variant analysis of Australian isolates
DNA from six other Australian isolates, one Hawaiian and one Brazilian isolate stored at the PBI were sequenced on Illumina sequencing platform (Table 7). A total of 209 GB of (150 bp) paired-end sequence data was first trimmed and then Bowtie2 [80] was used for mapping reads to the primary genome assembly using the --end-to-end option for each sample independently.
Each sample sam file was them converted into bam files using Samtools [88], sorted and indexed against the reference genome. Samtools mpileup and bcftools were used to generate compressed variant calling files (vcf.gz). Variants were filtered using vcftools to a minimum depth of 6X and insertion-deletions were removed. The vcf file was sorted and format converted with Tassel v5 [89] and the phylip alignment file was run with iqtree v1.6.7 [90] with parameters -bb 1000 -nt  Availability of supporting data Genome assembly and annotation data files: The primary genome assembly file, named APSI_primary.v1.fasta, has been deposited at the European Nucleotide Archive (ENA) at EMBL-EBI under the following ENA accession: ERZ1194194 (GCA_902702905). Raw data is deposited at ENA under project: PRJEB34084 (Study accession ERP116938). Locus tags are registered as APSI (for Austropuccinia psidii) and scaffolds identified as APSI_Pxxx (where x indicates scaffold number) for primary assembly.