Whole genome experimental maps of DNA G-quadruplexes in multiple species

Abstract Genomic maps of DNA G-quadruplexes (G4s) can help elucidate the roles that these secondary structures play in various organisms. Herein, we employ an improved version of a G-quadruplex sequencing method (G4-seq) to generate whole genome G4 maps for 12 species that include widely studied model organisms and also pathogens of clinical relevance. We identify G4 structures that form under physiological K+ conditions and also G4s that are stabilized by the G4-targeting small molecule pyridostatin (PDS). We discuss the various structural features of the experimentally observed G-quadruplexes (OQs), highlighting differences in their prevalence and enrichment across species. Our study describes diversity in sequence composition and genomic location for the OQs in the different species and reveals that the enrichment of OQs in gene promoters is particular to mammals such as mouse and human, among the species studied. The multi-species maps have been made publicly available as a resource to the research community. The maps can serve as blueprints for biological experiments in those model organisms, where G4 structures may play a role.

Several methods have been devised in the last decade to study the formation and stability of these structures in vitro (12,13) or to computationally predict their formation in genomic contexts (14)(15)(16)(17)(18). Biophysical studies have shed light on factors that influence G4 formation but are typically low throughput and limited to sequences of short length. Conversely, computational predictions can be applied to any given genome but lack a thorough experimental validation, and rather employ algorithms derived from experimental data on a small number of sequences. Folded G4s have been detected in small genomes by polymerase pausing using PacBio SMRT sequencing (19), though this approach has not yet been scaled to larger genomes. To overcome all such limitations and also go beyond computational prediction, we recently developed G4-seq to experimentally detect and map G4 structures in a way that is scalable to the human genome (20). This method identified hundreds of thousands (n = 716310) of G4 forming structures and has revealed important features that govern G4 formation and stability including the relevance of genomic sequence context. The human G4 map was generated using purified DNA and has served as a reference to interpret biological studies (21)(22)(23)(24). The initial G4-seq human genome dataset revealed some shortfalls in the various G4 computational prediction algorithms, which can lead to an over-or underestimation of G4 structures. Furthermore, the dataset (20) was largely made up of non-canonical G4s with long loops, bulges or comprising only two G-tetrads, all of which are difficult to predict with accuracy from the primary DNA sequence alone. Some improvements in sequence-based G4   prediction have been recently made via computational approaches that have employed the G4-seq dataset for training a machine learning model (17,25,26). We applied the refined G4-seq protocol to the genomes of 12 different species, comprising important model organisms and pathogens of clinical relevance (see list in Supplementary Table S1 and Figure 1A). As part of this study, we also addressed some limitations of the original G4-seq approach (20), which included poor coverage at high GC-rich regions, limiting its use for G-rich genomes or regions, and insufficient spatial resolution to disentangle G4s in close proximity ( Figure 1B). The comparative G4 maps generated provide important insights into G4 structural classes across species, the key sequence features that determine different patterns of G4 formation and the relevance of G4 localization across genomes.

Species naming convention
The full scientific name of each of the 12 species analysed in this study are reported at Supplementary Table S1, together with the short name used as convention throughout the text in this work. The short name used is either a concise version of the full scientific name (e.g. Drosophila in-stead of Drosophila melanogaster, or C. elegans instead of Caenorhabditis elegans) or the common name used in the field for that species (e.g. human, mouse and zebrafish for Homo sapiens, Mus Musculus and Danio rerio respectively), as detailed in Supplementary Table S1.

PQS motifs
PQS (Putative G-Quadruplex Sequences) are computationally defined sequence motifs that have features compatible with G-quadruplex formation. A PQS generally consist of stretches at least four G runs (i.e. two or more consecutive Gs) separated by nucleotide stretches of different length (loops). The PQS used in this study comply to the following regular expressions: . Human genomic DNA was extracted from human HEK-293T cells, cultured as previously described in section 7.1.6, by phenol/chloroform extraction, using a phenol:chloroform:isoamyl alcohol solution (25:24:1, Thermo Scientific). The resulting aqueous layer was purified and concentrated using ethanol precipitation at -20 • C overnight, to give the purified genomic material. Genomic DNA was obtained from yeast, Saccharomyces cerevisiae purchased from Sigma Aldrich (Type II, YSC2). The yeast (50 mg) was hydrated overnight at 20 • C, lysed using a proteinase K containing buffer (10 mM Tris pH 8.0, 100 mM NaCl, 10 mM EDTA pH 8.0, 0.5% SDS and 2 l Proteinase K) at 56 • C for 5 h and then incubated at 4 • C overnight. Phenol/chloroform extraction and ethanol precipitation were then used to obtain the purified genomic material. Finally, mouse (Mus musculus) genomic DNA was extracted from skin tissues of a 12-weekold male mouse (C57BL/6J strain, standard JAX reference strain from Charles River), provided by Biological resources unit, CRUK-CI genomics core, using DNeasy Blood and Tissues kit (Qiagen, 69504) according to the manufacturer's protocol. The integrity of all samples was assessed using a genomic DNA screentape on the Tapestation and DNA was quantified using dsDNA HS assay kit (Qubit). Genomic DNA samples were then sonicated and the library prepared, as in (20). Only DNA from H. sapiens was used with TruSeq Nano DNA LT Library Prep Kit, however all genomic DNA was used with PCR-Free Library Prep Kit.
Modified sequencing buffers were prepared as previously described (20), with the only difference being the addition of the small-molecule PDS to the K + instead of the Na + buffer.

Raw processing: alignment and mismatch calculation
Fastq files are generated through a customized protocol, where DNA fragments are sequenced two times with 150 bp, similarly to a paired-end protocol. However, the fragment read is not 'flipped' at Read-2 but just re-sequenced in different buffer conditions, as detailed in Chambers et al. (20). Fasta genome files were downloaded from public repositories for each species (Supplementary Table S4). The main processing steps are as follows: 1) Fastq files from Read-1 are aligned to the respective reference genomes using the bwa mem aligner (http://biobwa.sourceforge.net). 2) Aligned bam files are processed with a customized script that converts bam to bed files (bedtools bamtobed) and then extracts the alignment with the highest mapping quality (MapQ) for each read (bedtools groupBy). 3) An R script (27) takes in input the Read-1 and Read-2 fastq files and the bed alignment files generated at step 2 and calculates for each read the quality scores at Read-1 and Read-2, the delta quality score (i.e., the quality drop Read-1 minus Read-2) and the percentage of mismatches (mismatches %) between Read-1 and Read-2. The Mis-matchAnalyzer R script is deposited as part of the Supplementary Code in Chambers et al. (20). 4) The results are saved to mismatch files.

Mismatch combination and hit calling
Mismatch files generated as explained in the previous paragraph are processed as follows: 1) Each mismatch file is split by chromosome and each read is further split in single bases and the mismatches % is assigned to the first 50 bases belonging to a give read (bedtools makewindows). 2) Alignments shorter than 50 bases are filtered out. 3) Another customized R script (27)

Characterization of the improved G4-seq method
To quantify the effect of Li + sequencing and PCR-free library preparation methods, as compared to PCR-amplified, Na + sequencing, we analysed in-depth the coverage at Chr 1, which contains over 60k PQS of the form G 3+ L 1-12 over 250 Mb (Supplementary Data). We made sure that the subsets analysed would have similar coverage in all the compared methods: 8.2 per-nt per-strand in the published G4seq method (label PAPER in figures); 7.4 in the Li + , PCRamplified method (label PCR in figures); 8.6 for the Li + , PCR-free method (label PCR-FREE in figures). Coverage was inspected at PQS motifs with different loop length: 1-12, and the sub-categories 1-3, 4-7 and 8-12, and compared across sequencing methods. Coverage was also inspected at over 9 million windows of 50 nt, overlapping 25 nt with each other, covering the entire Chr 1. Windows exhibiting GC content >70% were calculated (n = 105 420) and further subset to those also containing PQS motifs (G 3+ L 1-12 , n = 38 329), and inspected for coverage. For the impact of averaging window size during analysis, we compared the case with 50 nt to the one with 150 nt. We re-analysed in the same way Chr 1 of the Li + , PCR-free human library, and assessed the number of PQS motifs (G 3+ L 1-7 ) present in OQs and detected for both window sizes, the average OQs region size and the number of OQs containing more than a single G 3+ L 1-7 motif.

Sequence analysis for G4 structural features
Hit files, also called OQs (Observed Quadruplexes), are then intersected to different predicted G-quadruplexes (PQS) files (see Table 2): G 3+ L 1-7 , G 3+ L 8-12 , G 2 L 1-12 (see PQS motifs in Materials and Methods). The intersection of each one of these files with the OQs in the respective species is calculated, and conversely also the overlap of the OQs to each PQ file (bedtools intersect, by swapping the -a and -b options for the two cases). To generate the pie charts, the number of OQs in each PQS category was calculated, and OQs were assigned hierarchically to the first matching category (according to the order presented above). OQs without any coverage (not even partial coverage) were classified as noncovered; PQS not scoring and overlapping areas with no coverage were extracted (bedtools intersect) as well as PQS not overlapping regions with no coverage (i.e. entirely covered; bedtools intersect option -v).

Genomic region analysis
The gene annotation files used in this analysis have been downloaded from publicly available databases, as listed in Supplementary Table S4. First, the different transcript regions, such as 5 UTR, exons, introns, lncRNA and promoter regions have been extracted from the .gff files. For promoters, we used as definition 1 kb upstream of the transcription start site (TSS); for TSS regions, we used ±1 kb from the TSS. Each annotation so generated has been intersected (bedtools intersect) with the OQ files in the respective species, and the overlap counted. Random genomewide shuffling of the OQ file has been performed three independent times (bedtools shuffle), and the overlap has been assessed for the random case. The fold enrichment of the actual overlap divided by the average random overlap has been calculated for each species separately, and the standard error of the mean fold enrichment computed.

Cross-species promoter co-occurrence analysis of OQs
The transcription start sites (TSS) of all genes in the human genome have been retrieved and the corresponding orthologous genes have been cross-mapped in five other species. Human, mouse, Drosophila, C. elegans, zebrafish and Saccharomyces genomes were considered for this analysis, due to the presence of well-annotated assemblies available programmatically from the Ensembl genome database (http: //www.ensembl.org) and because they represent the most related species to human in this study. For each considered genome, the TSSs have been retrieved along with the corresponding gene names (Ensembl gene IDs for ortholog genes in species other than human), chromosome name and strand, transcript start and end coordinates. The retrieval was done using the biomaRt (28) library in R (27), which provides an R interface to the Biomart data query and retrieval system of Ensembl genome database. Human gene set was taken as a reference, with all the corresponding ortholog information pulled from the other species. The versions of the genome assemblies for which the genomic coordinates were retrieved were hg38 (GRCh38, for human), mm10 (GRCm38, for mouse), dm6 (BDGP6, for Drosophila), ce11 (WBcel235, for C. elegans), danRer10 (GRCz10, for zebrafish) and sacCer3 (R64-1-1, for Saccharomyces). These assemblies matched those used for all the OQs analyses, except for hg38 and ce11 for human and C. elegans, respectively. For the latter two species, the obtained genomic coordinates were then converted into the hg19 (GRCh37) and ce10 (WormBaseWS220) versions, using the program liftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Promoter regions were defined as 1 kb upstream of the TSS for each species, and the maximal mismatch value in the regions was calculated, and hierarchical cluster analysis on a matrix where rows represent the TSS (n = 24 164 human TSS, of which 17 400 have at least one orthologues) and column represent the mismatch values in the 6 species considered (Supplementary File S2).

Sequencing G4s in multiple genomes
Details of the previous limitations and current refinements of G4-seq and the characterization of the improved method are described in the paragraph 'The improved G4-seq method' of the Supplementary Data. In essence, the use of Li + in the initial sequencing run improved the fraction of PQS sites with sequencing coverage, as compared to using Na + , especially for G4 motifs with loops shorter than four nucleotides. The use of a PCR-free protocol eliminated certain biases and improved coverage in regions with GCcontent over 70%, including those containing PQS motifs. We applied the revised G4-seq method to the genomic DNA of 12 different species that were chosen for being either important model organisms such as mouse, Drosophila and C. elegans, or important pathogens, such as Leishmania and E. coli (Table 1; see Supplementary Table S1 for genome sources). The chosen genomes also provided natural variation in genome size, GC content and the number/density of computationally predicted G4s (PQS motifs) ( Table 1 and Supplementary Figure S1) enabling a comparative assessment of G4 formation in various genomic contexts. Details about the sequencing yield and the total number of different motifs of PQS used in the following analyses are listed in Table 1 and Supplementary Table S2 (Materials and Methods).
The key difference in the analysis, compared to the previous approach (20), is the windows size and scoring thresholds. Sequences in Read-1 (in Li + ), where G4 formation is less favoured, are compared to the same sequences acquired in Read-2 (in either K + or K + +PDS) where the G4 formation is favoured. The mismatches are calculated and averaged for each genomic location with windows of 50 bp (see Methods for details). Thresholds to identify observed Gquadruplex (OQ) have been set to 25% for K + and 35% for K + +PDS after inspecting the mismatch distribution for PQS motifs, detailed in the next section. Mismatch percentage, coverage and OQs can be visualized as tracks in a genome browser such as IGV (29). Representative tracks are shown in Supplementary Figure S2.

G-quadruplex maps for 12 model species
After calculating the mismatch percentages for all 12 species in both K + and PDS conditions, we inspected the distribution of mismatches for PQS (G 3+ L 1-12 , Figure 2A and Supplementary Figure S3). Notably, the score distribution is essentially bimodal, with a consistent proportion of PQS exhibiting very low mismatches close to 0%, indicating that many predicted G4s do not form stably at physiological K + concentration. However, on addition of the small-molecule stabiliser PDS, the majority of these predicted G4s show mismatches >40%. This shift is evident in the distributions for all species (Figure 2A and Supplementary Figure  S3), demonstrating that insufficient G4 stability, rather than technical artefacts, determines the absence of G4 scoring under K + conditions. The bimodal nature of these distributions also shows that the choice of the OQs thresholds (25% for K + and 35% for K + +PDS) is appropriate, as the two PQS populations with high and low mismatches (stably forming and not forming, respectively) are split correctly.
We define specificity as the proportion of OQs that satisfy the minimal requirement for a G4 to have at least two tetrads (i.e. G 2+ L 1-12 ). We reasoned that sequences that do not conform to this relaxed G4 motif are likely false positives (Materials and Methods). We define sensitivity as the proportion of PQS (G 3+ L 1-12 ) that are identified as OQs, since there is consensus supporting in vitro G4 formation for G 3+ L 1-12 motifs (30,31). We observed that the majority of OQs exhibited high specificity (>80% for most species, Figure 2B). Some species exhibited higher specificity, such as the bacterial genomes, human, mouse, Drosophila, C. elegans and Leishmania, while others gave lower specificity, such as Plasmodium, Saccharomyces and zebrafish, especially in the K + condition. The use of a higher threshold in K + +PDS, made possible by the extra G4 stabilization provided by PDS, helps reduce false positives and other non-G4 related sequencing errors ( Figure 2B). Crucially, the good specificity did not compromise the sensitivity of the assay. Figure  2C shows that under K + conditions, most species show a percentage of PQS (G 3+ L 1-12 ) scoring in the range 40-60% (and 55-75% for the more stable G 3+ L 1-7 category), but there are some notable exceptions with lower scoring percentages, such as E. coli, Rhodobacter, Arabidopsis and Saccharomyces. The sensitivity increases strongly under PDS stabilizing conditions for all species (>70% for 10 out of 12 species, and >50% for the other two).

Structural categories of OQS
G4s can diverge from the commonly used motif comprising four runs of three guanines separated by loops up to seven nucleotides in length (14). Variants of G4s with longer loops, interruptions in the G-run such as mismatches or bulges (32), and G4s with only two G-tetrads are also possible though typically exhibit lower stability (2,33). From the OQs identified, we classified different structural categories, from the most stable to the least stable, based on previous biophysical knowledge (Methods) (20,21). The categories considered were, starting from the highest predicted stability, G 3+ L 1-7 (standard, three-tetrad G4 motif), G 3+ L [8][9][10][11][12] (longer loops, three-tetrad G4 motif), G 2 L 1-12 (two-tetrad G4 motif), Other (sequences that cannot be directly ascribed by any of the classifier presented in this work). Note that the third category G 2 L 1-12 includes both two-tetrad motifs and sequences with bulges, i.e. three-tetrads structures with interruptions in the G run, which cannot be unambiguously assigned without structural analysis, such as NMR spectroscopy. The relative proportions of each of the different G4 categories in K + varies across species, as shown in the pie charts ( Figure 3A and Table 2): the canonical PQS G 3+ L 1-7 motif (blue sector in the pie charts), i.e. those sequences considered to have high stability, occupy a large fraction of the OQs identified in K + conditions in human, mouse, Leishmania, C. elegans and Rhodobacter, while it represents only a minority for zebrafish, Trypanosoma, Arabidopsis, E. coli and Saccharomyces. Longer loop PQS motifs of the form G 3+ L 8-12 (red sector), i.e. three-tetrads structures with less predicted stability compared to the previous category, follow a similar pattern. Conversely, the twotetrad motif (green sector) show an opposite trend, occupying a larger fraction of OQs in those species with less canonical PQS motifs.
In fact, the proportion of observed two-tetrad G4s in particular increases upon PDS stabilization (green sectors in Figure 3A and B) for all genomes. Also, the total numbers of PQS motifs identified as OQs and the identified percentage of PQS out of the total genomic motifs increases in all Table 1. Genome size, GC and PQS content for the 12 species. # PQS G 3+ L 1-12 = number of motifs of the form G 3+ L 1-12 in the genome; density of PQS = 1000 × PQS total size / (2 x genome size); depth is calculated per each strand separately.  Figure 3A and fold enrichments for all species, calculated as explained in Figure 3E (see Materials and Methods  Figure  3B, C and Supplementary Figure S4), consistent with the small-molecule enabling more sensitive G4 detection. Overall, PDS treatment greatly increases the average assay sensitivity from 31% to 66%, since the small-molecule stabilization allows more PQS to be identified, and also to a lesser extent increases the average specificity from 81% to 85%, since OQs are scored using an increased threshold, which suppresses false positives. In mouse, for instance, there is a general increase in all G4 categories ( Figure 3D), but only the two-tetrad category shows a significant increase in the fold enrichment over random (from 1.4 to 1.8; t-test P-value = 10 -6 ; Figure 3E).
Poly-G stretches, i.e. sequences consisting of 12 or more consecutive Gs, also appear to be prone to stably form G4s, as 83% of the ∼40 000 poly-G stretches combined across all species is identified as OQs in K + conditions, with the percentage further increasing to 92% in PDS stabilizing conditions. The long stretch (∼427 nucleotides) of Gs present in the human genome in chromosome 2 (chr2:33141266-33141693) previously reported in Huppert (34) also displays OQs formation in PDS across the entire region, while in K + OQs are detected just below threshold (% mismatches of 23).
We also observed an increased enrichment for the twotetrad category in PDS versus in K + of each respective species (see Table 2 for K + and Supplementary Table S3 for PDS). Notably, the enrichment for the three-tetrad motifs (both G 3+ L 1-7 and G 3+ L 8-12 ) under K + condition was very high for Arabidopsis, C. elegans, Plasmodium and zebrafish and low for Saccharomyces, E. coli and Leishmania, suggesting that PQS motifs can score differently across species (Table 2). The additional stabilisation induced by PDS, caused a higher enrichment for the three-tetrad G4 motif in Saccharomyces and E. coli (Supplementary Table S3). Given G4seq is carried out on isolated single-stranded DNA, we reasoned that the observed differences must be due to speciesdependent sequence effects within and around the G4 motif, which we discuss in the next section.

Features of stable G4s identified
We next evaluated how particular sequence-related features, such as G4 loop sequence and G4 flank sequences, which influence G4 formation and stability, might explain the observed inter-species differences in G4 stability ( Figure 2C plementary Figure S5). Rather, the PQS scoring proportion showed, a strong dependency on G-and GG-richness (R = 0.62 and 0.72) and a negative correlation with C-and CCrichness (R = -0.68 and -0.64) (Figure 4), while T-and Arichness had a smaller effect (R = 0.29 and -0.34, respectively). G/C ratio, defined as G fraction divided by C fraction within the PQS motif, was an even stronger determinant of PQS scoring proportion (R = 0.82), with the degree of GC-richness in the flanking regions having no effect (R = -0.08) (Figure 4). Interestingly, these observations are in agreement with a recent multi-organism computational study reported by Burrows and co-workers (35). The C-, G-, T-, A-richness and G/C ratio were calculated within the PQS motifs, and the average PQS values for those are reported in Supplementary Figure S6. Notably, bacterial genomes, which showed a general absence of OQs in K + , are characterized by low G/C ratio within PQS motifs.
Sequence features, either individually or in combination with each other, can be used to predict the PQS scoring proportion by performing a linear model fitting. We also considered as additional predictive feature the G4Hunter score, which considers G-and C-richness and G/C asymmetry (18). Among all the features tested, G/C ratio, the G4Hunter score and the linear combination of G and C (or GG and CC) produced the best fitting (all R > 0.8; Supplementary Figure S7), confirming that G and C richness are the strongest determinant of G4 formation and stability. The negative effect of cytosines on G4 stability, assessed either as C-richness alone or in relation to G richness (G/C ratio and G4Hunter score), has been suggested for RNA (36)(37)(38) and for DNA (18,20). We have now observed and quantified this genome-wide across species, which explains the majority of PQS not scoring as stable G4s in bacteria genomes.

Genomic location of OQs in different species
To understand how OQs distribute with respect to key genomic structural elements, we downloaded gene annotations for all species and counted the OQs (considering all three categories) occurring in each region (Supplementary Table S4, Materials and Methods). The distribution of G4s showed considerable variation across species (Supplementary Figure S8). Enrichment or depletion of OQs in different regions was determined by comparison to random occurrence (Materials and Methods) within the same species ( Figure 5A-D and Supplementary File S1). The most striking observation was for human, mouse and Trypanosoma, where we observed a strong enrichment of OQs at gene promoters (1 kb upstream of TSS) and in 5 UTR regions, with human having the strongest enrichment ( Figure  5A). In contrast, other eukaryotic species (C. elegans, zebrafish and Drosophila) showed depletion at these and other (e.g. exons, 3 UTR) intragenic regions ( Figure 5B). Saccharomyces, Leishmania and Plasmodium genomes similarly showed depletion at intragenic regions ( Figure 5C), but differently from the previous group (C. elegans, zebrafish and Drosophila) did not exhibit enrichment at non-coding RNAs. The last group, Rhodobacter, E. coli and Arabidopsis, did not show enrichment or depletion of OQs at any genomic regions.

Cross species analysis of OQs occurrence at promoter regions
The incidence of G4s at gene promoter regions is relevant for hypotheses linking G4 formation to gene transcriptional activity (21,(39)(40)(41)(42). We considered OQs at promoters ( Saccharomyces), of those we studied, to evaluate any crossspecies co-occurrence patterns (Methods, Supplementary File S2). A proper analysis of G-quadruplex evolutionary conservation was not the goal of this study and would require a different choice of organisms. However, inspecting multiple genomic maps can still provide insights into G4 promoter occurrence, and help generating hypotheses about similarities and differences of the G4-mediated transcriptional control in different species. Therefore, as part of this analysis we did not consider exact sequence conservation, as the genomes of these species are not closely conserved.
Hierarchical clustering analysis, where we analysed the signal at promoter of orthologues (Materials and Methods), showed 8 interesting co-occurrence patterns present in at least 200 promoters ( Figure 5E). More clusters could be observed (Supplementary Figure S9) but were relatively low in abundance (less than 200 promoters per group), therefore we restricted the analysis to highly abundant patterns.
Many gene promoters did not have OQs in the promoters of any species (n = 8514, cluster 1; Figure 5E and Supplementary Figure S10), but interestingly a consistent number had OQs only in human and mouse, either specifically in one species (n = 1951 and n = 2673 for human-and mouse-specific, respectively; clusters 2 and 5) or in both species (n = 1623, cluster 3). On the other hand, some promoters exhibited OQs with higher mismatch values, hence higher predicted stability, in human compared to mouse (n = 459, cluster 6) ( Figure 5E). Interestingly, a direct comparison of the mismatch values in human and mouse promoters highlights some similarities in OQs formation at both promoters (over 2000 regions) but also substantial differences, with over 5300 related promoters exhibiting OQs (i.e. mismatches ≥ 25%) in only one species ( Supplementary Figure S11). A lower but substantial number of promoters had OQs only in one species, either C. elegans (n = 216, cluster 12), Drosophila (n = 529, cluster 13) or zebrafish (n = 406, cluster 14) ( Figure 5E). Detailed heat-maps of the 8 major promoter OQs co-occurrence patterns can be inspected at Supplementary Figure S10 and Supplementary File S2. Other combinations, such as OQs at conserved promoters in multiple species (e.g. human, mouse and Drosophila or human, mouse and zebrafish) also existed (Supplementary Figure S9, clusters 9 and 17), but in lower number.
We performed gene ontology and KEGG pathways enrichment analysis to infer if any particular functional category was enriched in the 8 major cluster groups (Methods, Supplementary File S3). Consistent with previous reports (20,21,43), we observed G4s in human to be strongly associated with regulatory regions of cancer-related genes and somatic copy number variations. In particular, promoters having OQs in both human and mouse, but not other species, displayed enrichment in cancer pathways as well as in genes from the cancer gene catalogue COSMIC (83 genes from the cluster 3, hypergeometric test P-value <0.01 for the cosmic gene enrichment within the cluster) (Supplementary File S3, all clusters KEGG tab). Regarding the human/mouse specific OQs promoter (cluster 3), we also noted that pathways involved in development, neurological activity and cardiac function were enriched. These genes were also strongly enriched for transcriptional regulation and developmental processes (Supplementary File S3, all clusters BP tab), whereas human only OQs-containing promoters (cluster 2) were enriched specifically in amino acid transport pathways, protein sumoylation and protein folding, to name a few.

DISCUSSION
The G4-seq approach for sequencing G-quadruplexes exploits specific properties of G4 folding by comparing sequencing outputs in conditions that stabilise folded G4s with sequencing under conditions that do not stabilise G4s (20). Our second-generation approach employed here applies the same general principles but provides improved coverage at GC-rich and G4 regions. This improvement was particularly advantageous for establishing G4 maps in challenging, GC-rich genomes such as Leishmania and Rhodobacter, and for obtaining accurate information at GC-rich regions in the human and mouse genomes that, otherwise, would lack sufficient coverage. In our original G4seq based human genome map, 20% of the identified OQs could not be ascribed to a defined G4 motif, which represents the false positive rate of the method. Our improved method uses Li + instead of Na + in the reference sequencing run (Read-1), leading to a lower basal level of G4 stabilisation. This change reduced the apparent false positive rate to just 8%. Another improvement was the ability to separate proximal G4 peaks by adopting a smaller window in the scoring pipeline to increase spatial resolution, which increased G4 peak resolution.
A striking observation based on the multi-species OQ maps is the strong depletion of G4s observed in bacterial genomes and in yeast ( Figure 2C). Previous computational studies have made predictions about G4 formation leading to suggestions about potential regulatory roles on transcription in bacteria (44,45) and highlighted the effects of G4s in causing genetic instability in yeast (46). At first glace, our data actually aligns with a recent study (47) that experimentally investigated RNA G-quadruplex formation in bacteria and suggested that G4s may have been deselected through evolution. However, a higher proportion of predicted G4 motifs were detected as OQs in bacteria and yeast upon inclusion of the small molecule PDS ( Figure 2C), suggesting some potential to form G4s. Further stabilization of the G4 during specific cellular processes, e.g. by protein interaction or in specific genetic backgrounds, could enable G4 formation and induce associated cellular effects. For example, functional effects of G4s have been specifically observed in FANCJ mutants both in C. elegans (48) and human cells (49), and after G4 stabilisation by small molecules or genetic PIF1 deletion in yeast (50).
We found that key sequence features, such as G and C richness, and the G to C ratio within the PQS motifs, can explain the global depletion or abundance of G4s observed in different species in K + ( Figure 4). As this is a global correlation analysis, predictions of individual G4s would need more sophisticated machine-learning approaches, as recently exemplified for the human genome (17). Another striking outcome of our study is the difference between species with regard to where in the genome G4s are positionally enriched. We observed strong G4 enrichment at promoter and TSS regions specifically in higher species such as human and mouse. Interestingly, the Trypanosoma genome also showed a similar enrichment pattern, despite being evolutionarily distant from mouse and human, which may reflect similarities in the G4 biology. Thus, the link between G4s and transcriptional control is worthy of further exploration in the future. Other vertebrates show, instead, a mild depletion at promoter regions, or even a strong reduction at any intragenic features, such as exons and UTRs ( Figure 5A-D). In Ding et al. (35), where they studied Gquadruplex formation for hundreds of microorganisms with a focus on the bacterial orders Deinococcales and Thermales, Thermales does not show G4 density near the TSS while Deinococcales does. In our data, on the other hand, Trypanosoma shows a genomic pattern of G-quadruplex occurrence similar to human and mouse. These observations, take together, suggest that species close in evolution can display fundamental differences in G4 localization and function, and vice versa distant species can present intriguing similarities. This interesting aspect will benefit from future evolutionary studies.
Clustering analysis of promoter co-occurrence for the six higher eukaryotic species revealed patterns of speciesspecific occurrence, such as the OQs found exclusively in mouse and human, or OQs unique to a single species. Preliminary enrichment analysis revealed processes or pathways associated with particular sub-classes of promoters, such as cancer genes for the promoter OQs specific to both mouse and human. A related observation was previously reported for human by computational analysis (43), in vitro (20) and in cells (21). This association suggests a specific role for G4 at promoters of oncogenes in mammals.
Evaluating where G4s form within the genome provides insights into how they may be exploited functionally, for example during development (43,51), and how they may be targeted for the treatment of conditions such as cancers (52,53). Our study has identified over 3.5 million G4s across 12 species, which constitutes the largest experimental G4 dataset to-date. This large dataset may enable computational and machine-learning approaches to better elucidate sequence and structural features for G4 formation, leading to improved predictors. We anticipate the G4 maps will be a valuable resource for the scientific community to probe and understand biology that might involve G4s.

DATA AVAILABILITY
Raw and processed data for all 12 species from this study have been submitted to the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE110582. A detailed explanation of all the processed files is provided in Supplementary File S4.