Assembly-free discovery of human novel sequences using long reads

Abstract DNA sequences that are absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is crucial for exploring the genomic diversity of populations and understanding the genetic basis of human diseases. However, various DNA lengths of reads generated from different sequencing technologies can significantly affect the results of novel sequences. In this work, we designed an assembly-free novel sequence (AF-NS) approach to identify novel sequences from Oxford Nanopore Technology long reads. Among the newly detected sequences using AF-NS, more than 95% were omitted from those using long-read assemblers and 85% were not present in short reads of Illumina. We identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, we found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. Our study demonstrates the advance of the assembly-free approach to capture more novel sequences over other assembler based methods. Combining the long-read data with powerful analytical methods can be a robust way to improve the completeness of novel sequences.


Introduction
Building a complete reference genome in humans is fundamental for decoding genetic variation and the associations with human diseases. 1 However, the current reference genome was derived from few individuals, resulting in an underrepresentation of the human population. 2,3 To increase its diversity, human genome projects have assembled genomes across and within subpopulations. 4,5 In general, DNA sequences missed in the human reference genomes are called novel sequences. These missed genomic sequences may contain thousands of unknown genetic variants implicated in biological functions. Therefore, identifying novel sequences can enrich the human reference genome and facilitate genomebased research.
Discovering novel sequences is dependent largely on the development of sequencing technologies. Initially, novel sequences were defined using fosmid end sequence pairs 6 and the entire fosmid clone. 7 However, the high expense of capillary sequencing made the creation of large-scale genomes impractical. In contrast, next-generation sequencing (NGS, also known as short-read sequencing) technologies enable the sequencing of thousands of samples with higher efficiency and at lower cost, resulting in more complete genomic information. Recently, an African pan-genome built with 910 African descents discovered 296 Mbp of novel sequences. 8 We built a Chinese pan-genome using 486 Chinese genomes and got 276 Mbp of novel sequences. 9 These studies have increased the human genome diversity and the related functional implications of novel sequences.
However, there is a major limitation for NGS short-reads <300 bases long that are too short to detect more variants, especially >70% of human genome structural variations. 10 To overcome this shortcoming, long-read sequencing has emerged, including Oxford Nanopore Technology (ONT), which generates long (10-100 kbp) and ultra-long (>100 kbp) DNA reads. 11 The average lengths of over 10,000 bp are helpful for analyzing structural variations and de novo assembly. [12][13][14] The first gapless human reference was created using PacBio HiFi and ONT, which incorporated 200 Mbp sequences absent in GRCh38.p13. 3 Currently, the detection of novel sequences relies mainly on genome assembling from long-read data. For example, 12.8 Mbp of novel sequences from a Chinese assembly 15 and over 10 Mbp of novel sequences from each individual in two Swedish genome projects. 16 In 2019, a landmark work used 15 samples to produce the largest long-read structural variant callset and reported 6.4 Mbp of novel sequences per individual. 17 These studies discovered a certain number of novel sequences, but considering varying lengths of DNA reads and different analytical approaches, it is necessary to evaluate their impact on building complete novel sequences.
In this study, we defined DNA sequences missed from the human reference and longer than 300 bp as "novel sequences." We proposed an assembly-free novel sequence (AF-NS) approach that performs quick identification of novel sequences without assembling processes. Derived from ONT long reads, the AF-NS detected novel sequences covered over 90% of Illumina novel sequences and contained more DNA information missing from the Illumina data. Our findings show the advantage of AF-NS in obtaining more intact DNA sequences while saving considerable computational resources, and the importance of long-read sequencing in building a complete picture of novel sequences. Furthermore, we uncovered the biological significance of the identified novel sequences and the population-specific novel insertions.

Samples and data sources
To study the effects of sequencing technologies on detecting novel sequences, we selected both ONT long reads and Illumina paired-end short reads from six samples-an Ashkenazim trio (HG002, HG003, HG004) and a Chinese trio (HG005, HG006, HG007)-extracted from the Genome in a Bottle (GIAB) project. 18 The ONT reads were basecalled using Guppy v4.2.2. To keep the consistent sequencing depth for parent-offspring trios, we down-sampled the three Ashkenazim samples (HG002-HG004) to 50-fold and the three Chinese samples (HG005-HG007) to 40-fold. To guarantee the identified sequences/reads that are truly absent in the human reference, we combined the latest references, GRCh38.p13 and chm13.v1.1, as the new reference. In addition, we removed chromosome Y when identifying novel sequences of HG004 and HG007 because the two samples were female.

AF-NS: assembly-free novel sequences construction
As shown in Fig. 1, AF-NS carried out an assembly-free pipeline consisting of three main steps. First, we aimed to obtain high-quality unmapped reads. All ONT long reads were aligned to references using minimap 2.17-r941, 19 and reads with unmapped fragments longer than 300 bp were selected. We then discarded low-quality reads with <10 Q-score by NanoFilt 20 and cut the adaptors of reads using porechop v0.2.4. 21 Second, we collected the unmapped sequences by three-round alignment-filtering. The first round was to align the filtered reads to references using minimap2 (-map-ont) and to obtain the unmapped fragments. The second round was to align these unmapped sequences generated from the first round to references using NUCmer 4.0.0rc122 22 (-l 15 -c 31). This would further remove the sequences mapped to the references. In last round, we aligned the remaining sequences to references again using minimap2 with the default parameters. Any fragments aligned to references and shorter than 300 bp were removed. Third, we eliminated sequences labelled as archaea, bacteria, fungi, plasmid, viral and UniVec (confidence score > 0.05) by Kraken2 23 and used minimap2 to overlap the remaining sequences. Two sequences were clustered together if they aligned with each another with >80% coverage. We chose the longest sequence in a cluster as its representative. Since the long-read sequencing could be inaccurate in low-complexity regions, 24 we utilized RepeatMasker 4.1.2-p1 25 to annotate the novel sequences and removed sequences if 80% of them were low-complexity or simple repeats. The remaining ones were collected as the final set of novel sequences.
2.3. novel_WG: extraction of novel sequences from the whole-genome assembly We executed whole-genome assemblies using two popular assemblers, Shasta 0.7.0 26 and raven 1.7.0, 27 and aligned the assemblies to references using NUCmer. Any fragments mapped to the reference with ≥80% identity were removed. Next, we aligned the unmapped fragments >300 bp to references using BWA-mem v0.7.17 28 and further eliminated the mapped parts if the alignment identity reached 80%. Then, to ensure the novelty of the remaining sequences, we aligned them to references using BWA-mem again and discarded any sequences that had hits to references satisfying >80% identity and >50% coverage. Sequentially, we used Kraken2 to remove contaminated sequences to obtain the final set of novel sequences. For Illumina data, we performed de novo assembly using MEGAHIT v1.2.9. 29 Other procedures were the same as those used for long-read sequencing data.

unmap_ASM: assembly of unmapped reads
Illumina reads were mapped to human references using BWAmem, and the unmapped parts were obtained and assembled into sequences using MEGAHIT. Next, we deleted novel sequences shorter than 300 bp and aligned the remaining ones to references using BWA-mem. Any sequences mapped to references with identity ≥ 80% and coverage ≥ 50% were filtered out. Finally, we eliminated contaminants identified by Kraken2 and generated a set of novel sequences.

Detection of novel placements
We aligned the ONT long reads to chm13 v1.1 using minimap2 and retained alignments with ≥80% identity. For reads with one or two consistent alignments, we extracted the corresponding unmapped fragments longer than 1,000 bp. These fragments were classified into two groups: (i) those with a single end aligned to the reference (SEP) and (ii) those with both ends mapped to the reference (BEP). Then we clustered two types of fragments. Two BEP sequences were merged if the length difference between their overlap and length was less than 100 bp. SEP sequences within 100 bp of each other were combined using BEDtools. 30 The longest sequence in a cluster was selected as its representative. We excluded clusters containing fewer than three sequences to obtain the final placement clusters.

Procedure for detecting long-read novel sequences
To examine the effects of different length reads on the discovery of novel sequences, we chose six samples, an Ashkenazim trio (HG002, HG003, HG004) and a Chinese trio (HG005, HG006, HG007), both of which have ONT long-read and Illumina short-read data. First, we conducted two widely used strategies to acquire long-read novel sequences, named novel_WG and unmap_ASM. The first approach involved extracting novel sequences from the wholegenome assembly. We used Shasta and raven to generate the whole-genome assemblies. However, the resulting novel sequences differed greatly in size, and more than 50% of them could not be aligned against each other with ≥80% identity ( Table 1, Supplementary Table 1). This inconsistency might be due to different whole-genome assemblers. The authenticity of these unmapped sequences requires further verification to determine whether there are assembly-caused artifacts. The mapped ones were treated as common novel_WG sequences with high confidence. Using the unmap_ASM approach, we obtained only a few novel sequences, indicating its ineffectiveness for long-read data (Supplementary Table 2).
To discover more complete long-read novel sequences, we designed a new pipeline AF-NS, consisting of a three-step procedure without using assemblers (Fig. 1). The first step obtained high-quality reads unmapped to the human reference and trimmed their adaptors. Then, through a threeround alignment-filtering process, the unmapped fragments longer than 300 bp were extracted from the trimmed reads as novel candidates. During the last step, after removing contaminants, similar candidates were clustered together to generate the final set of novel sequences. Compared with other methods, which assemble reads into contigs (contiguous sequences), AF-NS obtains novel genomic information from high-quality long reads. This ensures its efficiency in detecting novel sequences through minimizing sequence loss due to assembling. Workflow for identification of long-read novel sequences by assemble-free novel sequence (AF-NS) approach. Using long reads of Oxford Nanopore Technology (ONT), AF-NS carries out three main steps to discover the novel sequences missed from genome references in humans. AF-NS is a newly designed assembly-free approach. novel_WG detects novel sequences from the long-read whole-genome assembly generated using two assemblers, Shasta and raven, respectively.

Comparison of novel sequences
First, we compared AF-NS with the assembly-based approach novel_WG. AF-NS does not require computationally intensive whole-genome assembly and can detect novel sequences 15 times more than novel_WG (Table 1). We note that over 85% of the common novel_WG sequences were mapped to AF-NS-identified sequences with ≥80% identity (Table 2), showing the superiority of AF-NS in capturing more novel genomic information. Then we aligned the AF-NS sequences to whole-genome assemblies using NUCmer. The mapping percentage was less than 5% with ≥50% coverage (Table 3). This result indicates that current long-read assemblers lose most of the novel genomic information.
Next, we compared the novel sequences generated from different sequencers. We first constructed Illumina novel sequences of the six samples using two methods, unmap_ ASM and novel_WG. The size of novel_WG sequences was slightly larger than that of unmap_ASM sequences (Table  4). We aligned novel_WG and unmap_ASM sequences with each other and found that ~85% of unmap_ASM sequences or ~95% of novel_WG sequences could be mapped together with ≥ 80% identity (Supplementary Table 3). This result shows the consistency of novel sequences obtained from the two approaches. To minimize assembly-induced artifacts, we retained only the novel sequences detected by both.
The HG005 sample was excluded because the size of novel sequences discovered from its Illumina reads was considerably higher than that of the other samples (Table 4).
As shown in Fig. 2A, the AF-NS identified sequences covered more than 91% of Illumina novel sequences under ≥80% identity and ≥50-bp alignment length. By contrast, only 58-80% of the Illumina novel sequences were mapped to the long-read whole genomes using an assembler Shasta or raven (Fig. 2A). This analysis verifies the effectiveness of the AF-NS approach in discovering intact novel sequences. Because the above two identification methods may lose some novel sequences, we mapped entire Illumina raw reads to the AF-NS sequences using BWA-mem. Under the same threshold, less than 15% of the AF-NS sequences were covered by Illumina data (Fig. 2B). We also ran BUSCO 31 on both ONT and Illumina whole-genome assemblies using the eukaryote set of orthologs from OrthoDB v.10. 32 As expected, the shortread whole genome was not as complete as the long-read assemblies (Supplementary Table 4). These results indicate that short-read sequencing does miss a lot of genomic information.

Characteristics of novel sequences
According to our previous definition of the novel sequence composition, 9 we classified the AF-NS detected sequences as common and individual-specific components. A sequence was considered common if it could be mapped to other samples' sequences by satisfying ≥80% identity. Each of the six samples contained 1-2 Mbp common sequences, accounting for 6-8% of total sequences (Supplementary Table 5). To search for the origin of the AF-NS sequences, we aligned them to the chimpanzee genome (GCF_002880755.1) using NUCmer. Under the minimum requirement of 80% identity, about 7% of the novel sequences were present in the chimpanzee genome. By contrast, the common sequences of each sample obtained much higher alignment percentages of ~60% compared with individual-specific sequences of 2-3% (Fig. 3A).
To determine the novelty of the AF-NS sequences, we mapped them to existing novel sequences of three long-read sequencing samples, HX1 (Chinese) and two Swedes. 15,16 Few AF-NS sequences were aligned with ≥80% identity, indicating that most of them have not been reported (Supplementary Table 6). Next, we compared the AF-NS sequences with two pan-genomes, the African pan-genome and the Chinese pangenome. 8,33 Only about 3-7% of the AF-NS sequences could be matched to either one, implying a great number of novel sequences lost in the pan-genomes ( Fig. 3B and C). The low Common novel_WG: the novel sequences extracted from both Shasta and raven assemblies using novel_WG approach can be aligned with each other with ≥ 80% identity. AF-NS sequences: long-read novel sequences identified by the AF-NS approach. The AF-NS sequences mapped to long-read whole genome assembly generated by two assemblers, Shasta and raven, with ≥50% coverage. Novel_WG represents a method that identifies novel sequences from the whole-genome assembly. unmap_ASM represents a method for assembling reads unmapped to the human reference into sequences.
Assembly-free discovery of human novel sequences 5 alignment rate is possibly because the current pan-genomes were assembled using short-read data. However, the percentage of alignment to pan-genomes was still three to four times higher than that to the long-read novel sequences of an individual, emphasizing the importance of large-scale sequencing and assembling for obtaining complete genomic information.
In order to reveal the biological significance of the identified novel sequences, we, respectively, extracted the highly common sequences shared among at least four, five, and all six samples. Transcription factors (TFs) represent the main regulators that control gene transcription by binding to the DNA sequences. We performed TF binding motif searching on the highly common and individual-specific sequences using fimo 34 and based it on the TRANSFAC human dataset. 35 With a cutoff of P-value < 5e-8, we searched for TF binding to these highly common sequences. There were 71 TF motifs identified to bind to the sequences shared by all six samples (Fig. 3D, Supplementary Table 7). We display binding patterns of 15 main TFs that are believed to have a broad set of biological functions (Fig. 3E)

Analysis of novel placements
We positioned novel sequences in chm13.v1.1 and allowed the placed sequences within 100 bp of each other to be clustered together. We obtained 1,637 placed clusters among the six samples and used the corresponding placements to analyze the inserting preferences. The majority of placements (69%) were located in repeat regions, especially satellite and microsatellite regions (Supplementary Table 8, Fig. 4A). The enrichment of novel sequences in repeats might be due to the The percentage of short-read novel sequences of Illumina that were aligned to long-read sequences; AS-NS: long-read novel sequences identified by the AS-NS method; Shasta: long-read whole-genome assemblies generated using Shasta; raven: long-read whole-genome assemblies generated using raven. (B) The percentage of long-read novel sequences that were mapped to short-read raw reads; mapped: long-read novel sequences were mapped to short-read raw reads; unmapped: long-read novel sequences were not mapped to short-read raw reads. All alignments were based on at least 80% identity and 50-bp alignment length.
highly unstable nature of these regions, which are prone to generating variants. These variations probably drive genome plasticity and promote evolution. 36,37 In addition, we found 54 and 68 novel placements were in centromeres and telomeres, respectively, where genomic information is difficult to obtain using short-read data. 11 Long-read sequencing enables the mapping of novel sequences in highly condensed heterochromatin regions.
Furthermore, we noticed among the novel placements, 1,514 common ones were present in at least two samples (Fig. 5A). Since the GIAB samples were from Ashkenazim and Chinese subpopulations, we further divided the common novel placements into population-shared (across subpopulations) and population-specific (within a subpopulation). The 1,084 population-shared insertions as patches can supplement the latest reference genome. We obtained 430 populationspecific insertions, including 198 Chinese-specific and 232 Ashkenazim-specific (Fig. 5B) insertions. Then, we found 260 population-specific insertions present in only one parent and its offspring. If these mutations are newly generated in the parents, they may serve as new heritable variants.

Discussion
The identification of novel sequences is a basis for complementing the existing reference genome in humans. 38 However, influenced by different sequencing technologies and computational methods, the derived novel sequences vary considerably. 16,39,40 Finding a widely accepted standard for constructing novel sequences is a challenging issue in genomics research. Currently, there are two strategies for building novel sequences. The first one involves filtering out reads mapped to references and assembling the unmapped reads into novel sequence sets. 8,9,41 Another involves performing the whole-genome de novo assembly and extracting the unmapped sequences. 16,40 However, the quality of the identified novel sequences using these strategies is highly dependent on the performance of assemblers. When assembling unmapped reads, some novel sequences can be lost because of insufficient supporting reads or overlaps between reads that fail to meet assembly requirements. 42 Owing to the short length of NGS data, genome assembly is essential for identifying novel sequences that exceed the read lengths. By contrast, longread sequencing can generate reads longer than 100 kbp. Therefore, a single read can decipher large structural variations, which guarantees the feasibility of AF-NS to discover novel sequences at read level. Our assembly-free strategy includes a series of processes to ensure the completeness and accuracy of the identified novel sequences. First, the selected ONT reads during the first step should be of high quality. Second, the contaminants in the clustered candidate sequences must be filtered out. Last, we removed highly simple repetitive sequences from the final set due to the high error rate of longread sequencing in low-complexity regions. 24 To obtain more complete novel sequences, we did not use the self-correction approaches that are expected to correct the sequencing errors because their performances are largely affected by relatively low sequencing depths, 43 this possibly causes missing of novel sequences due to no sufficient long-read coverages. The performance of AF-NS demonstrates its advances in identifying novel sequences that have a larger size and better characteristics over other methods and saves considerable computational resources.
With the advancement of sequencing technologies, different ones have been applied to detect novel sequences. 15,39,44 However, the corresponding analyses focused only on specific data sets. Here, we compared the novel sequences generated using short-and long-read data and found a big difference, i.e., that most of the long-read sequences cannot be covered by short-read ones. Also, most of the long-read novel sequences were absent in Illumina pan-genomes even though the pan-genomes were built using hundreds of samples. These findings demonstrate that utilizing long-read technologies is the inevitable trend for creating complete novel DNA sequences.
Interaction between TFs and TF binding DNA sites on promotors is a key for transcriptional gene regulation. To analyze biological associations of the AF-NS novel sequences, we identified 71 TF binding motifs on the highly common sequences shared by all samples. These TFs are assumed to play regulatory functions involving many basic biological processes and pathways. Although the TF binding sites were not confirmed to promoters or non-coding regions, this finding suggests a wide biological connection to the newly detected sequences. Moreover, we discovered the population-shared and population-specific insertions by clustering placed novel sequences of six samples. The population-shared insertions indicate that the latest human reference genome is still underestimated, possibly because chm13 was derived from one genome. 3 The populationspecific insertions may be the force of the subpopulation evolution. Considering our classification using only six samples, increasing individuals is expected to assist in establishing more representative population-specific or common insertions. These population-specific insertions will probably allow us to determine the relationships of novel sequences with human evolution.
In summary, AF-NS represents a rapid sequence detection method from long reads and outperforms the existing assemblers in discovering intact novel sequences. The assembly-free design considerably reduces computation costs and facilitates the construction of large-scale novel sequence genomes. The superiority of long-read sequencing in identifying more novel sequences and positioning them in difficult-to-assemble regions, like centromeres and telomeres, shows its great potential for future novel sequence research. The high cost of long-read sequencing is still a challenge for generating numerous genomes. However, as more samples using advanced sequencers become available, we believe that we can establish a more comprehensive landscape of novel sequences.

Conflict of interest
R.L. receives research funding from Oxford Nanopore Technologies. The remaining authors declare no competing interests.