Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data

Abstract High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting ‘low-confidence’ variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls.


Supplementary Methods
Library preparation and sequencing. Whole genome libraries were prepared using the TruSeq Nano DNA Library Preparation Kit (Illumina) with 100ng of good quality genomic DNA (assayed by nanodrop-based spectrophotometric and Qubit based fluorometric method) in accordance with the manufacturer's instructions. Briefly, genomic DNA was sheared using the Covaris sonicator to a target size of 350 bp followed by end-repair, selection of library size, adenylation and adapter ligation steps. Further, DNA fragments were amplified using PCR. The amplified DNA libraries were checked in Agilent 2200 TapeStation (Agilent Technologies) using high sensitivity D1000 ScreenTape, quantified using Qubit based fluorometric method and Real Time PCR, normalised and subjected to equimolar pooling. Finally the pooled libraries were loaded on S4 flow-cell (illumina) and sequenced on an Illumina NovaSeq 6000 instrument using 2 x 150bp cycles protocol. All tools are run on Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz with 1 core and in default mode except qualimap2. Qualimap2 throws a memory error for --java-mem-size <=4GB. So we run it with --java-mem-size=6G. Samtools-stats took least time to complete the analysis because it included less number of features compared to other tools. Among the rest of the tools (except samtool stats) Mapinsights-bamqc modules show comparable or less run time, although it includes few novel features that are computationally intensive.

Processing of open source data
For Nextera data FASTQ files were downloaded from Genome in a bottle consortium (See Supplementary Table S1 for more details). The reads are generated on two different lanes of a single flow-cell using two different indexes. The FASTQ files contain~168.4 million reads of length 100bp. Lane and index wise processing was done following GATK best practices protocols. Downloaded reads are aligned against hg19 decoy reference sequences using BWA mem. Sequence alignment map (SAM) file generated after alignment was sorted based on chromosomal coordinates and converted to BAM file using PICARD. Afterwards, the BAM file fed to the PICARD tool for marking the optical duplicates followed by indel realignment and base quality recalibration using GATK. Lane and index wise recalibrated BAM files were merged to generate the final aligned file.
For TruSeq data, FASTQ files were not accessible from GIAB resources. Two BAM files namely NIST-hg001-7001-ready.bam and NIST-hg001-7001-b-ready.bam were available. NIST-hg001-7001-ready.bam was taken for analysis. Samtools was used to sort the BAM file according to read-id and converted to paired FASTQ files using BEDtools (65). Resultant FASTQ contains reads of various lengths and maximum read length was 150bp. Out of~53 million reads present in the FASTQ files~37 million reads (69.8%) are taken based on read length >= 148bp for further processing. This data was sequenced on two different lanes of a single flow-cell. The selected reads are then lane wise separated and processed following protocol mentioned above. Finally, a BAM file was generated by merging the final alignment file from both the lanes. The mean depth of coverage is~30X for TruSeq whereas~146X for Nextera. We have downsampled the Nextera data to~30X for better comparison between both the data sets. Similarly. For lung carcinoma data we only extract reads having read length of 100 from FASTQ files and follow the best practice protocol for further processing. See supplementary Data 5 for detailed Information pertains to data downloads.

Mapinsights-bamqc feature description with respect to HG01402 sample:
We have observed a sudden rise in substitution rate in cycle number 62 in the overall per cycle profile. As the bamqc module provides detailing in paired-end orientation fashion so we are able to enquire further with respect to cycle specific error. We found that the bias occurs only in the second-in-pair and in both the strands (Supplementary Figure S3A). C>A/G>T and A>C/T>G are found to be the major substitutions as observed in the overall substitution profile. On further investigation, we observed C>A/G>T substitution has strand specificity. C>A occurs more in the forward strand and G>T in the reverse strand whereas A>C/T>G occurs almost equally in both the strands (Supplementary Figure S3B). Further from Supplementary Figure S3C, it is evident that read2 contains more errors compared to read1.

Profile of exome sequencing data downloaded from 1000G:
Profile of substitution categories: The substitution pairs like C>T/G>A, A>T/T>A etc have shown similar trends in percentage change with the exception in G>T and C>A where the deviation is sharp. Like whole genome HiSeq data here also A>C and G>T suffer the most with respect to low base quality where more than 50% of such changes have base quality <=10 in almost all 30 samples ( Figure 34-middle panel).
Per cycle profile: The average substitution rate is around 1.27 in all cycles in all 30 samples and the range is between o.53 to 33.3. In general, the rate is higher in the first cycle and in the last five cycles. There is a sharp drop in quality observed in the first and last cycles, however even if the substitution rate is higher in the last few cycles (except last one), an increase in base quality is also observed ( Figure 6A) indicating error incorporation before sequencing most likely due to end-repair artefacts. C>T/G>A and A>G/T>C are the major nucleotide changes across cycles and decrease at the end ( Figure 6A-lower panel) whereas A>C/T>G occur more at the end of the reads, showing a reverse pattern. A>T/T>A substitutions are higher in first base which is unusual and likely artifacts.

Read mismatch content:
In 30 samples the average % of reads having one mismatch is 14.68 with range from 11.98 to 18.97(Supplementary Figure S7) and the percentage decreases sharply with increased mismatches.
Exonwise gene depth of coverage profile: Depth status of BRCA1 gene is used to explain the functionality and usefulness of Mapinsight's genedepth module on two samples i.e HG00114 and HG02155 ( Figure 6D). In the exome target bed file downloaded from 1000G has 21 exons in it for BRCA1 gene. The mean depth of coverage (DCOV) for BRCA1 is around 122X in sample HG00114, whereas it is 88X in HG02155, however DCOV varies across different exons and it ranges from 0 to 228X in HG02155 and the range is 20 to 210X in HG00114.

Profile of NA12878 exome sequencing data (Truseq Vs Nextera):
Profile of substitution categories: It is observed that A>G/T>C (11.4%/11.6%) transversions are the highest among other substitution categories in Nextera data whereas C>A(13.7%) transitions are predominant in TruSeq data which is more than 2 fold higher compared to G>T(5.7%) substitutions indication error ( Figure 4C-upper panel). Importantly the quality score of C>A changes are mostly (70%) > QV 20, that means good quality bases suggesting the errors might have occurred before sequencing possibly during library preparation. Like the other datasets analyzed here also we observed the same pattern, i.e, more than 70% A>C and T>G changes have quality < 10 and an abundance of C_C (30.49%) and G_G (29.93%) context ( Figure 4C-lower panel). Additionally, two other contextual changes were observed in higher frequency i.e GAG>GGG and CTC>CCC in nextera data.
Per cycle profile: In Nextera data ( Figure 6B), the average substitution rate per cycle is~1.5 and a significantly high substitution rate is observed in cycle number 3,4 and the last cycles with substitution rate 3.5, 2.1 and 5.5 respectively. Whereas cycle number 95 is severely affected in TruSeq data with a substitution rate~15.2, however the other cycles are not that much affected and the substitution rates are relatively low, mostly similar to other nearby cycles. Additionally, C>A substitutions are predominant in each cycle except cycle number 95 in Truseq data ( Figure 6C).

Exon-wise gene depth of coverage profile:
We have run the genedepth module on both Nextera and TruSeq data sets for CDKN2A gene. CDKN2A was found to be associated with different diseases including multiple cancers. The number of exons (n=4) and their coordinates are the same in both TruSeq and Nextera captures. The bam files used in this analysis have~30X mean depth of coverage for both the data. In Nextera data the mean DCOV of CDKN2A gene is 19.38X whereas it is 24.55X in TruSeq data ( Figure 6E).

Profile of fresh frozen and & FFPE exomes:
Profile of substitution categories: A>G and T>C are the dominating changes (> 13.7%) observed in fresh frozen tumor and normal samples in both the patients whereas C>T and G>A substitutions are found to be major changes (>13.9%) in both the FFPE data reflecting the know error signal associated with FFPE ( Figure 4D).
Per cycle profile: Substitution rate is higher at both end cycles in case of FFPE data compared to fresh frozen data. In T3-FFPE data cycle number 94 has a substitution rate of 4.7% that is 2 fold higher compared to nearby cycles indicating bias and a drop of quality also observed in the same cycle (Supplementary Figure S8). Figure S1. Flowchart Figure S6. Percentage of C>A and G>T variants (both raw (left) and filtered (right)) between two clusters; cluster1 : variant calls from 21 exomes and cluster2 : variant calls from 9 exomes. A significant deviation between C>A and G>T is observed in cluster2 with p-value = 0.01 in raw and p-value = 0.005 in filtered variant calls across 9 exomes. Figure S7. Read-mismatch-content profile of NA12878 sample sequence in various sequencing platforms (HiSeq, NovaSeq, MGI and AVITI) and also of 30 exome datasets sequenced using HiSeq in overall and paired-end orientation manner. In each data set, read2 contains more mismatches compared to read1.