PBSIM3: a simulator for all types of PacBio and ONT long reads

Abstract Long-read sequencers, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencers, have improved their read length and accuracy, thereby opening up unprecedented research. Many tools and algorithms have been developed to analyze long reads, and rapid progress in PacBio and ONT has further accelerated their development. Together with the development of high-throughput sequencing technologies and their analysis tools, many read simulators have been developed and effectively utilized. PBSIM is one of the popular long-read simulators. In this study, we developed PBSIM3 with three new functions: error models for long reads, multi-pass sequencing for high-fidelity read simulation and transcriptome sequencing simulation. Therefore, PBSIM3 is now able to meet a wide range of long-read simulation requirements.

The error rates were calculated from the alignments between the reads and their reference genomes.Figure S3: Non-uniformity of errors of ONT reads for E. coli O127.After grouping reads by their accuracy, they were segmented into fixed size (100, 200, 400, 800, 1600, and 3200 bp) disjoint intervals, and accuracy of each interval was computed.Each graph shows the distribution of the averaged accuracy of each intervals, where color of the plotted lines represents read groups (e.g., 'Acc.78'refers to a read group with an accuracy of 77.5-78.4%).

Acc.88
Figure S10: Emission probability matrices of states of FIC-HMM.The horizontal axis represents alignment states; M:Match, S:Substitution, I:Insertion, D:Deletion.The vertical axis represents states of FIC-HMM, which are sorted in descending order of M(atch) probability emitted by the states of FIC-HMM.The states on the vertical axis emit the alignment states on the horizontal axis.The sum of emission probabilities on each state of vertical axis is 100%.These are matrices of 'Acc.82'-'Acc.88'(e.g., 'Acc.84'refers to a read group with an accuracy of 83.5%-84.4%).Figure S13: Read length distribution of PacBio Iso-seq (HiFi reads).Reads were grouped by 1 kb by their template length.Each graph shows the distribution of the read length, where colors of plotted lines represent read groups (e.g., '1kb' refers to a read group with their template accuracy of 1-1000 bp).CLR reads were simulated using the PBSIM3 quality score models, and HiFi reads were generated by ccs software as consensus sequences from PBSIM3 outputs.The template transcript from which each read was most likely sequenced was obtained from alignments between the reads and their reference genomes.

Figure S5 :
FigureS5: Distributions of insertion and deletion (indel) length for real reads and simulated reads.The vertical axis represents the percentage, while the horizontal axis represents the indel length.Frequencies of indel length were obtained from alignments between the reads and their reference genomes. 0

Figure S11 :
FigureS11: Transition probability matrices of states of FIC-HMM.The vertical and horizontal axes represent states of FIC-HMM, which are sorted in the same order as the emission probability matrices (Supplementary FigureS10).The states on the vertical axis transition to the states on the horizontal axis.The sum of transition probabilities on each state of the vertical axis is 100%.These are matrices of 'Acc.82'-'Acc.88'(e.g., 'Acc.84'refers to a read group with an accuracy of 83.5%-84.4%).

Table S1 :
Datasets for whole genome sequencing

Table S3 :
Alignment statistics of whole genome sequencing Sub.: Substitution, Ins.: Insertion, Del.: Deletion.Statistics were calculated from the alignments between the reads and their reference genomes.

Table S4 :
Parameter settings of aligners and simulators

Table S5 :
Alignment statistics of transcriptome sequencing Sub.: Substitution, Ins.: Insertion, Del.: Deletion.Statistics were calculated from the alignments between the reads and their reference genomes.

Table S6 :
Comparison of whole genome sequencing alignment statistics between real and simulated reads

Table S8 :
The effect of the number of passes on the simulation of PacBio HiFi reads FigureS1: Non-uniformity of errors of PacBio RS II CLR reads for C. elegans.After grouping reads by their accuracy, they were segmented into fixed size(100, 200, 400, 800, 1600, and 3200 bp)disjoint intervals, and accuracy of each interval was computed.Each graph shows the distribution of the averaged accuracy of each intervals, where the color of the plotted lines represents read groups (e.g., 'Acc.78'refers to a read group with an accuracy of 77.5-78.4%).The random model randomly generates errors according to an error rate and error ratio.accuracy,theywere segmented into fixed size(100, 200, 400, 800, 1600, and 3200 bp) disjoint intervals, and accuracy of each interval was computed.Each graph shows the distribution of the averaged accuracy of each intervals, where color of the plotted lines represents read groups (e.g., 'Acc.78'refers to a read group with an accuracy of 77.5-78.4%).
A) PacBio RS II CLR reads for C. elegans