TKSM: highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator

Abstract Motivation Transcriptomic long-read (LR) sequencing is an increasingly cost-effective technology for probing various RNA features. Numerous tools have been developed to tackle various transcriptomic sequencing tasks (e.g. isoform and gene fusion detection). However, the lack of abundant gold-standard datasets hinders the benchmarking of such tools. Therefore, the simulation of LR sequencing is an important and practical alternative. While the existing LR simulators aim to imitate the sequencing machine noise and to target specific library protocols, they lack some important library preparation steps (e.g. PCR) and are difficult to modify to new and changing library preparation techniques (e.g. single-cell LRs). Results We present TKSM, a modular and scalable LR simulator, designed so that each RNA modification step is targeted explicitly by a specific module. This allows the user to assemble a simulation pipeline as a combination of TKSM modules to emulate a specific sequencing design. Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps. Availability and implementation TKSM is available as an open source software at https://github.com/vpc-ccg/tksm.

However, due to the nature of LR sequencing as an emerging technology, there are very few well established benchmark datasets or gold-standard datasets to assess transcriptomic LR bioinformatics tools.Such bioinformatics tools targeting these tasks require realistic simulations in order to assess their accuracy and performance.This includes the ability to simulate explicitly target specific library or cellular processes such as single-cell barcoding and UMI tagging, PCR, or molecule truncation.
Existing LR simulators such as Badread (Wick 2019), DeepSimulator (Li et al. 2020), Icarust (Munro et al. 2023), PBSIM3 (Ono et al. 2022), and Nanosim (Yang et al. 2017), typically focus on simulating the sequencing process, i.e., the point of contact of sequencing platform with the RNA/DNA molecule.Some have extensions focusing on specific sequencing libraries such as Trans-Nanosim transcriptomic and plasmid simulation (Hafezqorani et al. 2020), Meta-Nanosim metagenomic simulation (Yang et al. 2023), SLSim single-cell simulation (You et al. 2023a), and SQANTI-SIM alternative splicing simulation (Mestre-Toma ´s et al. 2023).However, these tools are not designed with modularity in mind and cannot be easily modified to address changes in the library preparation protocols such as adding a barcode tag or simulating the PCR process.A comprehensive survey of long-read tools, including simulation tools, is available at Long-read Tools catalogue (Amarasinghe et al. 2021).
We describe TKSM, a software that simulates realistic transcriptomic long-read datasets.TKSM modular design allows to target a wide range of library/cell processes.The power of TKSM lies in two key aspects: (i) the ease with which its simulation pipeline can be modified to cater to specific sequencing designs and (ii) high performance in terms of time and memory use.TKSM is open source, accessible via GitHub.

Methods
TKSM is flexible, both in that it can simulate a wide variety of datasets, and it is extendable.It is composed of several independent modules, each representing a cellular (e.g.polyadenylation) or a library preparation (e.g.PCR) process that modifies a nucleic acid molecule.This design allows the user to simulate different sequencing protocols by using TKSM's modules in various arrangements, imitating the different steps in the desired sequencing protocol.Additionally, this modular design allows TKSM to be easily extendable with future modules targeting additional library and cellular processes.To enable this modularity, we designed TKSM's modules to take and generate files in the same format that we call Molecule Description Format (MDF).An MDF file is a tabular file that describes molecules by listing for each molecule its genomic intervals alongside any sequence-level modifications to these intervals (e.g.substitutions).The rationale for using a tabular format is that write their own scripts that can generate or modify intermediate MDF files.We expand on the details of MDF files in Supplementary Section S1.The only exceptions to this design pattern are the entry module which generates the initial set of molecules from a transcript abundance profile and the exit module which generates the reads obtained by simulating the sequencing of the given molecules.
Each of TKSM's modules can be run as a separate process (tksm < module_name >).We also provide as part of TKSM a Snakemake (Mo ¨lder et al. 2021) script which can be configured by the user to specify a wide range of simulation experiments and run them all as a single command.Additionally, to optimize the computation time, we take advantage of Snakemake's piped input/output feature to allow modules to start running the moment they receive any input from a previous module, rather than having to wait for the preceding module to terminate.
TKSM can use real sequencing datasets to parameterize the behaviour of its modules, or alternatively, these parameters can be specified manually by the user.For example, TKSM contains preprocessing modules to compute the expression profile of transcripts from a given real sample which is then used to generate the molecules in the initial MDF file, whose sequencing according to a chosen protocol will be simulated by the next modules.

TKSM modules
TKSM contains three classes of modules, defined by features of their input and output: (i) entry-point modules start a TKSM pipeline and output an MDF file, (ii) core modules take an MDF file as input and output another MDF file, and (iii) exit (sequencing) modules take an MDF file as input and generate FASTA/FASTQ file(s) as output.Additionally, some preprocessing utilities in TKSM can take a real sequencing dataset and output model parameters for some of TKSM modules.A list of the implemented TKSM modules is presented in Fig. 1A and detailed in Supplementary Section S1.Additional modules and utilities can be implemented and easily integrated into TKSM in order to target specific steps in alternate sequencing protocols.

Customizable TKSM pipelines using Snakemake
An important design choice for TKSM is to make it easily customizable by the user, i.e. to make it easy to build a, possibly complex, simulation pipeline using the TKSM modules.To achieve that, we packaged TKSM with Snakemake and configuration scripts that can be edited by the user to add new modules or to define simulation experiments using any arrangement of TKSM modules.To define a simulation pipeline, the user lists the names of required TKSM modules and specifies, for the modules that require model construction, the real samples to build such models on.Additionally, using the Merging module, the user may build complex pipelines that are composed of different linear pipelines.An example of the configuration script is presented in Supplementary Listing S1. 2 Karaoglanoglu et al.

Results
To illustrate TKSM and assess its performances, we designed three simulation pipelines to emulate examples of standard transcriptomic sequencing protocols.Specifically, we present simulations of a standard bulk RNA sequencing experiment, a hybrid long-short read single-cell RNA sequencing (scRNAseq) experiment, and an RNA sequencing experiment similar to the bulk RNA sequencing experiment but with 100 random gene fusion events added.The Snakemake configuration files that specify these simulation pipelines are presented in Supplementary Listings S2-S4.
In the standard bulk RNA-seq experiment, we primarily compare against Trans-Nanosim (Hafezqorani et al. 2020) and try to conform to its pipeline design using TKSM modules.For both the bulk and gene fusion experiments, we use an RNA-seq sample generated from the MCF7 cell line by Chen et al. (2021) (direct RNA, replicate 1, run 2).We first accessed the SG-NEx data on 2020-06-17 via https://registry.opendata.aws/sgnex/.For the scRNA-seq experiment, we used an in-house dataset, named N1, first described by Ebrahimi et al. (2022).N1 follows the short-long single-cell hybrid protocol described previously in the literature (Gupta et al. 2018, Singh et al. 2019, Tian et al. 2021).In this manuscript, we use a random subsample of N1 with $1M longreads.The three TKSM pipelines are illustrated in Fig. 1B and C and Supplementary Figure S11.
Using these experiments, our goal is to assess TKSM on multiple metrics: (i) the similarity of the simulated data compared to the input real data on measures such as transcript expression, molecule sequence truncation, single cell barcode detection rates, and gene fusion generation, (ii) the time and memory footprint of various steps, and (iii) the ability to generate gene fusion events that can be detected by standard gene fusion tools.The results of all these experiments are presented in Supplementary Section S2.Note that all these results are reproducible using Snakemake scripts provided on the TKSM GitHub repository.

Conclusion
TKSM is a modular, accurate, and efficient transcriptomic LR sequencing simulator.Its modular design enables the user to construct a large verity of sequencing experiments with minimal effort.TKSM's standardized input and output for its modules allow the users of TKSM to add new modules that target existing and future library preparation techniques that TKSM currently does not target.For example, it is easy to envision an alternative entry-point module to the Transcribing module that generates nucleic acid molecules from DNA fragmentation while still making use of the rest of TKSM modules.TKSM also performs well in terms of generating realistic datasets with characteristics matching the real datasets it is simulating.Additionally, TKSM is engineered with efficient CPU and memory use in mind and its performance on those metrics is excellent.

Figure 1 .
Figure 1.(A) Existing TKSM modules and utilities alongside their high-level descriptions.TKSM is designed with modularity in mind; the user can specify a simulation pipeline of their choosing by chaining any number of TKSM modules including the possibility of using the same module multiple times.(B) Typical RNA-seq simulation pipeline that imitates Trans-Nanosim's workflow.(C) Single-cell long-read simulation pipeline.The pipeline makes use of the Filtering and Merging modules to add the short-read Illumina adapter and 10Â Genomics cellular barcodes only to molecules that have a tag indicating that they should have a cellular barcode.