ScanNeo2: a comprehensive workflow for neoantigen detection and immunogenicity prediction from diverse genomic and transcriptomic alterations

Abstract Motivation Neoantigens, tumor-specific protein fragments, are invaluable in cancer immunotherapy due to their ability to serve as targets for the immune system. Computational prediction of these neoantigens from sequencing data often requires multiple algorithms and sophisticated workflows, which are currently restricted to specific types of variants, such as single-nucleotide variants or insertions/deletions. Nevertheless, other sources of neoantigens are often overlooked. Results We introduce ScanNeo2 an improved and fully automated bioinformatics pipeline designed for high-throughput neoantigen prediction from raw sequencing data. Unlike its predecessor, ScanNeo2 integrates multiple sources of somatic variants, including canonical- and exitron-splicing, gene fusion events, and various somatic variants. Our benchmark results demonstrate that ScanNeo2 accurately identifies neoantigens, providing a comprehensive and more efficient solution for neoantigen prediction. Availability and implementation ScanNeo2 is freely available at https://github.com/ylab-hi/ScanNeo2/ and is accompanied by instruction and application data.


Pre-processing & Alignment
ScanNeo2 is a comprehensive snakemake-based pipeline to predict tumor neoantigens from raw or pre-processed DNA sequencing data, such as whole-exome or whole-genome sequencing (WES/WGS), and/or RNA sequencing data.It is implemented in the snakemake (Mölder et al., 2021) workflow language to enable maximum reproducibility.The installation process is simplified using conda environments and Docker containers, which are automatically fetched, installed, and executed by snakemake, thereby eliminating the need for users to manually install tools and dependencies.Fig. A1 depicts the different rules in the workflow.To run the pipeline, users can provide tumor/normal FASTQ or BAM files of either WES/WGS and/or RNA-seq data.In addition, pre-computed variants in VCF are also supported.When the sequencing reads are provided in FASTQ format, an optional pre-processing step can be invoked.Here, quality control of the reads is performed before and after the pre-processing using FastQC v0.12.1 (www.bioinformatics.babraham.ac.uk/projects/fastqc/).For the actual pre-processing procedure, fastp v0.23.4 (Chen et al., 2018) is used, which incorporates automated removal of adapter contamination.If the pre-processing is omitted, FASTQC is only applied to the input data.In the following, the RNA-seq reads are aligned to the reference genome (hg38) using STAR v2.7.10b.Here, the read groups are determined using either the sample identifiers or are directly extracted from BAM files (when provided).In other words, the former indicates that reads of different read groups need to be specified in separate files.In the following, the aligned reads are post-processed and the duplicates are removed using samtools v1.16.1 (Li et al., 2009;Danecek et al., 2021).If ScanNeo2 has been configured to perform indel/SNV calling, the reads are subsequently realigned using BWA v0.7.17 (Li and Durbin, 2009).
In that regard, the WES/WGS data are solely aligned using BWA.

HLA genotyping
In the HLA genotyping, WES/WGS and/or RNA-seq reads are used to determine the HLA alleles.In the case of RNA-seq data, the STAR alignments are used.For HLA class I alleles, the data is filtered HLA reads using yara (Siragusa et al., 2013), and subsequently the aligned reads are subjected to Optitype v1.3.5.In a similar manner, HLA-HD (Kawaguchi et al., 2017) is used to determine HLA class II alleles.

Variant Calling
ScanNeo2 integrates different modules for the variant calling.These include alternative splicing, exitron-splicing, and gene fusion events that are solely detected on RNA-seq data.As a consequence, these methods are omitted in the absence of RNA-seq data.
In contrast, the SNV/indel detection is done either on WES/WGS, RNA-seq, or both.

Alternative Splicing
Alternative splicing events are identified using SplAdder v3.0.4 (Kahles et al., 2016).We considered all event types (e.g., exon skips, intron retentions, alternative 3' splice sites, alternative 5' splice sites, mutually exclusive exons, multiple (coordinated) exons skips).As SplAdder only reports the events in Hierarchical Data Format (HDF5) and TXT format, ScanNeo2 converts the event files into VCF.ScanNeo2 mainly applies the default parameters of spladder, but confidence level (option confidence), and the number of iterations (option iterations) can be specified in the configuration.The former controls how strongly input alignments are filtered before new nodes and edges are added to the splicing graph.Here, the default is set to the highest level of confidence (default: 3, the maximum level of filtering).The latter controls the number of iterations when adding new intron edges (default: 5).

Exitron Splicing
We used ScanExitron (Wang et al., 2021) to detect exitron-derived splicing events.Here, the mapping quality is used that is specified for the whole analysis.In addition, default values for AO and PSO cutoffs are used.Subsequently, the results were converted into VCF using the provided scripts.

Gene Fusion
In this case, Arriba (Uhrig et al., 2021) is used.ScanNeo2 applies the recommended settings for the STAR alignments to use in Arriba.
In addition, parameters such as the minimum number of supported reads, the maximum E-value can be specified, whereas others are left to the default values.The resulting fusion events in tab-delimited format are converted into VCF using the provided scripts.

SNVs/Indels
ScanNeo2 employs multiple tools which include mutect2 and haplotypecaller from GATK (Van der Auwera and O'Connor, 2020) for SNVs and short indels, and transIndel (Yang et al., 2018) for long indels.However, ScanNeo2 allows to call these individually.It is to be noted, that this is applied both on the WES/WGS and RNA-seq data.Here, the (re-)alignments from BWA are used.Scanneo2 follows the best practice approach for short variant discovery.At first, the alignments are subjected to haplotypecaller which calls germline variants and is used with a recent build of the dbSNP database (Sherry et al., 2001).Subsequently, the variants are recalibrated using the Variant Quality Score Recalibration (VQSR) method that first builds a model using known, highly validated variant resources.This selects a subset of variants within the callset of high significance that functions as training set.
Overlaps of the training/truth resources sets and the detected callsets are used to model the distribution of these variants and groups them into clusters and assign so-called VQSLOD scores to all variants.In the next step, the filtering thresholds are applied to the variants in which the authors recommend a truth sensitivity level of 99.5% and 99% for SNV and indels, respectively.This results in filtered variants call sets for SNVs, and indels which are used for the base recalibration.The recalibrated alignment files are then applied to haplotypecaller, and the variants are again recalibrated and filtered to end up with call sets for SNVs and indels.Similarly, the recalibrated BAM file is subjected to mutect2 which identifies somatic variants.Subsequently, the indels are filtered and the SNVs and indels are sorted into separate files.Finally, the transIndel is used on the data to detect long indels.This is followed by the removal of PCR slippage events.

Annotation and Priorization
In the following, the individual VCF files are annotated using the Variant Effect Predictor (McLaren et al., 2016).Here, we used the plugin for nonsense-mediate decay events and predict the corresponding peptide sequences.For that, the flanking regions of the corresponding mutant alleles are extracted to the corresponding lengths that match the length of the corresponding peptides.This also allows to consider frameshift deletions that escape NMD.By default, ScanNeo2 determines peptide sequences of 8-11 for class I pMHCs, and 13-25 for class II pMHCs.Subsequently, netMHCpan and netHMCII (Reynisson et al., 2020) are used with the genotyped HLA alleles and the peptides.Peptides that fall short of a binding affinity 500nm are discarded.The surviving peptides are subjected to the immunogenicity caluclation for which the prediction tools from IEDB were used.Also the transcript expression was assessed.In addition, ScanNeo2 proves the ranking score to determine the significance for the detected pMHC.This is defined as score = 1 B + F + A * 100 in which B corresponds to the binding affinity, F as the fold-change between WT and MT alleles and A as the variant allele frequency.

Benchmarking
For the benchmarking of ScanNeo2, the TESLA datasets (Wells et al., 2020) was used (https://www.synapse.org/#!Synapse: syn21048999).In particular, normal/tumor whole exome sequencing (WES), and tumor RNA-Seq data of six patients in melanoma and two patients in non-small-cell lung carcinoma (NSCLC) was used.The samples were subjected to pre-processing applying a minimum length of 10, and an average quality score of 20.In addition, the window trimming with a window size of 3 and quality of 20 has been applied.In the following, stringent settings were used in the the variant calling.The results are listed in Table S3.Other parameters were set to default values.In a more stringent settings, the TPM was set to reads > 2, ranking score > 1000, and variant allele frequency was set to 0.02 and the immunogenicity score was set to 0.5.Table S4 summarizes the findings.

Runtime and Memory Requirements
All samples were subjected to each module individually.We tested single patient data in which the runtime between nextNEOpi took about 1.83min per million reads, wherease ScanNeo2 required 2.245min.However, this increase can be mainly attributed to the somatic variant calling that accounts for 0.7min.,311,929,587 2,284,672,789 (98.8%) 1,841,995,390 (79.7%) 1,196,134,909 (51.7%)Supplementary Figures Fig. A1.Visualization of ScanNeo2 workflow and the interplay of the rules as directed acyclic graph

Table S1 .
Feature comparison of common workflows for the prediction of neoantigens.

Table S2 .
Statistics of the pre-processing and alignments when analysing the TESLA dataset using ScanNeo2.Percentages in brackets refer to the raw sequencing reads.