Vacceed: a high-throughput in silico vaccine candidate discovery pipeline for eukaryotic pathogens based on reverse vaccinology

Summary: We present Vacceed, a highly configurable and scalable framework designed to automate the process of high-throughput in silico vaccine candidate discovery for eukaryotic pathogens. Given thousands of protein sequences from the target pathogen as input, the main output is a ranked list of protein candidates determined by a set of machine learning algorithms. Vacceed has the potential to save time and money by reducing the number of false candidates allocated for laboratory validation. Vacceed, if required, can also predict protein sequences from the pathogen’s genome. Availability and implementation: Vacceed is tested on Linux and can be freely downloaded from https://github.com/sgoodswe/vacceed/releases (includes a worked example with sample data). Vacceed User Guide can be obtained from https://github.com/sgoodswe/vacceed. Contact: John.Ellis@uts.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
Vacceed is the collective name for a framework of linked bioinformatics programs, Perl scripts, R functions, and Linux scripts. It has been designed to facilitate an automated, highthroughput in silico approach to vaccine candidate discovery for eukaryotic pathogens. This document explains how to install, configure, and run Vacceed. The first important point to note is not to be overwhelmed by the length of this user guide. It is not intended to be read from cover to cover but as a reference to assist in a stress-free experience in getting started and, only if desired, to obtain a more detailed understanding of how Vacceed works.
We recommend the following published articles for background information on motivation and theory behind Vacceed:

Test installation by running sample data
Prior to running tests you need to install dependant programs. The programs to install depend on which of the following two main tasks you intend to perform: 1) build a proteome for the target pathogen and/or 2) run the vaccine candidate discovery pipeline. See the following section How to get started to assist in choosing which task to perform. Subject to your choice, refer to Prerequisite programs in either or both PART A -Build Proteome and PART B -Run Pipeline sections to determine which programs need to be installed. Each installed prerequisite program should be independently tested before attempting to run

Vacceed.
Sample data for the species Toxoplasma gondii is provided as part of the Vacceed download to specifically test the installation (T. gondii is a eukaryotic protozoan responsible for human disease).
Test 1 -Build proteome for T. gondii 1. Edit the species configuration file 'toxoplasma_build.ini' located in the directory <install_dir>/vacceed/start/config_dir (where <install_dir> is the directory in which Vacceed was installed). Change the current path assigned to work_dir to the correct path to the 'vacceed' directory. Also, assign your e-mail address to email_url 2. Change directory to <install_dir>/vacceed/start in a command-line terminal 3. Enter the command: perl startup build tg Note that only chromosomes 'Ia' and 'Ib' from T. gondii are used in the test to reduce the running time. However, step 3 can still take between 5 to 8 hours to complete depending on your computing environment. The slowness is because of one program in particular called N-Scan. It is possible to run this test in less than 10 minutes by removing N-Scan from the building process. This is achieved by removing the word NSCAN from the list of program names assigned to name under the [Resources] header in file 'toxoplasma_build.ini'. This test (including N-Scan) was completed in 6 hours: 30 minutes: 30 seconds using Red Hat Enterprise Linux Workstation release 6.4, 64 bit kernel, and 12 MB memory with 6 CPUs ( without N-Scan, it only took 8 minutes).
An e-mail will automatically be sent to you either when the building of the proteome is successfully completed OR immediately when an error occurs. A log file is attached to the email that provides details of the success or failure. If  An e-mail will automatically be sent to you either when the pipeline is successfully completed OR immediately when an error occurs. A log file is attached to the e-mail that provides details common_programs -contains programs that are common to more than one resource.
chromosomes -contains nucleotide sequences for each chromosome from the target pathogen. A separate file in a FASTA format is required for each chromosome. The filename should consist of a consistent prefix (e.g. chr or chromosome), a chromosome number (e.g. 1 or Ia), and a consistent extension e.g. fasta or seq. Example filenames: chr1.fasta, chr2.fasta.
genes -contains nucleotide sequences for each gene within a chromosome. A separate file in a FASTA format is required for each chromosome. The filename should consist of a consistent prefix (e.g. genes_chr), a chromosome number (e.g. 1 or Ia), and a consistent extension e.g. fasta or seq. Example filenames: genes_chr1.fasta, genes_chr2.fasta.
proteins -contains amino acid sequences for all known proteins from the target pathogen.
Only one file in a FASTA format is required. Any filename can be used. The FASTA sequence identifier for each protein should be in a consistent format e.g. >gi|490147168, >gi|52000749| or >tr|F0V7C2|, >tr|F0VF65| ests -contains nucleotide sequences for Expressed Sequence Tags (ESTs). Only one file in a FASTA format is required. Any filename can be used. The FASTA sequence identifier for each EST should be in a consistent format mapping -contains all files used to map different gene or protein identifiers e.g. map UniProt ID to NCBI gi number.
training_files -contains user created training files for resources e.g. AUGUSTUS provides the option to create a training dataset specific to the target pathogen.
proteome -contains the two most important output files. One containing resource scores for each protein and the other, the built proteome in the form of protein sequences (see section

Output Files).
pipeline -The main work area for running the pipeline (see Part B -Run Pipeline).

Prerequisite Programs
Perl -Vacceed has been developed and tested on Perl 5. Note: Installing the prerequisite programs is perhaps the most challenging aspect to preparing Vacceed ready for use. It is highly recommended that you seek the help of an administrator (or an experienced Linux user). Ensure that the each program successfully runs with sample data before running Vacceed.

Important requirements:
[1] Append program location to the PATH variable so that the program will run from any directory. The best place to add the location is to modify the user's .bash_profile file e.g.

Configuration of prerequisite programs
Some of the prerequisite programs need to be configured specifically for the target pathogen: Augustus has been trained for many organisms (see the list at: http://bioinf.uni-

Prerequisite Starting Data
The only absolute mandatory data to build the proteome are nucleotide sequences. Ideally, however, known gene and protein sequences for the target pathogen are also obtained from public databases and saved in the directories as specified below. Without these existing genes and proteins, Vacceed can only assume that any predicted genes and proteins are novel.
Expressed sequence tags, although not mandatory, provide useful evidence to support existing or novel genes. Vacceed uses the second entity in the identifier as the protein ID e.g. 490147168 and F0V7C2

Nucleotide sequences for each chromosome
from the previous example. The default prefix is assigned to the key 'prot_id_prefix' in the species configuration file.

Expressed Sequence Tag (EST) sequences
The programs Blat and GMAP are used to align EST sequences to the chromosome sequences. Partial genes are then constructed. These predicted partial genes are compared to known genes in the genes directory using blastn. A score is determined based on the percentage of query coverage * sequence percentage similarity (see Output Files). The score provides evidence to support the existence of the known genes. Only one file containing the ESTs in a FASTA format is required. Any filename can be used but the name used must be assigned to the key 'est_file' in the species configuration file (see Species configuration file). The EST file should be saved in the ests directory (see Directory structure following Vacceed installation).

Mapping requirements
Vacceed needs to link the genes in the genes directory to the proteins in the proteins directory and vice versa. Two mandatory mapping files are required: Gene ID to protein ID Protein ID to Gene ID and chromosome number.

Vacceed Execution
Vacceed is invoked with a Perl script from within a Linux\Unix shell. The Perl script is called startup and is located in the start directory (See Figure 1).
To build proteome: Change directory to ~/vacceed/start and type… perl startup build <pathogen> e.g. perl startup build tg Description of arguments: 'build' instructs the script to build the proteome of the target pathogen <pathogen> is a user-definable name that determines which configuration file to use (see configuration file startup.ini)

Configuration file -startup.ini
Each target pathogen requires its own configuration files (see Specifies configuration file).
Typically there is one species configuration file for building proteome, and one for running the in silico vaccine discovery pipeline. An argument passed to the startup script dictates which species configuration file to use. For example, perl startup build tg or perl startup tg, where 'tg' is a user-definable code in the configuration file startup.ini (see Figure 2). The startup.ini configuration file contains 4 columns separated by a '<' character. The first column can be any number of characters and is used by the startup script to make the association with the appropriate species configuration file. Column 2 is simply a description and is not used by ant program. Column 3 should be either 'build' to indicate that the species configuration file relates to building the proteome or 'pipeline' to indicate the configuration file is used to run a vaccine discovery pipeline. Column 4 is the species configuration filename (any user definable name). The startup.ini file is located in the start directory (see Figure 1).

Specifies configuration file
The core of Vacceed is a species configuration file in a header-key format. User-definable configuration files are required for each species. Typically, each species will have two configuration files: one for building the proteome and one for running the pipeline. Four example configuration files (template_build.ini, template.ini, toxoplasma_build.ini, toxoplasma.ini) are supplied with the Vacceed distribution. The distributed configuration files are found in config_dir under the start directory (see Figure 1).
The following section describes the content of a typical configuration file and presents an example of how a user can modify the contents to suit the target species. The species configuration file is in header-key format (see Figure 3). For example, [Resources] is regarded as the header, and 'name' is the key. Once startup is invoked, the script executes in turn each resource listed after the 'name' key. A resource name e.g.
AUGUSTUS in principle can be any name on the provision that the same name is used consistently throughout the rest of the configuration file. In most cases, only keys under the Main header (see Figure 4) will need to be modified by the user. Any key can be used as a variable replacement in the rest of the configuration file. That is, a '$' character preceding a word denotes a variable e.g. $work_dir is replaced by '$HOME/vacceed' throughout the configuration file on execution of startup.
Description of keys under Main header: work_dir: The path to the directory that contains the Vacceed installation species_dir The directory name that will contain all data and output for a specific species   The keys under the Variables header (see Figure 5) are essentially used to save on typing and limit the number of changes required to the configuration file. The user can add any number of variables. [Main] work_dir="$HOME/vacceed" species_dir="toxoplasma" chromosomes="Ia Ib II III" master_script="master_script" build_script_only="NO" log_file="$work_dir/$species_dir_logfile.txt" email_url=Fred.Bloggs@student.uts.edu.au assembly_dir = Path to directory that will contain the files used to build proteome common_dir = Path to directory containing programs that are common to more than one resource. resource_dir = value for this key must not change. This is a special case in which $resource_dir is replaced by the relevant resource name e.g. augustus. [Variables] protein_fasta="UniProt_proteins.fasta" prot_id_prefix="tr" chr_dir="$work_dir/$species_dir/build_proteome/chromosomes" gene_dir="$work_dir/$species_dir/build_proteome/genes" est_dir="$work_dir/$species_dir/build_proteome/ests" prot_dir="$work_dir/$species_dir/build_proteome/proteins" train_dir="$work_dir/$species_dir/build_proteome/training_files" map_dir="$work_dir/$species_dir/build_proteome/mapping" proteome_dir="$work_dir/$species_dir/proteome" assembly_dir="$work_dir/$species_dir/build_proteome/build_dir/assembly/output" common_dir="$work_dir/$species_dir/build_proteome/build_dir/common_programs" The resourceName should be consistent with the name of the resource used under the Resources header. Figure 6 represents a typical configuration for a resource. The directory for each resource contains an identical structure of three directories, which by default are called output, scripts, and summary_files. script_dir The directory name that will contain Linux scripts to execute the various commands of the resource. A separate script is created for each chromosome e.g. script_Ia, script_Ib. These scripts are run either in parallel or consecutively (see section on Running scripts in parallel). The scripts can be run independently and are useful for debugging.

Resource Scripts
In the distribution version of Vacceed, each resource has one main Linux script (typically named after the resource) that creates a new script for each chromosome to be processed.
These scripts are saved in the scripts directory (see Figure 1). Each chromosome script (e.g. scriptIa) contains all the required commands to execute the resource and to extract relevant data for that particular chromosome. The extracted data is analysed and then used to build the proteome incrementally on a chromosome by chromosome basis. There is a set hierarchal structure for the execution of all Vacceed scripts e.g startup master_script resource_script chromosome_script. Any script can be run independently, which is ideal for debugging. Each resource script is constructed from a generic format. Figure 7 shows an example of this format. There are four main sections in the script: # Get command-line arguments -these arguments are passed by startup. The resource arguments to pass are read from the appropriate species configuration file (see section Specifies configuration file). The arguments constitute the variables (e.g. denoted by the prefix '$') used throughout the rest of the script.
# Hard coding -local variables are recommended to be added here. bg_mode is required to implement parallel processing of scripts (see section on Running scripts in parallel).
#Main loop for writing scripts -the idea behind this section is to create a subordinate script that encapsulates all the commands required to process and manipulate data for a given chromosome. The general pattern for each step or command to be performed is to add one line as a description of the step, and another line with the actual command. Each command line should also include '|| error_exit'. A generic function called error_exit is executed in the event of an error raised by the command. All errors are written to the log file. A generic script called 'error_script' is used to write the error_exit function.
#Run the scripts -a generic script called 'run_scripts' is used to execute each chromosome script.

Output Files
The two most important output files are proteome_info.txt and proteome.fasta, which are saved in the proteome directory (see Directory structure following Vacceed installation). It is recommended that these files be examined after startup has finished and no errors were detected.
proteome_info.txt -Lists all known and predicted pathogen proteins by a descending score in a The file 'proteome_info.txt' also lists proteins that are potentially novel. A novel protein in this context is one that matches no known protein from the target pathogen and is derived from a predicted gene that matches no known gene. A novel protein is denoted by the word 'new' as part of the ID e.g. chrIa_new7. This notation also indicates which chromosome contained the gene that encoded the novel protein. Novel proteins will always have a 0 score from each resource. However, the last column for novel proteins is a probability score rather than an average of the resource scores. The probability score (a value between 0 and 1) is derived by clustering the sequences predicted by different resources that do not match to known genes or proteins. The clustering here is based on sequence similarity using blastn or blastp. An assumption made is that a cluster containing several sequences has a greater probability than a cluster containing only a few sequences that it represents a 'real' sequence (i.e. a novel sequence).
proteome.fasta -Contains the amino acid sequences in a FASTA format for all known and predicted novel proteins of the target eukaryotic pathogen. This file provides the starting prerequisite for in silico vaccine discovery pipeline (i.e. PART B -Run Pipeline). Note that the FASTA definition line for novel proteins contains for consistency the same characters that precede the known proteins. For example, if know proteins have 'sp' preceding the ID in the FASTA definition line as in >sp|QQAAA|; the novel proteins will therefore have 'sp' preceding the ID as in >sp|chrIa_new7|. The preceding characters are assigned to the key 'prot_id_prefix' in the species configuration file.
If the above files are empty or have missing or unexpected data, then the next recommended step is to review the files in the summary_files directory for each resource. These summary files typically provide an overview of the output on a chromosome by chromosome basis. If the results are not what you expect, then probing the more detailed files in the output directories may provide some clues to the source of the problem. Each resource can potentially generate many output files. The output filenames, a description of their contents, and the program that generates them is listed in the following table. In general, the files represent an audit trail for the various steps performed by each resource. For the most part, you do not normally need to be concerned with these output files unless there is a requirement to scrutinise the quality of each step.
The table uses some abbreviations and terms: prefixrepresents a consistent set of characters that precede the rest of the filename.
The default is chr# (where # is the number of the chromosome). The prefix allows for the grouping and processing of files on chromosome basis

#represents a number either for a chromosome or gene
Query coveragepercent of the query sequence that overlaps the subject sequence with respect to a BLAST output Identpercent of similarity between the query and subject sequences over the length of the query coverage

Running scripts in parallel
The resources encapsulate, for the most part, a large number of independent computationintensive tasks. Vacceed takes advantage of multi-core processors. The default when building the proteome is to processes one chromosome per CPU in parallel. Chromosomes are queued it there are more chromosomes than CPUs i.e. when a chromosome has finished processing a new one will commence. The user can specify the number of chromosomes to process in parallel by altering the number assigned to the variable 'no_in_parallel' in run_scripts located in the common_programs directory. The chromosomes can be processed consecutively if setting for bg_mode = 0 (default is bg_mode=1 for parallel processing -see Resource Scripts).

Adding a new resource
This section describes the steps required to add a new resource. It is assumed here that the resource is to contain a fictitious program called program_x that predicts genes from chromosome sequences. The primary goal is to deduce protein sequences from the gene sequences: 1. Install program_x and append program location to the PATH variable so that the program will run from any directory. The best place to add the location is to modify the user's .bash_profile file.
E.g. PATH=$PATH:$HOME/Gene_Prediction_Programs/program_x/bin 2. Test that the program will run with sample data and ensure it can be invoked from any directory. Furthermore, determine the input and output requirements. [NEW_RESOURCE] This must be the same name as that used in step 3.
Provided you have added relevant programs to create files containing gene and protein sequences, no further amendments or steps are required.

PART B -Run Pipeline
This section describes everything you need to know to automate the process of highthroughput in silico vaccine candidate discovery for eukaryotic pathogens. The primary goal here is to computationally generate a file containing only the protein sequences (in a FASTA format) that represent the predicted vaccine candidates for the target pathogen. A prerequisite to run the pipeline is a file in a FASTA format containing protein sequences from the target pathogen. These protein sequences can be downloaded from public databases and/or predicted following the steps described in Part A -Build Proteome.
The pipeline in this context is a framework of data-processing stages. Each stage in the pipeline is encapsulated as a resource that mainly contains commands to execute a central

2.
Configure the installed programs, if required, for the target pathogen (see section

Configuration of prerequisite programs).
3. Add the target pathogen to startup.ini (see section Configuration file -startup.ini).

4.
Create a configuration file for the target pathogen (see Specifies configuration file).

Toxoplasma species is invalid, Toxoplasma_species is valid. Also, remember that
Linux/UNIX is case sensitive.
The contents of the directories are: Start -contains the Perl script to invoke Vacceed called startup. The master Linux script also is created in this directory.
config_dir -a directory within the start directory that contains the species specific configuration files. pipeline -The parent directory that contains all the resource directories.
[Resource name] -A separately named but identical directory structure is used for each evidence prediction resource. For example, each resource directory contains two sub directories: output (contains the main output files from the resource programs), scripts (contains Linux scripts that invoke the resource programs). Some resources contain an additional directory called training_files, which contain the necessary file for training the resource.
common_programs -contains programs that are common to more than one resource.
proteome -contains the prerequisite input file to run the pipeline i.e. a file containing protein sequences from the target pathogen in a FASTA format. Also, contains the main output file from the pipeline, which is a file containing protein sequences for predicted vaccine candidates in a FASTA format.

Prerequisite Programs
Perl -Vacceed has been developed and tested on Perl 5.10.1 for Linux. The following Perl modules MUST be installed: to execute the program. For the most part, these scripts will not need to be altered unless a program parameter is required to be changed. ignore_predictors is used to list the headers to ignore as predictors.

Configuration of prerequisite programs
Some of the prerequisite programs need to be configured specifically for the target pathogen: MHC I binding predictor program needs a file containing MHC Class I alleles of the host of the pathogen. This is required for the peptide-MHC binding predictions. The file should be in a comma delimited format with columns allele name and peptide length.

Example of an allele file for cattle
BoLA-T2C,10 BoLA-T2C,11 BoLA-T2C,12 BoLA-T2C,13 BoLA-T2C,14 BoLA-T2C,8 BoLA-T2C,9 It is recommended that you copy the appropriate allele file to the 'alleles' directory under the WoLF PSORT has a default sequence length restriction of 10000. If your input protein sequences are likely to be greater than this restriction, it recommended that the Perl script 'checkFastaInput.pl' located in the bin directory of the WoLF PSORT installation be edited.

Prerequisite Starting Data
The only absolute mandatory input required to run the pipeline is a file in a FASTA format containing amino acid sequences for proteins from the target eukaryotic pathogen. This file must be contained in the 'proteome' directory (see Directory structure following Vacceed installation).

Vacceed Execution
Vacceed is invoked with a Perl script from within a Linux\Unix shell. The Perl script is called startup and is located in the start directory (See Figure 8).
To run pipeline: Change directory to ~/vacceed/start and type perl startup <pathogen> e.g. perl startup tg Description of arguments: <pathogen> is a user-definable name that determines which configuration file to use (see configuration file startup.ini)

Configuration file -startup.ini
Each target pathogen requires its own configuration files (see Specifies configuration file).
Typically there is one species configuration file for building proteome, and one for running the in silico vaccine discovery pipeline. An argument passed to the startup script dictates which species configuration file to use. For example, perl startup build tg or perl startup tg, where 'tg' is a user-definable code in the configuration file startup.ini (see Figure 9). The startup.ini configuration file contains 4 columns separated by a '<' character. The first column can be any number of characters and is used by the startup script to make the association with the appropriate species configuration file. Column 2 is simply a description and is not used by ant program. Column 3 should be either 'build' to indicate that the species configuration file relates to building the proteome or 'pipeline' to indicate the configuration file is used to run a vaccine discovery pipeline. Column 4 is the species configuration filename (any user definable name). The startup.ini file is located in the start directory (see Figure 8).

Specifies configuration file
The core of Vacceed is a species configuration file in a header-key format. User-definable configuration files are required for each species. Typically, each species will have two configuration files: one for building the proteome and one for running the pipeline. Four example configuration files (template_build.ini, template.ini, toxoplasma_build.ini, toxoplasma.ini) are supplied with the Vacceed distribution. The distributed configuration files are found in config_dir under the start directory (see Figure 8).
The following section describes the content of a configuration file and presents an example of how a user can modify the contents to suit the target species. The first three steps in configuring Vacceed for a new target pathogen is to: 1) add new line in startup.ini, 2) copy #Startup.ini -User defined configuration files code<species<type e.g. build or pipeline<config<config directory path tg<Toxoplasma gondii<build<toxoplasma_build.ini<$HOME/Vacceed/start/config_dir tg<Toxoplasma gondii<pipeline<toxoplasma.ini<<$HOME/Vacceed/start/config_dir toxoplasma.ini (or another species configuration file if available) to <new_species >.ini, 3) copy the entire template_species directory to a user-named directory e.g. neospora for Neospora caninum.
The <new_species >.ini configuration file needs to be modified appropriately for the target pathogen. Any line in the configuration file that begins with '#' is interpreted by the script as a comment. The Vacceed framework is built around the concept of a resource. A resource, in this context, is a program or group of programs executed as an independent modular unit.
That is, each module contains everything necessary to execute only one aspect of the desired functionality and can be run independently. Modular units improve maintainability and allow new resources to be added when required (see Adding a new resource).
The species configuration file is in header-key format (see Figure 9). For example, [Resources] is regarded as the header, and 'name' is the key. Once startup is invoked, the script executes in turn each resource listed after the 'name' key. A resource name e.g. WOLF in principle can be any name on the provision that the same name is used consistently throughout the rest of the configuration file. For example, instead of WOLF one could use 'WoLF_PSORT'. The resource names can be in any order with the exception of VALIDATE and EVIDENCE, which must always be the first and last in the list respectively. In most cases, only keys under the Main header (see Figure 10) will need to be modified by the user. Any key can be used as a variable replacement in the rest of the configuration file.
That is, a '$' character preceding a word denotes a variable e.g. $work_dir is replaced by '$HOME/vacceed' throughout the configuration file on execution of startup. email_url E-mail address. An e-mail will be sent either to indicate that the proteome was built successfully OR failed. A log file is attached to the e-mail.

Figure 10: Main -extract from a species configuration file
The keys under the Variables header (see Figure 11) are essentially used to save on typing and limit the number of changes required to the configuration file. The user can add any number of variables. [Main] work_dir="$HOME/vacceed" species_dir="toxoplasma" master_script="master_script" build_script_only="NO" log_file="$work_dir/$species_dir_logfile.txt" email_url=Fred.Bloggs@student.uts.edu.au The resourceName should be consistent with the name of the resource used under the [Resources] header. Figure 12 represents a typical configuration for a resource. The directory for each resource contains an identical structure of two directories, which by default are called output and scripts. [Variables] proteome_fasta="test.fasta" prot_id_prefix="sp" proteome_dir="$work_dir/$species_dir/proteome" common_dir="$work_dir/$species_dir/pipeline/common_programs" evidence_dir="$work_dir/$species_dir/pipeline/evidence/output" resource_dir="[Resources.name]" # do not change The list of keys under the header resourceName_files is used to specify variables for filenames. These variables are used as arguments to the resource script. It is highly recommended that careful attention is made to checking filenames because they may be species-specific.

Resource Scripts
In the distribution version of Vacceed, each resource has one main Linux script (typically named after the resource) that in many instances creates subordinate scripts for processing a subset of the total number of proteins (see Running scripts in parallel). These scripts are saved in the scripts directory (see Figure 8). Each script (e.g. script1) contains all the required commands to execute the resource and to extract relevant evidence for a particular protein characteristic. There is a set hierarchal structure for the execution of all Vacceed scripts e.g.

Output Files
The two most important output files are vaccine_candidates and vaccine_candidates.fasta, which are saved in the proteome directory (see Directory structure following Vacceed installation).
vaccine_candidates contains an ordered list in a table format of each known and predicted protein in the proteome of the target pathogen (see example extract below). The order is determine by a final score based on the average of the machine learning (ML) scores.
There is one protein per row. Each column represents an average probability score between 0 and 1. This score represents the likelihood or confidence level that the 'YES' for vaccine classification is correct. The R functions for adaptive boosting, random forest, SVM, and naive Bayes classifier support class-probabilities i.e. an estimated probability for each protein belonging to 'YES' and 'NO' classes. The output from the R functions for k-nearest neighbour classifier and neural network is only a binary 'YES' or 'NO'. The score from these latter algorithms used in vaccine_candidates is therefore an average frequency for YES vaccine candidacy. The average score for all machine learning (ML) scores is reported in the last column and this value determines the order that the proteins appear in the list. There is also column that indicates a probability score that the protein is 'real'. This probability score is extracted from proteome_info.txt (see PART A -Output Files).
vaccine_candidates.fasta contains the protein sequences of the predicted vaccine candidates in a FASTA format for only proteins that have an average ML score and existence score from vaccine_candidates greater than user-defined threshold values. The threshold scores are assigned to the ml_threshold and existence_threshold keys under EVIDENCE resource in the species configuration file (default value = 0.75).
It is recommended that both 'vaccine_candidates' and 'vaccine_candidates.fasta' be examined after startup has finished and no errors were detected. If these files are empty or have missing or unexpected data, the next recommended step is to review the files in the output directory of the evidence resource. This latter directory should contain an output file from each resource. If a particular resource output file is missing or not what you expected, then probing the files in the output directory of the resource in question may provide some clues to the source of the problem. Each resource can potentially generate many output files.
The output filenames, a description of their contents, and the program that generates them is listed in the following table. In general, the files represent an audit trail for the various steps performed by each resource. For the most part, you do not normally need to be concerned with these output files unless there is a requirement to scrutinise the quality of each step.

Running scripts in parallel
The resources encapsulate, for the most part, a large number of independent computationintensive tasks. Vacceed takes advantage of multi-core processors. The default when running the pipeline is to divide protein sequences into a number of temporary files for the purpose of processing the files in parallel. The number of proteins in each temporary file is determined by the total number of proteins to be processed divided by the number assigned to the 'split_by' variable in the resource script (see Resource Scripts). For example, if the total number of proteins is 5000 and the split_by = 6, then five temporary files containing 833 protein sequences and one file containing 835 are created. The default value for 'split_by' is the number of CPUs but you can override this by assigning the desired split number to 'split_by'. The temporary files can be run in parallel or consecutively depending on the setting for bg_mode (default is bg_mode=1 for parallel processing -see Resource Scripts).
If there are more temporary files than CPUs, then the surplus files are queued i.e. when a file has finished processing a new one will commence.

Adding a new resource
This section describes the steps required to add a new resource. It is assumed here that the resource is to contain a fictitious program called program_z that predicts a particular protein characteristic. The primary goal is to extract relevant evidence from the output of program_z to add to overall evidence profile for the purpose of vaccine candidacy decision making using machine learning algorithms.

1.
Install program_z and append program location to the PATH variable so that the program will run from any directory. The best place to add the location is to modify the user's .bash_profile file.
E.g. PATH=$PATH:$HOME/Pipeline_Programs/program_z/bin 2. Test that the program will run with sample data and ensure it can be invoked from any directory. Furthermore, determine the input and output requirements.
3. Add a new resource name in the appropriate configuration file for running the pipeline. [Resources] name=VALIDATE,WOLF,TMHMM,TARGETP,PHOBIUS,NEW_RESOURCE,EVIDENCE 4. Add a new section to the same configuration file. The easiest way to do this is to copy an existing resource and amend accordingly. The texts highlighted in red are the only parts expected to be changed (see Species configuration file for more details).

Introduction
The current trend in vaccine development is epitope-based due to its potential to be more specific, safer, and easier to produce than traditional vaccines [1]. The key to subunit vaccine development is the successful identification of proteins of a pathogen, as opposed to using the entire entity, which evoke a protective, safe immune response. Proteins that are present on the surface of the pathogen or are secreted from the pathogen are the most likely candidates to induce an immune response and are consequently the target for this study. Five programs (WoLF PSORT [2], SignalP [3], TargetP [4], TMHMM [5], and Phobius [6]) were used to predict protein characteristics relevant to sub-cellular location given amino acid sequences as input.
It is the recognition of epitopes on pathogens by T-and B-cells (and soluble antibodies) that activates the cellular and humoral immune response [7]. The premise here is that if a high affinity epitope can be associated with a protein then this provides further evidence for the protein's vaccine candidacy. Two programs (MHC I Binding Predictor and MHC II Binding Predictor [8,9]) were used to predict peptide binding to MHC class I and class II molecules.
All seven programs were essentially chosen because they were applicable to eukaryotes, could be freely downloaded, run in a standalone mode, allow high throughput processing, and execute in a Linux environment. All programs can be executed via web interfaces. However, processing enormous amounts of input is currently unproductive through web interfaces and in particular the web versions of WoLF PSORT, SignalP, and TargetP restrict the number of input sequences; hence the reason why equivalent standalone versions of the programs were employed here in a Linux environment.

WoLF PSORT
WoLF PSORT computationally predicts a protein's localization and in effect mimics the biological mechanism of protein sorting [10] by which a protein, after its encoding and synthesis, is transported to the appropriate position in or outside the pathogen. The main determinant of a protein's localization is the protein amino acid sequence [2]. In effect the sequence contains a delivery address. Many programs have been developed to predict subcellular locations of proteins [4,11] 1 . Most programs are web based. The prediction methods can be broadly grouped into two classifications: rules/knowledge based and machine learning. The rules based method exploits static knowledge of what determines subcellular location, whereas the machine learning method dynamically utilises training data to identify subcellular locations by focusing on the differences between proteins from different known locations. PSORT [12] is a well-known, well-used example; PSORT II [12] is both rules and machine learning based.
WoLF PSORT is an extension of the PSORT II program and can be used for the prediction of protein localisation sites in eukaryotic pathogens [2]. It requires as input, full-length amino acid sequences of a protein in a FASTA format. The program detects sorting signals within the sequence from which it then computationally predicts the protein's subcellular localisation. Signal detection is achieved by applying stored rules for various sequence features with specific criteria of known sorting signal motifs e.g. the feature is a GPI-anchor and the criteria for the feature is 'type-1a membrane protein with a very short tail'. One of the extensions to PSORT II implemented by WoLF PSORT is the identification of localisation features using amino acid composition and functional motifs such as DNA-binding motifs. A weighted k-nearest-neighbour classifier estimates the likelihood of localization features being sorted to each candidate site (referred in the program as a localisation class) and outputs the most probable sites with a score. A training dataset is required comprising protein sequences with a known localisation label. The training set supplied with the program is stated to be applicable to animals and contains 12,000 UniProt sequences [2]. Figure A1 shows a typical output from WoLF PSORT. Information about each protein sequence is displayed on separate lines (only three sequences are shown in Figure A1). Each field along the line contains a localization class (based on Uniprot "Subcellular Localization" field keywords) and a score separated by a comma. There are 12 localisation classes that also map to Gene Ontology (GO) 2 . As an example of how to interpret the output in Figure A1 protein 'seq1' has six candidate sites listed in descending order of likelihood based on a score. The most likely site is extracellular (extr) and plasma membrane (plas) i.e. there is dual localisation with a score of 11.5. The plasma membrane (on its own) is the next most likely site, followed by extracellular, endoplasmic reticulum (E.R.), lysozyme (lyso) and finally peroxisome (pero).
The accuracy of WoLF PSORT is influenced by the number of each type of localisation site in the training data .e.g. sites with few examples in the training dataset are seldom correctly predicted.

SignalP
One of the most well-known protein sorting signals is the secretory signal peptide, which targets its passenger protein to the secretory pathway via the endoplasmic reticulum. The secretory pathway is a series of steps that ends with the secretion of a protein through the cell plasma membrane to the outside of the pathogen. It is important to know that not all secretory proteins have signal peptides, or are necessarily secreted to the outside of the pathogen [4].
Some proteins have specific retention signals that hold them back in the ER or the Golgi or divert them to the lysosomes [4]. There are many different types of secretory signal peptides but the most common type is the signal peptide cleaved by signal peptidase. Although there are no simple consensus sequences, three distinct compositional regions on the peptide's amino acid sequence help define this type of signal peptide: N-terminal, central hydrophobic, and C-terminal regions. Specific motifs that target the protein are within the N-terminal, and the signal peptidase cleavage site that precedes the mature protein is within the C-terminal.
Signal peptides are cleaved off while the protein is translocated through the cell membrane [13]. There are also signal peptides that are not cleaved called signal anchors i.e. a transmembrane protein with one transmembrane segment near the N-terminal of the protein [14,15].
Secretory signal peptides can be computationally predicted using machine learning techniques, such as neural networks and hidden Markov models (HMMs) [16]. The program SignalP (version 4.0) predicts the presence and location of the signal peptidase I cleavage site at the C-terminal end of the presequence; and classifies each residue in the presequence as either belonging or not belonging to a signal peptide using two neural networks. Two different types of negative data were used for training the neural network models: sequences (mostly derived from UniProt) with transmembrane regions located within the first 70 residues from the N-terminal were used to train the SignalP-TM network; and sequences from non-secretory proteins were used to train the SignalP-noTM network. If the network SignalP-TM predicts four or more positions as transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used. Training data for eukaryotes is supplied with SignalP.
The input format required is multi-FASTA. It is recommended in the SignalP user manual that only the first 50 to 70 amino acids 3 of each sequence should be used in the prediction as longer sequences increase the risk of false positives. To restrict the length of the input sequence a command-line parameter is used (e.g. -trunc 70). An example of the summary output from SignalP is shown in Figure A2.
SignalP comprises five different scores between 0 and 1: 1) Cmax is the maximum "cleavage site'" score (a C-score is calculated for each position in the submitted sequence and a significant high score indicates a cleavage site); 2) Ymax is a derivative of the C-score combined with the S-score resulting in a better cleavage site prediction than the raw C-score alone. 3) S-max is the "maximum signal peptide" prediction score (the S-score for the signal peptide prediction is calculated for every single amino acid position in the submitted sequence and a high score indicates that the corresponding amino acid is part of a signal peptide, and a low score indicates that the amino acid is part of a mature protein); 4) Smean is the "average of the S-score", and 5) D is an average of the "Smean and Ymax" score. Position (pos) is the location in the amino acid sequence where Cmax (.i.e. cleavage site position), Ymax (i.e. length of signal peptide), and Smax occur. The "Y" or "N" is a yes or no indication that the sequence has a cleavage site and a signal peptide, when D is above or below the Dmaxcut.
High scores also indicate that the sequence is a secretory protein. According to the authors of SignalP, a high D-score is the best indicator of secretory proteins [14].

TargetP
TargetP is similar to SignalP. Neural networks are also implemented to predict subcellular locations of eukaryotic protein sequences. More specifically, TargetP predicts the presence and length of secretory pathway signal peptides (SP) and mitochondrial targeting peptides (mTP) in the N-terminal presequences [17]. As with SignalP the input is a protein sequence in a FASTA format. An example of TargetP output is shown in Figure A3. Len is the sequence 3 N-terminal peptides typically comprise 15-30 amino acids The reliability class (RC) is from 1 (most reliable) to 5 (least reliable) and is a measure of prediction certainty. The truncated peptide length (TPlen) indicates the predicted presequence length to the cleavage site.

TMHMM
Being exposed to the outside environment, surface membranes of pathogens are in full view of a host's immune system surveillance. Consequently membrane molecules, including proteins spanning or anchored to the membrane, are likely to be antigenic. A transmembrane protein that spans an entire membrane has predominantly a hydrophobic domain consisting of one or multiple α-helices motifs [5]. Numerous programs to predict transmembrane helices have been developed over the last 30 years -programs such as DAS, SOSUI, SPLIT, TMAP, TMpred, TopPred 2, MEMSAT, HMMTOP, ALOM 2 and Tmpro. Moller and colleagues evaluated methods for the prediction of membrane spanning regions [18]. Most prediction methods are based on hydrophobicity of amino acid residues and/or the abundance of positively charged residues on the cytoplasmic side of the membrane and/or the protein's topology patterns of cytoplasmic and non-cytoplasmic loops. These methods are applicable for almost all organisms [5]. The majority of programs are web servers and therefore are unsuitable for high-throughput processing.
The program TMHMM based on a hidden Markov model approach [5] predicts transmembrane helices in given protein sequences in a FASTA format. Figure A4

Phobius
An evaluation of signal sequence prediction methods conducted by Menne and colleagues indicated that SignalP was more sensitive than other methods but included many false positive predictions [19]. An inherent problem in signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane (TM) helix and that of a signal peptide (SP) can result in a TM helix falsely classified as a SP, or conversely a SP falsely classified as a TM helix [15]. Phobius is a combined transmembrane domain and signal peptide predictor that can help discriminate between TM helices and SPs and also add endorsement to TMHMM predictions. The different sequence regions of a signal peptide and a transmembrane protein are modelled in Phobius with hidden Markov models. Figure A5 shows the output from Phobius in a short format. The output information for one protein sequence (SEQENCE) per line consists of the number of transmembrane (TM) helices, a "Y" or "N" indicator that the sequence has a signal peptide (SP), and a predicted topology (information for only one protein sequence is shown).

MHC Binding Predictors
One of the foremost resources for T-Cell MHC class I and II binding prediction tools is provided by the Immune Epitope Database Analysis Resource (IEDB). IEDB provides a download Linux package (for a 32 bit system) that contains a collection of peptide binding prediction tools for MHC class I and class II molecules. Included in the package are NetMHCpan and NetMHCIIpan, which are extended versions to NetMHC and netMHCII.
The collection of tools is a mixture of Python scripts and Linux specific binary files. Python 2.5 or higher is therefore a prerequisite to run the tools. These tools take as input an amino acid sequence (or a set of sequences) and determine the ability of each subsequence to bind to a specific MHC molecule. For MHC class I the available prediction methods are: artificial neural network (ANN) [20], Average relative binding (ARB) [21], Stabilized matrix method (SMM) [22], SMM with a Peptide-MHC Binding Energy Covariance matrix (SMMPMBEC), Scoring Matrices derived from Combinatorial Peptide Libraries (Comblib_Sidney2008) [23], Consensus [24], and NetMHCpan [25]. A large scale evaluation of three MHC class I binding prediction methods (ANN, SMM, and ARB) was conducted in 2006 [26]. Each of the three methods predicts the quantitative affinity of a peptide for an MHC molecule. In the evaluation, the predicted affinities of all three methods were compared to a collection of experimentally measured peptide affinities to MHC class I molecules. Linear correlation coefficients were calculated between predicted and measured affinities on a logarithmic scale.
The evaluation reported that ANN preformed the best in a statistically significant manner (with a correlation coefficient of 0.69), followed by SMM (0.62) and then ARB (0.55) [26].
The MHC II binding peptide predictions are more computationally challenging than for MHC class I and this seems to be reflected in the inferior prediction performance of class II algorithms in comparison to those in class I [21]. In the IEDB download package, the available prediction methods for MHC class II are: Consensus [27], Average relative binding (ARB) [21], combinatorial library (unpublished method), NN-align [28] (this method is the equivalent to netMHCII version 2.2), SMM-align [29] (equivalent to netMHCII version 1.1), Sturniolo [30] (a method also used in the program TEPITOPE [31]), and NetMHCIIpan [32].
Wang and colleagues assessed MHC class II peptide binding prediction methods in 2008 [27].
The IEDB curators rank Consensus as the best method, followed by NN-align, SMM-align, combinatorial library, Sturniolo, and ARB.
The performance of a prediction method is governed by the availability of MHC alleles. In other words, not all methods can currently make predictions for all MHC allele and peptide length combinations (e.g. there may be insufficient experimental data available to generate the combinations). The Consensus method is recommended by the creators because this method consecutively uses several prediction methods. For example, for each MHC I allele and peptide length combination ANN method is tried first, SMM is tried next, and then comblib_Sidney2008, ARB, and finally NetMHCpan is tried if no previous method was available for the allele-length combination.
Prediction methods are encapsulated in two programs: predict_binding for MHC class I and mhc_II_binding for MHC class II. The method to use in the prediction is given as a command-line parameter. Figure A6 shows an example of the command line syntax. Only one MHC allele is analysed at a time. Therefore, the relevant program needs to be executed multiple times for each possible MHC allele-peptide length combination. Figure A7 shows a typical output from the MHC class I predictor using a Consensus method (some columns have been deleted and the format adjusted to fit output on the page).
Beginning at the start amino acid (numbered 1) of each sequence (denoted by #), a test subsequence of a specific peptide length (e.g. PepLengh = 9) is created (e.g. Sequence = MSMEGDRPS and is located from amino acids 1 to 9 on sequence input #1). The subsequence is scored (e.g. in units of IC 50 nM) for binding affinity against the MHC allele e.g. HLA-A*02:05, using different prediction methods scores are calculated for each amino acid at each position in the subsequence, which are then added to yield the overall binding affinity.
In the example Figure 7A, method NetMHCpan was used because no previous method was available for the allele-length combination. However, the output could in theory contain scores from multiple methods if the method was available for the allele-length combination.
The next test subsequence in Figure 7A is "SMEGDRPSG" from amino acids 2 to 10 on sequence input #1 and is scored against the same MHC allele, and so on. The affinity of the MHC allele and subsequence binding is greater the lower the IC 50 value. The program As per Figure 7A there are multiple peptide affinity scores for each protein. The purpose of using the IEDB epitope prediction was to gather further evidence of a protein's vaccine candidacy rather than to identify specific epitopes for vaccine development. Ideally, a single score was required to encapsulate the collective potential of the epitopes on a protein antigen.
In Vacceed, random forest (a machine learning algorithm) is used to predict a single probability value that a protein had vaccine candidacy potential.