Large-scale structure-informed multiple sequence alignment of proteins with SIMSApiper

Abstract Summary SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements. Availability and implementation The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.


Technical documentation
This documentation pertains to SIMSApiper version 1, published on March 20st, 2024.We continuously work on expanding and improving the pipeline.For the most current version, please visit our GitHub.
Recommended settings can be enabled using the flag --magic.The pipeline can be launched using this command line or the launch file called "align magic.sh" to run the toy example dataset.

Options and Parameters
By default most flags are set to False.Adding a flag to the command line will set it to True and activate it.Some flags can carry additional information, such as percentages or filenames.The complete list can be found table S3.

Input files
Config file: Nextflow.config The config file holds all parameters needed to adapt SIMSApiper to your system and their default values.It includes execution profiles for local, server and HPC-SLURM as well as Conda, Docker and Singularity.Standard parameters and execution profiles can be easily adapted to the users system.

Sequence Input (--seqs)
Sequence data can be presented either as one file with all sequences or multiple sequence files, which can later be used for subsetting (--use subsets).Sequence data can be provided in any file format supported by the biopython package.Should the user plan to retrieve structural information via AlphaFold Database (--retrieve), sequence IDs should start with the Uniprot ID.

Structural input (--structures)
Structural input is optional, but if provided, it must be in .pdbformat of any origin (experimentally generated, from Protein Data Bank or modelled).Sequence labels and the structure filenames must match exactly.If experimental structures are used, only one chain should be provided to SIMSApiper.Mutations or less then 5% sequence divergence between input-sequence and model-sequence will not impact the MSA quality if the mutations do not impact the overall organisation of the protein.Modelled structures by ESMF, or AlphaFold2 using ColabFold can be used to omit this constraint.

Convert Sequence Files:
Parse sequence files using biopython to provide sequences in fasta-2-line format.

Reduce dataset size ( --dropSimilar):
Calculation time scales exponentially with number of input sequences.We therefore recommend at least a cutoff of 90% SI to remove redundant sequences or 70% to have a more general overview of the protein family and address sampling bias.SIMSApiper uses CD-Hit to cluster the input sequences and keeps cluster representatives in the dataset to align.Proteins inside clusters will be excluded from the downstream processes, but important proteins (such as reference sequences) can be added to the representatives with --favoriteSeqs.
1.3 Quality control for input sequences (--seqQC): Too many non-standard amino acids can result in a poor alignment, we therefore remove all sequences containing more than --seqQC % non-standard or unresolved amino acids from the process.

Match sequence with structure information
2.1 Identify missing models: SIMSApiper matches sequence ID and structure model filenames, and compiles a list of all protein sequences that have no structural information available.
2.2 Retrieve models from AlphaFold2 Protein Structure Database (--retrieve): AlphaFold2 models are considered the most accurate protein structure predictions, and for most Uniprot entries there already exists a pre-computed model in the AF2 DB.Where possible, we enable users with SIMSApiper to avoid re-computing existing models and collect these models automatically for them.

Identify missing models:
The list gets updated and all proteins for which SIMSApiper could find model in AF2 DB are removed.

Model Sequences with ESMF (--model):
The ESM Atlas provides very fast prediction of novel protein structure models from sequence without the need for Uniprot ID-labelled sequences.We use the ESM Atlas API to submit all sequences shorter then 400 residues and collect the resulting models.

Identify missing models:
The list gets updated and all proteins for which SIMSApiper could generate a model with ESMF are removed.
2.6 Assess number of structure models and sequences to be aligned (--strucQC): After these rounds of data collection, the final amount of structural information is assessed.The final alignment quality decreases if too many sequences are not matched to a model, and we therefore establish a minimal cutoff.SIMSApiper fails if there is too many missing models before the time-intensive T-Coffee step to allow users to add more information manually or continue the run with a more permissive cutoff.Sequences not matched to a model are collected in a separate file to aid this.

Subsetting -Division of the dataset into different subsets
Calculation time and space requirements of T-Coffee alignments scale exponentially with the number of sequences and the length of these sequences.We therefore established two presets: For an average sequence length shorter than 400 residues, maximally 100 sequences per subset are permitted.For longer sequences we suggest a cutoff of 50.Users can also set a custom maximum number of members per cluster with --maxSubsetSize.Sequences that do not fit with any subset can be labelled orphan, and will later be aligned to the T-coffee alignments with MAFFT based on sequence information.

User-provided subsets (--useSubsets):
Users can generate these subsets based on prior knowledge on function, phylogenetic relationships or SI and provide these subsets in as separate sequence files to SIMSApiper.These file will still be filtered to only include sequences for which a model has been found.

Automatically generated subsets (--createSubsets):
SISMApiper can create subsets automatically with PSI-CD-HIT and a low SI cutoff.If the CD-Hit generated clusters are are too large to become subsets, we split these clusters to obtain evenly distributed clusters smaller then --maxSubsetSize.If there are too many sequences that do not fit in any cluster, we interactively decrease the CD-Hit SI threshold by 5% until --minSubsetIdentity.
We observed that we obtained the best alignments when we submitted the minimal number of subsets, and provide the setting --minSubsetIdentity "min" to reduce overall number of subsets by collating small clusters and singletons until --maxSubsetSize reached.

Run MAFFT (--mafftParams)
After all T-Coffee subalignments are ready, these and any orphan or structureless sequences are combined into the final alignment using MAFFT.MAFFT mode 'merge' is conserving the subalignment structure.Users can add other mafft flag or alignment modes with --mafftParams "--localpair --maxiterate 100" 6. Run DSSP (--dssp) All structure models are translated into 2D secondary structure nomenclature using the DSSP codes 7. Improve MSA

Map DSSP to MSA
SIMSApiper maps DSSP sequences for each model on sequences of the MSA, conserving the gaps.We permit model sequence and alignment sequence to diverge in up to 5% of residues if the protein length is maintained (point mutations).If the length is not maintained, insertions/deletions can only appear at the C-/N-terminus.If these conditions are not met, the sequence will be excluded from this step and remains unconverted in the mapped alignment.
7.2 Squeeze MSA towards conserved secondary structure elements (--squeeze): MSA in unstructured or less conserved regions (e.g.loops) are usually very disperse.The MAFFT-merge step additionally is prone to add many gaps to the alignment.SIMSApiper identifies conserved secondary structure categories such as helices and β-sheets with --squeeze "H,E", and squeezes the MSA towards these regions.DSSP codes representing helices, i.e, H, G and I, are considered the same by SIMSApiper We also implemented a minimum of 3 consecutive conserved columns to be considered a region.The percentage threshold for region to be 'conserved' can be set with --squeezePerc.

Map DSSP to squeezed MSA
The alignment of secondary structure elements has been a useful tool for us to assess the quality of a MSA without a reference alignemnt.SIMSApiper maps the DSSP codes on the squeezed MSA as well to facilitate this analysis.The same constraints apply as before.

Reorder MSA (--reorder)
Order MSA according to the order of sequences in the input files.If more then one sequence input file is provided, the MSA will be reordered based on the order given in the files organized alphabetically.Instead of alphabetically, the user can also select explicitly the order of the files with --reorder "gamma.fasta,delta.fasta".

Convert MSA
SIMSApiper outputs the final MSA in fasta-2-line format, but can convert to any format supported by python package biopython.

Log files
SIMSApiper provides different log files containing information about the run and resource usage, or the data analyzed.
Fig. S1.Representation of data handling and tools in SIMSApiper.File and directory names match automatically generated out directory.Flags related to each step in bold.Detailed explanations for every step as well as all possible flags can be found in the Technical documentation below.
Fig. S2.Computational resources needed to generate a high-quality MSA of the TIM barrel protein family with SIMSApiper.Left: CPU, memory, runtime and flags needed by SIMSApiper to align 379 proteins using subsets (step 3) to speed up the alignment step (step 4).The data was split in 6 subsets based on its Sequence Identity (SI).Right: CPU, memory, runtime and flags needed by SIMSApiper to align 379 proteins (step 4) when step 3 is skipped.The entire dataset was submitted in one batch to T-Coffee.This results in a significant increase in required resources and highlights the importance of step 3.The plots were automatically generated by Nextflow and the values are represented by box-plots as different jobs ran in parallel to speed up the processes.
Fig. S3.Aligning highly divergent proteins with conserved secondary structure with SIMSApiper.a) b) c) Secondary and tertiary structure of natural and two designed TIM barrels.β-sheets are in cyan, α-helices are in magenta, loops in grey.The natural TIM barrels consist of 4 antisymmetric (βαβα)units[Wierenga, 2001], sTIM 11 consists of 4 symmetric (αβαβ)[Figueroa et al., 2013] units and Octarellin V1 presents a (αβα) sandwich architecture[Figueroa et al., 2013].d) Sequence and structural comparison of a natural TIM barrel (human TPIS, UNIPROT id:P60174, PDB id:1HTI) and two designed TIM barrels.Both at the sequence and structural level, the designed TIM barrels have relatively low identity.e) Alignment of the secondary structure elements of natural and designed TIM barrels obtained with SIMSApiper.SIMSApiper successfully aligned the secondary elements of the different TIM barrels

Table S1 .
Computer architecture used by SIMSApiper for each validation dataset.This hardware is part of the VSC Tier-2 general-purpose clusters provided by VUB-HPC.As independent steps or the alignment of the different subsets can happen in parallel, the two types of architecture available can be used depending on general cluster work load.

Table S2 .
Column Scores (CSs) between MSASIMSApiper and MSA ref per region and overall.

Table S3 .
All avilable flags in SIMSApiper with default values and explanation of function and recommended input (type).AA stands for amino acids in this table.