BioStructMap: a Python tool for integration of protein structure and sequence-based features

Abstract Summary A sliding window analysis over a protein or genomic sequence is commonly performed, and we present a Python tool, BioStructMap, that extends this concept to three-dimensional (3D) space, allowing the application of a 3D sliding window analysis over a protein structure. BioStructMap is easily extensible, allowing the user to apply custom functions to spatially aggregated data. BioStructMap also allows mapping of underlying genomic sequences to protein structures, allowing the user to perform genetic-based analysis over spatially linked codons—this has applications when selection pressures arise at the level of protein structure. Availability and implementation The Python BioStructMap package is available at https://github.com/andrewguy/biostructmap and released under the MIT License. An online server implementing standard functionality is available at https://biostructmap.burnet.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.


Availability and installation
BioStructMap also has soft dependencies on the NCBI BLAST+ tool (https://www.ncbi.nlm.nih.gov/guide/howto/runblast-local/) and Exonerate (https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate). If you choose not to install these, or don't want to use them, all sequence alignments will be performed using the Biopython Bio.pairwise2 module. This should work just as well if your reference sequence is reasonably similar to the sequence of the PDB file. If this is not the case, then we suggest that a better approach may be to build a homology model using MODELLER (Webb and Sali 2016) and use this instead of using the poorly aligned PDB structure. Some functions within BioStructMap also require installation of the DSSP tool (http://swift.cmbi.ru.nl/gv/dssp/). These include calculation of relative solvent accessibility and secondary structure determination. If you wish to use these functions, you must have DSSP installed.
Although the BioStuctMap package contains several modules, most of these work behind the scenes. The biostructmap module should be the only module that needs to be directly used in most cases.
The main class within the biostructmap module is the Structure class. This is initialised as such: from biostructmap import biostructmap my_structure = biostructmap.Structure (pdbfile='./1zrl.pdb',pdbname='1ZRL',mmcif=False) The pdbfile argument can be either a string of the file path to the PDB file of interest, while the pdbname argument is an optional descriptive string that is used when naming output files. The optional mmcif flag is used to indicate if the input file is in mmcif format. If you are using an mmcif file, you would instead run: my_structure = biostructmap.Structure (pdbfile='./1zrl.cif',pdbname='1ZRL',mmcif=True) The Structure class contains a number of methods, the most important being the map() method. This method allows for the mapping of data over a protein structure, with the ability to also apply some sort of spatial aggregation to data. The map method takes a number of arguments, the most important of which are data , method , ref and radius .
The Structure.map method returns a dictionary-like object, which can also be used to write output data to a PDB file or CSV file.
The data parameter is a dictionary mapping individual chains within the protein structure to relevant data objects.
The exact form of the data object will depend on the method argument selected. For example, if you are mapping the location of polymorphic residues onto a structure, the data would be a list of polymorpic sites:

The Structure class
2.2 Mapping data over a structure data data = {('A, B'): [1,34,56,77,120,121,125], ('C', ): [5, 34, 67, 122] } In this example, identical chains A and B are both assigned the same set of data, whilst the unique chain C is assigned another set of polymorphic residues. Note that the given polymorphic sites for each chain are aligned to the reference sequence supplied -see below for more detail.
The method parameter is either a string representing a method for mapping data (one of a number of pre-defined methods), or a custom function for mapping data (explained in more detail below). For example, to map polymorphic sites onto a protein structure, set method=snps .
The ref parameter is a set of reference sequences for all chains, and is used to align the user-supplied data to the protein structure. All data supplied via the data argument should align to these reference sequences.
For example, with identical chains A and B , and a unique chain C , we would have: All data provided in the data parameter should align to these sequences.
Also note the subtle difference in the keys needed for the ref dictionary vs. the data dictionary. The ref dictionary should have a sequence provided for every chain being evaluated, with the dictionary key being the string identifier for that chain. The data dictionary requires a tuple of chain identifiers for each related data value.
This difference arises so that it is possible to map the same data over multiple chains and subsequently ensure we don't duplicate identical data points that might fall within the same radius (from different chains). This will be discussed in more detail in a later section.
If the supplied data is a genomic multiple sequence alignment, then the provided reference sequence should also be a genomic sequence. In this case biostructmap will align this genomic sequence to the protein sequence for the relevant chain in the structure. If a genomic sequence is supplied, then the map_to_dna argument should also be set to True .
If the ref argument is not provided, then the sequences for each chain in the structure are used.
The radius (in Angstrom) over which to select nearby residues for inclusion within each 3D window. This defaults to 15 Angstrom, which is the typical maximum dimension for an antibody epitope. If you simply want to map data to individual residues (eg. to display polymorphic sites on a protein structure), set radius=0 .
When determining which residues fall within a given radius of a central residue, there are a number of ways in which to compute distances between residues. The default behaviour is to compute the minimum distance between any two atoms in each pair of residues. The selector argument allows the user to specify other atoms by which to compute residue distance. By default this argument is 'all' , which gets all non-heterologous atoms. Other potential options include 'CA' , 'CB' etc. If an atom is not found within a residue object, then the selection method reverts to using 'CA' .

method ref radius selector
If the user wishes to restrict analysis to residues that fall within a given range of relative solvent accessibility (RSA) values (eg. only surface exposed residues), they can provide a tuple to the rsa_range argument. This argument takes a tuple in the form (minimum, maximum) , where minimum and maximum are float values between 0 and 1 .
If any residue falls outside the given range of RSA values, then this residue will ignored in all calculations.
RSA is calculated using the DSSP software. If this is not installed and available on the users PATH, then any attempt to use the rsa_range argument will fail, throwing an exception.
The map_to_dna argument is a binary flag to indicate if the reference sequence to be aligned is a DNA sequence.
This needs to be set to True if the reference sequence is a DNA sequence (e.g. when using the Tajima's D method).
In order to make biostructmap flexible and extensible, the map method also takes additional arguments that will be passed to the data aggregation method. These arguments should be provided to the method_params argument in a dictionary of keyword arguments (key) and associated values (value).
To provide a concrete example of this, we can consider the 'default_mapping' method that applies a data aggregation method to sequence-aligned numerical data. By default, this method calculates the mean of all data within each radius. However, we can apply other data aggregation functions (e.g. calculate the median) to each 3D window by passing a method argument to the default_mapping function: import numpy as np my_structure.map(..., method_params={'method': np.median}) One usage of the biostructmap tool is to determine polymorphic hotspots on a protein structure. This requires the user to provide a list of all polymorphic residues of interest and an associated reference sequence. In this example we have a single-chain structure with the PDB file 1zrl.pdb and a reference sequence in FASTA format in the file reference.fasta . We will use the Biopython SeqIO module to read in the reference sequence from file. Polymorphic residues are residues 3, 67, 78, 99, 100, 120 and 121, relative to the reference sequence (where the first residue is number 1).
If we were interested in averaging the number of polymorphisms within a 10 Angstrom radius, we would run: We can also apply a 3D sliding window to calculation of amino acid propensity scales. In this example we will apply the Kyte & Doolittle index of hydrophobicity to the protein structure initialized in the above example. We will also demonstrate how to apply a custom amino acid scale as a 3D sliding window.
We can obtain the Kyte & Doolittle scale from the Biopython package: from Bio.SeqUtils import ProtParamData kd_scale = ProtParamData.kd For the 'aa_scale' method, the data argument should be a dictionary representing the amino acid scale of interest. In this example we will use a window size of 15 Angstrom, and only consider surface exposed residues (RSA > 0.2). To use a custom amino acid propensity scale, we just need to provide a dictionary of numerical values for all amino acids. We will apply the 'relative mutability scale' defined by (Dayhoff, Schwartz & Orcutt, 1978). Again, we are only considering surface exposed residues. Tajima's D is a statistical test used to determine if a sequence is evolving under non-neutral selection pressure.
Here we will apply Tajima's D as a 3D sliding window over our protein structure. We need to supply a multiple sequence alignment, using the biostructmap.SequenceAlignment class. The multiple sequence alignment is initially supplied as a FASTA file.
In this case, the reference sequence is taken as the first sequence in the multiple sequence alignment. Note the need to set map_to_dna=True .

Calculation of Tajima's D
Nucleotide diversity is a metric that is used to quantify the degree of diversity within a particular window on a gene.
We can extend this here to a 3D window over a structure to get a sense of the particular regions of the protein structure that are most diverse within a population (at a genomic level).
Again, we need to supply a multiple sequence alignment. The 'default_mapping' method allows the user to apply a custom data aggregation function to data within each window. For example, you could calculate the arithmetic mean of data within a window, calculate the maximum or minimum value within a radius, or apply some other metric to data. We will illustrate with a simple calculation of the maximum data value within a 5 Angstrom window.
Note the use of the additional keyword argument method_params , which takes a dictionary of additional parameters to pass to the default_mapping method. In this case, default_mapping takes the keyword argument method , which should be a function that can be used to aggregate a list of data points. This default_mapping method can be quite useful when constructing custom mapping procedures! data_values = list(range(1000)) # Just some placeholder data maximum_values = my_structure(data=data_values, method='default_mapping', ref={'A': reference_seq}, radius=5, method_params={ 'method': max } ) The results for each mapping call are returned in a dictionary-like object ( DataMap class -a simple class that extends the dict class by adding a couple of additional methods to deal with writing results to files).
The main method that is likely to be used from the DataMap object is the write_data_to_pdb_b_factor method.
This writes all data to the B-factor column of a PDB file, allowing easy visualisation in a program such as PyMOL.
We demonstrate the use of this following a simple calculation of average hydrophobicity (see section 2.3.2). import biostructmap from Bio import SeqIO from Bio.SeqUtils import ProtParamData kd_scale = ProtParamData.kd reference_seq = SeqIO.read("reference.fasta", "fasta") mean_hydrophocity = my_structure(data=ProtParamData.kd_scale, method='aa_scale', For the write_data_to_pdb_b_factor method, the fileobj keyword argument can be either an output file name as a string, or a file-like object to write output data to. Additionaly keyword arguments for this method are default_no_value and scale_factor . The default_no_value argument is used to specify the numerical value written to the B-factor column if the value for this residue is None (non-numerical values can't be written to the Bfactor column). The scale factor argument is used to scale output values in situations where they are either too big or small to fit within the B-factor column. For example, it is usually sensible to scale nucleotide diversity values by a factor of 1000 ( scale_factor=1000 ).
BioStructMap can be extended by providing custom functions with which to process data within each 3D sliding window. We will briefly discuss the format required for these custom data processing functions.
Each data processing function has the format: def some_method(structure, data, residues, ref, **kwargs): ... return final_data where structure is the parent biostructmap.Structure object from which the map method has been called, data is the data argument supplied to the map method (no filtering has been applied to this object yet!), residues is a list of PDB residues within that particular window, and ref is a dictionary mapping PDB residue numbers to reference sequence indices.
If the custom method being written needs to deal with data that is aligned to a protein sequence, then the key steps that need to be followed are: 1. From the list of PDB residues within the window, extract the positions of these residues in the corresponding reference sequence.
2. Construct a list of applicable data points, given the list of reference-sequence aligned residues.
3. Perform a data aggregation function over these data points, returning a single value (a single, numerical return value is not absolutely required, although writing data to a PDB file will not be possible if data is nonnumerical).
We illustrate these steps with a function to calculate the mean of selected data points: Step 1: A little bit more manipulation, as each reference residue is given # by a (chain, residue number) tuple, while the data keys are tuples that # can contain multiple chains (to enable mapping data between several chains). # The list of residues below will look like: Alternatively, if the user simply wants to apply an aggregation function to selected data points (as in the above example), then the default_mapping method provides a nice interface to perform this via the additional keyword argument method_params : data_values = list(range (1000) It is also possible to pass data to a command line tool for processing using the general format outline above.

Data aligned to a protein sequence
We firstly need to write a temporary file containing a multiple sequence alignment for each window, and then call our command line tool. In this example we will use 'possum' (https://github.com/jeetsukumaran/possum), which calculates a number of population statistics from a multiple sequence alignment.