PiDNA: predicting protein–DNA interactions with structural models

Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.


INTRODUCTION
Interactions between transcription factors (TFs) and their binding sites play important roles in many biological processes. Many previous studies have attempted to characterize the binding sequences of a TF by summarizing its known sites as a position weight matrix (PWM), and then using the PWM to discover more potential binding sites in the genome. However, the number of well-characterized PWMs is still far behind the number of known TFs. In this regard, it is desirable to exploit other resources, such as protein-DNA complexes in protein structure databases, to improve the coverage of TFs on which the prediction of binding sites can be made or improved.
In the past decade, as more protein-DNA complexes are becoming available, researchers are able to investigate protein-DNA interactions at an atomic level (1,2). Many studies have developed structure-based computational methods for predicting protein-DNA interactions (3)(4)(5).
Most of the studies focus on predicting protein functions or DNA-binding residues based on structure models (4)(5)(6)(7)(8), and the prediction of binding residues has achieved a high degree of accuracy (6,8). Many potential functions, including physics-based and knowledge-based (7,(9)(10)(11), have been developed for improving protein-DNA docking (11)(12)(13). These potential functions are also being applied to predict binding specificity and construct PWMs, as long as a complex structure exists (9,(14)(15)(16)(17)(18). A recently published approach further uses an energy function that is uniquely trained on each structure for recognition of TF-binding sites (19). While the prediction of PWMs based on native structures achieves better performance year after year, this success has also been extended to synthetic protein-DNA complexes in our recent study (15).
To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledgebased scoring function on numerous in silico mutated complex structures, and then predicting the interaction between a protein and a single DNA sequence using the PWM suggested by the structure models with small energy changes. Given a protein-DNA complex structure, all of the potential DNA sequences with limited mutations from the native sequence in the co-crystallized structure are scored by the atomic-level knowledge-based scoring function, and a subset of relatively high-specificity sequences is selected to construct a PWM for re-ranking the mutated sequences and for making more predictions. In addition to a ranked list that reveals the relative binding specificity of the mutated sequences, users can also download mutated protein-DNA complex structures of interest for other applications.
Although many Web servers exist for predicting protein-DNA interactions based on structure models, most of them are designed to predict whether a protein binds to DNA such as iDBPs (20) or to predict DNA-binding residues such as DISPLAR (21), DNABINDPROT (22) and PreDNA (8). Moreover, the Robetta server allows for residue design on DNA-binding proteins by modelling the changes in binding-free energy associated with amino acid and base substitutions, given a protein-DNA complex (23). On the other hand, Web servers that use protein-DNA complexes to predict PWMs or sequence specificity of TFs are still limited. Our previous work, DBD2BS, was developed for predicting PWMs from protein unbound structures (24). A more similar work to PiDNA is a recently published Web server, 3DTF, which also uses knowledge-based potential on mutated structures to produce PWMs (25). The 3Dfootprint database also provides pre-calculated PWMs for all protein-DNA complexes in the RCSB Protein Data Bank (PDB) (26), where the PWMs are constructed by using a hybrid approach that combines contact and readout models (27). In this regard, both 3DTF and 3Dfootprint are included in this study for comparison when evaluating the performance of PiDNA.

Input
Step 1. Provide a structure A protein-DNA complex is expected as the input to PiDNA. For the user's convenience, PiDNA has collected all of the existing protein-DNA complexes deposited in PDB on the local site. In the 28 November 2012 release, there are 2589 structures that contain both protein and DNA molecules. The four-character PDB identifiers can be specified in this step. Alternatively, users can upload their own structures for analysis.
Step 2. Select a double-stranded DNA After a structure is given, PiDNA automatically detects the chain identifiers of double-stranded DNA (dsDNA) molecules present in the structure. If only one dsDNA is found, PiDNA will use it by default for mutation analysis. On the other hand, if more than one dsDNA is detected, PiDNA will provide relevant information and wait for the user to select one dsDNA to make predictions. For example, the PDB structure '1RUN' contains two dsDNA units, where chain C is paired with chain F (denoted as 'C <-> F') and chain D is paired with chain E. As long as the dsDNA for analysis has been determined, PiDNA will provide the information about the paired complementary bases. Bases within 4.5 Å with respect to any heavy atoms of a protein chain are highlighted in red, where the threshold of 4.5 follows the suggestion from previous studies (5,9). The contact residues along with the protein chain identifiers appear when the mouse moves over a contact base. At this stage, the user is allowed to specify the range of base pairs to mutate as well as the maximum number of mutations. Users are suggested to start with a small value on the setting of the maximal number of mutations, especially for a small binding site. If the number of mutations is large with respect to the site length, the assumption of the rigidity of the DNA backbone might not hold anymore.

Output
After clicking the 'submit' button in Step 2, PiDNA synthesizes the structures for all sequence combinations with limited mutations from the native sequence. As exemplified in Supplementary Table S1, a position has three potential bases to mutate to. In this regard, a DNA sequence of length six can be mutated into 153 mutant forms when up to two mutations are allowed. For each mutated sequence, the corresponding synthetic structure is generated (see Materials and Methods) and the change in binding free energy is estimated (see Materials and Methods). In the result panel, only partial mutated sequences are listed, whereas the complete list can be downloaded as a text file. The list can be ranked by different columns: the number of mutated positions, the estimated change in binding free energy, the final prediction score (see Materials and Methods) and the rootmean-square deviation (RMSD) with respect to the native structure. In addition to the ranked list of the sequences with limited mutations, PiDNA also reports a position frequency matrix (PFM) constructed from structural models with small energy changes. With the reported PFM, the following three steps can be selected for execution: Step 3a (optional). More predictions on manually modified sequences More predictions can be made based on the PFM constructed previously. At this stage, the user can also select a mutated sequence to generate the structure for visualization and downloading. The mutated sequence can be further modified manually. Users will need the JAVA Runtime Environment to activate the Jmol applet (http://jmol.sourceforge.net/).
Step 3b (optional). More predictions on sequences set by uploading At this stage, the user can make more predictions by uploading files. The sequence file can be prepared according to an example file provided by the server.
Step 3c (optional). More predictions on random sequences generated by the server At this stage, the user can make more predictions on a larger set of sequences generated by PiDNA.

Base mutation
To perform base-pair mutations on a given structure, structure models for different base types are required in advance. The atom coordinates of a particular base (A, T, C or G) are collected from the available PDB structures that contain dsDNA. The coordinates retrieved from a single base are considered as a rigid body and structurally aligned with the others. In total, 10 000 structure models are collected for each of the base types, and the averaged atom coordinates are stored as the template model for each base type. Whenever a base pair, e.g. A <-> T, is requested to mutate into another one, e.g. C <-> G, the base 'A' is replaced by the template model of 'C' and the base 'T' is replaced by the template model of 'G' without affecting the backbone structure of the dsDNA. For example, the replacement of base 'A' by base 'C' is performed by the following procedures: (i) consider the template model of 'C' as a rigid body and superimpose the coordinates of the atom 'N1' in 'C' onto the corresponding atom 'N9' in 'A', (ii) align the normal vector of the new base plane with that of the original base plane, (iii) align the interior bisector of a specified angle in 'C' with that of 'A', and (iv) remove the atom coordinates of base 'A' from the complex structure. The information regarding the atom names and the atoms selected to construct the base planes and angle bisectors is provided in Supplementary Figure S1.

All-atom scoring function
PiDNA uses an all-atom distance-dependant knowledgebased scoring function to evaluate the preference of the synthetic complexes with respect to the native structure. Let atom i be an atom from the protein chains in a complex structure, and atom j be an atom from the DNA chains. An atom pair i and j, with a distance of r, is scored by the following equation: where P(i, j, r) = N obs (i, j, r)/AE r N obs (i, j, r) and P ref (r) is a weight function that reduces the influence of a long distance, as a longer distance covers more atom pairs. All the combinations of the frequency N obs (i, j, r) are derived from a pre-collected protein-DNA complex set, where i can be any atom type from amino acids and j can be any atom type from nucleotides. The a-carbon in amino acid cysteine is an atom type different from the a-carbon of alanine. In total, there are 167 atom types from proteins and 82 atom types from DNA. More details about the weight function P ref (r) and how the table entries N obs (i, j, r) are constructed based on the protein-DNA complex database can be found in our previous study (15). For a given complex, the binding free energy, ÁG, is defined as the sum of all the statistical potential of the observed atom pairs (9): uði, j, rÞ: Whenever a mutated structure is generated, PiDNA recalculates the ÁG and denotes it as ÁG 0 . Afterwards, all mutated structures along with the native one are sorted by the change of ÁG, i.e. ÁG 0 À ÁG native , in ascending order. The minimum value of ÁG 0 suggests acceptable flexibility on the structure change. In this regard, PiDNA discards the mutated structures with a change of ÁG larger than the absolute value of ÁG 0 minimum À ÁG native . In other words, although a negative change on ÁG is preferred, a positive change that falls in the range of acceptable flexibility is still favourable.
With the mutated structures that satisfy the flexibility criterion, PiDNA uses the corresponding mutated sequences to construct a PFM. PiDNA will then use the derived PFM to re-rank the binding sites by the procedure described in the following section. The analyses shown in this study reveal that the PFM scoring is, in general, superior to the scoring based on the estimated energy change when the number of mutations is getting larger. This might be owing to the fact that PiDNA assumes that the backbone of the dsDNA molecule does not suffer conformational change when base-pair mutations are performed, but this assumption loses its validity when the number of mutations increases. PiDNA also provides the information regarding potential structural change by calculating the RMSD of the mutated structure against the native complex as follows: where is the distance between N pairs of equivalent atoms.

PFM scoring
Given a PFM, P, with a width of w, a particular sequence S is scored with the following equation after proper alignment (given that the k-th position of S, denoted as S k , is aligned with the first position of the PFM without insertions or deletions and assuming that the length of S is larger than or equal to w): Score PFM ðS k ::: where the first dimension of the PFM, P, is the position identifier and the second dimension specifies the type of bases: A, T, C or G. In case the length of the substring starting from S k is smaller than w, the value of w is adjusted to a proper value before calculating the scores.
The Score PFM is assigned as the final score to reflect the relative sequence specificity of the mutated sequences. The performance of PiDNA in distinguishing the high-specificity sites from the lower ones is evaluated in the following section. In PiDNA, the PFM is calculated from aligned mutated sequences with equal length. Therefore, k is always set to 1 when calculating Score PFM . On the other hand, to generate Score PFM from an annotated PFM or from the PFM predicted by existing Web servers for comparison, the value of k is set to a proper value after sequence alignment with consideration for the reverse complementary form.

Validation sets
This study first evaluates whether the PFMs constructed using the highly reliable structures with limited mutations are consistent with the known binding sites of the query protein. Mouse and yeast TFs with structure models available in PDB are examined to see if annotated PFMs can be found in literature or databases. In this study, PFMs are collected from the study of Morozov et al.  Table S2).
To further evaluate the performance of PiDNA in distinguishing high-specificity sites from lower ones, this study uses in vitro protein-binding microarrays (PBMs) to retrieve relative specificity information of a DNAbinding protein against different dsDNA sequences. The PBM data provide information regarding how strongly a given protein binds to a probe relative to the others. The mouse and yeast proteins in the UniPROBE database (30) are examined against validation set 1 for available structures. The available PDB structures are further examined to remove structures in which the binding sites are bound by hetero-multimeric proteins. Because not all lengths of sequences can find a copy in a PBM array, a structure with a binding site larger than 10 is avoided in the validation set. In total, there are 11 proteins (validation set 2) that satisfy all the aforementioned criteria, as shown in Supplementary Table S3.
Finally, to evaluate the performance of PiDNA in selecting real binding sequences from a set of random sequences, we used the binding sequences collected in the study by Morozov et al. (9) as positive samples and randomly generated 10 000 sequences as negative samples to construct validation set 3. This set includes 15 proteins.

Comparison with annotated PFMs
The PFMs constructed by PiDNA, based on the set of favourable structure models with limited mutations, are compared with the annotated PFMs, and the performance is compared with the predictions by 3DTF and 3D-footprint. The C-test described in Morozov et al. (9) was used to evaluate the consistency between the predicted and annotated weight scores. The definition of the C-test is provided as follows: where p j i and q j i are predicted and annotated weight scores, respectively, for base type i at position j, and w is the length of the binding site in base pairs. A smaller value on the C-test implies a higher degree of consistency between two PFMs. The range of mutated positions for each test case is shown in Supplementary Table S4, and the same range is applied to 3DTF. The results provided in Table 1 reveal that PiDNA is able to deliver PFMs similar to the annotated PFMs from databases or literature. Both PFMs from PiDNA and 3D-footprint are better than that constructed by the original sequence in the complex. It can be seen in Table 1 that, in general, setting the maximum number of mutations to two performs better than three and four. However, it is also observed that a larger number is preferred for binding sites with a large number of degenerated positions, as shown by the sequence logos plotted in Figure 1.

Identification of binding sites with high specificity
Given a protein-DNA complex, it is of interest to know what other sequences with limited mutations from the native sequence can also be bound by the proteins in the complex. The expected relative binding specificity of these mutated sequences is retrieved from the in vitro PBM data. A 10-mer sequence can find two copies (including the reverse complementary form) in a PBM array constructed based on the de Bruijn sequence that covers all 10-mer binding sites. A site with shorter length can find more copies. It is assumed that the relative binding specificity of a k-mer string can be represented by the scores of the probes that contain the string. Because there is a risk that the binding specificity of the string of interest might be affected by other substrings on the same probe sequence, the average scores across all the probes that contain either the target string or its reverse complementary form are adopted. To evaluate the performance of PiDNA in detecting the binding sites with high specificity, the mutated sequences with relatively high binding specificity are assigned as the positive instances. The top-k scored sequences from PBM are adopted as the positives, where k is set to 10, 20, 50 and 100, respectively. Afterwards, the receiver operating characteristic (ROC) curve for each method is plotted based on the ranking list it produced.
The testing data first include all sequences with up to two mutations. For the 11 proteins in validation set 2, the number of testing sequences ranges from 153 to 435. The ROC curves and the area under ROC (AUC) scores are calculated using the R package 'ROCR'. The AUC scores of PiDNA are compared with the ranked lists generated by the PFMs from 3DTF and 3D-footprint. The same ranges shown in Supplementary Table S4 are applied when invoking the Web server 3DTF to generate PFMs for comparison. For 3DTF, the mode 'long (for convergence of full model)' is adopted. The average AUC scores for different methods are provided in Figure 2, and the AUC scores for each query protein are shown in Table 2 when the top-10 PBM scored sequences are used as positives. It is observed both in Figure 2 and Table 2 that PiDNA with PFM scoring is superior to the other approaches in identifying high-specificity sequences from the lower ones, including the approach of using the change on ÁG (ddG) for the ranking. This reveals that the mutated structure models and/or the selected atomiclevel scoring function still have potential limitations.   Next, evaluation is performed on a larger testing test, i.e. sequences with up to four mutations from the native sequence. For the 11 proteins in validation set 2, the number of testing sequences ranges from 1909 to 20 686. In this analysis, PiDNA is executed with different settings on the maximal number of mutations allowed when constructing the PFMs. The results are shown in Supplementary Figure S2. Two observations are summarized here. First, it is observed that setting the maximal number of mutations to two when constructing PFMs performs better than setting it to three or four. This is consistent with the conclusion drawn from Table 1, the comparison of the predicted PFMs with the annotated ones. Second, it is shown again that PiDNA with PFM scoring is superior to the other approaches in identifying high-specificity sequences from the lower ones.

Identification of known binding sites from random sequences
With validation set 3, PiDNA is evaluated according to its ability in identifying true binding sites from a set of 10 000 random sequences. The known binding sites are collected from the study of Morozov et al. (9), and length of the sites (the positions to mutate when constructing PFMs) is specified accordingly. 3D-footprint was not included in this comparison because most of the curated PFMs are shorter than the length of the sites for prediction. The results shown in Table 3 reveal that PiDNA can distinguish true binding sites from random ones with high accuracy. This data set includes 244 known binding sites, where 93 of the known sites contain more than four mutations with respect to the native sequences in the complexes used (Supplementary Figure S3). Table 3 reports the sensitivity (true-positive rate) and specificity (true-negative rate) for each query protein when high-specificity rates are considered, resulting in 28 false-negatives. It is summarized in Supplementary Figure S3 that most of the false-negatives are with a large number of mutations, although some sites with a large number of mutations can still be identified by PiDNA successfully. Instead of using sensitivity and specificity rates (there is a trade-off between them), the AUC scores are adopted when different methods are compared in Supplementary Table S5. Although the results in Supplementary Table S5 show that the change on ÁG alone does not serve as a good indicator for identifying true binding sites, it was observed that the change on ÁG usually assigns good rankings to the known binding sites among the sequences with the same number of mutations. In fact, the binding site with 12 mutations has a negative change on ÁG, and it is observed that more than half of the known binding sites are with a considerably small change on ÁG. Similarly, the RMSD values alone are not a good indicator, either. In PiDNA, RMSD values are reported along with the predictions such that users can be aware of large RMSDs, as the real conformational change might be even larger when the backbone flexibility is allowed.

CONCLUSION
PiDNA is designed to predict PFMs of DNA-binding proteins based on available protein-DNA complexes. Numerous structure models are first generated with limited mutations. Afterwards, the mutated sequences with relatively small energy changes are used for constructing PFMs for more predictions. In this study, PiDNA is shown to achieve good performance in delivering reliable PFMs and discovering binding sites with high binding specificity among the set of sequences with limited mutations from the native structure. The analysis shown in this study also reveals that the PFMs reported by PiDNA are able to distinguish true binding sites from random ones with high accuracy. Therefore, PiDNA can be considered as an alternative approach to predicting binding sites of a TF in the genome when lacking wellcharacterized PFMs (or PWMs). With PiDNA, the users can design biological experiments based on the predicted The testing data in this table include sequences with up to two mutations. The top-10 high-specificity sequences are assigned as the positives. The performance of PiDNA based on PFM scoring or based on the change on ÁG (denoted as 'ddG') is also compared. sequence specificity and/or request mutated structure models for further protein design. Also, it is expected that PiDNA can integrate with other high-throughput data to refine large-scale inference of in vivo protein-DNA interactions.  Testing data include the listed numbers of positives and 10 000 negatives.