Email updates

Keep up to date with the latest news and content from Proteome Science and BioMed Central.

Open Access Research

Prediction of DNA-binding proteins from relational features

Andrea Szabóová1*, Ondřej Kuželka1, Filip Železný1 and Jakub Tolar2

Author Affiliations

1 Czech Technical University, Prague, Czech Republic

2 University of Minnesota, Minneapolis, USA

For all author emails, please log on.

Proteome Science 2012, 10:66  doi:10.1186/1477-5956-10-66


The electronic version of this article is the complete one and can be found online at: http://www.proteomesci.com/content/10/1/66


Received:28 March 2012
Accepted:26 October 2012
Published:12 November 2012

© 2012 Szabóová et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The process of protein-DNA binding has an essential role in the biological processing of genetic information. We use relational machine learning to predict DNA-binding propensity of proteins from their structures. Automatically discovered structural features are able to capture some characteristic spatial configurations of amino acids in proteins.

Results

Prediction based only on structural relational features already achieves competitive results to existing methods based on physicochemical properties on several protein datasets. Predictive performance is further improved when structural features are combined with physicochemical features. Moreover, the structural features provide some insights not revealed by physicochemical features. Our method is able to detect common spatial substructures. We demonstrate this in experiments with zinc finger proteins.

Conclusions

We introduced a novel approach for DNA-binding propensity prediction using relational machine learning which could potentially be used also for protein function prediction in general.

Keywords:
DNA-binding propensity prediction; DNA-binding proteins; Relational machine learning

Background

The process of protein-DNA interaction has been an important subject of recent computational-biology research, however, it has not been completely understood yet. DNA-binding proteins have a vital role in the biological processing of genetic information like DNA transcription, replication, maintenance and the regulation of gene expression. Several computational approaches have recently been proposed for the prediction of DNA-binding function from protein structure. In this paper we are interested in prediction of DNA-binding propensity of proteins using their structural information and physicochemical properties. This approach is in contrast with some of the most recent methods which are based on similarity of proteins, for example structural alignment or threading-based methods [1-3] or methods exploiting information about evolutionary conservation of amino acids in proteins [4]. In general, methods exploiting evolutionary information can be more accurate than the approaches aiming to infer binding propensity purely from physicochemical or structural protein properties. On the other hand, the main advantage of the approaches not using evolutionary information is that they do not rely on the existence of homologous proteins and also they may provide interpretable patterns describing the binding principles.

In one of the pioneering works on the prediction of DNA-binding propensity, Stawiski et al. [5] investigated the structural and sequence properties of large, positively charged electrostatic patches on DNA-binding protein surfaces. They used a neural network with 12 features such as molecular weight per residue, patch size, percent α-helix in patch, average surface area per residue, number of residues with hydrogen-bonding capacity, percent of patch and cleft overlap, number of lysine and polar isosteres in Lysoffpatches, and percent of conserved positive and aromatic residues in the patch. Ahmad and Sarai [6] trained a neural network based on the net charge, the electric dipole and quadrupole moments of the protein. The combination of charge and dipole moment performed best, while information about the quadrupole moment improved the accuracy only slightly. They found out that DNA-binding proteins have significantly higher net positive charges and electric moments than other proteins. Bhardwaj et al. [7] examined the sizes of positively charged patches on the surface of DNA-binding proteins. They trained a support vector machine classifier using the protein’s overall charge and its overall and surface amino acid composition. Szilágyi and Skolnick [8] created a logistic regression classifier based on the amino acid composition, the asymmetry of the spatial distribution of specific residues and the dipole moment of the protein. Patel et al. [9] used an artificial neural network to discriminate DNA-binding proteins from non-DNA binding proteins using amino-acid sequential information. For each amino acid sequence they created a set of 62 sequence features. Nimrod et al. [4] presented a random forest classifier for identifying DNA-binding proteins among proteins with known 3D structures. First, their method detects clusters of evolutionarily conserved regions on the surface of proteins using the PatchFinder algorithm. Next, a classifier is trained using features like the electrostatic potential, cluster-based amino acid conservation patterns, the secondary structure content of the patches and features of the whole protein, including all the features used by Szilágyi and Skolnick [8].

In the present work, we use an automatic feature construction method based on relational machine learning to discover structural patterns capturing spatial configuration of amino acids in proteins. Numbers of occurrences of each discovered pattern in a protein become attributes of the protein, which are then used by a machine learning algorithm to predict the DNA-binding propensity of the protein. We combine two categories of features to predict the DNA-binding propensity of proteins. The first category contains physicochemical features which enabled Szilágyi and Skolnick’s method [8] to achieve state-of-the-art predictive accuracies. The second category contains structural features representing the discovered spatial patterns in protein structures. Using predictive classifiers based on these features we obtain accuracies competitive with existing physicochemical based methods on several datasets of proteins. Moreover, our method is able to detect conserved spatial substructures, which we demonstrate in experiments with zinc finger proteins.

Nassif et al. [10] previously used a relational learning based approach in a similar context, in particular to classify hexose-binding proteins. The main differences of our approach from the method of Nassif et al. [10] are as follows. First, the fast relational learning algorithm [11] that we use enables us to produce features by inspecting much larger structures (up to tens of thousands of entries in a learning example) than those considered in the work of Nassif et al. [10] using the standard learning system Aleph. Second, our structural features acquire values equal to the number of occurrences of the corresponding spatial pattern, whereas Nassif et al. [10] only distinguished the presence of a pattern in a learning example from its absence. Our preliminary results [12] indicated that occurrence-counting indeed substantially lifts predictive accuracy. Lastly, the approach of Nassif et al. [10] resulted in classifiers that are more easily interpretable than state-of-the-art classifiers and comparable in predictive accuracy. Here we maintain the interpretability advantage and achieve accuracies competitive to the state-of-the-art predictive accuracies both by a purely structural approach (without the physicochemical features) and also through the combination of structural and physicochemical features.

Materials and methods

Data

DNA-binding proteins are proteins that are composed of DNA-binding domains. A DNA-binding domain is an independently folded protein domain that contains at least one motif that recognizes double- or single-stranded DNA. We worked with the following datasets in our experiments:

• PD138 - dataset of 138 DNA-binding protein structures in complex with DNA,

• UD54 - dataset of 54 DNA-binding protein structures in unbound conformation,

• BD54 - dataset of 54 DNA-binding protein structures in DNA-bound conformation corresponding to the set UD54

• APO104 - dataset of 104 DNA-binding protein structures in unbound conformation,

• ZF - dataset of 33 Zinc Finger protein structures in complex with DNA,

• NB110 - dataset of 110 non-DNA-binding protein structures,

• NB843 - dataset of 843 non-DNA-binding protein structures.

Dataset PD138 was created using the Nucleic Acid Database (NDB) by Szilágyi and Skolnick [8] - it contains a set of DNA-binding proteins in complex with DNA strands with a maximum pairwise sequence identity of 35% between any two sequences.

Both the protein and the DNA can alter their conformation during the process of binding. This conformational change can involve small changes in side-chain location, and also local refolding, in case of the proteins. Predicting DNA-binding propensity from a structural model of a protein makes sense if the available structure is not a protein-DNA complex, i.e. it does not contain a bound nucleic acid molecule. In order to find out how the results would change according to the conformation before and after binding, we used two other datasets (UD54, BD54). BD54 contains bound conformations of DNA-binding proteins, i.e. DNA-protein complexes. UD54 contains the same sequences in their unbound, free conformation. These datasets were also obtained from Szilágyi and Skolnick [8].

Another set of DNA-binding protein structures (APO104) determined in the absence of DNA was obtained from Gao et al. [2].

Thirty-three examples of Cys2His2ZF-DNA complexes were sourced from Siggers et al. [13]. Their structural description was obtained from the Protein Data Bank.

Rost and Sander constructed a dataset (RS126) for secondary structure prediction. Ahmad & Sarai [6] removed the proteins related to DNA binding from it, thus getting a final dataset of non-DNA-binding proteins. As our negative dataset (NB110) we used this set of non-DNA-binding proteins.

We also used an extended dataset (NB843) by Nimrod et al. [4]. This dataset contains additional 733 structures of non-DNA-binding proteins. The additional structures were gathered using the PISCES server. Entries in this list include crystal structures with a resolution better than 3.0Å. The sequence identity between each pair of sequences is smaller than 25%.

From the structural description of each protein we extracted the list of all contained residues with information on their type and the list of pairwise spatial distances among all residues. As for the physicochemical features, we followed Szilágyi and Skolnick’s work [8] and extracted features indicating the respective proportions of the Arg, Lys, Asp, Ala and Gly residues, the spatial asymmetry of Arg, Gly, Asn and Ser, and the dipole moment of the protein.

Method

Our method exploits techniques of relational machine learning [14] in conjunction with state-of-the-art attribute-value learning algorithms [15]. Very briefly, our method can be viewed as proceeding in three steps. It starts with PDB files, which is a widely used format for proteins. Then it creates a relational representation of the proteins (step 1). After that it tries to extract meaningful relational patterns from the relational structures describing proteins and uses them to create an approximate attribute-value representation of the proteins (step 2) which is then used for learning attribute-value classifiers (step 3).

Although the field of attribute-value machine learning is more mature than the field of relational machine learning, attribute-value learning algorithms, such as decision trees or support vector machines, suffer from the limitation that they can deal only with data which is in the form of data tuples (such as real-valued or boolean vectors) of fixed length. Attribute-value learning algorithms face problems when dealing with data in a more structured form, for example spatial structures of proteins. On the other hand, relational learning algorithms can directly learn from data expressed as relational structures such as graphs or the logic-based form which we adopt and explain below. Spatial structures of proteins, which is what we are interested in, can be represented very naturally within the relational-learning framework.

Propositionalization[16] is a general strategy which combines advantages of attribute-value learning algorithms (usually higher accuracy) and relational learning algorithms (ability to handle structured examples). In propositionalization, one tries to convert a relational learning problem to an attribute-value learning problem by transforming the original relational representation to an (approximate) attribute-value representation, i.e. to representation where learning examples are represented as vectors of fixed size, and then to train an attribute-value classifier for such data. Thus, roughly speaking, propositionalization corresponds to steps 2 and 3 of our method.

The representation of examples that we use is rooted in the field of inductive logic programming [14] which is a sub-field of relational learning. However, for brevity, we mostly avoid the whole logical machinery usually used in inductive logic programming and we speak instead (rather informally) about relational structures instead of first-order formulas and logical interpretations. A literal is an expression of the form literalName(A1,…,Ak) where A1,…,Ak are variables or constants. We use the convention from logic programming that variables start with an upper-case letter. For example residue(A,his) or distance(AB,10.0Å) are literals and A, B are variables whereas his and 10Å are constants. An example is simply a set of literals none of which contains a variable. For instance

<a onClick="popup('http://www.proteomesci.com/content/10/1/66/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.proteomesci.com/content/10/1/66/mathml/M1">View MathML</a>

is an example describing a dipeptide.

Besides examples, we also need patterns. A pattern is a set of literals which, unlike examples, may contain variables. An example of a pattern is

<a onClick="popup('http://www.proteomesci.com/content/10/1/66/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.proteomesci.com/content/10/1/66/mathml/M2">View MathML</a>

A pattern p is said to covera an example e when we are able to find a substitution θ to variables of p such that e. For example the pattern p1 covers the example e1 because p1θ e for substitution θ = {A/bB/a}. We are not interested only whether a pattern p covers a given example e but also how many covering substitutions there are, i.e. how many substitutions θ such that e there are. We call the number of covering substitutions of a pattern p its value. Although counting the number of covering substitutions is not very common in ordinary propositionalization approaches, it makes perfect sense for the problem of predicting DNA-binding propensity of proteins, since ability to bind DNA is often connected with count or proportion of atom-groups with certain properties (e.g. charged residues [17]).

In the experiments we used a representation of proteins that consisted of literals representing types of the residues and literals representing pair-wise distances between the residues up to 10Å. These distances were computed from alpha-carbon coordinates obtained from PDBb. We also restricted the shapes of possible patterns by insisting that the patterns have to be tree-like. Despite these simplifications, some of the examples contained, in the end, tens of thousands literals which would be very challenging for common relational learning systems such as Alephc, not to mention that these systems do not allow computing numbers of covering substitutions. Therefore we customized the pattern search algorithm [11] which is more appropriate for problems of this size due to its pruning mechanisms and strong structural language bias (it constructs only tree-like patterns). This pattern search algorithm prunes pattern space using two measures: redundancy (described by Kuželka et al. [11]) and minimum frequency which is a minimum number of examples that must be covered by a pattern. An example of a tree-like pattern is res(A,arg), res(B,arg), res(C,lys), dist(A,B,10.0), dist(A,C,10.0). This pattern assumes the presence of two Arginines – A and B – and one Lysine – C. The distance between the Arginines is 10Å, the distance between the Arginine A and the Lysine C is also 10Å. A pattern like this can be used as a feature, counting the number of occurrences of this particular spatial configuration of amino acids in proteins.

The generated patterns were used for classification using six state-of-the-art attribute-value learning algorithms listed in Table 1. We used implementation of these learning algorithms present in the WEKA [18] open-source machine learning software. We also combined the patterns constructed automatically by the relational pattern search algorithm with numerical features devised by Szilágyi and Skolnick [8].

Table 1. Learning algorithms

Parameters of the classifiers were tuned using internal cross-validation. When performing cross-validation, the set of patterns was created separately for each train-test split corresponding to iterations of cross-validation procedure. The number of trees for random forest and the number of iterations for Ada-boost was selected from the set {10,20,50,100,200,500,1000}. The complexity parameter c for linear support vector machine and for support vector machine with RBF kernel was selected from the set {1,10,102,103,104,105,106}. The regularization parameter of L2-regularized logistic regression was selected from the set {10−3,10−2,10−1,1,10,102,103}. The minimum frequency of features on one of the classes was 0.7.

We used a different methodology for experiments with datasets PD138/NB843, because the size of the dataset required a sampling-based approach to feature construction rather than exhaustive search. Therefore, we followed an approach in which patterns were constructed on several randomly selected subsets of data and then evaluated on the complete dataset. The number of random samples was set to 10, the number of proteins in the samples from each class was set to 20. The minimum frequency for each sample was set to 1.

Results and discussion

We experimented with several datasets to evaluate the predictive accuracy and also the interpretability of our approach. We compared classifiers based on structural patterns discovered by our method (SF) with classifiers based on 10 physicochemical features (PF) identified as most predictive by Szilágyi and Skolnick’s method [8]. We also trained classifiers based on both structural features and physicochemical features (PSF). For each experiment we estimated predictive accuracy and the area under the ROC curve (AUC) by 10-fold cross-validation. Lastly, we inspected the most informative structural patterns in order to evaluate interpretability of these patterns. We assessed the informativeness by the χ2 criterion [24].

We performed five sets of experiments with datasets of DNA-binding proteins - PD138, UD54, BD54, APO104 and ZF - each one as a set of positive examples and dataset of non-DNA-binding proteins NB110 - as a set of negative examples. We obtained about 1400 structural patterns for datasets PD138/NB110, approximately 1500 structural patterns for datasets UD54/NB110, about 2400 structural patterns for datasets BD54/NB110, about 2800 structural patterns for datasets APO104/NB110 and approximately 6000 structural patterns for datasets ZF/NB110. Accuracies and areas under the ROC curve (AUC) obtained on the respective datasets by stratified 10-fold cross validation using physicochemical features (PF), structural pattern features (SF) and combination of both of them (PSF) are shown in Table 2. The results for the method based on physicochemical features (PF) differs slightly from the results reported by Szilágyi and Skolnick [8], because we used 10-fold cross-validation whereas Szilágyi and Skolnick used leave-one-out cross-validation.

Table 2. Results

We computed average rankings (over several machine learning algorithms) for accuracies and AUCs. The average ranking (over several machine learning algorithms) of classifiers based on structural features (SF) was best on datasets UD54/NB110, APO104/NB110 (tie with PSF) and ZF/NB110 for accuracies and on datasets UD54/NB110 and ZF/NB110 in terms of AUC. The average ranking of classifiers based on combination of structural and physicochemical features (PSF) was highest on datasets PD138/NB110, APO104/NB110 (tie with SF) and BD54/NB110 for accuracies and on datasets PD138/NB110, APO104/NB110 and ZF/NB110 (tie with SF) in terms of AUC. Classifiers based on physicochemical features (PF) obtained highest ranking only for AUCs on dataset BD54/NB110.

We made an additional experiment with datasets PD138/NB843 in order to be able to compare our method with the method of Nimrod et al. [4]. In this experiment we used only the random forest classifier which was also used by Nimrod et al. On this dataset Nimrod et al. obtained AUC 0.9. We obtained AUC 0.84 with the method of Szilágyi and Skolnick, 0.82 with the method based on structural features and 0.82 with the method based on the combination of structural and physicochemical features. It is important to note that unlike the method of Nimrod et al. our method does not rely on information about evolutionary conservation.

In order to find out whether our method did not just capture the consensus patterns of particular protein folds, we performed an experiment in which we made use of the division of DNA-binding proteins (of the dataset PD138) into seven protein groups. Our method was always applied on sets of proteins consisting of all but one protein group, then the obtained classifiers were tested on this excluded group. The resulting accuracies of linear SVM classifier on the excluded groups were reasonably high with the exception of the enzyme group. The enzyme group turned out to be more difficult for DNA-binding prediction also in previous works [5,8].

The performed experiments allow us to evaluate usability of the relational learning approach for prediction of DNA-binding propensity as well as its usability for discovery of interesting spatial patterns in proteins. Results of our experiments suggest that the method is suitable for both of the tasks. Here, we also discuss the factors influencing predictive performance and biological relevancy of discovered structural patterns.

DNA-binding proteins in general

We made several sets of experiments for DNA-binding proteins in general (datasets PD138, UD54, BD54, APO104). The method based on purely structural features (SF) and the method based on the combination of structural and physicochemical features (PSF) achieved higher predictive accuracies than the method based purely on physicochemical features (PF) - features introduced by Szilágyi and Skolnick [8]. The only exception was in case of the dataset BD54/NB110, where the method based on purely physicochemical features performed better than the method based on purely structural features. The results were not as definite in the case of AUC as in the case of predictive accuracy. The method based on structural features turned out to be better than the method based on physicochemical features on two datasets. Interestingly, these two datasets contain DNA-binding proteins in their unbound conformations. The method based on the combination of structural and physicochemical features was better than the method based on purely physicochemical features on three datasets.

It may seem counter-intuitive that in some of the experiments, physicochemical features (PF) or structural features (SF) outperformed the combined feature set (PSF). However, this is a rather natural manifestation of the overfitting effect; expansion of the feature set may indeed be detrimental especially with small data sets [15].

It is interesting to compare the results for the datasets UD54 and BD54. Dataset UD54 contains DNA-binding proteins in unbound conformation, dataset BD54 contains the same DNA-binding proteins, but in bound conformation with DNA. Whereas the highest predictive accuracies and best AUCs were obtained by the method based on structural features on dataset UD54, this method performed worst on dataset BD54. Interestingly, the number of frequent structural patterns was significantly higher for dataset BD54 (approximately 2400 structural patterns) than for the dataset UD54 (approximately 1500 structural patterns). This suggests that conformational changes after DNA-binding give rise to greater variability of spatial arrangements of some amino acid groups. Moreover, conformational changes may be responsible for increase of spatial asymmetry of some amino acids or protein’s dipole moment. This can explain the better performance of the method based on physicochemical features on the dataset BD54 (recall that these features were selected by experimenting on DNA-binding proteins in bound conformation with DNA by Szilágyi and Skolnick [8]). Also note that prediction of DNA-binding propensity from unbound conformations is more important for practical applications.

We examined the best discovered patterns in detail. For each split of the dataset PD138 induced by 10-fold cross-validation we selected the ten most informative structural patterns according to the χ2 criterion. Table 3 shows the number of occurrences of the ten best patterns. There are four structural patterns which are present in all ten folds. The first is res(A), residue(A,arg). This pattern counts the number of Arginines in the protein. It is known that the Arginine plays an important role in the DNA binding process. For now, we are interested in structural patterns. Since this pattern included no spatial information relating to other amino acids, we decided to analyse just the remaining three patterns.

Table 3. The most informative patterns for PD138

We inspected how structural patterns are reflected in protein’s primary structure. First, we examined whether amino acids matched by a pattern occur in a preferred order in the proteins’ sequences. We calculated the distribution of permutations of the amino acids matched by the first analysed structural pattern res(A,arg), res(B,lys), dist(A,B,4.0). The distribution of permutations on positive dataset was almost identical. Next, we were looking for relative positions of these amino acids in the sequences of DNA-binding proteins. Mostly the amino acids were situated next to each other in the proteins’ sequences for both permutations of amino acids: [arg,lys] and [lys,arg], i.e. on positions n and n+1. We also obtained occurrences of this pattern, where the amino acids were on positions n and n+3 for permutation [arg, lys].

The next analysed structural pattern was res(A,arg), res(B,arg), res(C,lys), dist(A,B,10.0), dist(A,C,10.0). There were no prevailing permutations for this structural pattern and also no prevailing local arrangements of amino acids in sequence. It would be hard to express this pattern using only primary structure information, unlike in the case of the previous pattern.

The third analysed structural pattern was res(A,arg), res(B,arg), dist(A,B,6.0). The most frequent relative positions of the amino acids were [n, n+2], [n, n+3], [n, n+4], where the first relative positions were approximately two times more frequent than the other two.

Zinc finger proteins

Zinc finger proteins are one of the most common DNA-binding proteins in eukaryotic transcription factors. Several studies [25-32] have tried to determine the DNA recognition by these proteins. The sequence of three fingers of the protein Zif268, which served as the prototype for understanding DNA recognition by this family of proteins, is shown with the cysteines and histidines involved in zinc coordination indicated in bold font in Table 4 (reproduced from Wolfe et al. [32]). Filledsquares below the sequences indicate the position of the conserved hydrophobic residues. Filledcircles and stars indicate residue positions that are involved in phosphate and base contacts (respectively) in most of the fingers. We evaluated relevance of the discovered structural patterns matching them to observations in the paper of Wolfe et al. [32].

Table 4. Annotated sequence of three fingers of Zif268

We made predictive classification experiments on dataset of zinc finger proteins (ZF). The best results, in terms of accuracy and AUC, were obtained by the method based on structural features. However, here the results were influenced by the fact that the zinc finger proteins were highly homologous. Therefore, we were more interested in the question whether the structural patterns were able to discover some basic characteristic of DNA-binding process shared by zinc finger proteins.

We inspected the best discovered patterns. We selected the ten most informative structural patterns according to the χ2 criterion, following the same procedure as for the DNA-binding proteins in general. Table 5 shows the number of occurrences of the ten best patterns. There were three structural patterns present in all of the dataset splits. We show them in Figures 1 and 2.

thumbnailFigure 1. Structural patterns for Zinc Fingers. Most informative structural patterns according to the χ2 criterion for the data set of Zinc Fingers (edges not to scale).

thumbnailFigure 2. The most informative structural pattern for Zinc Fingers. Example proteins (1A1F and 1AAY) containing one discovered pattern shown for the Zinc-finger proteins’ dataset using the protein viewer software [33]. Residues assumed by the pattern are indicated in the following way: CYS - pink, HIS - violet, ARG - yellow.

Table 5. The most informative patterns for ZF

We calculated the distribution of permutations of the amino acids matched by the first analysed structural pattern res(A,cys), res(B,cys), res(C,his), res(D,his), res(E,arg), dist(A,B,6.0), dist(A,C,8.0), dist(A,D,10.0), dist(A,E,10.0). The most frequent permutation was [cys, cys, arg, his, his]. We looked for the relative positions of these amino acids in zinc finger proteins’ sequences. The most frequently occurring relative positions were: [n, n+5, n+17, n+18, n+22]. We compared this result with the observation described in the paper of Wolfe et al. [32] (reproduced in Table 4). This discovered structural pattern exactly matched the positions of some of the amino acids which are supposed to be directly involved in DNA-binding. In case of the second structural pattern res(A,cys), res(B,his), res(C,his), res(D,arg), dist(A,B,8.0), dist(A,C,10.0), dist(A,D,10.0) and the third structural pattern res(A,his), res(B,his), res(C,cys), res(D,arg), dist(A,B,8.0), dist(A,C,8.0), dist(A,D,4.0) the most frequent permutation was [cys, arg, his, his] and the resulting relative positions were [n, n+17, n+18, n+22]. Table 4 indicates that these two patterns (P2 and P3) cover the first pattern (P1).

While, as already commented, the discovered patterns matched the positions of some of the amino acids supposed to be directly involved in DNA-binding, they in fact do not capture specific properties of DNA-binding process but rather a consensus amino acid pattern known to be present in Cys2His2 zinc fingersd[32]. One could be concerned whether the patterns discovered for DNA-binding proteins in general (datasets PD138, UD54, BD54, APO104) just captured conserved consensus patterns of different folds as well. However, this was not the case, because every discovered pattern was contained in at least 70% of DNA-binding proteins (recall that minimum frequency 0.7 was used for feature construction). In order to assure validity of this claim we performed an additional experiment in which the relational learning model was always constructed for proteins from all but one protein group and then tested on this excluded group (see sectionEvaluation of binding motif independence). Nevertheless, these observations indicate that caution should be exercised when applying our relational learning method on datasets with highly homologous proteins, because conserved consensus patterns not necessarily related to the function of the proteins could be discovered instead of the sought patterns responsible for the function.

PD138/NB843 Dataset

We performed an additional experiment involving the method of Nimrod et al. [4] on the dataset PD138/NB843. The method of Nimrod et al. exploits also evolutionary information therefore it is interesting to see whether methods relying only on physicochemical and/or structural features could come close to its predictive accuracy.

In this additional experiment, we used only random forest classifier because this classifier was also used by Nimrod et al. The AUC values of the approaches based on the physicochemical features (PF), structural features (SF), and their combination (PSF) were (respectively) 0.84, 0.82, and 0.82, whereas the method of Nimrod et al. achieved AUC of 0.9. This indicates that there is still a large gap between the structural and physicochemical feature based approaches on one hand, and methods relying on evolutionary conservation information.

Evaluation of binding motif independence

In order to further support our claim that the patterns discovered for DNA-binding proteins in general (datasets PD138, UD54, BD54, APO104) did not just capture the consensus patterns of particular folds, we performed an experiment in which the relational learning model was always constructed for proteins from all but one protein group and then tested on this excluded group. Proteins of the dataset PD138 were divided into seven groups following the work of Szilágyi and Skolnick [8]. They were the following: helix-turn-helix, zinc-coordinating, zipper-type, other α-helix, β-sheet, other and enzyme. We used linear SVM based on our structural features (SF), because SVM turned out to perform best in the experiments described in Table 2. We show both the predictive accuracies obtained by testing the learnt classifiers on the excluded groups and the cross-validated accuracies obtained by the classifiers on the remaining parts of the dataset in Table 6. The resulting accuracies on the excluded groups, which should correlate with the ability of our method to discover patterns characteristic for DNA-binding proteins in general, are reasonably high with the exception of the enzyme group. This agrees with the results of Szilágyi and Skolnick [8] and Stawiski et al. [5], who also noticed a drop in the ability of their method to detect DNA-binding proteins in the enzyme group. We can conclude that our method is indeed able to construct classifiers which can work accurately over various (non-enzyme) groups of proteins and that its ability to detect DNA-binding proteins is not due to discovery of conserved consensus patterns of different protein folds.

Table 6. Evaluation of binding motif independence

Conclusions

We applied relational machine learning techniques to predict DNA-binding propensity of proteins. We utilized our relational learning method [11]. We have shown that our relational learning approach is competitive to a state-of-the-art physicochemical approach for DNA-binding propensity prediction in terms of predictive accuracy. Moreover, we have illustrated that our method is capable to also provide interpretable patterns describing spatial configurations of amino acids in protein structures. In the future we would like to apply the method to protein function prediction in general.

Endnotes

aThis definition is equivalent to what is known as hypergraph homomorphism or θ-subsumption in inductive logic programming [14].

bhttp://www.pdb.org webcite

cSrinivasan, A.: The Aleph Manual, 4th edn. (2007), http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html webcite

dWe are grateful to an anonymous reviewer of the paper for pointing out this fact.

Competing interests

The authors have no competing interests to declare.

Authors’ contributions

Fž and JT conceived the idea of using relational machine learning for DNA-binding prediction. AS and OK conceived, designed and implemented the method, performed the experiments and analysed the results. AS, OK, Fž and JT wrote the paper. All authors read and approved the manuscript.

Acknowledgements

This research was supported by the Czech Science Foundation through project P202/12/2032, except for the part concerned with Zinc-Finger proteins which was supported by the Czech Ministry of Education through project ME10047.

References

  1. Zhao H, Yang Y, Zhou Y: Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function.

    Bioinformatics 2010, 26(15):1857-1863. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Gao M, Skolnick J: DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions.

    Nucleic Acids Res 2008, 36(12):3978-3992. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Gao M, Skolnick J: A threading-based method for the prediction of DNA-binding proteins with application to the human genome.

    Plos Comput Biol 2009, 5(11):e1000567. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Nimrod G, Szilágyi A, Leslie C, Ben-Tal N: Identification of DNA-binding proteins using structural, electrostatic and evolutionary features.

    J Mol Biol 2009, 387(4):1040-1053. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Stawiski E, Gregoret L, Mandel-Gutfreund Y: Annotating nucleic acid-binding function based on protein structure.

    J Mol Biol 2003, 326:1065-1079. PubMed Abstract | Publisher Full Text OpenURL

  6. Ahmad S, Sarai A: Moment-based prediction of DNA-binding proteins.

    J Mol Biol 2004, 341:65-71. PubMed Abstract | Publisher Full Text OpenURL

  7. Bhardwaj N, Langlois R, Zhao G, H L: Kernel-based machine learning protocol for predicting DNA-binding proteins.

    Nucleic Acids Res 2005, 33(20):6486-6493. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Szilágyi A, Skolnick J: Efficient prediction of nucleic acid binding function from low-resolution protein structures.

    J Mol Biol 2006, 358(3):922-933. PubMed Abstract | Publisher Full Text OpenURL

  9. Patel A, Patel S, Naik P: Binary classification of uncharacterized proteins into DNA binding/non-DNA binding proteins from sequence derived features using ANN.

    Digest J Nanomaterials Biostructures 2009, 4(4):775-782. OpenURL

  10. Nassif H, Al-Ali H, Khuri S, Keirouz W, Page D: An inductive logic programming approach to validate hexose biochemical knowledge. In Proceedings of the 19th International Conference on ILP. Leuven: Springer-Verlag; 2009:149-165. OpenURL

  11. Kuželka O, železný F: Block-wise construction of tree-like relational features with monotone reducibility and redundancy.

    Mach Learn 2011, 83:163-192. Publisher Full Text OpenURL

  12. Szabóová A, Kuželka O, železný F, Tolar J: Prediction of DNA-binding proteins from structural features. In MLSB 2010: 4th International Workshop on Machine Learning in Systems Biology. UK: University of Edinburgh; 2010:71-74. OpenURL

  13. Siggers T, Honig B: Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry.

    Nucleic Acids Res 2007, 35(4):1085-1097. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. De Raedt L: Logical and Relational Learning. Heidelberg: Springer-Verlag; 2008. OpenURL

  15. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag; 2001. OpenURL

  16. Lavrač N, Flach P: An extended transformation approach to inductive logic programming.

    ACM Trans Comput Logic 2001, 2:458-494. Publisher Full Text OpenURL

  17. Jones S, Shanahan H, Thornton J, Berman1 H: Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins.

    Nucleic Acids Res 2003, 31:7189-7198. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. San Francisco: Morgan Kaufmann; 2005. OpenURL

  19. Burges C: A tutorial on support vector machines for pattern Recognition.

    Data Min Knowl Discov 1998, 2(2):121-167. Publisher Full Text OpenURL

  20. Landwehr N, Hall M, Frank E: Logistic model trees.

    Mach Learn 2005, 59(1-2):161-205. Publisher Full Text OpenURL

  21. Hilbe J: Logistic Regression Models. New York: Taylor & Francis, Inc.; 2009. OpenURL

  22. Freund Y, Schapire R: A decision-theoretic generalization of on-line learning and an application to boosting. London, UK: Springer-Verlag; 1995. OpenURL

  23. Breiman L: Random forests.

    Mach Learn 2001, 45:5-32. Publisher Full Text OpenURL

  24. Liu H, Setiono R: Chi2: Feature selection and discretization of numeric attributes. In Proceedings of IEEE 7th International Conference on Tools with Artificial Intelligence. New Jersey, USA: IEEE Piscataway; 1995:338-391. OpenURL

  25. Desjarlais J, Berg J: Toward rules relating zinc finger protein sequences and DNA binding site preferences.

    Proc Nat Acad Sci 1992, 89(16):7345-7349. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Desjarlais J, Berg J: Use of a zinc-finger consensus sequence framework and specificity rules to design specific DNA binding proteins.

    Proc Nat Acad Sci 1993, 90(6):2256-2260. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  27. Desjarlais J, Berg J: Length-encoded multiplex binding site determination: application to zinc finger proteins.

    Proc Nat Acad Sci 1994, 91(23):11099-11103. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Nardelli J, Gibson T, Charnay P: Zinc finger-DNA recognition: analysis of base specificity by site-directed mutagenesis.

    Nucleic Acids Res 1992, 20(16):4137-4144. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Thukral S, Morrison M, Young E: Mutations in the zinc fingers of ADR1 that change the specificity of DNA binding and transactivation.

    Mol Cell Biol 1992, 12(6):2784-2792. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. Pavletich N, Pabo C: Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1Å.

    Science 1991, 252(5007):809-817. PubMed Abstract | Publisher Full Text OpenURL

  31. Elrod-Erickson M, Rould M, Nekludova L, Pabo C: Zif268 protein-DNA complex refined at 1.6Å: a model system for understanding zinc finger-DNA interactions.

    Structure 1996, 4(10):1171-1180. PubMed Abstract | Publisher Full Text OpenURL

  32. Wolfe S, Nekludova L, Pabo C: DNA recognition by Cys-2-His-2 zinc finger proteins.

    Annu Rev Biophys Biomol Struct 2000, 29:183-212. PubMed Abstract | Publisher Full Text OpenURL

  33. Moreland J, Gramada A, Buzko O, Zhang Q, Bourne P: The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications.

    BMC Bioinformatics 2005, 6(1):21+. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL