Abstract
There is a growing interest in the identification of proteins on the proteome wide scale. Among different kinds of protein structure identification methods, graphtheoretic methods are very sharp ones. Due to their lower costs, higher effectiveness and many other advantages, they have drawn more and more researchers’ attention nowadays. Specifically, graphtheoretic methods have been widely used in homology identification, sidechain cluster identification, peptide sequencing and so on. This paper reviews several methods in solving protein structure identification problems using graph theory. We mainly introduce classical methods and mathematical models including homology modeling based on clique finding, identification of sidechain clusters in protein structures upon graph spectrum, and de novo peptide sequencing via tandem mass spectrometry using the spectrum graph model. In addition, concluding remarks and future priorities of each method are given.
Background
Protein structure identification is a central research area in proteomics [1]. Proteins, as we know, are complex organic compounds, which consist of series of amino acids. Protein structures are usually considered as four different levels from amino acids sequences to various folding patterns. They are very important in proteomics since they usually determine the function, homology and other features of proteins. Therefore, increasing number of researchers are focusing on protein structure identification problems. Usually, biological experiments for identifying protein structures produce huge quantity of data. Facing these molecular biology data, researchers aim to find perspective relationships of proteins through effective analyzing and then, focusing on further biological relationships and functions of them [2]. In order to deal with these, biological ways have been used at first time. However, due to various limitations such as strict environment request and high experiment cost, these methods have encountered tough difficulties. Mathematical methods, by contrast, are effective in summarizing and predicting biological characteristics with lower cost, which are drawing increasing attention and being widely used in this area. Among different kinds of mathematical methods, graph theory is an essential one [3], which owns advantages in various protein structure identification problems including predicting protein structure, identification of sidechain clusters in protein structures, de novo sequencing, and so on [4,5].
In this paper, we summarize current applications and development of graph theory modeling in protein identification, mainly introducing three classical methods and mathematical models including homology modeling based on clique finding, identification of sidechain clusters in protein structures upon graph spectrum, and de novo peptide sequencing via tandem mass spectrometry using the spectrum graph model. Besides, we briefly analyze the advantages and disadvantages of these methods and give some possible directions for future research.
Review
Basic knowledge of graph theory
In order to understand the problem modeling, we need to know some basic concepts and background knowledge in graph theory. A graph G is an ordered pair (V (G), E(G)) consisting of a set V(G) of vertices and a set E(G), disjoint from V(G), of edges, together with an incident function ψ_{G} that associates with each edge of G an unordered pair of vertices (not necessary distinct), if e is an edge and u and v are vertices such that ψG(e) ={u, v}, then the edge e is said to join the vertices u and v, and u and v are called the ends of e[6]. We denote the numbers of vertices and edges in G by v(G) and e(G), which are called the order and size of G, respectively. In this paper, we always use G to represent a graph we are concerning.
The following is an example of a graph to clarify the definition. For notational simplicity, we use uv for the unordered pair {u,v}. Let G = (V(G), E(G)), where V(G) = {u, v, w, x, y}, E(G) = {a, b, c, d, e, f, g, h}. The function ψ_{G} is defined as: ψ_{G}(a) = uv, ψ_{G}(b) = uu, ψ_{G}(c) = vw, ψ_{G}(d) = wx, ψ_{G}(e) = vx, ψ_{G}(f) = wx, ψ_{G}(g) = ux, ψ_{G}(h) = xy. The graph G could be drawn as in Figure 1.
An edge with identical ends is called a loop, and an edge with distinct ends a link . Two or more links with the same pair of ends are said to be parallel edges. A graph is simple if it has no loops or parallel edges. In this paper, all the graphs we concern are simple graphs.
A complete graph is a simple graph in which any two vertices are adjacent, an empty graph one in which no two vertices are adjacent (that is, one whose edge set is empty). A path is a simple graph whose vertices can be arranged in a linear sequence in such a way that two vertices are adjacent if they are consecutive in the sequence, and are nonadjacent otherwise. The length of a path is the number of its edges. In a graph G, the degree of a vertex v, denoted by d_{G}(v), is the number of edges of G incident with v, each loop counting as two edges. The set of all vertices incident with v is denoted by N_{G}(v) [6].
In a graph, a clique is a set of mutually adjacent vertices, in other words, a subset of V(G) that has completely connected vertices. So in a clique, arbitrarily choosing two vertices, they are connected with each other. A clique in a graph is maximum if the graph contains no larger cliques. If a subgraph S in a graph G is a clique, then the clique center is a vertex v in S satisfying that, ∀u ∈ V(S) \ v, maxd(u, v) is minimal. The clique center is weighted if G is weighted in calculating distance.
Adjacency matrix of a graph G is the n × n matrix A_{G} := (a_{uv}), where a_{uv} is the number of edges joining vertices u and v. Each loop is counted as two edges [6]. A set of points in space can be represented in the form of a graph where the points represent the vertices of the graph and the distances between the points represent edges. The constructed graph can be represented mathematically in the form of a matrix called the Laplacian matrix[7]. Graph spectrum is the information on analyzing the eigenvalues and eigenvectors related to Laplacian matrix in the graph spectrum research. It can gain information on cliques and clique centers in the graph.
Construction of homology modeling upon bestweight clique finding
Problem description
Homology modeling is a key aspect in preteome study. When we say that sequence A has high homology to sequence B, we claim that not only sequence A looks much the same as sequence B, but also all of their ancestors look the same, going all the way back to a common ancestor [8]. Identification of homological sequences enables us to assign information from one known sequence to another unknown sequence, which enables to save lots of time and energy in research, too. However, homology modeling is facing many difficulties nowadays. One problem is that it is usually hard to find acceptable conformations of proteins because many conformations are highly dependent on experiment environment which would definitely limit the experiment design. Another problem is that there is no much effective algorithm available to cope with biological methods. Therefore, researchers are thinking of different mathematical approaches to solve these problems. Among them, the graphtheoretic method is a typical one. In this section, we will introduce a graphtheoretic method that constructs homology modeling upon bestweight clique finding. We first introduce some concepts, followed by modeling process, and then evaluate this method, giving some future research directions at last.
Homology modeling, also known as comparative modeling of proteins, is a technique that identifies approximate structure of a target protein from a related known homologous protein. When the target sequence is closely related to some known sequence, their overall folds are similar [9], so we can reconstruct the structure of target protein (from sequence) if we recognize its folding way by the known protein.
The steps of homology modeling can be arranged as follows. First, identifying an alignment between the target and related protein sequences [10]. Second, copying the mainchain coordinates from the related protein for equivalent residues and inferring some sidechain conformations. Last, building other structures left. In this procedure, current numerical methods encounter difficulties because it is hard to find suitable models [1116]. A good model should not only satisfies the polypeptide chain property that steric exclusive effect makes energy surface discontinuous and that the conformation is contextdependent, but also has effective algorithms in implementing. Here, a graphtheoretic method can be applied to solve this problem well [17].
Graphtheoretic modeling
In 1998, Samudrala and Moult transferred homology modeling into a clique finding problem in graph theory and used an effective algorithm to solve it [18]. The vertices and edges of the graph are defined as follows.
Vertex: Each possible conformation of an amino acid residue in the sequence stands for a vertex in the graph. The weight of the vertex depends on interaction strength between local mainchain atoms and sidechain atoms. The mainchain atoms up to four residues on each side of the residue position, and the mainchain atoms of this residue, should be considered to calculate the weight.
Edge: Edges would be drawn when vertices present residue conformations within the same mainchain segment but not between clash atoms or different possible sidechain conformations of the same residue. The weight of an edge stands for interaction strength between two differen vertices (which represent residues).
Once the qualified graph has been drawn, all the maximal sets of cliques can be found using a clique finding algorithm [19,20]. Here, we propose an algorithm developed by Bron and Kerbosch [21].
This algorithm uses a recursive backtracking procedure and a branchbound technique to achieve quick time clique finding [22]. There are three sets that play key roles in the algorithm: (1) potential clique; in this set, all the vertices are connected to each other, so this set can be extended by some new qualified vertices and has the potential to be the maximal clique. (2) candidates; this set consists of the vertices that can be added into the potential clique set. (3) not; this is a set of vertices that not belong to either of the former two sets, which means that the vertex has already served as an extension to the current potential clique set but not qualified.
At the beginning of the algorithm, potential clique and not are both empty while candidates consists of all the vertices of graph G, which represents all the possible conformations and their interactions. After that, choosing vertex v in candidates with maximal degree to the potential clique set. This kind of strategy makes larger cliques being found faster. Then, the vertices in candidates should be the vertices connected to v, and the vertices in not be the vertices disconnected to v. After that, choosing vertex u with maximal degree in the current candidates set, and repeating the procedure till the candidates set is empty. The procedure can also be written as the following steps. We use P, C, N to represent the sets potential clique, candidates, and not, respectively.
step 1: Set C = V(G), P = ∅, N = ∅;
step 2: If C ≠ ∅, calculate , go to step 3; else go to step 4.
step 3:P = P⋃{v}, C = C⋂N_{G}{v}, N = V(G)\(P⋃C), go to step 2.
step 4: Output P, stop.
Following this procedure, we can find (one of) the maximal cliques in G. Since each of the cliques represents a possible conformation of the sequence, the maximal one with the best weight would be considered as the most similar one to the native protein structure.
The score of each clique used to find maximal one with best weight is defined as
where S(d_{ab}) represents the score of atoms type a and b with distance d, P(d_{ab}C) represents the probability of observing a distance d between atom type a and b in a correct structure, and P(d_{ab}) represents the probability of observing such a distance in all conditions without considering it is correct or not. The value of P(d_{ab}C)/P(d_{ab}) is calculated by
where N(d_{ab}) represents the number of observations of atom types a and b in a particular distance d, ∑_{d}N(d_{ab}) represents the number of a – b contacts observed for all distances, ∑_{ab} N(d_{ab}) represents the total number of contacts between all pair of atom types in a particular distance d, and ∑_{d}∑_{ab} N(d_{ab}) represents the total number of contacts between all pair of atom types observed for all distances.
Given a weighted clique with n vertices and m edges representing a possible conformation, its score that represents the correctness of the probability can be calculated by
where S(vertex) is the sum of the scores for distances between all atoms p of the sidechain and atoms q of the total mainchain. Therefore, we have
and S(edge) is the sum of the scores for the distance between an atom r of one residue and an atom s of the other, which can be calculated by
If the distance between r and s is no more than four residues, only sidechain atoms are used to calculate scores. All S(vertex) and S(edge) are calculated only once. By this means, the calculating cost can be reduced a lot.
Discussion and further improvement
This section gives a typical graphtheoretic method which solves homology modeling problem. It has mainly three advantages. First, it transfers a protein structure identification problem to a graph theory one, uses the algorithm of graph theory (clique finding) to solve it and makes the original problem easier to handle. Second, in this model, each score can be calculated fast, which makes the computation easy to accomplish. At last, this method excludes impossible conformation before giving weight, which eliminates the number of edges and reduces the computation scale.
However, we can also see that there are some disadvantages in this method. One is that clique finding in a given graph is an NPhard problem that the computation time of the worst case is O(3^{n/}^{3}) [21], so it cannot be applied to large proteins. The other is that the function used to calculating weights of vertices and edges eliminates that the weight must be independent from other vertices and edges.
This method showed its effectiveness in the experiments done by Samudrala and Moult [18]. When the scoring function is appropriate and the CF algorithm is suitable, it can find out the nativelike conformations and native structure. This method successfully calculates the fitness of a conformation, excluding a large number of unacceptable conformations, then finds the conformations represented by the cliques independently. However, if the scale of the graph is extremely large, the clique finding algorithm would be timing consuming. Further improvements of the proposed method can be focused on at least two aspects. One is improving the algorithm and the other is modifying the model. For the former one, we can try to find other advanced clique finding (CF) algorithms to reduce the computation time and broaden the range of protein size, or we may use some parallel approaches to fasten the speed. For the latter one, we can modify the original model in selection part, adding filters to exclude more unacceptable conformations to reduce the scale of the graph.
Identification of sidechain clusters in protein structures upon graph spectrum
problem description
Sidechain interactions are essential to protein stability, function and folding. In protein secondary structures, the role of noncovalent sidechain interactions in stabilizing the mutual orientation has been studied well [2325]. It is well known that clusters of hydrophobic sidechains on the surface are important for proteinprotein recognition [2630], protein oligomerization [3133] and protein DNA interactions [34]. However, identifying sidechain interactions by experimental ways is very difficult, thus researchers prefer mathematical methods. In 1999, Kannan and Vishveswara explored a method to detect sidechain clusters in protein threedimensional structures using a graph spectral approach [7].
Graphtheoretic modeling
The protein structure can be represented by a weighted graph being made up of residues. The vertices and edges are defined as follows.
Vertex: The C^{β} atoms of the interacting residues are represented by vertices in a graph. Since atoms are labeled by Greek alphabetic order, C^{α} is the carbon closest to the hydroxyl group(–OH), and C^{β} is the second closest one.
Edge: If the distance between two C^{β} atoms satisfies specific interaction, we draw an edge between them.
In protein structure, sidechain interactions are represented by a weighted graph and the constructed graph is represented by its Laplacian matrix. Clusters are obtained directly from the eigenvector associated with the second lowest eigenvalue of the Laplacian matrix, and the sidechains which make the largest number of interactions in a cluster (cluster centers) are obtained from the eigenvectors associated with the top eigenvalues [7]. Particularly, clustering information is sorted in the vector components of the second lowest eigenvalue, for example, all vector components in the same cluster have the same value [35], and the vector components of the top eigenvalues carry the information regarding the branching of the points forming the cluster [36] and cluster centers [37,38]. This methodology, also been used in other disciplines like electrical engineering for obtaining clusters in circuit netlists [39], has been used here for the identification of clusters in protein structures.
An easy way to construct an adjacency matrix is to assign 1 or 0 to a_{ij} according vertex i and j are adjacent or not in the graph. Here, we use the following weight to construct adjacency matrix.
where d_{ij} is the distance between C^{β} atoms of the residues i and j.
A distance of 100 is assigned to the two sidechains not satisfying the interaction criteria, hence their corresponding weight (1/100) are close to zero. The degree matrix D := (d_{ij}) is constructed as:
thus, the Laplacian matrix B can be calculated as:
Here, we also need to define a function that evaluates sidechain interactions since the definition of A uses it. The interaction can be calculated as
where R_{i}, R_{j} are two different residues, Int(R_{i}, R_{j}) is the sidechain interaction of residues R_{i} and R_{j}, and N(R_{i}, R_{j}) is the number of all pairs of interacting sidechain atoms. Here only those atoms of residues have distance within 4.5 Å are calculated. Normal(type(R_{i})) is the normalization value of residue R_{i} that can be calculated in advance. Here we do not concern the way of calculating this value, but only show the Normal(type(R_{i})) for all 20 residues (see Table 1). Detailed calculation process can be found in [7].
Table 1. The normal(type(R_{i})) for 20 residues
After that, we can define the sidechain interaction criteria in different values. Noticing that when R_{i} and R_{j} are fixed, Int(R_{i}, R_{j}) is fixed, too. When the sidechain interaction threshold becomes higher, fewer residues will be considered, which leads to fewer clusters being found. However, if the threshold is too low, it will result in large expanded clusters. Therefore, there is a tradeoff of setting the proper threshold in this method.
Since sidechain information can be calculated through the clique and clique center, our goal here is to find them. Specifically, Clusters are acquired from the eigenvectors associated with the second lowest eigenvalue of the Laplacian matrix, and sidechains that have the most interaction in cluster (cluster center) are acquired from the eigenvectors associated with the top eigenvalues. Therefore, the Laplacian matrix B contains the information of cliques and clique centers, and useful sidechains in the protein structure can be found by the above method. The detailed approach of calculating clique center upon graph spectrum and an example can be found in the Appendix of [7].
Discussion and further improvement
This section discusses the aspects of graph spectral approach that used for identification of sidechain clusters. Clusters are obtained directly from the eigenvectors associated with the second lowest eigenvalue of the Laplacian matrix and the sidechains which make the largest number of interactions in a cluster (cluster centers) are obtained from the eigenvectors associated with the top eigenvalues. This approach detects clusters by using different sidechain interaction criteria which can be changed by users easily. Higher sidechain interaction threshold results in less clusters while lower threshold leads to expanded clusters. Users may change the threshold to fit the specific problem they are concerning. Also, this approach can be implemented by numerical methods and the output is a simple twodimensional cluster plot which contains the cluster and cluster center information.
However, this approach also has some disadvantages. One is that the sidechain interaction criteria is defined by researchers without any deep analysis on why this criteria is suitable, the other is that the way of constructing adjacency matrix A may be still simple and does not reflect interaction properly. Therefore, main issues in future can be the improvement of sidechain criteria and ways of constructing A.
De novo peptide sequencing via tandem mass spectrometry
Tandem mass spectrometry
Nowadays, tandem mass spectrometry (MS/MS) plays an important role in protein identification problems [40,41]. It breaks a peptide into smaller fragments and measures the mass of each fragment. A typical procedure of MS/MS contains the following steps. Protein mixtures are first digested into suitable sized peptides for mass spectrometric analysis using sitespecific proteases (usually trypsin). Then the peptides are ionized during a ionization process. After that, Some of the peptides are fragmented by collisioninduced dissociation (CID) and their tandem mass spectra are collected then [4245].
A tandem mass spectrometry works like a charged sieve, we can only get a series of charged fragments from it [46,47]. Large molecules are broken into small pieces, and the problem of peptide sequencing is to find out the whole sequence of the peptide from these fragments [48]. A schematic of MS/MS is shown in Figure 2. More introduction about mass spectrometry and tandem mass spectrometry can be found in [4954].
Figure 2. schematic of tandem mass spectrometry (from wikipedia)
Problem of peptide sequencing
In the following subsection, we will provide the method of modeling peptide sequencing based on [5]. Let A be the set of amino acids, since there are 20 different amino acids in nature, A can be defined as:
Then, the mass of each amino acid can be denoted as m(a_{i}), where i ∈ [1, 2,…, 20].
Let P = p_{1}⋯p_{n}be a sequence of amino acids. The mass of each amino acid and the mass of parent peptide P are denoted as m( p_{i}) and , respectively. A protein can be viewed as a chain of amino acids, which connected by a peptide bound. A peptide bound starts at a nitrogen(N) and ends at a carbon(C). We use P_{i} to represent Nterminal peptide p_{1}⋯p_{i}, and its mass can be calculated by . Similarly, We use to represent Cterminal peptide p_{i}_{+}_{1} ⋯p_{n} with mass m(P) – m_{i}.
When the peptide breaks down during MS/MS, it loses small pieces of molecules like water (H_{2}O), CO–group and NH–group[5557]. Assuming that there are k different types of ions that correspond to the removal of k chemical groups, the set of ions can be defined as
We also use δ_{j} to represent its mass, where j = 1, 2,…, k. A δ – ion of an N–terminal partial peptide P_{i} is a modification of P_{i} losing a small molecule of mass δ, and its mass is m_{i} – δ. Similarly, we can define δ – ion of the C–terminal partial peptides [58,59].
We denote the theoretical spectrum of peptide P as T(P), it can be calculated by subtracting all possible ion types δ_{1},δ_{2},…,δ_{k} from the masses of all partial peptide of P, such that every partial peptide generates k masses in the theoretical spectrum.
An experimental spectrum, denoted by S, is what we get from MS/MS, which can be defined as
where s_{t} is a fragment ion (peak) in S, t = 1, 2,…, q. In the following, we also use s_{t} to represent its mass. The experimental spectrum usually includes loss of some small fragments and chemical noises. Actually, MS/MS measures m/z ratio, where m stands for mass and z stands for charge value (typically, it is 1, 2, or 3). Here, we assume that z = 1 for simplicity. The distinction of the theoretical spectrum T(P) and the experimental spectrum S is the mathematical results (T(P)) given the peptide sequence P, and the experimental spectrum (S) without knowing what the peptide sequence is behind this spectrum (S). A match of T(P) and S can be used to measure the relationship between the two as well as to predict peptide sequence of S. Therefore, the problem of peptide sequencing can be described as below.
Problem of Peptide Sequencing
Finding a peptide whose theoretical spectrum has a maximum match to a measured experimental spectrum.
Input: Experimental spectrum S, the set of possible ion types ∆, and the parent mass m.
Output: A peptide P of mass m whose theoretical spectrum matches S better than any other peptide of mass m
De novo peptide sequencing method
There are mainly two ways to solve peptide sequencing problems, one is database search, and the other is de novo method [57,60]. The former one involves generating all 20l amino acid sequences of a certain length l and the theoretical spectrum related to each sequence, finding the maximal match among all the spectra [6163]. Considering the number of possible sequences grows exponentially with the length of peptide sequences, the computing time would also increase exponentially. De novo sequencing which usually uses a spectrum graph model, on the other hand, dose not need to generate all the amino acid sequences, thus developing fast and drawing increasing attention in recent years [6466]. Here, we introduce basic models and principles of this kind of method [5,65]. Some recent improvements and advanced approaches can be found in [6770].
In this method, a spectrum graph representing the experimental spectrum is constructed. Assuming that experimental spectrum S = s_{l},…,s_{q} consists of N–terminal ions. Here, we ignore C–terminal ions because we can build a similar model of C–terminal ions by changing N–terminal ions into C–terminal ions. Every mass of s_{t} ∈ S (t = 1, 2,…, q) may have been created from a partial peptide by one of the k different ion types. In other words, each s_{t} (t = 1, 2,…, q) corresponds to a spectrum of an ion, which is derived from some peptide P_{i} (i = 1, 2,…, n) losing some small group δ_{j} (j = 1, 2,…, k). However, we do not know what ion type of ∆ = {δ_{1}, δ_{2},…, δ_{k}} brings the mass of s_{t}, so we need to generate k different guesses for each mass in the experimental spectrum. Every guess corresponds to a hypothesis that, let x be the mass of some partial peptide, then s_{t} = x – δ_{j}, where t = 1, 2,…, q and j = 1, 2,…, k. Therefore, there are k different guesses of a partial peptide with mass x that s_{t} + δ_{1}s_{t} + δ_{2},…, s_{t} + δ_{k} corresponding to the mass s_{t} in experimental spectrum. That is to say, a partial peptide with mass x has k different possible conformations in this model.
After that, each mass in the experimental spectrum is transferred into a set consisting of k vertices in spectrum graph, corresponding to each possible ion type. The problem now can be solved by using graph theory. In particular, we use a directed acyclic graph (DAG) to represent the experimental spectrum. The vertices and edges of the graph are defined as follows.
Vertex: Each possible conformation of a partial peptide is represented by a vertex. The vertex for δ_{j} of the mass s_{t} is labeled with mass s_{t} + δ_{j} .
Edge: An directed edge is drawn from vertex u to v if the mass of v is larger than that of u by the mass of a single amino acid.
Now, if we add a vertex at 0 representing the starting vertex (with mass 0) and a vertex at m representing the parent peptide (with mass M), the peptide sequencing problem can be translated into a path (from 0 to m) finding problem in the resulting DAG. Specifically, if there exists an edge from u to v, the chain of amino acids will be extended by adding a chemical group whose mass is the mass difference between vertex u and v. Therefore, by finding a path from 0 to m in the DAG, amino acid chain increases gradually and the peptide sequence can be found eventually.
In addition, vertices of the resulting spectrum graph is a set of numbers s_{t} + δ_{j} representing potential masses of N–terminal peptides adjusted by the ion type δ_{j} . Every mass s_{t} generates k different vertices, denoted by V_{t}(s), then
There is the possibility that V_{t}(s) and V_{τ}(s) may overlap when s_{t} and s_{τ} are close, where s_{t}, s_{τ} ∈ S. The set of vertices in a spectrum graph is therefore {s_{initial}}⋃ V_{1} ⋃ ⋯ ⋃ V_{q} ⋃ {s_{final}}, where s_{initial} = 0 and s_{final} = m.
The spectrum graph has at most qk + 2 vertices. We label the edge of the spectrum graph by amino acid whose mass is equal to the mass difference between two possible conformations (vertices). If we view vertices as putative N–terminal peptides, the edge from u to v implies that the N–terminal sequence corresponding to v can be obtained by extending the sequence at u by the amino acid that labels on the edge from u to v, where u,v∈V(G).
For any i ∈ [1, n], if S contains at least one ion type corresponding to every N–terminal partial peptide P_{i} , we say that the spectrum S of a peptide sequence P = p_{l} … p_{n} is complete. The use of a spectrum graph is based on the fact that, for a complete spectrum, there exists a path of length n + 1 from s_{initial} to s_{final} in the spectrum graph that is labeled by P. This observation casts the peptide sequencing problem as one of finding the correct path in the set of all paths between two given vertices in a DAG. In addition, if the spectrum is complete, the correct path that we are finding will be the longest path in the graph usually [5].
Discussion and further improvement
In this section, we describe the de novo peptide sequencing problem and give an effective solution by a graphtheoretic method. The de novo method aims at inferring peptide sequences without using database, and the spectrum graph model solves this problem in a mathematical way. The solution successfully solves the problem by finding a longest path in a given spectrum graph. This kind of approach involves automatically interpreting the spectrum using the table of amino acids masses, and not relies on the completeness of database and effectiveness of searching algorithm, which the database method just relies on. Therefore, it usually costs less computation time, especially when the spectrum is with good quality.
However, this approach still has limitations. First, the success of finding the longest path in the graph relies on the completeness of mass spectrum, but in experiments, spectrum is always incomplete and combines with different kinds of noises, which makes the proposed approach hard to achieve. Second, finding the longest path in a given graph is an NPcomplete problem which is difficult to find optimal solution. Third, when peptide breaks into MS/MS, it loses different kinds of small molecules, and considering all these losses needs a lot of vertices been created in the spectrum graph. When the number of vertices of the graph increases, computation time of solving this problem increases too, and even faster. At last, this kind of approach does not pay much attention to the peak intensity but using the m/z value only.
The performance of de novo peptide sequencing depends on the quality of the MS/MS spectra and the algorithms. When the spectra is complete or with high quality, de novo algorithm can find the correct sequences faster than the database search method, and also has the ability of finding new peptide which is not in the current database. Also, with advanced algorithm, de novo method could handle with spectra containing much noise, with missing peaks and so on. However, due to the limitation of tandem mass spectrometry, the database method is still the most popular and widely used one today. Some possible ways of improvements of de novo method are given below. First, when the spectrum is incomplete, we can add the missing ones by their complementary ions. Since any ion with a mass X in MS/MS, there should be an ion with mass Y such that X + Y = M, where M is the mass of the parent peptide. Thus we can add complementary ions back in an experimental spectral data set [71]. Second, we can consider effective algorithms on finding the longest path in a given graph such as dynamic programming and parallel approach. Third, this method can be partly solved by modifying the original model from finding global solution to possible local solutions. Some suboptimal algorithms can be considered, too [69]. Last but not least, a meaningful issue for the future research can be the combination of de novo method and other approaches, for example, database search [72].
Conclusions
This paper reviews several methods in solving protein structure identification problems using graph theory. We first introduce the development of protein structure identification and existing problems, then giving basic knowledge of graph theory, and focusing on three typical methods using graph theory to solve protein identification problems. These methods are effective but still have problems or some inadequacy, so we also give concluding remarks of them.
In homology modeling based on clique finding, a graph that represents all the possible conformations of residues in amino acids and their interactions is drawn. We use a clique finding algorithm to find out the cliques with the best weight that are viewed as the optimal combinations of various sidechain and mainchain conformations. In identification of sidechain clusters in protein structures, graph spectral method is used. Clusters are obtained directly from the eigenvectors associated with the second lowest eigenvalue of the Laplacian matrix and the sidechains which make the largest number of interactions in a cluster (cluster centers) are obtained from the eigenvectors associated with the top eigenvalues. In de novo peptide sequencing via tandem mass spectrometry, the spectrum graph represents all the possible conformation of the partial peptide and the mass difference between each pair of conformations is drawn first. Then by finding the longest path in the spectrum graph, we can obtain the peptide sequence.
The above three methods all change protein identification problems into graphtheoretical ones and find effective ways of solving them. They give novel methods for handling proteomics problems and can be improved in various aspects in future. There are mainly two directions of improvements. One is the algorithm, such as improving CF algorithm and the longest path algorithm; the other is the model, for example, modifying sidechain interaction criteria. These improvements will enhance the computation ability and make the graph scale an acceptable size. We have seen that in recent literature, researchers are focusing on some of the improvements and have already done partial work successfully. However, there are still a vast amount of work for us to do to improve the current modified methods and find better ways to solve different protein identification problems in graph theoretical methods.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YY wrote the first draft of the review. SGZ intensively revised the manuscript. FXW supervised and gave suggestions of modifications of the manuscript. All authors read and approved the manuscript.
Acknowledgements
This work was supported by Natural Sciences and Engineering Research council of Canada (NSERC) and National Natural Science Foundation of China (No. 10871158).
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
References

Williams KL, Gooley AA, Packer NH: Proteome: not just a makeup name.
Today’s Life Science 1996, 1621. PubMed Abstract

GonzálezDíaz H, GonzálezDíaz Y, Santana L, Ubeira FM, Uriarte E: Proteomics, networks and connectivity indices.
Proteomics 2008, 8:750778. PubMed Abstract  Publisher Full Text

Pevzner PA: Computational Molecular Biology: An Alogorithmic Approach. Cambridge, Massachusetts: The MIT Press; 2000.

Jones NC, Pevzner PA: An Introduction to Bioinformatics Algorithms. Cambridge, Massachusetts: MIT press; 2004.

Bondy JA, Murty USR: Graph Theory. New York: Springer; 2008.

Kannan N, Vishveshwara S: Identification of sidechain clusters in protein structures by a graph spectral method.
J. Mol. Biol 1999, 292:441464. PubMed Abstract  Publisher Full Text

Pertsemlidis A, Fondon I, John W: Having a BLAST with bioinformatics (and avoiding BLASTphemy).

Chothia C, Lesk A: The relation between the divergence of sequence and structure in proteins.
EMBO Journal 1986, 5:823826. PubMed Abstract  PubMed Central Full Text

Greer J: Comparative modeling methods: application to the family of the mammalian serine proteases.
Proteins: Struct. Funct. Genet 1990, 7:317334. Publisher Full Text

Chen R: Monte Carlo simulations for the study of hemoglobin fragment conformations.

Skolnick J, Kolinski A: Simulations of the folding of a globular protein.
Science 1990, 250:11211125. PubMed Abstract  Publisher Full Text

Wilson S, Cui W: Applications of simulated annealing to peptides.
Biopolymers 1990, 29:225235. PubMed Abstract  Publisher Full Text

Venclovas C, Zemla A, Fidelis K, Moult J: Numerical criteria for evaluating protein structures derived from comparative modeling.

Abagyan R, Totrov M: Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins.
J. Mol. Biol 1994, 235:9831002. PubMed Abstract  Publisher Full Text

Avbelj F, Moult J: Determination of the conformation of folding initiation sites in proteins by computer simulation.
Proteins: Struct. Funct. Genet 1995, 23:129141. Publisher Full Text

Harel D: Algorithmics: The Spirit of Computing. New York: Pearson Education; 1992.

Samudrala R, Moult J: A graphtheoretic algorithm for comparative modeling of protein structure.
J. Mol. Biol 1998, 279:287302. PubMed Abstract  Publisher Full Text

Moon J, Moser L: On cliques in graphs.
Israel J. Math 1965, 3:2328. Publisher Full Text

Augustson JG, Minker J: An analysis of some graph theoretical cluster techniques.
Journal of the ACM 1970, 17:571588. Publisher Full Text

Bron C, Kerbosch J: Algorithm 457: finding all cliques of an undirected graph.
Communications of the ACM 1973, 16:575577. Publisher Full Text

Little , John D, et al.: An algorithm for the traveling salesman problem.
Oper. Res 1963, 11:972989. Publisher Full Text

Chou KC, Nemeth G, Scheraga HA: Energetics of interactions of regular structural elements in proteins.
Accts. Chem. Res 1990, 23:134141. Publisher Full Text

Nemethy G, Scheraga HA: A possible folding pathway of bovine pancreatic RNase.
Proc. Natl. Acad. Sci. USA 1979, 76:60506054. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Creighton TE, Chothia C: Electing buried residues.
Nature 1989, 339:1415. PubMed Abstract  Publisher Full Text

Young L, Jernigan BL, Covell DG: A role for surface hydrophobicity in proteinprotein recognition.
Protein Sci 1994, 3:717729. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guss JM, Freeman HC: Structure of oxidized polar plastocyanin at 1.6 Å resolution [abstract].
J. Mol. Biol 1983, 169:521563. PubMed Abstract  Publisher Full Text

Vam de Kamp M, Silvestrini MC, Brunoir M, Van Beumen J, Hali FC, Canters GW: Involvement of the hydrophobic patch of azurin in the electron transfer reactions with cytochrome c551 and nitrite reductase.
Eur. J. Biochem 1990, 194:109118. PubMed Abstract  Publisher Full Text

Pelletier H, Kraut J: Crystal structure of a complex between electron transfer partners, cytochrome c peroxidase and cytochrome c.

Chen L, Durley RCE, Mathews FS, Davidson VL: Structure of an electron transfer complex: methylamine dehydrogenase, amicyanin and cytochrome c551i.
Science 1994, 264:8689. PubMed Abstract  Publisher Full Text

Jones DH, McMillan AJ, Fersht AR: Reversible dissociation of dimeric tyrosiltRNA synthetase by mutagenesis at the subunit interface.

Ponder JW, Richards FM: Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes.
J. Mol. Biol 1987, 193:775791. PubMed Abstract  Publisher Full Text

Mossing MC, Sauer RT: Stable, monomeric variants of lambdaCro obtained by insertion of a designed betahairpin sequence.
Science 1990, 250:17121715. PubMed Abstract  Publisher Full Text

Anderson JE, Ptashne M, Harrison SC: Structure of the repressoroperator complex of bacteriophage 434.
Nature 1987, 326:846852. PubMed Abstract  Publisher Full Text

Hall KM: An rdimensional quadratic placement algorithm.
Manag. Sci 1970, 17:219229. Publisher Full Text

Randic M: Unique numbering of atoms and unique codes for molecular graphs.
J. Chem. Inf. Comp. Sci 1975, 15:105108. Publisher Full Text

Patra SM, Vishveshwara S: Classification of polymer structures by a graph theory.

Hagen L, Kahng AB: New spectral methods for ratio cut partitioning and clustering.
IEEE Trans. Comp.Design 1992, 11:10741084. Publisher Full Text

Johoson GJ, Biemann K: Computer program (DEQPEP) to aid in the interpretation of highenergy collision tandem mass spectra of peptides.
Biomed. Environ. Mass Spectrom 1989, 18:945957. PubMed Abstract  Publisher Full Text

McHugh L, Arthur JW: Computational methods for protein identification from mass spectrometry data.

Wysockia VH, Resingb KA, Zhang QF, Cheng GL: Mass spectrometry of peptides and proteins.
Methods 2005, 35:211222. PubMed Abstract  Publisher Full Text

McLafferty FW, Turecek F: Interpretation of Mass Spectra(Fourth Edition). California: United Science Books; 1993.

Pitt JJ: Principles and aplications of liquid chromatography mass spectrometry in clinical biochemistry.
Clin. Biochem. Rev 2009, 30:1934. PubMed Abstract  PubMed Central Full Text

Marshall AG, Hendrickson CL, Jackson GS: Fourier transform ion cyclotron resonance mass spectrometry: a primer.
Mass Spectrom. Rev 1998, 17:135. PubMed Abstract  Publisher Full Text

March RE: Quadrupole ion trap mass spectrometry: theory, simulation, recent developments and applications.
Rapid Commun. Mass Spectrom 1998, 12:15431554. Publisher Full Text

Na S, Paek E, Cheolju L: CIFTER: automated chargestate determination for peptide tandem mass spectra.
Anal. Chem 2008, 80:15201528. PubMed Abstract  Publisher Full Text

Wang P, Polce MJ, Bleiholder C, Paizs B, Wesdemiotis C: Structural characterization of peptides via tandem mass spectrometry of their dilithiated monocations.

Thomson JJ: Rays of positive electricity and their application to chemical analysis.
Proc. Roy. Soc 1913, 89:120. Publisher Full Text

Beynon J: The use of the mass spectrometer for the identification of organic compounds.

Biemann K, Cone C, Webster BR, Arsenault GP: Determination of the amino acid sequence in oligopeptides by computer interpretation of their highresolution mass spectra.
J. Am. Chem. Soc 1966, 88:55985606. PubMed Abstract  Publisher Full Text

Chamrad DC, Korting G, Stuhler K, Meyer HE, Klose J, et al.: Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data.
Proteomics 2004, 4:619628. PubMed Abstract  Publisher Full Text

Wong J, Sullivan M, Cartwright H, Cagney G: msmsEval: tandem mass spectral quality assignment for highthroughput proteomics.
BMC Bioinformatics 2007, 8:51. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Futrell JH: Development of tandem mass spectrometry: one perspective.
Int. J. Mass Spectrom 2000, 200:495508. Publisher Full Text

Gray AL, Williams JG, Ince AT, Liezers M: Noise sources in inductively coupled plasma mass spectrometry: an investigation of their importance to the precision of isotope ratio measurements.
J. Anal. At. Spectrom 1994, 9:11791181. Publisher Full Text

Zhang JF, He SM, Ling CX, Cao XJ, Zeng R, Gao W: PeakSelect: preprocessing tandem mass spectra for better peptide identification.
Rapid Commun. Mass Spectrom 2008, 22:12031212. PubMed Abstract  Publisher Full Text

Resing KA, Ahn NG: Proteomics strategies for protein identification.
FEBS Letters 2005, 579:885889. PubMed Abstract  Publisher Full Text

Wysocki VH, Tsaprailis G, Simth LL, Mobile B, Protons L: A framework for understanding peptide dissociation.
J. Mass Spectrom 2000, 35:13991406. PubMed Abstract  Publisher Full Text

Aebersold R, Goodlett DR: Mass spectrometry in proteomics.
Chem. Rev 2001, 101:269295. PubMed Abstract  Publisher Full Text

Protein ID: comparing de novo based and database search methods http://www.bioinformaticssolutions.com/functions db download.php?id=3558 webcite

Eng J, McCormack A, Yates J: An approach to correlate tandem mass spectral data of peptides with amoni acid sequences in a protein database.
J. Am. Soc. Mass Spectrom 1994, 5:976989. Publisher Full Text

Mann M, Wilm M: Errortolerant identification of peptides in sequence tags.
Anal. chem 1994, 66:43904399. PubMed Abstract  Publisher Full Text

Sadygov RG, Cociorva D, Yates JR: Largescale database searching using tandem mass spectra: looking up the answer in the back of the book.
Nature methods 2004, 1(3):195202. PubMed Abstract  Publisher Full Text

Bassil I, Dahiyat , Mayo SL: De novo protein design: fully automated sequence selection.
Science 1997, 278:8287. PubMed Abstract  Publisher Full Text

Dancik V, Addona TA, Clauser KR, et al.: De novo peptide sequencing via tandem mass spectrometry.
J. Comput. Biol 1999, 6:327342. PubMed Abstract  Publisher Full Text

Lu BW, Chen T: Algorithms for de novo peptide sequencing using tandem mass spectrometry.

Chen T, Kao MY, Tepel M, et al.: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry.
J. Comput. Biol 2001, 8(3):325337. PubMed Abstract  Publisher Full Text

Ma B, Zhang K, Hendrie C, et al.: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.
Rapid Commun. Mass Spectrom 2003, 17:23371342. PubMed Abstract  Publisher Full Text

Lu BW, Chen T: A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry.
J. Comput. Biol 2003, 10:112. PubMed Abstract  Publisher Full Text

Frank A, Pevzner PA: PepNovo: de novo peptide sequencing via probabilistic network modeling.
Anal. Chem 2005, 77:964973. PubMed Abstract  Publisher Full Text

Yan B, Pan CL, Olman VN, Hettich RL, Xu Y: A graphtheoretic approach for the separation of b and y ions in tandem mass spectra.
Bioinformatics 2005, 21:563574. PubMed Abstract  Publisher Full Text

Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry.