next up previous contents index
Next: 3.6 Conclusions Up: 3. Statistical and Computational Previous: 3.4 Spatial and Compositional

3.5 Protein Structure Comparison and Classification

Using the information from the Delaunay tessellation of a protein's backbone, it is possible to build a statistical representation of that protein, which takes into account the way its sequence must ''twist and turn'' in order to bring each four-body residue cluster into contact. Each residue - $ i, j, k$, and $ l$ of a four-body cluster comprising a simplex are nearest neighbors in Euclidean space as defined by the tessellation, but are separated by the three distances -  $ d_{ij}, d_{jk}$, and $ d_{kl}$ in sequence space. Based on this idea, we build a $ 1000$-tuple representation of a single protein by making use of two metrics: (1) the Euclidean metric used to define the Delaunay tessellation of the protein's C $ _{\alpha}$ atomic coordinates and (2) the distance between residues in sequence space.

If we consider a tessellated protein with N residues integrally enumerated according to their position along the primary sequence, the length of a simplex edge in sequence space can be defined as $ d_{ij}
=j-i-1$, where $ d_{ij}$ is the length of the simplex edge, $ \overline{ij} $, corresponding to the $ i$th and $ j$th $ \alpha $-carbons along the sequence. If one considers the graph formed by the union of the simplex edge between the two points $ i$ and $ j$ and the set of edges between all $ d_{ij}$ points along the sequence between $ i$ and $ j$, it is seen that the Euclidean simplex edge, $ \overline{ij} $, can generally be classified as a far edge (Pandit and Amritkar, 1999). Every simplex in the protein's tessellation will have three such edges associated with its vertices: $ i$, $ j$, $ k$, and $ l$ where $ i, j, k$, and $ l$ are integers corresponding to C $ _{\alpha}$ atoms enumerated according to their position along the primary sequence. Thus, we proceed to quantify the degree of ''farness'' in an intuitive way, by applying a transformation, $ T$, which maps the length, $ d$, of each edge to an integer value according to

$\displaystyle T:d \mapsto \left\{ \begin{array}{ll} 1&\;\text{if}\;d=0 \\ [-.4m...
...ext{if}\;50\le d\le 100 \\ [-.4mm] 10&\;\text{if}\;d\ge 101 \end{array} \right.$ (3.5)

The reasoning behind the design of the transformation is described by Bostick and Vaisman (2003). This transformation is used to construct an array that is representative of the distribution of combinations of segment lengths along the protein backbone forming four-residue clusters within the protein's structure as defined by the tessellation of its C$ _{\alpha}$ atomic coordinates. Each simplex in the protein's tessellation contributes to a 3D array, $ M$, where $ M_{npr}$ is the number of simplices whose edges satisfy the following conditions:
  1. The Euclidean length of any one simplex edge is not greater than $ 10\,{\text{\char0197}}$.
  2. $ d_{ij}= n$
  3. $ d_{jk}=p$
  4. $ d_{kl}=r$

Condition 1 is provided because simplices with a Euclidean edge length above $ {10}\,{\text{\char0197}}$ are generally a result of the positions of $ \alpha $-carbons on the exterior of the protein. We filter out contributions from these simplices to $ M$, because they do not represent physical interactions between the participating residues. The simplices with the long edges are formed due to the absence of solvent and other molecules around the protein in the tessellation, they would not have existed if the protein was solvated. The data structure, $ M$, contains $ 1000\,$elements. The number of elements is invariant with respect to the number of residues of the protein. In order to more easily conceptualize the mapping of the protein topology to the data structure, $ M$, we rewrite it as a $ 1000$-tuple vector $ \overrightarrow{M}=\left\{M_{000} ,M_{001} ,\ldots ,M_{010} ,M_{011}
,\ldots ,M_{999} \right\}$.

Given that each element of this vector represents a statistical contribution to the global topology, a comparison of two proteins making use of this mapping must involve the evaluation of the differences in single corresponding elements of the proteins' $ 1000$-tuples. We define, therefore, a raw topological score, $ Q$, representative of the topological distance between any two proteins represented by data structures, $ \overrightarrow{M}$ and $ \overrightarrow{{M}^{\prime}}$, as the supremum norm,

$\displaystyle Q\equiv \left\Vert {\overrightarrow {M}-{\overrightarrow{M}}^{\pr...
...iv \sum\limits_{i=0}^{999} \left\vert {M_{i} -M_{i} ^{\prime }} \right\vert \,.$ (3.6)

This norm is topologically equivalent to the Euclidean norm and has the added advantage that it is less computationally expensive to calculate.

Figure 3.9: Topological and geometric comparison within 6 protein families
\includegraphics[width=117mm,clip]{text/4-3/fig9.eps}

This topological score has an obvious dependence on the sequence length difference between the two proteins being compared due to the following implicit relation for a single protein representation,

$\displaystyle N_{s} =\sum\limits_{i=0}^{999} {M_{i} } \,,$ (3.7)

where $ i$ is the number of simplices with no edge having a Euclidian length greater than $ {10}\,{\text{\char0197}}$, and the $ M_{i} $ are the elements of the protein representation. In other words, since $ N_{s} $ is proportional to the number of residues in the protein, the difference in the length between two compared proteins might provide a systematic extraneous contribution to their score, $ Q$, in (3.6). This is not to say that the sequence length of a protein does not play a role in its topology. In fact, the length should be quite crucial (Bostick et al, 2004). However, the length dependence of our score implied by (3.7) is endemic to our protein representation (derived from its tessellation), and not due to protein topology itself. This length dependence may be removed by first normalizing the vector representation as follows:

$\displaystyle \underrightarrow{M} =\frac{\overrightarrow{M}}{\left\Vert {\overrightarrow{M}} \right\Vert}$ (3.8)

resulting in the unit-vector representation, $ \underrightarrow{M}$. The corresponding normalized topological score,

$\displaystyle \underline{Q}\equiv \left\Vert\underrightarrow{M}- \underrightarrow{M} \right\Vert _{\sup }$ (3.9)

can be expected to be less sensitive to the chain length difference between the two proteins being compared. Despite normalization, however, this score should still have an inherent dependence on the length difference between the compared proteins. A protein's structure must be dependent on the length of its sequence, because the number of configurational degrees of freedom in a polymer's structure is proportional to the number of residues it possesses. Such a dependence on the size of compared proteins is present in geometric methods of comparison such as structural alignment as well, and in some cases, has been accounted for (Carugo and Pongor, 2001).

The results of topological protein structure comparison can be illustrated using an example of proteins that belong to the same family. Six protein families were selected from the FSSP (Families of Structurally Similar Proteins) database for topological evaluation. We selected families that span various levels of secondary structural content. The representatives of these families are as follows: 1alv and 1avm (having greater than $ 50\,{\%}$ $ \alpha $-helical content), 2bbk and 2bpa (having greater than $ 50\,{\%}$ $ \beta$-sheet content), and 1hfc and 1plc (having at least $ 50\,{\%}$ content that is classified as neither $ \alpha $-helical nor $ \beta$-sheet). The FSSP database contains the results of the alignments of the extended family of each of these representative chains. Each family in the database consists of all structural neighbors excluding very close homologs (proteins having a sequence identity greater than $ 70\,{\%}$). The topological score was calculated for each representative in a one-against-all comparison with its neighbors. All of the scores are plotted against RMSD for each of the families in Fig. 3.9. A strong correlation between the topological score and structure similarity and the power-law trend can be seen for all families.


next up previous contents index
Next: 3.6 Conclusions Up: 3. Statistical and Computational Previous: 3.4 Spatial and Compositional