Using the information from the Delaunay tessellation of a protein's backbone, it is possible to build a statistical representation of that protein, which takes into account the way its sequence must ''twist and turn'' in order to bring each four-body residue cluster into contact. Each residue - , and of a four-body cluster comprising a simplex are nearest neighbors in Euclidean space as defined by the tessellation, but are separated by the three distances - , and in sequence space. Based on this idea, we build a -tuple representation of a single protein by making use of two metrics: (1) the Euclidean metric used to define the Delaunay tessellation of the protein's C atomic coordinates and (2) the distance between residues in sequence space.
If we consider a tessellated protein with N residues integrally enumerated according to their position along the primary sequence, the length of a simplex edge in sequence space can be defined as , where is the length of the simplex edge, , corresponding to the th and th -carbons along the sequence. If one considers the graph formed by the union of the simplex edge between the two points and and the set of edges between all points along the sequence between and , it is seen that the Euclidean simplex edge, , can generally be classified as a far edge (Pandit and Amritkar, 1999). Every simplex in the protein's tessellation will have three such edges associated with its vertices: , , , and where , and are integers corresponding to C atoms enumerated according to their position along the primary sequence. Thus, we proceed to quantify the degree of ''farness'' in an intuitive way, by applying a transformation, , which maps the length, , of each edge to an integer value according to
Condition 1 is provided because simplices with a Euclidean edge length above are generally a result of the positions of -carbons on the exterior of the protein. We filter out contributions from these simplices to , because they do not represent physical interactions between the participating residues. The simplices with the long edges are formed due to the absence of solvent and other molecules around the protein in the tessellation, they would not have existed if the protein was solvated. The data structure, , contains elements. The number of elements is invariant with respect to the number of residues of the protein. In order to more easily conceptualize the mapping of the protein topology to the data structure, , we rewrite it as a -tuple vector .
Given that each element of this vector represents a statistical contribution to the global topology, a comparison of two proteins making use of this mapping must involve the evaluation of the differences in single corresponding elements of the proteins' -tuples. We define, therefore, a raw topological score, , representative of the topological distance between any two proteins represented by data structures, and , as the supremum norm,
This topological score has an obvious dependence on the sequence length difference between the two proteins being compared due to the following implicit relation for a single protein representation,
The results of topological protein structure comparison can be illustrated using an example of proteins that belong to the same family. Six protein families were selected from the FSSP (Families of Structurally Similar Proteins) database for topological evaluation. We selected families that span various levels of secondary structural content. The representatives of these families are as follows: 1alv and 1avm (having greater than -helical content), 2bbk and 2bpa (having greater than -sheet content), and 1hfc and 1plc (having at least content that is classified as neither -helical nor -sheet). The FSSP database contains the results of the alignments of the extended family of each of these representative chains. Each family in the database consists of all structural neighbors excluding very close homologs (proteins having a sequence identity greater than ). The topological score was calculated for each representative in a one-against-all comparison with its neighbors. All of the scores are plotted against RMSD for each of the families in Fig. 3.9. A strong correlation between the topological score and structure similarity and the power-law trend can be seen for all families.