3.4 Spatial and Compositional Three-dimensional Patterns in Proteins

Delaunay simplices obtained as a result of the tessellation can be used to define objectively the nearest neighbor residues in 3D protein structures. The most significant feature of Delaunay tessellation, as compared with other methods of nearest neighbor identification, is that the number of nearest neighbors in three dimensions is always four, which represents a fundamental topological property of 3D space. Statistical analysis of the amino acid composition of Delaunay simplices provides information about spatial propensities of all quadruplets of amino acid residues clustered together in folded protein structures. The compositional statistics can be also used to construct four-body empirical contact potentials, which may provide improvement over traditional pairwise statistical potentials (e.g., Miyazawa and Jernigan, 2000) for protein structure analysis and prediction.

To perform the tessellation protein residues should be represented by single points located, for example, in the positions of the C $_{\alpha}$ atoms or the centers of the side chains. Tessellation training set includes high-quality representative protein structures with low primary-sequence identity (Wang and Dunbrack, 2003). The tessellated proteins are analyzed by computing various geometrical properties and compositional statistics of Delaunay simplices.

An example of Delaunay tessellation of a folded protein is illustrated on Fig. 3.3 for crambin ( $1\,$ crn). The tessellation of this

-residue protein generates an aggregate of

nonoverlapping, space-filling irregular tetrahedra (Delaunay simplices). Each Delaunay simplex uniquely defines four nearest neighbor C $_{\alpha}$ atoms and thus four nearest neighbor amino acid residues.

**Figure 3.3:** Delaunay tessellation of Crambin
$\includegraphics[width=55mm,clip]{text/4-3/fig3.eps}$

For the analysis of correlations between the structure and sequence of proteins, we introduced a classification of simplices based on the relative positions of vertex residues in the primary sequence (Singh et al., 1996). Two residues were defined as distant if they were separated by one or more residues in the protein primary sequence. Simplices were divided into five nonredundant classes: class $\{4\}$ , where all four residues in the simplex are consecutive in the protein primary sequence; class $\{3,1\}$ , where three residues are consecutive and the fourth is a distant one; class $\{2,2\}$ , where two pairs of consecutive residues are separated in the sequence; class $\{2,1,1\}$ , where two residues are consecutive, and the other two are distant both from the first two and from each other; and class $\{1,1,1,1\}$ where all four residues are distant from each other (Fig. 3.4). All five classes usually occur in any given protein.

**Figure 3.4:** Five classes of Delaunay simplices
$\includegraphics[clip]{text/4-3/fig4.eps}$

The differences between classes of simplices can be evaluated using geometrical parameters of tetrahedra such as volume and tetrahedrality (3.1). Distributions of volume and tetrahedrality for all five classes of simplices is shown in Fig. 3.5. The sharp narrow peaks correspond to the simplices of classes $\{4\}$ and $\{2,2\}$ . They tend to have well defined distributions of volume and distortion of tetrahedrality. These results suggest that tetrahedra of these two classes may occur in regular protein conformations such as $\alpha$ -helices and may be indicative of a protein fold family. We have calculated the relative frequency of occurrence of tetrahedra of each class in each protein in a small dataset of hundred proteins from different families and plotted the results in Fig. 3.6. The proteins were sorted in the ascending order of fraction of tetrahedra of class $\{4\}$ . Noticeably, the content of simplices of class $\{3,1\}$ decreases with the increase of the content of class $\{4\}$ simplices. According to common classifications of protein fold families (Orengo et al., 1997), at the top level of hierarchy most proteins can be characterized as all-alpha, all-beta, or alpha/beta. The fold families for the proteins in the dataset are also shown in Fig. 3.6. These results suggest that proteins having a high content of tetrahedra of classes $\{4\}$ and $\{2,2\}$ (i.e., proteins in the right part of the plot in Fig. 3.6) belong to the family of all-alpha proteins. Similarly, proteins having a low content of tetrahedra of classes $\{4\}$ and $\{2,2\}$ but a high content of tetrahedra of classes $\{2,2\}$ and $\{3,1\}$ (i.e., proteins in the left part of the plot in Fig. 3.6) belong to the all-beta protein fold family. Finally, proteins in the middle of the plot belong to the alpha/beta fold family. Thus, the results of this analysis show that the ratio of tetrahedra of different classes is indicative of the protein fold family.

**Figure 3.5:** Distribution of tetrahedrality and volume (in ${\text {\char 0197}^{3}}$ ) of Delaunay simplices in proteins
$\includegraphics[width=100mm,clip]{text/4-3/fig5.eps}$

**Figure 3.6:** Classes of Delaunay simplices and protein fold families. Contents of simplices of class $\{4\}$ (*solid line*), class $\{3,1\}$ (*dashed line*), class $\{2,1\}$ (*dotted line*), class $\{2,1\}$ (*dash-dotted line*), class $\{1,1,1,1\}$ (*dash-dot-dotted line*). *Upper part* of the figure displays fold family assignment: all-alpha (*circles*), all-beta (*squares*), and alpha-beta (*triangles*)
$\includegraphics[width=117mm,clip]{text/4-3/fig6.eps}$

Identification of significant patterns in biomolecular objects depends on the possibility to distinguish what is likely from what is unlikely to occur by chance (Karlin et al., 1991). Statistical analysis of amino acid composition of the Delaunay simplices provides information about spatial propensities of all quadruplets of amino acid residues to be clustered together in folded protein structures. We analyzed the results of the Delaunay tessellation of these proteins in terms of statistical likelihood of occurrence of four nearest neighbor amino acid residues for all observed quadruplet combinations of

natural amino acids. The log-likelihood factor,

, for each quadruplet was calculated using the following equation:

Theoretically, the maximum number of all possible quadruplets of $20\,$ natural amino acid residues is

$({{C}}_{20}^{4} +3{{C}}_{20}^{3} +2{{C}}_{20}^{2} +{{C}}_{20}^{2} +{{C}}_{20}^{1})$ . The first term accounts for simplices with four distinct residue types, the second - three types in

distribution, the third - two types in

distribution, the fourth - two types in

distribution, and the fifth - four identical residues. The log-likelihood factor

is plotted in Fig. 3.7 for all observed quadruplets of natural amino acids. Each quadruplet is thus characterized by a certain value of the

factor which describes the nonrandom bias for the four amino acid residues to be found in the same Delaunay simplex. This value can be interpreted as a four-body statistical potential energy function. The statistical potential can be used in a variety of structure prediction, protein modeling, and computational mutagenesis applications.

**Figure 3.7:** Log-likelihood ratio for the Delaunay simplices
$\includegraphics[width=105mm,clip]{text/4-3/fig7.eps}$

**Figure 3.8:** Potential profile of HIV-1 protease
$\includegraphics[width=108mm,clip]{text/4-3/fig8.eps}$

Computational mutagenesis is based on the analysis of a protein potential profile, which is constructed by summing the log-likelihood scores from (3.2) for all simplices in which a particular residue participates. A plot of the potential profile for a small protein, HIV-1 protease, is shown in Fig. 3.8. The shape of the potential profile frequently reflects important features of the protein, for example, the residues in local maxima values of the profile are usually located in the hydrophobic core of the protein and these residues play an important role in maintaining protein stability.

A potential profile can be easily calculated for both wild type and mutant proteins, assuming that the structural differences between them are small and that their tessellation results are similar. In this case the difference between the profiles is defined only by the change in composition of the simplices involving the substitution site. The resulting difference profile provides important insights into the changes in protein energetics due to the mutation.