18.1 Simplicial Depth

Simplicial depth generalizes the notion of data depth as introduced in Section 1.1. This general definition allows us to define a multivariate median and to visually present high dimensional data in low dimension. For univariate data we have well known parameters of location which describe the center of a distribution of a random variable $X$. These parameters are for example the mean

\begin{displaymath}
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i},
\end{displaymath} (18.1)

or the mode

\begin{displaymath}x_{mod} = \arg \max_{x} \hat{f}(x),\end{displaymath}

where $\hat{f}$ is the estimated density function of $X$ (see Section 1.3). The median

\begin{displaymath}x_{med} = \left\{\begin{array}{ll}
x_{((n+1)/2)} & \textrm{ ...
...)}+x_{(n/2+1)}}{2} & \textrm{ otherwise},
\end{array}\right. \end{displaymath}

where $x_{(i)}$ is the order statistics of the $n$ observations $x_i$, is yet another measure of location.

The first two parameters can be easily extended to multivariate random variables. The mean in higher dimensions is defined as in (18.1) and the mode accordingly,

\begin{displaymath}x_{mod} = \arg \max_{x} \hat{f}(x) \end{displaymath}

with $\hat{f}$ the estimated multidimensional density function of $X$ (see Section 1.3). The median poses a problem though since in a multivariate sense we cannot interpret the element-wise median
\begin{displaymath}
x_{med,j} = \left\{\begin{array}{ll}
x_{((n+1)/2),j} & \tex...
...}+x_{(n/2+1),j}}{2} & \textrm{otherwise}
\end{array}\right.
\end{displaymath} (18.2)

as a point that is ``most central''. The same argument applies to other observations of a sample that have a certain ``depth'' as defined in Section 1.1. The ``fourths'' or the ``extremes'' are not defined in a straightforward way in higher (not even for two) dimensions.

An equivalent definition of the median in one dimension is given by the simplicial depth. It is defined as follows: For each pair of datapoints $x_i$ and $x_j$ we generate a closed interval, a one-dimensional simplex, which contains $x_i$ and $x_j$ as border points. Redefine the median as the datapoint $x_{med}$, which is enclosed in the maximum number of intervals:

\begin{displaymath}
x_{med} = \arg\max_i \char93  \{k,l; x_i \in [x_k, x_l]\}.
\end{displaymath} (18.3)

With this definition of the median, the median is the ``deepest'' and ``most central'' point in a data set as discussed in Section 1.1. This definition involves a computationally intensive operation since we generate $n (n-1)/2$ intervals for $n$ observations.

In two dimensions, the computation is even more intensive since the interval $[x_k,x_l]$ is replaced by a triangle constructed from three different datapoints. The median as the deepest point is then defined by that datapoint that is covered by the maximum number of triangles. In three dimensions triangles become pyramids formed from 4 points and the median is that datapoint that lies in the maximum number of pyramids.

Figure 18.1: Construction of simplicial depth. 50324 MVAsimdep1.xpl
\includegraphics[width=1\defpicwidth]{mvasimdep1.ps}

An example for the depth in 2 dimensions is given by the constellation of points given in Figure 18.1. If we build for example the traingle of the points 1, 3, 5 (denoted as $\triangle$ 135 in Table 18.1), it contains the point 4. From Table 18.1 we count the number of coverages to obtain the simplicial depth values of Table 18.2.


Table 18.1: Coverages for artificial configuration of points.
  Triangle   Coverages
1 $\triangle$ 123 $\ni$ 1 2 3      
2 $\triangle$ 124 $\ni$ 1 2   4    
3 $\triangle$ 125 $\ni$ 1 2     5  
4 $\triangle$ 126 $\ni$ 1 2 3 4   6
5 $\triangle$ 134 $\ni$ 1   3 4    
6 $\triangle$ 135 $\ni$ 1   3 4 5  
7 $\triangle$ 136 $\ni$ 1   3     6
8 $\triangle$ 145 $\ni$ 1     4 5  
9 $\triangle$ 146 $\ni$ 1   3 4   6
10 $\triangle$ 156 $\ni$ 1   3 4 5 6
11 $\triangle$ 234 $\ni$   2 3 4    
12 $\triangle$ 235 $\ni$   2 3 4 5  
13 $\triangle$ 236 $\ni$   2 3 4   6
14 $\triangle$ 245 $\ni$   2   4 5  
15 $\triangle$ 246 $\ni$   2   4   6
16 $\triangle$ 256 $\ni$   2     5 6
17 $\triangle$ 345 $\ni$     3 4 5  
18 $\triangle$ 346 $\ni$     3 4   6
19 $\triangle$ 356 $\ni$     3   5 6
20 $\triangle$ 456 $\ni$       4 5 6



Table 18.2: Simplicial depths for artificial configuration of points.
point 1 2 3 4 5 6
depth 10 10 12 14 8 8


In arbitrary dimension $p$, we look for datapoints that lie inside a simplex (or convex $hull$) formed from $p+1$ points. We therefore extend the definition of the median to the multivariate case as follows

\begin{displaymath}
x_{med} = \arg\max_i \char93  \{k_0,\dots,k_p; x_i \in hull(x_{k_0},...,x_{k_p}) \}.
\end{displaymath} (18.4)

Here $k_0,...,k_p$ denote the indices of $p+1$ datapoints. Thus for each datapoint we have a multivariate data depth. If we compute all the necessary simplices $hull(x_{k_0}, \ldots,x_{k_p})$, the computing time will unfortunately be exponential as the dimension increases.

In Figure 18.2 we calculate the simplicial depth for a two-dimensional, 10 point distribution. The deepest point, the two-dimensional median, is indicated as a big star in the center. The points with less depth are indicated via grey shades.

Figure 18.2: $10$ point distribution with the median shown as a big star in the center. 50331 MVAsimdepex.xpl
\includegraphics[width=1\defpicwidth]{simdepex.ps}

Summary
$\ast$
The ``depth'' of a datapoint in one dimension can be computed by counting all (closed) intervals of two datapoints which contain the datapoint.
$\ast$
The ``deepest'' datapoint is the central point of the distribution, the median.
$\ast$
The ``depth'' of a datapoint in arbitrary dimension $p$ is defined as the number of simplices (constructed from $p+1$ points) covering this point. It is called simplicial depth.
$\ast$
A multivariate extension of the median is to take the ``deepest'' datapoint of the distribution.
$\ast$
In the bivariate case we count all triangles of datapoints which contain the datapoint to compute its depth.