Simplicial depth generalizes the notion of data depth as introduced in
Section 1.1. This general definition allows us to define a
multivariate median and to visually present high dimensional data in
low dimension.
For univariate data we have well known parameters
of location which describe the center of a distribution of a random
variable . These parameters
are for example the mean
The first two parameters can be easily extended to multivariate
random variables. The mean in higher dimensions is defined as in
(18.1) and the mode accordingly,
An equivalent definition of the median in one dimension is given by
the simplicial depth. It is defined as follows: For each
pair of datapoints and
we generate a closed
interval, a one-dimensional simplex, which contains
and
as border points. Redefine the median as the datapoint
,
which is enclosed in the maximum number of intervals:
With this definition of the median, the median is
the ``deepest'' and ``most central'' point in a data set
as discussed in Section 1.1. This definition involves a
computationally intensive operation since we generate intervals
for
observations.
In two dimensions, the computation is even more intensive since
the interval is replaced by a triangle constructed
from three different datapoints.
The median as the deepest point is then defined by that datapoint that is covered by the maximum
number of triangles. In three dimensions triangles become pyramids formed
from 4 points and the median is that datapoint that lies in the maximum number of
pyramids.
An example for the depth in 2 dimensions is given by the constellation of
points given in Figure 18.1. If we build for example the traingle
of the points 1, 3, 5 (denoted as 135 in Table 18.1),
it contains the point 4. From Table 18.1 we count the number
of coverages to obtain the simplicial depth values of Table 18.2.
In arbitrary dimension , we look for datapoints that lie inside a
simplex (or convex
) formed from
points.
We therefore extend the definition of the median to the
multivariate case as follows
Here denote the indices of
datapoints.
Thus for each datapoint we have a multivariate data
depth. If we compute all the necessary simplices
,
the computing time will unfortunately be exponential as
the dimension increases.
In Figure 18.2 we calculate the simplicial depth for a two-dimensional, 10 point distribution. The deepest point, the two-dimensional median, is indicated as a big star in the center. The points with less depth are indicated via grey shades.