6. Data sets with outliers

In exploratory data analysis one might wish instead to discover patterns while making few assumptions about data structure, using techniques with properties that change only gradually across a wide range of noise distributions. Nonlinear data smoothers provide a practical method of finding general smooth patterns for sequenced data confounded with long-tailed noise.

Suppose that one observes data such as those in Figure 6.1: the main body of the data lies in a strip around zero and a few observations, governing the scaling of the scatter plot, lie apart from this region. These few data points are obviously outliers. This terminology does not mean that outliers are not part of the joint distribution of the data or that they contain no information for estimating the regression curve. It means rather that outliers look as if they are too small a fraction of the data to be allowed to dominate the small-sample behavior of the statistics to be calculated. Any smoother (based on local averages) applied to data like in Figure 6.1 will exhibit a tendency to ``follow the outlying observations." Methods for handling data sets with outliers are called robust or resistant.

From a data-analytic viewpoint, a nonrobust behavior of the smoother is sometimes undesirable. Suppose that, a posteriori, a parametric model for the response curve is to be postulated. Any erratic behavior of the nonparametric pilot estimate will cause biased parametric formulations. Imagine, for example, a situation in which an outlier has not been identified and the nonparametric smoothing method has produced a slight peak in the neighborhood of that outlier. A parametric model which fitted that ``non-existing" peak would be too high-dimensional.

**Figure 6.1:** A simulated data set with outliers. The joint probability density function of $\scriptstyle \{ (X_i,Y_i) \}^n_{i=1}$ , $\scriptstyle n=100$ , was $\scriptstyle f(x,y)=g(y-m(x))I(x \in [ 0,1 ])$ with $\scriptstyle m(x)=\sin(\pi x)$ and the mixture density $\scriptstyle g(x)=(9/10) \varphi (x)+(1/10)(1/9) \varphi (x/9)$ , where $\scriptstyle \varphi$ denotes the standard normal density. The data points coming from the long tail mixture part $\scriptstyle (1/9) \varphi (x/9)$ are indicated by squares. The regression line $\scriptstyle m(x)$ is shown as a solid line. From Härdle (1989).
$\includegraphics[scale=0.15]{ANR6,1.ps}$

In this case, a robust estimator, insensitive to a single wild spike outlier, would be advisable. Carroll and Ruppert (1988, p. 175) aptly describe this as follows:

Robust estimators can handle both data and model inadequacies. They will downweight and, in some cases, completely reject grossly erroneous data. In many situations, a simple model will adequately fit all but a few unusual observations.

In this chapter, several resistant smoothing techniques are presented. It is seen how basic ideas from robust estimation of location can be used for nonparametric resistant smoothing. From the discussion also evolves an asymptotically efficient smoothing parameter selection rule.