Next: References Up: 5. Smoothing: Local Regression Previous: 5.4 Statistics for Linear

5.5 Multivariate Smoothers

When there are multiple predictor variables, the smoothing problem becomes multivariate: $\mu(x)$ is now a surface. The definition of kernel and local regression smoothers can be extended to estimate a regression surface with any number of predictor variables, although the methods become less useful for more than 2 or 3 variables. There are several reasons for this:

Data sparsity - the curse of dimensionality.
Visualization issues - how does one view and interpret a high dimensional smooth regression surface?
Computation is often much more expensive in high dimensions.

For these reasons, use of local polynomials and other smoothers to model high dimensional surfaces is rarely recommended, and the presentation here is restricted to the two-dimensional case. In higher dimensions, smoothers can be used in conjunction with dimension reduction procedures (Chap. III.6), which attempt to model the high-dimensional surface through low-dimensional components. Examples of this type of procedure include Projection Pursuit ([9]), Additive Models ([14]), Semiparametric Models ([26] and Chap. III.10) and recursive partitioning (Chap. III.14).

5.5.1 Two Predictor Variables

Suppose the dataset consists of vectors , where and are considered predictor variables, and is the response. For simplicity, we'll use $x_i = \begin{pmatrix}u_i & v_i\end{pmatrix}^{\top}$ to denote a vector of the predictor variables. The data are modeled as

$\displaystyle Y_i = \mu(u_i,v_i) + \epsilon_i = \mu(x_i) + \epsilon_i{}.$

Bivariate smoothers attempt to estimate the surface $\mu(u_i,v_i)$ . Kernel and local regression methods can be extended to the bivariate case, simply by defining smoothing weights on a plane rather than on a line. Formally, a bivariate local regression estimate at a point $x=(u,v)^{\top}$ can be constructed as follows:

Define a distance measure $\rho(x,x_i)$ between the data points and fitting point. A common choice is Euclidean distance,

$\displaystyle \rho(x,x_i) = \sqrt{ (u_i-u)^2 + (v_i-v)^2 }{}.$
Define the smoothing weights using a kernel function and bandwidth:

$\displaystyle w_i(x) = W \left( \frac{ \rho(x,x_i) }{ h}\right){}.$
Define a local polynomial approximation, such as a local linear approximation

$\displaystyle \mu(u_i,v_i) \approx a_0 + a_1 (u_i-u) + a_2 (v_i-v)$

when is close to . More generally, a local polynomial approximation can be written

$\displaystyle \mu(x_i) \approx \langle a, A \left(x_i-x\right)\rangle{},$

where is a vector of coefficients, and $A(\cdot)$ is a vector of basis polynomials.
Estimate the coefficient vector by local least squares. That is, choose $\hat{a}$ to minimize

$\displaystyle \sum_{i=1}^n w_i(x) \left( Y_i - \langle {a}, A \left({x_i-x}\right)\rangle \right)^2{}.$
The local polynomial estimate is then

$\displaystyle \hat{\mu}(x) = \hat{a}_0{}.$

5.5.2 Likelihood Smoothing

A likelihood smoother replaces the model (5.1) with a distributional assumption

$\displaystyle Y_i \sim f(y,\mu_i){},$

where $f(y,\mu)$ is a specified family of densities, parameterized so that $E(Y_i) = \mu_i$ . The family may be chosen depending on the response variable. If

is a count, then the Poisson family is a natural choice:

$\displaystyle f(y,\mu) = \frac{ \mu^y \mathrm{e}^{-\mu } }{ y! }{}; \quad y = 0,1,2,\ldots{}.$

is a

(or no/yes) response, then the Bernoulli family is appropriate:

$\displaystyle f(y,\mu) = \mu^y (1-\mu)^{1-y}{};\quad y = 0, 1{}.$

Given the data, the log-likelihood is

$\displaystyle \mathcal{L}(\mu_1,\ldots,\mu_n) = \sum_{i=1}^n \log\, f(Y_i,\mu_i){}.$

The goal is to estimate the mean function, $\mu_i = \mu(x_i)$ for an observed set of covariates . A generalized linear model (Chap. III.7) uses a parametric model for the mean function. Likelihood smoothers assume only that the mean is a smooth function of the covariates.

The earliest work on likelihood smoothing is [16], who used a penalized binomial likelihood to estimate mortality rates. The local likelihood method described below can be viewed as an extension of local polynomial regression, and was introduced by [30].

5.5.2.1 Local Likelihood Estimation.

Local likelihood estimation is based on a locally weighted version of the log-likelihood:

$\displaystyle \mathcal{L}_x(\mu_1,\ldots,\mu_n) = \sum_{i=1}^n w_i(x) \log\, f(Y_i,\mu_i){}.$

A local polynomial approximation is then used for a transformation of the mean function. For example, a local quadratic approximation is

$\displaystyle \theta(x_i)$	$\displaystyle = g(\mu(x_i))$
	$\displaystyle \approx a_0 + a_1 (x_i-x) + \frac{a_2}{2} (x_i-x)^2{}.$

The function $g(\mu)$ is the link function. Its primary goal is to remove constraints on the mean by mapping the parameter space to $(-\infty,\infty)$ . For example, in the Poisson case, the parameter space is $0 < \mu < \infty$ . If the log transformation $\theta = \log(\mu)$ is used, then the parameter space becomes $-\infty < \theta < \infty$ .

Let $l(y,\theta) = \log f(y,\mu)$ where $\theta = g(\mu)$ , so that the locally weighted log-likelihood becomes

$\displaystyle \mathcal{L}_x = \sum_{i=1}^n w_i(x) l\left(Y_i,\theta(x_i)\right){}.$

The maximizer satisfies the likelihood equations,

$\displaystyle \sum_{i=1}^n w_i(x) \begin{pmatrix}1 \\ x_i-x \\ {} \frac{1}{2}(x_i-x)^2 \end{pmatrix} \dot{l}(Y_i,\theta(x_i)) = 0{},$

(5.20)

where

$\displaystyle \dot{l} = \frac{\partial}{\partial\theta} l(y,\theta){}.$

In matrix notation, this system of equations can be written in a form similar to (5.7):

$\displaystyle \boldsymbol{X}^{\top}\boldsymbol{W} \dot{l}(Y,\boldsymbol{X}a) = 0{}.$

(5.21)

This system of equations is solved to find parameter estimates $\hat{a}_0,\hat{a}_1$ and $\hat{a}_2$ . The local likelihood estimate is defined as

$\displaystyle \hat{\mu}(x) = g^{-1}(\hat{a}_0){}.$

5.5.2.2 Solving the Local Likelihood Equations.

The local likelihood equations (5.20) are usually non-linear, and so the solution must be obtained through iterative methods. The Newton-Raphson updating formula is

$\displaystyle \hat{a}^{(\kern.3pt j+1)} = \hat{a}^{(\kern.3pt j)} + \left(\bold... ...}\boldsymbol{W} \dot{l}\left(Y,\boldsymbol{X} \hat{a}^{(\kern.3pt j)}\right){},$

(5.22)

where $\boldsymbol{V}$ is a diagonal matrix with entries

$\displaystyle - \frac{\partial^2}{\partial\theta^2} l(y,\theta){}.$

For many common likelihoods $l(Y,\theta)$ is concave. Under mild conditions on the design points, this implies that the local likelihood is also concave, and has a unique global maximizer. If the Newton-Raphson algorithm converges, it must converge to this global maximizer.

The Newton-Raphson algorithm (5.22) cannot be guaranteed to converge from arbitrary starting values. But for concave likelihoods, $\hat{a}^{(\kern.3pt j+1)} - \hat{a}^{(\kern.3pt j)}$ is guaranteed to be an ascent direction, and convergence can be ensured by controlling the step size.

5.5.2.3 Statistics for the Local Likelihood Estimate.

Since the local likelihood estimate does not have an explicit representation, statistical properties cannot be derived as easily as in the local regression case. But a Taylor series expansion of the local likelihood gives an approximate linearization of the estimate, leading to theory parallel to that developed in Sects. 5.3 and 5.4 for local regression. See Chap. 4 of [21].

5.5.3 Extensions of Local Likelihood

The local likelihood method has been formulated for regression models. But variants of the method have been derived for numerous other settings, including robust regression, survival models, censored data, proportional hazards models, and density estimation. References include [30], [17], Loader ([19], [21]).

5.5.3.1 Robust Smoothing.

Robust smoothing combines the ideas of robust estimation (Chap. III.9) with smoothing. One method is local M-estimation: choose $\hat{a}$ to minimize

$\displaystyle \sum_{i=1}^n w_i(x) \rho \left( Y_i - \langle {a}, A\left( {x_i-x}\right)\rangle\right){},$

and estimate $\hat{\mu}(x) = \hat{a}_0$ . If $\rho(u) = u^2$ , this corresponds to local least squares estimation. If $\rho(u)$ is a symmetric function that increases more slowly than

, then the resulting estimate is more robust to outliers in the data. One popular choice of $\rho(u)$ is the Huber function:

$\displaystyle \rho(u) = \begin{cases}u^2 & \vert u\vert \le c \\ c(2\vert u\vert-c) & \vert u\vert > c \end{cases}{}.$

References include [11] and [21]. Another variant of M-estimation for local regression is the iterative procedure of [4].

5.5.3.2 Density Estimation.

Suppose $X_1, \ldots, X_n$ are an independent sample from a density . The goal is to estimate . The local likelihood for this problem is

$\displaystyle \mathcal{L}_x(a) = \sum_{i=1}^n w_i(x) \langle {a} , A \left({x_i... ...}\right) \mathrm{e}^{\langle {a}, A \left({u-x}\right)\rangle} {\mathrm{d}}u{}.$

Letting $\hat{a}$ be the maximizer of the local log-likelihood, the local likelihood estimate is $\hat{f}(x) = \exp( \hat{a}_0)$ . See [17] and [19].

The density estimation problem is discussed in detail, together with graphical techniques for visualizing densities, in Chap. III.4.

Acknowledgements. This work was supported by National Science Foundation Grant DMS 0306202.

Next: References Up: 5. Smoothing: Local Regression Previous: 5.4 Statistics for Linear