3.4 $k$-nearest neighbor estimates

The construction of nearest neighbor estimates differs from that of kernel estimates. The kernel estimate $\hat m_h(x)$ was defined as a weighted average of the response variables in a fixed neighborhood around $x$, determined in shape by the kernel $K$ and the bandwidth $h$. The $k$-nearest neighbor ($k$-$NN$) estimate is a weighted average in a varying neighborhood. This neighborhood is defined through those $X$-variables which are among the $k$-nearest neighbors of $x$ in Euclidean distance. The $k$-$NN$ weight sequence has been introduced by Loftsgaarden and Quesenberry (1965) in the related field of density estimation and has been used by Cover and Hart (1967) for classification purposes. In the present regression setting the $k$-$NN$ smoother is defined as

\begin{displaymath}
\hat m_k(x)=n^{-1} \sum^n_{i=1} W_{ki}(x)Y_i,
\end{displaymath} (3.4.18)

where $\{ W_{ki}(x)\}_{i=1}^n$ is a weight sequence defined through the set of indexes

\begin{displaymath}J_x = \{i: X_i\ \hbox{is one of the} k \hbox{nearest observations
to} x \}. \end{displaymath}

With this set of indexes of neighboring observations the $k$-$NN$ weight sequence is constructed:
$\displaystyle W_{ki}(x)= \left\{ \begin{array}{l@{\quad}l}
n/k, & \hbox{if} i \in J_x;\\
0 & \hbox{otherwise}. \end{array} \right.$      

To give some insight into the construction of weights consider the following example. Let $\{ (X_i,Y_i) \}_{i=1}^5$ be $\{ (1,5), (7,12), (3,1), (2,0), (5,4) \} $ and let us compute the $k$-$NN$ estimate $\hat m_k(x)$ for $x=4$ and $k=3$. The $k$ observations closest to $x$ are the last three data points, therefore $J_x=J_4=\{ 3,4,5 \}$ and thus

\begin{displaymath}W_{k1}(4)=0, W_{k2}(4)=0, W_{k3}(4)=1/3, W_{k4}(4)= 1/3, W_{k5}(4)=1/3,\end{displaymath}

which results in $\hat m_3(4)=(1+0+4)/3=5/3$.

In an experiment in which the $X$-variable is chosen from an equidistant grid the $k$-$NN$ weights are equivalent to kernel weigths. Let $k=2nh$ and compare $\{ W_{ki}(x) \}$ with $\{ W_{hi}(x)\} $ for a uniform kernel $K(u)= {1 \over 2} I(\vert u \vert\le 1)$ for an $x$ not too close to the boundary. Indeed for $i\in J_x$:

\begin{displaymath}W_{ki}(x) = {n \over (2nh)} = {1 \over 2 }h^{-1}=W_{hi}(x).\end{displaymath}

The smoothing parameter $k$ regulates the degree of smoothness of the estimated curve. It plays a role similar to the bandwidth for kernel smoothers. The influence of varying $k$ on qualitative features of the estimated curve is similar to that observed for kernel estimation with a uniform kernel.

Consider for fixed $n$ the case that $k$ becomes larger than $n.$ The $k$-$NN$ smoother then is equal to the average of the response variables. The other limiting case is $k=1$ in which the observations are reproduced at $X_i$, and for an $x$ between two adjacent predictor variables a step function is obtained with a jump in the middle between the two observations. Again a smoothing parameter selection problem is observed: $k$ has to be chosen as a function of $n$ or even of the data. As a first goal one might like to reduce the noise by letting $k=k_n$ tend to infinity as a function of the sample size. The second goal is to keep the approximation error (bias) low. The second goal is achieved if the neighborhood around $x$ shrinks asymptotically to zero. This can be done by defining $k=k_n$ such that $k_n/n \to 0.$ Unfortunately, this condition conflicts with the first goal. In order to keep the variance as small as possible one would like to choose $k$ as large as possible.

So again we face a trade-off problem between a ``good approximation" to the regression function and a ``good reduction" of observational noise. This trade-off problem can be expressed formally by an expansion of the mean squared error of the $k$-$NN$ estimate.

Proposition 3.4.1   (Lai; 1977) Let $k\rightarrow\infty ,\ k/n\rightarrow 0,\ n\rightarrow\infty$. Bias and variance of the $k$-$NN$ estimate $\hat m_k$ with weights as in (3.4.19) are given by

\begin{eqnarray*}
E\hat m_k(x)-m(x)& \approx & {1\over 24 f(x)^3}\ [(m''f + 2 m'...
...)^2,\\
var\{ \hat m_k(x)\} & \approx & {\sigma^2(x) \over k}.
\end{eqnarray*}



The trade-off between bias$^2$ and variance is thus achieved in an asymptotic sense by setting $k \sim n^{4/5}$. A consequence is that the the mean squared error itself converges to zero at a rate of $k^{-1}\sim n^{- (4/5) }.$

Other weight sequences $W_{ki}(x)$ have been proposed by Stone (1977). In addition to the ``uniform" weights (3.4.18) he defined ``triangular and quadratic $k$-$NN$ weights." Generally speaking, the weights can be thought of as being generated by a kernel function $K$,

\begin{displaymath}
W_{Ri}(x) = {K_R (x-X_i)\over \hat f_R(x)},
\end{displaymath} (3.4.19)

where

\begin{displaymath}\hat f_R(x) = n^{-1}\sum_{i=1}^n K_R (x-X_i)\end{displaymath}

is a kernel density estimate of $f(x)$ with kernel sequence

\begin{displaymath}K_R(u)=R^{-1} K(u/R) \end{displaymath}

and $R=R_n$ is the distance between $x$ and its $k$-th nearest neighbor. In the example given above with $x=4$ and $k=3$, the distance $R$ would be equal to 2 since the observation (2,0) is the furthest away among the three neighbors of $x=4$.

To give insight into this weight sequence consider the potato example again. In Figure 3.5 the effective $k$-$NN$ weights $W_{Ri}(x)$ are shown analogous to the kernel weigths $W_{hi}(x)$ in Figure 3.5

Figure 3.5: The effective $\scriptstyle k$-$NN$ weights for the food versus net income data set. $ \scriptstyle K_R(x-.)/\hat f_R(x)$ at $\scriptstyle x=1$ and $\scriptstyle x=2.5$ for $ \scriptstyle k=100$ (label 1), $ \scriptstyle k=200$ (label 2), $ \scriptstyle k=300$ (label 3) with Epanechnikov kernel $ \scriptstyle K(u)=0.75 (1-u^2) I(\left\vert u \right\vert \le 1) $ and density estimate as in Figure 1.3, year = 1973, $n=7125$. 4680 ANRkNNweights.xpl Survey (1968-1983).
\includegraphics[scale=0.7]{ANRkNNweights.ps}

One can see quite clearly the difference in the kernel weights. At the right end of the data, where the observations get sparser, the $k$-$NN$ weights spread out more than the kernel weights, as presented in Figure 3.2 Mack (1981) computes bias and variance for this parameterization of the weights $\{ W_{Ri}\}_{i=1}^n$.

Proposition 3.4.2   (Mack; 1981) Let $k \to \infty, \ k/n \to 0, \ n \to \infty $ and let $c_K, d_K$ be defined as in Theorem 3.1.1 Then
\begin{displaymath}
E\hat m_R(x) - m(x) \approx {\left( k\over n\right) ^2}
{ (m''f + 2 m'f')(x)\over 8 f(x)^3}d_K,
\end{displaymath} (3.4.20)


\begin{displaymath}
var(\hat m_R(x) \vert X=x) \approx 2 {\sigma^2(x) \over k}c_K.
\end{displaymath} (3.4.21)

A consequence of this proposition is that a balance between the $\hbox{\rm {bias}} ^2$ and the variance contribution to the mean squared error is as for the uniform $k$-$NN$ weights (3.4.19) achieved by letting $k=k_n$ be proportional to $n^{4/5}$. Thinking of the bandwidth parameter $h$ for kernel smoothers as being roughly equivalent to ${1 \over 2} nk^{-1}$ it is seen that the bias and variance rates of $\hat m_R$ are completely equivalent to those for kernel smoothers, only the constants differ. The bias of $\hat m_R$ tends to be large in the tail of the marginal distribution. The kernel estimators show a different behavior: There the variance is a multiple of $f(x)^{-1}$ and the bias turns out not to be a function of $f(x)^{-3}$. A comparison of kernel and $k$-$NN$ smoothers mean squared error properties can be found in Table 3.1

Table 3.1: Bias and variance of $k$-$NN$ and kernel smoother. Source: Mack (1981, table,1).
  kernel $k$-$NN$
bias $h^2{(m''f+2m'f')(x)\over 2f(x)}d_K$ $\left({k\over n}\right)^2
{(m''f+2m'f')(x)\over 8f^3(x)}d_K$
variance ${\sigma^2(x)\over nhf(x)}c_K$ ${2\sigma^2(x)\over k}c_K$


The entries of Table 3.1 show the asymptotic dependence of bias and variance on $f,k$ and $h$. The essential equivalence of the row entries of Table 3.2.1 can be seen by using the relation

\begin{displaymath}
k=2nh f(x).
\end{displaymath} (3.4.22)

Using this $k$ leads to the same (asymptotic) mean squared error (at $x$) for the $k$-$NN$ and the kernel smoothers. The accuracy of $\hat m_R$ can be stated in terms of a central limit theorem, which was given by Mack (1981, theorem 3). Rates of convergence for this $k$-$NN$ smoother have also been derived by Devroye (1978a) and Györfi (1981).

A third kind of $k$-$NN$ smoothers are symmetrized nearest neighbor estimators. Let $F_n$ denote the empirical distribution of the sample from $X.$ Let $h$ be a bandwidth tending to zero. The estimate proposed by Yang (1981) and studied by Stute (1984) is

\begin{displaymath}
\hat m_{k(h)}(x)=(nh)^{-1}\sum_{i=1}^n K\left({F_n(X_i)-F_n(x) \over
h}\right) Y_i.
\end{displaymath} (3.4.23)

The estimate (3.4.23) is also a nearest neighbor estimate, but now neighbors are defined in terms of distance based on the empirical distribution function of the $\{ X_i \}_{i=1}^n$. Thus a weight sequence (symmetric in $F_n(X)$ space)

\begin{displaymath}W_{k(h)}(x)=K_h (F_n(X_i)-F_n(x) )\end{displaymath}

is used.

Note that $\hat m_R$ always averages over a symmetric neighborhood in the $X$-space, but may have an asymmetric neighborhood of $F_n(X)$. By contrast, $\hat m_{k(h)}$ always averages over the same amount of points left and right of $x$, but may in effect average over an asymmetric neighborhood in the $X$-space. The estimate $\hat m_{k(h)}$ has an intriguing relationship with the $k$-$NN$ estimator used by Friedman (1984). The variable span smoother (supersmoother) proposed by Friedman uses the same type of neighborhood as does $\hat m_{k(h)}$; see Section 5.3. The estimate (3.4.23) also looks appealingly like a kernel regression estimate of $Y$ against not $X$ but rather $F_n(X)$. Define the expected value

\begin{displaymath}\bar m_{k(h)}(x)=h^{-1}\int m(u)K\left({F(u)-F(x)\over h}\right) du.\end{displaymath}

Then Stute (1984) shows that as $n\to\infty ,\ h\to 0$ and $n h^3 \to \infty ,$

\begin{displaymath}(nh)^{1/2}(\hat m_{k(h)}(x)-\bar m_{k(h)}(x))
{\buildrel {\cal L}\over \to} N (0,c_K \sigma^2(x) ). \end{displaymath}

With this choice of $h, \hat m_{k(h)}(x) $ has the same limit properties as a kernel or ordinary nearest neighbor estimate as long as its bias term is of order $O(h^2)$. It is in fact not hard to show that the bias satisfies to order $O({h}^2)$:

\begin{displaymath}
{h}^2 d_K {(m''f - m'f')(x) \over 2 f^3 (x)}.
\end{displaymath} (3.4.24)

Carroll and Härdle (1988) compare (3.4.24) with the bias for kernel smoothers and $k$-$NN$ smoothers. They show that even when the variances of all three estimates are the same (the case $h= h f(x))$, the bias properties differ unless

\begin{displaymath}m'(x)f'(x) =0.\end{displaymath}

Otherwise, the smoothing parameter balancing variance versus bias$^2$ for the kernel and ordinary nearest neighbor estimates will lead to a different mean squared error than what one obtains for the symmetrized nearest neighbor estimate. An example is given in Figure 3.6

Figure 3.6: The symmetrized nearest neighbor estimator together with the kernel estimator for the potatoe versus income data set. From citeasnouncar:har:88 with permission of Elsevier Science Publishers. 4687 ANRsimnn.xpl
\includegraphics[scale=0.7]{ANRsimnn.ps}

For these data, we used the quartic kernel

\begin{displaymath}K(u) = (15/16) (1- u^2)^2 I(\vert u \vert \le 1).\end{displaymath}

We computed the ordinary kernel estimate $\hat m_h(x)$ and the symmetrized nearest neighbor estimate $\hat m_{k(h)}(x) $, the bandwidths being selected by cross-validation, see Chapter 4. The data driven bandwidths were $h = 0.25$ for the kernel smoother on the scale $[0, 3]$ of Figure 3.6 and $h = 0.15 $ on the $F_n(X)$ scale. The resulting regression curves are similar for $x \le 1$, which is where most of the data lie. There is a sharp discrepancy for larger values of $x$, the kernel estimate showing evidence of a bimodal relationship and the symmetrized nearest neighbor estimate indicating either an asymptote or even a slight decrease as income rises. In the context, the latter seems to make more sense economically and looks quite similar to the curve in Hildenbrand and Hildenbrand (1986). Statistically, it is the range of the data where the density $f(x)$ takes on small values; see Figure 2.1, for example. This is exactly when we expect the biggest differences in the estimates, that is, the kernel estimate should be more variable but less biased.

3.4.1 Computational aspects of k-NN smoothing

A great advantage of the $k$-$NN$ estimate (3.4.18) is that its computation can be updated quite easily when $x$ runs along the sorted array of $X$-variables. The algorithm requires essentially $O(n)$ operations to compute the smooth at all $X_i$, as compared to $O(n^2 h)$ operations for direct computation of the kernel estimate.

Let us construct $k$-$NN$ smoothers as weighted averages over a fixed number of observations with uniform weights as used as in (3.4.19). Suppose that the data have been pre-sorted, so that $X_i \le X_{i+1}, i=1,\ldots,n-1$. Then if the estimate has already been computed at some point $X_i$ the $k$-$NN$ smoother at $X_{i+1}$ can be recursively determined as

\begin{displaymath}\hat m_k(X_{i+1})= \hat m_k(X_i)+ k^{-1}(Y_{i+[k/2] +1} - Y_{i- [k/2]}),\end{displaymath}

where $[u]=\sup \{ i : i \le u \}.$

This updating formula is also applicable for local polynomial fits. For simplicity only the local linear fit is described . The slope $\beta$ and intercept $\alpha$ of the least squares line through the neighborhood determined by uniform weights (3.4.19) are given by

\begin{eqnarray*}
\alpha_{X_i} &=& \hat m_k(X_i)- \beta \overline \mu_{X_i}, \\ ...
... \sum_{i \in J_x} X_i Y_i, \\
V_x &=& \sum_{i \in J_x} X_i^2.
\end{eqnarray*}



If an observation $(X_{i+ [k/2] +1}, Y_{i+ [k/2] +1})$ is added and $(X_{i- [k/2]}, Y_{i- [k/2]})$ falls out of the window over which to average, the following formulas can be used:

\begin{eqnarray*}
\overline \mu_{X_{i+1}}
&=&\overline \mu_{X_i} + k^{-1} (X_{i...
...
V_{X_{i+1}}
&=& V_{X_i} + X^2_{i+ [k/2] +1} -X^2_{i- [k/2]}.
\end{eqnarray*}



This recursive algorithm is a component of the supersmoother to be described in Section 5.3.

The principal idea of updating also applies to the $k$-$NN$ estimate $\hat m_R(x)$ if a discrete approximation to the kernel $K$ is used. Suppose that for fixed $k$ the effective kernel weight is sufficiently well approximated by a sum of indicator functions

\begin{displaymath}(2 [k/2])^{-1} \sum^{[k/2]}_{j=0} I (\left\vert x-X_i
\right\vert \le
R_n(k-2j)),\end{displaymath}

where $R_n(k)$ is the distance of $x$ to its $k$-th nearest neighbor. Then $\hat m_R(x)$ can be represented as a sum of simple $k$-$NN$ estimates,

\begin{displaymath}\hat m_R(x) \approx (2 [k/2])^{-1} \sum^{[k/2]}_{j=0} \hat
m_{k-2j}(x).\end{displaymath}

Every term in this sum can be updated as before. The computational costs are linear in $n$.

Exercises
3.4.1Show that the bias and variance of $\hat m_R$ are identical to the bias and variance of $\hat m_k$ if a uniform kernel is used.

3.4.2Define $d_M(k) = $variance$(k) + $bias$^2(k)$. Compute

\begin{displaymath}k_{opt}= \arg\min_k d_M(k)\end{displaymath}

as a function of $m,\ f$ and $K$. Also compute $d_M(k_{opt})$. Compare with the speed of convergence for kernel smoothers. Interprete the constants occuring in these expressions.

3.4.3Implement the $k$-$NN$ algorithm on a computer and compare with a brute force programming of the kernel smoother. Where do you see numerical advantages or disadvantages of the recursive updating algorithm?

3.4.4Compare kernel and $k$-$NN$ smoothes subjectively. At what regions of the data would you prefer the one before the other?

3.4.5Verify the bias formula (3.4.24) for the symmetrized nearest neighbor estimator. Compare with the bias for the ordinary $k$-$NN$ smoother and the kernel smoother.