14.7 Thresholding

In this section we consider the problem of recovering the regression function from noisy data based on a wavelet decomposition.

The upper two plots in the display show the underlying regression function (left) and noisy data (right). The lower two plots show the distribution of the wavelet coefficients in the time-scale domain for the regression function and the noisy data.

For simplicity we deal with a regression estimation problem. The noise is assumed to be additive and Gaussian with zero mean and unknown variance. Since the wavelet transform is an orthogonal transform we can consider the filtering problem in the space of wavelet coefficients. One takes the wavelet transform of the noisy data and tries to estimate the wavelet coefficients. Once the estimator is obtained one takes the inverse wavelet transform and recovers the unknown regression function.

To suppress the noise in the data two approaches are normally used. The first one is the so-called linear method. We have seen in Section 14.6 that the wavelet decomposition reflects well the properties of the signal in the frequency domain. Roughly speaking, higher decomposition scales correspond to higher frequency components in the regression function. If we assume that the underlying regression is allocated in the low frequency domain then the filtering procedure becomes evident. All empirical wavelet coefficients beyond some resolution scale are estimated by zero. This procedure works well if the signal is sufficiently smooth and when there is no boundary effect in the data. But for many practical problems such an approach does not seem to be fully appropriate, e.g. images cannot be considered as smooth functions.

To avoid this shortcoming often a nonlinear filtering procedure is used to suppress the noise in the empirical wavelet coefficients. The main idea is based on the fundamental property of the wavelet transform: father and mother functions are well localized in time domain. Therefore one could estimate the empirical wavelet coefficients independently. To do this let us compare the absolute value of the empirical wavelet coefficient and the standard deviation of the noise. It is clear that if the wavelet coefficient is of the same order of the noise level, then we cannot separate the signal and the noise. In this situation a good estimator for the wavelet coefficient is zero. In the case when an empirical wavelet coefficient is greater than the noise level a natural estimator for a wavelet coefficient is the empirical wavelet coefficient itself. This idea is called thresholding.

We divide the different modifications of thresholding in mainly three methods: hard thresholding, soft thresholding and a levelwise thresholding using Stein risk estimator, see Subsection 14.7.3.

14.7.1 Hard Thresholding

To suppress the noise we apply the following nonlinear transform to the empirical wavelet coefficients: $F(x) = x\cdotp I(\vert x\vert>t)$ , where is a certain threshold. The choice of the threshold is a very delicate and important statistical problem.

On the one hand, a big threshold leads to a large bias of the estimator. But on the other hand, a small threshold increases the variance of the smoother. Theoretical considerations yield the following value of the threshold:

$\displaystyle t = \sqrt{2 \sigma^2 \log(n)/n},$

where

is the length of the input vector and $\sigma^2$ is the variance of the noise. The variance of the noise is estimated based on the data. We do this by averaging the squares of the empirical wavelet coefficients at the highest resolution scale.

We provide two possibilities for choosing a threshold. First of all you can do it by ``eye'' using the Hard threshold item and entering the desired value of the threshold. The threshold offered is the one described in the paragraph before. Note that this threshold value is in most cases conservative and therefore we should choose a threshold value below the offered one. The item Automatic means that the threshold will be chosen as $\sqrt{2 \sigma^2 \log(n)/n}$ with a suitably estimated variance.

The plots of the underlying regression function and the data corrupted by the noise are shown in top of the display. You can change the variance of the noise and the regression function by using the Noise level and the Change function items.

On the display above you can see the estimated regression function after the thresholding procedure. The left upper window shows how the threshold procedure distorts the underlying regression function when there is no noise in the data. The final estimator is shown in the right upper window.

Two plots in the bottom of the display show the thresholded and the nonthresholded wavelet coefficients. The thresholded coefficients are shown as blue circles and the nonthresholded as red ones. The radii of the circles correspond to the magnitudes of the empirical wavelet coefficients.

You can try to improve the estimator using the items Change basis and Change level. Under an appropriate choice of the parameters you can decrease the bias of your smoother.

Change basis -- changes type of wavelet decomposition. Possible choices are Haar, Daubechies 4, Coiflet 2
Change level -- changes the number of father wavelet coefficients on the first resolution scale. This number must be a power of 2 in the range $1,\dots,n/2$
Change view -- provides other views to the mother wavelet coefficients
Noise level -- changes variance of white Gaussian noise
Hard Threshold -- changes magnitude of the threshold
Automatic -- threshold is chosen as $\sqrt{2 \textrm{Var}\, \log(n)/n}$
Print -- prints the display

14.7.2 Soft Thresholding

Along with hard thresholding in many statistical applications soft thresholding procedures are often used. In this section, we study the so-called wavelet shrinkage procedure for recovering the regression function from noisy data.

The display shows the same regression function as in the lesson before, without and with noise, after soft thresholding.

The only difference between the hard and the soft thresholding procedures is in the choice of the nonlinear transform on the empirical wavelet coefficients. For soft thresholding the following nonlinear transform is used:

$\displaystyle S(x) = \textrm{sign}(x) (\vert x\vert-t)I(\vert x\vert>t),$

where

is a threshold. The menu provides you with all possibilities for choosing the threshold and exploring the data.

In addition you can use a data driven procedure for choosing the threshold. This procedure is based on Stein's principle of unbiased risk estimation (Donoho and Johnstone; 1995). The main idea of this procedure is the following. Since is a continuous function, given , one can obtain an unbiased risk estimate for the soft thresholding procedure. Then the optimal threshold is obtained by minimization of the estimated risks.

Change basis -- changes the type of the wavelet decomposition. Possible choices are Haar, Daubechies 4, Coiflet 2
Change level -- changes the number of the father wavelet coefficients on the first resolution scale. This number must be a power of 2 in the range $1,\dots,n/2$
Change view -- provides other views to the mother wavelet coefficients
Noise level -- changes the variance of the white Gaussian noise
Soft Threshold -- changes the magnitude of the threshold
Automatic -- threshold is chosen as $\sqrt{2 \textrm{Var}\, \log(n)/n}$
Adaptive -- adaptive thresholding using Stein's principle
Print -- prints the display

14.7.3 Adaptive Thresholding

We first give an intuitive idea about adaptive thresholding using Stein's principle. Later on we give a detailed mathematical description of the procedure.

We define a risk function with $Z_k = c_k + s_k e_k, k = 1,\dots,M$ , where is the unknown coefficient, known (or to be estimated) scale parameters, i.i.d. random variables and the threshold.

Since Stein enables us to estimate the risk

$\displaystyle \sum_{k=1}^{M} R(s_k,Z_k,t),$

the risk minimizing argument

can also be estimated and will be taken then as the ``optimal'' adaptive threshold. But still we have to decide for the definition of the risk function (and therefore for the thresholding) what kind of threshold function (hard or soft) we want to choose.

This procedure can be performed for each resolution level.

Mathematical derivation

Now we discuss the data driven choice of threshold, initial level and the wavelet basis by the Stein (1981) method of unbiased risk estimation. The argument below follows Donoho and Johnstone (1995).

We first explain the Stein method for the idealized one-level observation model discussed in the previous section: , for $k =1,\dots,M$ , where $c=(c_1,\dots,c_M)$ is the vector of unknown parameters, are known scale parameters and are i.i.d. random variables.

Let be an estimator of . Introduce the mean squared risk of the estimate of : $R= \sum_{k=1}^M E(c_k' - c_k)^2$ .

Assume that the estimators have the form

$\displaystyle c' = Z_k +H_t ( Z_k ),$

(14.1)

where

is a parameter and

is a weakly differentiable real valued function for any fixed

. One may think initially of

to be a threshold, but Stein's argument works in the general case as well. The parameter

can be chosen by the statistician. In other words, (14.1) defines a family of estimators, indexed by

, and the question is how to choose the optimal

. Define the optimal

as a minimizer of the risk function with respect to

If the true parameters were known, one could compute explicitly. In practice this is not possible, and one chooses a certain approximation of as a minimizer of an unbiased estimator of the risk . To construct , note that

$\displaystyle E(c_k' - c_k)^2 =E\{ R(s_k,Z_k,t)\},$

(14.2)

where

$\displaystyle R(s, x, t) = s^2 + 2 s^2 (d / dx) H_t (x) +H_t ^2 (x).$

In fact,

$\displaystyle \ E(c_k' - c_k)^2 =s_k^2+2s_k E(c_kH_t(Z_k))+E(H_t^2(Z_k)),$

(14.3)

and, by partial integration, (14.2) follows. The relation (14.2) yields

, where the value $R' = \sum_{k=1}^M R(s_k, Z_k,t)$ is an unbiased risk estimator, or risk predictor. It is called Stein's unbiased risk estimator (SURE).

The Stein principle is to minimize with respect to and take the minimizer

$\displaystyle \hat{t}=\arg\min_{t\geq 0}\sum_{k=1}^M R(s_k,Z_k,t)$

(14.4)

as a data driven estimator of the optimal

. The unbiasedness relation

(for every

) alone does not guarantee that the estimate is close to

. Some more developed argument is used to prove this, see Donoho and Johnstone (1995).

In the rest of this paragraph we formulate the Stein principle for the example of soft thresholding wavelet estimators. For soft thresholding (see above) we have

$\displaystyle H_t (x) = -x I\{\vert x\vert < t\}-t I\{\vert x\vert \geq t\} \textrm{sign}(x)$

and

$\displaystyle R(s,x,t)= (x^2-s^2)I\{\vert x\vert<t\}+(s^2+t^2)I\{\vert x\vert\geq t\}$

$\displaystyle = [x^2-s^2]+(2s^2-x^2+t^2)I\{\vert x\vert\geq t\}.$

(14.5)

An equivalent expression is

$\displaystyle R(s,x,t) = \min(x^2, t^2) - 2 s^2 I\{x^2 \leq t^2\} + s^2.$

The expression in square brackets in (14.5) does not depend on . Thus, the definition (14.2) is equivalent to

$\displaystyle \hat{t}=\arg\min_{t\geq 0}\sum_{k=1}^M(2s_k^2+t^2-Z_k^2) I\{\vert Z_k\vert\geq t\}.$

(14.6)

Let $(p_1,\dots,p_M)$ be the permutation ordering the array such that $\vert Z_k\vert$ , $k =1,\dots,M$ : $\vert Z_{p_1}\vert \leq \vert Z_{p_2}\vert \leq \dots \leq \vert Z_{p_M}\vert$ and $\vert Z_{p_0}\vert = 0$ . According to (14.6) one obtains $t' = \vert Z_{p_l}\vert$ where

$\displaystyle l=\arg\min_{0\leq k\leq M} \sum_{s=k+1}^M(2s_{p_s}^2+ Z_{p_k}^2-Z_{p_s}^2).$

In particular for the above equation yields the following estimator

$\begin{displaymath} \hat{c}_1=\left\{ \begin{array}{ll} Z_1;& Z_1^2\geq 2s_1^2,\\ 0;& Z_1^2< 2s_1^2. \end{array}\right. \end{displaymath}$

It is easy to see that the computation of the estimate of defined above requires approximately $M \ln(M)$ operations provided that a quick sort algorithm is used to order the array $\vert Z_k\vert$ , $k =1,\dots,M$ .

We proceed from the idealized model to a more realistic density estimation model. In the context of wavelet smoothing the principle of unbiased risk estimation gives the following possibilities for adaptation:

adaptive threshold choice at any resolution level $j\geq j_0$ .
adaptive choice of plus 1.
adaptive choice of father wavelet and mother wavelet plus 2.

Please notice that in XploRe we have implemented the first version. To demonstrate these possibilities consider the family of wavelet estimators

$\displaystyle f^*(x,t,j_0,\psi)=\sum_k\alpha^*_{j_0 k}[\varphi,t]\varphi_{j_0 k}(x)+ \sum_{j=j_0}^{j_1}\sum_k \beta^*_{jk}[\psi,t_j]\psi_{jk}(x),$

(14.7)

where $\alpha^*_{j_0 k}[\varphi,t]=\hat{\alpha}_{j_0 k}+H_t(\hat{\alpha}_{j_0 k})$ and $\beta^*_{jk}[\psi,t_j]=\hat{\beta}_{jk}+H_t(\hat{\beta}_{jk})$ are soft thresholded empirical wavelet coefficients with

. Here $t = (t,t_{j_0},\dots, t_{j_1})$ is a vector of thresholds. The dependence of

on the mother wavelet is skipped in the notation since the mother wavelet is supposed to be canonically associated with the father wavelet. As in (14.2) it can be shown that, under certain general conditions, $E\Vert f^*-f\Vert _2^2=E\left(\hat{R}(t,j_0,\varphi)\right)$ . Here Stein's unbiased risk estimator is given by

$\displaystyle \hat{R}(t,j_0,\varphi)=\sum_kR(s_{j_0 k}[\varphi],\hat{\alpha}_{j_0 k},t)+ \sum_{j=j_0}^{j_1}\sum_kR(s_{jk}[\psi],\hat{\beta}_{jk},t_j),$

where

is defined above, and $s^2_{jk}[\varphi]$ and $s^2_{jk}[\psi]$ are variances of the corresponding empirical wavelets coefficients. To obtain the best estimator from the family (14.7), one can choose the unknown parameters of the estimator minimizing $'R(t,j_0, \varphi)$ . For the cases 1, 2 and 3 these parameters can be chosen respectively as follows:

adaptive choice of thresholds: $\hat{t}=\arg\min_t\hat{R}(t,j_0,\varphi)$ .
adaptive choice of thresholds and : $(\hat{t},\hat{\jmath_0})=\arg\min_{t,j_0}\hat{R}(t,j_0,\varphi)$ .
adaptive choice of thresholds, and wavelet basis:
$(\hat{t},\hat{\jmath_0},\hat{\varphi}) =\arg\min_{t,j_0,\phi}\hat{R}(t,j_0,\varphi)$ .

In the third case we assume that the minimum is taken over a finite number of given wavelet bases.

Remark 1: Since in practice the values of the different variances are not available, one can use their empirical versions instead. For example if (14.7) is the wavelet density estimator based on the sample $X_1, \dots, X_n$ , one can replace $s^2_{jk}[\psi]$ by its estimator.

It is clear that this yields a consistent estimator for the variances under rather general assumptions on the basis functions and on the underlying density of the 's.

Remark 2: If one wants to threshold only the coefficients $\beta_{jk}$ , which is usually the case, the function for $\alpha_{jk}$ should be identically zero. Therefore, $R(s_{j_0 k}[\varphi],\hat{\alpha}_{jk},t)$ should be replaced by $s_{j_0 k}[\varphi]$ and Steins unbiased risk estimator takes the form

$\displaystyle \hat{R}((t_{j_0},\dots,t_{j_1}),j_0,\varphi)= \sum_k s^2_{jk}[\varphi]+\sum_{j=j_0}^{j_1}\sum_k R\left(s_{jk}[\psi], \hat{\beta}_{jk},t_j\right).$

Let us now apply the Stein principle to a regression estimation example. We choose the following step function for $x \in [0,1]$ :

$\displaystyle f(x)=0.1I(x<0.4)+2I(x\in[0.4,0.6]+0.5I(x\in[0.6,0.8]s^2_{jk}[\varphi]).$

The function was observed at 128 equispaced points and disturbed with Gaussian noise with variance . We use the Stein rule only for threshold choice 1 (level by level) and not for the cases 2 and 3 where the adaptive choice of and of the basis is considered. We thus choose the threshold as the minimizer with respect to $t= (t_{j_0},\dots,t_{j_1})$ of

$\displaystyle \hat{R}(t)=\sum_{j=j_0}^{j_1}\sum_k R\left(\hat{s}_{jk}[\psi],\hat{\beta}_{jk}, t_j\right),$

where

is defined as above and $\hat{s}_{jk}[\psi]$ is an empirical estimator of the variance of the wavelet regression coefficients. For computation we use the discrete wavelet transform.