3.7 Applications

In this section, we return to the opening questions of this chapter concerning some real data sets. In our calculation, we use the Gaussian kernel throughout.

Example 3.7.1. We return to the data set in Example 3.1.1. Previous analyses, such as Fan and Zhang (1999), have ignored the weather effect. The omission of the weather effect seems reasonable from the point of view linear regression, such as model (3.2). However, as we shall see, the weather has an important role to play.

The daily admissions shown in Figure 3.8 (a) suggest non-stationarity, which is, however, not discernible in the explanatory variables. This kind of trend was also observed by Smith, Davis, and Speckman (1999) in their study of the effect of particulates on human health. They conjecture that the trend is due to the epidemic effect. We therefore estimate the time dependence by a simple kernel method and the result is show in Figure 3.8 (a). Another factor is the day-of-the-week effect, presumably due to the booking system. The effect of the day-of-the-week effect can be estimated by a simple regression method using dummy variables. To better assess the effect of pollutants, we remove these two factors first. By an abuse of notation, we shall continue to use the $ y_t $ to denote the `filtered' data, now shown in Figure 3.8 (b).

Figure 3.8: For Example 3.7.1. (a) the original data set and the time trend, (b) after removing the time trend and the day-of-the-weak effect.
\includegraphics[width=1.2\defpicwidth]{figure9.ps}

 

As the different pollutant-based and weather-based covariates may affect the circulatory and respiratory system after different time delay, we first use the method of Yao and Tong (1994) to select a suitable lag for each covariate within the model framework

$\displaystyle y_t = {\sl g}(x_{k, t-\ell}) + \varepsilon_{k,t}, \qquad k = 1, 2, \cdots, 6, \quad \ell = 1, 2, \cdots, 8.$      

The selected lag variables are $ x_{1,t-6}, x_{2, t-1}, x_{3,
t-1}, x_{4, t-7}, x_{5, t-6}, x_{6, t-4} $ respectively. Since it is expected that the rapid changes in the temperature may also affect the health, we also incorporate the local temperature variation $ v_{5, t} $. See, Fan and Yao (1998). Finally, we consider the model
$\displaystyle y_t = {\sl g}(X_t) + \varepsilon_t, \ \textrm{with } X_t = (x_{1,t-6},
x_{2,t-1}, x_{3,t-1}, x_{4, t-7}, v_{5,t-6},x_{5, t-6},
x_{6,t-4})^{\top },$      

where all the variables are standardized.

Now, using the MAVE method and with the bandwidth $ h = 0.8 $, we have $ CV(1) = 0.3802, CV(2) = 0.3632, CV(3)= 0.3614, CV(4) =
0.3563, CV(5) = 0.3613, CV(6)$ $ = 0.3800, CV(7) = 0.4241$. Therefore, the number of e.d.r. directions is 4. The corresponding directions are


$\displaystyle \begin{tabular}{lcccccccr}
$\hat \beta_1 $\ & = & (-0.0606 &
0.53...
...0.0271 &
0.8051 &
-0.0255 &
0.3033 &
-0.4745 &
0.1514$)^{\top }$.
\end{tabular}$

17780 XEGex71.xpl


Figures 3.9 (a)-(d) show $ y_t $ plotted against the e.d.r. directions. Figures 3.9 (a$ '$)-(d$ '$) are the estimated regression function of $ y_t $ on the e.d.r. directions and pointwise 95% confidence bands. See for example Fan and Gibbers (1996). It suggests that along these directions, there are discernible changes in the function value.

Figure 3.9: Example 3.7.1. (a)-(d) are the $ y_t $ plotted against $ \hat \beta_1^{\top }X $, $ \hat
\beta_2^{\top }X $, $ \hat \beta_3^{\top }X $ and $ \hat
\beta_4^{\top }X $ respectively. (a$ '$)-(d$ '$) are the estimates of regression functions of $ y_t $ on $ \hat \beta_i^{\top }X $, $ i = 1,2, 3, 4$, and the pointwise 95% confidence bands.
\includegraphics[width=1.4\defpicwidth]{hkreg1.ps}

\includegraphics[width=1.4\defpicwidth]{hkreg2.ps}

\includegraphics[width=1.4\defpicwidth]{hkreg3.ps}

17789 XEGfig10.xpl

We may draw some preliminary conclusions about which weather-pollutant conditions are more likely to produce adverse effects on the human circulatory and respiratory system. We have identified four conditions, which we list in descending order of importance as follows. (i) The main covariates in $ \hat \beta_1^{\top }X $ are nitrogen dioxide ($ x_2$), variation of temperature ($ v_5 $) and temperature $ x_5 $, with coefficients 0.5394, -0.4652 and -0.6435 respectively. From Figures 3.9 (a) and (a$ '$), the first e.d.r. direction suggests that continuous cool days with high nitrogen dioxide level constitute the most important condition. This kind of weather is very common in the winter in Hong Kong. (ii) The main covariates in $ \hat
\beta_2^{\top }X $ are ozone ($ x_4$) and humidity ($ x_6$), with coefficients 0.6045 and -0.7544 respectively. Figures 3.10 (b) and (b1) suggest that dry days with high ozone level constitute the second most important condition. Dry days are very common in the autumn time in Hong Kong. (iii) The main covariates in $ \hat \beta_3^{\top }X $ are nitrogen dioxide ($ x_2$) and the variation of the temperature ($ v_5 $), with coefficients 0.7568 and 0.5232 respectively. Figures 3.9 (c) and (c$ '$) suggest that rapid temperature variation with high level of nitrogen dioxide constitutes the third important condition. Rapid temperature variations can take place at almost any time in Hong Kong. (iv) The main covariates in $ \hat
\beta_4^{\top }X $ are the respirable suspended particulates $ x_3$, the variation of temperature $ v_5 $ and the level of temperature $ x_5 $, with coefficients $ 0.8051 $, $ 0.3033 $ and $ -0.4745$ respectively. Figures 3.9 (d) and (d$ '$) suggest that high particulate level with rapid temperature variation in the winter constitutes the fourth important condition.

Although individually the levels of major pollutants may be considered to be below the acceptable threshold values according to the National Ambient Quality Standard (NAQS) of U.S.A. as shown in Figure 3.10, there is evidence to suggest that give certain weather conditions which exist in Hong Kong, current levels of nitrogen dioxide, ozone and particulates in the territory already pose a considerable health risk to its citizens. Our results have points of contact with the analysis of Smith, Davis, and Speckman (1999), which focused on the effect of particulates on human health.

Figure 3.10: Histograms of pollutants in Hong Kong. The vertical lines are the threshold NAQS values, which pollutants are considered to have significant effect on the circulatory and respiratory system. (We do not have the corresponding value for ozone).
\includegraphics[width=1.6\defpicwidth]{hkhists.ps}

17796 XEGfig11.xpl

Example 3.7.2. We continue with Example 3.1.2. The first model was fitted by Maran (1953):

$\displaystyle y_t = 1.05 + 1.41y_{t-1} - 0.77 y_{t-2} + \varepsilon_t,$     (3.46)

where $ \varepsilon_t \stackrel{i.i.d.}\sim N(0, 0.0459)$. The fit of this linear model is known to be inadequate (see e.g. Tong (1990)). Let us consider a nonlinear (nonparametric) autoregressive model say
$\displaystyle y_t = {\sl g}(X_t) + \varepsilon_t,$     (3.47)

where $ X_t = (y_{t-1}, y_{t-2})^{\top } $.


Table 3.5: Estimations of models for the Lynx data in Example 3.7.2.
Model SS of residuals cv$ ^*$ bandwidth by cv
$ y_t = {\sl g}_0(\hat \beta_1^{\top } X_t, \hat \beta_2^{\top } X_t) + \varepsilon_t$ 0.0417 0.0475 0.5316
$ y_t = {\sl g}_1(\hat \beta_1^{\top } X_t) + \varepsilon_t$ 0.0475 0.0504 0.2236
$ y_t - \hat {\sl g}_1(\hat \beta_1^{\top } X_t) = {\sl g}_2(\hat
\beta_1^{\top } X_t) + \varepsilon_t$ 0.0420 0.0449 0.5136
$ y_t = \theta^{\top } X_t + {\sl g}(\beta^{\top } X_t) + \varepsilon_t $ 0.0401 0.0450 0.2240
* The data have not been standardised because we want to compare the results
with Maran's (1953)


Now, using the dimension reduction method, we obtain the e.d.r. directions as $ \hat \beta_1 =(.87, -.49)^{\top } $ and $ \hat
\beta_2= (.49, .87)^{\top } $. No dimensional reduction is needed and the number of e.d.r. is 2. It is interesting to see that the first e.d.r. direction practically coincides with that of model (3.46). (Note that $ 1.41/(-0.77) \approx 0.87/(-0.49)$ ). Without the benefit of the second direction, the inadequacy of model (3.46) is not surprising. Next, the sum of squared (SS) of the residuals listed in Table 3.5 suggests that an additive model can achieve almost the same SS of the residuals. Therefore, we may entertain an additive model of the form

$\displaystyle y_t = {\sl g}_1(\beta_1^{\top } X_t) + {\sl g}_2(\beta_2^{\top } X_t)
+\varepsilon_t.$     (3.48)

The estimated $ {\sl g}_1 $ and $ {\sl g}_2 $ are shown in Figures 3.11 (a) and (b) respectively. Note that $ {\sl g}_1 $ is nearly a linear function.

Figure 3.11: For Example 3.7.2. (a) and (b) are the estimated $ {\sl g}_1 $ and $ {\sl g}_2 $ respectively.
\includegraphics[width=1.6\defpicwidth]{lynxreg1.ps}

Based on the above observation, it further seems reasonable to fit a generalized partially linear single-index model of the form (3.41). Using the method described in Example 3.6.3, we have fitted the model

$\displaystyle y_t = 1.1339y_{t-1} + 0.2420 y_{t-2} + {\sl g}( -0.2088 y_{t-1} + 0.9780 y_{t-2} ) + \varepsilon_t.$      

By comparing the sum of square of the residuals in Table 3.5, the partially linear single-index model fits the lynx data quite well. Removing the linear part, $ {\sl g}$ may be written in the form
$\displaystyle {\sl g}( -0.2088 y_{t-1} + 0.9780 y_{t-2} )$ $\displaystyle =$ $\displaystyle 1.0986 - 0.9794(-0.2088 y_{t-1} + 0.9780 y_{t-2} )$  
    $\displaystyle + {\sl g}_o(-0.2088 y_{t-1} + 0.9780 y_{t-2} ),$  

where $ Corr( {\sl g}_o(-0.2088 y_{t-1} + 0.9780 y_{t-2} ), X_t) = 0 $. Then, the fitted partially linear single-index model can be written as
$\displaystyle y_t = 1.099 + 1.3384y_{t-1} - 0.716y_{t-2}+ {\sl g}_o( -0.209 y_{t-1} + 0.978 y_{t-2} ) + \varepsilon_t,$     (3.49)

which looks quite similar to model (3.46) except for $ {\sl g}_o$. Note that the coefficient of $ y_{t-1}$ in $ {\sl g}_o$ is small compared with that of $ y_{t-2}$. Therefore, model (3.49) can be further approximated to
$\displaystyle y_t \approx 1.3384y_{t-1} + f_o( y_{t-2} ) + \varepsilon_t.$      

This is close to the model suggested by Lin and Pourahmadi (1998) and has clearly points of contact with the threshold models, where the ``thresholding'' is at lag 2. (Tong; 1990).