1.6 Impact of Heteroscedasticity and Correlation

In our climate example we used one fifth of all measurements in the year 1990. Figure 1.9 shows all measurements in 1990 and periodic spline fits using all measurements with GCV, GML and UBR choices of the smoothing parameter. Obviously the GCV and UBR criteria under-estimate the smoothing parameter which leads to wiggly fits. What is causing the GCV and UBR methods to breakdown?

**Figure 1.9:** *Points* are all measurements in 1990. *Lines* are the periodic spline estimates. The methods for selecting the smoothing parameter are indicated in strips
$\includegraphics[width=10.5cm]{text/3-1/allyearfits.eps}$

**Figure 1.10:** *Circles* are observations. *Dotted lines* are true functions. *Solid lines* are the estimated functions. Simulation settings and model selection criteria are marked in strips
$\includegraphics[width=11.7cm]{text/3-1/violation.eps}$

In model (1.2) we have assumed that random errors are iid with mean zero and variance $\sigma ^2$ . The middle panel of Fig. 1.1 indicates that variation of the maximum temperature is larger during the winter. Also, temperatures close in time may be correlated. Thus the assumption of homoscedasticity and independence may not hold. What kind of impact, if any, do these potential violations have on the model selection procedures?

For illustration, we again consider two simulations with heteroscedastic and auto-correlated random errors respectively. We use the same function and design points as the simulation in Sect. 1.2 with the true function shown in the left panel of Fig. 1.4. For heteroscedasticity, we generate random errors $\epsilon_i \sim N(0,((i+36.5)/147)^2)$ , $i=1,\cdots,73$ , where the variance increases with

. For correlation, we generate the $\epsilon_i$ 's as a first-order autoregressive process with mean zero, standard deviation 0.5 and first-order correlation 0.5. The first and the second rows in Fig. 1.10 show the fits by the trigonometric model with cross-validation, $\mathrm{BIC}$ and $\mathrm{C}_p$ choices of orders under heteroscedastic and auto-correlated random errors respectively but without adjustment for the heteroscedasticity or correlation. The third and the fourth rows in Fig. 1.10 show the fits by the periodic spline with GCV, GML and UBR choices of smoothing parameters under heteroscedastic and auto-correlated random errors respectively but without adjustment for the heteroscedasticity or correlation. These kind of fits are typical under two simulation settings. The heteroscedasticity has some effects on the model selection, but far less severe than the impact of auto-correlation. It is well-known that positive auto-correlation leads to under-smoothing for non-parametric models with data-driven choices of the smoothing parameter ([53,40]). Figure 1.10 shows that the same problem exists for parametric regression models as well.

The breakdown of the GCV and UBR criteria for the climate data is likely caused by the auto-correlation which is higher when daily measurements are used as observations. Extensions of the GCV, GML and UBR criteria for correlated data can be found in [53].