9.4 Testing in Additive Models, GAM, and GAPLM

For testing the GAM or GAPLM specification we concentrate on the presentation of a general approach and discuss a typical testing approach which is similar to that in Chapter 7:

In order to cover the most complex case, we now focus on the GAPLM. The modification of the test procedure to simpler models is straightforward.

Certainly, the most interesting testing problem is that of testing the specification of a single component function. Let us consider the null hypotheses which assumes a polynomial structure for $ g_\alpha$ ($ \alpha$ fixed). For example, the null function is the most simple polynomial. Testing $ H_0 : g_\alpha \equiv 0$ means to test for significant impact of $ T_\alpha$ on the response. The alternative is an arbitrary functional form.

We explain the test procedure using the example of a linear hypothesis for the first component function $ g_1$. This means

$\displaystyle H_0 : \ g_1 (t_1) = \gamma \cdotp t_1, \quad \textrm{ for all } t_1.
$

As the procedure and motivation for each step are the same as in Chapter 7, we condense our presentation to the most essential steps. Recall the GAPLM:

$\displaystyle E(Y\vert {\boldsymbol{U}},{\boldsymbol{T}}) = G\{{\boldsymbol{U}}...
...p {\boldsymbol{\beta}}+ c + g_1 (T_{1}) +
g_2(T_{2}) + \cdots + g_q(T_{q}) \}. $

We know already that a direct comparison of the parametric estimate

$\displaystyle \widetilde g_1(t_1) = \widehat \gamma u$

and the nonparametric estimate $ \widehat g_1 (u)$ causes the problem of comparing two functional estimates with biases of different magnitude. To avoid this discrepancy we replace $ \widetilde g_1$ with a bootstrap estimate $ E^*\widehat g^*_1 $ that takes the bias into account. $ E^*\widehat g^*_1 $ is the bootstrap expectation given the data $ (Y_i,{\boldsymbol{U}}_i,{\boldsymbol{T}}_i)$ ( $ i=1,\ldots,n$) under $ H_0$ over nonparametric estimates of $ g_1$ from bootstrap samples $ (Y_i^*,{\boldsymbol{U}}_i,{\boldsymbol{T}}_i)$ ( $ i=1,\ldots,n$). The $ Y^*_i$ are generated under the $ H_0$ model as in Chapter 7.

Estimation under $ H_0$ means that we consider the model

$\displaystyle E(Y\vert {\boldsymbol{U}},{\boldsymbol{T}}) = G \{ {\boldsymbol{U...
... {\boldsymbol{\beta}}+ b + \gamma T_{1} +
g_2(T_{2}) + \cdots + g_q(T_{q}) \}. $

The constant $ b$ in this equation can be different from $ c$, because the function $ \gamma T_{1}$ is not necessarily centered as we assumed for $ g_1(T_1)$. The estimation of the parametric components $ {\boldsymbol{\beta}}$, $ b$ and $ \gamma$ as well as of the components $ g_2, \ldots, g_q$ is performed as presented in Section 9.3.2.

We define the test statistic in analogy to (7.34):

$\displaystyle \widetilde{LR} = \sum\limits^n_{i=1} w({\boldsymbol{T}}_{i})
\fr...
...\}}\, \left\{ \widehat{g}_1 (T_{i1}) - E^*
\widehat g_1^*(T_{i1}) \right\}^2 . $

where $ \widehat m({\boldsymbol{T}}_i) = \widehat c + \widehat g_1(T_{i1}) + \ldots
+\widehat g_q(T_{iq})$ and $ \widehat\mu_i=G\{{\boldsymbol{U}}_i^\top\widehat{\boldsymbol{\beta}}
+\widehat m({\boldsymbol{T}}_i)\}$. The function $ w$ defines trimming weights to obtain numerically stable estimators on the range of $ {\boldsymbol{T}}$ that is of interest.

Härdle et al. (2004) prove that (under some regularity assumptions) the test statistic $ \widetilde{LR}$ has an asymptotic normal distribution under $ H_0$. As in the GPLM case, the convergence of the test statistic is very slow. Therefore we prefer the bootstrap approximation of the quantiles of $ \widetilde{LR}$. The approach here is analog to that in Subsection 7.3.2. We will now study what the test implies for our example on migration intention.

EXAMPLE 9.3  
We continue Example 9.2 but concentrate now on testing whether the nonlinearities found for the impact of AGE and INCOME are significant.

As a test statistic we compute $ \widetilde{LR}$ and derive its critical values from the bootstrap test statistics $ \widetilde{LR}^*$. The bootstrap sample size is set to $ n_{boot}=499$ replications, all other parameters were set to the values of Example 9.2. We find the following results: For AGE linearity has always been rejected at the 1% level, in particular for all bandwidths that we used. This result may be surprising but a closer inspection of the numerical results shows that the hypothesis based bootstrap estimates have almost no deviation. In consequence a slight difference from linearity already leads to the rejection of $ H_0$.

This is different for the variable INCOME. The bootstrap estimates vary in this case. Here, linearity is rejected at the 2% level for $ h=0.75$ and at 1% level for $ h=1.0$. Note that this is not in contradiction to the results in Chapter 7 as the results here are based on different samples and models. $ \Box$

Let us consider a second example. This example is interesting since some of the results seem to be contradictory at a first glance. However, we have to take into account that nonparametric methods may not reveal their power as the sample size is too small.

EXAMPLE 9.4  
We use again the data of Proença & Werwatz (1995) which were already introduced in Example 6.1. The data are a subsample of $ 462$ individuals from the first nine waves of the GSOEP, including all individuals who have completed an apprenticeship in the years between 1985 and 1992.


Table 9.4: Logit and GAPLM coefficients for unemployment data
  GLM (logit) GAPLM
  Coefficients S.E. Coefficients
FEMALE -0.3651 0.3894 -0.3962
AGE 0.0311 0.1144 --
SCHOOL 0.0063 0.1744 0.0452
EARNINGS -0.0009 0.0010 --
CITY SIZE -5.e-07 4.e-07 --
FIRM SIZE -0.0120 0.4686 -0.1683
DEGREE -0.0017 0.0021 --
URATE 0.2383 0.0656 --
constant -3.9849 2.2517 -2.8949

We are interested in the question which factors cause unemployment after the apprenticeship. In contrast to Example 6.1 we use a larger set of explanatory variables:

$ U_1$
female (1 if yes),
$ U_2$
years of school education,
$ U_3$
firm size ($ 1$ if large firm),
$ T_1$
age of the person,
$ T_2$
earnings as an apprentice (in DM),
$ T_3$
city size (in $ 100,000$ habitants),
$ T_4$
degree of apprenticeship, i.e., the percentage of people apprenticed in a certain occupation, divided by the percentage of people employed in this occupation in the entire economy, and
$ T_5$
unemployment rate in the particular country the apprentice is living in.

Here, SCHOOL and AGE represent the human capital, EARNINGS represents the value of an apprenticeship, and CITY SIZE could be interesting since larger cities often offer more employment opportunities. To this the variable URATE also fits. FIRM SIZE tells us whether e.g. in small firms the number of apprenticeship positions provided exceeds the number of workers retained after the apprenticeship is completed. Density plots of the continuous variables are given in Figure 9.5.

We estimate a parametric logit model and the corresponding GAPLM to compare the results. Table 9.4 reports the coefficient estimates for both models, for the parametric model standard deviations are also given. The nonparametric function estimates are plotted in Figure 9.6. We used bandwidths which are inflated from the standard deviations of the variables by certain factors.

Figure 9.5: Density plots for some of the explanatory variables
\includegraphics[width=1.4\defpicwidth]{SPMapl1d.ps}

Looking at the function estimates we get the impression of strong nonlinearity in all variables except for URATE. But in contrast, the test results show that the linearity hypothesis cannot be rejected for all variables and significant levels from 1% to 20%.

What does this mean? It turns out that the parametric logit coefficients for all variables (except the constant and URATE) are already insignificant, see Table 9.4. As we see now this is not because of the misspecification of the parametric logit model. It seems that the data can be explained by neither the parametric logit model nor the semiparametric GAPLM. Possible reasons may be the insignificant sample size or a lack of the appropriate explanatory variables. Let us also remark that from the density plots we find that applying nonparametric function estimates could be problematic except for DEGREE and URATE. $ \Box$

Figure 9.6: Estimates of additive components for unemployment data
\includegraphics[width=1.4\defpicwidth]{SPMapl1f.ps}

The semiparametric approach to partial linear models can already be found in Green & Yandell (1985). The way we present it here was developed by Speckman (1988) and Robinson (1988b). A variety of generalized models can be found in the monograph of Hastie & Tibshirani (1990). This concerns in particular backfitting algorithms.

The literature on marginal integration for generalized models is very recent. Particularly interesting is the combination of marginal integration and backfitting to yield efficient estimators as discussed in Linton (2000).

Variants of GAM and GAPLM are the extension to parametric nonlinear components (Carroll et al., 2002), nonparametric components of single index form (Carroll et al., 1997), models with nonparametric link and nonparametric component functions (Horowitz, 1998a), and weak separable models (Mammen & Nielsen, 2003; Rodríguez-Póo et al., 2003).

For the issue of hypothesis testing we refer for further reading to Gozalo & Linton (2001) and Yang et al. (2003).

EXERCISE 9.1   What would be a proper algorithm using backfitting in the semiparametric APLM when the dimension $ q$ (of the nonparametric part) is bigger than one?

EXERCISE 9.2   Discuss for the semiparametric APLM the following question: Why don't we use the approach of Speckman (1988) (Chapter 7) to get an estimate for $ {\boldsymbol{\beta}}$? Afterwards we could apply backfitting and/or marginal integration on the nonparametric part. a) Does this work? b) What follows for the properties of $ {\boldsymbol{\beta}}$?

EXERCISE 9.3   Recall equation (9.13). Why is $ \widehat {\boldsymbol{\gamma}}_2$ from (9.13) inefficient for estimating $ {\boldsymbol{\beta}}$?

EXERCISE 9.4   Recall Subsection 8.2.3 (estimation of interaction terms) and Section 9.4 (testing). Construct a test for additivity in additive models with only using procedures from these two sections.

EXERCISE 9.5   Again recall the estimation of interaction terms from Subsection 8.2.3. In Subsection 9.2.2 we introduced a direct extension for marginal integration to marginal integration in GAM. In Section 9.3.2 we extended it, less straight forward, to GAPLM. How can the interaction be incorporated into these models? What would a test for interaction look like?

EXERCISE 9.6   In Chapter 8 we discussed intensively the difference between backfitting and marginal integration and indicated that their interpretations carry over to the GAM and GAPLM. Based on these differences, construct a test on separability of the impact of $ T_1$ and $ T_2$. What would a general test on additivity look like?

EXERCISE 9.7   Think of a general test to compare multidimensional regression against an additive model structure. Discuss the construction of such a test, its advantages and disadvantages.


Summary
$ \ast$
The nonparametric additive components in all these extensions of simple additive models can be estimated with the rate that is typical for one dimensional smoothing.
$ \ast$
An additive partial linear model (APLM) is of the form

$\displaystyle E(Y\vert {\boldsymbol{U}}, {\boldsymbol{T}}) = {\boldsymbol{U}}^\top {\boldsymbol{\beta}}+c+\sum_{\alpha =1}^q g_\alpha
(T_\alpha ) \ . $

Here, $ {\boldsymbol{\beta}}$ and $ c$ can be estimated with the parametric rate $ \sqrt{n}$. While for the marginal integration estimator in the suggested procedure it is necessary to undersmooth, it is still not clear for the backfitting what to do when $ q>1$.
$ \ast$
A generalized additive model (GAM) has the form

$\displaystyle E(Y\vert {\boldsymbol{X}}) = G \{ c+\sum_{\alpha =1}^d g_\alpha (X_\alpha ) \} $

with a (known) link function $ G$. To estimate this using the backfitting we combine the local scoring and the Gauss-Seidel algorithm. Theory is lacking here. Using the marginal integration we get a closed formula for the estimator for which asymptotic theory can be also derived.
$ \ast$
The generalized additive partial linear model (GAPLM) is of the form

$\displaystyle E(Y\vert{\boldsymbol{U}},{\boldsymbol{T}}) = G\left\{ {\boldsymbo...
...}^\top {\boldsymbol{\beta}}+c+\sum_{\alpha=1}^q
g_\alpha (T_\alpha ) \right\} $

with a (known) link function $ G$. In the parametric part $ {\boldsymbol{\beta}}$ and $ c$ can be estimated again with the $ \sqrt{n}$-rate. For the backfitting we combine estimation in APLM with the local scoring algorithm. But again the the case $ q>1$ is not clear and no theory has been provided. For the marginal integration approach we combine the quasi-likelihood procedure with the marginal integration afterwards.
$ \ast$
In all considered models, we have only developed theory for the marginal integration so far. Interpretation, advantages and drawbacks stay the same as mentioned in the context of additive models (AM).
$ \ast$
We can perform test procedures on the additive components separately. Due to the complex structure of the estimation procedures we have to apply (wild) bootstrap methods.