For testing the GAM or GAPLM specification we concentrate on the presentation of a general approach and discuss a typical testing approach which is similar to that in Chapter 7:
In order to cover the most complex case, we now focus on the GAPLM. The modification of the test procedure to simpler models is straightforward.
Certainly, the most interesting testing problem is that of
testing the specification of a single component function.
Let us consider the null hypotheses which assumes a polynomial
structure for (
fixed). For example,
the null function is the most simple polynomial. Testing
means to test for significant
impact of
on the response. The alternative
is an arbitrary functional form.
We explain the test procedure using the example of a
linear hypothesis for the first component function . This means
Estimation under means that we consider the model
We define the test statistic in analogy to (7.34):
Härdle et al. (2004) prove that (under some regularity assumptions)
the test statistic
has an asymptotic normal distribution
under
.
As in the GPLM case, the convergence of the test statistic is very slow.
Therefore we prefer the bootstrap approximation of the
quantiles of
. The approach here is analog
to that in Subsection 7.3.2. We will now
study what the test implies for our example on migration
intention.
As a test statistic we compute
and derive
its critical values from the bootstrap test statistics
.
The bootstrap sample size is set to
replications,
all other parameters were set to the values of Example 9.2.
We find the following results: For AGE
linearity has always been rejected at the 1% level,
in particular for all bandwidths that we used. This result may
be surprising but a closer inspection of the numerical results
shows that the hypothesis based bootstrap estimates
have almost no deviation. In consequence a slight difference
from linearity already leads to the rejection of
.
This is different for the variable INCOME. The bootstrap
estimates vary in this case. Here, linearity is rejected at
the 2% level for and at 1% level for
.
Note that this is not in contradiction to the results in
Chapter 7 as the results here are based on different
samples and models.
Let us consider a second example. This example is interesting since some of the results seem to be contradictory at a first glance. However, we have to take into account that nonparametric methods may not reveal their power as the sample size is too small.
GLM (logit) | GAPLM | ||
Coefficients | S.E. | Coefficients | |
FEMALE | -0.3651 | 0.3894 | -0.3962 |
AGE | 0.0311 | 0.1144 | -- |
SCHOOL | 0.0063 | 0.1744 | 0.0452 |
EARNINGS | -0.0009 | 0.0010 | -- |
CITY SIZE | -5.e-07 | 4.e-07 | -- |
FIRM SIZE | -0.0120 | 0.4686 | -0.1683 |
DEGREE | -0.0017 | 0.0021 | -- |
URATE | 0.2383 | 0.0656 | -- |
constant | -3.9849 | 2.2517 | -2.8949 |
We are interested in the question which factors cause unemployment after the apprenticeship. In contrast to Example 6.1 we use a larger set of explanatory variables:
Here, SCHOOL and AGE represent the human capital, EARNINGS represents the value of an apprenticeship, and CITY SIZE could be interesting since larger cities often offer more employment opportunities. To this the variable URATE also fits. FIRM SIZE tells us whether e.g. in small firms the number of apprenticeship positions provided exceeds the number of workers retained after the apprenticeship is completed. Density plots of the continuous variables are given in Figure 9.5.
We estimate a parametric logit model and the corresponding GAPLM to compare the results. Table 9.4 reports the coefficient estimates for both models, for the parametric model standard deviations are also given. The nonparametric function estimates are plotted in Figure 9.6. We used bandwidths which are inflated from the standard deviations of the variables by certain factors.
Looking at the function estimates we get the impression of strong nonlinearity in all variables except for URATE. But in contrast, the test results show that the linearity hypothesis cannot be rejected for all variables and significant levels from 1% to 20%.
What does this mean?
It turns out that the parametric logit coefficients
for all variables (except the constant and URATE) are already
insignificant, see Table 9.4.
As we see now this is not because of the misspecification
of the parametric logit model. It seems that the data can
be explained by neither the parametric logit model nor the semiparametric
GAPLM. Possible reasons may be the insignificant sample size
or a lack of the appropriate explanatory variables.
Let us also remark that from the
density plots we find that applying nonparametric
function estimates could be problematic
except for DEGREE and URATE.
The semiparametric approach to partial linear models can already be found in Green & Yandell (1985). The way we present it here was developed by Speckman (1988) and Robinson (1988b). A variety of generalized models can be found in the monograph of Hastie & Tibshirani (1990). This concerns in particular backfitting algorithms.
The literature on marginal integration for generalized models is very recent. Particularly interesting is the combination of marginal integration and backfitting to yield efficient estimators as discussed in Linton (2000).
Variants of GAM and GAPLM are the extension to parametric nonlinear components (Carroll et al., 2002), nonparametric components of single index form (Carroll et al., 1997), models with nonparametric link and nonparametric component functions (Horowitz, 1998a), and weak separable models (Mammen & Nielsen, 2003; Rodríguez-Póo et al., 2003).
For the issue of hypothesis testing we refer for further reading to Gozalo & Linton (2001) and Yang et al. (2003).