6.5 Statistical Evaluation and Presentation


6.5.1 Statistical Characteristics


stat
= 12949 gplmstat (code, x, t, y, h, b, bv, m, df{, opt}) computes statistical characteristics for an estimated GPLM

12952 gplmest provides a number of statistical characteristics of the estimated model in the output component stat. The quantlet 12955 gplmstat can be used to create the above mentioned statistics by hand. Suppose we have input x, y and have estimated the vector of coefficients b (with covariance bv) and the nonparametric curve m by model "nopow". Then the list of statistics can be found from

  stat=gplmstat("nopow",x,y,b,bv,m,df)
Of course, an list of options opt can be added at the end. If options from opt have been used for the estimation, these should be included for 12958 gplmstat , too.

The following characteristics are contained in the output stat. This itself is a list and covers the components

df
approximate degrees of freedom according to Hastie and Tibshirani (1990).
deviance
the deviance of the estimated model.
pearson
the Pearson statistic.
loglik
the log-likelihood of the estimated model, using the estimated dispersion parameter.
dispersion
an estimate for the dispersion parameter (deviance/df).
aic, bic
Akaike's AIC and Schwarz' BIC criterion, respectively.
r2, adr2
the (pseudo) coefficient of determination and its adjusted version, respectively.
it
the number of iterations needed.
ret
the return code, which is 0 if everything went without problems, 1 if the maximal number of iterations was reached, and negative if missing values have been encountered.

Sometimes, one or the other statistic may not be available, when it was not applicable. This can always be checked by searching for the components in stat:

  names(stat)
The quantlet 12961 names will report all components of the list stat.


6.5.2 Output Display


13074 gplmout (code, x, t, y, h, b, bv, m, stat{, opt})
creates a nice output display for an estimated GPLM

An output display containing statistical characteristics and a plot of the fitted link function can be obtained by 13077 gplmout .

Recall our example from Section 6.3:

  opt=gplmopt("meth",1,"shf",1)
  opt=gplmopt("xvars",xvars,opt)
  opt=gplmopt("tg",grid(0|0,0.05|0.05,21|21),opt)
  g=gplmest("bilo",x,t,y,h,opt)
The optional component xvars will be used in the output display:
  gplmout("bilo",x,t,y,h,g.b,g.bv,g.m,g.stat,opt)
13081 XAGgplm03.xpl

produces the output given in Figure 6.3.

Figure 6.3: GPLM output display.
\includegraphics[scale=0.59]{gplmoutput_p}

The optional parameters that can be used to modify the result from 13087 gplmout can be found in Subsection 6.4.7.


6.5.3 Model selection


g
= 13324 gplmbootstraptest (code, x, t, y, h, nboot{, opt}) tests a GLM against the GPLM

To assess the estimated model it might be useful to check significance of single parameter values, or of linear combinations of parameters. To compare two different, nested models a sort of likelihood ratio (LR) test can be performed using the test statistic

$\displaystyle R = 2\sum\limits^n_{i=1} L(\widehat{\mu}_i,y_i) - L(\widetilde{\mu}_i,y_i).$ (6.5)

Here we denote the GLM fit by $ \widetilde{\mu}$ and the GPLM fit by $ \widehat{\mu}$. This approach corresponds fully to the parametric case, except that for the GPLM the approximate degrees of freedom have to be used. Please consult the corresponding subsections of the

GLM tutorial for more information on the LR test.

A modified likelihood ratio test for testing $ H_0: G(X^T\beta +T^T\gamma +c)$ (GLM) against $ H_1: G\{X^T\beta +m(T)\}$ (GPLM) was introduced by Härdle, Mammen, and Müller (1998). They propose to use a ``biased'' parametric estimate $ \overline{m}(t)$ instead of $ t^T\widetilde{\gamma }+ c$ and the test statistic

$\displaystyle R^\mu = 2\sum\limits^n_{i=1} L(\widehat{\mu}_i,\widehat{\mu}_i) - L(\overline{\mu}_i,\widehat{\mu}_i).$ (6.6)

Asymptotically, this test statistic is equivalent to

$\displaystyle \widetilde{R}^\mu= \sum\limits^n_{i=1} w_i \left\{x_i^T(\widehat{\beta } - \widetilde{\beta }) + \widehat{m}(t_i) - \overline{m}(t_i)\right\}^2$ (6.7)

and

$\displaystyle \widetilde{R}_o^\mu= \sum\limits^n_{i=1} w_i \left\{x_i^T(\widehat{\beta } - \widetilde{\beta }) + \widehat{m}(t_i) - \overline{m}(t_i)\right\}^2$ (6.8)

with

$\displaystyle w_i = \frac{[G'\{x_i^T\widehat{\beta } +
\widehat{m}(t_i)\}]^2}{V[G\{x_i^T\widehat{\beta}
+\widehat{m}(t_i)\}]}.$

All three test statistics are asymptotically equivalent and have an asymptotic normal distribution. However, since the convergence to the limiting normal distribution is slow, it is recommended to determine the critical values of the test by bootstrap. The quantlet 13327 gplmbootstraptest performs this bootstrap test.

Let us continue with our credit scoring example and test whether the correct specification of the model is $ G(X^T\beta + T^T\gamma + c)$ or $ G\{X^T\beta + m(T)\}$. The following code computes first the GLM and applies the quantlet 13330 gplmbootstraptest to estimate the GPLM and perform the bootstrap test.

  library("glm")   ; GLM estimation
  n=rows(x)
  opt=glmopt("xvars",xvars|tvars|"constant")
  l=glmest("bilo",x~t~matrix(n),y,opt)
  glmout("bilo",x~t~matrix(n),y,l.b,l.bv,l.stat,opt)

  library("gplm")  ; GPLM estimation and test
  h=0.4
  nboot=10
  randomize(742742)
  opt=gplmopt("meth",1,"shf",1,"xvars",xvars)
  opt=gplmopt("wr",prod((abs(t-0.5) < 0.40*trange),2),opt)
  g=gplmbootstraptest("bilo",x,t,y,h,nboot,opt)
  gplmout("bilo",x,t,y,h,g.b,g.bv,g.m,g.stat,opt)
13334 XAGgplm05.xpl

Note the optional weight vector wr which defines weights for the test statistics. All observations outside a radius of 0.4 around the center of t are excluded. This is to ensure that the test result is not disturbed by outliers and boundary effects. Table 6.3 summarizes the coefficients from the output windows for the GLM (left column) and the GPLM (right) column.


Table 6.3: Coefficients from GLM (with and without interaction term) and GPLM, $ t$-values in parentheses.
4pt
  Coeff. Coeff. Coeff.
previous 0.974 ( 3.99) 0.954 ( 3.91) 0.974 ( 3.90)
employed 0.783 ( 3.34) 0.765 ( 3.26) 0.753 ( 3.17)
duration -0.048 (-4.04) -0.050 (-4.15) -0.050 (-4.35)
amount 0.092 (-0.12) 1.405 (-1.09) -- --
age 0.989 ( 1.93) 2.785 ( 1.82) -- --
interaction -- -- -3.355 (-1.26) -- --
constant 0.916 ( 2.40) 0.275 ( 0.44) -- --
  GLM GLM (interaction) GPLM


The obtained significance levels for the test (computed for all three test statistics $ {R}^\mu$, $ \widetilde{R}^\mu$ and $ \widetilde{R}_o^\mu$) can be found in the component alpha of the result g. Note that the approximations $ \widetilde{R}^\mu$ and $ \widetilde{R}_o^\mu$ (the latter in particular) may give bad results when the sample size $ n$ is small. If we run the test with random seed 742742 and nboot=250 we get:

Contents of alpha
  [1,]  0.035857 
  [2,]  0.035857 
  [3,]  0.043825
The hypothesis GLM can hence be rejected (at 5% level for $ {R}^\mu$, $ \widetilde{R}^\mu$, $ \widetilde{R}_o^\mu$).

It is also possible to test more complicated GLMs against the GPLM. For example, the nonlinear influence of amount and age could be caused by an interaction of these two variables. Consider now the GLM hypothesis $ G(X^T\beta + T^T\gamma + \delta\, t_1\cdotp t_2 +c)$. The code for this test needs to define an optional design matrix tdesign which is used instead of the default t~matrix(n) in the previous test. The essential changes are as follows:

  tdesign=t~prod(t,2)~matrix(n)
  opt=gplmopt("tdesign",tdesign,opt)
  g=gplmbootstraptest("bilo",x,t,y,h,nboot,opt)
13340 XAGgplm06.xpl

The resulting coefficients for the GLM can be found in the middle column of Table 6.3. Performing the test with random seed 742742 and nboot=250 yields:
  Contents of alpha
  [1,]    0.052 
  [2,]    0.056 
  [3,]    0.064
The hypothesis, that the correct specification is a GLM with interaction term, can hence be rejected as well (now at 10% level for $ {R}^\mu$, $ \widetilde{R}^\mu$, $ \widetilde{R}_o^\mu$).

Note that 13345 gplmbootstraptest also prints a warning, if missing values occurred in the bootstrap procedure. In our last example we have:

  [1,] ======================================================
  [2,]  WARNING!
  [3,] ======================================================
  [4,]  Missing values in bootstrap encountered!
  [5,]  The actually used bootstrap sample sizes are:
  [6,]    nboot[1] =          249  ( 99.60%)
  [7,]    nboot[2] =          249  ( 99.60%)
  [8,]    nboot[3] =          249  ( 99.60%)
  [9,] ======================================================
Missing values are mostly due to numerical errors when the sample size is small or the dataset contains outliers.