7.8 Application for Real Data

We will demonstrate an example of processing of real data in this section. We can use two data sets of Wisconsin farm data, 1987, from originally 1000 data. Selected are middle sized animal farms, outliers were removed. The first data set animal.dat contains 250 observations (rows) of family labor, hired labor, miscellaneous inputs, animal inputs and intermediate run assets. The response variable livestock is contained in the second data set goods.dat . Detailed description of data, source, possible models of interest and some nonparametric analysis can be found in Sperlich (1998).

In this example we will deal with the first three inputs, i.e. family labor, hired labor, miscellaneous inputs and animal inputs. We will store them into the variable t and also we must read the response variable y:

  data=read("animal.dat")
  t1 = data[,1]
  t2 = data[,2]
  t3 = data[,3]
  t4 = data[,4]
  t=t1~t2~t3~t4
  y=read("goods.dat")
Now we can calculate approximately bandwidth $ h$:
  h1=0.5*sqrt(cov(t1))
  h2=0.5*sqrt(cov(t2))
  h3=0.5*sqrt(cov(t3))
  h4=0.5*sqrt(cov(t4))
  h=h1|h2|h3|h4
Finally we set up the parameters for estimation and run the partial integration procedure 15941 intest . It will be shown running of the computation.
  g=h
  loc=0
  opt=gamopt("shf",1)
  m = intest(t,y,h,g,loc,opt)
For an objective view of the results we create the graphical output on Figure 7.1.

Figure: Generalized additive model for animal.dat , partial integration.
\includegraphics[scale=0.6]{gam_real_1}

It is produced by the following statements:
  const=mean(y)*0.25
  m1 = t[,1]~(m[,1]+const)
  m2 = t[,2]~(m[,2]+const)
  m3 = t[,3]~(m[,3]+const)
  m4 = t[,4]~(m[,4]+const)
  setmaskp(m1,4,4,4)
  setmaskp(m2,4,4,4)
  setmaskp(m3,4,4,4)
  setmaskp(m4,4,4,4)
  setmaskl(m1,(sort(m1~(1:rows(m1)))[,3])',4,1,1)
  setmaskl(m2,(sort(m2~(1:rows(m2)))[,3])',4,1,1)
  setmaskl(m3,(sort(m3~(1:rows(m3)))[,3])',4,1,1)
  setmaskl(m4,(sort(m4~(1:rows(m4)))[,3])',4,1,1)
  yy=y-mean(y)-sum(m,2)
  d1=t[,1]~(yy+m[,1])
  d2=t[,2]~(yy+m[,2])
  d3=t[,3]~(yy+m[,3])
  d4=t[,4]~(yy+m[,4])
  setmaskp(d1,1,11,4)
  setmaskp(d2,1,11,4)
  setmaskp(d3,1,11,4)
  setmaskp(d4,1,11,4)
  pic = createdisplay(2,2)
  show(pic,1,1,m1,d1)
  show(pic,1,2,m2,d2)
  show(pic,2,1,m3,d3)
  show(pic,2,2,m4,d4)
15948 XAGgam15.xpl

We see two properties of the data from the produced Figure 7.1:
  1. the bandwidth $ h$ was chosen quite well; the data seems not to be oversmoothed or undersmoothed.
  2. there are several outliers in the data; they can be seen in the right part of the pictures.
If we try to use quantlet 15953 intest with inner grid for computation (optional variable opt.tg) the quantlet ends with an error message. It is because of outliers where the data is too sporadic.

For better understanding the data we can use backfitting algorithm for estimation (quantlet 15956 backfit ) and compare the results.

kern="qua"
{mb,b,const} = backfit(t,y,h,loc,kern,opt)
For graphical output we can use the similar approach as above with several differences.
  m1 = t[,1]~mb[,1]
  m2 = t[,2]~mb[,2]
  m3 = t[,3]~mb[,3]
  m4 = t[,4]~mb[,4]
  setmaskp(m1,4,4,4)
  setmaskp(m2,4,4,4)
  setmaskp(m3,4,4,4)
  setmaskp(m4,4,4,4)
  setmaskl(m1,(sort(m1~(1:rows(m1)))[,3])',4,1,1)
  setmaskl(m2,(sort(m2~(1:rows(m2)))[,3])',4,1,1)
  setmaskl(m3,(sort(m3~(1:rows(m3)))[,3])',4,1,1)
  setmaskl(m4,(sort(m4~(1:rows(m4)))[,3])',4,1,1)
  yy=y-const-sum(mb,2)
  d1=t[,1]~(yy+mb[,1])
  d2=t[,2]~(yy+mb[,2])
  d3=t[,3]~(yy+mb[,3])
  d4=t[,4]~(yy+mb[,4])
  setmaskp(d1,1,11,4)
  setmaskp(d2,1,11,4)
  setmaskp(d3,1,11,4)
  setmaskp(d4,1,11,4)
  pic2 = createdisplay(2,2)
  show(pic2,1,1,m1,d1)
  show(pic2,1,2,m2,d2)
  show(pic2,2,1,m3,d3)
  show(pic2,2,2,m4,d4)
15960 XAGgam15.xpl

Figure: Generalized additive model for animal.dat , backfitting.
\includegraphics[scale=0.6]{gam_real_2}

The graphs of this estimation on Figure 7.2 are like the graphs on Figure 7.1 achieved using 15968 intest ; only different scale factor was used. It seems that the dependence of variable $ y$ on the miscellaneous inputs is almost linear. Unfortunately the quantlet 15971 intestpl for additive partially linear model ends with the error because of outliers. Likewise the testing of interactions ( 15974 intertest1 or 15977 intertest1 ) is aborting. For data manipulation using this quantlets the removing outliers from the data sets would be necessary.