10.5 Examples


10.5.1 Simulated Example

Let us generate the observations

$\displaystyle Y_i=f(X_i)+\epsilon_i, \, \, \, i=1,\ldots , 200,
$

where
$\displaystyle f(x_1,x_2)$ $\displaystyle =$ $\displaystyle 100 \, \, I(0 \leq x_1 \leq 0.5, 0.5 < x_2 \leq 1)$  
    $\displaystyle + 120 \, \, I(0.5 < x_1 \leq 1, 0.5 < x_2 \leq 1) ,$  

$ X_i$ are independently uniformly distributed on $ [0,1] \times [0,1]$, and $ \epsilon_i$ are independently standardly normally distributed.

Figure 10.1 shows the data simulated from the function $ f$.

Figure 10.1: Plot of 100 simulated data from function $ f(x_1,x_2)$. The datapoints in the upper left (marked with crosses) are in the area of $ f(x_1,x_2)=100$, the datapoints in the upper right (marked with triangles) are in the area of $ f(x_1,x_2)=120$ and the datapoints in the lower part (marked with circles) are in the area of $ f(x_1,x_2)=0$.
\includegraphics[scale=0.55]{tutosim1}

The quantlet for generating the observations is
  proc(y)=tuto(seed,n)
    randomize(seed)
    xdat=uniform(n,2)
    index=(xdat[,2]<=0.5)+(xdat[,2]>0.5).*(xdat[,1]<=0.5)*2
    layout=3*(index==1)+4.*(index==0)+5.*(index==2)
    ydat=100.*(index==2)+120.*(index==0)
    y=list(xdat,ydat,layout)
  endp

  library("xclust")
  d=createdisplay(1,1)
  data=tuto(1,100)
  x=data.xdat
  setmaskp(x, data.layout, data.layout, 8)
  show(d,1,1,x)
19851 XAGcart01.xpl

Let us grow such a tree that the number of observations in a leaf nodes is less or equal to $ 5$ (mincut), the deviation in a leaf node is larger or equal 0 (mindev) and cut will be only done if the number of the resulting nodes is larger as $ 1$ (minsize). The type of variable is continuous.
  library("xclust")
  data=tuto(1,100)
  type=#(1,1)
  opt=cartsplitopt("minsize",1,"mindev",0,"mincut",5)
  tr=cartsplit(data.xdat,data.ydat,type,opt)
  totleaves=leafnum(tr,1)
  totleaves
  plotcarttree(tr)
19857 XAGcart02.xpl

Figure 10.2 shows the regression tree tr with 41 leaves. From this figure, we prefer to choose the tree with 3 leaves because it is easier to see that in general it has three groups.

Figure 10.2: Initial regression tree for 100 simulated data from function $ f(x_1,x_2)$ (left). The total number of leaves ($ 41$) is shown at the right.
\includegraphics[scale=0.5]{tutosim2}


Let us choose the tree with $ 3$ leaves with the following command.

  trfin=prunetot(tr,3)
  plotcarttree(trfin)
19864 XAGcart03.xpl

Figure 10.3: Final regression tree for 100 simulated data from function $ f(x_1,x_2)$ after pruning. The final tree consists of three leaves which separate the $ x_1,x_2$-plane into three parts.
\includegraphics[scale=0.55]{tutosim3}

Figure 10.3 shows the final tree for simulated data.


10.5.2 Boston Housing Data

The Boston housing data set bostonh.dat was collected by Harrison and Rubinfeld (1978). The following variables are in the data:

  1. crime rate
  2. percent of land zoned for large lots
  3. percent non retail business
  4. Charles river indicator, $ 1$ if on Charles river, 0 otherwise
  5. nitrogen oxide concentration
  6. average number of rooms
  7. percent built before $ 1980$
  8. weighted distance to employment centers
  9. accessibility to radial highways
  10. tax rate
  11. pupil-teacher ratio
  12. percent black
  13. percent lower status
  14. median value of owner-occupied homes in thousands of dollars.
The variable 14 is the response variable. The variables 1-13 are predictor variables. The 4-th and 9-th are categorical variables, the other are continuous. There are 506 observations. Let us generate such a tree that the number of observations in leaf nodes is less or equal to 8.
  library("xclust")
  randomize(10)
  boston=read("bostonh")
  boston=paf(boston,uniform(rows(boston))<0.20)
  yvar=boston[,14]
  xvar=boston[,1:13]
  type=matrix(13)
  type[4]=0
  type[9]=0
  opt=cartsplitopt("minsize",1,"mindev",0,"mincut",8)
  tr=cartsplit(xvar,yvar,type,opt)
  totleaves=leafnum(tr,1)
  totleaves
  plotcarttree(tr)
20026 XAGcart04.xpl

We can observe that the tree tr with $ 29$ leaves is large.

Figure 10.4: Initial regression tree for Boston housing data. The total number of leaves (29) is shown at the right.
\includegraphics[scale=0.5]{tutobos1}


It is not so easy to read Figure 10.4. We can look at the optimal subtree consisting of $ 10$ leaves by using these commands:

  prtr=prunetot(tr,10)
  plotcarttree(prtr)
20033 XAGcart05.xpl

The Figure 10.5 shows pruning tree for Boston housing data.

Figure 10.5: Sub-Tree consisisting of 10 leaves for 20% sample of the Boston housing data
\includegraphics[scale=0.55]{tutobos2}

Let us try to choose the optimal number of leaves with $ 10$ fold cross validation.
  cval=cartcv(xvar,yvar,type,opt,10)
  res=cval.lnumber~cval.alfa~cval.cv~cval.cvstd
  res=sort(res,1)
  res=res[1:12,]
  title=" no   alfa    cv   cvstd"
  restxt=title|string("%3.0f %6.2f %6.2f %6.2f",
  res[,1], res[,2], res[,3], res[,4])

  dd=createdisplay(2,2)
  show(dd, 1, 1, cval.lnumber~cval.alfa)
  setgopt(dd, 1, 1, "title","number obs. vs alpha")
  show(dd, 1, 2, cval.lnumber~cval.cv)
  setgopt(dd, 1, 2, "title","number obs. vs cv")
  show(dd, 2, 1, cval.lnumber~cval.cvstd)
  setgopt(dd, 2, 1, "title","number obs. vs cvstd")
  show(dd, 2, 2, restxt)
20040 XAGcart06.xpl

We get the result shown in Figure 10.6.

Figure 10.6: Cross-validation for 20% sample of Boston housing data.
\includegraphics[scale=0.55]{tutobos3}

The first column gives the numbers of leaves in the sequence of pruned subtrees and the second column gives the sequence $ \alpha_i$, The estimates for the expectation of the mean of squared residuals, $ ER(\hat{f}_{\alpha_i})$, are in the third column of the above matrix. The fourth column gives the estimates of the standard error of the corresponding estimators.

We can see that there is a clear minimum for the estimates for the expectation of the mean of squared residuals.

Therefore, it seems reasonable to choose as final estimate the tree with $ 7$ leaves. Let us choose $ \alpha = 0.9$ and form the corresponding tree.

  fin=prune(tr,0.9)
  plotcarttree(fin)
20047 XAGcart07.xpl

The final estimate is in the Figure 10.7.

Figure 10.7: Final tree for 20% sample of Boston housing data
\includegraphics[scale=0.55]{tutobos4}

Let us look at the numbers of observations and the mean values in each node with command
  plotcarttree(fin,"nelem")
  plotcarttree(fin,"mean")
20054 XAGcart08.xpl

The result is displayed in the Figure 10.8 and Figure 10.9 respectively.

Figure 10.8: Final tree for 20% sample of Boston housing data with numbers of observations
\includegraphics[scale=0.55]{tutobos5}

Figure 10.9: Final tree for 20% sample of Boston housing data with mean values
\includegraphics[scale=0.55]{tutobos6}


10.5.3 Density Estimation


regdat = 20202 dentoreg (dendat, binlkm)
transforms density data to regression data using variance stabilizing transform

Instead of writing separate procedures for the estimation of density functions, we will transform density data to the regression data and use regression tree to estimate density functions.

The basic idea is to divide the sample space into bins, calculate the number of observations in every bin, and consider these frequencies as a dependent regression variable. The independent regression variables are the midpoints of the bins. To be more precise, after we have calculated the frequencies of the bins $ Z_i$, we will transform these to

$\displaystyle Y_i = \sqrt{Z_i + 3/8}.
$

This was suggested by Anscombe (1948) and Donoho, Johnstone, Kerkyacharian, and Picard (1995, page 327).

Use the procedure first to make a histogram estimator for the density. This estimator will have a large number of equal size bins and so it will not be a good density estimator, but we will then combine some of these bins together in an optimal way using CART. The new regression data will have dimension equal to the number of bins to the power of the number of variables. Given moment computing capability, probably 9 is the maximum number of variables for this method.

As an example we will analyze data which consists of $ 200$ measurements on Swiss bank notes. These data are taken from Flury and Riedwyl (1988). One half of these bank notes are genuine, the other half are forged bank notes. The following variables are in the data.

  1. length of the note (width)
  2. height of the note (left)
  3. height of the note (right)
  4. distance of the inner frame to the lower border (bottom)
  5. distance of the inner frame to the upper border (top)
  6. length of the diagonal of the central picture (diagonal)
The macro 20205 dentoreg transforms density data to regression data. Let us choose $ 9$ bins for every coordinate axes because we for the last $ 3$ variables in the data.
  ; load library xclust and plot
  library("xclust")
  library("plot")

  ; set random seed
  randomize(1)
  ; read swiss banknote data
  dendat=read("bank2")
  ; select the last three variables
  dendat=dendat[,4:6]
  ; choose 9 bins in each dimension
  binlkm=9

  ; compute density estimate
  regdat=dentoreg(dendat,binlkm)

  ; compute CART and tree
  type=matrix(cols(dendat))
  opt=cartsplitopt("minsize",50,"mindev",0,"mincut",1)
  tr=cartsplit(regdat.ind,regdat.dep,type,opt)
  ; color datapoints after node the fall in
  g=cartregr(tr, dendat, "node")
  {gcat,gind}=groupcol(g, rows(g))
  ; compute cuts up level 2 for (X1,X2)
  xdat=regdat.ind
  gr12=grcart2(xdat, tr, 1, 2, 10, 0)
  xdat12=dendat[,1|2]
  setmaskp(xdat12, gind)
  ; compute cuts up level 2 for (X1,X3)
  gr13=grcart2(xdat, tr, 1, 3, 10, 0)
  xdat13=dendat[,1|3]
  setmaskp(xdat13, gind)
  ; compute cuts up level 2 for (X2,X3)
  gr23=grcart2(xdat, tr, 2, 3, 10, 0)
  xdat23=dendat[,2|3]
  setmaskp(xdat23, gind)

  ; compute tree and its labels
  {tree, treelabel}=grcarttree(tr)
  ; show all projections and the tree in a display
  setsize(640, 480)
  d=createdisplay(2,2)
  show(d, 1,1, xdat12, gr12)
  setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")

  show(d,2,1, xdat13, gr13)
  setgopt(d,2,1, "xlabel", "top (X1)")
  setgopt(d,2,1, "ylabel", "diagonal (X3)")
  show(d, 2,2, xdat23, gr23)
  setgopt(d,2,2, "xlabel", "bottom (X2)")
  setgopt(d,2,2, "ylabel", "diagonal (X3)")
  axesoff()
  show(d, 1,2, tree, treelabel)
  axeson()
20209 XAGcart09.xpl

Figure 10.10: The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.
\includegraphics[scale=0.55]{bankdens}

The result is shown in Figure 10.10. The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.

All splits are done in the bottom-diagonal plane. The lower right plot shows that CART algorithm just cuts from the main bulk of the data. Note the different colors in the left plots which shows that we have some cuts which are not visible in the top-bottom or top-diagonal projection.

Since we have chosen to stop splitting if the number of the observations is less than $ 75$ (see the parameters 20215 cartsplitopt in 20218 XAGcart09.xpl ) we may choose a smaller number.

In 20221 XAGcart10.xpl we have choosen a smaller number ($ 20$), do not color of the datapoints and omit the tree labels. The main result is here again that the CART algorithm cuts away the tails of the distribution and generates at least $ 4$ different group of nodes.

  ; load library xclust and plot
  library("xclust")
  library("plot")
  ; set random seed
  randomize(1)
  ; read swiss banknote data
  dendat=read("bank2")
  ; select the last three variables
  dendat=dendat[,4:6]
  ; choose 9 bins in each dimension
  binlkm=9
  ; compute density estimate
  regdat=dentoreg(dendat,binlkm)
  ; compute CART and tree
  type=matrix(cols(dendat))
  opt=cartsplitopt("minsize",20,"mindev",0,"mincut",1)
  tr=cartsplit(regdat.ind,regdat.dep,type,opt)
  ; compute cuts up level 2 for (X1,X2)
  xdat=regdat.ind
  gr12=grcart2(xdat, tr, 1, 2, 10, 0)
  xdat12=dendat[,1|2]
  ; compute cuts up level 2 for (X1,X3)
  gr13=grcart2(xdat, tr, 1, 3, 10, 0)
  xdat13=dendat[,1|3]
  ; compute cuts up level 2 for (X2,X3)
  gr23=grcart2(xdat, tr, 2, 3, 10, 0)
  xdat23=dendat[,2|3]
  ; compute tree and its labels
  {tree, treelabel}=grcarttree(tr)
  ; show all projections and the tree in a display
  setsize(640, 480)
  d=createdisplay(2,2)
  show(d, 1,1, xdat12, gr12)
  setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")
  show(d, 2,1, xdat13, gr13)
  setgopt(d,2,1, "xlabel", "top (X1)", "ylabel", "diagonal (X3)")
  show(d, 2,2, xdat23, gr23)
  setgopt(d,2,2, "xlabel", "bottom (X2)", "ylabel", "diagonal (X3)")
  show(d, 1,2, tree)
  setgopt(d,1,2, "xlabel", " ", "ylabel", "log10(1+SSR)")
20225 XAGcart10.xpl

Figure 10.11: The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.
\includegraphics[scale=0.55]{bankdens2}