Let us generate the observations
![]() |
![]() |
![]() |
|
![]() |
Figure 10.1 shows the data simulated from the function .
![]() |
proc(y)=tuto(seed,n) randomize(seed) xdat=uniform(n,2) index=(xdat[,2]<=0.5)+(xdat[,2]>0.5).*(xdat[,1]<=0.5)*2 layout=3*(index==1)+4.*(index==0)+5.*(index==2) ydat=100.*(index==2)+120.*(index==0) y=list(xdat,ydat,layout) endp library("xclust") d=createdisplay(1,1) data=tuto(1,100) x=data.xdat setmaskp(x, data.layout, data.layout, 8) show(d,1,1,x)
library("xclust") data=tuto(1,100) type=#(1,1) opt=cartsplitopt("minsize",1,"mindev",0,"mincut",5) tr=cartsplit(data.xdat,data.ydat,type,opt) totleaves=leafnum(tr,1) totleaves plotcarttree(tr)
![]()
|
Let us choose the tree with leaves with the following command.
trfin=prunetot(tr,3) plotcarttree(trfin)
![]() |
The Boston housing data set bostonh.dat was collected by Harrison and Rubinfeld (1978). The following variables are in the data:
library("xclust") randomize(10) boston=read("bostonh") boston=paf(boston,uniform(rows(boston))<0.20) yvar=boston[,14] xvar=boston[,1:13] type=matrix(13) type[4]=0 type[9]=0 opt=cartsplitopt("minsize",1,"mindev",0,"mincut",8) tr=cartsplit(xvar,yvar,type,opt) totleaves=leafnum(tr,1) totleaves plotcarttree(tr)
![]()
|
It is not so easy to read Figure 10.4. We can
look at the optimal subtree consisting of leaves by using these commands:
prtr=prunetot(tr,10) plotcarttree(prtr)
cval=cartcv(xvar,yvar,type,opt,10) res=cval.lnumber~cval.alfa~cval.cv~cval.cvstd res=sort(res,1) res=res[1:12,] title=" no alfa cv cvstd" restxt=title|string("%3.0f %6.2f %6.2f %6.2f", res[,1], res[,2], res[,3], res[,4]) dd=createdisplay(2,2) show(dd, 1, 1, cval.lnumber~cval.alfa) setgopt(dd, 1, 1, "title","number obs. vs alpha") show(dd, 1, 2, cval.lnumber~cval.cv) setgopt(dd, 1, 2, "title","number obs. vs cv") show(dd, 2, 1, cval.lnumber~cval.cvstd) setgopt(dd, 2, 1, "title","number obs. vs cvstd") show(dd, 2, 2, restxt)
The first column gives the numbers of leaves in the sequence of pruned
subtrees and the second column gives the sequence ,
The estimates for the expectation of the mean of squared residuals,
, are in the third column of the above matrix.
The fourth column gives the estimates of the standard error of the
corresponding estimators.
We can see that there is a clear minimum for the estimates for the expectation of the mean of squared residuals.
Therefore, it seems reasonable to choose as final estimate the tree with leaves.
Let us choose
and form the corresponding tree.
fin=prune(tr,0.9) plotcarttree(fin)
plotcarttree(fin,"nelem") plotcarttree(fin,"mean")
The result is displayed in the Figure 10.8 and Figure 10.9 respectively.
|
Instead of writing separate procedures for the estimation of density functions, we will transform density data to the regression data and use regression tree to estimate density functions.
The basic idea is to divide the sample space into bins,
calculate the number of observations in every bin, and
consider these frequencies as a dependent regression variable.
The independent regression variables are the midpoints of the bins.
To be more precise, after we have calculated the frequencies of
the bins , we will transform these to
Use the procedure first to make a histogram estimator for the density. This estimator will have a large number of equal size bins and so it will not be a good density estimator, but we will then combine some of these bins together in an optimal way using CART. The new regression data will have dimension equal to the number of bins to the power of the number of variables. Given moment computing capability, probably 9 is the maximum number of variables for this method.
As an example we will analyze data which consists of measurements
on Swiss bank notes. These data are taken from Flury and Riedwyl (1988).
One half of these bank notes are genuine, the other half are forged
bank notes. The following variables are in the data.
; load library xclust and plot library("xclust") library("plot") ; set random seed randomize(1) ; read swiss banknote data dendat=read("bank2") ; select the last three variables dendat=dendat[,4:6] ; choose 9 bins in each dimension binlkm=9 ; compute density estimate regdat=dentoreg(dendat,binlkm) ; compute CART and tree type=matrix(cols(dendat)) opt=cartsplitopt("minsize",50,"mindev",0,"mincut",1) tr=cartsplit(regdat.ind,regdat.dep,type,opt) ; color datapoints after node the fall in g=cartregr(tr, dendat, "node") {gcat,gind}=groupcol(g, rows(g)) ; compute cuts up level 2 for (X1,X2) xdat=regdat.ind gr12=grcart2(xdat, tr, 1, 2, 10, 0) xdat12=dendat[,1|2] setmaskp(xdat12, gind) ; compute cuts up level 2 for (X1,X3) gr13=grcart2(xdat, tr, 1, 3, 10, 0) xdat13=dendat[,1|3] setmaskp(xdat13, gind) ; compute cuts up level 2 for (X2,X3) gr23=grcart2(xdat, tr, 2, 3, 10, 0) xdat23=dendat[,2|3] setmaskp(xdat23, gind) ; compute tree and its labels {tree, treelabel}=grcarttree(tr) ; show all projections and the tree in a display setsize(640, 480) d=createdisplay(2,2) show(d, 1,1, xdat12, gr12) setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)") show(d,2,1, xdat13, gr13) setgopt(d,2,1, "xlabel", "top (X1)") setgopt(d,2,1, "ylabel", "diagonal (X3)") show(d, 2,2, xdat23, gr23) setgopt(d,2,2, "xlabel", "bottom (X2)") setgopt(d,2,2, "ylabel", "diagonal (X3)") axesoff() show(d, 1,2, tree, treelabel) axeson()
![]() |
The result is shown in Figure 10.10. The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.
All splits are done in the bottom-diagonal plane. The lower right plot shows that CART algorithm just cuts from the main bulk of the data. Note the different colors in the left plots which shows that we have some cuts which are not visible in the top-bottom or top-diagonal projection.
Since we have chosen to stop splitting if the number of the
observations is less than (see the parameters
cartsplitopt
in
XAGcart09.xpl
) we may choose a smaller number.
In
XAGcart10.xpl
we have choosen a smaller number (
), do not
color of the datapoints and omit the tree labels. The main result
is here again that the CART algorithm cuts away the tails of the
distribution and generates at least
different group of nodes.
; load library xclust and plot library("xclust") library("plot") ; set random seed randomize(1) ; read swiss banknote data dendat=read("bank2") ; select the last three variables dendat=dendat[,4:6] ; choose 9 bins in each dimension binlkm=9 ; compute density estimate regdat=dentoreg(dendat,binlkm) ; compute CART and tree type=matrix(cols(dendat)) opt=cartsplitopt("minsize",20,"mindev",0,"mincut",1) tr=cartsplit(regdat.ind,regdat.dep,type,opt) ; compute cuts up level 2 for (X1,X2) xdat=regdat.ind gr12=grcart2(xdat, tr, 1, 2, 10, 0) xdat12=dendat[,1|2] ; compute cuts up level 2 for (X1,X3) gr13=grcart2(xdat, tr, 1, 3, 10, 0) xdat13=dendat[,1|3] ; compute cuts up level 2 for (X2,X3) gr23=grcart2(xdat, tr, 2, 3, 10, 0) xdat23=dendat[,2|3] ; compute tree and its labels {tree, treelabel}=grcarttree(tr) ; show all projections and the tree in a display setsize(640, 480) d=createdisplay(2,2) show(d, 1,1, xdat12, gr12) setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)") show(d, 2,1, xdat13, gr13) setgopt(d,2,1, "xlabel", "top (X1)", "ylabel", "diagonal (X3)") show(d, 2,2, xdat23, gr23) setgopt(d,2,2, "xlabel", "bottom (X2)", "ylabel", "diagonal (X3)") show(d, 1,2, tree) setgopt(d,1,2, "xlabel", " ", "ylabel", "log10(1+SSR)")
![]() |