|
The semiparametric Cox proportional hazards model is the most commonly used model in hazard regression. In this model, the conditional hazard function, given the covariate value , is assumed to be of the form
where is the vector of regression coefficients, and denotes the baseline hazard function. No particular shape is assumed for the baseline hazard; it is estimated nonparametrically. The contributions of covariates to the hazard are multiplicative. An accessible introduction to the Cox model is given, for example, in Klein and Moeschberger (1997).The quantlib hazreg provides quantlets for estimating the regression coefficients, , standard deviations of these estimates, and estimates for the cumulative baseline hazard and the conditional survival function. Our calculations are based on standard partial likelihood methods in the proportional hazards model. Additionally, we provide three commonly used tests for the hypothesis that one or more of the 's are zero. These tests are useful for model choice procedures.
Let us assume that there are no ties between the event times. In this case, the partial likelihood function is given by
Let denote the maximum (partial) likelihood estimate of , obtained by maximizing the partial log-likelihood function, . From (5.4), it follows immediately that
For the case of ties in the event times, there are several ways to define a partial likelihood function. Currently, we are using formula (5.4) both for data with and without ties. Each event time contributes one factor to the likelihood function; for tied events, all events in the tie appear with the same denominator. This approach was suggested by Breslow (1974), and is implemented in most statistical packages. When there are few ties, this approximation to the partial likelihood works rather well, see Klein and Moeschberger (1997), p.238.
The information matrix
is given by the negative of the second
derivative of . Let
denote the
indicator vector of the risk set ; this means, the -th
element of
is when
, and
0, otherwise. Then, the information matrix takes the form
For large samples, the maximum likelihood estimate is known to follow an asymptotic -variate normal distribution,
The inverse of the information matrix, is a consistent estimate of the covariance matrix of . It may be used to construct confidence intervals for the components of .The quantlet hazregll computes the partial log-likelihood function, its first derivative (efficient scores), and the negative of the second derivative (information matrix). The first and second derivatives of the log-likelihood function (5.5) are later used to obtain , as well as for computing test statistics for local tests on . The syntax of hazregll is given below:
{ll,ll1,ll2} = hazregll(data,beta)
Input:
Output:
Example 5.The simulated data in the file haz01.dat were generated from a proportional hazards model with covariates, the conditional hazard function , and . The baseline hazard is constant, . We use the quantlet hazregll to calculate the partial log-likelihood function, the efficient scores and the information matrix at the true parameter value, .
library("hazreg") dat=read("haz01.dat") t = dat[,1] ; observed times delta = dat[,2] ; censoring indicator z = dat[,3:4] ; covariates {data,ties} = hazdat(t,delta, z) ; preparing data beta = 1|2 {ll,ll1,ll2} = hazregll(data,beta)
The calculations yield ll = -34.679 for the value of the log-likelihood function, for the first derivatives, and
The quantlet hazbeta calculates the maximum likelihood estimate by solving the nonlinear equation system , defined in (5.6). We use a Newton-Raphson algorithm with the stopping criterion
{betahat, betak, ck} = hazbeta(data {,maxit})
Input:
Output:
Example 6.In this example, we compute for the data in haz01.dat , and estimate the covariance matrix of by . We use the quantlets hazbeta and hazregll . The data was generated from a proportional hazards model with . Details are given in Examples 4 and 5.
library("hazreg") dat=read("haz01.dat") t = dat[,1] ; observed times delta = dat[,2] ; censoring indicator z = dat[,3:4] ; covariates {data,ties} = hazdat(t,delta, z) ; preparing data {betahat,betak,ck} = hazbeta(data) {ll, ll1, ll2} = hazregll(data, betahat) sigma = inv(ll2) ; covariance matrix estimate
The calculation results in , with the estimated covariance matrix
We estimate the cumulative baseline hazard function, , by
The estimate is a right-continuous step function, with jumps in the event times. The index cycles through all observations, In the case of tied events, each of the events in the tie contributes its own term to the sum; this term is the same for all events in a particular tie. The estimate can be derived through a profile likelihood approach, see Klein and Moeschberger (1997), pages 260 and 237, and Johansen (1983).We estimate the baseline survival function, by
In the Cox proportional hazards model, the survival function of an individual with covariate values is given by
Note that the estimates , and are all step functions, with jumps at the event times. All three estimates are non-negative, is monotonously increasing, and the survival function estimates are monotonously decreasing.
The quantlet hazcoxb provides the estimates and . It has the following syntax:
{bcumhaz, bsurv} = hazcoxb(data)
Input:
Example 7.In this example, we calculate and plot estimates of the cumulative baseline hazard and of the corresponding survival function , for the simulated data in haz01.dat . The estimates are calculated using the quantlet hazcoxb . Plotting of the step functions is supported by steps4plot . The resulting plots are displayed in Figures 5.2 and 5.3. The data in haz01.dat were generated from a proportional hazards model with and ; details are given in Examples 4 and 5.
library("hazreg") dat=read("haz01.dat") t = dat[,1] ; observed times delta = dat[,2] ; censoring indicator z = dat[,3:4] ; covariates {data,ties} = hazdat(t,delta, z) ; preparing data {bcumhaz,bsurv} = hazcoxb(data) ; compute estimates setsize(600,400) ; initiating graph plot1=createdisplay(1,1) ; initiating graph plot2=createdisplay(1,1) n = rows(data) ; sample size pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n)) ; points to be connected cn = matrix(2*n+2) ; color_num, controls colors ar = matrix(2*n+2) ; art, controls line types th = matrix(2*n+2) ; thick, controls line thickness bsurvline = steps4plot(bsurv, 0~1) ; points for step function plot setmaskl(bsurvline, pm, cn, ar, th) ; lines connected setmaskp(bsurvline, 4, 0, 8) ; points controlled bcumhazline = steps4plot(bcumhaz, 0~0) ; points for step function plot setmaskl(bcumhazline, pm, cn, ar, th) setmaskp(bcumhazline, 4, 0, 8) show(plot1, 1, 1, bcumhazline) ; plot baseline hazard setgopt(plot1, 1, 1, "title","Cumulative Baseline Hazard") setgopt(plot1, 1, 1, "xlabel","Time") setgopt(plot1, 1, 1, "ylabel","Cumulative Hazard") setgopt(plot1, 1, 1, "ymajor", 0.5) print (plot1,"hazbcumhaztest.ps") show(plot2,1, 1, bsurvline) ; plot baseline survival setgopt(plot2, 1, 1, "title","Baseline Survival Function") setgopt(plot2, 1, 1, "xlabel","Time") setgopt(plot2, 1, 1, "ylabel","Survival Function") setgopt(plot2, 1, 1, "ymajor", 0.2, "ylim", (0|1.01)) print (plot2,"hazbsurvtest.ps")
|
|
The quantlet hazsurv provides an estimate of the conditional survival function, , as defined in formula (5.9). It has the following syntax:
surv = hazsurv(data,z)
Input:
Example 8. We calculate and plot the estimate of the conditional survival function for , using the simulated data in haz01.dat . The resulting graph is displayed in Figure 5.4.
library("hazreg") dat=read("haz01.dat") t = dat[,1] ; observed times delta = dat[,2] ; censoring indicator z = dat[,3:4] ; covariates {data,ties} = hazdat(t,delta, z) ; preparing data z1 = 0.1|-0.3 ; covariate values surv = hazsurv(data, z1) ; estimate conditional survival function setsize(600, 400) ; initiating graph plot1=createdisplay(1,1) ; initiating graph n = rows(data) pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n)) ; points to be connected cn = matrix(2*n+2) ; color_num, controls colors ar = matrix(2*n+2) ; art, controls line types th = matrix(2*n+2) ; thick, controls line thickness survline = steps4plot(surv, 0~1) ; points for step function plot setmaskl(survline, pm, cn , ar, th) ; lines connected setmaskp(survline, 4, 0, 8) ; points controlled setsize(600,400) show(plot1, 1, 1, survline) setgopt(plot1, 1, 1, "title","Conditional Survival Function") setgopt(plot1, 1, 1, "xlabel","Time") setgopt(plot1, 1, 1, "ylabel","Survival Function") setgopt(plot1, 1, 1, "ylim", (0|1.01), "ymajor", 0.2) print (plot1,"hazsurvtest.ps")
The quantlib hazreg offers three tests for hypotheses about regression parameters: the likelihood ratio test, Wald's test and the score test. Assume that , where the -dimensional subvector consists of the regression coefficients of interest, and the -dimensional subvector contains the remaining parameters. We are testing the hypotheses versus , in the presence of the remaining unknown regression coefficients; denotes the -dimensional zero vector. This type of tests is often used in model choice procedures, testing whether a given model can be improved by including certain additional covariates or covariate combinations.
Under the null hypothesis, the asymptotic distribution of is . We calculate -values as tail probabilities of the distribution, .
For more details on hypothesis testing in the Cox model, see Klein and Moeschberger (1997), Section 8.4.
{ttest, val, df, pval} = haztest(data,index)
Input:
Example 9. We are testing the null hypothesis for the data in haz01.dat . The quantlet haztest provides values for three test statistics and computes the corresponding -values as tail probabilities of the distribution.
library("hazreg") dat=read("haz01.dat") t = dat[,1] ; observed times delta = dat[,2] ; censoring indicator z = dat[,3:4] ; covariates {data,ties} = hazdat(t,delta, z) ; preparing data index = 1|0 ; testing if the second ; coefficient is zero {testt, val,df,pval} = haztest(data, index) testt ; print summary table
The variable testt contains the summary table. The last line of the code prints the table into the XploRe output window:
"--------------------------------------------------------------" "Cox Proportional Hazards Model" "" "Hypothesis: beta1=0 for a subvector of regression coefficients" "" "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " " Test statistic DF P-value " "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " "LR Test 7.56687 1 0.00595" "Wald Test 7.04612 1 0.00794" "Score Test 4.25763 1 0.03908" "--------------------------------------------------------------" ""
Additionally, the test statistic values and the -values are stored in the variables val and pval, respectively. The data in haz01.dat were generated from a proportional hazards model with . For this sample, all three tests result in small -values, providing evidence against the null hypothesis .
Nursing homes provide both short-term and long-term care, and patients may stay from a few days to several years. In this section, we investigate how certain characteristics of nursing home patients influence their length of stay. We use a subset of the nursing home data presented by Morris, Norton, and Zhou (1994). The original study was sponsored by the National Center for Health Services in 1980-1982, and investigated the the impact of certain financial incentives on the nursing home care of Medicaid patients. Thirty six nursing homes were randomized into a treatment or control group; nursing homes in the treatment group received financial incentives for admitting more disabled Medicare patients, for improving their health status, and for discharging the patients to their homes within 90 days. The nursing home data consist of patients from this study. Patients were admitted to participating nursing homes between May 1, 1981, and April 30, 1982, and followed over a period of up to three years. The following characteristics were recorded: age, marital status, gender, health status. The data set is available on a floppy disk distributed with Lange et al. (1994), and at StatLib, http://www.stat.cmu.edu/datasets/csb/ .
We restrict our analysis to nursing homes in the control group, which represents the standard care without additional financial incentives, and to patients that were at least 65 years old and of comparatively good health (health status = 2). Our subset consists of patients. The patients were followed from admission to either discharge, or death. For patients that were still in a nursing home on April 30, 1983, the length of stay is recorded as censored (25 % of the observations).
Our subset of the nursing home data is provided in nursing.dat . The first column of the data file contains the length of stay in the nursing home (in days), the second column is the censoring indicator, and columns 3-5 contain the age (in years), the gender (1=male, 0=female), and the marital status (1=married, 0=unmarried), respectively. Twenty one percent of the patients are male, and 14 % are married.
In order to identify the impact of age, marital status and gender on the length of stay in a nursing home, we are fitting a Cox proportional hazards model with covariates: agespline, gender and married. The first variable measures the age of a patient as this transformation was suggested in Morris, Norton, and Zhou (1994). The other two covariates are indicator variables for the gender and marital status, respectively. The second column of Table 5.1 provides the sample means of the covariates and the standard deviation of agespline.
|
The following code reads in the data and calculates estimates of the regression coefficients, , and their covariance matrix:
library("hazreg") dat=read("nursing.dat") ; read data from file t = dat[,1] ; time = length of stay delta = dat[,2] ; censoring age = dat[,3] ; covariate AGE gender = dat[,4] ; covariate GENDER married = dat[,5] ; covariate MARRIED limit = matrix(rows(dat), 1)*90 ; transform AGE agespline = min(age~limit,2) - 65 {data, ties} = hazdat(t, delta, agespline~gender~married) ; prepare data {betahat, betak, ck}=hazbeta(data) ; estimate beta {ll, ll1, ll2} = hazregll(data,betahat) sigma = inv(ll2) ; covariance matrix of betahat
Table 5.1 presents the estimated regression coefficients in the fitted model (, the value of betahat) and their estimated standard deviation, SE(), obtained as square root of the diagonal elements of sigma.
Let us test, for each of the covariates, whether it contributes to the hazard in the presence of the other two variables:
{ttest1, val1, df1, pval1} = haztest(data, (0|1|1)) ;test for AGESPLINE {ttest2, val2, df2, pval2} = haztest(data, (1|0|1)) ;test for GENDER {ttest3, val3, df3, pval3} = haztest(data, (1|1|0)) ;test for MARRIED
The only covariate with a significant contribution to the hazard is the age. In comparison, in Morris, Norton, and Zhou (1994), a Cox model is fitted to all observations, with additional variables that identify the health status at entry and the treatment group. Here, the age does not appear to be significant, while gender, marital status and poor health significantly contribute to the length of stay.
These results are an example that caution is advised in interpreting fitted models. In our case, gender and marital status are correlated with the health status: married patients tend to enter the nursing home with more advanced health problems, and men are more likely than women to be admitted in poorer health. In our restricted data set of patients with similar, good health at entry, neiter gender nor marital status are helpful for modeling the expected length of stay in the framework of proportional hazards.