# 5.3 The Cox Proportional Hazards Model

 {ll, ll1, ll2} = hazregll (data,beta) calculates the value of the partial log-likelihood function and of the first two derivatives {betahat, betak, ck} = hazbeta (data {,maxit}) estimates the regression coefficients for the Cox proportional hazards model {bhaz, bsurv} = hazbase (data) estimates the baseline hazard and survival functions surv = hazsurv (data, z) estimates the conditional survival function {val, df, pval} = haztest (data, index) performs the likelihood ratio test, Wald's test and the score test

The semiparametric Cox proportional hazards model is the most commonly used model in hazard regression. In this model, the conditional hazard function, given the covariate value , is assumed to be of the form

where is the vector of regression coefficients, and denotes the baseline hazard function. No particular shape is assumed for the baseline hazard; it is estimated nonparametrically. The contributions of covariates to the hazard are multiplicative. An accessible introduction to the Cox model is given, for example, in Klein and Moeschberger (1997).

The quantlib hazreg provides quantlets for estimating the regression coefficients, , standard deviations of these estimates, and estimates for the cumulative baseline hazard and the conditional survival function. Our calculations are based on standard partial likelihood methods in the proportional hazards model. Additionally, we provide three commonly used tests for the hypothesis that one or more of the 's are zero. These tests are useful for model choice procedures.

## 5.3.1 Estimating the Regression Coefficients

Let us assume that there are no ties between the event times. In this case, the partial likelihood function is given by

 (5.4)

where denotes the risk set at time . Note that only event times contribute their own factor to the partial likelihood. However, both censored and uncensored observations appear in the denominator, where the sum over the risk set includes all individuals who are still at risk immediately prior to .

Let denote the maximum (partial) likelihood estimate of , obtained by maximizing the partial log-likelihood function, . From (5.4), it follows immediately that

 (5.5)

The first derivative of with respect to is called vector of efficient scores, given by

 (5.6)

where denotes the vector of censoring indicators, and is the -matrix of covariate values, with the -th row containing the covariate values of the -th individual, .

For the case of ties in the event times, there are several ways to define a partial likelihood function. Currently, we are using formula (5.4) both for data with and without ties. Each event time contributes one factor to the likelihood function; for tied events, all events in the tie appear with the same denominator. This approach was suggested by Breslow (1974), and is implemented in most statistical packages. When there are few ties, this approximation to the partial likelihood works rather well, see Klein and Moeschberger (1997), p.238.

The information matrix is given by the negative of the second derivative of . Let  denote the indicator vector of the risk set ; this means, the -th element of is when , and 0, otherwise. Then, the information matrix takes the form

 (5.7)

where the are scalars; for any vector , denotes the diagonal matrix with the main diagonal , and is defined elementwise; and . The matrices are modifications of the design matrix , setting the rows of to zero when the corresponding observation is not in the risk set for time . Note that the index  runs through all  observations. When ties are present, each of the tied event times appears once, with the same risk set, and contributes the same term to the information matrix.

For large samples, the maximum likelihood estimate is known to follow an asymptotic -variate normal distribution,

The inverse of the information matrix, is a consistent estimate of the covariance matrix of . It may be used to construct confidence intervals for the components of .

The quantlet hazregll computes the partial log-likelihood function, its first derivative (efficient scores), and the negative of the second derivative (information matrix). The first and second derivatives of the log-likelihood function (5.5) are later used to obtain , as well as for computing test statistics for local tests on . The syntax of hazregll is given below:

   {ll,ll1,ll2} = hazregll(data,beta)


Input:

data
matrix, the sorted data matrix obtained as output data of the quantlet hazdat ;
beta
vector, the regression coefficient vector .

Output:

ll
scalar, the log-likelihood function at parameter value ;
ll1
vector, the first derivatives of the log-likelihood function at parameter value ;
ll2
matrix, the negative Hessian matrix of the log-likelihood function at parameter value (information matrix).

Example 5.The simulated data in the file haz01.dat were generated from a proportional hazards model with covariates, the conditional hazard function , and . The baseline hazard is constant, . We use the quantlet hazregll to calculate the partial log-likelihood function, the efficient scores and the information matrix at the true parameter value, .

  library("hazreg")
t = dat[,1]                        ; observed times
delta = dat[,2]                    ; censoring indicator
z = dat[,3:4]                      ; covariates
{data,ties} = hazdat(t,delta, z)   ; preparing data
beta = 1|2
{ll,ll1,ll2} = hazregll(data,beta)


The calculations yield ll = -34.679 for the value of the log-likelihood function, for the first derivatives, and

for the information matrix.

The quantlet hazbeta calculates the maximum likelihood estimate by solving the nonlinear equation system , defined in (5.6). We use a Newton-Raphson algorithm with the stopping criterion

The syntax of hazbeta is given below:

  {betahat, betak, ck} = hazbeta(data {,maxit})


Input:

data
matrix, the sorted data matrix obtained as output data of hazdat ;
maxit
scalar, maximum number of iteration for the Newton-Raphson procedure, default is .

Output:

betahat
vector, estimate of the regression parameter ;
betak
matrix, iterated parameter values through the Newton-Raphson procedure;
ck
vector, values of the convergence criterion at each iteration of the Newton-Raphson procedure.

Example 6.In this example, we compute for the data in haz01.dat , and estimate the covariance matrix of by . We use the quantlets hazbeta and hazregll . The data was generated from a proportional hazards model with . Details are given in Examples 4 and 5.

  library("hazreg")
t = dat[,1]                      ; observed times
delta = dat[,2]                  ; censoring indicator
z = dat[,3:4]                    ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
{betahat,betak,ck} = hazbeta(data)
{ll, ll1, ll2} = hazregll(data, betahat)
sigma = inv(ll2)                 ; covariance matrix estimate


The calculation results in , with the estimated covariance matrix

Both components and are within their respective asymptotic 95% confidence intervals that may be constructed with betahat and the square root of the diagonal elements of sigma .

## 5.3.2 Estimating the Hazard and Survival Functions

We estimate the cumulative baseline hazard function, , by

The estimate is a right-continuous step function, with jumps in the event times. The index cycles through all observations, In the case of tied events, each of the events in the tie contributes its own term to the sum; this term is the same for all events in a particular tie. The estimate can be derived through a profile likelihood approach, see Klein and Moeschberger (1997), pages 260 and 237, and Johansen (1983).

We estimate the baseline survival function, by

In the Cox proportional hazards model, the survival function of an individual with covariate values is given by

 (5.8)

Consequently, we estimate the conditional survival function by substituting estimates for and ,

 (5.9)

Note that the estimates , and are all step functions, with jumps at the event times. All three estimates are non-negative, is monotonously increasing, and the survival function estimates are monotonously decreasing.

The quantlet hazcoxb provides the estimates and . It has the following syntax:

  {bcumhaz, bsurv} = hazcoxb(data)


Input:

data
matrix, the sorted data matrix given by the output data of hazdat .
Output:

bcumhaz
matrix, with rows , sorted in the same order as the in data;
bsurv
matrix, with rows , sorted in the same order as the in data.

Example 7.In this example, we calculate and plot estimates of the cumulative baseline hazard and of the corresponding survival function , for the simulated data in haz01.dat . The estimates are calculated using the quantlet hazcoxb . Plotting of the step functions is supported by steps4plot . The resulting plots are displayed in Figures 5.2 and 5.3. The data in haz01.dat were generated from a proportional hazards model with and ; details are given in Examples 4 and 5.

  library("hazreg")
t = dat[,1]                            ; observed times
delta = dat[,2]                        ; censoring indicator
z = dat[,3:4]                          ; covariates
{data,ties} = hazdat(t,delta, z)       ; preparing data
{bcumhaz,bsurv} = hazcoxb(data)        ; compute estimates

setsize(600,400)                       ; initiating graph
plot1=createdisplay(1,1)               ; initiating graph
plot2=createdisplay(1,1)
n = rows(data)                         ; sample size
pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n))
; points to be connected
cn = matrix(2*n+2)             ; color_num, controls colors
ar = matrix(2*n+2)             ; art, controls line types
th = matrix(2*n+2)             ; thick, controls line thickness

bsurvline = steps4plot(bsurv, 0~1)
; points for step function plot
setmaskl(bsurvline, pm, cn, ar, th)    ; lines connected
setmaskp(bsurvline, 4, 0, 8)           ; points controlled

bcumhazline = steps4plot(bcumhaz, 0~0)
; points for step function plot

show(plot1, 1, 1, bcumhazline)      	; plot baseline hazard
setgopt(plot1, 1, 1, "title","Cumulative Baseline Hazard")
setgopt(plot1, 1, 1, "xlabel","Time")
setgopt(plot1, 1, 1, "ylabel","Cumulative Hazard")
setgopt(plot1, 1, 1, "ymajor", 0.5)
print (plot1,"hazbcumhaztest.ps")

show(plot2,1, 1, bsurvline)            ; plot baseline survival
setgopt(plot2, 1, 1, "title","Baseline Survival Function")
setgopt(plot2, 1, 1, "xlabel","Time")
setgopt(plot2, 1, 1, "ylabel","Survival  Function")
setgopt(plot2, 1, 1,  "ymajor", 0.2, "ylim", (0|1.01))
print (plot2,"hazbsurvtest.ps")


The quantlet hazsurv provides an estimate of the conditional survival function, , as defined in formula (5.9). It has the following syntax:

   surv = hazsurv(data,z)


Input:

data
matrix, the sorted data matrix given by the output data of hazdat ;
z
vector, value of the covariates;
Output:
surv
matrix, the first column is the sorted , followed by the estimated conditional survival function .

Example 8. We calculate and plot the estimate of the conditional survival function for , using the simulated data in haz01.dat . The resulting graph is displayed in Figure 5.4.

  library("hazreg")
t = dat[,1]                      ; observed times
delta = dat[,2]                  ; censoring indicator
z = dat[,3:4]                    ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
z1 = 0.1|-0.3                    ; covariate values
surv = hazsurv(data, z1)
; estimate conditional survival function

setsize(600, 400)                ; initiating graph
plot1=createdisplay(1,1)         ; initiating graph
n = rows(data)
pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n))
; points to be connected
cn = matrix(2*n+2)               ; color_num, controls colors
ar = matrix(2*n+2)               ; art, controls line types
th = matrix(2*n+2)
; thick, controls line thickness

survline = steps4plot(surv, 0~1)
; points for step function plot
setmaskl(survline, pm, cn , ar, th) ; lines connected
setmaskp(survline, 4, 0, 8)         ; points controlled
setsize(600,400)

show(plot1, 1, 1, survline)
setgopt(plot1, 1, 1, "title","Conditional Survival Function")
setgopt(plot1, 1, 1, "xlabel","Time")
setgopt(plot1, 1, 1, "ylabel","Survival Function")
setgopt(plot1, 1, 1, "ylim", (0|1.01), "ymajor", 0.2)
print (plot1,"hazsurvtest.ps")


## 5.3.3 Hypothesis Testing

The quantlib hazreg offers three tests for hypotheses about regression parameters: the likelihood ratio test, Wald's test and the score test. Assume that , where the -dimensional subvector consists of the regression coefficients of interest, and the -dimensional subvector contains the remaining parameters. We are testing the hypotheses versus , in the presence of the remaining unknown regression coefficients; denotes the -dimensional zero vector. This type of tests is often used in model choice procedures, testing whether a given model can be improved by including certain additional covariates or covariate combinations.

### 5.3.3.1 Likelihood Ratio Test

The test statistic for the likelihood ratio test is given by

where , and and are the -dimensional zero vector and the conditional maximum likelihood estimate for , given , respectively. The estimate is obtained by substituting the fixed null hypothesis value, , for the corresponding 's in the partial log-likelihood function (5.5).

Under the null hypothesis, the asymptotic distribution of is . We calculate -values as tail probabilities of the distribution, .

### 5.3.3.2 Wald Test

Let denote the usual maximum partial likelihood estimate of the full parameter vector . Now, let us partition the inverse of the information matrix into

where denotes the submatrix corresponding to . The information matrix is defined in (5.8). The test statistic for the Wald test is given by

Under the null hypothesis, the distribution of converges to .

### 5.3.3.3 Score Test

Let denote the subvector of the first elements of the score function , defined in (5.6). The test statistic for the score test is

where is the maximum likelihood estimate for under the null hypothesis, introduced in Section 5.3.3. Again, the large sample distribution of the test statistic under the null hypothesis is .

For more details on hypothesis testing in the Cox model, see Klein and Moeschberger (1997), Section 8.4.

### 5.3.3.4 Implementation

Values of the three test statistics , and , and the corresponding asymptotic -values are provided by the quantlet haztest . -values are computed as tail probabilities of the distribution, which is the asymptotic distribution for each of the three tests. The syntax of haztest is given below:

   {ttest, val, df, pval} = haztest(data,index)


Input:

data
matrix, the sorted data matrix given by the output data of hazdat ;
index
vector, with index[i]=0 when is part of the null hypothesis, and index[i]=1, otherwise;
Output:
ttest
printed output, table with values of the test statistics, degrees of freedom and -values for the likelihood ratio test, Wald's test and the score test.
val
vector, values of the test statistics, in the following order: likelihood ratio test, Wald's test, score test;
df
scalar, degrees of freedom of the reference distribution;
pval
vector, -values of the tests.

Example 9. We are testing the null hypothesis for the data in haz01.dat . The quantlet haztest provides values for three test statistics and computes the corresponding -values as tail probabilities of the distribution.

  library("hazreg")
t = dat[,1]                        ; observed times
delta = dat[,2]                    ; censoring indicator
z = dat[,3:4]                      ; covariates
{data,ties} = hazdat(t,delta, z)   ; preparing data
index = 1|0                        ; testing if the second
; coefficient is zero
{testt, val,df,pval} = haztest(data, index)
testt                              ; print summary table


The variable testt contains the summary table. The last line of the code prints the table into the XploRe output window:

 "--------------------------------------------------------------"
"Cox Proportional Hazards Model"
""
"Hypothesis: beta1=0 for a subvector of regression coefficients"
""
"- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "
"                  Test statistic     DF        P-value        "
"- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "
"LR Test              7.56687         1         0.00595"
"Wald Test            7.04612         1         0.00794"
"Score Test           4.25763         1         0.03908"
"--------------------------------------------------------------"
""


Additionally, the test statistic values and the -values are stored in the variables val and pval, respectively. The data in haz01.dat were generated from a proportional hazards model with . For this sample, all three tests result in small -values, providing evidence against the null hypothesis .

## 5.3.4 Example: Length of Stay in Nursing Homes

Nursing homes provide both short-term and long-term care, and patients may stay from a few days to several years. In this section, we investigate how certain characteristics of nursing home patients influence their length of stay. We use a subset of the nursing home data presented by Morris, Norton, and Zhou (1994). The original study was sponsored by the National Center for Health Services in 1980-1982, and investigated the the impact of certain financial incentives on the nursing home care of Medicaid patients. Thirty six nursing homes were randomized into a treatment or control group; nursing homes in the treatment group received financial incentives for admitting more disabled Medicare patients, for improving their health status, and for discharging the patients to their homes within 90 days. The nursing home data consist of patients from this study. Patients were admitted to participating nursing homes between May 1, 1981, and April 30, 1982, and followed over a period of up to three years. The following characteristics were recorded: age, marital status, gender, health status. The data set is available on a floppy disk distributed with Lange et al. (1994), and at StatLib, http://www.stat.cmu.edu/datasets/csb/ .

We restrict our analysis to nursing homes in the control group, which represents the standard care without additional financial incentives, and to patients that were at least 65 years old and of comparatively good health (health status = 2). Our subset consists of patients. The patients were followed from admission to either discharge, or death. For patients that were still in a nursing home on April 30, 1983, the length of stay is recorded as censored (25 % of the observations).

Our subset of the nursing home data is provided in nursing.dat . The first column of the data file contains the length of stay in the nursing home (in days), the second column is the censoring indicator, and columns 3-5 contain the age (in years), the gender (1=male, 0=female), and the marital status (1=married, 0=unmarried), respectively. Twenty one percent of the patients are male, and 14 % are married.

In order to identify the impact of age, marital status and gender on the length of stay in a nursing home, we are fitting a Cox proportional hazards model with covariates: agespline, gender and married. The first variable measures the age of a patient as this transformation was suggested in Morris, Norton, and Zhou (1994). The other two covariates are indicator variables for the gender and marital status, respectively. The second column of Table 5.1 provides the sample means of the covariates and the standard deviation of agespline.

Table: Covariates in a Cox proportional hazards model fitted to the data in nursing.dat . The time-to-event is the length of stay of patients in a nursing home (in days).
 agespline Covariate Mean LR Test Wald Test Score Test (SD) SE() -value -value -value 15.6 -0.052 0.00001 0.00001 (7.06) (0.011) gender 0.210 0.037 0.86 0.86 0.09 (0.213) married 0.140 0.040 0.87 0.87 0.12 (0.246)

The following code reads in the data and calculates estimates of the regression coefficients, , and their covariance matrix:

  library("hazreg")
t = dat[,1]                            ; time = length of stay
delta = dat[,2]                        ; censoring
age = dat[,3]                          ; covariate AGE
gender = dat[,4]                       ; covariate GENDER
married = dat[,5]                      ; covariate MARRIED
limit = matrix(rows(dat), 1)*90        ; transform AGE
agespline = min(age~limit,2) - 65
{data, ties} = hazdat(t, delta, agespline~gender~married)
; prepare data
{betahat, betak, ck}=hazbeta(data)     ; estimate beta
{ll, ll1, ll2} = hazregll(data,betahat)
sigma = inv(ll2)
; covariance matrix of betahat


Table 5.1 presents the estimated regression coefficients in the fitted model (, the value of betahat) and their estimated standard deviation, SE(), obtained as square root of the diagonal elements of sigma.

Let us test, for each of the covariates, whether it contributes to the hazard in the presence of the other two variables:

  {ttest1, val1, df1, pval1} = haztest(data, (0|1|1))
;test for AGESPLINE
{ttest2, val2, df2, pval2} = haztest(data, (1|0|1))
;test for GENDER
{ttest3, val3, df3, pval3} = haztest(data, (1|1|0))
;test for MARRIED


The variables ttest1, ttest2 and ttest3 contain the summary tables for the covariates agespline, gender, and married, respectively. The -values are provided in Table 5.1.

The only covariate with a significant contribution to the hazard is the age. In comparison, in Morris, Norton, and Zhou (1994), a Cox model is fitted to all observations, with additional variables that identify the health status at entry and the treatment group. Here, the age does not appear to be significant, while gender, marital status and poor health significantly contribute to the length of stay.

These results are an example that caution is advised in interpreting fitted models. In our case, gender and marital status are correlated with the health status: married patients tend to enter the nursing home with more advanced health problems, and men are more likely than women to be admitted in poorer health. In our restricted data set of patients with similar, good health at entry, neiter gender nor marital status are helpful for modeling the expected length of stay in the framework of proportional hazards.