5.1 Data Structure


{data, ties} = 8762 hazdat (t, delta{, z})
sorts the times t in ascending order, cosorts the censoring indicator delta and the covariates in z, and provides tie information
nar = 8765 haznar (data)
calculates the size of the risk set at each observed time point
atrisk = 8768 hazrisk (data, i)
determines which observations are at risk at time $ t_i$

The quantlib hazreg provides methods for analyzing right-censored time-to-event data. The observed data are triples $ (t_{i}, \delta_{i}, z_{i})$, $ i=1, \ldots, n$, where $ t_{i}$ denotes the observed survival time of the $ i$-th individual, $ z_{i}=(z_{i1}, \ldots ,z_{ip})^T\ $ denotes the $ p$-dimensional covariate vector associated with the $ i$-th individual, and $ \delta_{i}$ is the censoring indicator.

Let $ y_{i}$ denote the uncensored survival time, and $ c_{i}$ the random censoring time. The observed survival time of the $ i$-th individual is then given by $ t_{i} = \min (y_{i}, c_{i})$. The censoring indicator takes the value $ \delta_{i}=1$ when $ y_i \leq c_i$; in this case, the observed time, $ t_i = y_i$, is called event time. Otherwise, $ \delta_{i}=0$, and the observed time is censored, $ t_i = c_i$. We assume that censoring is uninformative; this means, given the covariate values, the conditional distributions of the survival time and of the censoring time are independent.

For many computations, information on the presence and location of ties is required. Obviously, we could locate the ties each time that a method requires this information. However, in a typical session the same dataset will be studied for various purposes. It is much more efficient to gather the tie information once, and link it to the data set. We address this problem by compiling most of the necessary data information into a matrix data, which is passed on as an argument to the various data analysis quantlets.

The quantlet 8773 hazdat sorts the right-censored data $ (t_{i}, \delta_{i}, z_{i})$, $ i=1, \ldots, n$ in ascending order with respect to time $ t$, cosorts the censoring indicator and covariate values, evaluates ties, and organizes the data and tie information in the matrix data.

It has the following syntax:

  {data,ties} = hazdat(t, delta {,z})

Input:

t
$ n \times 1 $ vector of survival times $ t_{i}$,
delta
$ n \times 1 $ vector of censoring indicators $ \delta_{i}$,
z
$ n \times p$ matrix of covariate values, with rows $ z_{i}^T$; default is an empty matrix.

Output:

data
$ n \times (p+4)$ matrix of cosorted time-to-event data, with
column 1: observed times $ t_{i}$, sorted in ascending order,
column 2: censoring indicator $ \delta_{i}$, cosorted,
column 3: original observation labels ( $ 1, \ldots, n$), cosorted,
column 4: number of tied observations in time $ t_{i}$, cosorted,
columns 5 through $ (p+4)$: covariate values $ z_{i}^T=(z_{i1}, \ldots,
z_{ip})$, cosorted;
ties
scalar, indicator of ties, with ties=1 when ties in the $ t_{i}$ are present, and ties=0 when there are no ties.

Example 1.With this example, we illustrate the use of the quantlet 8776 hazdat . The censoring and the observed times are chosen to better demonstrate the handling of ties (column 4 in data, and tie indicator ties=1). There are no covariates. Note that at the start of each session, the quantlib hazreg has to be loaded manually, with the command library("hazreg") .

  library("hazreg") 
  y = 2|1|3|2|4|7|1|3|2        ; uncensored event times
  c = 3|1|5|6|1|6|2|4|5        ; censoring times
  t = min(y~c,2)               ; observed (censored) times             
  delta = (y<=c)               ; censoring indicator            
  {data,ties} = hazdat(t,delta)                             
  data
  ties
8782 haz01.xpl

The variables data and ties take the following values:

data =
        1        0        5        3 
        1        1        7        3 
        1        1        2        3 
        2        1        4        3 
        2        1        9        3 
        2        1        1        3 
        3        1        8        2 
        3        1        3        2 
        6        0        6        1 

ties = 
        1
The first column of data provides the observed times in ascending order. Column 3 gives the original order of the sample. The elements of Column 4 count how many observations (censored or uncensored) are tied at the corresponding times. In our data, three observations are tied at time points $ t=1$ and $ t=2$, each.

REMARK 5.1   Most of our hazard regression quantlets require an input variable data, which provides the time-to-event data and tie information in exactly the same format as the 8787 hazdat output variable data (first element in the output list). Therefore, we recommend to run the quantlet 8790 hazdat at the beginning of each session, or whenever a different set of covariates or a subset of time points is to be considered.

In order to simplify notation, we assume from now on that the observed times are sorted, $ t_{1}\leq t_{2}\leq \ldots \leq t_{n}$.

For many calculations we need to know which observations are in the risk set for any given event time. The risk set at time $ t$ is defined as $ R(t) = \{j\!:\, t\leq t_j\}$. It consists of all observations that did not have an event or were censored prior to time $ t$, and thus are still at risk for an event. The quantlet 8793 hazrisk determines the observations at risk at a given observed time point, $ t_i$. The syntax is given below:

  atrisk = hazrisk(data,i)

Input:

data
$ n \times (p+4)$ matrix, the sorted data matrix given by the output data of 8796 hazdat ;
i
scalar, the position of $ t_{i}$ in the ordered list $ t_{1}\leq t_{2}\leq \ldots \leq t_{n}$.
Output:
atrisk
$ n \times 1 $ vector, with elements 0 or $ 1$ that indicate whether observations are in the risk set at time $ t_{i}$.
atrisk[j] = 1 when $ t_{i}\leq t_{j}$, and atrisk[j] = 0, otherwise.

Example 2.We illustrates the use of the quantlet 8799 hazrisk with the data set of Example 1. Note that the first 6 lines of the XploRe code are identical. In line 6, we call 8806 hazdat to organize the observations and the tie information into the matrix data, which is displayed as output of 8809 hazdat in Example 1. In line 7, data is passed as input argument to the quantlet 8812 hazrisk .

  library("hazreg")          
  y = 2|1|3|2|4|7|1|3|2           ; uncensored event times
  c = 3|1|5|6|1|6|2|4|5           ; censoring times
  t = min(y~c,2)                  ; observed (censored) times            
  delta = (y<=c)                  ; censoring indicator                                        
  {data,ties} = hazdat(t,delta)   ; organize data
  atrisk = hazrisk(data,6)        ; risk set at observation 6    
  atrisk
8816 haz02.xpl

The variable atrisk takes the value $ \texttt{atrisk} = (0, 0, 0, 1, 1, 1, 1, 1, 1)^T$. In this example, the times $ t_4 = t_5 = t_6$ are tied. Therefore, the risk set at time $ t_{6}$ includes all observations with index $ j\geq 4$.

The quantlet 8821 haznar returns the size of the risk set at each observed time $ t_{i}, \ i=1, \ldots, n\ $. Its syntax follows below:

   nar = haznar(data)

Input:

data
$ n \times (p+4)$ matrix, the sorted data matrix given by the output data of 8824 hazdat .
Output:
nar
$ n \times 1 $ vector, the number (of observations) at risk at each time point.

Example 3. The use of the quantlet 8827 haznar is illustrated with the same data set used in the previous two examples. Again, the first 6 lines of code are identical to Example 1, preparing the data. The input matrix data is obtained as part of the output of the 8830 hazdat call; data is displayed in Example 1.

  library("hazreg") 
  y = 2|1|3|2|4|7|1|3|2           ; uncensored event times
  c = 3|1|5|6|1|6|2|4|5           ; censoring times
  t = min(y~c,2)                  ; observed (censored) times            
  delta = (y<=c)                  ; censoring indicator                                        
  {data,ties} = hazdat(t,delta)
  nar = haznar(data)              ; calculate the number at risk             
  nar
8834 haz03.xpl

The output variable nar takes the value $ \texttt{nar}=(9, 9, 9, 6, 6, 6, 3, 3, 1)^T$. The first three observations are tied, and, therefore, have identical risk sets.