With the continuing advancement in the use of biological markers in epidemiology and genetic studies, which often involve expensive assays, there is a growing incentive to further improve study efficiency and power by optimally incorporating into the statistical analysis the available auxiliary covariate. Some proposed methods have been developed for the univariate survival time data in the areas of mismeasured covariates, missing data, and auxiliary covariate problems. This includes, but is not limited to [1-3].
Models dealing with multivariate failure time data where the true covariates of interest are fully available for all subjects have been well studied. In particular, if the correlation among the observations is not of interest, the marginal proportional hazards model is widely used, e.g., [4-9]. There has been limited progress on the methods for dealing with covariate measurement error for multivariate failure time. Greene and Cai [10] proposed using the SIMEX approach for handling measurement errors in the marginal hazards model for multivariate failure time data, when a validation set is not available. Liu, Zhou and Cai [11] consider an inference procedure for multivariate failure time with auxiliary covariate information.
Clustered failure time data arise when the study subjects are sampled in clusters so that the failure times within the same cluster tend to be corrected. In this article, assuming a validation set is available, we develop an estimated pseudopartial likelihood method for handling auxiliary covariates for clustered failure time data under the framework of the marginal hazards model with distinguishable baseline hazards.
The rest of the article is organized as follows. Section 2 outlines the marginal hazard model and present the estimated pseudopartial likelihood estimator. In Section 3, We characterize the asymptotic properties of the proposed estimator and propose a variance estimator. We conclude the article with some discussion in Section 4. Outline of the proof for theoretical results are given in the Appendix.
Suppose that there are $n$ independent clusters. In cluster $i$, there are $J$ subjects. For subject $j$ in cluster $i$, $K$ different types of failures may occur. Let $(i, j, k)$ denote the $k$th type of failure on subject $j$ in $i$th cluster, for $i=1, \cdots, n; j=1, \cdots, J; k=1, \cdots, K$. Let $T_{ijk}$ and $C_{ijk}$ denote the potential failure time and censoring time, respectively. With censoring, we observe $X_{ijk}=\min(T_{ijk}, C_{ijk})$. Let $\Delta_{ijk}=I(X_{ijk}\leq C_{ijk})$ be the failure indicator and $Y_{ijk}(t)=I(X_{ijk}\geq t)$ denote the at-risk indicator process. Let $(E_{ijk}, Z_{ijk})$ denote a set of covariate, where $E_{ijk}$ is the primary exposure subjecting to missing and $Z_{ijk}=(Z_{ijk1}, \cdots, Z_{ijkd})'$ is the remaining observed covariates vector that is always. We denote variable $A$ as an auxiliary variable for the exposure variable $E$, assuming that conditional on $E, A$ provides no additional information to the regression model, i.e.,
Suppose that there is a simple random validation sample with sample size $n_V$, denote by $V$, such that $(i, j, k)$ belonging to $V$ have their $(E, A)$ measured. Similarly, let $\overline{V}$ denote the remaining subjects, the nonvalidation set, the subjects in $\overline{V}$ will only have their $A$ measured. Hence, the observed data structure for $(i, j, k)$ is
Assume that, the marginal hazard function for the $k$th failure type of subject $j$ in cluster $i$ takes the form
where $E_{ijk}^*$ is an $m$-vector consisting of $E_{ijk}$ and possibly interaction terms between $E_{ijk}$ and some fully observed covariates, $\beta=(\beta'_1, \beta'_2)'$ is the parameter to be estimated, and $\lambda_{0jk}(t)$ is an unspecified marginal distinct baseline hazard function pertaining to the type $k$ failure.
If $(i, j, k)$ belongs to the validation set, then $Z_{ijk}$ and $E_{ijk}$ are observed and the marginal model takes the form as in equation (2.1). If $(i, j, k)$ belongs to the nonvalidation set $\overline{V}$, we only observe $Z_{ijk}(t)$ and $A_{ijk}(t)$. Under this situation, we can show, using the argument of Liu [11], that the hazard function for $\lambda_{ijk}(t;Z_{ijk}(t), A_{ijk}(t))$ satisfied the induced model
where $A^*$ includes auxiliary variable $A$ and the part of the information in covariate $Z$ that, given $A$, are still related to $E$. That is, $A^*$ satisfying the following conditional dependence $f(E_{ijk}(t)|X_{ijk}(t)\geq t, Z_{ijk}(t), A_{ijk}(t))=f(E_{ijk}(t)|X_{ijk}(t)\geq t, A_{ijk}^*(t))$. Notice that under this formulation, $A^*$ still satisfies the auxiliary assumption that given $E$ and $Z$, $A^*$ does not contribute to the regression model, i.e., $\lambda(t;Z(t), E(t), A^*(t))=\lambda(t;Z(t), E(t))$.
Equation (2.2) implies that this induced hazard model is also a proportional hazard model with the relative risk function $\exp(\beta'_2Z_{ijk}(t))\phi_{ijk}(\beta_1;t)$, where
Based on equations (2.1) and (2.2), the relative risk function can be written as
where $R_{ijk}(\beta_1, t)=\exp(\beta'_1E_{ijk}^*(t))\rho_{ijk}+\phi_{ijk}(\beta_1, t)(1-\rho_{ijk})$ and the binary variable $\rho_{ijk}=1$ or $0$ denote whether $(i, j, k)$ is in validation set $V$ or not. If $f(E_{ijk}(t)|X_{ijk}(t)\geq t, A_{ijk}^*(t))$ is a known function up to a parameter $\theta$, then the inference about $\beta$ and $\theta$ can be drawn from a pseudopartial likelihood [4, 6]. However, misspecification of such parameterization may lead to biased estimates. We develop an estimated pseudopartial likelihood approach for clustered correlated failure time data that avoids making undesirable parametric assumptions on the conditional distribution.
If all the observations were independent, we could write the partial likelihood as
When the failure times within a subject are not independent, the above function is referred to as the pseudopartial likelihood [4, 6]. Without loss of generality, we assume that $\{A_{ijk}^*\}$ are identically distributed categorical variables with the distribution $\text{Pr}(A^*=a_m)=p_m, m=1, \cdots, L, \sum\limits_{m=1}^L p_m=1$. Hence, if $(i, j, k)$ is in the nonvalidation set $\overline{V}$, we will estimate the induced hazard function, $\phi_{ijk}(\beta_1, t)$, as
It follows that the estimated relative risk function is
where
Replacing $r_{ijk}(\beta, t)$ by $\widehat{r}_{ijk}(\beta, t)$ in equation (2.3), we obtain an estimated pseudopartial likelihood function
We define our proposed estimator ${\mathit{\hat \beta }}_\mathit{E}$ as the maximizer of equation (2.5). ${\mathit{\hat \beta }}_\mathit{E}$ can be obtained by solving the estimated pseudo partial likelihood score equation, $\widehat{U}(\beta)=0$, where
and $N_{ijk}(t)=I(X_{ijk}\leq t, \Delta_{ijk}=1)$ is the counting process corresponding to failure time $T_{ijk}$. For a function $g(\beta, u)$, $g^{(j)}(\beta, u)$ denotes the $j$th derivative of $g(\beta, u)$ with respect to $\beta$. A Newton–Raphson iterative procedure can be invoked to obtain ${\mathit{\hat \beta }}_\mathit{E}$.
To investigate the asymptotic properties of the estimated pseudopartial likelihood estimator ${\mathit{\hat \beta }}_\mathit{E}$, we define the following notations. For a vector $a$, define $a^{\bigotimes0}=1, a^{\bigotimes1}=a, a^{\bigotimes2}=aa', ||a||=\sup_i|a_i|$. For a matrix $A$, define $||A||=\sup_{i, j}|a_{ij}|$. We also define
Assume that the study duration is from $0$ to $\tau$. Suppose that $\beta_0=(\beta'_{10}, \beta'_{20})'$ is the true hazards parameter. Our asymptotic results rely on the following assumptions:
[A1] $\displaystyle\int_0^{\tau}\lambda_{0jk}(t)<\infty, j=1, \cdots, J; k=1, \cdots, K$.
[A2] $Pr(Y_{ijk}(t)=1|A_{ijk}^*(t)=a_m)>0, m=1, \cdots, L$.
[A3] For any $j=1, \cdots, J; k=1, \cdots, K$, there exists a neighborhood $B_2$ of $\beta_{20}$ such that
[A4] There exists an open set $B_1$, containing $\beta_{10}$, such that $\phi_{ijk}(\beta_1, t)$ is bounded away from $0$ on $B_1\times[0, \tau]$. $\sum(\beta_0)$, as defined in Theorem 3.2, is positive definite.
[A5] For any $j=1, \cdots, J; k=1, \cdots, K$,
[A6] $\sup\limits_{t\in[0, \tau]}|L_k^{(d)}(t)|=O_p(1)$, $d=0, 1$, where
and $\gamma_{ijk}(\beta_1, t)=\exp(\beta'_1E_{ijk})$.
Following closely the argument of [11, 12], we can show the asymptotic properties of ${\mathit{\hat \beta }}_\mathit{E}$. We summarize the results in the following theorems and give the outline of the proofs in the Appendix.
Theorem 3.1 (Consistency) ${\mathit{\hat \beta }}_\mathit{E}$ is a consistent estimator of $\beta_0$ under assumptions (A1)–(A6).
Theorem 3.2 (Asymptotic Normality) Under the assumptions (A1)–(A6) in Appendix, we have that $n^{1/2}({\mathit{\hat \beta }}_\mathit{E}-\beta_0)$ is asymptotically normally distributed with mean zero and variance matrix $\sum_{EPPL}(\beta_0)=\sum^{-1}(\beta_0)\sum_1(\beta_0)\sum^{-1}(\beta_0)$, where
Here $s_{jk}^{(11)}(\beta_{0}, t)$ is the first m elements of $s_{jk}^{(1)}(\beta_{0}, t)$ and $s_{jk}^{(12)}(\beta_{0}, t)$ is the remaining $p$ elements,
$q=_{\mathit{n} \to \infty }^{\;{\rm{lim}}}\left( {{\mathit{n}_\mathit{v}}/\mathit{n}} \right), M_{ijk}(t)=N_{ijk}(t)-\int_0^\mathit{\tau } {} \lambda_{ijk}(u)du$ is the marginal martingale.
The variance estimator for ${\mathit{\hat \beta }}_\mathit{E}$ can be consistently estimated by replacing the population quantities in the covariance matrix $\sum_{EPPL}(\beta_0)$ with their corresponding sample quantities. The cumulative hazard $\Lambda_{0jk}(t)$ can be estimated by Aalen –Breslow type of estimator:
where $\widehat{S}_{jk}^{(0)}(\beta, t)=n^{-1}\sum\limits_{\mathit{i} = 1}^\mathit{n} {} Y_{ijk} \widehat{r}_{ijk}(\beta, t)$.
In this article, we studied an estimated pseudopartial likelihood method for clustered failure time data with an auxiliary covariate. A key feature of this method is that it is nonparametric with respect to the association between the missing covariate and the observed auxiliary covariate. The auxiliary variable is assumed to be discrete with the number of categories fixed. One way to deal with a continuous auxiliary variable is to discretize it into categories and then apply the proposed method. Future work about common baseline hazard models and mixed baseline hazard models for clustered correlated failure time data with auxiliary covariates will be considered.
In this appendix, we outline the proofs of the theorems.
Proof of Theorem 3.1 Note that ${\mathit{\hat \beta }}_\mathit{E}$ solves $n^{-1}\widehat{U}(\beta)=0$. Follow closely the argument of [12], one can show that ${\mathit{\hat \beta }}_\mathit{E}$ is consistent for $\beta_0$, provided:
[R1] $n^{-1}\partial\widehat{U}(\beta)/\partial\beta$ exists and is continuous in an open neighborhood $B$ of $\beta_0$.
[R2] $n^{-1}\partial\widehat{U}(\beta_0)/\partial\beta_0$ is negative definite with probability going to $1$.
[R3] $n^{-1}\partial\widehat{U}(\beta)/\partial\beta$ converges in probability to a fixed function, $\sum(\beta)$, uniformly in an open neighborhood of $\beta_0$.
[R4] $n^{-1}\widehat{U}(\beta_0)\rightarrow 0$ in probability.
Let
similar to [11], we can show that the four conditions are satisfied. Therefore, ${\mathit{\hat \beta }}_\mathit{E}$ converges in probability to $\beta_0$.
Proof of Theorem 3.2 It can be shown that the score function $n^{-1}\partial\log EPPL(\beta)/\partial\beta$ can be expressed as
By Taylor expansion of $\widehat{U}(\beta_0)$, we have
where $\beta_*$ is between ${\mathit{\hat \beta }}_\mathit{E}$ and $\beta_0$. To prove the asymptomatic normality, it suffices to prove that $n^{-1/2}\widehat{U}(\beta_0)$ converges to a normal random variable in distribution and that $n^{-1}\partial\widehat{U}(\beta_*)/\partial\beta_*$ converges to an invertible matrix. By consistency of ${\mathit{\hat \beta }}_\mathit{E}$ and the convergence proof of $n^{-1}\partial\widehat{U}(\beta)/\partial\beta$ for (R3), it can be shown that $n^{-1}\partial\widehat{U}(\beta_*)/\partial\beta_*$ converges to the invertible matrix $\sum(\beta_0)$. These results together with the Slutsky Lemma give the desired normally result for ${\mathit{\hat \beta }}_\mathit{E}$ in Theorem 3.2.