数学杂志  2018, Vol. 38 Issue (1): 67-74   PDF    
扩展功能
加入收藏夹
复制引文信息
加入引用管理器
Email Alert
RSS
本文作者相关文章
LIU Yu-huan
WANG Cheng-yong
REGRESSION ANALYSIS OF CLUSTERED CURRENT STATUS DATA UNDER THE ADDITIVE HAZARDS MODEL
LIU Yu-huan1, WANG Cheng-yong2    
1. School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China;
2. School of Mathematics and Computer Science, Hubei University of Arts and Science, Xiangyang 441053, China
Abstract: In this paper, we discuss regression analysis of clustered current status data under the additive hazards model. Under the situation when the correlated failure times of interest may be related to cluster sizes, by proposing a within-cluster resampling (WCR) method, the limit distribution theory for the corresponding estimators are derived under some regularity conditions. Some simulation studies are conducted to assess the finite-sample behaviors of the estimators.
Key words: additive hazards model     current status data     within-cluster resampling    
加法风险率模型下聚类的当前状态数据的回归分析
刘玉环1, 王成勇2    
1. 武汉大学数学与统计学院, 湖北 武汉 430072;
2. 湖北文理学院数学与计算机科学学院, 湖北 襄阳 441053
摘要:本文研究了加法风险率模型下聚类的当前状态数据(Ⅰ型区间删失数据)的回归分析问题.在相关的失效时间数据与簇类的规模有关的情形下,本文提出了一个簇内再抽样方法,并在一些正则条件下给出了相应估计量的极限分布理论.最后通过模拟实验验证了估计量的有限样本行为.
关键词加法风险率模型    当前状态数据    簇内再抽样    
1 Introduction

Case Ⅰ interval-censored failure time data or current status data arise in many areas including demographical studies, economics, medical studies, reliability studies and social sciences, see e.g. [1-4]. By case Ⅰ interval-censored data, we mean that the failure time of interest is not exactly observed but the observation on it is either left- or right-censored. A typical example of such data is given by a tumorigenicity study and in this case, the time to tumor onset is often of interest. However, it is usually not observable as the presence or absence of tumors in animals is usually known only at their death or sacrifice. In particular, clustered current status data are commonly encountered in biomedicine.

Many procedures were developed for regression analysis of interval-censored failure time data under various models. For example, Huang [3] developed the maximum likelihood approach for fitting the proportional hazards model to case Ⅰ interval-censored data, Chen and Sun [5], Sun and Shen [6] discussed the same problem in the presence of clustering and competing risks, respectively. Hu and Xiang [7] considered the efficient estimation for semiparametric cuer models when one faces case Ⅱ interval-censored data, Lin et al. [8], Chen and Sun [5] discussed the fitting of the additive hazards model to case Ⅰ interval-censored data. However, these methods do not take the clustered data into account or assumes that the cluster size is completely random or noninformative, and it is well-known that this may not be true as the outcome of interest among individuals in a cluster may be associated with the size of the cluster. That is, we may have informative cluster sizes. In the following, we present one approach for the problem of the regression analysis of clustered current status data under the additive hazards model.

In the presence of informative cluster size, among others, Dunson et al. [9] proposed a Bayesian procedure that models the relationship between the failure times of interest and the cluster size through a latent variable. Williamson et al. [10] and Cong et al. [11] also considered the same problem and investigated a weighted score function (WSF) approach and a within-cluster resampling (WCR) procedure. However, it does not seem to exist an estimation procedure for regression analysis of clustered failure time data with informative cluster size under the additive hazards model framework and current status data.

The rest of the article is organized as follows. Section 2 proposes the model and some notations used in this paper. Section 3 gives the WCR method by using the inference procedure proposed by Lin et al. [8] under the additive hazards model for case Ⅰ interval-censored failure time data, and Section 4 presents some extensive simulation studies to assess the performance of the proposed approach.

2 Notation and Model

Let $i=1, \cdots, n$ denote the independent clusters, and $j=1, \cdots, n_i$ denote the subjects within the $i$-th cluster. For subject $j$ in the $i$-th cluster, for $i=1, \cdots, n$ and $j=1, \cdots, n_i, $ let $T_{ij}$ and $C_{ij}$ denote the failure time of interest and the censoring or observation time, and let $Z_{ij}(t)$ be a $p$-dimensional vector of covariates that may depend on time $t$. It is assumed that the $T_{ij}$ may be dependent for the subjects within the same cluster but are independent for subjects from different clusters. We assume that $T_{ij}$ is conditionally independent of $C_{ij}$ given $Z_{ij}(t)$.

We assume that the survival probabilities of individuals in a cluster depend on the size of that cluster. However, it just as noted in Cong et al. [11], the cause for cluster sizes being informative can be complicated and usually unknown, and some latent variables may implicitly affect the baseline hazard for each cluster and/or covariates. If cluster sizes are noninformative to survival, the usual marginal additive hazards model (see [12]) is

$ \begin{equation} \lambda_{ij}(t \, | \, Z_{ij})\, = \, \lambda_0(t) \, +\, \omega_i\beta_0'Z_{ij}(t), \end{equation} $ (2.1)

where $\beta_0$ is the unknown vector of $p$-dimensional regression coefficient, $\omega_i$ is the cluster-specific random effect to account for within-cluster correlation in cluster $i, $ and $\lambda_0(t)$ is the unknown baseline hazard function. If cluster sizes are ignorable (noninformative to survival), the usual marginal additive hazards model is applicable, given by

$ \begin{equation} \lambda_{ij}(t \, | \, Z_{ij})\, = \, \lambda_0(t) \, +\, \beta_0'Z_{ij}(t). \end{equation} $ (2.2)

For each $(i, j)$, we define $N_{ij} (t) \, = \, I ( C_{ij} \leq \min ( t , T_{ij} ) ), \delta_{ij}=I(C_{ij}\leq T_{ij})$ and $Y_{ij} (t) \, = \, I ( C_{ij} \geq t )$ and let $\lambda_{c} (t)$ denote the hazard function of the $C_{ij}$'s. Also define

$ \begin{equation*} \tilde\lambda_{ij} ( t | Z_{ij} (s)) \, = \, \lambda_{c} (t) \, e^{ - \Lambda_0 (t) } \, e^{ - \beta_0 ' Z_{ij}^* (t) } \, := \, \lambda_0^{c} (t) \, e^{ - \beta_0 ' Z_{ij}^* (t) }, \end{equation*} $

where $\Lambda_0 (t) \, = \, \displaystyle\int_0^t \, \lambda_0 (s) ds$ and $Z_{ij}^* (t) \, = \, \displaystyle\int_0^t \, Z_{ij} (s) d s$, and

$ M_{ij}(t)=N_{ij}(t)-\displaystyle\int_0^tY_{ij}(u)\lambda_0^{c}(u)e^{-\beta_0'Z_{ij}^*(u)}du. $

Note that $M_{ij}(t)$ is a local square-integrable martingale with respect to the marginal filtration

$ {\mathcal F}_{ij}(t)=\sigma\{N_{ij}(u), Y_{ij}(u), Z_{ij}(u), 0\leq u\leq t\} $

(see Lin et al. [8]), and $\tilde\lambda_{ij} ( t | Z_{ij} (s))$ satisfies the Cox proportional hazards model. However, due to the within-cluster dependence, $M_{ij}(t)$ is not a martingale with respect to the joint filtration generated by the history of all the failure, censoring and covariate information up to time $t.$

3 A Method Based on the Within-Cluster Resampling Technique

When cluster sizes are informative, the estimates and inference based on equation (2.2) may be incorrect. To account for informative cluster sizes, this section will propose a method based on the within-cluster resampling (WCR) technique. The basic idea behind the WCR-based procedure is that one observation is randomly sampled with replacement from each of the $n$ clusters using the WCR approach (refer to Hoffman et al. [13]). For this, we randomly sample one subject with replacement from each of the $n$ clusters, and suppose that the resampling process is repeated $K$ times, where $K$ is a large fixed number. Let $\tau$ denote a known time for the length of study period, the $k$-th resampled data set denoted by $\{C_{i, k}, \delta_{i, k}, Z_{i, k}(t);i=1, \cdots, n, 0\leq t\leq\tau\}, $ consists of $n$ independent observations, which can be analyzed using model (2.2) for independent data set. Define $Y_{i, k}(t)=I(C_{i, k}\geq t)$ and $N_{i, k}(t)=\delta_{i, k}I(C_{i, k}\leq t)$, for the $k$-th resampled data, the partial likelihood function is

$ \begin{equation} L_k(\beta)=\prod\limits_{i=1}^n\left(\frac{\exp(-\beta'Z_{i, k}^*(C_{i, k}))} {\sum\limits_{j=1}^nY_{j, k}(C_{i, k})\exp(-\beta'Z_{j, k}^*(C_{i, k}))}\right)^{\delta_{i, k}}, \end{equation} $ (3.1)

and the partial likelihood score function and observed information matrix are

$ \begin{eqnarray} U_k(\beta)&=&\sum\limits_{i=1}^n\displaystyle\int_0^\tau\left(Z_{i, k}^*(t)-\frac{S_k^{(1)}(\beta, t)}{S_k^{(0)}(\beta, t)}\right)dN_{i, k}(t), \\ \Sigma_k(\beta)&=&\sum\limits_{i=1}^n\displaystyle\int_0^\tau\left(\frac{S_k^{(2)}(\beta, t)}{S_k^{(0)}(\beta, t)} -\left(\frac{S_k^{(1)}(\beta, t)}{S_k^{(0)}(\beta, t)}\right)^{\otimes 2}\right)dN_{i, k}(t), \nonumber \end{eqnarray} $ (3.2)

where

$ \begin{eqnarray*} &&Z_{i, k}^*(t)=\displaystyle\int_0^tZ_{i, k}(s)ds, \\ &&S_k^{(b)}(\beta, t)=\frac{1}{n}\sum\limits_{j=1}^nY_{j, k}(t)\left(Z_{j, k}^*(t)\right)^{\otimes b}e^{-\beta'Z_{j, k}^*(t)}, \end{eqnarray*} $

and $a^{\otimes b}=1, a, aa'$ for $b=0, 1$ and 2. The maximum partial likelihood estimator (refer to [14]) $\hat\beta^k$ is the solution to $U_k(\beta)=0.$ Furthermore, Lin et al. [8] showed that $\sqrt{n}(\hat\beta^k-\beta_0)$ converges in distribution to a zero-mean normal random vector with covariance matrix can be consistently estimated by $n\Sigma_k^{-1}(\hat\beta^k), $ and so $\hat\beta^k$ is consistent.

As it is known to all that sample mean can reduce the system error, after repeating this procedure $K$ times, the WCR estimator for $\beta_0$ can be constructed as the average of the $K$ resample-based estimators, that is,

$ \hat\beta_{wcr}=\frac{1}{K}\sum\limits_{k=1}^K\hat\beta^k. $

Under some regularity conditions, we can show that $\sqrt{n}(\hat\beta_{wcr}-\beta_0)$ converges in distribution to a zero-mean normal random vector, and the covariance matrix can be consistently estimated by

$ \hat\Sigma_{wcr}=\frac{n}{K}\sum\limits_{k=1}^K\Sigma_k^{-1}(\hat\beta_{wcr})-\frac{n}{K}\sum\limits_{k=1}^K(\hat\beta^k-\hat\beta_{wcr})(\hat\beta^k-\hat\beta_{wcr})'. $

The proof of this result is sketched in Appendix. It does not need some special software to implement the proposed method. One can just input the data $\{C_{i, k}, \delta_{i, k}, Z_{i, k}^*(\cdot), \ i= 1, \cdots, n\}$ into standard software for fitting the proportional hazards model with right-censored data.

4 Simulation Study

In this section, we conduct some simulations to assess the finite sample performance of the methods developed in the previous section. In the study, the failure times were generated from model (2.1) with $\lambda_0(.)=2$. The covariate process was assumed to be time independent for simplicity and generated from the Bernoulli distribution with success probability $p=0.5$. The censoring times were generated from the exponential distribution with mean $1/\exp (\beta Z_i ).$ The cluster sizes were randomly generated from uniform distribution $U\{2, 3, 4, 5, 6, 7\}$ regardless of the frailty values. Here we chose $\beta_0 = \pm0.5, \pm 0.2$ and $0.$ The censoring times were generated from the exponential distribution to achieve approximately $30\%, 40\%, 50\%$ and $60\%$.

The results include the estimated bias (Bias) given by the average of the proposed estimates minus the true value, the sample standard deviation (SSE) of the proposed estimates, the average of the proposed estimates of the standard errors (SEE), and the empirical 95% coverage probabilities (CP). All results listed in the following table are based on 500 replications with the number of clusters $n = 200$, $300$ and $K=500.$ It can be seen from Table 1 that the proposed estimate seem to be unbiased, the proposed variance estimates also seem to be reasonable, and all estimates become better when the sample size increases.

Table 1
simulation results for estimates of $\beta_0$
Appendix: proofs of asymptotic normality of $\hat\beta_{wcr}$

We first assume that $1/n\sum\limits_{i=1}^n 1/n_i\sum\limits_{j=1}^{n_i}Y_{ij}(t)e^{-\beta_0'Z^*_{ij}(t)}Z^*_{ij}(t), $ $1/n\sum\limits_{i=1}^n 1/n_i\sum\limits_{j=1}^{n_i}Y_{ij}(t)e^{-\beta_0'Z^*_{ij}(t)}, $$1/n\sum\limits_{i=1}^nY_{i, k}(t)e^{-\beta_0'Z_{i, k}^*{t}}Z_{i, k}^*(t)$ and $1/n\sum\limits_{i=1}^nY_{i, k}(t)e^{-\beta_0'Z_{i, k}^*(t)}$ uniformly converge to $\kappa(t), \ \pi (t), $ $\tilde\kappa(t)$ and $\tilde\pi (t), $ respectively. For $i=1, \cdots, n; j=1, \cdots, n_i$ and some constant $\tau$, we assume that $P\{Y_{ij}(t)=1, 0\leq t\leq\tau\}>0, \ \displaystyle\int_0^\tau\lambda_c(t)dt<\infty; Z_{ij}(t)$ is bounded and the cluster sizes are finite.

Since $\hat\beta_k$ is the solution of the estimating equation $U_k(\beta)=0$, and by the Taylor's expansion, we have

$ \begin{equation} -U_k(\beta_0)=U_k(\hat\beta_k)-U_k(\beta_0)=\frac{\partial U_k(\beta_\xi)}{\partial\beta_\xi}(\hat\beta_k-\beta_0), \end{equation} $ (5.1)

where $\beta_\xi$ is on the line segment between $\hat\beta_k$ and $\beta_0.$ Rewriting (5.1) yields that

$ \begin{equation*} \sqrt{n}(\hat\beta_k-\beta_0)=\left(\frac{1}{n}\frac{\partial U_k(\beta_\xi)}{\partial\beta_\xi}\right)^{-1}\left(-\frac{1}{\sqrt{n}}U_k(\beta_0)\right). \end{equation*} $

Note that

$ \begin{eqnarray*} \frac{1}{n}\frac{\partial U_k(\beta)}{\partial\beta}=&=&\frac{1}{n}\sum\limits_{i=1}^n\displaystyle\int_0^\tau\frac{S_k^{(2)}(\beta, s)-\left(S_k^{(1)}(\beta, s)\right)^{\otimes 2}}{\left(S_k^{(0)}(\beta, s)\right)^2}dN_{i, k}(s)\\ &=&\frac{1}{n}\sum\limits_{i=1}^n\displaystyle\int_0^\tau\left(Z_{i, k}^*(s)-\bar{Z^k}(\beta, s)\right)^{\otimes 2}Y_{i, k}(s)e^{-\beta'Z_{i, k}^*(s)} \frac{d\bar N_{k}(s)}{S_k^{(0)}(\beta, s)}\\ &:=&A_k(\beta), \end{eqnarray*} $

where $\bar Z^k(\beta, s)=S_k^{(1)}(\beta, s)/S_k^{(0)}(\beta, s)$ and $\bar N_k(s)=n^{-1}\sum\limits_{i=1}^nN_{i, k}(s).$ Note that $A_k(\beta)$ is positive definite. Since the $K$ resamples are identically distributed, it can be seen that $A_k(\beta_0)$ converges in probability to a deterministic and positive definite matrix denoted by $\mathcal A_{wcr}.$

Averaging over $k=1, \cdots, K$ resamples, it yields

$ \begin{eqnarray*} \sqrt{n}(\hat\beta_{wcr}-\beta_0)&=&\frac{1}{K}\sum\limits_{k=1}^K\sqrt{n}(\hat\beta^k-\beta_0) =\frac{1}{K}\sum\limits_{k=1}^K A_k(\beta_\xi)^{-1}\frac{-1}{\sqrt{n}}U_k(\beta_0) \\ &=&-{\mathcal A_{wcr}}^{-1}\frac{1}{\sqrt{n}K}\sum\limits_{q=1}^K U_k(\beta_0)+o_p(1). \end{eqnarray*} $

It is sufficient to show that $1/(K\sqrt{n})\sum\limits_{q=1}^K U_k(\beta_0)$ converges to a normal distribution as $n\rightarrow\infty, $ changing the order of summation yields that

$ \begin{eqnarray*} \frac{1}{\sqrt{n}K}\sum\limits_{k=1}^KU_k(\beta_0)&=&\frac{1}{\sqrt{n}}\sum\limits_{i=1}^n\frac{1}{K}\sum\limits_{k=1}^K\displaystyle\int_0^\tau\left(Z_{i, k}^*(t)-\bar Z^k(t)\right)dM_{i, k}(t)\\ &=&\frac{1}{\sqrt{n}}\sum\limits_{i=1}^n\frac{1}{K}\sum\limits_{k=1}^K\displaystyle\int_0^\tau\left(Z_{i, k}^*(t)-\frac{\tilde\kappa(t)}{\tilde\pi(t)}\right)dM_{i, k}(t)+o_p(1)\\ &:=&\frac{1}{\sqrt{n}}\sum\limits_{i=1}^n{\mathcal U_i(\beta_0)}+o_p(1), \end{eqnarray*} $

where ${\mathcal U_i(\beta_0)}, \ i=1, \cdots, n$ are independent with zero mean and finite variance. By the multivariate central limit theorem, $n^{-1/2}K^{-1}\sum\limits_{k=1}^K U_k(\beta_0)$ is asymptotically normal with zero mean and some positive definite covariance matrix. Combining with Slutsky's theorem, $\sqrt{n}(\hat\beta_{wcr}-\beta_0)$ converges in distribution to a normal random vector with zero mean and denote the consistent estimator of the covariance matrix by $\hat\Sigma_{wcr}$.

To obtain the consistent estimator of the covariance matrix, it is similar to Hoffman et al. [13], we first write

$ {\rm var}(\hat\beta_k)=E\left({\rm var}(\hat\beta_k|{\rm data})\right) +{\rm var}\left({E}(\hat\beta_k|{\rm data})\right), $

where the expectations on the right-hand side are over the resampling distribution for $\hat\beta_k$ given the data. By the fact of ${\rm E}(\hat\beta_k|{\rm data})=\hat\beta_{wcr}, $ it yields that

$ \begin{equation} {\rm var}(\hat\beta_{wcr})={\rm var}(\hat\beta_k)-E({\rm var}(\hat\beta_k|{\rm data})). \end{equation} $ (5.2)

For each resampled data, ${\rm var}(\hat\beta_k)$ can be consistently estimated by $\Sigma_k.$ By averaging over the $K$ resamples, the resulting estimator denoted by $K^{-1}\sum\limits_{k=1}^K\hat\Sigma_k$ is also consistent. For the second term on the right-hand side of (5.2), since

$ E({\rm var}(\hat\beta_k|{\rm data}))=E\left(\frac{1}{K}\sum\limits_{k=1}^K(\hat\beta_k-\hat\beta_{wcr})(\hat\beta_k-\hat\beta_{wcr})'\right), $

it can be estimated as the covariance matrix based on the $K$ resamples estimators $\hat\beta_k, $ that is

$ \Omega=\frac{1}{K}\sum\limits_{k=1}^K(\hat\beta_k-\hat\beta_{wcr})(\hat\beta_k-\hat\beta_{wcr})'. $

Thus the estimated variance-covariance matrix of $\hat\beta_{wcr}$ is

$ \tilde\Sigma_{wcr}=\frac{1}{K}\sum\limits_{k=1}^K\hat\Sigma_k-\frac{1}{K}\sum\limits_{k=1}^K(\hat\beta_k-\hat\beta_{wcr})(\hat\beta_k-\hat\beta_{wcr})'. $

To show the consistency of $\tilde\Sigma_{wcr}, $ it suffices to show that $\Omega-E(\Omega)\rightarrow 0$ in probability as $n\rightarrow\infty.$ Actually, by applying the same arguments as those in the proof of Cong et al. [11], it can be shown that $\tilde\Omega-E(\tilde\Omega)\rightarrow 0$ in probability as $n\rightarrow\infty.$ This completes the proof.

References
[1] Andersen P K, Gill R D. Cox's regression model for counting processes:a large sample study[J]. Ann. Stat., 1982, 10: 1100–1120. DOI:10.1214/aos/1176345976
[2] Jewell N P, van der Laan M. Generalizations of current status data with applications[J]. Lifetime Data Anal., 1995, 1: 101–110. DOI:10.1007/BF00985261
[3] Huang J. Efficient estimation for the proportional hazards model with interval censoring[J]. Ann. Stat., 1996, 24: 540–568. DOI:10.1214/aos/1032894452
[4] Rossini A J, Tsiatis A A. A semiparametric proportional odds regression model for the analysis of current status data[J]. J. Amer. Stat. Assoc., 1996, 91: 713–721. DOI:10.1080/01621459.1996.10476939
[5] Chen L, Sun J. A multiple imputation approach to the analysis of current status data with the additive hazards model[J]. Comm. Stat. The. Meth., 2009, 38: 1009–1018. DOI:10.1080/03610920802361407
[6] Sun J, Shen J. Efficient estimation for the proportional hazards model with competing risks and current status data[J]. Canad. J. Stat., 2009, 37: 592–606. DOI:10.1002/cjs.v37:4
[7] Hu T, Xiang L. Efficient estimation for semiparametric cure models with interval-censored data[J]. J. Multi. Anal., 2013, 121: 139–151. DOI:10.1016/j.jmva.2013.06.006
[8] Lin D, Oakes D, Ying Z. Additive hazards regression with current status data[J]. Biom., 1998, 85: 289–298.
[9] Dunson D B, Chen Z, Harry J. Bayesian joint models of cluster size and subunitspecific outcomes[J]. Biom., 2003, 63: 663–672.
[10] Williamson J, Kim H Y, Manathuga A, Addiss D G. Modeling survival data with informative cluster size[J]. Stat. Med., 2008, 27: 543–555. DOI:10.1002/(ISSN)1097-0258
[11] Cong X, Yin G, Shen Y. Marginal analysis of correlated failure time data with informative cluster sizes[J]. Biom., 2007, 63: 663–672. DOI:10.1111/biom.2007.63.issue-3
[12] Lin D, Ying Z. Semiparametric analysis of the additive risk model[J]. Biom., 1994, 81: 61–71.
[13] Hoffman E B, Sen P K, Weinberg C R. Within cluster resampling[J]. Biom., 2001, 88: 1121–1134.
[14] Xiao Z, Zhu Q. Moderate deviation of maximum likelihood estimators for truncated and censored data[J]. J. Math., 2009, 29(3): 273–278.