In many epidemiological studies, the meaningful results can be obtained through observing thousands of subjects for a long time. Due to the financial limitation or technical difficulties, it needs to develop the cost-effective design for selecting subjects in the underlying cohort to observe their expensive covariates. The case-cohort sampling (Prentice, 1986) is a well-known cost-effective design with the response subject to censoring, where the expensive covariates are measured only for a subcohort randomly selected from the cohort and additional failures outside the subcohort. The statistical methods for case-cohort sampling were well studied in the literature (e.g., Self and Prentice, 1988; Chen and Lo, 1999; Kulich and Lin, 2000; Kong, Cai and Sen, 2004; Kong and Cai, 2009).
Aforementioned works show the case-cohort sampling is especially useful when the failure rate is low. However, the failure rate may be high in practice. Therefore, it's unpractical to assemble covariates of all failures due to the fixed budget. Under such situations, the generalized case-cohort (GCC) sampling is proposed, which selects a subset of failures instead of all the failures in the case-cohort design. For example, Chen (2001) proposed the GCC design and studied its statistical properties. Kang and Cai (2009) studied the GCC design with the multivariate failure time. Cao, Yu and Liu (2015) studied the optimal GCC design through the power function of a significant test. The aforementioned works are all under the framework of Cox's proportional hazards model (Cox, 1972). Yu et al. (2014), Cao and Yu (2017) studied the GCC design under the additive hazards model (Lin and Ying, 1994).
Both the Cox proportional hazards model and additive hazards model are based on modeling the hazards function. However, it is important to directly model the failure time in some applications. Recently, the accelerated failure time (AFT) model which linearly relates the logarithm of the failure time to the covariates gained more and more attention. Kong and Cai (2009) studied the case-cohort sampling under the AFT model. Chiou, Kang and Yan (2014) proposed a fast algorithm for the AFT model under the case-cohort sampling. Cao et al. (2017) studied the GCC sampling under the AFT model and discussed the optimal subsample allocation by the asymptotic relative efficiency between the proposed estimators and the estimators from the simple random sampling scheme.
In order to design a GCC study in practice, there is an important question for the principal investigators that how to calculate the power function under a fixed budget. To the best of our knowledge, no such consideration is given under the generalized case-cohort design. Therefore, we will fill this gap under the accelerated failure time model in this paper.
The article is organized as follows. In Section 2, we propose the generalized case-cohort sampling, use the smoothed weighted Gehan estimating equation approach to estimate the unknown regression parameters in the accelerated failure time model, and give the corresponding asymptotic properties. In Section 3, we study the power calculation under a fixed budget. In Section 4, we conduct the simulation studies to evaluate the performances of the proposed methods. A real data analysis is analyzed through the proposed method in Section 5. Some concluding remarks are presented in Section 6.
Let $\tilde{T}$ and $C$ denote the failure time and the corresponding censoring time, respectively. Due to the right censoring, we only observe $T=\min(\tilde{T}, C)$ and $\delta=I(\tilde{T}\leq C)$, where $I(\cdot)$ is an indicator function. Let $Z_e$ be a $d_1$-dimensional vector of covariates which are expensive to measure and $Z_c$ be a $d_2$-dimensional vector of covariates which are cheap or easily to measure. It is assumed that given the covariates $(Z_e, Z_c)$, $\tilde{T}$ and $C$ are independent. We consider the following accelerated failure time model
where $\beta_0$ and $\gamma_0$ are unknown regression parameters and $\epsilon$ is the random error with an unknown distribution function.
Suppose the underlying population has $n$ subjects and $\{T_i, \delta_i, Z_{e, i}, Z_{c, i}, i=1, \cdots, n\}$ are the independent copies of $(T, \delta, Z_e, Z_c)$. In the generalized case-cohort sampling, binary random variable $\xi_i$ denotes whether or not the $i$-th subject is selected into the subcohort and the corresponding successful probability is $p$. Let $\eta_i$ be the selection indicator for whether or not the $j$-th subject is selected into supplemental failure samples and the conditional probability $P(\eta_j=1|\xi_j=0, \delta_j=1)=q$. In the GCC sampling, the covariates $Z_e$ are only observed on the selected subjects. Hence, the observed data structure is given as follows:
Define $\theta=(\beta{'}, \gamma{'}){'}$, $\theta_0=(\beta^{'}_0, \gamma^{'}_0){'}$, $X_i=(Z^{'}_{e, i}, Z^{'}_{c, i}){'}$, and $e_i(\theta)=\log(T_i)-\beta{'}Z_{e, i}-\gamma{'}Z_{c, i}, i=1, \cdots, n$. Let $N_i(t, \theta)=I(e_i(\theta)\leq t, \delta_i=1)$ and $Y_i(t, \theta)=I(e_i(\theta)\geq t)$ denote the counting process and at risk process, respectively. If the data $\{T_i, \delta_i, Z_{e, i}, Z_{c, i}, i=1, \cdots, n\}$ are completely observed, the unknown regression parameters in model (2.1) can be estimated by solving the following estimating equations
where $\psi(\cdot)$ is a possible data-dependent weight function and $\bar{X}(t, \theta)=S^{(1)}(t, \theta)/S^{(0)}(t, \theta)$ with $S^{(d)}(t, \theta)=n^{-1}\sum\limits_{j=1}^{n}Y_j(t, \theta)X_{j}^{d}$ for $d=0, 1$. The weight function $\psi(t, \theta)=1$ and $S^{(0)}(t, \theta)$ are corresponding to the log-rank and Gehan statistics, respectively.
Unfortunately, in the GCC sampling, the covariates $Z_e$ are only observed for selected subject and the distribution of selected supplemental failures is different from the distribution of the underlying population. Therefore, the inverse probability weight method (Horvitz and Thompson, 1951) is needed to adjust for the biased sampling mechanism of the GCC sampling
Then, the true regression parameters $\theta_0$ in model (2.1) can be estimated by solving the following weighted estimating equations
where $\tilde{\psi}(\cdot)$ is also a possible data-dependent weight function and $\tilde{X}(t, \theta)=\tilde{S}^{(1)}(t, \theta)/\tilde{S}^{(0)}(t, \theta)$ with
for $d=0, 1$. In this paper, we consider Gehan statistics, $\tilde{\psi}(t, \theta)=\tilde{S}^{(0)}(t, \theta)$. Hence, the weighted Gehan estimating equations can be re-written as
which are monotone in each component of $\theta$ and let $\tilde{\theta}_n$ denote the estimators obtained by solving (2.5).
Due to the fact that the weighted Gehan estimating equations are not continuous, induced smoothing procedure is adopted to smooth the weighted Gehan estimating equations (Brown and Wang, 2007; Cao, Yang and Yu, 2017). The smoothed weighted Gehan estimating equations can be re-written as
where $r_{ij}^{2}=n^{-1}(X_j-X_i){'}(X_j-X_i)$ and $\Phi(\cdot)$ is a distribution function of the standard normal distribution. As $n$ goes to infinity, the distribution function $\Phi(\cdot)$ will convergent to the indicator function. Let $\widehat{\theta}_n$ denote the estimators by solving estimating equation (2.6).
In this subsection, we will show the consistency and asymptotic distribution of the $\widehat{\theta}_n$. Furthermore, the asymptotic distribution of $\widehat{\theta}_n$ is also the same as that of $\tilde{\theta}_n$. Define $M_i(t, \theta)=N_i(t, \theta)-\Lambda_i(t, \theta)$, $\Lambda_i(t, \theta)=\displaystyle\int_{-\infty }^{t} Y_i(s, \theta)\lambda(u)du$, and $\lambda(\cdot)$ is the common hazard function of the error term and $a^{\otimes 2}=aa{'}$ for a vector $a$.
Theorem 2.1 Under some regular conditions,
$\widehat{\theta}_n$ is strong consistency, and $\sqrt{n}(\widehat{\theta}_n-\theta_0)$ converges in distribution to zero-mean normal with covariance
where the matrix $\Sigma_A(\theta_0)$ is the limit of
with
The regularity conditions and the proof of Theorem 2.1 can be founded in [15].
In this section, we consider the power calculation for GCC sampling with a fixed budget. In order to simplify the notations, let $\bf{B}$ denote the fixed budget, $C_c$ denote the unit price to measure the observed failure time, censoring indicator and cheap covariates $\{T_i, \delta_i, Z_{c, i}\}$, and $C_e$ denote the unit price to measure expensive covariates $Z_{e, i}$. Hence,
where $\pi=P(\delta=1)$. In practice, $n$, $\bf{B}$, $C_c$ and $C_e$ are known, $\pi$ can be estimated by $n^{-1}\sum\limits_{i=1}^{n}\delta_i$, which is equal to $p+(1-p)\pi q$ fixed. Let $\rho_v=p+(1-p)\pi q$, which is the proportion of the validation data set in the missing data literature, where all the data is completely observed.
We consider the following significant test
where $k$ is a non-zero $d_1$-dimensional constant. Let $\widehat{\beta}_n$ denote the proposed estimator of $\beta_0$ and $\alpha$ denote the type Ⅰ error, respectively. From Theorem 2.1, the reject region of the test (3.2) at the significant level $\alpha$ is
where
$\Psi^{-1}(1-\frac{\alpha}{2})$ is a $d_1$-dimensional vector with same element $(\Phi^{-1}(1-\frac{\alpha}{2}), \cdots, \Phi^{-1}(1-\frac{\alpha}{2})){'}$ with $\Phi(\cdot)$ being the distribution function of the standard norm distribution, and $[A]_{d_1\times d_1}$ is the upper-left $d_1\times d_1$ submatrix of matrix $A$. Obviously, the power function for the significant test is a function of $(p, q)$ given as follows
When we calculate the power function, due to constrain (3.1), we need to consider the following optimization problem through the Lagrange multiplier argument
where $\|\cdot\|_1$ denote the $L_1$ norm. Because the power function is positive, the optimal solution $(p^{*}, q^{*})$ can be easy to obtain and the corresponding power is ${\rm Power}(p^{*}, q^{*})$.
In this section, the simulation studies are conducted to evaluate the finite sample performances of the proposed method. We generate the failure time from the accelerated failure time model
where $Z_e$ follows a standard normal distribution, $Z_c$ follows a Bernoulli distribution with a successful probability of 0.5, the regression parameters $\beta_0=0$ and $\gamma_0=0.5$, and the error term $\epsilon$ follows a standard normal distribution or a standard extreme value distribution, which will result a log-norm distribution or a Weibull distribution for the failure time, respectively. The censoring time is generated from the uniform distribution over the interval $[0, c]$, where $c$ is chosen to yield around $80\%$ censoring rate, respectively.
We consider the following test at the significant level $\alpha$ being 0.05:
The size of the underlying population is $n=600$. We investigate different scenarios for sampling probabilities $(p, q)$ under constraint (3.1), which is equal to $\rho_v$ being fixed. For each configuration, we generate $1000$ simulated data sets. The results of the simulation studies are summarized in Figure 1. From Figure 1, we can obtain following results
(Ⅰ) When the error term follows the standard normal distribution, the powers are 0.746 and 0.967 with $\rho_v$ being 0.200 and 0.400, respectively, and the corresponding sampling probability $(p, q)$ are $(0.100, 0.556)$ and $(0.260, 0.946)$, respectively.
(Ⅱ) When the error term follows the extreme value distribution, the powers are 0.893 and 0.994 with $\rho_v$ being 0.200 and 0.400, respectively, and the corresponding sampling probability $(p, q)$ are $(0.120, 0.455)$ and $(0.260, 0.946)$, respectively.
The national Wilm's tumor study group (NWTSG) is a cancer research which was conducted to improve the survival of children with Wilms' tumor by evaluating the relationship between the time to tumor relapse and the tumor histology (Green et al., 1998). However, the tumor histology is difficult and expensive to measure. According to the cell type, the tumor histology can be classified into two categories, named as favorable and unfavorable. Let the variable $histol$ denote the category of the tumor histology. We also consider other covariates including the patient age, the disease stages and the study group.
We consider the accelerated failure time model
where the covariates $stage2, stage3, stage4$ indicate the disease stages and the variable $study$ indicates the study group. There are 4028 subjects in the full cohort and 571 subjects subject to tumor relapse. We randomly selected a subcohort by $p=0.166$ and select a subset of the failures outside subcohort through $q=0.400$. We compare the proposed estimator $\widehat{\alpha}_G$ with $\widehat{\alpha}_S$, which is based on the simple random sampling design with the same sample size as GCC design. The results are summarized in Table 1.
From Table 1, both the two methods confirm that tumor histology is significant to the cancer relapse. The proposed method shows the age is significant to cancer relapse which is different from the result from $\widehat{\alpha}_{S}$.
In this paper, we study the power calculation for the generalized case-cohort (GCC) design under the accelerated failure time model. Due to the biased sampling mechanism of GCC, the weighted Gehan estimating equations are adopted to estimate the regression coefficients. The induced smoothing procedure is introduced to overcome the discontinuous of the smoothed weighted Gehan estimating equation, which could lead to continuously differentiable estimating equations and can be solved by the standard numerical methods. The simulation studies are conducted to evaluate the finite sample performances of the proposed method and we also analyze a real data set from national Wilm's tumor study group.
In this paper, we consider the covariates which are time-invariant. Next, we will consider power calculation in the accelerated failure time model under the GCC design with time-dependent covariates. Finally, it will be interesting to evaluate the performance of stratified sampling in the subcohort to enhance the efficiency. Study along this directions is currently under way.