Abstract
A new multivariate shared frailty model based on the truncated normal distribution is proposed. For the basal distribution of failure times, we assume a parametric approach through the Weibull and piecewise exponential distributions and also a nonparametric approach. Similar to the traditional gamma frailty model, the Laplace transform, the hazard and survival functions of our proposal have a simple and closed form. In addition, the n-th derivative of the Laplace transform can be expressed recursively. Parameter estimation is performed by a classical approach through the EM algorithm. A simulation study is presented to demonstrate the consistency of the estimators in finite samples. Finally, two applications to medical data modelling the recurrence of infection in renal patients and patients with fibrosarcoma are presented to demonstrate the effectiveness of the model compared to other classical approaches in the literature. The computational implementation of the model is available in the extrafrail package of R.
Similar content being viewed by others
Introduction
Survival models study the time to event until a certain event of interest occurs. They are characterized by including censored (incomplete) information within the study, either because the individual never presented the event during follow-up or because the follow-up of the individual was truncated during the study. Within this context, due to the fact of not assuming a specific distribution for failure times, one of the most referenced models in the literature is Cox’s proportional hazards (PH). For \({\varvec{x}}_i=(x_{i1},\ldots ,x_{ip})\) a set of p observed covariates (without intercept term), the hazard risk function for this model is given by
where \({\varvec{\beta }}=(\beta _1,\ldots ,\beta _p)\) denotes a vector of p observed covariates and \(h_0(\cdot )\) denotes the baseline hazard function. Note that this model provides a proportional hazard structure because the ratio for two individuals with profiles \({\varvec{x}}_i\) and \({\varvec{x}}_{i'}\)
does not depend on t. A way to break the proportional hazard risk assumption is by using univariate frailty models, although in practice the concept of frailty is more intuitive to explain in the context of data grouped into clusters or data that have some type of association (measurements of the same individual, for example). In the literature, there are many models considered for the frailty distribution in a univariate context. To name a few, gamma1,2, inverse gaussian (IG)3,4, Birnbaum-Saunders (BS)5,6, folded normal7, weighted Lindley (WL)8,9, mixture of IG10, among others, where the restriction that the frailty variable has mean 1 is usually required to avoid identifiability problems.
However, when the observations are grouped in clusters with different sizes, a multivariate frailty model framework is required. In addition to the aforementioned restriction, in this case is required that the derivatives of the Laplace transform have a known form because the joint density function depends on it. Few distributions in the literature satisfy these conditions. The gamma, IG and the recently proposed WL shared frailty model9 satisfies those conditions. For this reason, the literature on frailty models in a multivariate context has increased only for the bivariate or trivariate case, in which case all the clusters have 2 or 3 observations, respectively. In addition to the three distributions mentioned above, we found the generalized exponential discussed in11 and the generalized inverse Gaussian presented in12. The truncated normal (TN) model was mentioned as a possible frailty distribution in7. However, it was used without imposing any mean restrictions or reparameterization, applying it solely within a copula model and restricting their analysis to the bivariate case. To date, the behavior of the TN model in the context of frailty has not been explored for clusters larger than two, let alone for groups with varying sample sizes.
In this paper, we use the TN distribution as the frailty distribution for clustered survival data. For our model to be identifiable, we employ a TN distribution with mean one and frailty variance as the frailty distribution by using a new parameterization of the TN distribution. The conditional distribution of frailties among the survivors and the frailty of individuals dying at time t can be explicitly determined. Furthermore, we propose a recurrent closed form for the derivatives of the Laplace transform. For parameter estimation, we give a simple EM algorithm, since all conditional expectations involved in the E-step are obtained in explicit form. Finally, the results of this paper have been implemented into R statistical software. The manuscript is organized as follows. Section 2 presents a background of frailty models and introduces the TN frailty model with parameterization such that the mean of the distribution is 1. Section 3 discusses the estimation procedure for the model based on a classical approach. Section 4 presents a simulation study to assess the performance of the proposed estimators in finite samples. In Section 5, we present two real data, the first related to the recurrence times of patients with renal problems and the second fibrosarcoma data. Finally, in Section 6 are presented the main conclusions of this work.
Background of frailty models
In this Section, we introduce the truncated normal distribution and we present a background of frailty models. Then, we introduced the novelty truncated normal frailty model for the univariate and multivariate cases.
The truncated normal distribution
A variable Z has TN distribution defined in the positive axis if its probability density function (PDF) is given by
where \(\phi (\cdot )\) and \(\Phi (\cdot )\) denote the PDF and cumulative density function (CDF) of the standard normal distribution, \(-\infty< \mu < \infty\) represents a location parameter and \(\sigma> 0\) a scale parameter. The mean and variance of the TN distribution are given by
Considering the reparameterization \(\nu =\mu /\sigma\) and the restriction \(\sigma =\bigg (\nu +\dfrac{\phi (\nu )}{\Phi (\nu )}\bigg )^{-1}\), we obtain that the pdf of the model is reduced to
with \(\gamma =\gamma (\nu )=\nu +\phi (\nu )/\Phi (\nu )\), and the mean and variance of the model are given by
respectively. From now on we will use the notation TN\((\nu )\) to refer to a random variable with PDF given in Equation (2). We note that this parameterization was not proposed in the statistical literature. But, it is not possible to directly reparameterize the frailty variance in terms of \(\theta\), however, there is a one-to-one relationship between \(\theta\) and \(\nu\). Thus, this parameterization is very useful because allows us to compare different frailty models also parameterized in the frailty variance directly.
Note that under the restriction \(\text {E}(Z) = 1, 0 \le \theta =\text{ Var }(Z) \le 1\). In principle, this can be a disadvantage. However, in practice usually, the frailty variance satisfies this condition (see Section 6).
Figure 1 shows the pdf and variance of the TN\((\nu )\) model with different values for \(\nu\). The flexibility of the TN distribution is apparent. Furthermore, the variance of the TN distribution decreases as \(\nu\) increases.
The Laplace transform for the TN\((\nu )\) model is given by
where \(\kappa =\kappa (s,\nu )=\nu -s/\gamma\). Let \(\mathcal {L}^{(d)}_g\big (s\big )\) be the d-th derivative of the Laplace transform. For \(d=1\) and \(d=2\) such term is given by
In13, Corollary 2.1 presents a recurrence relation for derivatives of order 3 or higher of the generating-moment function (denoted as \(M_g(\cdot )\)) for the TN model. Using the property \(M_g(s)=\mathcal {L}_g(-s)\) we can derive the following relation:
which depends on the two last derivatives, but it is simple to implement computationally. Higher-order Laplace transforms are provided in the table included in the Supplementary Material file. The results in Equations (4) and (5) are very important to the development of our approach for the TN model within the context of frailty models.
Univariate frailty models
In a univariate context, the extended Cox model with the unobserved source of heterogeneity has a conditional hazard function given by
where \(\textbf{x}_i\) denotes a vector of covariates and \(z_i\) is a latent variable representing the unobserved heterogeneity of the i-th observation. For \(z_1, z_2,\ldots ,z_n\) a positive distribution is assumed (say one with pdf \(g(\cdot )\)), typically with mean 1 to avoid identifiability problems14. Similar to Eq. (1), this implies that the quotient of the conditional hazard function of two individuals does not depend on t, but we remark that in this case it is the conditional (and not marginal) risk function that satisfies this property. Also note that the larger \(z_i\) is, the greater the risk associated with that observation. The conditional survival function for the i-th individual obtained from equation (6) is given by
where \(H_0(t)=\int _{0}^{t}h_0(u)du\) represents the basal cumulative hazard function. The marginal survival function can be obtained as
where \(\mathcal {L}_g(\cdot )\) corresponds to the Laplace transform of the pdf \(g(\cdot )\). On the other hand, the marginal hazard function is given by
where \(\mathcal {L}^{(d)}_g(\cdot )\), \(d\in \mathbb {Z}\), denotes the d-th derivative of \(\mathcal {L}_g(\cdot )\). It is clear from Eq. (7) that the assumption PH is not satisfied in this case. Particularly, when \(Z_i \sim \text{ TN }(\nu )\), the marginal survival and hazard functions are reduced to
Finally, for the univariate case we present two propositions related to the conditional distribution for the frailty given the events \(T>t\) and \(T=t\), respectively.
Proposition 1.1
The conditional distribution for the frailty \(Z\mid T>t\), follows a TN(\(\varepsilon\)), where \(\varepsilon =\varepsilon (H_0(t),\nu )=\nu -H_0(t)/\gamma\).
Proposition 1.2
The conditional distribution for the frailty \(Z\mid T=t\) follows a modified half Normal (MHN)15, which density function is given by
Proofs of Propositions 1.1 and 1.2 are provided in the Supplementary Material.
Multivariate shared frailty model
In a more general context, it is possible to consider that the observations are grouped in m clusters and the ith cluster has \(n_i\) observations, for \(i=1,\ldots ,m\). This scenario is ad hoc when the observations in the same cluster have some kind of dependence. For instance, measurements in the same individual, or members of the same family, among others. The assumption here is that all the observations related to the same cluster are conditionally independent given its corresponding frailty term (\(z_i\)). With this assumption, we obtain that the conditional hazard and the joint survival function are given by
respectively, where \(\textbf{x}_{ij}^\top =(x_{ij1},\ldots ,x_{ijp})\) denotes a vector of p covariates related to the j-th individual in the i-th cluster and \(\textbf{X}_i^\top =(\textbf{x}_{i1}^\top ,\ldots ,\textbf{x}_{in_i}^\top )\) denotes the vector with all the information for the p covariates associated with the \(n_i\) observations in the i-th cluster, \(z_i\) represents the influence of the i-th cluster on its observations. Integrating \(z_i\) over its density function is obtained that the marginal survival function for \({\varvec{t}}_i=(t_{i1},\ldots ,t_{in_i})\) is given by
and then, the marginal hazard function is given by
For the TN, expressions for Equation (8) can be expressed using the recursive formula in (5). For the bivariate case (i.e., \(n_i=2\), \(\forall i=1,\ldots ,m\)), the marginal hazard function is reduced to
where \(s_i=H_0(t_{i1})\exp (\textbf{x}_{i1}^\top {\varvec{\beta }})+H_0(t_{i2})\exp (\textbf{x}_{i2}^\top {\varvec{\beta }})\).
Kendall’s tau
Kendall’s \(\tau\) is a measure that quantifies the dependency between observations in the same cluster. This measure is independent of the unit of measurement of the data, so it works better than the variance and the correlation of the data due to its limitations (non-existence of the second moment, existence of censored observations, different measurement scale, see16, page 153, for details). Considering the Laplace transform and its second derivative (see Equations (3) and (4), respectively), we can determine the value of \(\tau\), for TN distribution, which is defined as
The integral is solved computationally since it does not have a closed expression. Figure 2 shows the different dependency values (\(\tau\)) according to the variance value for different frailty models. Note that \(\tau \in [0,0.33]\), for \(\theta \in (0,1)\) in the TN frailty model. We also note that, for a given frailty variance \(\theta \in (0,0.864)\), the TN frailty model produces a higher degree of dependence \(\tau\) than the GA, IG, and WL frailty models.
On the basal hazard function
The basal hazard function \(h_0(t)\) is usually modeled with common distributions with positive support, such as Weibull, gamma, and Gompertz, among others. For the Weibull distribution, we consider the parameterization such as \(h_0(t)=\lambda \rho t^{\rho -1}\) and \(H_0(t)=\lambda t^{\rho }\), \(t, \lambda , \rho>0\) and we denote \(T\sim \text{ W }(\lambda ,\rho )\) to refer to this particular parameterization. The Weibull model has been widely used in the literature because it adapts well to diverse biological, physical, chemical, and industrial processes, to name a few. Furthermore, its hazard function can assume monotonic forms (increasing, decreasing, or constant), which are controlled only by \(\rho\). On the other hand, the piecewise exponential (PE) distribution introduced in17 and extended in18 for the case with covariates. This model considers a constant risk between each predefined interval, say \((a_1,...,a_L)\) such as \(0=a_0<a_1<...<a_{L-1}<a_L = \infty\). This distribution is extremely useful for adapting critical points where there may be abrupt changes in the baseline risk function and which cannot be captured by non-segmented distributions such as the Weibull distribution. We say that T has PE model with vector of parameters \({\varvec{\lambda }}=(\lambda _1,...,\lambda _L)\) and known partition time \({\varvec{a}}=(a_1,...,a_{L-1})\) (we denote \(T\sim PE_a({\varvec{\lambda }}))\), if its survival function is given by
where
The hazard function is given by
and the cumulative hazard function is given by
In the literature, when the PE model is used in the context of frailty models, it is typically referred to as a semi-parametric model19. However, in this work, we also consider a non-parametric form for the baseline hazard distribution.
Estimation
In this section, we discuss the parameter estimation for the TN frailty model. Let \(Y_{ij}\) and \(C_{ij}\) be the failure and censoring times for the j-th individual in the i-th cluster and \({\textbf {x}}_{ij}\) be a \(p\times 1\) covariate vector (without intercept term), where \(1\le i \le m\) and \(1 \le j \le n_i\). Under a right censoring scheme, we observe the random variables \(T_{ij}=\min (Y_{ij}, C_{ij})\) and \(\delta _{ij}=I(Y_{ij} \le C_{ij})\), where \(I(A)=1\) if the event A occurs (0 otherwise). We assume the frailty terms \(Z_1,\ldots ,Z_m\) to be a random sample from the TN\((\theta\)) distribution. Considering the following assumptions:
-
i)
The pairs \((Y_{i1}, C_{i1}),\ldots ,(Y_{in_i}, C_{in_i})\) are conditionally independent given \(Z_i\), and \(Y_{ij}\) and \(C_{ij}\) are mutually independent for \(j=1,\ldots ,n_i\).
-
ii)
\(C_{i1}, \ldots , C_{in_i}\) are non-informative about \(Z_i\).
Under this setting, the observed log-likelihood function is given by
where \(r_i=\sum _{j=1}^{n_i} \delta _{ij}\) is the failures in the i-th cluster, \(b_\nu =\gamma ^2/2\) and \(c_{\varvec{\psi }}^{(i)}=\gamma \nu -\sum _{j=1}^{n_i} H_0(t_{ij})e^{\textbf{x}_{ij}^\top {\varvec{\beta }}}\). However, the last integral is related to the modified half-normal (MHN) distribution15 and it can be written as
where
is a specific case of the Fox-Wright function. The supplementary material in15 discusses different ways to compute this term. Therefore,
with \(r=\sum _{i=1}^m r_i\) the total failures in the sample. In a parametric approach, \(H_0(t)\) or \(h_0(t)\) are specified by a set of parameters, say \({\varvec{\lambda }}\), and then the parameter vector is reduced to \(({\varvec{\beta }}, {\varvec{\lambda }}, \nu )\). For instance, for the Weibull (WEI) distribution, we use the parameterization \(H_0(t)=\lambda \, t^\rho\) and \(h_0(t)=\lambda \, \rho \, t^{\rho -1}\), where \(t>0\) and \({\varvec{\lambda }}=(\lambda ,\rho )\in \mathbb {R}_+^2\). From a classical approach, the ML estimator can be obtained by maximizing \(\log L({\varvec{\beta }}, {\varvec{\lambda }}, \nu )\) relative to \({\varvec{\beta }}, {\varvec{\lambda }}\) and \(\nu\). For the flexibility discussed in previous sections, we also consider the PE model. However, it can be also attractive to discuss a non-parametric approach for the baseline distribution. For this, in the next subsection, we consider an estimation procedure based on the EM algorithm.
EM algorithm
Given the unobservable nature of the frailty terms, the EM algorithm is an ad hoc tool to be applied in this context. Let \({\varvec{t}}_i^\top =(t_{i1},...,t_{in_i})\), \({\varvec{\delta }}_i^\top =(\delta _{i1},...,\delta _{in_i})\) and \({\varvec{x}}_i^\top =({\varvec{x}}_{i1},...,{\varvec{x}}_{in_i})\) the observed times, failure indicators and covariates, related to the \(n_i\) observations in the i-th cluster, \(i=1,\ldots ,m\). For our particular problem, \(\mathcal {D}_c= ({\varvec{t}}^\top ,{\varvec{\delta }}^\top ,{\varvec{X}}^\top ,{\varvec{Z}}^\top )\) represents the complete data, where \({\varvec{t}}^\top =({\varvec{t}}_1^\top ,...,{\varvec{t}}_m^\top )\), \({\varvec{\delta }}^\top =({\varvec{\delta }}_1^\top ,...,{\varvec{\delta }}_m^\top )\), \({\varvec{X}}^\top =({\varvec{x}}_i^\top ,...,{\varvec{x}}_m^\top )\) and \({\varvec{Z}}^\top =(z_1,...,z_m)\), where \(\mathcal {D}_{o}=({\varvec{t}}^T,{\varvec{\delta }}^T,{\varvec{X}}^T)\) is the observed data and \({\varvec{Z}}^\top\) represents the vector of latent variables. Note that the complete likelihood function can be written as \(L({\varvec{\beta }},H_0,\nu ;\mathcal {D}_{c})=L_1({\varvec{\beta }}, H_0;\mathcal {D}_{c})\times L_2(\nu ; {\varvec{Z}})\), where \(L_1({\varvec{\beta }}, H_0;\mathcal {D}_{c})\) \(=\prod _{i=1}^m \prod _{j=1}^{n_i} \left[ z_i h_0(t_{ij})\exp ({\textbf {x}}_{ij}^\top {\varvec{\beta }})\right] ^{\delta _{ij}}\)\(\exp (-z_i H_0(t_{ij})\text {e}^{{\textbf {x}}_{ij}^\top {\varvec{\beta }}})\) and \(L_2(\nu ; {\varvec{Z}})=\prod _{i=1}^m f(z_i;\nu )\).
The complete log-likelihood function is given by \(\ell _c({\varvec{\beta }}, H_0,\nu ;\mathcal {D}_{c})=\ell _{1c}({\varvec{\beta }}, H_0;\mathcal {D}_{c})+\ell _{2c}(\nu ; {\varvec{Z}})\), where except for a constant that does not depend on \({\varvec{\beta }}\), \(H_0\) or \(\nu\), such functions are given by
Let \({\varvec{\psi }}^{(k)} = \left( {\varvec{\beta }}^{(k)},H_0^{(k)}, \nu ^{(k)}\right)\) be the estimated vector of \({\varvec{\psi }}= ({\varvec{\beta }},H_0, \nu )\) at the k-th iteration and
i.e., the conditional expectation of \(\ell _c({\varvec{\beta }}, H_0,\nu ;\mathcal {D}_{c})\) given the observed data and \({\varvec{\psi }}^{(k)}\). Note that \(Q({\varvec{\psi }}\mid {\varvec{\psi }}^{(k)})=Q_1(({\varvec{\beta }},H_0)\mid {\varvec{\psi }}^{(k)})+Q_2(\nu \mid {\varvec{\psi }}^{(k)})\)
where \(\widehat{z}_i^{(k)}=\mathbb {E}\big [Z_i\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\big ]\) and \(\widehat{z_i^2}^{(k)} =\mathbb {E}\big [Z_i^2\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\big ]\). It is possible to show that
Refer to the supplementary material file for a proof of this fact. Using this notation, and applying Lemma 2 from Sun et al.15, it follows immediately that
On the other hand, it is possible to construct a discrete version of the cumulative baseline hazard function, considering \(H_0^D(t)=\sum _{\ell : t_{(\ell )}\le t} H_0(t_{(\ell )})\), where \(t_{(1)},\ldots ,t_{(q)}\) are the ordered distinct failure times and q is the number of different observed failure times. Replacing \(H_0(\cdot )\) and \(h_0(\cdot )\) in Equation (9) is obtained
Replacing the solution for \(h_0(t_{(\ell )})\), i.e., \(\widehat{h}_0(t_{(\ell )})=d_{(\ell )}/\left[ \sum _{i,j \in R(t_{(k)})}\exp \left( {\textbf {x}}_{ij}^\top {\varvec{\beta }}+\log \widehat{z}^{(\ell )}_i\right) \right]\), the expression for \(Q_1\) is reduced, up to a constant that does not depend on \({\varvec{\beta }}\), to
Note that \(Q_1(\cdot )\) has the same form of the partial log-likelihood function of the Cox model, except for the offset \(\log \widehat{z}_i^{(k)}\). For this, to update \({\varvec{\beta }}\) in the M-step we can use the Cox approach. Finally, the non-parametric estimator for \(H_0(\cdot )\) in the k-th step of the algorithm is given by
In summary, the EM algorithm is given by the following steps.
-
E-step: For \(i=1,...,m\), compute \(\widehat{z}_i^{(k+1)}\) and \(\widehat{z_i^2}^{(k+1)}\) using equations (12) and (13), respectively, with \({\varvec{\beta }}^{(k)}\), \(H_0(\cdot )^{(k)}\) and \(\nu ^{(k)}\) as the estimated parameters at the k-th iteration.
-
M1-step: Update \({\varvec{\beta }}^{(k+1)}\) and \(H_0^{(k+1)}(\cdot )\) by fitting a Cox regression model with offset \(\log \widehat{z}_i^{(k+1)}\) for the nonparametric case, or maximizing \(Q_1(\beta ,H_0)\) for the parametric (WEI) and semi-parametric (PE) cases.
-
M2-step: Update \(\nu ^{(k+1)}\) by maximizing \(Q_2(\nu \mid {\varvec{\psi }}^{(k)})\) in relation to \(\nu\).
Maximization around \(H_0\) refers to optimizing the parameters in \(H_0(\cdot )\): \(\rho\) and \(\lambda\) for the Weibull baseline distribution, or the vector \({\varvec{\lambda }}\) for the piecewise exponential case. The unified formulation ensures algorithmic generality. The algorithm iterates until a convergence criterion is satisfied. For instance, we consider \(||\widehat{{\varvec{\psi }}}^{(k-1)}-\widehat{{\varvec{\psi }}}^{(k)}||<\epsilon\), where \(\epsilon\) is a predefined value and \(||\cdot ||\) denotes the Euclidean norm. Initial values are derived from the ordinary Cox model, taking \(\nu ^{(0)}=0.5\). On the other hand, following the suggestion of20, we estimate the standard error of \(\widehat{\varvec{\beta }}\) and \(\widehat{\nu }\) via a profile log-likelihood function: \(\ell ({\varvec{\beta }},\nu )=\log L({\varvec{\beta }}, H_0, \nu )\), replacing \(H_0\) with its estimate \(\widehat{H}_0\). The variance-covariance matrix of \((\widehat{\varvec{\beta }},\widehat{\nu })\) is then:
Finally, more important than \(\widehat{\nu }\) is \(\widehat{\theta }:= \widehat{\gamma }^{-2}-\dfrac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })}\widehat{\gamma }^{-1}\) (the frailty variance) because allows us to compare this term with the variance of other models parameterized directly in the frailty variance. The variance of \(\widehat{\theta }\) is estimated as:
Remark 1
Note that the result in Equation (11) is also interesting if a Bayesian approach were applied to the model, because also is valid conditioning on the parameters. This facilitates, among other things, the application of an MCMC type method to simulate from the corresponding conditional distribution related to the frailties.
Computational aspects
The extrafrail21 package of R22 includes the computational implementation for the TN frailty model considering as the baseline model the Weibull, exponential and PE distributions and the non-parametric specification. For instance, to fit the Weibull case, it can be used
frailty.fit(formula, data, dist = “weibull”, dist.frail=“TN”)
whereas is usually in survival analysis with random effects in R, the formula can be defined as
Surv(time, event) \(\mathtt {\sim }\) covariates + cluster(id)
A similar syntax can be used to fit the other cases specifying dist=“exponential”, dist=“pe” or dist=“np” in the last sentence. We highlight that the function allows us to perform the estimation even for the case where the clusters have different sizes (i.e., \(n_1, n_2\),\(\ldots\),\(n_m\) are not necessarily the same).
Simulation study
In this Section, we present a simulation study to assess the performance of the maximum likelihood estimators obtained via the EM algorithm with samples of different percentages of censoring.
Recovery parameters
We consider the following three different scenarios:
-
Scenario 1: 19 clusters with 2 observations each and 19 clusters with 4 observations each, totalling 114 observations. (\(n_1=\ldots =n_{19}=2, n_{20}=\ldots =n_{38}=4\) and \(m=38\)).
-
Scenario 2: 38 clusters with 2 observations each and 38 clusters with 4 observations each, totalling 228 observations. (\(n_1=\ldots =n_{38}=2, n_{39}=\ldots =n_{76}=4\) and \(m=76\)).
-
Scenario 3: 19 clusters with 4 observations each and 19 clusters with 8 observations each, totalizing 228 observations. (\(n_1=\ldots =n_{19}=4, n_{20}=\ldots =n_{38}=8\) and \(m=38\)).
The idea is to verify if, under a certain amount of data, it is advisable to increase the number of clusters or increase the cluster observations. We consider as baseline model the PE distribution with \(L=3\) and time partition \({\varvec{a}} = (7/365, 56/365)\). Similar to the real data application, we also consider one dichotomous covariate x, which was drawn from the Bernoulli distribution with success probability 20/76. We also consider three values for \(\theta\), the variance of the frailty terms: 0.20, 0.50 and 0.75. The percentage of censoring was fixed at 10%, 25% and 50%. In all the cases, the regression coefficient was fixed as \(\beta =1.8\) and the parameters from the PE distribution were fixed as \({\varvec{\lambda }}=(\lambda _1=0.3,\lambda _2=2.6,\lambda _3=1.9)\). To simulate values from the model, we use the following steps:
-
i)
Draw \(z_i\sim \text{ TN }(\nu )\), \(i=1,\ldots ,m\), using the inverse transform method, i.e., do \(z_i=\Big (\Phi ^{-1} \big (u_i \Phi (\nu ) + \Phi (-\nu ) \big ) + \nu \Big )\gamma ^{-1}\), where \(u_i\sim \text{ U }(0,1)\) (the standard uniform distribution).
-
ii)
Draw the failure times from the conditional distribution \(y_{ij}\mid z_i\sim \text{ PE }({\varvec{\lambda }} z_i \exp (\textbf{x}_{ij}^{\top }{\varvec{\beta }}), {\varvec{a}})\).
-
iii)
Define the censoring times, \(c_{ij}\), as the \(100\times (1-q)\)-th quantile of the corresponding conditional distribution \(\text{ PE }({\varvec{\lambda }} z_i \exp (\textbf{x}_{ij}^{\top }{\varvec{\beta }}), {\varvec{a}})\) distribution.
-
iv)
Define the observed failure times and failure indicators as \(t_{ij}=\min (y_{ij},c_{ij})\) and \(\delta _{ij}=I(y_{ij}\le c_{ij})\), respectively, for \(i=1,\ldots ,m\), \(j=1,\ldots ,n_i\).
For each scenario and combination of censoring and \(\theta\), we draw 1,000 samples and compute the ML estimates. For each parameter, Tables 1 and 2 summarized the average bias (bias), the root of the estimated mean squared error (RMSE), the mean of the standard errors (SE) and the coverage probabilities (CP) of the asymptotic 95% confidence intervals.
An increase in the sample size improves the precision and accuracy of the estimates. In particular, scenarios 2 and 3, which have larger sample sizes, exhibit better performance than scenario 1. In general, an increase in heterogeneity (\(\theta\)) and in the censoring percentage tends to raise the bias, standard error, and RMSE, while reducing coverage probability (CP). However, the behavior of the estimator for \(\theta\) improves under higher censoring, showing reduced bias and increased coverage, possibly due to a better identification of the random effect in the presence of censored events. The most affected estimator is \(\lambda _3\), since censored information tends to concentrate within its interval. When comparing Scenarios 2 and 3, the former yields better results. This suggests that for a fixed total sample size, increasing the number of clusters is preferable to increasing the number of observations per cluster. This leads to greater diversity in latent effects, which enhances the estimation of frailty terms.
Applications with real data sets
In this Section, we present two applications to illustrate the performance of the TN frailty model in comparison with traditional models. The first application is related to patients with Chronic Kidney Disease (CKD), while the second application is related to patients with fibrosarcoma.
Kidney data set
CKD is the slow and progressive loss of kidney function over time. The main job of these organs is to remove waste and excess water from the body. This disease may be asymptomatic for some time until the kidneys have almost stopped working, whereupon kidney disease usually subsides, diagnosed in its final stages. The final stage of CKD is called End-Stage Renal Disease (ESRD). At this stage, the kidneys can no longer sufficiently remove waste and excess fluid from the body, requiring the patient to undergo dialysis (a life-sustaining treatment) or a kidney transplant (US National Library of Medicine). Dialysis is broken down into two main modalities: hemodialysis and peritoneal dialysis. Hemodialysis consists of extracting blood from the body to direct it to a machine that eliminates waste and excess fluid; after filtration, it is reintroduced into the bloodstream. Peritoneal dialysis, for its part, is a simpler process and can be done on an outpatient basis. Liquid is inserted into the peritoneal cavity through a catheter located in the stomach. This solution absorbs waste and excess fluid and is later extracted. The solution is removed through the same channel.
CKD represents one of the most important non-communicable diseases worldwide23. For many patients, dialysis is the focal point around which their lives revolve, not only because of the time spent travelling to and from the sessions in specialized centres and the time dedicated to the dialysis treatment itself but also due to the diet that accompanies it, fluid restrictions and medication load24. Thus, one of the most advantageous options, considering quality of life, is treatment by ambulatory peritoneal dialysis (with a portable machine). The peritoneal catheter is a foreign body that facilitates the appearance of infections and serves as a reservoir for bacteria. Infection can appear both in the exit orifice and the tunnel (tunnelled path of the catheter) or the peritoneum (peritonitis). Peritonitis continues to be an important complication of PD, as it contributes to technique failure, hospitalization, and even death25.
We focus on a real dataset named kidney, available in the R22 package frailtyHL26. For further details, see page 11 of its documentation: https://cran.r-project.org/web/packages/frailtyHL/frailtyHL.pdf. The study collected bivariate times, consisting of the times of first and second recurrence of infection at the catheter insertion point in patients with kidney problems using a portable dialysis machine. The catheter is later removed if infection occurs and can be removed for other reasons, in which case the observation is censored. Available covariates are sex and type of kidney disease: Glomerulonephritis (GN), acute nephritis (AN), Polycystic kidney disease (PKD) and others. Previous analysis suggests that only sex is significant in this context27. The study has 38 patients, 10 men and 28 women, each person has 2 times of recurrence of the infection, so there are a total of 76 observations. A summary of such times is presented in Table 3 and Figure 3 presents the Kaplan-Meier (KM) estimator by both times and by sex.
For comparison purposes, we also consider the GA, WL and IG frailty models with baseline distribution WE and PE. Figure 4 shows the cumulative hazard function for the kidney data. The proposed partition for the PE model was set at 1 and 8 weeks (indicated by the vertical segments in the graph). A change in the slope behavior is evident, as highlighted in the zoomed-in view on the right. This supports the conclusion that the PE model provides a better fit than a non-segmented model for this dataset. Practically, this suggests that the risk of infection at the catheter insertion site is highest during the first week post-insertion and gradually decreases over time. After two months, the risk stabilizes and remains relatively low. This understanding can help healthcare professionals in identifying critical time periods for infection prevention and monitoring patients accordingly.
Table 4 shows the Akaike information criterion (AIC)28 and the Bayesian information criterion (BIC)29 for such models. According to the AIC and BIC criteria, it is suggested that the baseline PE model is more appropriate for this data than the WE model, independent of the frailty model used. However, the TN frailty model provides better results. Table 5 presents the estimates for all the models considering the PE baseline distribution, including the ordinary PE model (i.e., without frailty).
Note that the effect of not considering the dependence among the clusters is the underestimation of the effect for sex. On the other hand, the estimated Kendall’s \(\tau\) for the different models is around 0.13. However, the estimated frailty variance for GA, WL and IG is overestimated by at least 50% concerning the frailty TN model. In practical terms, this means that the TN frailty model estimates a greater effect of sex on the recurrence of infection at the catheter insertion point and less variability between the measures associated with the same individual.
Fibrosarcoma data set
Fibrosarcoma is a rare malignant tumor that originates from fibroblasts, the connective tissue cells responsible for the production of collagen and extracellular matrix. This neoplasm exhibits infiltrative growth, a high propensity for local recurrence, and metastatic potential. It can develop in any part of the body, although it is most commonly found in the extremities, trunk, and retroperitoneal region. Clinically, it typically presents as a progressively enlarging mass, initially painless. Diagnosis is based on histopathological findings, where tumor cells are arranged in a characteristic herringbone pattern, and is often supported by immunohistochemical studies to differentiate it from other soft tissue tumors30. The treatment of choice is surgical excision with wide margins, and adjuvant radiotherapy is frequently considered; chemotherapy is generally reserved for advanced or metastatic cases31.
This dataset includes information from 251 patients diagnosed with fibrosarcoma SOE (from the portugues “sem outra especificacão, meaning “not otherwise specified”) with diagnosis dates ranging from 2000 to 2022, and follow-up data extending through December 2022. The dataset was obtained from the Oncocenter Foundation of São Paulo, Brazil (Fundacão Oncocentro de São Paulo, FOSP), which oversees the Hospital Cancer Registry of the State of São Paulo (http://fosp.saude.sp.gov.br). This neoplasm is coded as 8810/3 Fibrosarcoma, NOS (not otherwise specified), according to the International Classification of Diseases for Oncology (ICD-O32), which is used in cancer registries to classify tumors that lack further histological subtyping at the time of diagnosis.
Cancer-specific death was defined as the event of interest, and time-to-event was measured from the date of diagnosis to the patient’s death (in years: mean\(=5.72\), standard deviation (SD)\(=5.78\), median\(=3.12\), range\(= 0.025-21.86\)). During the follow-up period, a total of 103 events (39%) occurred. As covariates we use the type of treatment, with eight possible labels: A - surgery (84 patients, 32.2%), B - Radiotherapy (14 patients, 5.4%), C - Chemotherapy (18 patients, 6.9%), D - Surgery \(+\) Radiotherapy (42 patients, 16.1%), E - Surgery \(+\) Chemotherapy (34 patients, 13.0%), F - Radiotherapy + Chemotherapy (11 patients, 4.2%), G - Surgery \(+\) Radiotherapy \(+\) Chemotherapy (29 patients, 11.1%) and I - other combination (29 patients, 11.1%). Figure 5 presents the KM estimator by both times and type of treatment. The clusters considered in this analysis correspond to the 26 clinical areas responsible for treating the patients, which are summarized in Table 6. Note that these clusters are highly unbalanced in terms of sample size. In this analysis, we consider the TN, GA, WL, and IG frailty models, using the Weibull distribution for the baseline hazard. The results are summarized in Table 7. Notably, the TN frailty model provides the lowest AIC among the models considered. Once again, the Kendall’s \(\tau\) values provided by the models are similar. However, the estimated intra-cluster variance (0.226) is lower for the TN model compared to the others. Finally, Figure 6 shows the survival functions (SF) for patients treated in neurology and clinical oncology centers, as well as the marginal SF (i.e., the SF for a patient randomly selected from the entire cohort).
Concluding remarks
A new survival model with TN frailty was proposed and studied in detail. This model can lead to a complex structure for the data, because allows to modelling of univariate and multivariate data, being adaptable even for groups of different sizes. For the baseline risk, the Weibull, and PE distributions were adopted as well as a non-parametric approach. For a fixed variance for the frailty, the TN frailty model provides a greater Kendall’s than the gamma and IG frailty models. We get a recursive closed-form expression for the derivatives of the Laplace transform for the TN model. Furthermore, the conditional distributions of frailties among the survivors and the frailty of individuals dying at time t were determined explicitly. The simulation studies, based on the EM algorithm, conclude that having more complete information relative to the censored information improves the accuracy and precision of the estimate. Scenarios 2 and 3 did not have a large difference in bias, this suggests that the bias depends on the sample size, not on the data configuration. On the other hand, concerning the RMSE and SE, Scenario 2 showed an improvement in precision for Scenario 3. This suggests that increasing the information in the clusters increases the precision compared to having clusters with little information but more numerous. We fitted the proposed frailty model to a real dataset on times to the first and second recurrence of infection at the catheter insertion point in patients with kidney problems using a portable dialysis machine to show the potential of using the new frailty model. This application demonstrates the practical relevance of the new regression model. In particular, the estimated frailty variance for GA, WL and IG is overestimated in the frailty TN model.
Data availability
The real dataset used, named kidney, is available in the frailtyHL package in R. For details on its use, refer to page 11 of the manual: https://cran.r-project.org/web/packages/frailtyHL/frailtyHL.pdf.
References
Vaupel, J., Manton, K. & Stallard, E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16, 439–454 (1979).
Congdon, P. Modelling frailty in area mortality. Statistics in Medicine 14, 1859–1874 (1995).
Hougaard, P. Life table methods for heterogeneous populations. Biometrika 71, 75–83 (1984).
Manton, K., Stallard, E. & Vaupel, J. Alternative models for heterogeneity of mortality risks among the aged. Journal of the American Statistical Association 81, 635–644 (1986).
Leão, J., Leiva, V., Saulo, H. & Tomazella, V. Birnbaum-Saunders frailty regression models: Diagnostics and application to medical data. Biometrical journal 59, 291–317 (2017).
Gallardo, D. I., Bourguignon, M. & Romeo, J. S. Birnbaum-saunders frailty regression models for clustered survival data. Statistics and Computing 34, 141 (2024).
Wang, Y. & Emura, T. Multivariate failure time distributions derived from shared frailty and copulas. Japanese Journal of Statistics and Data Science 4, 1105–1131 (2021).
Mota, A. et al. Weighted lindley frailty model: estimation and application to lung cancer data. Lifetime Data Analysis 27, 561–587 (2021).
Gallardo, D. I., Bourguignon, M. & Santibáñez, J. L. The shared weighted lindley frailty model for clustered failure time data. Biometrical Journal 67, e70044 (2025).
Kiprotich, G., Gallardo, D. I., Ramos, P. L. & Augustin, T. A shared frailty regression model for clustered survival data. Statistical Methods in Medical Research 0, 09622802251338984, https://doi.org/10.1177/09622802251338984 (0).
Barreto-Souza, W. & Mayrink, V. Semiparametric generalized exponential frailty model for clustered survival data. Annals of the Institute of Statistical Mathematics 71, 679–701 (2019).
Piancastelli, L., Barreto-Souza, W. & Mayrink, V. Generalized inverse-Gaussian frailty models with application to TARGET neuroblastoma data. Annals of the Institute of Statistical Mathematics 73, 979–1010 (2021).
Gómez, H. J., Olmos, N. M., Varela, H. & Bolfarine, H. Inference for a truncated positive normal distribution. Applied Mathematics-A Journal of Chinese Universities 33, 163–176 (2018).
Elbers, C. & Ridder, G. True and spurious duration dependence: The identifiability of the proportional hazard model. The Review of Economic Studies 49, 403–409 (1982).
Sun, J., Kong, M. & Pal, S. The modified-half-normal distribution: Properties and an efficient sampling scheme. Communications in Statistics-Theory and Methods 52, 1591–1613 (2023).
Wienke, A. Frailty models in survival analysis (CRC press, 2010).
Feigl, P. & Zelen, M. Estimation of exponential survival probabilities with concomitant information. Biometrics 826–838 (1965).
Friedman, M. Piecewise exponential models for survival data with covariates. The Annals of Statistics 10, 101–113 (1982).
Balakrishnan, N. & Liu, K. Semi-parametric likelihood inference for birnbaum-saunders frailty model. REVSTAT 231–255 (2018).
Klein, J. P. Semiparametric estimation of random effects using the cox model based on the em algorithm. Biometrics 48, 795–806 (1992).
Gallardo, D., Bourguignon, M. & Santibáñez, J. Estimation and Additional Tools for Alternative Shared Frailty Models (2025). R package version 1.13.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2025).
Torres, R. S. L., Ushiña, J., Lloay, A. & Balseca, M. Cuidados de enfermería en pacientes con enfermedad renal crónica en hemodiálisis durante infección por covid-19. RECIAMUC 6, 81–90 (2003).
Davenport, A. Portable and wearable dialysis devices for the treatment of patients with end-stage kidney failure: Wishful thinking or just over the horizon?. Pediatric nephrology 30, 2053–2060 (2015).
Fariñas, M. C., García-Palomo, J. D. & Gutiérrez-Cuadra, M. Infecciones asociadas a los catéteres utilizados para la hemodiálisis y la diálisis peritoneal. Enfermedades infecciosas y microbiologia Clinica 26, 518–526 (2008).
Ha, I. D., Noh, M., Kim, J. & Lee, Y. frailtyHL: Frailty Models via Hierarchical Likelihood (2019). R package version 2.3.
McGilchrist, C. & Aisbett, C. Regression with frailty in survival analysis. Biometrics 461–466 (1991).
Akaike, H. A new look at the statistical model identification. IEEE transactions on automatic control 19, 716–723 (1974).
Schwarz, G. Estimating the dimension of a model. The annals of statistics 461–464 (1978).
Fletcher, C. D. M., Bridge, J. A., Hogendoorn, P. C. W. & Mertens, F. WHO Classification of Tumours of Soft Tissue and Bone (IARC Press, Lyon, 2020), 5 edn.
Pisters, P. W., Leung, D. H., Woodruff, J., Shi, W. & Brennan, M. F. Analysis of prognostic factors in 1,041 patients with localized soft tissue sarcomas of the extremities. Journal of Clinical Oncology 25, 785–790. https://doi.org/10.1200/JCO.2006.08.1363 (2007).
World Health Organization. International Classification of Diseases for Oncology (ICD-O), 3rd Edition, 1st Revision (WHO Press, 2013).
Acknowledgements
Yolanda M Gómez was partially funded by FONDECYT, project grant number 11230397 from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science, Technology, Knowledge, and Innovation and Marcelo Bourguignon gratefully acknowledges partial financial support of the Brazilian agency Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq: grant 304140/2021-0).
Author information
Authors and Affiliations
Contributions
“D.I.G.: Conceptualization of this study, Methodology, Software. Y.M.G.: Data curation, Methodology, Software, Writing - Original draft preparation. J.L.S.: Data curation, Methodology, Software, Writing - Original draft preparation. O.V.: formal analysis, investigation, writing–review and editing, funding acquisition. M.B.: Data curation, Methodology, Software. All authors reviewed the manuscript.”
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gallardo, D.I., Gómez, Y.M., Santibañez, J.L. et al. The multivariate shared truncated normal frailty model with application to medical data. Sci Rep 15, 30099 (2025). https://doi.org/10.1038/s41598-025-15903-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-15903-y