The multivariate shared truncated normal frailty model with application to medical data

Gallardo, Diego I.; Gómez, Yolanda M.; Santibañez, John L.; Venegas, Osvaldo; Bourguignon, Marcelo

doi:10.1038/s41598-025-15903-y

Download PDF

Article
Open access
Published: 17 August 2025

The multivariate shared truncated normal frailty model with application to medical data

Diego I. Gallardo¹,
Yolanda M. Gómez¹,
John L. Santibañez²,
Osvaldo Venegas³ &
…
Marcelo Bourguignon⁴

Scientific Reports volume 15, Article number: 30099 (2025) Cite this article

1132 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

A new multivariate shared frailty model based on the truncated normal distribution is proposed. For the basal distribution of failure times, we assume a parametric approach through the Weibull and piecewise exponential distributions and also a nonparametric approach. Similar to the traditional gamma frailty model, the Laplace transform, the hazard and survival functions of our proposal have a simple and closed form. In addition, the n-th derivative of the Laplace transform can be expressed recursively. Parameter estimation is performed by a classical approach through the EM algorithm. A simulation study is presented to demonstrate the consistency of the estimators in finite samples. Finally, two applications to medical data modelling the recurrence of infection in renal patients and patients with fibrosarcoma are presented to demonstrate the effectiveness of the model compared to other classical approaches in the literature. The computational implementation of the model is available in the extrafrail package of R.

On a Bayesian multivariate survival tree approach based on three frailty models

Article Open access 08 April 2025

Frailty identification using a sensor-based upper-extremity function test: a deep learning approach

Article Open access 22 April 2025

Leveraging the variational Bayes autoencoder for survival analysis

Article Open access 19 October 2024

Introduction

Survival models study the time to event until a certain event of interest occurs. They are characterized by including censored (incomplete) information within the study, either because the individual never presented the event during follow-up or because the follow-up of the individual was truncated during the study. Within this context, due to the fact of not assuming a specific distribution for failure times, one of the most referenced models in the literature is Cox’s proportional hazards (PH). For ${\varvec{x}}_i=(x_{i1},\ldots ,x_{ip})$ a set of p observed covariates (without intercept term), the hazard risk function for this model is given by

$$\begin{aligned} h(t\, |\, {\varvec{x}}_i) = h_0(t)\exp ({\varvec{x}}_i^\top {\varvec{\beta }}), \end{aligned}$$

(1)

where ${\varvec{\beta }}=(\beta _1,\ldots ,\beta _p)$ denotes a vector of p observed covariates and $h_0(\cdot )$ denotes the baseline hazard function. Note that this model provides a proportional hazard structure because the ratio for two individuals with profiles ${\varvec{x}}_i$ and ${\varvec{x}}_{i'}$

$$\begin{aligned} \frac{h(t\, |\, {\varvec{x}}_i)}{h(t\, |\, {\varvec{x}}_{i'})} =\exp \left( ({\varvec{x}}_i-{\varvec{x}}_{i'})^\top {\varvec{\beta }}\right) , \end{aligned}$$

does not depend on t. A way to break the proportional hazard risk assumption is by using univariate frailty models, although in practice the concept of frailty is more intuitive to explain in the context of data grouped into clusters or data that have some type of association (measurements of the same individual, for example). In the literature, there are many models considered for the frailty distribution in a univariate context. To name a few, gamma^1,2, inverse gaussian (IG)^3,4, Birnbaum-Saunders (BS)^5,6, folded normal⁷, weighted Lindley (WL)^8,9, mixture of IG¹⁰, among others, where the restriction that the frailty variable has mean 1 is usually required to avoid identifiability problems.

However, when the observations are grouped in clusters with different sizes, a multivariate frailty model framework is required. In addition to the aforementioned restriction, in this case is required that the derivatives of the Laplace transform have a known form because the joint density function depends on it. Few distributions in the literature satisfy these conditions. The gamma, IG and the recently proposed WL shared frailty model⁹ satisfies those conditions. For this reason, the literature on frailty models in a multivariate context has increased only for the bivariate or trivariate case, in which case all the clusters have 2 or 3 observations, respectively. In addition to the three distributions mentioned above, we found the generalized exponential discussed in¹¹ and the generalized inverse Gaussian presented in¹². The truncated normal (TN) model was mentioned as a possible frailty distribution in⁷. However, it was used without imposing any mean restrictions or reparameterization, applying it solely within a copula model and restricting their analysis to the bivariate case. To date, the behavior of the TN model in the context of frailty has not been explored for clusters larger than two, let alone for groups with varying sample sizes.

In this paper, we use the TN distribution as the frailty distribution for clustered survival data. For our model to be identifiable, we employ a TN distribution with mean one and frailty variance as the frailty distribution by using a new parameterization of the TN distribution. The conditional distribution of frailties among the survivors and the frailty of individuals dying at time t can be explicitly determined. Furthermore, we propose a recurrent closed form for the derivatives of the Laplace transform. For parameter estimation, we give a simple EM algorithm, since all conditional expectations involved in the E-step are obtained in explicit form. Finally, the results of this paper have been implemented into R statistical software. The manuscript is organized as follows. Section 2 presents a background of frailty models and introduces the TN frailty model with parameterization such that the mean of the distribution is 1. Section 3 discusses the estimation procedure for the model based on a classical approach. Section 4 presents a simulation study to assess the performance of the proposed estimators in finite samples. In Section 5, we present two real data, the first related to the recurrence times of patients with renal problems and the second fibrosarcoma data. Finally, in Section 6 are presented the main conclusions of this work.

Background of frailty models

In this Section, we introduce the truncated normal distribution and we present a background of frailty models. Then, we introduced the novelty truncated normal frailty model for the univariate and multivariate cases.

The truncated normal distribution

A variable Z has TN distribution defined in the positive axis if its probability density function (PDF) is given by

$$\begin{aligned} g(z)=\dfrac{\phi \Big (\frac{z-\mu }{\sigma }\Big )}{\sigma \, \Phi \Big (\frac{\mu }{\sigma }\Big )},\quad z> 0, \end{aligned}$$

where $\phi (\cdot )$ and $\Phi (\cdot )$ denote the PDF and cumulative density function (CDF) of the standard normal distribution, $-\infty< \mu < \infty$ represents a location parameter and $\sigma> 0$ a scale parameter. The mean and variance of the TN distribution are given by

$$\begin{aligned} \mathbb {E}(Z)=\mu +\sigma \dfrac{\phi (\mu /\sigma )}{\Phi (\mu /\sigma )}, \quad \text{ and } \quad \text{ Var }(Z)=\sigma ^2\Bigg \{1-\dfrac{\mu \,\phi (\mu /\sigma )}{\sigma \,\Phi (\mu /\sigma )}-\bigg (\dfrac{\phi (\mu /\sigma )}{\Phi (\mu /\sigma )}\bigg )^2\Bigg \}. \end{aligned}$$

Considering the reparameterization $\nu =\mu /\sigma$ and the restriction $\sigma =\bigg (\nu +\dfrac{\phi (\nu )}{\Phi (\nu )}\bigg )^{-1}$, we obtain that the pdf of the model is reduced to

$$\begin{aligned} g(z)=\dfrac{\gamma \phi \bigg (\gamma z-\nu \bigg )}{\Phi \big (\nu \big )}, \quad \nu \in \mathbb {R}, z>0, \end{aligned}$$

(2)

with $\gamma =\gamma (\nu )=\nu +\phi (\nu )/\Phi (\nu )$, and the mean and variance of the model are given by

$$\begin{aligned} \mathbb {E}(Z)=1 \quad \text{ and } \quad \theta =\text{ Var }(Z)=\gamma ^{-2}-\dfrac{\phi (\nu )}{\Phi (\nu )}\gamma ^{-1}, \end{aligned}$$

respectively. From now on we will use the notation TN$(\nu )$ to refer to a random variable with PDF given in Equation (2). We note that this parameterization was not proposed in the statistical literature. But, it is not possible to directly reparameterize the frailty variance in terms of $\theta$, however, there is a one-to-one relationship between $\theta$ and $\nu$. Thus, this parameterization is very useful because allows us to compare different frailty models also parameterized in the frailty variance directly.

Note that under the restriction $\text {E}(Z) = 1, 0 \le \theta =\text{ Var }(Z) \le 1$. In principle, this can be a disadvantage. However, in practice usually, the frailty variance satisfies this condition (see Section 6).

Figure 1 shows the pdf and variance of the TN$(\nu )$ model with different values for $\nu$. The flexibility of the TN distribution is apparent. Furthermore, the variance of the TN distribution decreases as $\nu$ increases.

The Laplace transform for the TN$(\nu )$ model is given by

$$\begin{aligned} \mathcal {L}_g\big (s\big )&=\dfrac{\Phi \big (\kappa \big )}{\Phi \big (\nu \big )}\exp \bigg \{\frac{s}{\gamma }\,\Big (\frac{s}{2\gamma }-\nu \Big ) \bigg \}, \end{aligned}$$

(3)

where $\kappa =\kappa (s,\nu )=\nu -s/\gamma$. Let $\mathcal {L}^{(d)}_g\big (s\big )$ be the d-th derivative of the Laplace transform. For $d=1$ and $d=2$ such term is given by

$$\begin{aligned} \mathcal {L}^{(1)}_g\big (s\big ) = -\dfrac{\mathcal {L}_g\big (s\big )}{\gamma }\Bigg (\kappa +\frac{\phi (\kappa )}{\Phi (\kappa )}\Bigg ) \quad \text{ and } \quad \mathcal {L}^{(2)}_g \big (s\big ) = \frac{ \mathcal {L}_g \big (s\big )}{\gamma ^2} \Bigg [\kappa \left( \kappa +\frac{\phi (\kappa )}{\Phi (\kappa )}\right) +1 \Bigg ]. \end{aligned}$$

(4)

In¹³, Corollary 2.1 presents a recurrence relation for derivatives of order 3 or higher of the generating-moment function (denoted as $M_g(\cdot )$) for the TN model. Using the property $M_g(s)=\mathcal {L}_g(-s)$ we can derive the following relation:

$$\begin{aligned} \mathcal {L}_g^{(d)}(s) = \frac{(d-1)}{\gamma ^2}\mathcal {L}_g^{(d-2)}(s) - \frac{\kappa }{\gamma }\mathcal {L}_g^{(d-1)}(s), \quad d = 3,4,\ldots , \end{aligned}$$

(5)

which depends on the two last derivatives, but it is simple to implement computationally. Higher-order Laplace transforms are provided in the table included in the Supplementary Material file. The results in Equations (4) and (5) are very important to the development of our approach for the TN model within the context of frailty models.

Univariate frailty models

In a univariate context, the extended Cox model with the unobserved source of heterogeneity has a conditional hazard function given by

$$\begin{aligned} h(t\, |\, z_i,\textbf{x}_i) = z_i \, h_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }}), \end{aligned}$$

(6)

where $\textbf{x}_i$ denotes a vector of covariates and $z_i$ is a latent variable representing the unobserved heterogeneity of the i-th observation. For $z_1, z_2,\ldots ,z_n$ a positive distribution is assumed (say one with pdf $g(\cdot )$), typically with mean 1 to avoid identifiability problems¹⁴. Similar to Eq. (1), this implies that the quotient of the conditional hazard function of two individuals does not depend on t, but we remark that in this case it is the conditional (and not marginal) risk function that satisfies this property. Also note that the larger $z_i$ is, the greater the risk associated with that observation. The conditional survival function for the i-th individual obtained from equation (6) is given by

$$\begin{aligned} S(t\, |\, z_i,\textbf{x}_i)=\exp \left\{ -\int _0^t h(u\, |\, z_i,\textbf{x}_i)du\right\} =\exp \Big \{-z_i H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\Big \}, \end{aligned}$$

where $H_0(t)=\int _{0}^{t}h_0(u)du$ represents the basal cumulative hazard function. The marginal survival function can be obtained as

$$\begin{aligned} S(t\, |\, \textbf{x}_i)=\int _{0}^{\infty }\exp \Big \{-z_i H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\Big \}g(z_i)dz_i=\mathcal {L}_g\left( H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\right) , \end{aligned}$$

where $\mathcal {L}_g(\cdot )$ corresponds to the Laplace transform of the pdf $g(\cdot )$. On the other hand, the marginal hazard function is given by

$$\begin{aligned} h(t\, |\, \textbf{x}_i)=-\frac{\partial S(t\, |\, \textbf{x}_i)/\partial t}{S(t\, |\, \textbf{x}_i)}=-\dfrac{h_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\, \mathcal {L}^{(1)}_g\big (H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\big )}{\mathcal {L}_g\big (H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})\big )}, \end{aligned}$$

(7)

where $\mathcal {L}^{(d)}_g(\cdot )$, $d\in \mathbb {Z}$, denotes the d-th derivative of $\mathcal {L}_g(\cdot )$. It is clear from Eq. (7) that the assumption PH is not satisfied in this case. Particularly, when $Z_i \sim \text{ TN }(\nu )$, the marginal survival and hazard functions are reduced to

$$\begin{aligned} S(t\mid \textbf{x}_i)&=\dfrac{\Phi \big (\nu -\frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{\gamma }\big )}{\Phi \big (\nu \big )}\exp \bigg \{\frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{\gamma }\,\left( \frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{2\gamma }-\nu \right) \bigg \}, \quad \text{ and } \nonumber \\ h(t\mid \textbf{x}_i)&=-\frac{h_0(t)}{\gamma }\exp \left( \textbf{x}_i^\top {\varvec{\beta }}\right) \left\{ \nu -\frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{\gamma }+\frac{\phi \left( \nu -\frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{\gamma }\right) }{\Phi \left( \nu -\frac{H_0(t)\exp (\textbf{x}_i^\top {\varvec{\beta }})}{\gamma }\right) } \right\} . \nonumber \end{aligned}$$

(8)

Finally, for the univariate case we present two propositions related to the conditional distribution for the frailty given the events $T>t$ and $T=t$, respectively.

Proposition 1.1

The conditional distribution for the frailty $Z\mid T>t$, follows a TN($\varepsilon$), where $\varepsilon =\varepsilon (H_0(t),\nu )=\nu -H_0(t)/\gamma$.

Proposition 1.2

The conditional distribution for the frailty $Z\mid T=t$ follows a modified half Normal (MHN)¹⁵, which density function is given by

$$\begin{aligned} f(z|T=t)=\frac{\gamma ^2 \exp \big (-\frac{\kappa ^2}{2}\big )}{\sqrt{2\pi }\big (\kappa \Phi (\kappa )+\phi (\kappa )\big )} z \exp \Big \{ -\frac{\gamma ^2}{2} z^2 + \gamma \kappa z\Big \}. \end{aligned}$$

Proofs of Propositions 1.1 and 1.2 are provided in the Supplementary Material.

Multivariate shared frailty model

In a more general context, it is possible to consider that the observations are grouped in m clusters and the ith cluster has $n_i$ observations, for $i=1,\ldots ,m$. This scenario is ad hoc when the observations in the same cluster have some kind of dependence. For instance, measurements in the same individual, or members of the same family, among others. The assumption here is that all the observations related to the same cluster are conditionally independent given its corresponding frailty term ($z_i$). With this assumption, we obtain that the conditional hazard and the joint survival function are given by

$$\begin{aligned} h(t_{i1},\ldots , t_{in_i} \mid z_i,\textbf{X}_{i})&=\sum _{j=1}^{n_i} h(t_{ij}\mid z_i, \textbf{x}_{ij})=z_i \sum _{j=1}^{n_i}\exp \left( \textbf{x}_{ij}^\top {\varvec{\beta }}\right) h_0(t_{ij}), \quad \text{ and }\\ S(t_{i1},\ldots , t_{in_i} \mid z_i,\textbf{X}_{i})&=\exp \left( -z_i \sum _{j=1}^{n_i}\exp \left( \textbf{x}_{ij}^\top {\varvec{\beta }}\right) H_0(t_{ij})\right) , \end{aligned}$$

respectively, where $\textbf{x}_{ij}^\top =(x_{ij1},\ldots ,x_{ijp})$ denotes a vector of p covariates related to the j-th individual in the i-th cluster and $\textbf{X}_i^\top =(\textbf{x}_{i1}^\top ,\ldots ,\textbf{x}_{in_i}^\top )$ denotes the vector with all the information for the p covariates associated with the $n_i$ observations in the i-th cluster, $z_i$ represents the influence of the i-th cluster on its observations. Integrating $z_i$ over its density function is obtained that the marginal survival function for ${\varvec{t}}_i=(t_{i1},\ldots ,t_{in_i})$ is given by

$$\begin{aligned} S({\varvec{t}}_i\mid \textbf{X}_i) = \mathcal {L}_g \left( \sum _{j=1}^{n_i}H_0(t_{ij})\exp (\textbf{x}_i^\top {\varvec{\beta }})\right) , \end{aligned}$$

and then, the marginal hazard function is given by

$$\begin{aligned} h({\varvec{t}}_i\mid \textbf{X}_i) = \frac{(-1)^{n_i} \displaystyle \sum _{j=1}^{n_i} h_0(t_{ij})\exp (\textbf{x}_i^\top {\varvec{\beta }}) \, \mathcal {L}_g^{(n_i)}\Big (\displaystyle \sum _{j=1}^{n_i} H_0(t_{ij})\exp (\textbf{x}_i^\top {\varvec{\beta }})\Big )}{\mathcal {L}_g \Big (\sum _{j=1}^{n_i}H_0(t_{ij})\exp (\textbf{x}_i^\top {\varvec{\beta }})\Big )}. \end{aligned}$$

(9)

For the TN, expressions for Equation (8) can be expressed using the recursive formula in (5). For the bivariate case (i.e., $n_i=2$, $\forall i=1,\ldots ,m$), the marginal hazard function is reduced to

$$\begin{aligned} h(t_{i1},t_{i2}\mid \textbf{x}_{i1},\textbf{x}_{i2})=\frac{1}{\gamma ^2}\sum _{j=1}^{2} h_0(t_{ij}) \exp (\textbf{x}_{ij}^\top {\varvec{\beta }}) \left[ \left( \nu -\frac{s_i}{\gamma }\right) \left( \left( \nu -\frac{s_i}{\gamma }\right) +\frac{\phi \left( \nu -\frac{s_i}{\gamma }\right) }{\Phi \left( \nu -\frac{s_i}{\gamma }\right) }\right) +1 \right] , \end{aligned}$$

where $s_i=H_0(t_{i1})\exp (\textbf{x}_{i1}^\top {\varvec{\beta }})+H_0(t_{i2})\exp (\textbf{x}_{i2}^\top {\varvec{\beta }})$.

Kendall’s tau

Kendall’s $\tau$ is a measure that quantifies the dependency between observations in the same cluster. This measure is independent of the unit of measurement of the data, so it works better than the variance and the correlation of the data due to its limitations (non-existence of the second moment, existence of censored observations, different measurement scale, see¹⁶, page 153, for details). Considering the Laplace transform and its second derivative (see Equations (3) and (4), respectively), we can determine the value of $\tau$, for TN distribution, which is defined as

$$\begin{aligned} \tau&=4\int _{0}^{\infty } s\mathcal {L}^{(2)}(s)\mathcal {L}(s)ds - 1,\\&=4\int _{0}^{\infty } s \left[ \dfrac{\Phi \big (\kappa \big )}{\gamma \Phi \big (\nu \big )}\right] ^2\exp \bigg \{\frac{s}{\gamma }\,\Big (\frac{s}{\gamma }-2\nu \Big ) \bigg \} \Bigg [\kappa \left( \kappa +\frac{\phi (\kappa )}{\Phi (\kappa )}\right) +1 \Bigg ]ds-1. \end{aligned}$$

The integral is solved computationally since it does not have a closed expression. Figure 2 shows the different dependency values ($\tau$) according to the variance value for different frailty models. Note that $\tau \in [0,0.33]$, for $\theta \in (0,1)$ in the TN frailty model. We also note that, for a given frailty variance $\theta \in (0,0.864)$, the TN frailty model produces a higher degree of dependence $\tau$ than the GA, IG, and WL frailty models.

On the basal hazard function

The basal hazard function $h_0(t)$ is usually modeled with common distributions with positive support, such as Weibull, gamma, and Gompertz, among others. For the Weibull distribution, we consider the parameterization such as $h_0(t)=\lambda \rho t^{\rho -1}$ and $H_0(t)=\lambda t^{\rho }$, $t, \lambda , \rho>0$ and we denote $T\sim \text{ W }(\lambda ,\rho )$ to refer to this particular parameterization. The Weibull model has been widely used in the literature because it adapts well to diverse biological, physical, chemical, and industrial processes, to name a few. Furthermore, its hazard function can assume monotonic forms (increasing, decreasing, or constant), which are controlled only by $\rho$. On the other hand, the piecewise exponential (PE) distribution introduced in¹⁷ and extended in¹⁸ for the case with covariates. This model considers a constant risk between each predefined interval, say $(a_1,...,a_L)$ such as $0=a_0<a_1<...<a_{L-1}<a_L = \infty$. This distribution is extremely useful for adapting critical points where there may be abrupt changes in the baseline risk function and which cannot be captured by non-segmented distributions such as the Weibull distribution. We say that T has PE model with vector of parameters ${\varvec{\lambda }}=(\lambda _1,...,\lambda _L)$ and known partition time ${\varvec{a}}=(a_1,...,a_{L-1})$ (we denote $T\sim PE_a({\varvec{\lambda }}))$, if its survival function is given by

$$\begin{aligned} S(t) = \exp \bigg (-\sum _{l=1}^{L}\lambda _l\nabla _l(t) \bigg ),\,\,\, t>0, \end{aligned}$$

where

$$\begin{aligned} \nabla _l(t) = \left\{ \begin{aligned} 0&,\ \text {if} \ t< a_{l-1},\\ t-a_{l-1}&,\ \text {if} \ a_{l-1} \le t < a_l, \\ a_l-a_{l-1}&,\ \text {if} \ t> a_l. \end{aligned} \right. \end{aligned}$$

The hazard function is given by

$$\begin{aligned} h_0(t)=\lambda _\ell , \quad t\in (a_{\ell -1},a_\ell ],\ \ell =1,...,L, \end{aligned}$$

and the cumulative hazard function is given by

$$\begin{aligned} H_0(t)=\sum _{l=1}^{L}\lambda _l\Delta _l(t). \end{aligned}$$

In the literature, when the PE model is used in the context of frailty models, it is typically referred to as a semi-parametric model¹⁹. However, in this work, we also consider a non-parametric form for the baseline hazard distribution.

Estimation

In this section, we discuss the parameter estimation for the TN frailty model. Let $Y_{ij}$ and $C_{ij}$ be the failure and censoring times for the j-th individual in the i-th cluster and ${\textbf {x}}_{ij}$ be a $p\times 1$ covariate vector (without intercept term), where $1\le i \le m$ and $1 \le j \le n_i$. Under a right censoring scheme, we observe the random variables $T_{ij}=\min (Y_{ij}, C_{ij})$ and $\delta _{ij}=I(Y_{ij} \le C_{ij})$, where $I(A)=1$ if the event A occurs (0 otherwise). We assume the frailty terms $Z_1,\ldots ,Z_m$ to be a random sample from the TN$(\theta$) distribution. Considering the following assumptions:

i)
The pairs $(Y_{i1}, C_{i1}),\ldots ,(Y_{in_i}, C_{in_i})$ are conditionally independent given $Z_i$, and $Y_{ij}$ and $C_{ij}$ are mutually independent for $j=1,\ldots ,n_i$.
ii)
$C_{i1}, \ldots , C_{in_i}$ are non-informative about $Z_i$.

Under this setting, the observed log-likelihood function is given by

$$\begin{aligned} L({\varvec{\beta }}, H_0, \nu )&=\prod _{i=1}^m\int _0^{+\infty } \prod _{j=1}^{n_i} \left[ z_i h_0(t_{ij})\exp \left( {\textbf {x}}_{ij}^\top {\varvec{\beta }}\right) \right] ^{\delta _{ij}} \exp \left( -z_i H_0(t_{ij})\text {e}^{{\textbf {x}}_{ij}^\top {\varvec{\beta }}}\right) \dfrac{\gamma \phi (\gamma z_i-\nu )}{\Phi \big (\nu \big )}dz_i \nonumber \\&=\left( \frac{\gamma \text {e}^{-\nu ^2/2}}{\sqrt{2\pi }\Phi (\nu )}\right) ^m \exp \left( \sum _{i=1}^m \sum _{j=1}^{n_i} \delta _{ij}\textbf{x}_{ij}^\top {\varvec{\beta }}\right) \prod _{i=1}^m \int _0^\infty z_i^{r_i} \exp \left( -b_{\nu } z_i^2+c_{\varvec{\psi }}^{(i)} z_i\right) dz_i \prod _{j=1}^{n_i} h_0(t_{ij})^{\delta _{ij}}. \end{aligned}$$

where $r_i=\sum _{j=1}^{n_i} \delta _{ij}$ is the failures in the i-th cluster, $b_\nu =\gamma ^2/2$ and $c_{\varvec{\psi }}^{(i)}=\gamma \nu -\sum _{j=1}^{n_i} H_0(t_{ij})e^{\textbf{x}_{ij}^\top {\varvec{\beta }}}$. However, the last integral is related to the modified half-normal (MHN) distribution¹⁵ and it can be written as

$$\begin{aligned} \int _0^\infty z_i^{r_i} \exp \left( -b_\nu z_i^2+c_{\varvec{\psi }}^{(i)} z_i\right) dz_i=\frac{1}{2}b_\nu ^{-(r_i+1)/2}\Psi \left( \frac{r_i+1}{2},\frac{c_{\varvec{\psi }}^{(i)}}{\sqrt{b_\nu }}\right) , \end{aligned}$$

where

$$\begin{aligned} \Psi \left( \frac{\alpha }{2},x\right) =\sum _{k=0}^\infty \frac{\Gamma (\frac{\alpha +k}{2})}{k!}x^k, \end{aligned}$$

is a specific case of the Fox-Wright function. The supplementary material in¹⁵ discusses different ways to compute this term. Therefore,

$$\begin{aligned} L({\varvec{\beta }}, H_0, \nu )=b_\nu ^{-(r+m)/2}\left( \frac{\gamma \text {e}^{-\nu ^2/2}}{2 \sqrt{2\pi }\Phi (\nu )}\right) ^m \exp \left( \sum _{i=1}^m \sum _{j=1}^{n_i} \delta _{ij}\textbf{x}_{ij}^\top {\varvec{\beta }}\right) \prod _{i=1}^m \Psi \left( \frac{r_i+1}{2},\frac{c_{\varvec{\psi }}^{(i)}}{\sqrt{b_\nu }}\right) \prod _{j=1}^{n_i} h_0(t_{ij})^{\delta _{ij}}, \end{aligned}$$

with $r=\sum _{i=1}^m r_i$ the total failures in the sample. In a parametric approach, $H_0(t)$ or $h_0(t)$ are specified by a set of parameters, say ${\varvec{\lambda }}$, and then the parameter vector is reduced to $({\varvec{\beta }}, {\varvec{\lambda }}, \nu )$. For instance, for the Weibull (WEI) distribution, we use the parameterization $H_0(t)=\lambda \, t^\rho$ and $h_0(t)=\lambda \, \rho \, t^{\rho -1}$, where $t>0$ and ${\varvec{\lambda }}=(\lambda ,\rho )\in \mathbb {R}_+^2$. From a classical approach, the ML estimator can be obtained by maximizing $\log L({\varvec{\beta }}, {\varvec{\lambda }}, \nu )$ relative to ${\varvec{\beta }}, {\varvec{\lambda }}$ and $\nu$. For the flexibility discussed in previous sections, we also consider the PE model. However, it can be also attractive to discuss a non-parametric approach for the baseline distribution. For this, in the next subsection, we consider an estimation procedure based on the EM algorithm.

EM algorithm

Given the unobservable nature of the frailty terms, the EM algorithm is an ad hoc tool to be applied in this context. Let ${\varvec{t}}_i^\top =(t_{i1},...,t_{in_i})$, ${\varvec{\delta }}_i^\top =(\delta _{i1},...,\delta _{in_i})$ and ${\varvec{x}}_i^\top =({\varvec{x}}_{i1},...,{\varvec{x}}_{in_i})$ the observed times, failure indicators and covariates, related to the $n_i$ observations in the i-th cluster, $i=1,\ldots ,m$. For our particular problem, $\mathcal {D}_c= ({\varvec{t}}^\top ,{\varvec{\delta }}^\top ,{\varvec{X}}^\top ,{\varvec{Z}}^\top )$ represents the complete data, where ${\varvec{t}}^\top =({\varvec{t}}_1^\top ,...,{\varvec{t}}_m^\top )$, ${\varvec{\delta }}^\top =({\varvec{\delta }}_1^\top ,...,{\varvec{\delta }}_m^\top )$, ${\varvec{X}}^\top =({\varvec{x}}_i^\top ,...,{\varvec{x}}_m^\top )$ and ${\varvec{Z}}^\top =(z_1,...,z_m)$, where $\mathcal {D}_{o}=({\varvec{t}}^T,{\varvec{\delta }}^T,{\varvec{X}}^T)$ is the observed data and ${\varvec{Z}}^\top$ represents the vector of latent variables. Note that the complete likelihood function can be written as $L({\varvec{\beta }},H_0,\nu ;\mathcal {D}_{c})=L_1({\varvec{\beta }}, H_0;\mathcal {D}_{c})\times L_2(\nu ; {\varvec{Z}})$, where $L_1({\varvec{\beta }}, H_0;\mathcal {D}_{c})$ $=\prod _{i=1}^m \prod _{j=1}^{n_i} \left[ z_i h_0(t_{ij})\exp ({\textbf {x}}_{ij}^\top {\varvec{\beta }})\right] ^{\delta _{ij}}$$\exp (-z_i H_0(t_{ij})\text {e}^{{\textbf {x}}_{ij}^\top {\varvec{\beta }}})$ and $L_2(\nu ; {\varvec{Z}})=\prod _{i=1}^m f(z_i;\nu )$.

The complete log-likelihood function is given by $\ell _c({\varvec{\beta }}, H_0,\nu ;\mathcal {D}_{c})=\ell _{1c}({\varvec{\beta }}, H_0;\mathcal {D}_{c})+\ell _{2c}(\nu ; {\varvec{Z}})$, where except for a constant that does not depend on ${\varvec{\beta }}$, $H_0$ or $\nu$, such functions are given by

$$\begin{aligned} \ell _{1c}({\varvec{\beta }},H_0;\mathcal {D}_c) =&\sum _{i=1}^{m} \sum _{j=1}^{n_i}\bigg \{ \delta _{ij}\big [\log h_0(t_{ij})+{\varvec{x}}_{ij}^\top {\varvec{\beta }}\big ] -z_i H_0(t_{ij}) \exp ({\varvec{x}}_{ij}^\top {\varvec{\beta }})\bigg \}, \quad \text{ and }\\ \ell _{2c}(\nu ; {\varvec{Z}}) =&\sum _{i=1}^{m} \Big \{\log \gamma - \log \Phi (\nu ) - \frac{1}{2}\log (2\pi ) - \frac{1}{2}\big (z_i^2\gamma ^2 - 2\gamma \nu z_i+\nu ^2\big ) \Big \}. \end{aligned}$$

Let ${\varvec{\psi }}^{(k)} = \left( {\varvec{\beta }}^{(k)},H_0^{(k)}, \nu ^{(k)}\right)$ be the estimated vector of ${\varvec{\psi }}= ({\varvec{\beta }},H_0, \nu )$ at the k-th iteration and

$$\begin{aligned} Q({\varvec{\psi }}\,|\, {\varvec{\psi }}^{(k)})=\mathbb {E}\left( \ell _c({\varvec{\beta }}, H_0,\nu ;\mathcal {D}_{c})\mid \mathcal {D}_o, {\varvec{\psi }}={\varvec{\psi }}^{(k)}\right) , \end{aligned}$$

i.e., the conditional expectation of $\ell _c({\varvec{\beta }}, H_0,\nu ;\mathcal {D}_{c})$ given the observed data and ${\varvec{\psi }}^{(k)}$. Note that $Q({\varvec{\psi }}\mid {\varvec{\psi }}^{(k)})=Q_1(({\varvec{\beta }},H_0)\mid {\varvec{\psi }}^{(k)})+Q_2(\nu \mid {\varvec{\psi }}^{(k)})$

$$\begin{aligned} Q_1(({\varvec{\beta }},H_0)\mid {\varvec{\psi }}^{(k)})=&\sum _{i=1}^{m} \sum _{j=1}^{n_i}\bigg \{ \delta _{ij}\big [\log h_0(t_{ij})+{\varvec{x}}_{ij}^\top {\varvec{\beta }}\big ] -\widehat{z}_i^{(k)} H_0(t_{ij}) \exp ({\varvec{x}}_{ij}^\top {\varvec{\beta }})\bigg \}, \quad \text{ and } \end{aligned}$$

(10)

$$\begin{aligned} Q_2(\nu \mid {\varvec{\psi }}^{(k)})=&\sum _{i=1}^{m} \Big \{\log \gamma - \log \Phi (\nu ) - \frac{1}{2}\log (2\pi ) - \frac{1}{2}\big (\widehat{z_i^2}^{(k)}\gamma ^2 - 2\gamma \nu \widehat{z}_i^{(k)}+\nu ^2\big ) \Big \}, \end{aligned}$$

(11)

where $\widehat{z}_i^{(k)}=\mathbb {E}\big [Z_i\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\big ]$ and $\widehat{z_i^2}^{(k)} =\mathbb {E}\big [Z_i^2\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\big ]$. It is possible to show that

$$\begin{aligned} Z_i\mid {\varvec{t}}_i^\top ,{\varvec{\delta }}_i^\top \sim \text{ MHN }\left( a_i=1+r_i,b_\nu =\frac{\gamma ^2}{2}, c_{\varvec{\psi }}^{(i)}=\gamma \nu - \sum _{j=1}^{n_i} H_0(t_{ij}) \exp ({\varvec{x}}_{ij}^\top {\varvec{\beta }})\right) . \end{aligned}$$

(12)

Refer to the supplementary material file for a proof of this fact. Using this notation, and applying Lemma 2 from Sun et al.¹⁵, it follows immediately that

$$\begin{aligned} {\widehat{z}}_i^{(k)}&= \mathbb {E}\left( Z_i\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\right) = \frac{\Psi \left( \frac{r_i+1}{2};\frac{c_{\psi }^{(i)}}{\sqrt{b_\nu }}\right) }{\sqrt{b_\nu }\Psi \left( \frac{r_i}{2};\frac{c_{\psi }^{(i)}}{\sqrt{b_\nu }}\right) }, \quad \text{ and } \\ {{\widehat{z}}_{i}^{2}}{}^{(k)}&= \mathbb {E}\left( Z_i^{2}\mid \mathcal {D}_o,{\varvec{\psi }}={\varvec{\psi }}^{(k)}\right) = \frac{\Psi \left( \frac{r_i+1}{2};\frac{c_{\psi }^{(i)}}{\sqrt{b_\nu }}\right) }{b_\nu \Psi \left( \frac{r_i}{2};\frac{c_{\psi }^{(i)}}{\sqrt{b_\nu }}\right) }. \end{aligned}$$

(13)

On the other hand, it is possible to construct a discrete version of the cumulative baseline hazard function, considering $H_0^D(t)=\sum _{\ell : t_{(\ell )}\le t} H_0(t_{(\ell )})$, where $t_{(1)},\ldots ,t_{(q)}$ are the ordered distinct failure times and q is the number of different observed failure times. Replacing $H_0(\cdot )$ and $h_0(\cdot )$ in Equation (9) is obtained

$$\begin{aligned} Q_1(({\varvec{\beta }},H_0)\mid {\varvec{\psi }}^{(k)})=\sum _{\ell =1}^q d_{(\ell )} \log \left[ h_0(t_{(\ell )})\right] +\sum _{i=1}^m \sum _{j=1}^{n_i} \delta _{ij} {\textbf {x}}_{ij}^\top {\varvec{\beta }}-\sum _{\ell =1}^q h_0(t_{(\ell )})\sum _{i,j \in R(t_{(\ell )})} \widehat{z}_i^{(k)} \text {e}^{{\textbf {x}}_{ij}^\top {\varvec{\beta }}}. \end{aligned}$$

Replacing the solution for $h_0(t_{(\ell )})$, i.e., $\widehat{h}_0(t_{(\ell )})=d_{(\ell )}/\left[ \sum _{i,j \in R(t_{(k)})}\exp \left( {\textbf {x}}_{ij}^\top {\varvec{\beta }}+\log \widehat{z}^{(\ell )}_i\right) \right]$, the expression for $Q_1$ is reduced, up to a constant that does not depend on ${\varvec{\beta }}$, to

$$\begin{aligned} Q_1({\varvec{\beta }}\mid {\varvec{\psi }}^{(k)})&=-\sum _{\ell =1}^q d_{(\ell )}\log \left( \sum _{i,j \in R(t_{(\ell )})} \exp \left( {\textbf {x}}_{ij}^\top {\varvec{\beta }}+\log \widehat{z}_i^{(k)}\right) \right) +\sum _{i=1}^m\sum _{j=1}^{n_i} \delta _{ij} {\textbf {x}}_{ij}^\top {\varvec{\beta }}. \end{aligned}$$

Note that $Q_1(\cdot )$ has the same form of the partial log-likelihood function of the Cox model, except for the offset $\log \widehat{z}_i^{(k)}$. For this, to update ${\varvec{\beta }}$ in the M-step we can use the Cox approach. Finally, the non-parametric estimator for $H_0(\cdot )$ in the k-th step of the algorithm is given by

$$\begin{aligned} \widehat{H}^{(k)}_0(t)=\sum _{\ell : t_{(\ell )}\le t} \frac{d_{(\ell )}}{\sum _{i,j \in R(t_{(\ell )})}\exp \left( {\textbf {x}}_{ij}^\top \varvec{\beta }^{(k)}+\log \widehat{z}^{(k)}_i\right) }, \quad t>0. \end{aligned}$$

In summary, the EM algorithm is given by the following steps.

E-step: For $i=1,...,m$, compute $\widehat{z}_i^{(k+1)}$ and $\widehat{z_i^2}^{(k+1)}$ using equations (12) and (13), respectively, with ${\varvec{\beta }}^{(k)}$, $H_0(\cdot )^{(k)}$ and $\nu ^{(k)}$ as the estimated parameters at the k-th iteration.
M1-step: Update ${\varvec{\beta }}^{(k+1)}$ and $H_0^{(k+1)}(\cdot )$ by fitting a Cox regression model with offset $\log \widehat{z}_i^{(k+1)}$ for the nonparametric case, or maximizing $Q_1(\beta ,H_0)$ for the parametric (WEI) and semi-parametric (PE) cases.
M2-step: Update $\nu ^{(k+1)}$ by maximizing $Q_2(\nu \mid {\varvec{\psi }}^{(k)})$ in relation to $\nu$.

Maximization around $H_0$ refers to optimizing the parameters in $H_0(\cdot )$: $\rho$ and $\lambda$ for the Weibull baseline distribution, or the vector ${\varvec{\lambda }}$ for the piecewise exponential case. The unified formulation ensures algorithmic generality. The algorithm iterates until a convergence criterion is satisfied. For instance, we consider $||\widehat{{\varvec{\psi }}}^{(k-1)}-\widehat{{\varvec{\psi }}}^{(k)}||<\epsilon$, where $\epsilon$ is a predefined value and $||\cdot ||$ denotes the Euclidean norm. Initial values are derived from the ordinary Cox model, taking $\nu ^{(0)}=0.5$. On the other hand, following the suggestion of²⁰, we estimate the standard error of $\widehat{\varvec{\beta }}$ and $\widehat{\nu }$ via a profile log-likelihood function: $\ell ({\varvec{\beta }},\nu )=\log L({\varvec{\beta }}, H_0, \nu )$, replacing $H_0$ with its estimate $\widehat{H}_0$. The variance-covariance matrix of $(\widehat{\varvec{\beta }},\widehat{\nu })$ is then:

$$\begin{aligned} I(\widehat{\varvec{\beta }}, \widehat{\nu })=-\frac{\partial ^2 \ell ({\varvec{\beta }},\nu )}{\partial ({\varvec{\beta }},\nu ) \partial ^\top ({\varvec{\beta }},\nu )}\Bigg |_{{\varvec{\beta }}=\widehat{\varvec{\beta }}, \nu =\widehat{\nu }}. \end{aligned}$$

Finally, more important than $\widehat{\nu }$ is $\widehat{\theta }:= \widehat{\gamma }^{-2}-\dfrac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })}\widehat{\gamma }^{-1}$ (the frailty variance) because allows us to compare this term with the variance of other models parameterized directly in the frailty variance. The variance of $\widehat{\theta }$ is estimated as:

$$\begin{aligned} \widehat{Var (\widehat{\theta })} = \widehat{Var (\widehat{\nu })} \left[ \frac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })}\widehat{\gamma }^{-2} \left( 1- \frac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })}\widehat{\gamma } \right) - 2\widehat{\gamma }^{-3} \left( 1- \frac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })}\widehat{\gamma } \right) + \frac{\phi (\widehat{\nu })}{\Phi (\widehat{\nu })} \right] ^2. \end{aligned}$$

Remark 1

Note that the result in Equation (11) is also interesting if a Bayesian approach were applied to the model, because also is valid conditioning on the parameters. This facilitates, among other things, the application of an MCMC type method to simulate from the corresponding conditional distribution related to the frailties.

Computational aspects

The extrafrail²¹ package of R²² includes the computational implementation for the TN frailty model considering as the baseline model the Weibull, exponential and PE distributions and the non-parametric specification. For instance, to fit the Weibull case, it can be used

frailty.fit(formula, data, dist = “weibull”, dist.frail=“TN”)

whereas is usually in survival analysis with random effects in R, the formula can be defined as

Surv(time, event) $\mathtt {\sim }$ covariates + cluster(id)

A similar syntax can be used to fit the other cases specifying dist=“exponential”, dist=“pe” or dist=“np” in the last sentence. We highlight that the function allows us to perform the estimation even for the case where the clusters have different sizes (i.e., $n_1, n_2$,$\ldots$,$n_m$ are not necessarily the same).

Simulation study

In this Section, we present a simulation study to assess the performance of the maximum likelihood estimators obtained via the EM algorithm with samples of different percentages of censoring.

Recovery parameters

We consider the following three different scenarios:

Scenario 1: 19 clusters with 2 observations each and 19 clusters with 4 observations each, totalling 114 observations. ($n_1=\ldots =n_{19}=2, n_{20}=\ldots =n_{38}=4$ and $m=38$).
Scenario 2: 38 clusters with 2 observations each and 38 clusters with 4 observations each, totalling 228 observations. ($n_1=\ldots =n_{38}=2, n_{39}=\ldots =n_{76}=4$ and $m=76$).
Scenario 3: 19 clusters with 4 observations each and 19 clusters with 8 observations each, totalizing 228 observations. ($n_1=\ldots =n_{19}=4, n_{20}=\ldots =n_{38}=8$ and $m=38$).

The idea is to verify if, under a certain amount of data, it is advisable to increase the number of clusters or increase the cluster observations. We consider as baseline model the PE distribution with $L=3$ and time partition ${\varvec{a}} = (7/365, 56/365)$. Similar to the real data application, we also consider one dichotomous covariate x, which was drawn from the Bernoulli distribution with success probability 20/76. We also consider three values for $\theta$, the variance of the frailty terms: 0.20, 0.50 and 0.75. The percentage of censoring was fixed at 10%, 25% and 50%. In all the cases, the regression coefficient was fixed as $\beta =1.8$ and the parameters from the PE distribution were fixed as ${\varvec{\lambda }}=(\lambda _1=0.3,\lambda _2=2.6,\lambda _3=1.9)$. To simulate values from the model, we use the following steps:

i)
Draw $z_i\sim \text{ TN }(\nu )$, $i=1,\ldots ,m$, using the inverse transform method, i.e., do $z_i=\Big (\Phi ^{-1} \big (u_i \Phi (\nu ) + \Phi (-\nu ) \big ) + \nu \Big )\gamma ^{-1}$, where $u_i\sim \text{ U }(0,1)$ (the standard uniform distribution).
ii)
Draw the failure times from the conditional distribution $y_{ij}\mid z_i\sim \text{ PE }({\varvec{\lambda }} z_i \exp (\textbf{x}_{ij}^{\top }{\varvec{\beta }}), {\varvec{a}})$.
iii)
Define the censoring times, $c_{ij}$, as the $100\times (1-q)$-th quantile of the corresponding conditional distribution $\text{ PE }({\varvec{\lambda }} z_i \exp (\textbf{x}_{ij}^{\top }{\varvec{\beta }}), {\varvec{a}})$ distribution.
iv)
Define the observed failure times and failure indicators as $t_{ij}=\min (y_{ij},c_{ij})$ and $\delta _{ij}=I(y_{ij}\le c_{ij})$, respectively, for $i=1,\ldots ,m$, $j=1,\ldots ,n_i$.

For each scenario and combination of censoring and $\theta$, we draw 1,000 samples and compute the ML estimates. For each parameter, Tables 1 and 2 summarized the average bias (bias), the root of the estimated mean squared error (RMSE), the mean of the standard errors (SE) and the coverage probabilities (CP) of the asymptotic 95% confidence intervals.

Table 1 Estimated bias, RMSE, SE and approximated 95% coverage probabilities for the TN frailty model with basal distribution PE under different scenarios (cases censoring 10% and 25%).

Full size table

Table 2 Estimated bias, RMSE, SE and approximated 95% coverage probabilities for the TN frailty model with basal distribution PE under different scenarios (case censoring 50%).

Full size table

An increase in the sample size improves the precision and accuracy of the estimates. In particular, scenarios 2 and 3, which have larger sample sizes, exhibit better performance than scenario 1. In general, an increase in heterogeneity ($\theta$) and in the censoring percentage tends to raise the bias, standard error, and RMSE, while reducing coverage probability (CP). However, the behavior of the estimator for $\theta$ improves under higher censoring, showing reduced bias and increased coverage, possibly due to a better identification of the random effect in the presence of censored events. The most affected estimator is $\lambda _3$, since censored information tends to concentrate within its interval. When comparing Scenarios 2 and 3, the former yields better results. This suggests that for a fixed total sample size, increasing the number of clusters is preferable to increasing the number of observations per cluster. This leads to greater diversity in latent effects, which enhances the estimation of frailty terms.

Applications with real data sets

In this Section, we present two applications to illustrate the performance of the TN frailty model in comparison with traditional models. The first application is related to patients with Chronic Kidney Disease (CKD), while the second application is related to patients with fibrosarcoma.

Kidney data set

CKD is the slow and progressive loss of kidney function over time. The main job of these organs is to remove waste and excess water from the body. This disease may be asymptomatic for some time until the kidneys have almost stopped working, whereupon kidney disease usually subsides, diagnosed in its final stages. The final stage of CKD is called End-Stage Renal Disease (ESRD). At this stage, the kidneys can no longer sufficiently remove waste and excess fluid from the body, requiring the patient to undergo dialysis (a life-sustaining treatment) or a kidney transplant (US National Library of Medicine). Dialysis is broken down into two main modalities: hemodialysis and peritoneal dialysis. Hemodialysis consists of extracting blood from the body to direct it to a machine that eliminates waste and excess fluid; after filtration, it is reintroduced into the bloodstream. Peritoneal dialysis, for its part, is a simpler process and can be done on an outpatient basis. Liquid is inserted into the peritoneal cavity through a catheter located in the stomach. This solution absorbs waste and excess fluid and is later extracted. The solution is removed through the same channel.

CKD represents one of the most important non-communicable diseases worldwide²³. For many patients, dialysis is the focal point around which their lives revolve, not only because of the time spent travelling to and from the sessions in specialized centres and the time dedicated to the dialysis treatment itself but also due to the diet that accompanies it, fluid restrictions and medication load²⁴. Thus, one of the most advantageous options, considering quality of life, is treatment by ambulatory peritoneal dialysis (with a portable machine). The peritoneal catheter is a foreign body that facilitates the appearance of infections and serves as a reservoir for bacteria. Infection can appear both in the exit orifice and the tunnel (tunnelled path of the catheter) or the peritoneum (peritonitis). Peritonitis continues to be an important complication of PD, as it contributes to technique failure, hospitalization, and even death²⁵.

We focus on a real dataset named kidney, available in the R²² package frailtyHL²⁶. For further details, see page 11 of its documentation: https://cran.r-project.org/web/packages/frailtyHL/frailtyHL.pdf. The study collected bivariate times, consisting of the times of first and second recurrence of infection at the catheter insertion point in patients with kidney problems using a portable dialysis machine. The catheter is later removed if infection occurs and can be removed for other reasons, in which case the observation is censored. Available covariates are sex and type of kidney disease: Glomerulonephritis (GN), acute nephritis (AN), Polycystic kidney disease (PKD) and others. Previous analysis suggests that only sex is significant in this context²⁷. The study has 38 patients, 10 men and 28 women, each person has 2 times of recurrence of the infection, so there are a total of 76 observations. A summary of such times is presented in Table 3 and Figure 3 presents the Kaplan-Meier (KM) estimator by both times and by sex.

Table 3 Summary of the first and second time of recurrence (TR$_1$ and TR$_2$).

Full size table

For comparison purposes, we also consider the GA, WL and IG frailty models with baseline distribution WE and PE. Figure 4 shows the cumulative hazard function for the kidney data. The proposed partition for the PE model was set at 1 and 8 weeks (indicated by the vertical segments in the graph). A change in the slope behavior is evident, as highlighted in the zoomed-in view on the right. This supports the conclusion that the PE model provides a better fit than a non-segmented model for this dataset. Practically, this suggests that the risk of infection at the catheter insertion site is highest during the first week post-insertion and gradually decreases over time. After two months, the risk stabilizes and remains relatively low. This understanding can help healthcare professionals in identifying critical time periods for infection prevention and monitoring patients accordingly.

Table 4 shows the Akaike information criterion (AIC)²⁸ and the Bayesian information criterion (BIC)²⁹ for such models. According to the AIC and BIC criteria, it is suggested that the baseline PE model is more appropriate for this data than the WE model, independent of the frailty model used. However, the TN frailty model provides better results. Table 5 presents the estimates for all the models considering the PE baseline distribution, including the ordinary PE model (i.e., without frailty).

Table 4 Maximized log-likelihood function (log-Like), AIC and BIC of the TN, GA, WL and IG models for kidney dataset.

Full size table

Table 5 Estimates, standard errors (in parenthesis) and Kendall’s $\tau$ for the TN, GA, WL and IG frailty model with baseline PE and the ordinary PE model for kidney dataset.

Full size table

Note that the effect of not considering the dependence among the clusters is the underestimation of the effect for sex. On the other hand, the estimated Kendall’s $\tau$ for the different models is around 0.13. However, the estimated frailty variance for GA, WL and IG is overestimated by at least 50% concerning the frailty TN model. In practical terms, this means that the TN frailty model estimates a greater effect of sex on the recurrence of infection at the catheter insertion point and less variability between the measures associated with the same individual.

Fibrosarcoma data set

Fibrosarcoma is a rare malignant tumor that originates from fibroblasts, the connective tissue cells responsible for the production of collagen and extracellular matrix. This neoplasm exhibits infiltrative growth, a high propensity for local recurrence, and metastatic potential. It can develop in any part of the body, although it is most commonly found in the extremities, trunk, and retroperitoneal region. Clinically, it typically presents as a progressively enlarging mass, initially painless. Diagnosis is based on histopathological findings, where tumor cells are arranged in a characteristic herringbone pattern, and is often supported by immunohistochemical studies to differentiate it from other soft tissue tumors³⁰. The treatment of choice is surgical excision with wide margins, and adjuvant radiotherapy is frequently considered; chemotherapy is generally reserved for advanced or metastatic cases³¹.

This dataset includes information from 251 patients diagnosed with fibrosarcoma SOE (from the portugues “sem outra especificacão, meaning “not otherwise specified”) with diagnosis dates ranging from 2000 to 2022, and follow-up data extending through December 2022. The dataset was obtained from the Oncocenter Foundation of São Paulo, Brazil (Fundacão Oncocentro de São Paulo, FOSP), which oversees the Hospital Cancer Registry of the State of São Paulo (http://fosp.saude.sp.gov.br). This neoplasm is coded as 8810/3 Fibrosarcoma, NOS (not otherwise specified), according to the International Classification of Diseases for Oncology (ICD-O³²), which is used in cancer registries to classify tumors that lack further histological subtyping at the time of diagnosis.

Cancer-specific death was defined as the event of interest, and time-to-event was measured from the date of diagnosis to the patient’s death (in years: mean$=5.72$, standard deviation (SD)$=5.78$, median$=3.12$, range$= 0.025-21.86$). During the follow-up period, a total of 103 events (39%) occurred. As covariates we use the type of treatment, with eight possible labels: A - surgery (84 patients, 32.2%), B - Radiotherapy (14 patients, 5.4%), C - Chemotherapy (18 patients, 6.9%), D - Surgery $+$ Radiotherapy (42 patients, 16.1%), E - Surgery $+$ Chemotherapy (34 patients, 13.0%), F - Radiotherapy + Chemotherapy (11 patients, 4.2%), G - Surgery $+$ Radiotherapy $+$ Chemotherapy (29 patients, 11.1%) and I - other combination (29 patients, 11.1%). Figure 5 presents the KM estimator by both times and type of treatment. The clusters considered in this analysis correspond to the 26 clinical areas responsible for treating the patients, which are summarized in Table 6. Note that these clusters are highly unbalanced in terms of sample size. In this analysis, we consider the TN, GA, WL, and IG frailty models, using the Weibull distribution for the baseline hazard. The results are summarized in Table 7. Notably, the TN frailty model provides the lowest AIC among the models considered. Once again, the Kendall’s $\tau$ values provided by the models are similar. However, the estimated intra-cluster variance (0.226) is lower for the TN model compared to the others. Finally, Figure 6 shows the survival functions (SF) for patients treated in neurology and clinical oncology centers, as well as the marginal SF (i.e., the SF for a patient randomly selected from the entire cohort).

Table 6 Number of records per medical specialty (cluster size in parenthesis).

Full size table

Table 7 Parameter estimates, standard errors (in parentheses), and Kendall’s $\tau$ for TN, GA, WL, and IG frailty models assuming a Weibull baseline hazard.

Full size table

Concluding remarks

A new survival model with TN frailty was proposed and studied in detail. This model can lead to a complex structure for the data, because allows to modelling of univariate and multivariate data, being adaptable even for groups of different sizes. For the baseline risk, the Weibull, and PE distributions were adopted as well as a non-parametric approach. For a fixed variance for the frailty, the TN frailty model provides a greater Kendall’s than the gamma and IG frailty models. We get a recursive closed-form expression for the derivatives of the Laplace transform for the TN model. Furthermore, the conditional distributions of frailties among the survivors and the frailty of individuals dying at time t were determined explicitly. The simulation studies, based on the EM algorithm, conclude that having more complete information relative to the censored information improves the accuracy and precision of the estimate. Scenarios 2 and 3 did not have a large difference in bias, this suggests that the bias depends on the sample size, not on the data configuration. On the other hand, concerning the RMSE and SE, Scenario 2 showed an improvement in precision for Scenario 3. This suggests that increasing the information in the clusters increases the precision compared to having clusters with little information but more numerous. We fitted the proposed frailty model to a real dataset on times to the first and second recurrence of infection at the catheter insertion point in patients with kidney problems using a portable dialysis machine to show the potential of using the new frailty model. This application demonstrates the practical relevance of the new regression model. In particular, the estimated frailty variance for GA, WL and IG is overestimated in the frailty TN model.

Data availability

The real dataset used, named kidney, is available in the frailtyHL package in R. For details on its use, refer to page 11 of the manual: https://cran.r-project.org/web/packages/frailtyHL/frailtyHL.pdf.

References

Vaupel, J., Manton, K. & Stallard, E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16, 439–454 (1979).
Article CAS PubMed Google Scholar
Congdon, P. Modelling frailty in area mortality. Statistics in Medicine 14, 1859–1874 (1995).
Article CAS PubMed Google Scholar
Hougaard, P. Life table methods for heterogeneous populations. Biometrika 71, 75–83 (1984).
Article MathSciNet Google Scholar
Manton, K., Stallard, E. & Vaupel, J. Alternative models for heterogeneity of mortality risks among the aged. Journal of the American Statistical Association 81, 635–644 (1986).
Article CAS PubMed Google Scholar
Leão, J., Leiva, V., Saulo, H. & Tomazella, V. Birnbaum-Saunders frailty regression models: Diagnostics and application to medical data. Biometrical journal 59, 291–317 (2017).
Article MathSciNet PubMed Google Scholar
Gallardo, D. I., Bourguignon, M. & Romeo, J. S. Birnbaum-saunders frailty regression models for clustered survival data. Statistics and Computing 34, 141 (2024).
Article MathSciNet Google Scholar
Wang, Y. & Emura, T. Multivariate failure time distributions derived from shared frailty and copulas. Japanese Journal of Statistics and Data Science 4, 1105–1131 (2021).
Article MathSciNet Google Scholar
Mota, A. et al. Weighted lindley frailty model: estimation and application to lung cancer data. Lifetime Data Analysis 27, 561–587 (2021).
Article MathSciNet PubMed Google Scholar
Gallardo, D. I., Bourguignon, M. & Santibáñez, J. L. The shared weighted lindley frailty model for clustered failure time data. Biometrical Journal 67, e70044 (2025).
Article MathSciNet PubMed Google Scholar
Kiprotich, G., Gallardo, D. I., Ramos, P. L. & Augustin, T. A shared frailty regression model for clustered survival data. Statistical Methods in Medical Research 0, 09622802251338984, https://doi.org/10.1177/09622802251338984 (0).
Barreto-Souza, W. & Mayrink, V. Semiparametric generalized exponential frailty model for clustered survival data. Annals of the Institute of Statistical Mathematics 71, 679–701 (2019).
Article MathSciNet Google Scholar
Piancastelli, L., Barreto-Souza, W. & Mayrink, V. Generalized inverse-Gaussian frailty models with application to TARGET neuroblastoma data. Annals of the Institute of Statistical Mathematics 73, 979–1010 (2021).
Article MathSciNet Google Scholar
Gómez, H. J., Olmos, N. M., Varela, H. & Bolfarine, H. Inference for a truncated positive normal distribution. Applied Mathematics-A Journal of Chinese Universities 33, 163–176 (2018).
Article MathSciNet Google Scholar
Elbers, C. & Ridder, G. True and spurious duration dependence: The identifiability of the proportional hazard model. The Review of Economic Studies 49, 403–409 (1982).
Article MathSciNet Google Scholar
Sun, J., Kong, M. & Pal, S. The modified-half-normal distribution: Properties and an efficient sampling scheme. Communications in Statistics-Theory and Methods 52, 1591–1613 (2023).
Article MathSciNet Google Scholar
Wienke, A. Frailty models in survival analysis (CRC press, 2010).
Feigl, P. & Zelen, M. Estimation of exponential survival probabilities with concomitant information. Biometrics 826–838 (1965).
Friedman, M. Piecewise exponential models for survival data with covariates. The Annals of Statistics 10, 101–113 (1982).
Article MathSciNet Google Scholar
Balakrishnan, N. & Liu, K. Semi-parametric likelihood inference for birnbaum-saunders frailty model. REVSTAT 231–255 (2018).
Klein, J. P. Semiparametric estimation of random effects using the cox model based on the em algorithm. Biometrics 48, 795–806 (1992).
Article CAS PubMed Google Scholar
Gallardo, D., Bourguignon, M. & Santibáñez, J. Estimation and Additional Tools for Alternative Shared Frailty Models (2025). R package version 1.13.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2025).
Torres, R. S. L., Ushiña, J., Lloay, A. & Balseca, M. Cuidados de enfermería en pacientes con enfermedad renal crónica en hemodiálisis durante infección por covid-19. RECIAMUC 6, 81–90 (2003).
Article Google Scholar
Davenport, A. Portable and wearable dialysis devices for the treatment of patients with end-stage kidney failure: Wishful thinking or just over the horizon?. Pediatric nephrology 30, 2053–2060 (2015).
Article PubMed Google Scholar
Fariñas, M. C., García-Palomo, J. D. & Gutiérrez-Cuadra, M. Infecciones asociadas a los catéteres utilizados para la hemodiálisis y la diálisis peritoneal. Enfermedades infecciosas y microbiologia Clinica 26, 518–526 (2008).
Article PubMed Google Scholar
Ha, I. D., Noh, M., Kim, J. & Lee, Y. frailtyHL: Frailty Models via Hierarchical Likelihood (2019). R package version 2.3.
McGilchrist, C. & Aisbett, C. Regression with frailty in survival analysis. Biometrics 461–466 (1991).
Akaike, H. A new look at the statistical model identification. IEEE transactions on automatic control 19, 716–723 (1974).
Article MathSciNet ADS Google Scholar
Schwarz, G. Estimating the dimension of a model. The annals of statistics 461–464 (1978).
Fletcher, C. D. M., Bridge, J. A., Hogendoorn, P. C. W. & Mertens, F. WHO Classification of Tumours of Soft Tissue and Bone (IARC Press, Lyon, 2020), 5 edn.
Pisters, P. W., Leung, D. H., Woodruff, J., Shi, W. & Brennan, M. F. Analysis of prognostic factors in 1,041 patients with localized soft tissue sarcomas of the extremities. Journal of Clinical Oncology 25, 785–790. https://doi.org/10.1200/JCO.2006.08.1363 (2007).
Article Google Scholar
World Health Organization. International Classification of Diseases for Oncology (ICD-O), 3rd Edition, 1st Revision (WHO Press, 2013).
Google Scholar

Download references

Acknowledgements

Yolanda M Gómez was partially funded by FONDECYT, project grant number 11230397 from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science, Technology, Knowledge, and Innovation and Marcelo Bourguignon gratefully acknowledges partial financial support of the Brazilian agency Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq: grant 304140/2021-0).

Author information

Authors and Affiliations

Departamento de Estadísticas, Facultad de Ciencias, Univerisidad del Bío-Bío, Concepción, 4081112, Chile
Diego I. Gallardo & Yolanda M. Gómez
Departamento de Matemática, Facultad de Ingeniería, Universidad de Atacama, Copiapó, 1530000, Chile
John L. Santibañez
Departamento de Ciencias Matemáticas y Físicas, Facultad de Ingeniería, Universidad Católica de Temuco, Temuco, 4780000, Chile
Osvaldo Venegas
Departamento de Estatística, Universidade Federal do Rio Grande do Norte, Natal, 59078-970, Brazil
Marcelo Bourguignon

Authors

Diego I. Gallardo
View author publications
Search author on:PubMed Google Scholar
Yolanda M. Gómez
View author publications
Search author on:PubMed Google Scholar
John L. Santibañez
View author publications
Search author on:PubMed Google Scholar
Osvaldo Venegas
View author publications
Search author on:PubMed Google Scholar
Marcelo Bourguignon
View author publications
Search author on:PubMed Google Scholar

Contributions

“D.I.G.: Conceptualization of this study, Methodology, Software. Y.M.G.: Data curation, Methodology, Software, Writing - Original draft preparation. J.L.S.: Data curation, Methodology, Software, Writing - Original draft preparation. O.V.: formal analysis, investigation, writing–review and editing, funding acquisition. M.B.: Data curation, Methodology, Software. All authors reviewed the manuscript.”

Corresponding authors

Correspondence to Yolanda M. Gómez or Osvaldo Venegas.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gallardo, D.I., Gómez, Y.M., Santibañez, J.L. et al. The multivariate shared truncated normal frailty model with application to medical data. Sci Rep 15, 30099 (2025). https://doi.org/10.1038/s41598-025-15903-y

Download citation

Received: 27 April 2025
Accepted: 11 August 2025
Published: 17 August 2025
DOI: https://doi.org/10.1038/s41598-025-15903-y

Subjects

Abstract

Similar content being viewed by others

On a Bayesian multivariate survival tree approach based on three frailty models

Frailty identification using a sensor-based upper-extremity function test: a deep learning approach

Leveraging the variational Bayes autoencoder for survival analysis

Introduction

Background of frailty models

The truncated normal distribution

Univariate frailty models

Proposition 1.1

Proposition 1.2

Multivariate shared frailty model

Kendall’s tau

On the basal hazard function

Estimation

EM algorithm

Remark 1

Computational aspects

Simulation study

Recovery parameters

Applications with real data sets

Kidney data set

Fibrosarcoma data set

Concluding remarks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links