Abstract
Among diseases, cancer exhibits the fastest global spread, presenting a substantial challenge for patients, their families, and the communities they belong to. This paper is devoted to modeling such a disease as a special case. A newly proposed distribution called the binomial-discrete Erlang-truncated exponential (BDETE) is introduced. The BDETE is a mixture of binomial distribution with the number of trials (parameter \(n\)) taken after a discrete Erlang-truncated exponential distribution. A comprehensive mathematical treatment of the proposed distribution and expressions of its density, cumulative distribution function, survival function, failure rate function, Quantile function, moment generating function, Shannon entropy, order statistics, and stress-strength reliability, are provided. The distribution's parameters are estimated using the maximum likelihood method. Two real-world lifetime count data sets from the cancer disease, both of which are right-skewed and over-dispersed, are fitted using the proposed BDETE distribution to evaluate its efficacy and viability. We expect the findings to become standard works in probability theory and its related fields.
Similar content being viewed by others
Introduction
Cancer is the disease that spreads the most quickly around the world. It is a big problem for patients, their families, and their communities. If this sickness is caught early, it can be treated. Because of this, modelling of disease has become an important tool in the areas of public health research and disease epidemiology in the past few years.
A mixed distribution in statistics is the mixing of two or more probability distributions. It may be used to represent a statistical population with subpopulations, where the weights are the percentage of each subpopulation in the overall population and the mixture probability density components are the subpopulation densities. The probability distributions of the subpopulations may be univariate or multivariate and discrete or continuous. Also, the mixture distribution can come from different distribution families or the same distribution families with different parameters. Certain data sets may be suitable for a mixed distribution because discrete subgroups of the whole data set have unique characteristics that are best described independently.
In recent years, the challenge of creating a mixing distribution from the binomial distribution has gained a lot of attention. Breslow and Day1 extensively utilized negative binomial distribution in their cancer research statistics. Roy et al.2 investigated the Poisson mixture of the binomial distribution. Wood3 used a cumulative distribution function to come up with the mixture of the binomial distribution. Binomial mixes of the Poisson, normal, chi-squared, F, t, beta, gamma, exponential, rectangular, and Erlang distributions were developed by Roy et al.4. Zhu et al.5 recently used a beta-binomial-Poisson mixture distribution to model the number of successes and the number of binary trials at the same time. Shkedy et al.6 created the hierarchical Binomial-Poisson model, assuming that the number of responses is a Poisson random variable, for the analysis of a crossover design for correlated binary data when the number of trials depends on the dose. To predict how many credits first-year students at the University of Florence's School of Economics will obtain, Grilli et al.7 used a binomial finite mixture model. Knape et al.8 tested the sensitivity of binomial N-mixture models to over-dispersion in abundance and detection using simulations and a case study. El-Alosey9 proposed the binomial-exponential mixture by deriving the probability mass function of discrete mixes of distributions using the probability-generating function approach. The Erlang distribution's binomial mixture was created by Abed Al-Kadim and AL-Hussani10 utilizing moment's technique and Laplace transform.
Very recently, Triple Binomials, defined as a multiplicative mixing of the same three distributions, were developed by Adnan and Kiser11. Eledum and El-Alosey12 derived the binomial-geometric mixtures by using the probability-generating function technique.
Mixture distributions can be applied in cancer disease to identify different subtypes and stages of the disease based on the expression of biomarkers. This approach can lead to better diagnosis, prognosis, and treatment of cancer patients. Prabakaran et al.13 developed the Gaussian mixture model (GMM)-based classifier to improve molecular stratification of patients with breast cancer. Gaussian Mixture Models are often used for clustering and classification tasks in epidemiology. Their application in genotyping and disease subtyping has been explored in numerous studies, as highlighted by McLachlan et al.14. Held et al.15 applied the Beta distribution to infectious disease data analysis. Noor et al.16 preferred a novel four-component mixture model under Bayesian estimation to estimate the average number of incidences and deaths of both genders in different age groups, considering 28 different kinds of cancer diagnosed in recent years. In this paper, the proposed mixture distribution is fitted to two datasets of cancer disease, and the results showed that the proposed mixture distribution is well suited to model these datasets. In other words, we devote this paper to modeling a cancer disease using a new mixture distribution called the binomial-discrete Erlang-truncated exponential distribution (BDETE). This mixture distribution is a combination of the binomial distribution with the discrete Erlang-truncated exponential distribution. We use the probability-generating function of mixtures to find the pmf of the BDETE distribution. We look at some statistical properties of the proposed distribution and use the MLE to estimate its parameters.
The proposed BDETE distribution with three parameters is interesting because it has an increasing hazard rate function and a decreasing probability mass function. The novel lifetime mixture distribution is useful because it can model a real lifetime count data set of cancer disease that is skewed to the right and over-dispersed.
Binomial and discrete Erlang-truncated exponential distributions
The probability mass function (pmf) and the associated probability-generating function (pgf) of a binomial random variable \(X\) with parameters \(n\) and \(p\) are given as
The pmf of a discrete Erlang-truncated exponential (DETE) random variable \(N\) with parameters \(n\), \(\beta\) and \(\omega\) is given as17
where \(n\) is the number of failures before the first success. The DETE distribution's mean and variance are stated as
Mixing binomial and other distributions with a probability-generating function method
If we assume that the parameter \(n\) in the binomial distribution in Eq. (1) is a random variable with pmf \({f}_{N}(n,\omega ,p)\), then we can use the probability generating function approach to get the binomial mixed distribution as12
where \({P}_{D}(d;n,p)\) is pgf for the binomial distribution, while \(p\), \(\omega\) and \(\beta\) are the parameters of the mixture distribution.
This paper's remaining sections are organized as follows: The proposed distribution BDETE is presented in “Binomial-discrete Erlang-truncated exponential distribution” section, and “Distributional properties” section demonstrates its statistical features, including the quantile function, the moment-generating function, the Shannon entropy, the order statistics, and the stress-strength parameter. The maximum likelihood technique is described in “Maximum likelihood estimation” section for estimating BDETE mixing parameters. In “Application” section, two real data sets are used to illustrate the performance of the BDETE distribution. Finally, some final thoughts are offered in “Conclusion remarks” section.
Binomial-discrete Erlang-truncated exponential distribution
This section evaluates and discusses the mathematical formulae for the pmf and cdf of the proposed Binomial-discrete Erlang-truncated exponential mixture (BDETE). Here we also derive the hazard and survival functions for the BDETE distribution.
Probability mass and cumulative distribution functions for the BDETE
If we assume that the parameter \(n\) in the binomial distribution in Eq. (1) follows a discrete Erlang-truncated exponential distribution in Eq. (4), we can use the probability generating function method in Eq. (2) to get the pmf of the proposed BDETE distribution as
Thus, the pmf of BDETE is the coefficient of \({d}^{x}\) in the pgf as
with the corresponding cdf as:
where \(x\in \left\{\mathrm{0,1},2,\dots \right\}, 0\le p\&\omega \le 1,\beta >0\)
The binomial-geometric distribution can be obtained from Eq. (5) by taking \(\beta =1\) and \(\omega =1-\theta\) as follows12
The pmf of the BDETE distribution for varying values of the distribution’s parameters are shown in Figs. 1, 2, and 3, while the cdf are presented in Figs. 4, 5 and 6.
The proposed BDETE distribution is right-skewed, and its pmf is a declining function, as shown in Figs. 1 through 3.
Survival and hazard rate functions
The survival function of X is:
The hazard function is as follows:
The hazard function of the BDETE is shown in Table 1 and Fig. 7 for a given set of \(p, \omega\) and \(\beta\) values.
Based on Table 1 and Fig. 7, we observe that the hazard function goes down as both \(p\) and \(\theta\) go up. On the other hand, as \(\beta\) goes up, the hazard function goes up.
Distributional properties
In this section, we develop some statistical properties of the BDETE distribution, such as the quantile function, the moment generating function, and some other related measures. We also define some other techniques, like the Shannon entropy and the order statistics.
Quantile function
By inverting the cdf in Eq. (6), the quantile of order \(0<r<1\) could be derived as follows
Then \({\mathrm{F}}_{\mathrm{X}}^{-1}\left(\mathrm{r}\right)=\mathrm{min}\{\mathrm{x}\in \mathrm{R}:{\mathrm{F}}_{\mathrm{X}}(\mathrm{x})\ge \mathrm{r}\}\)
Thus, The \({r}^{th}\) quantile is
The BDETE distribution's median can be computed by substituting by \(r=\frac{1}{2}\) in Eq. (8) as follows:
The moment-generating function
The moment-generating function of a random variable \(X\) with a BDETE and parameters \((p,\omega ,\beta )\) is deduced as
The mean (first moment) of the BDETE distribution can be calculated using Eq. (9) as follows:
The 2nd moment about the origin is
As a result, the BDETE distribution's variance is given by
It is obvious from Eqs. (10) and (11) that
This demonstrates that the BDETE distribution is always over-dispersed (the variance is larger than the mean), making it appropriate for usage with such data.
The 3rd moment about the origin is
The 4th moment about the origin is
The BDETE distribution has a coefficient of variation (C.V), coefficient of Skewness (\(\sqrt{{\beta }_{1}}\)), the coefficient of Kurtosis (\({\beta }_{2}\)), and the index of dispersion (\(\gamma\)) as
Table 2 shows the mean, variance, and skewness of the BDETE distribution for various combinations of \(p, \omega\) and \(\beta\).
The results in Table 2 show that when both \(p\) and \(\omega\) increase, so do the proposed distribution's mean and variance. Conversely, when \(\beta\) rises, the mean and variance fall. On the other hand, when both \(p\) and \(\omega\) increase, the coefficient of skewness decreases, while when \(\beta\) rises, so do the coefficient of skewness. Table 2 also demonstrates that the proposed BDETE distribution has over-dispersion and positive skewness.
Shannon entropy
The Shannon entropy is one of many entropy and information indices that have been made and used in a wide range of fields and situations. This measure is defined as
The Shannon entropy of a random variable \(X\) with a BDETE distribution pmf of Eq. (5) is
Order statistics
In the field of non-parametric statistics and inference, order statistics are the most significant and fundamental tools. They employ a variety of approaches to address estimation and hypothesis testing issues. Therefore, the purpose of this subsection is to develop some order statistics for the BDETE distribution, including the maximum, minimum, and median order statistics.
Suppose \({f}_{k}(x;p, \omega ,\beta )\) and \({F}_{k}(x;p, \omega ,\beta )\) are the pmf and cdf of the kth order statistic of a random sample; \({X}_{1},{X}_{2},\dots ,{X}_{n}\); of size \(\mathrm{n}\), taken from BDETE.
The kth order statistic's pmf is
The kth order statistic's cdf is
Let \({X}_{(1)}=min({X}_{1},{X}_{2},\dots ,{X}_{n})\),\({X}_{(n)}=max({X}_{1},{X}_{2},\dots ,{X}_{n})\), and \({X}_{(m+1)}\) with \(\mathrm{m}=\frac{\mathrm{n}}{2}\) be the minimum, maximum and medium order statistics, respectively. Therefore, result, the pmfs of the minimum, maximum, and median are
Estimation of Stress-strength for BDETE distribution
In this part, we look at how to estimate the stress-strength parameter when both the strength and the stress are random variables with the BDETE distribution.
The discrete version of a stress-strength parameter is specified as
where \({\mathrm{f}}_{\mathrm{X}}(\mathrm{x})\) and \({\mathrm{F}}_{\mathrm{X}}(\mathrm{x})\) are the pmf and cdf of the independent discrete random variables X and Y, respectively.
Suppose X and Y are two independent random variables having the BDETE distribution with parameters BDETE(\({\mathrm{p}}_{1}, {\upomega }_{1},{\upbeta }_{1}\)) and BDETE(\({\mathrm{p}}_{2}, {\upomega }_{2},{\upbeta }_{2}\)) respectively. The stress-strength parameter for the BDETE distribution is given as
Maximum likelihood estimation
The goal of this section is to find the maximum likelihood estimate (MLE) for the BDETE distribution parameters.
Let \({X}_{1},{X}_{2},\dots ,{X}_{n}\) be a random sample of size \(n\) having the BDETE distribution. The log-likelihood is
Further differentiating the log-likelihood in Eq. (12) partially with respect to \(p\), \(\omega\) and \(\beta\), we get the likelihood equations as
The solutions of likelihood Eqs. (13), (14), and (15) provide the MLEs of \(p\), \(\omega\) and \(\beta\), which can be obtained by numerical methods. Since the MLE of the vector of unknown parameters \(\tau ={( p , \omega ,\upbeta )}^{T}\) cannot be derived in closed forms, it is, therefore, hard to figure out the exact MLEs for the BDETE’s parameters.
The second partial derivatives are given below
Lawless18 defined the asymptotic distribution of the MLE \(\widehat{\tau }\) as
where \({I}^{-1}\left(\tau \right)\) is the inverse of Fisher’s information matrix of the unknown parameters \(\tau ={( p ,\omega ,\beta )}^{T}\) as follows:
On the other hand, Fisher’s information matrix can be computed by using the approximation
where \(\widehat{p}\), \(\widehat{\omega }\) and \(\widehat{\beta }\) are the MLEs of \(p\), \(\omega\) and \(\beta\) respectively.
Application
Using the proposed BDETE distribution, we examine two data sets in this section to illustrate its use. The BDETE distribution is compared to some related distributions include, the binomial geometric (BG)12, negative binomial-discrete Erlang-truncated exponential (NBDETE)19, discrete Erlang-truncated Exponential (DETE)17, discrete extended Erlang-truncated Exponential (DEETE)20, and the discrete Kumaraswamy Erlang-truncated exponential distribution (DKw_ETE)21 to evaluate its performance and check its goodness of fit. Both the chi-square statistic and the -log-likelihood (−log(L)) are used as evaluation tools. Two right-skewed, over-dispersed real lifetime count data sets from the cancer disease are fitted with the BDETE distribution.
The first data in Table 3, provided by Klein Moeschberger22 describes the death times, expressed in weeks, of 30 tongue cancer patients. This data was used by Eledum and El-Alosey12 to study the binomial geometric distribution. The average, variance, and skewness for this data respectively are 50.03,1945.84, and 0.972. The second data set in Table 4, released by Lawless18, indicates the lengths of remission in weeks for a group of 30 leukemia patients taking a specific kind of medicine. This data was utilized by Eledum and El-Alosey12 to assess the binomial geometric distribution. The results of the two data sets are demonstrated in Tables 5 and 6.
From the results in Table 5, we can see that the suggested BDETE distribution has the smallest number for −logL (157.487) compared to the other similar distributions (the smaller, the better). On the other hand, this value, along with the value of the \({\chi }^{2}\) statistic (23.12) and its associated p-value (0.5960), shows that the suggested BDETE distribution is the best model to fit the tongue cancer patient's data set. Since this is the case, all the studied distributions fit this data set well.
Table 6 shows that, among the comparative distributions, the proposed BDETE distribution has the least value for −logL (127.24). This result, combined with the \({\chi }^{2}\) statistic value of (33.01) and the corresponding p-value of (0.1038) explains that the proposed BDETE distribution is the most appropriate model for the leukemia patient’s data set. On the other hand, all distributions that were considered fit the data well.
Conclusion remarks
This paper developed a novel mixture of binomial distribution called the Binomial-discrete Erlang-truncated exponential distribution (BDETE), which was created by combining the binomial with the discrete Erlang-truncated exponential distribution using the probability generating function method. We look at some of the BDETE statistical properties and use the maximum likelihood method to estimate its parameters. The new compounding distribution has an increasing hazard rate function depending on the behavior of the distribution's parameters. Two real-world lifetime count data sets from the cancer disease, both of which are right-skewed and overdispersed, are fitted using the proposed BDETE distribution to evaluate its efficacy and viability. The application showed that the proposed distribution is the easiest model to fit a real lifetime count data set of cancer diseases that is right-skewed, over-dispersed, and has a decreasing probability mass function. We recommend using the proposed BDETE distribution for data modeling in applications of life-time count data from the medical field, especially in cancer diseases, based on the merits of increasing failure rate and decreasing probability mass function. In future studies, we can do another mixing of the BDETE distribution to increase the distribution flexibility.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
References
Breslow, N. E., Day, N. E. & Heseltine, E. Statistical Methods in Cancer Research (The International Agency for Research on Cancer (IARC), 1980).
Roy, M. K., Rahman, S. & Ali, M. M. A class of poisson mixtured distributions. J. Inf. Optim. Sci. 13(2), 207–218 (1992).
Wood, G. Binomial mixtures and finite exchangeability. Ann. Stat. 20(3), 1167–1173. https://doi.org/10.1214/aop/1176989684 (1992).
Roy, M. K., Roy, A. K. & Ali, M. M. Binomial mixtures of some standard distributions. J. Inf. Optim. Sci. 14(1), 57–71. https://doi.org/10.1080/02522667.1993.10699136 (1993).
Zhu, J., Eickhoff, J. C. & Kaiser, M. S. Modeling the dependence between number of trials and success probability in beta-Binomial-Poisson mixture distributions. Biometrics 59(4), 955–961. https://doi.org/10.1111/j.0006-341X.2003.00110.x (2004).
Shkedy, Z., Molenberghs, G., Craenendonck, H. V., Steckler, T. & Bijnens, L. A hierarchical Binomial-Poisson model for the analysis of a crossover design for correlated binary data when the number of trials is dose-dependent. J. Biopharm. Stat. 15(2), 225–239. https://doi.org/10.1081/BIP-200049825 (2005).
Grilli, L., Rampichini, C. & Varriale, R. Binomial mixture modeling of university credits. Commun. Stat. Theory Methods 44(22), 4866–4879. https://doi.org/10.1080/03610926.2013.804565 (2015).
Knape, J. et al. Sensitivity of binomial N-mixture models to overdispersion: The importance of assessing model fit. Methods Ecol. Evol. 9, 2102–2114. https://doi.org/10.1111/2041-210X.13062 (2018).
El-Alosey, A. R. Random sum of new type of mixtures of distributions. Int. J. Stat. Syst. 2(1), 49–57 (2007).
Abed Al-Kadim, K. & Al-Hussani, R. N. Binomial mixture of Erlang distribution. Int. J. Math. Stat. Stud. 4(2), 28–38 (2016).
Adnan, M. A. S. & Kiser, H. A class of triple mixture distributions. Far East J. Theor. Stat. 59(2), 59–79. https://doi.org/10.17654/TS059020059 (2020).
Eledum, H. & El-Alosey, A. R. Binomial-geometric mixture and its applications. Math. Stat. 10(6), 1218–1228. https://doi.org/10.13189/ms.2022.100608 (2022).
Prabakaran, I. et al. Gaussian mixture models for probabilistic classification of breast cancer. Can. Res. 79(13), 3492–3502 (2019).
McLachlan, G. J., Lee, S. X. & Rathnayake, S. I. Finite mixture models. Annu. Rev. Stat. Appl. 6, 355–378 (2019).
Held, L., Hens, N., D O'Neill, P., & Wallinga, J. (eds.). Handbook of Infectious Disease Data Analysis (CRC Press, 2019).
Noor, F. et al. Bayesian analysis of cancer data using a 4-component exponential mixture model. Comput. Math. Methods Med. 6289337, (2021).
El-Alosey, A. R. Discrete Erlang-truncated exponential distribution. Int. J. Stat. Appl. Math. 6(1), 230–236. https://doi.org/10.22271/maths.2021.v6.i1c.653 (2021).
Lawless, J. F. Statistical Models and Methods for Lifetime Data 2nd edn. (Wiley, 2002).
El-Alosey, A. R. & Eledum, H. On the negative binomial-discrete Erlang-truncated exponential mixture. Inf. Sci. Lett. 12(1), 1–13. https://doi.org/10.18576/isl/120203 (2023).
El-Alosey, A. R. & Eledum, H. Discrete extended Erlang-truncated exponential distribution and its applications. Appl. Math. Inf. Sci. 16(1), 127–138. https://doi.org/10.18576/amis/160113 (2022).
Eledum, H. & El-Alosey, A. R. Discrete Kumaraswamy Erlang-truncated exponential distribution with applications to count data. J. Stat. Appl. 12(2), 725–739. https://doi.org/10.18576/jsap/120232 (2023).
Klein, J. P. & Moeschberger, M. L. Survival Analysis: Techniques for Censored and Truncated Data (Springer, 2003).
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Author information
Authors and Affiliations
Contributions
A.R.E. did the theoretical part and H.E. made the estimation and applications.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
El-Alosey, A.R., Eledum, H. Binomial-discrete Erlang-truncated exponential mixture and its application in cancer disease. Sci Rep 13, 12229 (2023). https://doi.org/10.1038/s41598-023-38709-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-023-38709-2









