Marginalized two part model for analyzing multilevel semicontinuous medical costs in Iranian households

Daghaghele, Elham; Angali, Kambiz Ahmadi; Kamyari, Naser; Seyedtabib, Maryam

doi:10.1038/s41598-025-91309-0

Download PDF

Article
Open access
Published: 03 March 2025

Marginalized two part model for analyzing multilevel semicontinuous medical costs in Iranian households

Scientific Reports volume 15, Article number: 7491 (2025) Cite this article

1900 Accesses
Metrics details

Subjects

Abstract

Medical costs (MCs) represent a significant burden on household finances and often lead to economic challenges. This study analyzed the data of 8,993 Iranian households from 2021 collected by the Iranian Statistical Center. Using a marginalized two-part model (MTP) with lognormal and gamma distributions, the relationship between MCs and factors such as age, gender, education, and household characteristics was examined. A two-level structure was applied to account for heterogeneity across provinces, with analyzes performed using R software. The mean annual MC was $180 with high variability (SD = $324.39). The main determinants included family size, residence area, education level, and socioeconomic status. Single households and families with more students had lower MCs. Among the models evaluated, the MTP-Lognormal model (MTP-LN) performed better than the MTP-Gamma model (MTP-G), as it provided better predictive accuracy and better reflected cost differences between province. These results highlight the socio-economic and demographic factors that influence household MCs in Iran. The MTP-LN model provides valuable insights for identifying at-risk groups and developing targeted interventions to reduce the financial burden of healthcare, especially for vulnerable populations. This study emphasizes the importance of tailored interventions to address regional inequalities and promote equitable access to healthcare.

A novel mitochondrial-related risk model for predicting prognosis and immune checkpoint blockade therapy response in uterine corpus endometrial carcinoma

Article Open access 09 January 2025

Longitudinal assessment of established risk stratification models in patients with monoclonal gammopathy of undetermined significance

Article Open access 27 August 2024

Impact of family doctor contracted services on the health of migrants: a cross-sectional study in China

Article Open access 27 November 2024

Introduction

Medical costs (MCs) are an essential but often underestimated component of maintaining and improving one’s health¹. These expenses cover many healthcare costs, such as doctor visits, hospitalizations, surgeries, medications, checkups, etc. You must be aware of the significant costs associated with healthcare and actively manage your healthcare. By staying informed and engaged, you can make informed decisions about your healthcare, reduce unnecessary spending and prioritize your well-being^2,3. The escalation of medical costs in recent decades reflects the development of living standards and quality of life in different societies^4,5. A closer look at the course and extent of this increase reveals considerable differences in the level and development of healthcare expenditure between individual countries^6,7,8,9,10. The costs associated with healthcare and medical services are a significant component of household finances and can lead to economic stress and financial challenges. These expenses, known as the costs of health impoverishment, have the potential to drive households into financial instability and jeopardize their economic prosperity¹¹.

In many developing countries, particularly in countries such as Iran, a considerable portion of medical costs is covered by households¹². This heavy reliance on out-of-pocket payments for healthcare underscores a critical aspect of the healthcare system in these areas. However, heavy reliance on direct household payments is considered an inadequate method of ensuring financial stability¹³. When families need healthcare beyond their means, they encounter significant challenges and may resort to measures such as borrowing, selling assets or cutting back on other essential spending to cover the costs¹⁴. A study conducted by Meharara in Iran on household health expenditure revealed that a significant proportion of the population, about 5.2%, is affected by catastrophic health expenditure¹⁵. Vulnerable groups, including rural households, households with unemployed people, households with young children and elderly people, and people without insurance coverage, are more susceptible to unsustainable healthcare costs A priority objective of a country’s health policy is therefore to reduce direct payments and introduce more equitable financing mechanisms to ensure wider access to health services¹⁶.

Chronic diseases, such as diabetes, obesity, and depression, impose a significant economic burden on families and contribute to escalating healthcare costs^17,18,19,20. Several studies have shown a positive correlation between a person’s economic status and medical costs^21,22,23, with expenses generally increasing with age²⁴. In addition, insurance coverage has been shown to be an important factor influencing medical costs^23,25,26,27. Household head characteristics such as gender, education level and urban or rural residence have also been associated with differences in medical expenditure in various studies^28,29,30. Although the coefficients associated with these variables may vary from country to country due to different circumstances, overall, they influence healthcare expenditure patterns^{23,28,29,30,31,32,33}. These findings highlight the importance of considering household head characteristics in understanding differences in healthcare spending patterns and emphasize the need for tailored interventions to address inequalities in MC based on these factors.

Modeling medical costs presents challenges due to the unique characteristics of the data, including non-negativity, right skewness, and a substantial percentage of observations equal to zero, defining it as "semi-continuous" data^34,35,36. Traditional approaches such as linear models cannot adequately handle this complexity without additional modifications³⁷. Advanced statistical models tailored to semi-continuous healthcare cost data have been shown to be valuable tools for addressing the complexity of cost analysis³⁸. Techniques such as Tweedie distributions in generalized linear models³⁹, two-part mixed-effects models^40,41,42, joint models⁴³, machine learning^38,43, and Bayesian inference methods⁴¹ provide greater insight into healthcare cost patterns and allow researchers to examine the impact of various determinants on medical costs. Marginalized two-part (MTP) models, including commonly used models, are for the structure of mixed semi-continuous data^36,43,44,45. In summary, the search results show the importance of using specialized statistical models such as MTP to effectively analyze semi-continuous data, which is a common characteristic of healthcare cost data and other medical variables.

This study aims to investigate the factors associated with MC in Iranian households by using marginalized two-part models in a multilevel framework. By utilizing advanced statistical methods, this study attempts to elucidate the various determinants affecting medical costs and their impact on the level of household health expenditure. Finally, this study attempts to provide valuable insights for health care financing and insurance coverage strategies to ultimately promote a more equitable health care system for Iranian households.

Materials and methods

Study population

This study belongs to the type of analytical, applied and data-oriented epidemiological studies. The data used in this study refer to the information on Iranian households’ medical expenditures obtained from a national project entitled “The Households Income and Expenditure Survey (HIES)” in 2021. The data comes from the Statistical Center of Iran (SCI), which can be found at https://www.amar.org.ir⁵². This information was provided to the researchers of this project in raw form and as a random sample. The final sample size, after applying entry and exit restrictions, was 8993 household heads in Iranian provinces.

To minimize potential biases that may result from the use of self-reporting, the HIES survey used standardized procedures and strict quality control protocols implemented by SCI to ensure the reliability and validity of the data collected. The nationally representative sampling design of the survey and subsequent data processing further reduced errors and inconsistencies prior to analysis.

The household data was kept confidential throughout the study. The study was conducted after approval from the Research Ethics Committee (REC) of Ahvaz Jundishapur College of Medical Sciences (AJUMS) with project number U-02034 and ethics code IR.AJUMS.REC.1402.064.

Predictor variables

For all and based on the information available, the 19 variables that the researchers believe may have an impact on MC were grouped into two categories: Information related to the household head (five variables) and information related to the household (15 variables).

Information about the household head includes age (young adults; middle-aged adults; older adults), gender (male; female), education level (illiterate and elementary; lower than diploma; diploma and associate; bachelors; MSc or PhD), marital status (married; widowed or divorced; single), employment status (employed; not working; have income without a job; others). Household variables include the number of family members (1; 2; 3; 4 or more persons), residential area (urban, rural), number of employees (noun, one, two or more), number of students in the family (noun; one; two or more), number of educated persons in the family (noun; one; two or more), type of home ownership (have a home; mortgage, rent & other), subsidies (no; yes), internet access (no; yes), car ownership (no; yes), bicycle ownership (no; yes), family income in the year and household expenditure on food (below; above the national average), clothing (below; above the national average) and housing (below; above the national average) in the year were extracted and used in the analysis phase.

Outcome variable

In this study, the total medical costs of a year (such as costs for dental and eye care, medication, addiction treatment, surgery, etc.) for each household were considered as total medical costs and outcomes. The cost variable is a positive amount or zero. Assuming that there is a correlation in the cost of healthcare services within each province and that there are differences and heterogeneity among these cities, the provincial cities in the center of Iran were considered as clusters and random effects were considered to account for heterogeneity.

Semi-continuous data

Data on healthcare costs are often characterized as semi-continuous and exhibit a non-normal distribution with an unbalanced ratio of zeros to positive values³⁴. In health economics and service research, such data are widely used and pose a challenge for analysis due to their unique characteristics. These types of data, known as semi-continuous data with zero inflation, cover a wide range of areas, such as research on healthcare costs, medical care services, health assessments^34,37, average daily alcohol consumption^37,53, annual car insurance claims, and the relative abundance of the microbiome^37,46.

In the context of semi-continuous data with zero inflation, the presence of a significant proportion of zeros alongside positive skewed values requires special treatment in statistical analysis. Failure to account for this peculiarity when running regression models can lead to biased estimates, incorrect conclusions and ultimately misleading results. Taking into account the atypical distribution of semi-continuous data with zeros is crucial for the accuracy and validity of research results in various fields of study.

Two-part models for semi-continuous data

Semi-continuous data usually needs two-part mixture models to effectively capture both the discrete and continuous aspects of the data. In the case of independent observations, the typical format of the two-part model is outlined below:

$$f\left({y}_{i}\right)={(1-{\pi }_{i})}^{{\mathbb{l}}_{({y}_{i}=0)}}\times {\left[{\pi }_{i}g({y}_{i}|{y}_{i}>0;{\mu }_{i},\sigma ,\kappa )\right]}^{{\mathbb{l}}_{({y}_{i}>0)}}, { y}_{i}\ge 0, i=1,\dots ,n$$

(1)

where ${\pi }_{i}=Pr({Y}_{i}>0)$, ${\mathbb{l}}_{(.)}$ serves as an indicator function, and $g({y}_{i}|{y}_{i}>0)$ is a function that depends on a specific location parameter ${\mu }_{i}$, a positive scale parameter $\sigma$, and $\kappa \in \mathfrak{R}$ that determines the shape or skewness of the distribution. Commonly preferred densities include the gamma (G)⁵⁴ or generalized gamma (GG)^48,55,56, lognormal (LN)^57,58, weibull (W)⁵⁹, and log-skew-normal (LSN)^57,60, which will be further elaborated on below.

In above equation, covariates are included in two separate linear predictors, one for ${\pi }_{i}$ and one for ${\mu }_{i}$. An instance of this is the conditional two-part (CTP) model⁶¹, where a logit link is utilized for the binary part and a positive continuous distribution is used for $g({y}_{i}|{y}_{i}>0)$. The model is structured as below:

$$\text{Part I}: logit\left({\pi }_{i}\right)=logit\left[\text{Pr}\left({Y}_{i}>0\right)\right]={{\varvec{Z}}}_{i}{\prime}\boldsymbol{\alpha }={\alpha }_{0}+{z}_{1i}{\alpha }_{1}+\dots +{z}_{qi}{\alpha }_{q}$$

$$\text{Part II}: {\mu }_{i}=E\left[\text{ln}({Y}_{i}|{Y}_{i}>0)\right]={{\varvec{X}}}_{i}{\prime}{\varvec{\beta}}={\beta }_{0}+{x}_{1i}\beta +\dots +{x}_{pi}{\beta }_{p} , i=1,\dots ,n$$

(2)

where ${{\varvec{Z}}}_{{\varvec{i}}}^{\boldsymbol{^{\prime}}}$ is a $1\times q$ covariate vector and $\boldsymbol{\alpha }$ a $q\times 1$ regression coefficient in the binary part. Also, ${{\varvec{X}}}_{{\varvec{i}}}^{\boldsymbol{^{\prime}}}$ is a $1\times p$ covariate vector and ${\varvec{\beta}}$ a $p\times 1$ regression coefficient in the continuous part. By disregarding the intercept, the components of $\boldsymbol{\alpha }$ indicate unit changes in the log-odds of a positive response, while the components of ${\varvec{\beta}}$ represent unit changes on the conditional mean of the logged positive values, ${\mu }_{i}=E\left[\text{ln}({Y}_{i}|{Y}_{i}>0)\right]$. The conditional interpretation of ${\varvec{\beta}}$ suggests that it assesses the effects of covariates on individuals who exhibit a positive response, rather than on the overall population.

In statistical modeling, researchers often focus on examining the effects of certain factors on the transformed marginal means. However, there are cases where it is crucial to examine the impact on the untransformed marginal mean, denoted $E({Y}_{i})$, in order to draw conclusions about the overall population, which includes both users and non-users of health services^44,45,46,50. To address the need for such inferences, Smith and colleagues introduced a marginalized two-part (MTP) model that allows direct parameterization of the effects of covariates on the marginal mean⁵¹. The MTP model is characterized by its parameters as:

$$\text{Part I}: logit\left({\pi }_{i}\right)={{\varvec{Z}}}_{i}{\prime}\boldsymbol{\alpha }$$

$$\text{Part II}:\text{ E}\left({Y}_{i}\right)={\nu }_{i}=\text{exp}({{\varvec{X}}}_{i}{\prime}{\varvec{\beta}}) , i=1,\dots ,n$$

(3)

In this context, $\boldsymbol{\alpha }$ has the same meaning as in the CTP model and represents a vector of log-odds ratios. The model allows the estimation of covariate effects on the overall marginal mean and standard error by linear combinations of the parameters in the second part. Specifically, $\text{exp}({\beta }_{k})$ represents the multiplicative effect on the overall mean when the kth covariate increases by one-unit. With utilizing this parameterization, the marginal means and standard errors predicted by the model can be easily determined by calculating $\text{exp}({{\varvec{X}}}_{i}{\prime}{\varvec{\beta}})$ at the specified values of the covariates.

In this class of models, different distributions can be used to effectively analyze the semi-continuous data. Based on AIC and BIC, we used the Vuong test (V) to determine whether the zero and positive components of the cost variables are generated by different processes⁶². This non-nested hypothesis test produces a Z statistic, where a value greater than 1.96 supports the alternative assumption that the first model fits the data better, while a value less than − 1.96 indicates that the second model provides a better fit. In our analysis, we evaluated our independent two-equation model against a Tobit model that accounts for interdependence. The calculated test statistic was 69,729.1, and since V is greater than 1.96, we find evidence supporting the hypothesis of independence of process. Although Tobit or Heckman models could account for interdependence, their use is not justified here. The principle of parsimony favors the simpler independent two- equation model, which effectively captures the data and provides more interpretable coefficients without the complexity of interdependence. In particular, the MTP model allows for adaptability by considering a range of distributions and variance structures^{34,44,50,51,63}. To justify the choice of distributions in the marginalized two-part model (MTP), we selected lognormal and gamma distributions based on their theoretical suitability for modeling right-skewed, non-negative medical cost data, their frequent use in related studies, and their applicability to the multilevel structure of the data in the final section of the models.

Multilevel models in cluster analysis

According to Ning Li’s research, data can be organized in a composite or stratified format, where hierarchies signify that observations within identical groups or contexts share commonalities or similarities that imply some degree of uniformity⁶⁴. Consequently, a mixed framework can be used to describe a model for semi-continuous data with two levels. The first level pertains to observations ($i=1,\dots ,{n}_{j}$) nested in two-level units ($j=1,\dots ,m$) that refer to center provinces.

The model’s parameterization is divided into two parts that are fitted separately.

In part I, the binary outcome is modeled as:

$$logit\left({\pi }_{ij}\right)=logit\left(\text{Pr}\left({Y}_{ij}>0\right)\right)={{\varvec{Z}}}_{{\varvec{i}}{\varvec{j}}}^{\boldsymbol{^{\prime}}}\boldsymbol{\alpha }+{{\varvec{b}}}_{1{\varvec{i}}}$$

(4)

where ${b}_{1i}\sim N(0,{\sigma }_{b1}^{2})$ represents the random effect that accounts for the correlation within a cluster (level 2) in the zero part.

Assuming that the logarithm for the g link function, the location parameter ${\mu }_{ij}$ for the continuous component in the second part is modeled as:

$$g\left(E\left({Y}_{ij}|{Y}_{ij}>0\right)\right)=\text{log}\left({\mu }_{ij}|{Y}_{ij}>0\right)={{\varvec{X}}}_{{\varvec{i}}{\varvec{j}}}^{\boldsymbol{^{\prime}}}{\varvec{\beta}}+{{\varvec{b}}}_{2{\varvec{i}}}$$

(5)

where ${b}_{2i}\sim N(0,{\sigma }_{b2}^{2})$ represents the random effect that accounts for the correlation within a cluster (level 2) in the continuous part. These random effects capture the unobserved characteristics or factors that may influence the outcome variable within each cluster. By including this random effect in the model, we can account for the clustering of observations within each cluster and better estimate the true relationship between the predictors and the outcome variable. In this context, it is assumed that the random effects ${b}_{1i}$ and ${b}_{2i}$, pertaining to the processes zero and non-zero, are independent and uncorrelated.

${{\varvec{Z}}}_{{\varvec{i}}{\varvec{j}}}^{\boldsymbol{^{\prime}}}$ represents the covariates for the i-th subject in the j-th cluster for the binary part, and ${{\varvec{X}}}_{{\varvec{i}}{\varvec{j}}}^{\boldsymbol{^{\prime}}}$ represents the covariates for the i-th subject in the j-th cluster used for the continuous part. The two parts may have common or completely different covariates. $\boldsymbol{\alpha }$ represents the vector of model coefficients for the binary part, while ${\varvec{\beta}}$ represents the vector of coefficients for the continuous part, under the condition that the values are non-zero.

For a TP model, the marginal mean and variance of ${Y}_{ij}$ can be derived as follows:

$$E\left({Y}_{ij}\right)={\pi }_{ij}E\left({Y}_{ij}|{Y}_{ij}>0\right)$$

$$Var\left({Y}_{ij}\right)={\pi }_{ij}\left[E({Y}_{ij}^{2}|{Y}_{ij}>0)-{\pi }_{ij}{E({Y}_{ij}|{Y}_{ij}>0)}^{2}\right]$$

(6)

when lognormal is assumed in the continuous part, the marginal mean is

$$E\left({Y}_{ij}\right)={\pi }_{ij}\times exp\left\{{\mu }_{ij}+\frac{{\sigma }^{2}}{2}\right\}=\frac{1}{1+exp\left\{-{Z}_{ij}{\prime}\alpha +{b}_{1i}\right\}}\times exp\left\{{X}_{ij}{\prime}\beta +{b}_{2i}+\frac{{\sigma }^{2}}{2}\right\}$$

(7)

and when gamma is assumed in the continuous part, the marginal mean is

$$E\left({Y}_{ij}\right)={\pi }_{ij}\times exp\left\{{\mu }_{ij}\right\}=\frac{1}{1+exp\left\{-{Z}_{ij}{\prime}\alpha +{b}_{1i}\right\}}\times exp\left\{{X}_{ij}{\prime}\beta +{b}_{2i}\right\}$$

(8)

In binary models, the $\boldsymbol{\alpha }$ estimates represent the average probabilities of positive values in the population. On an exponential scale, $\text{exp}(\alpha )$ is the odds ratio for a one-unit increase in the covariate. In continuous models, the ${\varvec{\beta}}$ estimates are only for non-zero positive values, a subset of the data. When a log link is used, $\text{exp}(\beta )$ shows the multiplicative change in the overall mean as the covariate increases by one unit, assuming the observation is not zero. To summarize, the binary part estimates the probabilities of non-zero values in the population, while the continuous part shows the effects on the population mean when the values are non-zero. Moving forward, to simplify the presentation, we refer to the marginalized two-part lognormal and marginalized two-part gamma models as MTP-LN and MTP-G, respectively.

Parameter estimation and inference for MTP

Let $n={n}_{j}\times m$ be the total number of subjects and assume that subjects $(i=1,\dots ,{n}_{j})$ on different clusters $(j=1,\dots ,m)$ are independent. Given the random effects ${{\varvec{b}}}_{1{\varvec{i}}}$ and ${{\varvec{b}}}_{2{\varvec{i}}}$, The likelihood function can be described as such:

$$L\left(\alpha ,\beta |{b}_{1},{b}_{2}\right)=$$

$$\prod_{i=1}^{{n}_{j}}\prod_{j=1}^{m}\iint {\left(1-{\pi }_{ij}\right)}^{{\mathbb{l}}_{\left({y}_{ij}=0\right)}}{\left[{\pi }_{ij}g\left({y}_{ij}|{y}_{ij}>0,{b}_{2i}\right)\right]}^{{\mathbb{l}}_{\left({y}_{ij}>0\right)}}\varphi \left({b}_{1}\right)\varphi \left({b}_{2}\right)d{b}_{1i}d{b}_{2i}$$

(9)

where ${n}_{j}$ is the number of subjects in the cluster j, ${\pi }_{ij}$ is given (4) if the logit link function is used in Part I, $g({y}_{ij}|{y}_{ij}>0)$ is depend on the distribution assumption on ${y}_{ij}>0$ (lognormal or gamma), and $\varphi \left({b}_{1}\right)$ and $\varphi \left({b}_{2}\right)$ are the normal density of two random effects ${b}_{1}$ and ${b}_{2}$.

The likelihood in Eq. (9) requires the integration of a nonlinear function over the two random effects in the likelihood function. To obtain maximum likelihood estimators for $\alpha$, $\beta$, and the random effects, numerical methods combined with integration approximation techniques are essential. Some researchers used a high-order Laplace approximation to estimate the marginal likelihood and employed an approximate Fisher scoring algorithm for maximization^65,66,67. Similarly, Tooze et al. used a quasi-Newton algorithm in conjunction with an adaptive Gaussian quadrature for likelihood maximization⁶⁸. Hubin⁶⁹ and Wang⁷⁰ also investigated the Integrated Nested Laplace Approximation (INLA) and a generalized version of the Fisher scoring method for estimating the marginal likelihood and maximizing the likelihood, respectively.

In this article, we use different methods to estimate the parameters in the utilization models. For the MTP-LN model, we use an integration method known as adaptive Gauss-Hermite quadrature and an optimization method that combines hybrid EM and quasi-Newton approaches. For the MTP-G model, on the other hand, we implement the Laplace approximation as the integration method and use maximum likelihood estimation via ‘TMB’ (Template Model Builder) as the optimization method. These different techniques meet the unique requirements of each model and ensure accurate parameter estimation and robust model fitting for both MTP-LN and MTP-G models. This methodology can be easily implemented in widely used standard statistical packages. Data cleaning, statistical analyses, and data visualization were primarily conducted using R 4.3.2 version⁷¹. The corresponding codes are included in the Appendix A for reference. Maps were created using the free plan of the Datawrapper site (https://app.datawrapper.de)⁷².

Model fit assessment

The log-likelihood ($LL$) determined by maximum likelihood estimation serves as an indicator of how well a model fits the data, with higher values indicating a stronger fit. However, when comparing different models, it is more appropriate to use information criteria such as Deviance ($D=-2LL$), Akaike information criterion ($AIC=-2LL+2k$), and Schwarz’s Bayesian information criterion ($BIC=-2LL+k\text{log}n$) where n is the sample size or the data point in X’s, and K is the number of estimable parameters⁷³. These criteria are based on the log likelihood function, but include a penalty for the number of parameters in the model, which helps to prevent overfitting. The model with the smallest value of the information criterion is generally preferred, as it represents the best balance between model fit and complexity. Using compliant and large sample data, we perform an evaluation to determine the time to convergence of the models.

To assess the fit of the models used, scatter plots and heat maps were created to compare the actual values with the fitted values for the MTP models. These scatter plots provide a visual representation of how well the models can predict the data, allowing a more comprehensive evaluation of their performance.

Results

The average medical cost of Iranian households (n = 8993) for one year, including dental and eye care, surgery, etc., included as response variables were about $180 with a notable standard deviation of $39,324 (median = $52.32, IQR = $204), indicating a wide range of costs. In addition, the minimum and maximum medical costs in 2021 were zero and over $2000, respectively. The skewness and kurtosis of the response variables were 29.3 and 16.13, respectively.

The histogram of total MCs of Iranian households in 2021 is shown in Fig. 1. This plot shows that medical costs was a semi-continuous variable with 2385 (26.5%) values equal to zero and a continuous right skewed distribution among the positive values. The fitted values extracted from two proposed models, MTP-LN and MTP-G, are also inserted into the histogram.

Table 1 shows the factors associated with MC in Iranian households for the year 2021, broken down by head of family and family variables. This data provides valuable insights into the determinants of medical expenditure in households in Iran. The total number of families included in the study was 8993, out of which 2386 (26.53%) had zero MC and 6607 (73.47%) had nonzero (positive) MC.

Table 1 Socio-demographic information on Iranian families by medical costs in 2021.

Full size table

The mean age of the head of family was 52.64 years in families with positive MC, which was higher than families with zero MC (51.21 years). There is also a significant association between age groups and MC, with a higher proportion of young adults among those with nonzero MC (P < 0.001). The majority of the heads in Iranian family were male in both groups, with no significant difference between zero and nonzero MC (P = 0.425). In terms of education, heads with higher education levels (bachelor’s degree or higher) had a higher proportion of positive MC compared to family heads with lower education levels (P = 0.034). Married individuals constituted the majority in both groups, but married heads had a higher proportion of nonzero MC compared to widowed/divorced or single heads (P = 0.004). Employed heads had a higher proportion of positive MC compared to families who head were not working or head had income without a job (P < 0.001).

Families with positive MC tend to have a slightly larger family size compared to those with zero MC. The mean family size was 3.32, which was significantly higher in families with positive MC. There is also a significant association between family size categories and MC (P < 0.001). The majority of families resided in urban areas (71.5%), with a significantly higher proportion of families with nonzero MC residing in urban areas (P < 0.001). The mean number of employees was not significantly different between the two groups. The presence of employed members per family also shows a significant difference between the groups based on MC. There is no significant difference in the distribution based on the number of student members (P = 0.575), but there is a significant difference in the distribution based on the number of educated members (P = 0.004). The majority of families owned their homes (70.9%), received subsidies (89.1%), and had internet access (78.3%), with no significant difference between the two groups. While the ownership of a car and a bicycle does show significant differences. Finally, the mean income, feeding cost, clothing cost, and housing cost were all significantly higher in families with positive MC (P < 0.001).

Table 2 shows the estimated coefficients with standard errors for fitting the marginalized two-part lognormal model (MTP-LN) and the two-part gamma model (MTP-G) to Iranian household MC of Iranian households in 2021. Factors considered include various demographic and socioeconomic variables related to family head of the family and family characteristics.

Table 2 Fitting marginalized two-part log-normal and two-part gamma models to medical costs of Iranian households in 2021.

Full size table

In both models, the zero part evaluates the probability that a household has a positive MC. The MTP-LN model shows that households aged 60 and over are significantly 1.46 times more likely (${e}^{0.3797}\cong 1.46$) to have a positive MC than households aged 18 to 39. The MTP-G model also provides comparable results. In addition, households headed by individuals with a bachelor’s degree are approximately 1.38 times less likely (${1/e}^{-0.3246}\cong 1.38$) to have positive MC than households with lower or higher levels of education. Notably, rural households are significantly 1.4 times less likely (${1/e}^{-0.3390}\cong 1.40$) to have a positive MC than urban households in both models. In addition, households with two or more educated members have a 1.35 times higher probability (${e}^{0.3030}\cong 1.35$) of positive MC in both models. Interestingly, households with a mortgage, renting or in alternative forms of living also have a 17% higher probability (${e}^{0.1572}\cong 1.17$) of positive MC in the MTP-G model.

The positive part of the two models estimates the amount of MC for households with positive costs. In the MTP-G model, the results are generally consistent with the MTP-LN model. However, there are some additional significant results in MTP-G model. In both models, household heads with a higher level of education and a different employment status have a significantly higher (24% and 43% respectively) amount of MC. Households with a different employment status have significantly higher MC compared to employed households. At the mean of the sample, $180.58, a 43% increase (${e}^{0.3555}\cong 1.43$) represents a relatively additional $77.67 spent on MC for Iranian households in one year. In contrast, single household heads have significantly lower (− 35%) MC compared to married, widowed or divorced household heads. At the mean of the sample, a 35% decrease (${e}^{-0.4317}\cong 0.65$) represents a relatively lower $63.11 spent on MC for single Iranian households in one year. Among the family characteristics, larger families (4 or more people) (33%) and lower student group membership (20%) were associated with a higher amount of MC in both the MTP-LN and MTP-G models. At the sample mean, the increase of 33% (${e}^{0.2819}\cong 1.33$) and 20% (${1/e}^{-0.1835}\cong 1.20$) means that Iranian households with larger size and lower student membership, spend relatively $58.62 and $36.25 more on MCs in a year, respectively. Households with two or more educated members were associated with a higher (20%) amount of MC in the MTP-G model. Higher than average income (23−26%) and higher than average food (20%), clothing (4−5%) and housing costs (22−26%) were also associated with a higher amount of MC in both models.

Random effects models estimate variance components at each level. These components quantify the total variability of the dependent variable due to differences between clusters (level 2) and within clusters (level 1). The estimated variance values for the random effects in the two parts of the models can be found at the end of Table 2. The dispersion of the responses in the second part of the log-normal model is significantly larger than in the gamma model, indicating a broader coverage of the responses with this model. The large dispersion of medical costs at the provincial level (78.5%) necessitated the use of the two-part multilevel model [(8.94/(0.77 + 1.66 + 8.94)) × 100 = 78.5%].

Table 3 compares and evaluates the goodness-of-fit statistics for two models. The results show that the MTP-LN model has a better fit compared to the MTP-G model, as evidenced by lower log-likelihood, AIC and BIC values. It is important to note that the MTP-LN model requires more time for estimation than the MTP-G model.

Table 3 Comparison and assessment of fit statistics for two-part lognormal and two-part gamma models.

Full size table

Figure 2A compares the actual values with the fitted values using the MTP-LN model, while Fig. 2B does the same using the MTP-G model for the MC values. The data show a concentration of observations at lower cost levels. Notably, the two-part lognormal model shows a better fit between the actual and fitted values than the two-part gamma model. The lognormal model with random effects captures the dispersion better than the gamma model, which tends to underestimate the data with minimal skewness and low predictive power.

Figure 3 shows the actual values of MC (3A) and the average costs predicted by the MTP-LN (3B) and MTP-G (3C) models, categorized by different provinces in Iran. The results show that, despite possible discrepancies in the estimated figures, the predictions of the MTP-LN model correspond very well with the actual values, with a slight overestimation. In contrast, the MTP-G model tends to significantly underestimate the values. Therefore, it appears that the MTP-LN model performs better than the MTP-G model in terms of predictive accuracy.

Discussion

The study on the medical costs of Iranian households in 2021 provide valuable insights into the determinants and patterns of medical expenditure within the population. The average MC for Iranian households was approximately $180 per year, with a notable standard deviation, indicating a wide range of costs. The data exhibited a right-skewed distribution with a significant proportion of households having zero MC. Factors such as age, education level, employment status, family size, urban/rural residence, and income were found to be associated with MC in Iranian households.

Researchers working with semi-continuous data often use two-part models that combine logistic/probit regression for predicting zero values and linear regression for positive values^51,74. In more recent studies, these models have been extended to include zero augmented beta prime³⁴, multivariate proportionally restricted models⁷⁵, random effects models for longitudinal data⁶⁷, marginalized models⁵¹, and quantile regression models⁷⁶. The MTP model helps to assess the influence of factors on the marginal mean, which improves insight into health outcomes in different populations⁶³. Multilevel models, such as the two-part models introduced by Belotti⁴¹, are crucial for dealing with nested data structures with different variables between groups. Understanding hierarchical data structures is essential for accurate modeling, especially in areas of decision making such as business and science. In this study, comparison of the MTP-LN and MTP-G models showed better predictive accuracy, with the MTP-LN model closely approximating actual medical costs in Iranian provinces and outperforming the MTP-G model. This research also shows the importance of understanding the hierarchical structure of data and its impact on modeling by creating scatter plots and mapping diagrams.

Our study provides new insights into the factors influencing medical costs in Iranian households, particularly regarding the role of education, family size, income, and rural versus urban residence. The analysis revealed that older individuals are less likely to have positive MC than younger individuals. Research consistently shows that age is a significant factor in determining MC and access to care. Richman found that both younger and middle-aged adults are vulnerable to burdensome MC, with younger adults particularly at risk even with moderate incomes⁷⁷. This is further supported by Na, who found that younger medicare beneficiaries are less likely to receive recommended care compared to older age groups⁷⁸. However, Faraji et al.⁷⁹, Brockmann⁸⁰, and Mueller⁸¹ provide a different perspective, suggesting that older individuals may actually have lower MC, with the latter study indicating that high medical spending among older households is associated with decreased nonmedical spending. These findings highlight the complex interplay of age, medical costs, and access to care.

Some studies show that, higher education levels among family heads were associated with a less likelihood of incurring MC^79,81,82. The results of this study suggest that, holding family income and insurance status constant, higher education has a positive effect on MC in different quantiles, especially for spouses with higher education and in higher quantiles of health care expenditures. In addition, the study showed that having a PhD increases household MC by about $47 per year compared to an illiterate household.

In terms of rural vs. urban differences, our study aligns with findings from Lee⁸³ and Hartley⁸⁴, who noted that rural populations face more financial barriers in accessing care, despite generally lower medical costs. Our study highlights the importance of improving healthcare access and providing financial subsidies to rural households.

In both models, heads of households with a higher level of education and a different employment status have significantly higher MC, with non-working households spending around 43% more on MC compared to working households, while single heads of households spend around 35% less on MC than married, widowed or divorced heads of households. Larger families (4 or more people) and households with fewer students tend to have higher MC, while households with two or more educated members, above average income and above average expenditure on food, clothing and household costs are also associated with higher MC. Research consistently shows that larger family sizes, lower student group membership, and higher income and spending are associated with higher levels of MC^81,82,85,86. These factors can be seen as indicators of increased healthcare needs and utilization, which in turn drive up medical spending. However, the specific mechanisms through which these factors influence MC may vary, and further research is needed to fully understand these relationships.

The study compared statistical models for analyzing MC, focusing on methods for positive skewed healthcare costs. The two-part models, MTP-LN and MTP-G, were utilized to analyze the multilevel MC data. The results indicated that the MTP-LN model provided a better fit compared to the MTP-G model, as evidenced by lower log-likelihood, AIC, and BIC values. Liu and Powers both explore the use of two-part models in analyzing MC data^55,87. Liu (2010) specifically focuses on a two-part random effects model, while Powers evaluates the predictive modeling of total healthcare costs using pharmacy claims data. Both studies find that the two-part models are effective in their respective analyses. However, Lin and Crawford present alternative approaches. Lin compares the two-part model with neural networks, finding strong evidence in favor of the latter⁸⁸. Crawford compares the accuracy of total population and disease-specific neural network models in predicting MC, with the latter proving more effective⁸⁹. These studies suggest that while the two-part models can be useful, they may not always provide the best fit compared to other modeling techniques^59,90.

The MTP-LN model has been found to be more accurate in capturing data dispersion, particularly in the presence of random effects, compared to the MTP-G model. This is supported by Liu, who highlighted the impact of model misspecification on the marginalized models⁴⁹, and Iddi, who emphasized the importance of considering overdispersion and correlation in modeling⁹¹. Voronca further demonstrated the superiority of the MTP-LN model in the context of the Generalized Gamma family of distributions⁴⁷. However, The search results do not provide direct evidence to support the claim that the MTP-LN model is more accurate in capturing data dispersion, particularly in the presence of random effects, compared to the MTP-G model^40,92.

Implications for health policy

The findings of this study offer several important insights that could guide health policy in Iran, especially given the substantial burden of medical costs on households. The wide variability in medical expenditures, with a significant proportion of households having no medical costs while others have much higher expenditures, points to the need for more targeted interventions. Firstly, the relationship between education level and medical costs shows that households with a higher level of education tend to have higher medical costs. This could be due to greater awareness and utilization of healthcare services. Policy makers could consider improving access to preventive measures and health literacy programs, especially for low-income and less educated households, to avoid high medical costs. Second, the finding that rural households tend to have lower medical costs but face greater financial barriers underscores the importance of improving access to health care in these areas. Measures could include improving health infrastructure, providing financial subsidies to rural populations and offering mobile health services to reduce the economic burden of seeking healthcare. Third, the significant relationship between family size and medical costs suggests that larger families are more likely to have higher health care expenditures. Financial support measures such as expanding family-oriented health insurance plans or providing subsidies for larger households could help reduce these costs. Finally, the observed relationship between higher income and higher health care costs underscores the importance of addressing the health care needs of wealthier households, particularly with regard to insurance coverage and out-of-pocket spending. Policymakers could consider expanding coverage to reduce spending for high-income households while ensuring that low-income households continue to receive adequate support. When these factors are taken into account, health policies in Iran can be better tailored to reduce the financial burden on households, especially for the more vulnerable populations who face higher medical costs.

Study limitations

When interpreting the results of this study, it is important to consider that, as with other studies using household data, this is self-reported data that may be influenced by recall bias. This means that patients may have forgotten, underestimated or overestimated some of the information provided. Since the study focuses on 8993 Iranian households, its generalizability to the broader population is limited. The unclear methods of data collection could affect the reliability of the results and lead to bias. Due to the selective inclusion of certain cost determinants, the study could lack depth and robustness. The specificity of the study to Iran in 2021 also limits the broader implications. To address these limitations, future research should consider a broader range of data and statistical approaches to improve the depth and robustness of the analysis. The use of advanced statistical techniques such as multivariate regression analysis, weighted samples, propensity score matching, structural equation modeling, or machine learning algorithms can provide a more comprehensive understanding of the factors influencing healthcare costs in Iran.

Conclusion

In this cross-sectional study, we conducted retrospective reporting on the MCs of Iranian households in 2021. In summary, the study illustrates the complexity of factors influencing MC in Iranian households and demonstrates the utility of MTP models in analyzing such data. The results emphasize the importance of considering demographic, socioeconomic, and geographic variables in understanding and predicting medical expenditure patterns within a population. The results of this study offer several important insights that could guide healthcare policy in Iran, especially in light of the significant medical cost burden faced by households. The wide variability in medical expenditures, with a substantial portion of households incurring zero medical costs, while others face much higher expenses, suggests the need for more targeted interventions. Understanding these factors can help policy makers and healthcare providers develop targeted interventions to reduce the financial burden of healthcare for vulnerable populations. The MTP-LN models provide valuable insights into the factors associated with MCs of Iranian households. The models can be used to identify groups with a higher risk of high MC and to develop targeted measures to reduce the financial burden on the healthcare system for these groups.

Data availability

The data sets utilized and/or analyzed in this study are not publicly accessible due to sensitivity concerns. However, they can be made available upon a reasoned request to the corresponding author. The data originates from the Statistical Center of Iran (SCI), and the sources can be found at https://www.amar.org.ir.

Abbreviations

MC:: Medical cost
HIES:: Households Income and Expenditure Survey
SCI:: Statistical Center of Iran
MTP-LN:: Marginalized two-part lognormal
MTP-G:: Marginalized two-part gamma
GG:: Generalized gamma
W:: Weibull
LSN:: Log skew-normal
CTP:: Conditional two-part
D:: Deviance
AIC:: Akaike information criterion
BIC:: Bayesian information criterion
IQR:: Interquartile range

References

Roebuck, M. C., Liberman, J. N., Gemmill-Toyama, M. & Brennan, T. A. Medication adherence leads to lower health care use and costs despite increased drug spending. Health Affair 30(1), 91–99. https://doi.org/10.1377/hlthaff20091087 (2017).
Article Google Scholar
How Proactive Healthcare Can Save on Costs | Chicago Booth Review. (2024). https://www.chicagobooth.edu/review/how-proactive-healthcare-can-save-costs
Liu, P.-H. et al. Cost-effectiveness of human papillomavirus vaccination for prevention of cervical cancer in Taiwan. BMC Health Serv. Res. 10, 11 (2010).
Article PubMed PubMed Central Google Scholar
Hussey, P. S., Wertheimer, S. & Mehrotra, A. The association between health care quality and cost a systematic review. Ann. Intern. Med. 158(1), 27 (2013).
Article PubMed PubMed Central MATH Google Scholar
Prioritizing health: A prescription for prosperity|McKinsey. (2024). https://www.mckinsey.com/industries/healthcare/our-insights/prioritizing-health-a-prescription-for-prosperity
Brück, C. C., Wolters, F. J., Ikram, M. A. & de Kok, I. M. C. M. Projections of costs and quality adjusted life years lost due to dementia from 2020 to 2050: A population-based microsimulation study. Alzheimer’s Dement. 19(10), 4532–4541 (2023).
Article Google Scholar
Brück, C. C., Wolters, F. J., Ikram, M. A., de Kok, I. M. C. M. Projections of costs and quality adjusted life years lost due to dementia from 2020 to 2050: A population‐based microsimulation study. Alzheimer’s Dement. (2023).
Galvani, A. P., Parpia, A. S., Foster, E. M., Singer, B. H. & Fitzpatrick, M. C. Improving the prognosis of health care in the USA. Lancet 395(10223), 524–533 (2020).
Article PubMed PubMed Central MATH Google Scholar
Odonnell, O. et al. Who pays for health care in Asia?. J. Health Econ. 27(2), 460–475 (2008).
Article PubMed Google Scholar
Queenan, J. T. The increasing cost of medical care. Obstet. Gynecol. 100(4), 629–630 (2002).
PubMed MATH Google Scholar
Doshmangir, L., Yousefi, M., Hasanpoor, E., Eshtiagh, B. & Haghparast-Bidgoli, H. Determinants of catastrophic health expenditures in Iran: A systematic review and meta-analysis. Cost Eff. Resour. Alloc. 18(1), 1–21 (2020).
Article Google Scholar
Pauly, M. V., Zweifel, P., Scheffler, R. M., Preker, A. S. & Bassett, M. Private health insurance in developing countries. NCHS Data Brief 25(2), 369–379. https://doi.org/10.1377/hlthaff252369 (2017).
Article Google Scholar
Health financing. https://www.who.int/health-topics/health-financing#tab=tab_1
Cohen, R. A. & Kirzinger, W. K. Financial burden of medical care: a family perspective. NCHS Data Brief 142, 1–8 (2014).
MATH Google Scholar
Mehrara, M. & Fazaeli, A. A. Health finance equity in Iran: An analysis of household survey data (1382–1386). J. Health Adm. 13(40), 51–62 (2010).
Google Scholar
Ahmadi, A. M., Nikravan, A., Naseri, A. & Asari, A. Effective determinants in household out of packet payments in health system of Iran, using two part regression model. (2014).
Xu X, Huang X, Zhang X, Chen L. Family economic burden of elderly chronic diseases: evidence from China. In: Healthcare, 99 (MDPI, 2019).
Health and Economic Costs of Chronic Diseases | CDC [Internet]. (2024). https://www.cdc.gov/chronicdisease/about/costs/index.htm
Gambert, S. R. The burden of chronic disease. Mayo Clin. Proc. Innov. Qual. Outcomes 8(1), 112 (2024).
Article Google Scholar
Chronic conditions lead health care spend in the U.S. | Employer | UnitedHealthcare. (2024). https://www.uhc.com/employer/news-strategies/chronic-conditions-lead-health-care-spend-in-the-us
Han, K.-T., Kim, W. & Kim, S. Disparities in healthcare expenditures according to economic status in cancer patients undergoing end-of-life care. BMC Cancer. 22(1), 303 (2022).
Article PubMed PubMed Central MATH Google Scholar
Lee, H. J. et al. Association between changes in economic activity and catastrophic health expenditure: Findings from the Korea Health Panel Survey, 2014–2016. Cost Eff. Resour. Alloc. 18, 1–9 (2020).
Article CAS MATH Google Scholar
Okunade, A. A., Suraratdecha, C. & Benson, D. A. Determinants of Thailand household healthcare expenditure: The relevance of permanent resources and other correlates. Health Econ. 19(3), 365–376 (2010).
Article PubMed Google Scholar
Polder, J. J., Bonneux, L., Meerding, W. J. & Van Der Maas, P. J. Age-specific increases in health care costs. Eur. J. Public Health 12(1), 57–62 (2002).
Article PubMed Google Scholar
Romley, J. A. et al. The relationship between commercial health care prices and medicare spending and utilization. Health Serv. Res. 50(3), 883–896 (2015).
Article PubMed Google Scholar
Young, G. J. Do financial barriers to healthcare services affect health status?. Med. Care. 48(2), 1–10 (2010).
Article MATH Google Scholar
Health coverage protects you from high medical costs | HealthCare.gov. (2024). https://www.healthcare.gov/why-coverage-is-important/protection-from-high-medical-costs/
Liu, H. et al. Catastrophic health expenditure incidence and its equity in China: A study on the initial implementation of the medical insurance integration system. BMC Public Health. 19(1), 1–10 (2019).
Article Google Scholar
Fang, K. et al. Illness, medical expenditure and household consumption: Observations from Taiwan. BMC Public Health 13(1), 743 (2013).
Article PubMed PubMed Central MATH Google Scholar
Azzani, M., Roslani, A. C. & Su, T. T. Determinants of household catastrophic health expenditure: A systematic review. Malays. J. Med. Sci. 26(1), 15 (2019).
Article PubMed PubMed Central Google Scholar
Tur-Sinai, A., Magnezi, R. & Grinvald-Fogel, H. Assessing the determinants of healthcare expenditures in single-person households. Isr. J. Health Policy Res. 7(1), 191996 (2018).
Article Google Scholar
Almalki, Z. S. et al. Original research: Investigating households’ out-of-pocket healthcare expenditures based on number of chronic conditions in Riyadh, Saudi Arabia: a cross-sectional study using quantile regression approach. BMJ Open 12(9), 66145 (2022).
Article Google Scholar
Tur-Sinai, A., Magnezi, R. & Grinvald-Fogel, H. Assessing the determinants of healthcare expenditures in single-person households. Isr. J. Health Policy Res. 7(1), 48. https://doi.org/10.1186/s13584-018-0246-8 (2018).
Article PubMed PubMed Central Google Scholar
Kamyari, N., Soltanian, A. R., Mahjub, H., Moghimbeigi, A. & Seyedtabib, M. Zero-augmented beta-prime model for multilevel semi-continuous data: A Bayesian inference. BMC Med. Res. Methodol. 22(1), 283. https://doi.org/10.1186/s12874-022-01736-0 (2022).
Article PubMed PubMed Central MATH Google Scholar
Liu, L. Joint modeling longitudinal semi-continuous data and survival, with application to longitudinal medical cost data. Stat. Med. 28(6), 972–986 (2009).
Article ADS MathSciNet PubMed MATH Google Scholar
Shahrokhabadi, M. S., Chen, D. G., Mirkamali, S. J., Kazemnejad, A. & Zayeri, F. Marginalized two-part joint modeling of longitudinal semi-continuous responses and survival data: With application to medical costs. Math 9, 2603 (2021).
Article MATH Google Scholar
Liu, L. et al. Statistical analysis of zero-inflated nonnegative continuous data. Stat. Sci. 34(2), 253–279 (2019).
Article MathSciNet MATH Google Scholar
Mazumdar, M. et al. Comparison of statistical and machine learning models for healthcare cost data: A simulation study motivated by Oncology Care Model (OCM) data. BMC Health Serv. Res. 20(1), 7183716 (2020).
Article Google Scholar
Kurz, C. F. Tweedie distributions for fitting semicontinuous health care utilization cost data. BMC Med. Res. Methodol. 17(1), 171. https://doi.org/10.1186/s12874-017-0445-y (2017).
Article PubMed PubMed Central MATH Google Scholar
Zero-Inflated and Two-Part Mixed Effects Models • GLMMadaptive. (2024). https://drizopoulos.github.io/GLMMadaptive/articles/ZeroInflated_and_TwoPart_Models.html
Belotti, F., Deb, P., Manning, W. G., Norton, E. C. & Arbor, A. twopm: Two-part models. Stata J. 15(1), 3–20 (2015).
Article Google Scholar
Blozis, S. A. Bayesian two-part multilevel model for longitudinal media use data. J. Mark. Anal. 10(4), 311–328. https://doi.org/10.1057/s41270-022-00172-9 (2022).
Article MathSciNet MATH Google Scholar
Rustand, D., Briollais, L. & Rondeau, V. A marginalized two-part joint model for a longitudinal biomarker and a terminal event with application to advanced head and neck cancers. Pharm. Stat. 23(1), 60–80 (2024).
Article PubMed MATH Google Scholar
Smith, V. A., West, B. T. & Zhang, S. Fitting marginalized two-part models to semicontinuous survey data arising from complex samples. Health Serv. Res. 56(3), 558 (2021).
Article PubMed PubMed Central MATH Google Scholar
Kamyari, N., Soltanian, A. R., Mahjub, H. & Moghimbeigi, A. Diet, nutrition, obesity, and their implications for COVID-19 mortality: Development of a marginalized two-part model for semicontinuous data. JMIR Public Health Surveill. 7(1), e22717 (2021).
Article PubMed PubMed Central MATH Google Scholar
Chai, H., Jiang, H., Lin, L. & Liu, L. A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput. Biol. 14(7), e1006329 (2018).
Article ADS PubMed PubMed Central Google Scholar
Voronca, D. C., Gebregziabher, M., Durkalski, V. L., Liu, L. & Egede L. E. Marginalized two part models for generalized gamma family of distributions. arXiv151105629 (2015).
Jaffa, M. A. et al. Analysis of longitudinal semicontinuous data using marginalized two-part model. J. Transl. Med. 16(1), 1–15 (2018).
Article MATH Google Scholar
Liu, X. et al. Are marginalized two-part models superior to non-marginalized two-part models for count data with excess zeroes? Estimation of marginal effects, model misspecification, and model selection. Health Serv. Outcomes Res. Methodol. 18, 175–214 (2018).
Article MATH Google Scholar
Smith, V. A. & Preisser, J. S. A marginalized two-part model with heterogeneous variance for semicontinuous data. Stat. Methods Med. Res. 28(5), 1412–1426. https://doi.org/10.1177/0962280218758358 (2018).
Article MathSciNet PubMed MATH Google Scholar
Smith, V. A., Preisser, J. S., Neelon, B. & Maciejewski, M. L. A marginalized two-part model for semicontinuous data. Stat. Med. 33(28), 4891–4903 (2014).
Article MathSciNet PubMed MATH Google Scholar
Statistical Centre of Iran > Metadata > Statistical Survey > Household, Expenditure and Income. https://www.amar.org.ir/english/Metadata/Statistical-Survey/Household-Expenditure-and-Income
Liu, L., Ma, J. Z. & Johnson, B. A. A multi-level two-part random effects model, with application to an alcohol-dependence study. Stat. Med. 27(18), 3528–3539 (2008).
Article MathSciNet PubMed MATH Google Scholar
Gebregziabher, M. et al. Joint modeling of multiple longitudinal cost outcomes using multivariate generalized linear mixed models. Health Serv. Outcomes Res. Methodol. 13, 39–57 (2013).
Article CAS PubMed PubMed Central Google Scholar
Liu, L., Strawderman, R. L., Cowen, M. E. & Shih, Y.-C.T. A flexible two-part random effects model for correlated medical costs. J. Health Econ. 29(1), 110–123 (2010).
Article PubMed MATH Google Scholar
Manning, W. G., Basu, A. & Mullahy, J. Generalized modeling approaches to risk adjustment of skewed outcomes data. J. Health Econ. 24(3), 465–488 (2005).
Article PubMed MATH Google Scholar
Neelon, B., O’Malley, A. J. & Smith, V. A. Modeling zero-modified count and semicontinuous data in health services research part 2: Case studies. Stat. Med. 35(27), 5094–5112 (2016).
Article MathSciNet PubMed MATH Google Scholar
Li, N., Elashoff, D. A., Robbins, W. A. & Xun, L. A hierarchical zero-inflated log-normal model for skewed responses. Stat. Methods Med. Res. 20(3), 175–189 (2011).
Article MathSciNet CAS MATH Google Scholar
Malehi, A. S., Pourmotahari, F. & Angali, K. A. Statistical models for the analysis of skewed healthcare cost data: A simulation study. Health Econ. Rev. 5, 1–16 (2015).
Article MATH Google Scholar
Chai, H. S. & Bailey, K. R. Use of log-skew-normal distribution in analysis of continuous data with a discrete component at zero. Stat. Med. 27(18), 3643–3655 (2008).
Article MathSciNet PubMed PubMed Central MATH Google Scholar
Manning, W. G. et al. A two-part model of the demand for medical care: Preliminary results from the health insurance study. Health Econ. 137, 103–123 (1981).
Google Scholar
Vuong, Q. H. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica. 57(2), 307 (1989).
Article MathSciNet MATH Google Scholar
Smith, V. A., Neelon, B., Maciejewski, M. L. & Preisser, J. S. Two parts are better than one: Modeling marginal means of semicontinuous data. Health. Serv. Outcomes Res. Methodol. 17, 198–218 (2017).
Article MATH Google Scholar
Li, N., Elashoff, D. A., Robbins, W. A. & Xun, L. A hierarchical zero-inflated log-normal model for skewed responses. Stat. Methods Med. Res. 20(3), 175–189. https://doi.org/10.1177/0962280208097372 (2008).
Article MathSciNet MATH Google Scholar
Raudenbush, S. W., Yang, M.-L. & Yosef, M. Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. J. Comput. Graph. Stat. 9(1), 141–157 (2000).
Article MathSciNet MATH Google Scholar
Bock, R. D. Maximum Marginal Likelihood Item Factor Analysis By Adaptive Quadrature Stephen Schilling school of Education (University of Michigan, 2005). https://api.semanticscholar.org/CorpusID:58914630
Olsen, M. K. & Schafer, J. L. A two-part random-effects model for semicontinuous longitudinal data. J. Am. Stat. Assoc. 96(454), 730–745 (2001).
Article MathSciNet MATH Google Scholar
Tooze, J. A., Grunwald, G. K. & Jones, R. H. Analysis of repeated measures data with clumping at zero. Stat. Methods Med. Res. 11(4), 341–355 (2002).
Article PubMed MATH Google Scholar
Hubin, A. & Storvik, G. Estimating the Marginal Likelihood with Integrated nested Laplace Approximation (INLA). arXiv161101450. (2016).
Wang, Y. Maximum likelihood computation based on the Fisher scoring and Gauss-Newton quadratic approximations. Comput. Stat. Data Anal. 51(8), 3776–3787 (2007).
Article MathSciNet MATH Google Scholar
Team RC. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, 2013).
Datawrapper: Create charts, maps, and tables. (2024). https://www.datawrapper.de/
Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (Springer, 2002).
MATH Google Scholar
Tom, B. D. M., Su, L. & Farewell, V. T. A corrected formulation for marginal inference derived from two-part mixed models for longitudinal semi-continuous data. Stat. Methods Med. Res. 25(5), 2014–2020 (2016).
Article MathSciNet PubMed MATH Google Scholar
Xie, Y., Zhang, Z., Rathouz, P. J. & Barrett, B. P. Multivariate semi-continuous proportionally constrained two-part fixed effects models and applications. Stat. Methods Med. Res. 28(12), 3516–3533 (2019).
Article MathSciNet PubMed MATH Google Scholar
Merlo, L., Maruotti, A. & Petrella, L. Two-part quantile regression models for semi-continuous longitudinal data: A finite mixture approach. Stat. Model. 22(6), 485–508. https://doi.org/10.1177/1471082X21993603 (2021).
Article MathSciNet MATH Google Scholar
Richman, I. B. & Brodie, M. A national study of burdensome health care costs among non-elderly Americans. BMC Health Serv. Res. 14, 1–7 (2014).
Article MATH Google Scholar
Na, L. et al. Disparities in receipt of recommended care among younger versus older medicare beneficiaries: A cohort study. BMC Health Serv. Res. 17, 1–13 (2017).
Article Google Scholar
Faraji, M. et al. Out-of-pocket pharmaceutical expenditure and its determinants among Iranian households with elderly members: A double-hurdle model. Cost Eff. Resour. Alloc. 22(1), 1–9 (2024).
Article Google Scholar
Brockmann, H. Why is less money spent on health care for the elderly than for the rest of the population? Health care rationing in German hospitals. Soc. Sci. Med. 55(4), 593–608 (2002).
Article PubMed MATH Google Scholar
Mueller, C. W., Charron-Chénier, R., Bartlett, B. J. & Brown, T. H. Budgetary consequences of high medical spending across age and social status: evidence from the consumer expenditure surveys. Gerontologist. 60(7), 1322–1331 (2020).
Article PubMed PubMed Central Google Scholar
Blackburn, J. & Choi, S. Patterns and factors associated with medical expenses and health insurance premium payments. Financ. Couns. Plan. 29(1), 6–18 (2018).
Article MATH Google Scholar
Lee, W.-C., Jiang, L., Phillips, C. D. & Ohsfeldt, R. L. Rural-Urban differences in health care expenditures: Empirical data from US households. Adv. Public Health. 2014, 1–10 (2014).
Article Google Scholar
Hartley, D., Quam, L. & Lurie, N. Urban and rural differences in health insurance and access to care. J. Rural Health. 10(2), 98–108 (1994).
Article CAS PubMed MATH Google Scholar
Lu, S., Zhang, Y., Niu, Y. & Zhang, L. Exploring medical expenditure clustering and the determinants of high-cost populations from the family perspective: A population-based retrospective study from rural China. Int. J. Environ. Res. Public Health. 15(12), 2673 (2018).
Article PubMed PubMed Central MATH Google Scholar
Halliday, T. J. & Park, M. Household size, home health care, and medical expenditures. Inst Study Labor (IZA) Univ Hawai’i Manoa. http://www.Economic.hawaii.edu/research/workingpapers/WP_09-16.pdf. (2009).
Powers, C. A., Meyer, C. M., Roebuck, M. C. & Vaziri, B. Predictive modeling of total healthcare costs using pharmacy claims data: A comparison of alternative econometric cost modeling techniques. Med Care. 43(11), 1065–1072 (2005).
Article PubMed Google Scholar
Lin, C., Hsu, S. & Takao, A. A review and comparison of medical expenditures models: Two neural networks versus two-part models. J. Risk Res. 11(8), 967–982 (2008).
Article MATH Google Scholar
Crawford, A. G., Fuhr, J. P. Jr., Clarke, J. & Hubbs, B. Comparative effectiveness of total population versus disease-specific neural network models in predicting medical costs. Dis. Manag. 8(5), 277–287 (2005).
Article PubMed MATH Google Scholar
Garrido, M. M., Deb, P., Burgess, J. F. Jr. & Penrod, J. D. Choosing models for health care cost analyses: Issues of nonlinearity and endogeneity. Health Serv. Res. 47(6), 2377–2397 (2012).
Article PubMed PubMed Central Google Scholar
Iddi, S. & Molenberghs, G. A combined overdispersed and marginalized multilevel model. Comput. Stat. Data Anal. 56(6), 1944–1951 (2012).
Article MathSciNet MATH Google Scholar
Duan, Y., Emir, B., Bell, G. & Cabrera, J. twopartm: Two-part model with marginal effects. CRAN Contrib Packag. (2022). https://cran.r-project.org/package=twopartm

Download references

Acknowledgements

The authors would like to thank the Statistical Center of Iran (SCI) for providing the data used in this study. This research is part of the Biostatistics MS thesis of Elham Daghaghele and was supported by Ahvaz Jundishapur University of Medical Sciences (AJUMS). Special thanks to the Research Deputy of Ahwaz Jundishapur University of Medical Sciences for providing financial support for this project.

Funding

This research was supported by project U-02034 from Ahvaz Jundishapur University of Medical Sciences. However, the source of funding had no influence on the study design, data collection, analysis and interpretation, writing of the report, or the decision to publish the article.

Author information

Authors and Affiliations

Department of Biostatistics and Epidemiology, School of Health, Ahvaz Jundishapur University of Medical Sciences, Ahvaz, Iran
Elham Daghaghele
Department of Biostatistics and Epidemiology, School of Health, Social Determinants of Health Research Center, Ahvaz Jundishapur University of Medical Sciences, Ahvaz, Iran
Kambiz Ahmadi Angali & Maryam Seyedtabib
Department of Biostatistics and Epidemiology, School of Health, Research Center for Environmental Contaminants (RCEC), Abadan University of Medical Sciences, Abadan, 63198-11154, Iran
Naser Kamyari

Authors

Elham Daghaghele
View author publications
Search author on:PubMed Google Scholar
Kambiz Ahmadi Angali
View author publications
Search author on:PubMed Google Scholar
Naser Kamyari
View author publications
Search author on:PubMed Google Scholar
Maryam Seyedtabib
View author publications
Search author on:PubMed Google Scholar

Contributions

CRediT authorship contribution statement: ED: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Resources, Data curation, Writing–original draft, writing—review & editing, Visualization, Project administration. KA-A: Conceptualization, Data curation, Formal analysis, Investigation, Writing–original draft, writing—review & editing. NK: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing–original draft, writing—review & editing, Visualization, Supervision. MS: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Resources, Data curation, Writing—original draft, writing—review & editing, Visualization, Project administration.

Corresponding authors

Correspondence to Naser Kamyari or Maryam Seyedtabib.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This study was approved by the Research Ethics Committee (REC) of Ahvaz Jundishapur University of Medical Sciences under the ID number IR.AJUMS.REC.1402.064. Methods used complied with all relevant ethical guidelines and regulations. The Ethics Committee of Ahvaz Jundishapur University of Medical Sciences waived the requirement for written informed consent from study participants.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Daghaghele, E., Angali, K.A., Kamyari, N. et al. Marginalized two part model for analyzing multilevel semicontinuous medical costs in Iranian households. Sci Rep 15, 7491 (2025). https://doi.org/10.1038/s41598-025-91309-0

Download citation

Received: 14 May 2024
Accepted: 19 February 2025
Published: 03 March 2025
DOI: https://doi.org/10.1038/s41598-025-91309-0

Subjects

Abstract

Similar content being viewed by others

A novel mitochondrial-related risk model for predicting prognosis and immune checkpoint blockade therapy response in uterine corpus endometrial carcinoma

Longitudinal assessment of established risk stratification models in patients with monoclonal gammopathy of undetermined significance

Impact of family doctor contracted services on the health of migrants: a cross-sectional study in China

Introduction

Materials and methods

Study population

Predictor variables

Outcome variable

Semi-continuous data

Two-part models for semi-continuous data

Multilevel models in cluster analysis

Parameter estimation and inference for MTP

Model fit assessment

Results

Discussion

Implications for health policy

Study limitations

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links