Introduction

The significance of population wellbeing is gaining widespread recognition globally, prompting governments to broaden their evaluative criteria beyond the traditional measure of GDP (Gross Domestic Product) to assess the overall success of their population1. While GDP and productivity measures continue to be central for policymaking, there is an emerging shift towards a more comprehensive approach that includes the assessment of wellbeing. Initiatives like the Wellbeing Economy Governments partnership (WEGo) exemplify this shift, where national and regional governments collaboratively advance the concept of Wellbeing Economies2. Despite sustained economic growth, New Zealand faces pressing challenges such as high rates of child poverty, homelessness, and suicide. In response, the government introduced its inaugural ‘wellbeing budget’ in 20193, signifying a renewed commitment to prioritising people’s wellbeing alongside economic growth.

Understanding wellbeing presents challenges due to the evolving nature and diverse perspectives around its meaning. Initially, wellbeing was often perceived as positive human functioning, referred to as “eudaimonia,” encompassing aspects such as self-actualisation and autonomy4. Other researchers have integrated eudaemonic and hedonic components, combining aspects of functioning and emotions5. For example, Diener’s tripartite model identified cognitive, positive affect, and negative affect components6, while Seligman’s PERMA model introduced positive emotion, engagement, relationships, meaning, and accomplishment as key dimensions7. Thompson et al.’s dynamic model of ‘flourishing’ further highlights the interplay between positive feelings, effective functioning, external conditions, and personal resources8. This comprehensive perspective suggests that ‘flourishing’ or elevated wellbeing emerges from the interplay of positive emotions and effective functioning within an individual’s unique circumstances and available resources. Thus, a ‘flourishing’ nation indicates elevated wellbeing among its citizens.

The increasing significance of incorporating wellbeing indicators into policy decisions is becoming more prominent in New Zealand. Despite this growing importance, there still exists a considerable gap in our understanding of the factors that influence population wellbeing in the country. This knowledge gap is partially attributed to the scarcity of detailed, population-level wellbeing data. The NZ General Social Survey (GSS), a biennial survey of around 9000 individuals9, offers wellbeing data across twelve domains: health, housing, income and consumption, jobs and earnings, leisure and free time, knowledge and skills, safety and security, social connections, cultural identity, civic engagement and governance, environmental quality, and subjective wellbeing. Designed based on the NZ Living Standards Framework10, which in turn was drawn from the OECD’s framework11, the GSS lays the foundation for wellbeing assessment in New Zealand. In the context of this study, we focus primarily on the subjective wellbeing domain, focussing on indicators such as life satisfaction, sense of purpose, family wellbeing and mental wellbeing.

Although the GSS sample is considered nationally representative, certain subgroups of the population (that may be of significant policy interest) remain underrepresented due to limitations in sample size. For instance, it is impractical to understand the wellbeing experiences of individuals living in government-sponsored social housing. This is because the number of people who participated in the GSS and are also residents of social housing may be very small. Therefore, to assess the impact of government initiatives targeting this specific population sub-group, comprehensive wellbeing measures applicable to the entire population are needed.

To address this challenge, two strategies offer potential solutions. One approach involves collecting regular wellbeing data for the entire population in a census activity; however, this method is resource-intensive and time-consuming. An alternative approach involves leveraging existing routinely collected data to extrapolate GSS wellbeing measures to the broader population. This may be feasible due to New Zealand’s Integrated Data Infrastructure (IDI): a complex database managed by Stats NZ12. The IDI contains individual response data (microdata) on people and households, supplemented with anonymised information on education, income, health, justice, and housing. Notably, the IDI facilitates dataset linkage across these areas using a unique identifier variable. Details about this linking process are available elsewhere13. Crucially, the IDI houses the GSS data, allowing linkage with the country’s Census data which the majority of the nation’s population completes (given it is a legal requirement to do so).

The Census is a comprehensive nationwide survey conducted once every five years in New Zealand, with the primary aim of officially counting individuals and households in the country14. It also provides a snapshot of various aspects of life, including demographic information, educational qualifications, employment status, and more. Additionally, the Census gathers data on addresses for each household, which are then aggregated at the meshblock level for reporting purposes. A meshblock represents the smallest administrative geographical unit, typically encompassing about 30 to 60 households15. Environmental data, such as the extent of green spaces, are also available at the meshblock level and can therefore be linked to the Census data. One notable example is the Healthy Location Index, which captures accessibility to health-promoting elements (e.g., green spaces, physical activity facilities) and health-constraining elements (e.g., alcohol outlets, fast-food shops)16. The ability to link such key environmental information to the Census is crucial, given the established links between the environment and wellbeing17,18.

The aim of this study is to predict GSS-derived wellbeing measures from Census-based sociodemographic information and meshblock-level environmental indicators. It is important to note that, this study does not make causal claims or explore the determinants of subjective wellbeing; instead, it is purely predictive. If successful, such a predictive model could be used to extrapolate these predicted wellbeing scores to the entire IDI population, thereby creating a population-level estimate of subjective wellbeing. This could yield transformative benefits by facilitating the integration of wellbeing metrics into policy analysis. It also holds the potential to significantly enhance our understanding of how the political, social, and economic landscape impacts the wellbeing and overall functioning of individuals in New Zealand. This would further empower decision-makers to formulate more informed, targeted, and effective policies that address the genuine needs and concerns of New Zealanders.

Methods

Data sources

The data used in this study was sourced from three datasets: New Zealand General Social Survey (GSS)9, New Zealand Census of Population and Dwellings14, and the Healthy Location Index (HLI)19. Of these, two are present in the New Zealand Integrated Data Infrastructure (IDI), namely the GSS and the Census. All datasets within the IDI are structured as tables in an SQL database and can be linked with one another using the Stats NZ unique identifier variable20. All datasets within the IDI can be accessed only from a Stats NZ data laboratory. A formal application to access the IDI datasets, and the IDI data laboratory was submitted and approved by Stats NZ. The methodology used in this research was approved by the AUT University Ethics Committee (AUTEC #21/115).

The study utilized GSS data during the 2018 year, with a sample size of 8,793. More information regarding the GSS and its data collection methodology can be found elsewhere21,22. The wellbeing outcome variables, unique identifier variable (snz_uid) and the meshblock_code variable were selected from the GSS. The subjective wellbeing outcome variables investigated in this study are listed in Table 1.

Table 1 GSS wellbeing outcome measures.

Next, the Census 2018 dataset was utilised in this study. Further details about the Census and its methodology are available elsewhere25. The size of the dataset was approximately 4.9 million observations with over 300 variables, of which 29 demographic variables were selected as predictors. The choice of these variables was guided by their availability for most of the population. To enhance interpretability, some variables were consolidated into fewer categories due to low counts in some specific categories. Table 2 shows the full list of demographic variables used in the study.

Table 2 Predictor variables from the Census 2018 dataset.

Lastly, data related to the environment was acquired from the Healthy Location Index (HLI) dataset19. As this dataset is not present in the IDI, it was imported into the IDI data environment by Stats NZ. The HLI data provides a rank (ranging between 1 and 52,593) for every New Zealand meshblock (excluding oceanic meshblocks). This ranking is determined based on the accessibility of each meshblock (i.e., distance proximity) to both health-promoting features of the environment (e.g., physical activity facilities) and health-constraining features of the environment (e.g., fast-food outlets, takeaway outlets). The methodology involves a straightforward assignment of ranks, offering a transparent depiction of how each meshblock compares in terms of accessibility to these environmental factors. More details about this dataset and the methodology involved in developing this measure can be found elsewhere16. A total of 13 environmental variables (shown in Table 3) were used as predictors in this study. All these variables were measured in deciles, ranging from 1 (indicating the highest decile and closest proximity to the environmental feature) to 10 (representing the lowest decile and the farthest distance from the environmental feature).

Table 3 Environment related variables from the Healthy Location dataset (HLI).

The GSS dataset was linked with the Census using the unique identifier variable (snz_uid) and to the HLI dataset using the meshblock number. After linking these, the dataset underwent a cleaning process to ensure data quality and consistency. Any observations with missing values were removed from the dataset (n = 3,135). Unknown or "did not answer" categories in the variables were removed resulting in a data with 5,658 observations and 42 predictor variables (29 Census variables + 13 HLI variables). The demographic distribution of the final dataset (shown in Supplementary table S-1) closely resembles that of the GSS 2018 dataset, indicating a balanced representation of most of the demographic sub-groups without any noticeable over- or under-representation.

Modelling

The development of precise predictive models is pivotal in extrapolating GSS data to the broader population. A robust predictive model assists in uncovering patterns within the dataset and establishes a solid foundation for reliable extrapolation. In this study, we employed three distinct predictive models with varying degrees of complexity: (1) Stepwise Linear Regression, (2) Elastic Net Regression, and (3) Random Forest. The modelling process described below was repeated for each of the four wellbeing outcome variables separately (life satisfaction, life worthwhileness, family wellbeing, and mental wellbeing). These models were chosen due to the substantial number of predictor variables (n = 42), and their ability to handle variable selection effectively. Furthermore, the inclusion of Random Forests allowed us to evaluate their ability to model non-linear relationships and complex interactions compared to traditional regression models for predicting subjective wellbeing outcomes.

To begin, the Stepwise Linear Regression method was utilized. It employed an iterative forward and backward selection process to add and remove predictor variables using the Akaike Information Criterion (AIC) as the selection criterion. This ensures that variables were retained or removed based on their joint contribution to model fit, ultimately yielding a subset of relevant variables27. While not entirely random, the variable selection process is automated, making it suitable for situations where there are numerous potential predictors. The order of variable selection is determined through statistical criteria rather than random selection. For more detailed information on this model, please refer to Draper and Smith (1998)28. Next, we incorporated the Elastic Net Regression model to evaluate its predictive performance in comparison to the Stepwise method. Elastic Net regression provides a unique set of advantages over other regression methods as it is a combination of both Lasso (L1) and Ridge (L2) regularization techniques29. This combination facilitates automatic variable selection, enhanced model interpretability and reducing overfitting, making it particularly well-suited for regression tasks involving high-dimensional data29. Lastly, we introduced a Random Forest model, to compare its performance against the traditional regression models. The Random Forest is an ensemble learning technique that constructs multiple decision trees and aggregates their predictions to enhance accuracy and reduce overfitting30. The Random Forest is effective at handling high-dimensional data as it has inbuilt variable selection, and can capture complex non-linear relationships between variables more effectively than traditional linear regression techniques31.

All models were implemented using the train function in the ‘caret’ package in R (version 6.0-94), with the appropriate ‘method’ argument specified as follows: Stepwise regression: ‘glmStepInc’, Elastic Net: ‘glmnet’ (version 4.1-8), and Random Forest: ‘rf’ (using the randomForest package, version 4.7-1). Furthermore, to mitigate class imbalances inherent within the dataset, class weights were computed as the inverse of the class frequencies and subsequently integrated into the model training process. These weights, operationalised though the ‘weights’ parameter in the train function, serve to recalibrate the model’s focus towards underrepresented classes, thereby improving accuracy in predicting these classes. For instance, a class with substantially fewer instances than others would be assigned a higher weight, incentivising the model to allocate increased computational resources towards accurately predicting instances of this class. This methodological adjustment is crucial in fostering a balanced predictive performance, counteracting the model’s inherent propensity to bias predictions in favour of overrepresented classes.

Firstly, the dataset was split into a training set and a testing set in a 70:30 ratio. The training set, consisting of 70% of the data (n = 3963 observations), was further subjected to a tenfold cross-validation process to evaluate and select the best model parameters. During this cross-validation process, various combinations of hyperparameters (e.g., mtry and ntree values for the random forest model) were evaluated, and the optimal values (that yielded the lowest root mean squared error) were used to train the final model. The final models were trained on the entire training dataset using these optimal parameters. The performance of the final models was then evaluated on the testing dataset (n = 1695 observations) to assess their predictive capabilities and generalisation to unseen data. For the Random Forest, the importance of each variable for improving model performance was estimated using the varImp function in the ‘caret’ R package, which evaluates the contribution of each predictor to the overall predictive performance of the model. Specifically, variable importance is assessed based on the mean decrease in accuracy or Gini impurity when a variable is excluded or permuted. Two variations of the model were fit, one incorporating environment-related variables from the HLI dataset, and another excluding HLI indicators. This was performed to examine how environmental data affected the predictive performance of each model. The performance of all models were assessed using root mean squared error (RMSE), mean absolute error (MAE), and R-squared (R2). As a further check, the Pearson’s correlation between the observed and predicted values were also evaluated.

Code availability

Code associated with this study is available as a supplementary file. However, given that the analysis was carried out within the Data lab environment, the models are not publicly available. For more details, please refer to the Data Availability section.

Results

We employed three distinct models to predict four wellbeing variables: life satisfaction, life worthwhileness, family wellbeing, and mental wellbeing. Table 4 provides a summary of both the observed mean and standard deviation, alongside the predicted values for all models. Notably, the Random Forest model exhibited superior performance, with predictions that were closely aligned with the observed values. These results were obtained through the evaluation of model performance on the test dataset, comprising 30% of the original dataset (n = 1695).

Table 4 Descriptive statistics (obtained from the testing dataset, n = 1695) for observed and predicted wellbeing variables.

Table 5 provides an overview of the performance metrics for all predictive models. Notably, the Random Forest models demonstrated stronger performance with lower RMSE (ranging between 1.5 and 1.6 for life satisfaction, life worthwhileness, and family wellbeing). However, the R-squared (R2) values were relatively low (~ 0.006), suggesting that these models had limited explanatory capabilities. Traditional models (Stepwise regression and Elastic Net) produced higher RMSE values (~ 2.5) and even lower R2 values (< 0.003) for these wellbeing variables. Table 5 displays the results with and without the inclusion of environmental variables for the Random Forest model only (given this was the best performing). The incorporation of environmental features had a negligible impact on the model’s predictive capacity. Furthermore, we assessed the correlation between the observed and predicted values produced by the Random Forest model (without environmental variables). This correlation ranged from weak to moderate, falling within the range of 0.202 to 0.250, for all wellbeing outcome variables (all p < 0.05). Supplementary table S-2 shows the importance of the top 10 predictor variables employed by the random forest model.

Table 5 Model performance metrics.

Discussion

The primary aim of this study was to evaluate the predictive efficacy of population-level socio-demographic variables in predicting GSS-based subjective wellbeing outcomes, encompassing life satisfaction, life worthwhileness, family wellbeing, and mental wellbeing. This analysis was augmented by incorporating environmental data from the Healthy Location Index. The study employed three distinct predictive models: Stepwise Regression, Elastic Net, and Random Forest. Our results demonstrated the models’ ability to predict wellbeing outcomes, as evidenced by their low RMSE values, by utilizing a concise set of easily accessible socio-demographic variables from the Census. However, the low R2 values suggest a constrained capacity to account for the extensive variability in the dependant variables. In practical terms, while the models are adept at approximating group-level averages with reasonable precision —an approach relevant for policy applications—they fail to capture the underlying dynamics or variance in individual-level wellbeing outcomes, which is critical for tailoring interventions and understanding subjective wellbeing in depth. This limitation may be influenced by various factors, including dataset characteristics, as discussed in subsequent sections. Notably, this aligns with findings from Lundberg et al. [2024] and Salganik et al. [2020], which highlight that even advanced models may struggle to explain the variability in subjective outcomes due to irreducible error32,33. While our findings emphasise the need for further improvements in predictive modelling, they also underscore the fundamental limits of explainability for subjective and multidimensional outcomes due to the complex and dynamic nature of human lives.

In our investigation, Random Forest models outperformed conventional modelling techniques like Elastic Net and Stepwise regression in terms of predictive capability. This may be because random forest algorithms are capable of capturing complex nonlinear relationships in the data, handling multicollinearity, and reducing overfitting through their ensemble nature31,34. While previous studies have employed similar methodologies to predict clinical outcomes such as the incidence of cardiovascular diseases and other chronic conditions35, our study stands out by predicting subjective wellbeing outcomes (e.g., life satisfaction) by utilising a straightforward demographic variable set.

The inclusion of environmental variables from the HLI dataset did not result in a significant improvement in model performance when compared to models that solely relied on socio-demographic factors. Yet, these environmental variables ranked among the top 10 important predictors when assessed using the varImp function. This suggests that while environmental factors are associated with subjective wellbeing outcomes, they do not necessarily have any causal influence. Prior research has indicated a connection between the HLI indicators and deprivation16, primarily determined using various socio-demographic indicators such as education, income, and housing data from the Census. Given that we have already included a range of these Census-level socio-demographic variables in our analyses, the inclusion of environmental variables may not have offered any additional insights beyond what we had already captured through the Census data.

Additionally, it’s worth noting that the environmental variables from the HLI dataset primarily capture proximity to various environmental elements but do not consider the total number, variety, or quality of such facilities. It is known that overall extent and quality of green/blue space within an area is related to mental health36,37, and studies have established the importance of environmental factors in influencing an individual’s mental health38,39. It should also be noted that HLI is an area-level measure, yet we were predicting individual-level outcomes, which could have also attenuated the effect of the environment. Future studies could explore the utility of a more nuanced selection of environmental variables in the modelling process.

Although our predictions were reasonable, there are limitations in our approach that should be discussed. Firstly, the wellbeing data from the GSS 2018 dataset used to train the models did not have a uniform distribution of responses across the measurement scale. For instance, the outcome variable ‘life satisfaction’ ranged from 1 to 11, and over 50% of respondents reported a score of either 7 or 8. This imbalance may be inherent to the subjective nature of the question. Despite incorporating weights into the model training process, the majority of our predictions tended to cluster around scores of 7 and 8. Since this range of values closely aligns with that of the observed values, the models achieved a relatively low RMSE (< 1.6). However, a lower correlation between the observed and predicted values (0.20–0.25) suggests that the model predictions within this narrow range were not linearly associated with the observed data. This discrepancy can likely be attributed to the limited range of values present in the GSS dataset. It’s also worth noting that while our predictions typically fell within a 1–2-point range of the true scores, this apparent accuracy could be misleading. This is because the true scores themselves predominantly fell within this same 1–2-point range, and consequently, the proportional error is relatively high. While this study did not include explicit uncertainty quantification, future work could employ methods such as bootstrap resampling to estimate confidence intervals for predictions. These techniques could provide additional insights into the variability and robustness of model outputs, particularly in contexts where subjective outcomes are clustered within a narrow range.

Understanding subjective wellbeing, especially when collected through surveys, is complex. Unlike quantifying tangible health conditions (e.g., cardiovascular disease, obesity, diabetes), subjective wellbeing relies on self-reported responses, which can vary based on how an individual interprets the question. For example, two people who choose scores of 7 might perceive those scores differently. Moreover, a lower score might not necessarily indicate less satisfaction relative to another person, it could reflect an individual’s unique understanding of the scale. Without a benchmark for validation, it is challenging to confidently interpret model results. Another important consideration is that these outcome scores reflect an individual’s overall wellbeing experience over time, not just their feelings on the day of the survey. However, someone generally satisfied with life might choose a lower score if recent unpleasant events influenced their mood. The subjective nature of these outcomes makes their validation difficult.

Another limitation arises from the dataset cleaning process, particularly the exclusion of nearly 3% of the Māori population due to missing values (see Supplementary table S-1). This exclusion could have potentially introduced bias into the model’s predictions and overall outcomes. Similarly, another limitation pertains to the Census 2018 dataset which had a lower response rate than expected. To address this challenge, Stats NZ employed alternative strategies to impute missing data. These strategies involved leveraging other available microdata within the Integrated Data Infrastructure (IDI) to fill in the gaps and enhance the completeness of the dataset. Although the data imputation process is beneficial, it introduces a potential source of bias or uncertainty in our results, as the imputed values may not accurately capture the true characteristics of the non-respondent population. Further information regarding this issue can be explored in "2018 Census collection response rates unacceptably low" by Stats NZ (2018)40.

To enhance the predictive performance for future studies, we recommend exploring additional analyses, improved data handling, and engaging in alternate feature engineering strategies. For instance, while the re-weighting strategy used in this study aims to address the underrepresentation of certain classes in the GSS dataset, it does not eliminate biases inherent in the original data, potentially leading to the replication of these biases in model predictions. An alternative approach, such as bootstrapped re-sampling41, could explore the impact of synthetic samples on prediction. Future work should compare re-sampling techniques, synthetic data generation, and augmentation methods to address class imbalance and underlying biases more effectively while preserving model robustness. Additionally, this study addressed missing data by removing incomplete observations to maintain dataset consistency. While this approach is straightforward, it may have introduced bias by excluding certain groups of respondents. Future work could explore alternative methods, such as imputation techniques, to better understand the potential impact of missing data on model performance and ensure the robustness of findings.

Next, a broader range of demographic variables could be considered to provide a more comprehensive representation of individual characteristics. One area for future exploration could be examining how different treatments of the outcome variable impact the model’s predictive accuracy. For instance, our model used life satisfaction as a continuous variable, rather than categorizing it (e.g., low, medium, high). However, establishing thresholds for these classes could be uncertain and may require guidance from industry experts. Additionally, creating composite indices that capture multiple dimensions of wellbeing, or integrating other data sources available in the IDI (e.g., health data) as predictors could potentially lead to improved model performance.

While the IDI and Census provide rich, granular datasets, their reliance on periodic collection—such as the five-year interval for census data in New Zealand—presents a fundamental limitation for generating real-time or frequently updated predictions. Addressing this limitation may involve exploring complementary data sources, such as observation (EO) data, which have been effectively combined with machine learning to estimate health and living conditions42,43,44. Lastly, considering alternative modelling techniques beyond the ones explored in this study, such as neural networks or support vector machines, may provide further insights into predicting wellbeing outcomes and improve model performance.

Conclusion

Our findings indicate that a Random Forest model, in conjunction with census-level socio-demographic variables, yields moderate predictive efficacy for a range of GSS-based subjective wellbeing measures. This outcome underscores the potential of this methodological approach. However, it is imperative to acknowledge limitations arising from the subjective nature and distribution characteristics of the outcome variables. While our study offers valuable insights into predicting wellbeing outcomes using predictive modelling techniques, there is significant scope for improvement. By refining the modelling approach, incorporating more diverse data sources (e.g., health records within the IDI), and employing advanced analytical methods (e.g., deep learning), future research can contribute to a more accurate and comprehensive understanding of population wellbeing and offer robust tools for evidence-based policymaking.