Introduction

Deciphering the demographic history of natural populations through the analysis of genetic polymorphism data represents an ongoing challenge in the field of population genetics due to the complexity of these data and their underlying processes. A better understanding of past demographic events can provide, for example, insights on specific questions such as the impact of anthropogenic processes (Pujolar et al. 2017; Dong et al. 2021) or climatic events (Bai et al. 2018; Fedorov et al. 2020) on population dynamics. It can also be helpful in conservation biology (Der Sarkissian et al. 2015; Abascal et al. 2016).

The exploration of genetic diversity also sheds light on the intricate dynamics of human populations (Cavalli-Sforza et al. 1994; Henn et al. 2012; Schraiber and Akey, 2015). Analyzing genomic polymorphisms in the genomes of modern human populations enables us to infer past demographic and evolutionary events on different timescales, including colonization, separation, migration and expansion events, as well as adaptive processes. As such, reconstructing the demographic history of populations is essential to disentangle the effects of demography from those of selection (Akey et al. 2004; Lohmueller 2014; Johri et al. 2023).

Taking full advantage of the information contained in a large number of genomic data remains a methodological and computational challenge. Several kinds of inference methods have been developed to infer past demographic events from these genomic data (Beichman et al. 2018; Marchi et al. 2021). Some of these methods are based on the sequential Markovian coalescent (SMC) model (McVean and Cardin 2005; Marjoram and Wall 2006; Li and Durbin 2011; Sheehan et al. 2013; Schiffels and Durbin 2014; Terhorst et al. 2017). They use the pattern of diversity along the genomes to estimate the inverse coalescent rate through time, which is a proxy for past population sizes under specific assumptions (Chikhi et al. 2018). Other methods use the length of regions that are identical-by-state (IBS) or identical-by-descent (IBD) between haplotypes (Palamara et al. 2012; Harris and Nielsen 2013; Browning and Browning 2015). Finally, some methods rely solely on the site frequency spectrum (Gutenkunst et al. 2009; Liu and Fu 2015). Unlike the other methods, they do not take into account the linkage disequilibrium among SNPs.

Approximate Bayesian computation (ABC) is a simulation-based method used when the likelihood function is unavailable or too expensive to compute. It initially emerged to tackle population genetics issues (Tavaré et al. 1997; Pritchard et al. 1999; Beaumont et al. 2002) and was then used in various domains such as ecology (Jabot and Lohier 2016; Wood and Simon, 2018), epidemiology (Tanaka et al. 2006; McKinley et al. 2018), system biology (Toni et al. 2008; Liepe et al. 2014) or linguistics (Thouzeau et al. 2017, 2022). This framework appears particularly efficient when dealing with complex demographic models (Beaumont 2010; Bertorelle et al. 2010). It relies on the comparison of summary statistics computed on the genomic data set under investigation and on genomic data sets simulated under a given demographic model, for which the parameters are generally drawn in uninformative prior distributions. In the classical rejection sampling approach, the posterior distributions of the parameters are estimated by keeping only the simulations yielding summary statistics close to those from the real data (Tavaré et al. 1997). More elaborate methods were subsequently designed (Blum and Francois 2010; Estoup et al. 2012).

More recently, simulation-based approaches combined with supervised machine learning algorithms were developed for demographic inference (Schrider and Kern 2018; Collin et al. 2021; Korfmann et al. 2023). Some of these approaches are based on random forests (Pudlo et al. 2016; Raynal et al. 2019; Ghirotto et al. 2021), while others use neural networks (Sheehan and Song 2016; Mondal et al. 2019; Sanchez et al. 2021). As machine learning (ML) methods in demographic inference have gained ground in recent years, the focus has not been on comparing their characteristics and performance.

In this study, we focused on the inference of demographic parameters using coalescent simulations for models in which an ancestral population splits into two at some point in the past, with these populations remaining connected by migration. We implemented three machine-learning methods aiming at inferring demographic parameters from a large number of summary statistics computed on genomic data. These methods were based, respectively, on random forest (Breiman 2001), XGBoost (Chen and Guestrin 2016), and multilayer perceptron (Haykin 1994). To our knowledge, it was the first time that XGB was used in this context.

After comparing their overall performance and dissecting the impact of demographic parameters on the results, we analyzed which summary statistics contributed most and introduced a method to evaluate their influence on predictions.

Methods

Broadly, our approach consists of (i) producing many simulated datasets under a demographic model, (ii) computing summary statistics on these simulated datasets, and (iii) using these summary statistics as a training dataset for three machine-learning methods.

Demographic models

We first considered a standard model of isolation with migration (IM) (Fig. 1A), where an ancestral population splits into two populations at a given time (Split_time). We assumed that, right after this splitting event, a continuous and symmetrical migration rate (Migration_rate), defined as the proportion of individuals moving from one population to the other per generation, was maintained between the two populations until the present. These two populations were assumed to have distinct effective population sizes (N_current_1 and N_current_2), themselves distinct from the ancestral effective population size (N_ancestral). In this model, effective population sizes were assumed to be constant except for the change at split time.

Fig. 1: Demographic models with the varying demographic parameters.
Fig. 1: Demographic models with the varying demographic parameters.
Full size image

A Isolation with migration (IM) model. B Secondary contact with varying population sizes (SC) model. Both populations can either grow or decline. Split_time is expressed in generations, Migration_rate is defined per generation, and Growth_rate corresponds to a per-generation exponential rate.

We then considered a more complex model of secondary contact with varying population sizes (SC) (Fig. 1B). In this model, the effective population sizes of the two populations resulting from the split were no longer assumed to be constant but instead to increase or decrease exponentially, with different growth rates for the two populations. Furthermore, the populations did not undergo gene flow immediately after their separation: migration started at a later stage and lasted until the present time. This allowed us to assess the performances of the ML models on a more complex model.

Simulations

We generated 10,000 simulated data sets using msprime (Kelleher et al. 2016; Baumdicker et al. 2021) for each of the two demographic models. For the simulated data sets under the IM model, the five parameters were drawn in uniform prior distributions: Uniform[1;5000] for Split_time, corresponding to the number of generations since the split occurred; Uniform [100;10,000] for N_ancestral, N_current_1, and N_current_2; and Uniform[0;0.001] for Migration_rate (per generation). While other prior distributions could be considered, we opted for uniform distributions to ensure an even exploration of the parameter space. We note that by setting the lower bound of the Split_time prior to 1 generation, predictions can be particularly difficult in such extreme scenarios. For each parameter set, we simulated 20 independent neutral loci for a sample of 10 diploid individuals in each population. Each locus was a genomic sequence of length 2 Mb, with a constant recombination rate of 1.0 × 10−8 per bp per generation and a constant mutation rate of 1.25 × 10−8 per bp per generation, which are standard values for human genomes (Campbell et al. 2012; Kong et al. 2012; Scally and Durbin 2012; Schiffels and Durbin 2014).

For the SC model, the five parameters already present in the IM model were drawn from uniform prior distributions that differed to some extent from the priors of the IM model: Uniform[100;5000] for Split_time (in generations); Uniform[1000;5000] for N_ancestral, N_current_1, and N_current_2; and Uniform[0;0.005] for Migration_rate (per generation). In addition to the IM model, each population experienced exponential growth with per generation rates Growth_rate_1 and Growth_rate_2, drawn independently in a uniform distribution: Uniform[−0.001;0.002]. Finally, the migration duration parameter was defined as the ratio between the number of generations during which migration occurred and the total number of generations since the split time. It was drawn in a uniform distribution Uniform[0,1], where 0 corresponded to a case with no migration and 1 to a case of continuous migration since the split.

Among the 10,000 simulated data sets, 5000 were used as a training data set (used to train the models), 2500 as a validation data set (used to tune hyperparameters, i.e., the parameters of the models set before the training process), and 2500 as a test data set (used to evaluate the final performance of the trained models).

Summary statistics

The summary statistics that we computed on the genomic data belonged to 11 classes: eight classes were computed within populations and three among populations. Note however that the within-population statistics were computed for each population sample separately, and also on the pooled sample combining the two samples. The number of individual summary statistics varied considerably among the different summary statistics classes. Note that during training, these individual features were considered independently from the class to which they belong.

The within-population statistics classes were:

  1. (i)

    S: the proportion of segregating sites.

  2. (ii)

    D: Tajima’s D statistic (Tajima 1989).

  3. (iii)

    PI: The mean and the standard deviation of the expected heterozygosity across segregating sites (Nei and Li 1979).

  4. (iv)

    WinH: the mean and the standard deviation of the haplotypic heterozygosity computed for non-overlapping windows of size 50 kb.

  5. (v)

    Site frequency spectrum (SFS) statistics: The percentage of single-nucleotide polymorphisms (SNPs) where the derived allele is present at frequency i, for all i in [1, …, 2n], where n is the sample size, and the standard deviation of the distances between two adjacent SNPs with the derived allele at frequency i (Boitard et al. 2016; Jay et al. 2019).

  6. (vi)

    LD: mean and standard deviation of the squared correlation r2 of all pairs of SNPs in a given bin of distance. Following Boitard et al. (2016) and Jay et al. (2019), 19 bins of distance with means ranging from 282 bp to 1.4 Mb were considered.

  7. (vii)

    IBS: Deciles of segment length distribution completely identical between m haplotypes for m = 2, 4, 8, and 16 (Boitard et al. 2016; Jay et al. 2019; inspired by Harris and Nielsen 2013).

  8. (viii)

    AFIBS: For each SNP, the AFIBS segment refers to the region surrounding the SNP that remains identical across all haplotypes containing the derived allele at that particular SNP (Theunert et al. 2012). We computed the mean and the standard deviation of the length of AFIBS segments for the SNPs of frequency i for all i in [1, …, 2n].

    The among-population statistics classes were:

  9. (ix)

    Fst: the fixation index (Wright 1950).

  10. (x)

    Dxy: genetic divergence between the two populations (Nei and Li 1979).

  11. (xi)

    JSFS: Joint site frequency spectrum (Wakeley and Hey 1997): The percentage of SNPs where the derived allele is present at a frequency i in population 1 and j in population 2.

The statistics of the classes S, D, PI, Fst, Dxy, and JSFS were calculated using the tskit package (Kelleher et al. 2016), while those for classes WinH, SFS, LD, IBS and AFIBS were computed using scripts adapted from Jay et al. (2019) in Python 3.9.7. All summary statistics were computed for each independent locus. We then computed the mean, median, and variance across loci of these values for each simulation, yielding a total of 3024 summary statistics (Table S1).

Parameter inference methods

We explored the effectiveness of three regression methods to infer the demographic parameters: random forest (RF), gradient boosting system XGBoost (XGB), and multilayer perceptron (MLP). All models underwent optimization of their hyperparameters using a validation set. The CPU times used to train and predict these three methods are shown in Fig. S1.

RF (Breiman 2001) is an ensemble method based on the construction of multiple decision trees. This method reduces the uncertainty of the prediction by averaging many predictions from the individual trees. The hyper-optimization procedure led us to set the number of trees between 200 and 300, with depths ranging from 15 to 20 (the exact values depended upon the demographic parameters).

While RF uses independent predictors, XGB (Chen and Guestrin 2016) uses a sequential approach where each new tree is adjusted to correct errors in the existing model. This method assigns higher weights to observations that were poorly predicted in earlier iterations, thus prioritizing their correction in the construction of subsequent decision trees. For this second ensemble learning technique, the tuning of hyperparameters on the validation set led us to set the number of trees to 100, with a depth of seven.

Finally, MLP (Haykin 1994) is a neural network method made up of fully connected layers. We trained a separate network for each demographic parameter, although a multi-output MLP could also be used. Depending on the demographic parameters, a hyper-optimization preliminary study on the validation set led us to set the number of neuron layers to six or seven (Table S2). For all models, summary statistics were standardized before training and the loss function was the mean square error (MSE). Unlike RF or XGB, MLP can lead to predictions outside the prior intervals. In that case, the prediction was set to the closest bound of the prior interval. All MLP models incorporate the ReLU function as an activation function (Agarap 2019). We also performed a preliminary study on the impact on prediction quality of the introduction of an L1 regularization penalty, which led us not to use this option (Fig. S2).

We implemented the RF and MLP methods using the scikit-learn python library (Pedregosa et al. 2011), and the XGBoost library for the XGB method (Chen and Guestrin 2016).

Feature contributions

We used permutation feature importance (PFI) to gauge the contribution of each feature (i.e., each summary statistic) to the performance of a specific model. It consists in randomly shuffling the values of a given feature among replicates and observing the subsequent degradation of the model’s performance. By breaking the feature-target relationship, it allows to determine the extent to which the model depends on this particular feature. In this study, we used the Permutation importance function from the sickit-learn library.

Moreover, we used Shapley values to assess the extent to which each summary statistic impacts the model’s predictions for each demographic parameter, measuring both the direction and magnitude of their contributions. Shapley values first appeared in a context of cooperative game theory (Shapley 1953) to estimate the contribution of individual players to the outcome of a game. In the ML context, the players are represented by the features and the cooperation is represented by combinations of features. We used the Python SHAP library (Lundberg and Lee 2017), which provides visual representations of the influence of the feature value on the outcome.

Comparison with ABC methods

The three regression methods were compared with several ABC algorithms from the ‘abc’ package (Csilléry et al. 2012), namely ABC rejection, ABC loclinear and ABC neuralnet, tested on the same simulation set. In addition to a rejection step, the last two methods apply corrections based on a local linear regression and a neural network, respectively. Their efficiency was computed on the same 2500 samples from the test data set used to assess the regression methods. For each of these 2500 samples, we kept the simulations closest to the target among the 7500 simulations not used in the test data set, for tolerance thresholds ranging from 5 × 10−4 to 5 × 10−2 (Table S3).

For the loclinear method, we selected the summary statistics with the highest correlation with the demographic parameters, in order to comply with the limitations of this method on the maximal number of summary statistics. Thus, we retained up to 300 statistics in the case of the 5 × 10−2 tolerance threshold. Similarly, to follow neuralnet specifications, we kept the 300 statistics with the highest correlation with the target parameter while considering a network with three neurons in its hidden layer. For all three algorithms, we then used either the mean or the median of the posterior as a point estimate.

Results

Accuracy of the machine-learning method for parameter estimation under the IM model

After training the three ML methods for each demographic parameter of the standard IM model on the training data set, we evaluated their performances on the test data set. We compared the predictions of these three ML methods using different metrics (Table 1). In the following, N_current refers to results for population 1. Since the model is symmetric and both population sizes have the same prior, the results obtained for N_current_1 and N_current_2 are similar (see Table S4 for population 2).

Table 1 Prediction errors on the test data set of the three ML methods, for each demographic parameter of the standard IM model.

Root mean square error (RMSE) and mean absolute error (MAE) are commonly used measures, and the choice between them depends on the specificities of the issue to be tackled (Chai and Draxler 2014). The squaring operation in RMSE penalizes large errors more heavily, which is particularly appropriate when avoiding outlier prediction is critical. For three of the four demographic parameters, Split_time, Migration_rate and N_current, MLP performed better than RF and XGB for these two metrics. Conversely, for the N_ancestral parameter, the XGB model performed better according to the RMSE metric, while the MLP model performed better according to the MAE metric.

The MAE metric presents the advantage of directly reporting the average magnitude of the errors. Split_time was the demographic parameter for which the MLP model outperformed the two other models by the largest margin: the MAE of RF and XGB were 522 and 467 generations respectively, i.e., 36 and 21% higher than the MAE of MLP, which was only of 385 generations. For the migration rate, which ranged between zero and 10−3 per generation in the simulations, MLP achieved an average MAE of around 7.5 × 10−5 per generation, compared with 8.0 × 10−5 and 8.8 × 10−5, respectively, for the XGB and RF models, i.e., a higher MAE by 6 and 17%, respectively. For ancestral and current effective population sizes, the lowest average errors were also achieved by the MLP models. For ancestral effective population size, MLP outperformed only marginally the other methods, with a mean error for MLP of 127 individuals compared with 129 and 130 individuals, respectively, for the RF and XGB models. The difference was more substantial for the current effective population size: for this parameter, the MAE of RF and XGB were higher by 14 and 4%, respectively, than the MAE of 375 of the MLP.

The NMAE represents a normalized version of the mean absolute error (MAE), in which the error for a given parameter is divided by the difference between the minimum and maximum bounds of the prior of this parameter. This allowed us to have a comparison between the estimates of all parameters, whatever their range. N_ancestral had the lowest NMAE (0.0127). It was thus the parameter that could be estimated with the highest accuracy, followed by N_current (NMAE of 0.0375). Migration_rate and Split_time parameters, with respective NMAE of 0.0752 and 0.0771, showed the lowest accuracy.

Observing predicted versus observed values allowed us to obtain more detailed information about the predictions for each parameter (Fig. 2; Fig. S3). RF and XGB methods showed both a downward bias for high values of splitting time and migration rate, in contrast to MLP (Fig. 2). We also observed significant upward biased predictions for the RF model for low migration rates, leading to a large overall error for this method on this demographic parameter.

Fig. 2
Fig. 2
Full size image

Predicted values versus observed values for the four isolation-with-migration parameters: Split_time (in generations), Migration_rate (per generation), N_ancestral, and N_current, estimated with three different ML methods: RF (Random Forest), XGB (XGBoost), and MLP (Multilayer perceptron).

Comparison with ABC methods

The performance of the ABC algorithms depended on both the tolerance thresholds and the choice of using either the median or the mean of the posterior as estimator (Table S3). Whatever the tolerance threshold chosen, the three ML methods all outperformed these ABC algorithms on the same simulation set (Fig. S4). The ABC rejection algorithm always showed lower performances than the ABC loclinear and ABC neuralnet methods. Compared with the three ML methods, the performances of ABC algorithms in predicting current and ancestral effective population sizes were substantially lower. The difference between the three ML methods and the ABC algorithms was less pronounced for the migration rate parameter, but still in favor of the ML methods.

Accuracy of the machine-learning method for parameter estimation under the SC model

We then trained the same hyper-parametrized models on the SC scenarios. In the following, N_current and Growth_rate refer to results for population 1. As the model is symmetrical and the priors are the same for both populations, the results are similar to those for population 2.

For the parameters already present in the standard IM model, MLP was again generally the best-performing method (Table 2). This was particularly noteworthy for Split_time, for which RF and XGB showed both a downward bias for high values of that parameter (Fig. S5). N_ancestral remained the most accurately predicted parameter with the lowest NMAE (4.19 × 10−2 for MLP). The model with the lowest RMSE for Growth_rate prediction was XGB (6.06 × 10−4), while MLP showed the lowest MAE (4.51 × 10−4). Finally, for the migration duration, MLP achieved the best performances in terms of both RMSE and MAE.

Table 2 Prediction errors on the test data set of the three ML methods, for each demographic parameter of the secondary contact with varying population sizes model.

In the following sections, we focus on the standard isolation with migration model (IM).

Assessing accuracy discrepancies across demographic scenarios

We focus in this section on MLP, as it showed the lowest MAE. We investigated if there was a heterogeneity in prediction quality according to the demographic scenarios. In order to address this question, we used regression to identify to what extent the different demographic parameters could explain prediction errors (Fig. S6). We then plotted these prediction errors as a function of the two most explanatory parameters (Fig. 3).

Fig. 3: Standardized error patterns in MLP predictions.
Fig. 3: Standardized error patterns in MLP predictions.
Full size image

Standardized errors \(\left({ {\hat{y}} }_{i}-{y}_{i}\right)/\left({y}_{\max }-{y}_{\min }\right)\) where y is the target demographic parameter: Split_time (in generations) (A), Migration_rate (per generation) (B), N_ancestral (C), and N_current (D), as a function of the two variables most predictive of these errors under the MLP method. Cold colors correspond to areas of low error and therefore better model prediction, while warm colors correspond to areas of higher error.

Regarding the prediction of split time, we observed that the best performance was achieved for smaller ancestral effective population sizes (Fig. 3A). As this effective population size increased, the prediction error increased. This effect was more pronounced for older split times. When predicting the migration rate, MLP performed less efficiently for recent split times. As the split time increased, the prediction quality was higher for a lower migration rate (Fig. 3B).

As for the ancestral effective population size, the values taken by the current effective population sizes had the largest impact on the quality of the prediction. The lowest accuracy was observed for the ancestral size when both current populations showed low effective sizes. When one of the two effective population sizes was small, prediction quality remained lower than average, regardless of the effective size of the second population (Fig. 3C).

Split time was the most decisive parameter in the quality of the prediction for current effective population size parameters. As this time increased, the prediction became increasingly accurate, whereas demographic scenarios with recent split time and high current effective population sizes led to the highest errors.

Different patterns of summary statistic usage across ML models

We estimated the importance levels of various summary statistics classes for each ML model and each demographic parameter in order to determine to which extent they differ in that aspect. We computed the importance of each variable, using the PFI method for the three ML methods, which computes the extent to which the quality of a prediction of a given model decreases when a single variable is randomly shuffled.

We observed first that RF and XGB allocated quite similar importance levels to the various classes of summary statistics (Fig. 4), while the MLP method generally showed different behaviors. For Split_time, the SFS class stood out for the two ensemble methods, with importance levels of 0.71 and 0.66 respectively. The JSFS class was second in terms of importance (importance levels of 0.10 and 0.16 for RF and XGB, respectively), but the gap was high with the SFS class. Conversely, for MLP, the JSFS class was the most important, with an importance of 0.63, ahead of the SFS and AFIBS classes. The differences among methods were much less pronounced for the Migration_rate parameter, where the AFIBS, SFS, and foremost JSFS classes were used predominantly by all methods. Nevertheless, the XGB and RF models also gave significant importance to Fst, unlike the MLP model.

Fig. 4
Fig. 4
Full size image

Importance levels of the different classes of summary statistics as a percentage of the total for each demographic parameter and each method (RF, XGB, MLP).

The differences were much more striking for the effective population sizes. For the ancestral effective population size, RF and XGB focused almost exclusively on the IBS class, with importance levels of 0.99 and 0.96, respectively. Conversely, MLP used the different classes much more evenly, with five classes showing noticeable levels of importance. The discrepancy among methods was a bit lower but still striking for the current effective population sizes. All methods used several classes of summary statistics, but the importance levels were strongly unbalanced toward AFIBS for RF and XGB, with values 0.66 and 0.65 respectively, while MLP used again the different classes more evenly.

This similarity in the variables used by XGB and RF was confirmed by observing the individual importance levels of each summary statistics (see Fig. S7). In the case of split time, the variables related to SFS were of the utmost importance for the predictions made by these two methods. In a similar manner, these methods assigned the greatest importance to Fst, followed by statistics from the JSFS class for predicting the migration rate. The most important variables for predicting the ancestral effective population size belonged all to the IBS class, while for the current effective population size these variables belonged to different classes: AFIBS, JSFS and winHET. The most important summary statistics chosen by MLP differed strikingly from those chosen by RF and XGB. They also exhibited markedly lower importance values. Indeed, the maximum importance attributed to a single feature by MLP was 2.6% for the Migration_rate parameter. For this method, importance was clearly spread over a much larger number of variables than for the two other methods (Fig. S8).

Towards improved explainability

The ML methods do not directly provide the impact of each individual statistic on the parameter estimation, due to their inherent complexity, nonlinear relationships, or the fact that they are a combination of a set of predictors. This makes the interpretation of the individual impact of each summary statistics more complex. While some methods such as the PFI method used above are effective in revealing features that are important for prediction, they reach their limit when it comes to quantifying the impact of features while taking into account how they interact with each other. Interpretation techniques may then prove necessary to provide insights into the role of summary statistics in complex models. We therefore used SHAP (Lundberg and Lee 2017), a method based on Shapley values that measures the average marginal contribution of each feature when combined with all other to make a prediction for a given parameter. In addition to identifying the most important summary statistics for each parameter, this method also allows investigating finely how these statistics affect the model outputs, quantifying both the direction and magnitude of their contributions to the prediction of each parameter. We illustrate the contributions of this method in the IM model for the Split_time predictions obtained with the XGB model, which combined high accuracy with an ability to bring out statistics with substantial importance (Fig. S7).

Focusing first on the summary statistics with the greatest impact, we observed that the frequency of tripletons in the overall population (SFS_3_all-mu) showed the highest mean of absolute SHAP values (Fig. 5A). This statistic was also the one that showed the highest importance (Fig. S7). The distribution of SHAP values for the most important variables allowed us to understand the impact of individual summary statistics on predictions (Fig. 5B). We observe that SFS_3_all-mu had the greatest upward and downward impact on the prediction of Split_time. While the highest values of this variable almost always had an upward impact, we see that intermediate values had a substantial impact, particularly downward, when combined with other statistics (Fig. 5C). We also observed that for the variable SFS_19_p1-mu, which represents the frequency of SNPs with the ancestral allele found in a single haplotype, higher values have a greater impact on the parameter compared to lower values, which have a more limited impact (Fig. 5B).

Fig. 5: SHAP-based summary statistics analysis in Split_time prediction by XGB model.
Fig. 5: SHAP-based summary statistics analysis in Split_time prediction by XGB model.
Full size image

A Top features with the highest means of absolute SHAP values for the XGB model on the Split_time parameter. B Feature effects indicated by the Shapley value on the x-axis for the top 5 features with their values represented by color. Each point corresponds to one instance. C Shapley values as a function of the normalized SFS_3_all-mu (gray points), distribution of the normalized SFS_3_all-mu (gray bars).

Discussion

We have shown here how an approach combining a large number of summary statistics and machine learning methods was particularly effective in predicting demographic parameters in scenarios involving two connected populations. These methods do not require preselecting a subset of summary statistics in order to reduce the number of variables. Both ensemble and neural network methods are indeed suitable for learning from a very large number of statistics, some of which being highly correlated, and some being aggregated indicators that can be drawn from others. We showed that MLP overall outperformed the RF and XGB methods in parameter prediction for the considered IM and SC model. Other demographic and genetic modeling assumptions could be tested in future works using the available scripts, as well as the effect of the number of simulations on accuracy. Other prior distributions could also be tested, such as log-uniform distributions, which are of particular interest when we are interested in inferring the order of magnitude of a parameter. Although achieving lower accuracy, RF and XGB still showed good performances. Depending on the metric and degree of accuracy required, they could still be considered in some cases, as they are less computer-intensive.

The intricate effects of the demographic parameters on the summary statistics undermine the predictive power of these statistics. We therefore emphasize the importance of studying the impact of confounding factors (here, the other demographic parameters) and the heterogeneity of prediction accuracy across the parameter space. Within the isolation with migration framework, we identified which demographic parameters were most decisive in the accuracy of the prediction of our target parameter, and visualized how this accuracy varies as a function of these most explanatory parameters. Taking such an approach when building a new model for a given evolutionary question ensures a precise picture of the inference strengths and weaknesses.

The three machine learning methods showed contrasting patterns in terms of summary statistics usage for predicting demographic parameters in the modeling settings of the IM model. Although XGB showed generally better performances than RF, both methods used quite similar learning schemes. MLP, on the other hand, used strikingly different classes of summary statistics than the two other methods. Furthermore, whereas RF and XGB focused on a limited number of summary statistics to which they gave a high importance, MLP used a much larger number of statistics, each being given a small importance. In other words, the information used by MLP was spread much more evenly over the different statistics. Based on a higher degree of complexity, this method therefore showed its higher capacity to capture intricate patterns and complex relationships from a high number of variables, as for instance those of the JSFS class.

Conversely, by pointing at a limited number of important statistics for prediction, tree-based methods reveal few important summary statistics for inference. We therefore note that an error on one of the most important summary statistics is likely to have a strong impact on the estimation of the demographic parameter. However, this has the advantage of allowing us to identify and select the specific classes of summary statistics to which we may wish to restrict ourselves for a particular demographic model, enhancing the interpretability of the results. For this purpose, we can rely on the recently developed interpretability method SHAP (Lundberg and Lee 2017), which was notably used to interpret mutation rate predictions from a neural network based on SFS (Burger et al. 2022) and in a deep learning framework for selection inference (Cecil and Sugden 2023). This method allowed us to assess how the values of a given summary statistic affect the prediction of the demographic parameters under study. Ultimately, while the methods developed here were tested in specific demographic scenarios, with mutation and recombination rates consistent with known values for humans, they could be adapted to other contexts, in which their performances could be tested with the same procedure as we used here. The impact of the sample size and of the number of loci could also be studied either using realistic ranges or matching the exact values of the user’s dataset.

Note also that the methods that we implemented here are likely to undergo several developments in the future. Our simulation approach was based indeed on a strictly neutral model, without any selective pressure. However, it has been shown that positive selection (Schrider et al. 2016), as well as background and purifying selection (Ewing and Jensen 2016; Pouyet et al. 2018; Johri et al. 2021), have an impact on demographic inferences. Therefore, while it is difficult for other approaches such as composite-likelihood and SMC-based methods to account for selection explicitly, methods developed in this study could naturally be extended to models that incorporate these selective effects, allowing for a more comprehensive assessment of their impact on demographic inferences.

Similarly, we simulated sequences with even mutation and recombination rates, and without sequencing errors. A relaxation of these assumptions could also be explored (see e.g., Boitard et al. 2016; Jay et al. 2019). We further point out that while all demographic inference methods are sensitive to model assumptions, simulation-based approaches including ABC methods are also sensitive to simulation misspecification. Since the prediction of demographic parameters is restricted to predefined prior bounds, it is essential to ensure that these bounds are set carefully, without being too restrictive. An additional area of focus could be to investigate how these methods perform for model selection, often a preliminary step to parameter estimation.

Finally, even if we showed here the high predictive power of our methods, we suggest to compare their efficiency with that of methods that rely on the direct application of deep learning methods to raw genomic data, in order to infer demographic (Sanchez et al. 2021) or selection events (Flagel et al. 2019; Torada et al. 2019). By learning more complex patterns from the data that may not be captured by summary statistics devised beforehand, these methods may prove particularly effective in case of complex demographic models. However, as they do not benefit from the guidance provided by the summary statistics, it is necessary to compare their performances with those of methods such as the ones that we developed here, as carried out by Sanchez et al. (2021) for the estimation of the effective size of a population over time. In this context, they showed that current neural networks yielded accurate predictions without requiring handcrafted features, but summary statistics were already compelling if carefully optimizing the inference model.