Introduction

The shift from foraging to plant cultivation and domestication represents the most important transition in the economic history of early Holocene societies. This transformation occurred independently across different regions of the world1,2,3, leading to the development of diverse assemblages of domesticated plants. The composition of these assemblages varied considerably, contributing uniquely to diets and landscape management practices depending on the region. For instance, while some areas—such as Europe and the Southwest Asia – adopted intensive cereal-based agriculture, others – such as the Neotropics – favored home gardening and agroforestry systems4,5.

A central, long-standing question remains: did the shift to plant cultivation, whether through the local domestication of plants or the adoption of non-native cultivars, arise from global driving forces, or were region-specific factors more influential? This debate has persisted since early studies on the origins of agriculture, with theories attributing the primary impetus to factors such as population pressure and diminishing returns6,7, as well as niche construction8. Alternatively, the emphasis on local environmental conditions has been proposed to explain regional transitions to cultivation, such as the influence of the Younger Dryas in the Near East9 and the early Holocene forest expansion in South America10.

In this study, we utilize survival analysis to model and predict the timing of initial transitions to food production on a global scale, incorporating archaeological data and environmental variables. Our analysis reveals strong spatial autocorrelation in the timing of these transitions, underscoring the substantial role of diffusion and interregional contact. When controlling for spatial lag, we observe different contributions from bioclimatic predictors depending on the specific region examined. For example, annual mean temperature and seasonality are key predictors in the Americas, whereas the temperatures during the wettest and driest quarters are most influential in the Near East and surrounding regions. We propose that these differences are linked to the distinct plant assemblages and the local biogeographic conditions that shaped their spread. In conclusion, our findings support the view that the global emergence of food production was driven not by a singular, universal cause, but by a range of local factors, further shaped by processes of contact and diffusion.

Overall chronology

Globally, the first centers of plant domestication emerged during the early Holocene. In the Americas, early evidence is found in southwestern Mesoamerica, where maize (Zea mays L.) was domesticated approximately 10,000 to 8,000 years before present (BP, calibrated years before 1950)11. Northwestern South America also stands out for the early domestication of crops such as squash (Cucurbita sp.), documented since the early Holocene5,10. The Llanos de Moxos region, in southwestern Amazonia, represents another early center, with manioc (Manihot sp.) and squash (Cucurbita sp.) occurring around 10,000 BP12.

In Africa, independent centers have been documented in the western part of the continent, where crops such as pearl millet (Pennisetum glaucum (L.) R.Br.), African rice (Oryza glaberrima Steud.), and fonio (Digitaria sp.) were domesticated after the African humid period, ca. 4,500 BP13. In eastern Africa, the available evidence indicates that sorghum (Sorghum bicolor (L.) Moench) was domesticated in the eastern Sahel region ca. 6,000 BP14, while the earliest remains of other eastern African domesticates such as finger millet (Eleusine coracana (L.) Gaertn.) and teff (Eragrostis teff (Zucc.) Trotter) appear around 2,000 BP in the Horn of Africa15,16.

Eurasia’s Neolithic is marked by the domestication of staple cereals such as wheat (Triticum spp.) and barley (Hordeum spp.) within the Fertile Crescent around 10,000 to 8,000 BP. Although full domestication is associated with the Pre-Pottery Neolithic B (PPNB) period, early indicators of storage and cultivation appear in Pre-Pottery Neolithic A (PPNA) sites, which is the reason why these dates are included in this study2,17. In the Indian subcontinent, southern regions witnessed independent domestication of crops including millets (Panicum sp., Setaria sp.) and beans (Vigna sp.) around 5,000 BP2. East Asia saw two major centers: the Yellow River basin in northern China, where millets (Panicum sp., Setaria sp.) were cultivated potentially as early as the early Holocene, and the Yangtze basin in southern China, where rice (Oryza sp.) was domesticated by 10,000 to 8,000 BP18.

As for Oceania, in the highlands of Papua New Guinea, an independent domestication process involved the cultivation of bananas (Musa sp.) at approximately 7,000 BP19.

Following these early domestication events, the spread of agricultural practices varied regionally. In the Americas, the eastern lowlands of South America exhibit more recent dates for the presence of cultivars, often linked to migration and cultural diffusion during the mid to late Holocene. The spread of crops, particularly from the Amazon Basin, is closely associated with the Arawak and Tupi-Guarani expansions, to cite the most important ones20,21. In North America, the early adoption of maize in the Southwest contrasts with the mid-Holocene domestication of goosefoot (Chenopodium sp.) and sunflower (Helianthus sp.) in the eastern temperate forests. The spread of maize and other crops of Mesoamerican origin followed later22.

In Africa, a major vector for the dissemination of agriculture was the Bantu expansion, which began in West Africa around 8,000 BP and continued throughout sub-Saharan Africa into the late Holocene23.

The spread of the Neolithic crop package from the Fertile Crescent into Europe has been interpreted as a case of demic diffusion, marked by the westward expansion of farming24,25,26. Eastward diffusion into South Asia was delayed before reaching the Indus basin, potentially due to climatic challenges posed by monsoon-dominated regions27.

Finally, in East and Southeast Asia, the distribution of rice cultivation has been studied in the context of demic diffusion, highlighting the movement of agricultural practices and populations across these regions28,29,30.

Materials and methods

Data compilation

Data was compiled from published radiocarbon databases of global and continental coverage18,30,31,32,33,34. We selected dates acquired directly from plant remains of cultivated and domesticated species, or associated with them through context, stratum/level, or general site period information (Fig. 1, Supplementary Fig. S1). In regions with limited evidence or preservation of plant macroremains, such as South America, South Asia or Africa, we also used dates derived from pollen and phytolith data. Additionally, we included dates not directly associated with archaeobotanical evidence, but which belonged to archaeological cultures known to have practiced agriculture (for example, dates associated with a Neolithic occupation in Europe, Bantu occupation in Africa or Tupi-Guarani - among many other agroforestry-practicing cultures - in lowland South America).

To conform to the spatial resolution of our environmental data and prioritize the date of first transition in each region, avoiding as much noise as possible, we retained only the earliest dates within a 50-kilometer radius. Our data set comprises a total of 1589 dates.

Fig. 1
figure 1

Dates associated with the transition to food production used in this study. Map created by the authors using R v4.4.2 (https://www.r-project.org/).

Environmental predictors

We utilized 17 bioclimatic variables and net primary production data derived from climate simulations based on the HadCM3 and HadAM3H models, accessed using the R package pastclim35,36. Because we focused on the Holocene epoch, when most of the transitions took place, and considering that the dates span from the early to the late Holocene, we averaged the values across time slices, each spanning 1,000 years, from 12,000 years before present (BP) to the present, retaining the average of the 12 slices for each bioclimatic variable in the analysis. Our edaphic predictors included 20 parameters for topsoil and subsoil characteristics, obtained from the Harmonized World Soil Database37.

Terrain-related variables included elevation and rugosity, sourced alongside the climatic data from pastclim. Additionally, we calculated the distance from major rivers and the coast using the Global Self-consistent, Hierarchical, High-resolution Geography Database38, separating the distance from first, second and third-level rivers.

To enhance model parsimony and mitigate multicollinearity, we excluded predictors with a correlation coefficient greater than 0.9. After addressing collinearity, our final dataset comprised 36 predictors.

All raster data were projected to the WGS84 coordinate system and standardized to a resolution of 0.5 degrees, consistent with the original resolution of the bioclimatic dataset. Raster processing was conducted using R version 4.3.1 and QGIS version 3.34.

Spatial structure

Demic and cultural diffusion played an important role in the global transition to food production20,24,39. To account for this phenomenon, we explore spatial autocorrelation in the data using variation partitioning and distance-based Moran’s eigenvector maps (dbMEMs). We conducted variation partitioning to assess the proportion of variance influenced by spatial structuring40,41. This analysis broke down the total inertia into independent and shared fractions: the exclusive contribution of each explanatory dataset, the joint fractions due to intercorrelation, and the unexplained fraction. We also constructed dbMEMs using the distances based on the latitude and longitude of each site42,43. The median of the calibrated ages was used as the target variable.

Recognizing the pronounced spatial autocorrelation present in the target variable, we included latitude and longitude as predictors in a Random Forest model to capture nonlinear spatial dependencies44. This strategy helps to account for spatial patterns in the target variable and aligns with prior research suggesting that incorporating spatial information in the form of geographic coordinates can enhance the performance of tree-based machine learning models45. Additionally, using coordinates as model predictors enables global extrapolation, facilitating predictions beyond the original dataset coverage.

Random forest

We employ Random Forest to predict the time of transition to food production. Random Forest is a machine learning ensemble algorithm that excels at capturing complex relationships between predictors and the outcome without requiring strict prior specifications. This method is particularly suitable for our analysis due to its robustness against overfitting and its ability to handle non-normally distributed independent variables46.

For our study, we utilize a variant of Random Forest tailored for survival analysis known as Random Survival Forests (RSF)47. Survival analysis is a regression technique designed for modeling the time duration until an event takes place. It differs from conventional regression models by dealing with inherently positive time-to-event data48,49.

RSF adapts the traditional Random Forest framework to handle time-to-event data, where the event of interest is the transition to food production. This approach constructs multiple decision trees, each calculating a survival function representing the probability that the event time exceeds a given time point t.

The RSF model calculates the survival function using the Kaplan-Meier estimator, defined as:

$${S}_{h}\left(t\right)=\:\prod\:_{{t}_{j,h}\le\:t}(1-\frac{{d}_{j,h}}{{Y}_{j,h}})$$

where dj, h is the number of events and Yj, h is the number of individuals at time tj in terminal node h. The node splits in each tree are determined to maximize the difference in survival between nodes, enhancing the model’s predictive performance.

The dependent variable in our model is the date of transition to food production, measured from the earliest date in our dataset (12,700 BP). Our predictor matrix includes a comprehensive set of 36 bioclimatic, edaphic, and terrain variables, as previously described.

We implemented the Random Survival Forest model using the randomForestSRC package in R50. We tuned the model to minimize out-of-sample error, which was calculated using 80% of the data for model fitting (in-bag samples). The best model was retained with 500 trees, a terminal node size of 2, and 35 variables tried at each split.

To assess the performance of the Random Forest model, we utilized Harrell’s concordance index (C-index)51,52 of the out-of-bag (OOB) samples. The C-index measures the agreement between the observed survival outcomes and the predicted outcomes. Specifically, it evaluates whether observations with earlier dates are correctly predicted by the model as having a worse outcome (in terms of survival) than the observations with later dates. The prediction error is calculated as 1 - C, where a prediction error of 0.5 indicates that the model’s performance is no better than random chance.

Variable importance

We assessed variable importance using the Breiman-Cutler permutation importance method46. This method measures the importance of each variable by quantifying the increase in prediction error when the values of that variable are randomly shuffled within the out-of-bag samples.

Additionally, we estimated the individual feature contributions to the model using Shapley values, a concept derived from cooperative game theory53. Shapley values measure the marginal impact of each feature on the model’s prediction across all possible feature combinations. We calculated Shapley values using the R library fastshap54.

Results and discussion

Spatial autocorrelation

Variation partitioning showed significant effects of all sets of variables in the response (Supplementary Fig. S2). A notable portion of the variance related to terrain, bioclimatic, and soil factors overlapped with the spatial component, revealing significant linear spatial patterns among the environmental variables. Crucially, the spatial component alone accounted for 16% of the total inertia, highlighting the influence of cultural transmission or migration in the spread of plant cultivation, as anticipated. Nevertheless, the environmental variables account for a similar portion of the variability, which is not shared with the spatial component, pointing to the importance of local environmental processes in shaping the timing of the transition to plant cultivation55.

The final model, incorporating 33 dbMEM eigenvectors, achieved an adjusted R² of 0.64. The primary axis of variation closely replicates the distinction between regions exhibiting the earliest transition dates compared to those receiving the spread of these transitions. This pattern is particularly evident in contrasts such as Southwest Asia versus South Asia, and Northwest South America/Mesoamerica versus other regions of the American continent (Supplementary Fig. S3).

Random forest and variable importance

The Random Forest model achieved high accuracy on test data (0.83) based on the error of the OOB samples. The model accurately predicts the emergence of agriculture in different centers and its spread (Fig. 2, Supplementary Fig. S4).

Fig. 2
figure 2

Predicted time of transition to food production based on the median survival results from the Random Survival Forest. Map created by the authors using R v4.4.2 (https://www.r-project.org/).

Fig. 3
figure 3

Shapley feature importance plot showing the ten most important features.

To understand the factors that drive this process, we primarily focus on the interpretation of Shapley values (Fig. 3), with permutation test results validating a similar order of variable importance (Supplementary Fig. S5). Notably, latitude and longitude emerged among the most critical features, highlighting pronounced spatial autocorrelation. Longitude’s importance reflects the early domestication dates in and around the Fertile Crescent, a center from which agriculture spread east and west, contrasting with the generally later transition in the western hemisphere (Fig. 2). A similar trend can be observed in Africa regarding latitude. In this case, the southwards expansion of Bantu-speaking populations from western Central Africa was a major driver of agricultural diffusion, with earlier transitions occurring in regions closer to the Sahel and progressively later adoptions further south. Although distance from rivers does not appear among the top features, this variable might have played a role in supplying habitats for the diffusion of cultivars such as rice in East Asia, where the Yangtze river would have offered the wetland margins of the early cultivation systems of this water-loving plant56. Distance from top-level rivers holds greater importance in the permutation test compared to Shapley values (Supplementary Fig. S5), which could tentatively be attributed to the role of diffusion.

Given that processes such as diffusion and migration are key drivers in the spread of plant domestication20,24,39, proximity to an established center of food production was expected (and confirmed) to increase the likelihood of transitioning to agriculture. While disentangling the effects of diffusion/migration from spatial autocorrelation in environmental variables can be challenging, variation partitioning demonstrates that not all variance explained by the spatial component overlaps with bioclimatic predictors (Supplementary Fig. S2). Shapley values further reveal the individual contributions of each variable. Excluding the spatial lag, the most influential environmental predictors were identified as the mean temperature of the wettest quarter, temperature seasonality, annual mean temperature, and the mean temperature of the driest quarter (Fig. 3).

To explore spatial patterns in the impact of different variables, we analyzed the spatial distribution of Shapley values, an approach which enables an assessment of how the relative contributions of variables change across global regions57,58 (Fig. 4).

Fig. 4
figure 4

Spatial distribution of the Shapley values for the four most important features (excluding geographical coordinates). Maps created by the authors using R v4.4.2 (https://www.r-project.org/).

The spatial pattern in the environmental variables is evident also from the dependence plots, showing that the highest contributing variables related to temperature, seasonal temperature and precipitation displayed interactions with longitude (Fig. 5). Specifically, the mean temperature of the wettest quarter greatly influenced early domestication dates in the Near East and Mediterranean (Fig. 4). These regions are characterized by low temperatures during the wettest quarter (rainy winters) and elevated temperatures during the driest quarter (dry summers). The spatial variation in these variables elucidates some of the observed diffusion patterns, namely the deceleration of agricultural spread from the Near East to South Asia. This shift involved crossing from a Mediterranean climate to a monsoon-influenced tropical climate, which contributed to delays in adapting the original plant package from the Near East27.

Temperature seasonality and annual mean temperature also interact with longitude (Fig. 5), and were critical predictors in the western hemisphere, explaining the transition gradient from Mesoamerica and Northwestern South America to other areas of the American continent. Regions with less seasonality and warmer climates saw earlier domestication. These areas then became the primary centers for the initial diffusion of domesticated species while further diffusion reflected challenges, especially in the adaptation of tropical domesticates, such as maize, to temperate zones. An alternative explanation is that less seasonal environments were associated with higher forager population densities in regions with high net primary productivity, as previously suggested59. If that is the case, such population pressures might have accelerated transitions to agriculture due to diminishing foraging returns, suggesting that forager population density could be a relevant but unexamined variable in our analysis. However, foraging population density – though previously identified as critical in the transition to farming60,61 – is partly predictable from bioclimatic variables59,62. Including it would introduce collinearity and is therefore best excluded from the analysis.

Fig. 5
figure 5

Shapley dependence plots for the four most important features (excluding geographical coordinates).

A notable conclusion is the contrast in plant domestication chronologies between the Americas and Southwest Asia, stemming from the photoperiodic needs of key domesticates. In the Americas, the main crops were short-day plants, which flowered as daylight shortened. Conversely, cereals in Southwest Asia were long-day plants, thriving through winter and flowering as daylight lengthened during dry summer months63. This differentiation helps explain the variable importance of climatic factors: in the Americas, temperature seasonality and mean temperature were predominant, whereas in Southwest Asia, interactions between temperature and precipitation held more significance. The combination of mild, wet winters and hot, dry summers created favorable conditions for wild cereals like wheat and barley, which composed the most important part of the Southwest Asian Neolithic crop package. Previous simulations incorporating bioclimatic factors and technological development replicated the rapid spread of cereal cultivation in the Near East, compared to the slower adoption of less energy-efficient American cultigens (except maize), delaying agriculture in the Americas60.

These differences underscore the gradient in adoption dates, suggesting that delays in domestication and spread were influenced by regional variations in temperature and precipitation seasonality in relation to the local plant package, a phenomenon previously discussed regarding the expansion of cereal crops into South Asia27. At the same time, it should be noticed that, unlike Southwest Asia, where diffusion occurred across regions with relatively similar bioclimatic conditions, the western hemisphere presents a more complex scenario, with distinct centers of development in diverse biomes, complicating the analysis of date gradients and feature importance at a continental scale. To that, we must add the lower diversity of domesticable species, and a slower transition influenced by cultural diffusion rather than demic diffusion, as highlighted by previous simulations focused on Eastern North America versus Europe64.

Similarly, the results are much more ambiguous for regions such as the African continent, where the effect of the bioclimatic variables is not as clear. This might be related to a different model of agricultural expansion, likely following a nonlinear trajectory65. Indeed, the expansion of agriculture in Africa occurred later and at a slower pace than in other areas such as Southwest Asia or the Americas. Recent studies indicate that plant cultivation in Africa was gradually incorporated into more complex models of subsistence, not necessarily representing the main source of food, but instead being part of a more holistic approach to subsistence that included the use of both wild and domestic plant and animal resources for long periods of time16,66,67,68,69. Such a complex introduction trajectory may hinder the model’s ability to identify the influence of bioclimatic variables in the expansion of agriculture, as the interaction between the environment and agricultural practices occurs within a more diverse and dynamic framework.

Finally, although soil properties were crucial for selecting species suitable for cultivation in arid environments55, they did not emerge as key predictors for the timing of initial food production transitions. It is likely that while edaphic factors influence agricultural practices in later stages, initial transitions were more dependent on broader bioclimatic conditions.

Conclusion

In conclusion, our findings do not identify a set of variables that globally drive the transition to food-producing economies. Instead, our analysis highlights the significance of diffusion, supported by the observed spatial autocorrelation in the timing of agricultural adoption and the strong influence of spatial predictors. The spatial patterns evident from the Shapley values indicate that the bioclimatic factors influencing the timing of agricultural adoption differ by region. Specifically, in the Americas, factors related to temperature and temperature seasonality play a more prominent role, whereas in Southwest Asia precipitation relative to temperature emerges as more important.

This regional variability may be influenced by the types of plants domesticated in each area, such as the distinction between short- and long-day plants. Our model appears to have effectively captured these local bioclimatic influences alongside the overarching process of diffusion. However, while the model elucidates region-specific factors tied to plant availability, it does not point to any singular global determinant for the emergence of food production.

In summary, although our model accurately captures the factors driving the spread of food production across diverse bioclimatic conditions, the specific determinants at each center of origin require independent analysis. Local environmental shifts, such as changes in precipitation, land cover and other disruptions linked to the Younger Dryas and the Pleistocene-Holocene transition, may have independently triggered domestication processes3,70. This hypothesis necessitates separate investigation using distinct methodological approaches. Thus, while diffusion and regional environmental factors play essential roles, the precise factors driving agricultural development in each center of food production remain an open question for further investigation.