Abstract
Global food security is increasingly challenged by climate change and unsustainable agriculture, emphasizing the need for strategies to enhance crop productivity. Understanding the interplay between crop health and soil microbiomes is crucial. This study explores the link between crop health, observed via multi-spectral satellite imagery, and fungal soil microbiome taxonomy. We associate the normalized difference vegetation index with fungal microbiomes in wheat, barley, and maize using a two-step machine learning process. The first step adjusts normalized difference vegetation index values for abiotic confounders using a random forest model trained on Lucas 2018 topsoil and ERA5 climate datasets. The second step clusters operational taxonomy unit counts from fungal DNA, revealing significant differences in residual normalized difference vegetation index values. To identify potential bio-fertilizer candidates, we compare the average relative abundance of operational taxonomy unit clusters and construct sparse biological networks. Key findings are: (I) clusters with higher plant pathogenic genera have lower normalized difference vegetation index values; (II) clusters with higher influential scores for multiple beneficial genera have higher normalized difference vegetation index values; (III) lower abundance taxonomy (1-3%) seems to regulate microbial networks; (IV) the influence of beneficial vs. pathogenic taxonomy is relative to their abundance. The study links satellite imagery to fungal microbiomes, providing a baseline for exploring fungal bio-fertilizers.
Similar content being viewed by others
Introduction
The world relies on agricultural food production as one of the primary sources of nutrition1, making it a critical area for global food security. With an ever-growing population, increasing changes to the climate, and a need for more sustainable production, it is necessary to identify sustainable methods for optimizing agricultural yield and health. One such method is the development of smart farming2, which can aid in optimizing crop yield and health by monitoring individual field conditions, making near real-time adjustments possible. Smart farming integrates sensors or remote sensing technologies, such as satellite imagery3 and drones4, with data analytics and decision support systems5. A wide array of data sources and techniques are available to monitor agricultural fields via satellite imagery. Some of the common satellites utilized in agricultural imagery are the Sentinel-2 and Landsat 8 satellites, which are both considered to be of medium resolution6. Lower resolution satellites, such as the MODIS satellite, are also applied in agriculture, although often they do not provide enough accuracy for precision agriculture6,7,8,9,10. MODIS does have the upside of daily samples, whilst Landsat 8 and Sentinel-2 only sample every sixteenth and fifth day11. Apart from resolution differences, these satellites all provide multi-spectral images3,12. For agriculture, the multi-spectral images are mostly used to compute vegetation indices (VI)13. Some of the most applied indices are the normalized difference vegetation index (NDVI)14, the enhanced vegetation index (EVI)15, the leaf area index (LAI)16, and the Chlorophyll Index17. For a more extensive review of multispectral VI’s, we refer to the Index Database18 and the works of Xue and Su19.
One of the most commonly used VIs for assessing crop health is that of NDVI, which is calculated using the ratio of near-infrared (NIR) and red bands from multi-spectral images. NDVI ranges from –1 to 1, where higher values indicate healthy crops with higher vegetation density19,20. Several factors have been reported to impact NDVI, including soil moisture21, temperature22, seasonality23, climate24, nutrients25, and crop type26. These factors are, for the most part, well covered and described in the literature and mainly linked to NDVI via non-linear models23,27,28. However, studies linking remote sensing with microbial soil composition remain limited compared to those examining nutrient effects.
In the works of Carvalho29, species richness was concluded to be positively linked to NDVI, but high variance across sampling sites led to high uncertainty in the conclusions. A more recent study suggests similar findings, linking higher richness and specific bacteria to higher soil fertility30. Still, their study did not include the effects of climate and weather conditions when estimating the effect of the bacteria. Another study investigated the role of Cyanobacteria and showed that when inoculated over 90 days, NDVI was significantly increased in fields with added Cyanobacteria compared to fields with no Cyanobacteria31. Other studies confirm that Cyanobacteria plays a role in NDVI measurements23,32, but these are mainly linked to Cyanobacteria blooming in lakes. In another study, the alpha diversity of bacteria was linked to hyperspectral imagery uing Airborne Visible InfraRed Imaging Spectrometer—Next Generation (AVIRIS-NG)33. A combination of linear discriminant analysis and partial least squares regression was applied to determine the dominant bacterial families’ abundance from airborne hyperspectral imagery33. Thus, confirming a link between microbial diversity and remote sensing is possible.
Apart from bacteria, fungi are of particular interest since ectomycorrhizal associations34 could be applied in agriculture as a means of reducing traditional fertilization in favor of bio-fertilization35. This property makes the remote sensing connection to soil fungal communities very attractive, as cheap satellite imagery may provide field diagnostics indicating if particular parts of a fungal community may be lacking. However, the link between NDVI (or other VIs) and the soil microbial fungal composition is only sparsely covered in the literature. In the work of Nutter36, NDVI was applied to find the epicenters of soy-bean rust disease caused by the Phakopsora pachyrhizi fungi. Another study links higher NDVI values with a larger species richness, having a larger proportion of Onygenales37. The effect of richness is also evident in the works of Liu, concluding that soil decomposers seem to play a role in higher NDVI values and retaining community stability38. In a recent study, the alpha diversity of fungi and bacteria in forests was investigated using hyperspectral imagery from the DESIS satellite. The study concluded that high fungal richness (hot spots) is linked to environmental parameters such as pH and landscape type?. Thereby making remote sensing an attractive opportunity for the monitoring of crops.
However, every study is on a relatively small scale (few fields), and the data have similar spatio-temporal properties (fields in the same area sampled at the same time interval). In addition, only a limited influence of climate is accounted for, which may lead to nuisance effects influencing the NDVI response in particular across spatio-temporal data.
Our study aims to explore the associations of the fungal soil microbiome composition and the NDVI response while accounting for the effects occurring in data across regions and one growth season. We propose a two-step methodology: first, we adjust NDVI values for abiotic influences, and then link the residual NDVI values to fungal biotic variance. In the initial step, a model is used to adjust NDVI values for abiotic factors, based on previous findings25,39. Following this adjustment, the adjusted NDVI values are analyzed in relation to the fungal soil microbiome (see Fig. 1 for a graphical outline).
(1) We estimate crop health/growth as NDVI values from satellite images. (2) We adjust the NDVI values by removing abiotic influence through a Random Forest model. (3) We identify clusters of pre-processed biotic data from soil samples using hierarchical clustering and investigate the link between the residual NDVI values and the derived clusters. (4) We filter rare taxonomy applying IQR bootstrapping, followed by the generation of sparse networks, identifying and investigating nodes of importance from biotic networks of the clusters.
Summing up, our contributions are (I) A methodology that adjusts for abiotic factors prior to investigating biotic relations to the residual NDVI. (II) Demonstrating a significant difference in NDVI values for different clusters of observed fungal microbiome compositions. (III) The Mortierella genus has the strongest influence regardless of its abundance; the presence of multiple influential plant pathogenic genera is associated with lower NDVI, while multiple influential beneficial genera are associated with higher NDVI.
Data description
This paper utilized multiple data sources to investigate the features affecting the NDVI responses. Specifically, the datasets used within our analysis were the 2018 LUCAS biodiversity dataset40, the LUCAS 2018 topsoil dataset41, and the ERA5 Copernicus climate dataset42. The remote sensing data contained the 2018 Copernicus LUCAS multi-polygons43 and our own polygons for the 2015 which were used to generate Sentinel-2 and Landsat 8 satellite images from which the mean NDVI values were extracted from regions of interest (ROIs) via Sentinel Hub44 (see the Satellite image subsection for details). To get an overview of how the aforementioned datasets were combined and how pre-processing was carried out, the data linkage and processing are visually outlined in section 1.1 of the supplementary material. Furthermore, the pre-processing of the acquired satellite images and bioinformatics data is outlined in the Data Pre-processing subsection. In the following subsections, the tabular datasets are described in detail.
Abiotic data
Multiple abiotic factors, such as soil nutrients, crop type, soil type, seasons, and climate conditions, are all known to influence the NDVI values and should therefore be taken into account when estimating the influence of the microbial composition. The LUCAS 2018 topsoil dataset consists of a total of 7430 unique crop-related samples with 10 features (see Table 1) describing the soil composition at different sampling sites with a unique set of latitude and longitude coordinates across Europe41. In total, 41 unique crop types are found in the data, and for our study, we focus on the three most abundant crop types: common wheat, barley, and maize. To obtain additional geospatial information, each sampling site was divided into climate zones based on the Köppen-Geiger climate zone classification system45. An additional filtration based on months was done, removing the months of winter (December, January, February), early spring (March, April), and late autumn (October and November), as the images mostly consisted of barren soil, snow, and clouds. Finally, the data was reduced, leaving out climate zones with less than 5 observations, removing observations from the BSh and ET zones. The final data consists of 2245 observations and is visualized for each crop type in Fig. 2C. The same procedure was carried out for the Landsat 8 2015 and 2018 datasets with the results depicted in Fig. 2A, B.
A NDVI values (mean within ROIs) over time for Landsat 8 2015, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. B NDVI values (mean within ROIs) over time for Landsat-8 2018, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. C NDVI values (mean within ROIs) over time for Sentinel-2 2018, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. The black line represents a linear trendline showing the progression of NDVI values over time for each crop. Further details of NDVI distribution across crop types are provided in the supplementary section 1.2.
In addition to the topsoil composition affecting the observed NDVI values, climate data were considered. The climate data were obtained from the ERA5 Copernicus dataset, which contains detailed climate records dating back to 1940 until the present day42. For our purpose, we chose variables linked to the properties of soil, which were soil temperature (K), soil moisture (m3), soil type (categorical, 6 soil types in total), and air temperature (K). The climate data were grouped such that the yearly average air temperature, soil moisture, and soil temperature were linked to each NDVI observation. Additional temporal information for the aforementioned variables was also included in the form of the previous month’s average values. A summarized overview of all abiotic variables from the year 2018 is provided in Table 1, with the year 2015 found in the supplementary material section 1.2. For a complete overview of the distribution of each variable of Table 1, the reader is referred to the supplementary material section 1.2.
In the topsoil data (Table 1), the variable of water-based pH was excluded due to a high correlation with the CaCl2 based pH index (see the correlation plot found in supplementary section 1.3). The calcium chloride pH was chosen as it has been reported to be less affected by soil electrolyte concentration and produces more consistent measurements46. The variables of CaCO3, Aluminum Oxalate, and Iron Oxalate were also excluded from the analysis as the inclusion would reduce the data size to approximately one-third of the original size due to missing observations.
Biodiversity data
The final dataset included in this paper was the 2018 LUCAS biodiversity dataset40. The data contains a total of 885 bio-samples, each with a unique barcode ID (1-885) from which fungal ITS DNA sequences were obtained, where 347 samples were linked to cropland. For the crops of interest (wheat, barley, and maize) within the specified time period, a total of 115 samples were present. The taxonomy was classified from the raw DNA based on the UNITE-INSDc 9.0 database47 (see the Data Pre-processing subsection for details regarding the classification and processing of the raw DNA data). To ensure the quality of the classification, a cutoff based on the Operational taxonomic unit (OTU) counts was set at 1, filtering any sample that had OTU counts lower than the set threshold48 (see the ITS PACBIO sequencing for details on how the OTU counts were generated).
Results
Modeling of abiotic effects
The regression results on the test set, modeling the abiotic factors influence on the NDVI values, show that the RF outperforms the linear regression model with lower root mean squared error (RMSE) and higher explained variance, R2 (Table 2). Both models are based on the 16 variables in Table 1.
The final linear regression models, including up to two-factor interaction terms, are detailed in Section 1.6 of the supplementary material for each model (see Table 2). The hyperparameter values for the final RF model are provided in Section 1.7 of the supplementary material.
For the RF model, the observed NDVI values vs the predicted NDVI values showed good association, and we conclude that the model adequately captures dependencies within the data when comparing the test and full dataset to the true NDVI values (section 1.8 of supplementary material). Furthermore, we investigated the residuals based on single crop type predictions for every model listed in Table 2, finding that no inherent bias towards any single crop type is present (supplementary section 1.10). The residual analysis for every model, likewise found no larger variance across any of the less-represented levels of the categorical and numerical variables listed in Table 1 (see supplementary section 1.10). We observed that the most important variables are the season and crop type. In contrast, soil type and potassium levels seem to be less important (Fig. 3 and supplementary material section 1.9). To summarize, the results show a robust estimation of abiotic influences via the RF model. The step ensures a reliable quantification of residual NDVI, allowing for a clearer investigation of biological relationships.
The y-axis shows the variable names in accordance with Table 1, and the x-axis is the percent increase in mean squared error.
OTU data analysis
In the first step of our two-step process, we identified two unique clusters of specific taxonomy through unsupervised clustering of OTU samples (Fig. 4). The samples are approximately equally distributed between the first (Cluster 1) and the second cluster (Cluster 2).
Visualization of the two clusters identified from the average silhouette method (see supplementary material section 1.4) on the pre-processed PACBIO data (biotic). Note that the data has been scaled via the Hellinger transformation to reduce the effect of large OTU values. The letters in the cluster refer to the crop type of each sample (see the LUCAS 2018 biodiversity dataset for details40).
As a sensitivity analysis, we additionally carried out a clustering, which focused on each crop individually (wheat, maize, and barley - supplementary material section 1.5). We identified two clusters for wheat with 30 and 19 observations, respectively. Due to low sample sizes, no meaningful patterns were identified for the separate clustering of barley and maize.
The clusters (including the separate wheat clusters) were further analyzed by connecting the residual NDVI values of the samples to the fungal taxonomy present in each cluster.
Mircobiome impact modeling
OTU Cluster 1 showed significantly higher residual NDVI values (Table 2 in supplementary material section 1.4) compared to Cluster 2 (Fig. 5 and table 2 in supplementary material section 1.4). The results for the raw NDVI values for each cluster are found in the supplementary section 1.4 for both Landsat 8 and Sentinel-2.
Similarly, on the wheat clusters, we found significant differences between the two clusters (supplementary material table 5, section 1.5). The average relative abundance analysis (Fig. 6) showed that genera associated with plant health, (such as Tomentella and Mortierella49,50), were more abundant in Cluster 1 (higher NDVI), while pathogenic genera, such as Fusarium, were more prevalent in Cluster 2 (lower NDVI). Specifically, Fusarium was more abundant in maize and wheat within Cluster 2, with similar levels in barley across both clusters.
Abundance analysis for the wheat clusters, revealed a similar pattern, with pathogenic genera being more abundant for clusters with lower NDVI (section 1.12 in the supplementary material).
Using our bootstrapping approach (algorithm 1), we identified 99 and 114 genera (Cluster 1 and 2, respectively) as non-outlier taxa (outlier taxa being low abundance genera with high random large OTU spikes) (Fig. 7).
A Graphical results of the bootstrap resampling for Cluster 1, in total, the relative abundance IQR confidence intervals of 99 genera (blue lines) do not overlap 0 (red line), while the remaining genera contain 0 in the confidence intervals (orange line). B Graphical results of the bootstrap resampling for Cluster 2, in total, the relative abundance IQR confidence intervals of 114 genera (blue lines) do not overlap 0 (red line), while the remaining genera contain 0 in the confidence intervals (orange line). In addition, it should be noted that the Wald approximation has been applied for estimating the CIs, hence the negative CI values.
Network models of the non-outlier taxonomy show that many of the taxonomy correlations have been penalized by the sparse lasso model, retaining few connected nodes for both Cluster 1 and 2 (Figs. 8, 9). Results on the optimal regularization parameter are provided in the supplementary section 1.11.
Red and blue edges indicate negative and positive correlations, respectively. The full network includes all non-connected genera and is shown in the supplementary section 1.13. Note that only the genus names are used for the nodes and that smaller absolute correlations (smaller than 0.3) have been removed from the plot.
Only a few strong negative correlations are present in the network for Cluster 1 (Fig. 8), forming a denser network compared to the network of Cluster 2. The network of Cluster 2 is smaller with fewer strong connections compared to the Cluster 1 network (Fig. 9). A notable strong negative correlation appears in both networks between the Fusarium and Russula genera.
Red and blue edges indicate negative and positive correlations, respectively. The full network includes all non-connected genera and is shown in the supplementary section 1.13. Note that only the genus names are used for the nodes and that smaller absolute correlations (smaller than 0.3) have been removed from the plot.
Network analysis
Analyzing each sparse network showed that Mortierella is the most influential genus for both clusters (Table 3), having the highest abundance at approximately 15% in each cluster. In Cluster 1, the second most influential genus is Cortinarius, which shows a higher abundance than in Cluster 2, where Fusarium ranks second. Both genera appear highly influential in each cluster but have different abundance and ivi scores. Low-abundance genera (Trechispora, Schizothecium, and Serendipita) are also assigned high ivi scores and are of high influence in Cluster 2. They are present in Cluster 1 as well, but at a much lower abundance (below 0.5% in all cases).
In wheat-specific clusters, Mortierella was identified as the most influential genus across clusters, despite not being the most abundant (supplementary material section 1.15). In the higher NDVI cluster, Funneliformis was the most abundant genus but did not contribute to network influence, whereas lower-abundance genera such as Neoschizothecium and Curvularia demonstrated higher influence. Additionally, the plant pathogenic genera Fusarium and Penicillium51 showed the highest ivi scores in the lower NDVI cluster. In the wheat-specific Cluster 2 (lower NDVI), the same beneficial genera were present as in Cluster 1 (higher NDVI), but with lower ivi scores (except for Mortierella).
Discussion
When comparing the variable importance of our RF model with the literature, we find the following agreements: Previous studies have shown that the canopy structure of vegetation influences the NDVI with a non-linear relationship26,52. As for seasons, it has been proven that NDVI is temporally affected, with a maximum NDVI around harvest months53. In terms of nutrient influences, Loozen54 used RF modeling to demonstrate a relationship between NDVI and nitrogen content in forest canopies. Furthermore, the soil carbon content has been identified to have a non-linear link with NDVI55,56. In addition, in the literature, we find that meteorological variables (eg. temperatures and soil moisture)57,58,59agree with our findings of having high importance with respect to NDVI. We note that the temporal resolution of the satellite imagery may affect the results, as the sampling date does not always correspond with the satellite recording date. Thus, the Sentinel-2 imagery may be more accurate with respect to the NDVI responses as these have higher temporal resolution. In the future, it would be interesting to repeat our analysis on PlanetScope data, which has daily recordings, although at a higher cost than Sentinel-2 and Landsat 8. Finally, the results of our abiotic model confirm previous findings in the literature, and it serves as an important step towards removing abiotic influence from the NDVI responses
Even so, our model is still subject to certain limitations. First, our model is limited as most of the meteorological observations are confined to moisture and temperature levels in the range 0.1–0.4 m3 and 280–290 K. This implies that the model may not be able to handle drought or very high temperatures. Second, our model has limited temporal coverage, as only the year 2018 is covered for biological data, possibly making the clustering susceptible to temporal distribution shifts. The temporal robustness and/or evolution is worth investigating in future studies.
For future research, it would be valuable to include a wider spatio-temporal area, including multiple years and using large-scale geospatial data with higher resolution than the Landsat 8 (30 m) and Sentinel-2 (10 m) satellites. This is particularly interesting, as differences have been observed between modeling based on Sentinel-2 and Landsat 8 data within the same year. An increased resolution could provide a more accurate representation of biotic effects. However, a significant limitation of higher-resolution imagery is its cost, as high-resolution data is much more expensive. It is also worth noting that the 2022 LUCAS Topsoil data is forthcoming. Although it is not yet publicly accessible, it could be used to perform an analysis with Sentinel-2 similar to that of the Landsat 8 analysis performed in this study. We strongly encourage such future external validations.
Our study mainly focuses on investigating the fungal impact on NDVI values across heterogeneous spatial areas across Europe. Overall, our findings successfully demonstrate that the fungal taxonomic composition is significantly associated with the NDVI values adjusted for abiotic influence.
From the LUCAS data alone, we cannot explicitly state which mechanism may be behind the regularization of the networks (Figs. 8, 9). We can infer that high abundance is not the sole criterion for network influence, although it seems to play a role, as the most influential genera (Table 3) are mainly comprised of genera with high average relative abundance. However, lower-abundance genera seem to have a much higher impact relative to their presence.
Moving further from our exploratory analysis, the highly influential taxonomy may provide a baseline candidate list for further studies investigating the fungal interaction both in vivo and in vitro. For instance, it would be interesting to investigate the effects on the NDVI values when Mortierella has minimal presence vs being highly abundant. From the literature, it is known that Mortierella has been found very beneficial for agriculture, in general60. However, to the best of our knowledge, no larger-scale controlled field experiments have been conducted with different levels of Mortierella. This could be carried out with the statistical split-plot design type61, which is frequently utilized in agronomy for investigating the influence of other factors, such as the impact of fertilizer types or irrigation methods. Likewise, the negative correlation between Russula and Fusarium is interesting to explore further, as the same relation is identified in the study of canker disease for citrus fruits62, revealing Russula to counteract the influence of Fusarium causing the disease. This relation has, to the best of our knowledge, never been reported in crops, and further investigation is required to assess our finding.
Further relevant actions would be to analyze the metabolomics profile of the soil communities and pinpoint the origin of metabolites positively or negatively affecting plant growth. It should be noted that data is only assigned until the genus level, hence, species information is lost, making the above discussion solely based on the most common occurrence of fungal species in maize, wheat, and barley. For future studies, the acquisition of species-level taxonomy would enrich the analysis as specific species linked to NDVI may be identified. In addition, relevant actions would be to analyze the metabolomics profile of the soil communities, pinpointing the taxonomic origin of metabolites positively or negatively affecting plant growth. In addition, an investigation of the rhizosphere of the plants would enrich our results further, as some fungal-plant interactions are known to work through the rizosphere. Specifically, this investigation could reveal if bulk soil content plays a mediating role in shaping fungal community composition by influencing root exudates, microbial recruitment, or nutrient availability within the rhizosphere. This is particularly relevant as Mortierella seems to be a central genus for regulation of the microbiome which is known for having soil-plant nexus interactions63,64. However, this is very difficult to investigate as the meta-genom of a single field is subjected to high spatial biological variance even when sampling within a few meters of the same site. Aggregating samples across a grid may help, but outliers could skew results. Hence, further studies are best done in a controlled lab with few crops and limited field space.
Our analysis demonstrates the potential of relating satellite imagery and the composition of fungal soil microbiomes across datasets characterized by varying spatiotemporal attributes. However, the biological analysis is solely based on a total of 115 bio-samples, which is currently limiting the construction of robust models for predicting NDVI based on the soil microbiome. Therefore, more bio-samples should be collected in accordance with the LUCAS initiative to provide a ground truth foundation for prediction models. Hence, our approach reveals in an exploratory sense, that certain microbiome compositions may affect the overall observed vegetation health, which potentially opens the way for selective microbial fertilization. One potential benefit of establishing this link is that expensive soil sample analysis can be replaced by satellite imagery. This would not only save time and resources but also enable much faster fine-tuning of microbial compositions for individual fields. An interesting approach would be to combine fungal fertilization with well-established bio-fertilization methods, such as crop rotation and the introduction of cover crops like legumes, which promote biodiversity by introducing nitrogen-fixing bacteria into the soil65. As noted by Liu et al.66, fungal functional diversity plays a crucial role in ecosystem stability, suggesting that the addition of specific fungal taxa could enhance the benefits already provided by practices like crop rotation and cover crops. Furthermore, a consequence of introducing fungal bio-fertilization may be increased soil taxonomic stability, making the system more robust to invasive crop pathogens67.
In the end, this fine-tuning may aid in an optimized agricultural yield35 through the increase of crop health, aiding in food security. Furthermore, the possibility of using selective microbial fertilizers may not only aid in achieving higher crop yields but ultimately replace some parts of the conventional fertilization process35 saving resources and opting for a more sustainable future through green farming.
Methods
Data pre-processing
Satellite images
To ensure consistent atmospheric correction, the satellite images were corrected for atmospheric disturbances using the sen2LA product over the manual correction of sen2L1C product68. For the Landsat 8 images, atmospheric disturbances were corrected with the usage of the level 2 OT product69. Furthermore, we removed images with more than 5% cloud cover by applying the Sentinelhub API cloud filtering algorithm68. This resulted in a reduction of data from 7430 crop samples to 5410 crop samples. To ensure band resolution was uniform at a targeted 10 meters, we used bilinear interpolation to resample each image70. The same procedure was utilized on the Landsat 8 imagery. To ensure the pixel band values are not affected by environmental artifacts, we applied masking on the Sentinel-2 scene classification maps (SCM)71, which removes pixels containing dark areas, snow, smaller clouds, and water from each sample tile. From the filtered, interpolated images, the mean NDVI values were aggregated based on the pixels found in each multi-polygon. Sampling dates did not always match the Sentinel-2 or Landsat-8 temporal recording scheme. To meet this challenge, a search range of 3 weeks, before and after the sampling data, was utilized. Imagery that did not meet the 5% could cover cutoff was discarded in the search process. The image closest to the sampling date was chosen to represent NDVI values for a sample. When no suitable imagery could be found in the temporal search span, the sample was removed from the final data.
ITS PacBio sequencing
An initial quality filtering was performed by discarding reads containing more than 1 ambiguous base and more than 2 expected errors. Adapters were trimmed and read orientation corrected using CutAdapt 2.1072. ITSxpress73 was then used to extract ITS regions. The UNITE 9.0 UCHIME reference dataset74 was used for reference-based chimera removal. Sequences were clustered using open reference clustering by including the ITS sequences extracted from the UNITE-INSDc 9.0 database75 using ITSx. The sequences were clustered at 98% similarity using VSEARCH76, applying the “-cluster-smallmem -usersort” arguments to prioritize full-length UNITE-INSDc and PACBIO sequences before partial sequences as described by Tedersoo et al.77. A sample-by-OTU table was produced using VSEARCH, and OTUs shorter than 250 nucleotides were discarded. Representative sequences from each OTU were classified by Megablast queries using BLAST 2.12.0, applying taxon-specific e-value and sequence similarity thresholds as described by Tedersoo et al.77. Conservatively, OTUs were classified to the level of genus.
Supervised machine learning and statistical modeling
We built two models that adjust the NDVI for abiotic confounders/contributions. We compare an RF model78 and a linear regression model79 using the variables listed in Table 1 (see linear regression model results in the supplementary, for parameters included in the fully reduced regression model) as the predictors and NDVI as the response. The RF model parameters were tuned using 5 repetitions of 5-fold cross-validation (CV). For the 2018 Sentinel-2 data, we split the observations in 20% for a one-off unseen testing and 80% for training via cross-validation. The Landsat 8 data underwent several splitting scenarios based on different combinations of datasets from 2015 and 2018. In the first scenario, the 2018 dataset was split into 20% for testing and 80% for training. The second scenario involved combining the 2018 and 2015 datasets into a single dataset, which was then split in the same 20% testing and 80% training ratio. The third scenario treated the 2015 and 2018 datasets as independent, with 2015 serving as the test data and 2018 as the training data. The final scenario mirrored the third but swapped the roles of the datasets, using 2018 for testing and 2015 for training. In the repeated CV on training data, the RF hyperparameters (variable splits, number of trees, and minimal terminal size) were selected using a grid search over a range of values (see the supplementary material) according to the minimal RMSE value. We estimate the expected error using the unseen test data for the selected set of hyperparameter values. The final model used for adjusting the NDVI from abiotic confounders was constructed by training a model with the selected hyperparameters to the full data (see 2). An analysis of residuals for the RF model is also included. For readers new to the usage of RF models, it should be noted that the residuals do not follow the same assumptions of traditional statistical models (eg, identical, independent distribution)78. However, the analysis is included to test if any bias is present among the prediction of less-represented levels of variables.
The linear regression model included two-factor interaction terms and was reduced according to the principles described in the Statistical analysis subsection and the results section. The models were compared using the test set errors and the errors estimated from making predictions onto the full dataset. Both models were evaluated based on the explained variance (R2) and root mean squared error (RMSE).
The estimated linear regression model coefficients \(\hat{{{\boldsymbol{\beta }}}}\) were computed by least squares79.
To mitigate over-fitting, the significance of each linear model regression coefficient was determined by ANOVA tests (see the Statistical Analysis subsection). Model diagnostics were also evaluated to ensure adequate fulfillment of the underlying assumptions. For a more extensive theoretical review of the linear model, we refer to the work of Madsen79.
To adjust the NDVI values for confounders, the estimated NDVI values from the abiotic RF model (see the Modeling of abiotic effects subsection) were used to subtract the non-microbiome-related variance (YRF) from the raw NDVI values (Yraw), resulting in the creation of the residual NDVI values (Yresidual) given as:
Subsequently, we investigate the association between the residual NDVI values and clusters derived from the microbiome sequencing data (Fig. 4).
Clustering
We used hierarchical clustering to create clusters from the pre-processed PACBIO data in an unsupervised manner. We based the clustering on the Euclidean distance and chose the Agnes method, which applies Agglomerative coefficients80 to identify the best linkage. Hierarchical clustering computes the dissimilarities between observations based on a distance matrix, where each row/column represents an observation and the distances reflect how dissimilar or similar two observations are. The Agnes method iteratively merges the closest clusters based on these dissimilarities until an optimal number of clusters is reached. To identify the optimal number of clusters, we applied the average silhouette width s(i), defined as:
Here, a(i) describes the average within-cluster distance between observation i and all other observations of the same cluster. The term b(i) denotes the average between-cluster distance between observation i and the observations assigned to the neighboring cluster. The term s(i) expresses how well each point is clustered, resulting in the optimal clustering being assigned when all clusters have observations above the average silhouette width (see the work of Friedman80 for more details on how to estimate between- and within-cluster distance). It should be noted that prior to clustering, the data was standardized in accordance with the Hellinger transformation81.
Bootstrapping
The non-parametric bootstrap approach82 was utilized for filtering the rare taxonomy of the pre-processed PACBIO data. A total of 5000 bootstrap samples were generated for each genus within each cluster. The interquartile range (IQR) of the generated samples was then applied to filter out taxonomy, with a threshold set at the IQR confidence interval containing 0. The full process is mathematically outlined in algorithm 1.
Algorithm 1
Genus filtration pseudo code
Initial inputs:
A matrix X of OTU abundances, with N rows (samples) and K columns (genera), i.e, \({{\boldsymbol{X}}}\in {{\mathbb{R}}}^{N\times K}\) is given.
for k= 1 to K do:
Step 1: Draw B = 5000 bootstrap samples, X*, with replacement from X
\({X}_{k}^{* }=[{X}_{k,1}^{* },{X}_{k,2}^{* },..,{X}_{k,B}^{* }]\), with \({X}_{k}^{* }\in {{\mathbb{R}}}^{N\times B}\).
Where \({X}_{k,b}^{* }=({x}_{1,k}^{(b)},{x}_{2,k}^{(b)},...,{x}_{N,k}^{(b)})\), with \({x}_{k,b}^{* }\in {{\mathbb{R}}}^{N\times 1}\).
Step 2: Estimate IQR for the kth genus, i.e., the kth row of \({X}_{k}^{* }\).
\(IQ{R}_{k,b}={q}_{3}({X}_{k,b}^{* })-{q}_{1}({X}_{k,b}^{* })\), with \(IQ{R}_{k,b}\in {{\mathbb{R}}}^{B\times 1}\).
Where q3 and q1 indicates the 75% and 25% quantile respectively.
Step 3: Per k’th IQR bootstrap samples (IQRk), estimate:
\({\bar{IQR}}_{k}=\frac{1}{B}{\sum }_{b = 1}^{B}IQ{R}_{k,b}\), mean IQR
\(SD({IQR}_{n})=\sqrt{\frac{1}{B-1}\mathop{\sum }_{b = 1}^{B}{(IQ{R}_{k,b}-{\bar{IQR}}_{k})}^{2}}\), IQR standard deviation
\(C{I}_{0.95}(IQ{R}_{k})={\bar{IQR}}_{k}\pm \frac{{Z}_{1-\alpha /2}SD({IQR}_{k})}{\sqrt{B}}\), IQR Wald 95% confidence interval.
Where Z1−α/2 ≈ 1.96 for the alpha-level quantile of the standard normal distribution, at α = 0.05
Step 4: remove outlier taxonomy
if 0 ∈ CI0.95(IQRk) then
The k’th column of X is considered outlier taxonomy and excluded
end if
end for
From the resampled relative abundance data, a 95% confidence interval (CI) for the IQR values was computed. Any taxonomy with an IQR confidence interval overlapping 0 is considered rare and therefore discarded. In contrast, taxonomies with IQR confidence intervals not containing 0 are retained. The method was applied as even with the addition of the lasso penalty, low-abundance genera with high random large OTU spikes may produce random correlations. Still, setting a cutoff is problematic as there are no universally accepted lower values83,84. Consequently, we applied a data-driven thresholding strategy via the IQR bootstrapping approach.
Graphical lasso model and networks
To derive the sparse network analysis visualized in Figs. 8, 9, the graphical lasso model was applied to the correlation matrix of the filtered and pre-processed PACBIO data. The main principles are outlined here with a reference to the works of Friedman85,86 for a more detailed description. The idea behind the graphical lasso model is to omit edges (correlations) from a network by controlling the number of zero entries of the precision matrix (or inverse covariance matrix, Θ) via the lasso penalty, see Friedman85 for more details. The graphical lasso model can be defined through the following optimization problem:
The optimization problem of Eq. (3) does not have an analytical solution, thus element-wise coordinate descent is applied, identifying one parameter at a time (one entry of the precision matrix at a time), for more details, we refer to the work of Mazumder and Hastie87, which reviews the original algorithm while implementing additional improvements.
The networks are constructed from the sparse precision matrix estimated via the graphical lasso model. The sparse covariance matrix is estimated as
from which the sparse correlation matrix (ρ) may be obtained as
The sparse correlation matrix is then utilized as the edges of the networks, with each genus being a network node.
To enhance the insights for each network, the Integrated Value of Influence88,89 (ivi) is evaluated for each connected node in the sparse networks. The ivi score was applied since it combines six centrality metrics88,89 into a score that considers both local and global centrality. We refer to the works of Salavaty88 covering the details of the ivi score and to the works of Klein90 for readers not too familiar with graph network theory.
Statistical analysis
The significance level for the hypothesis tests was set to α = 0.05 and was further regulated for multiple hypothesis testing via the Tukey honest significance difference (HSD) test when factor levels exceed two91. For the subsequent tuning of the linear adjustment model, Type 2 ANOVA was applied based on a series of nested χ2 hypothesis tests92. Note that the principle of hierarchical modeling was utilized, retaining non-significant main effects when these are part of significant interaction terms79. The model diagnostics of the linear abiotic adjustment model are listed, investigated, and reported in the supplementary.
Code
Pre-processing of the satellite images was carried out in Python (version 3.10.9)93, while all remaining data processing, machine learning tasks, and statistical evaluations were carried out in R (version 4.1.2)94. Both source codes are available on our GitHub (https://github.com/Mabso1/Artimate), and the specific software packages used are all listed in the supplementary material, including a brief description of how each was used and references.
Data availability
All pre-processed data applied within this study is located in our GitHub repository: https://github.com/Mabso1/Artimate. For the raw data, we refer to the original sources.
Code availability
All code applied for this study is located in our GitHub repository: https://github.com/Mabso1/Artimate.
References
Viana, C. M. & Rocha, J. Evaluating dominant land use/land cover changes and predicting future scenario in a rural region using a memoryless stochastic method. Sustainability 12, 4332 (2020).
Mohamed, E. et al. Smart farming for improving agricultural management. Egyptian J. Remote Sens. Space Sci. 24, 971–981 (2021).
Im, J. & Jensen, J. R. Hyperspectral remote sensing of vegetation. Geogr. Compass 2, 1943–1961 (2008).
Singh, A. P., Yerudkar, A., Mariani, V., Iannelli, L. & Glielmo, L. A bibliometric review of the use of unmanned aerial vehicles in precision agriculture and precision viticulture for sensing applications. Remote Sensing 14 (2022).
Abhiram, M., Kuppili, J. & Manga, N. Smart farming system using iot for efficient crop growth. In 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 1–4 (2020).
Jiménez-Jiménez, S. I. et al. Vical: Global calculator to estimate vegetation indices for agricultural areas with landsat and sentinel-2 data. Agronomy 12, 1518 (2022).
Sishodia, R. P., Ray, R. L. & Singh, S. K. Applications of remote sensing in precision agriculture: A review. Remote Sensing 12, 3136 (2020).
Segarra, J., Buchaillot, M. L., Araus, J. L. & Kefauver, S. C. Remote sensing for precision agriculture: Sentinel-2 improved features and applications. Agronomy 10, 641 (2020).
Giri, C., Pengra, B., Long, J. & Loveland, T. R. Next generation of global land cover characterization, mapping, and monitoring. Int. J. Appl. Earth Observ. Geoinf. 25, 30–37 (2013).
Chen, J. et al. Global land cover mapping at 30m resolution: A pok-based operational approach. ISPRS J. Photogramm. Remote Sens. 103, 7–27 (2015).
Martinis, S., Wieland, M. & Rättich, M. Chapter 2 - An automatic system for near-real time flood extent and duration mapping based on multi-sensor satellite data. In Earth Observation for Flood Applications, Earth Observation, (ed. Schumann, G. J.-P.) 7–37 (Elsevier, 2021).
Qian, S.-E. Hyperspectral satellites, evolution, and development history. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 14, 7032–7056 (2021).
Zeng, Y. et al. Optical vegetation indices for monitoring terrestrial ecosystems globally. Nat. Rev. Earth Environ. 3, 477–493 (2022).
Rouse, J. W., Haas, R. H., Schell, J. A. & Deering, D. W. et al. Monitoring vegetation systems in the great plains with erts. NASA Spec. Publ. 351, 309 (1974).
Liu, H. Q. & Huete, A. A feedback based modification of the ndvi to minimize canopy background and atmospheric noise. IEEE Trans. Geosci. Remote Sens. 33, 457–465 (1995).
Chen, J. M., Rich, P. M., Gower, S. T., Norman, J. M. & Plummer, S. Leaf area index of boreal forests: Theory, techniques, and measurements. J. Geophys. Res.: Atmospheres 102, 29429–29443 (1997).
Gitelson, A. A., Viña, A., Ciganda, V., Rundquist, D. C. & Arkebauer, T. J. Remote estimation of canopy chlorophyll content in crops. Geophys. Res. Lett. 32, https://doi.org/10.1029/2005GL022688 (2005).
Development of an online indices database: Motivation, concept, and implementation. In Proc. 6th EARSeL Imaging Spectroscopy SIG Workshop Innovative Tool for Scientific and Commercial Environment Applications, 16–18 (2009).
Jinru, X. & Su, B. Significant remote sensing vegetation indices: A review of developments and applications. J. Sens. 2017, 1–17 (2017).
Stamford, J. D., Vialet-Chabrand, S., Cameron, I. & Lawson, T. Development of an accurate low cost ndvi imaging system for assessing plant health. Plant Methods 19, 9 (2023).
Klimavičius, L., Rimkus, E., Stonevičius, E. & Mačiulyte, V. Seasonality and long-term trends of ndvi values in different land use types in the eastern part of the baltic sea basin. Oceanologia 65, 171–181 (2023).
Dabrowska-Zielinska, K., Kogan, F., Ciolkosz, A., Gruszczynska, M. & Kowalik, W. Modelling of crop growth conditions and crop yield in poland using avhrr-based indices. Int. J. Remote Sens. 23, 1109–1123 (2002).
Zhao, H. et al. Monitoring cyanobacteria bloom in dianchi lake based on ground-based multispectral remote-sensing imaging: Preliminary results. Remote Sensing 13, 3970 (2021).
Zhang, J.-f., Liu, H.-b., Wu, W. & Fan, L. Correlation analysis of NDVI and meteorological variables. In PIAGENG 2010: Photonics and Imaging for Agricultural Engineering (ed. Tan, H.), vol. 7752 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 77521K (2011).
Cabrera-Bosquet, L. et al. Ndvi as a potential tool for predicting biomass, plant nitrogen content and growth in wheat genotypes subjected to different water and nitrogen conditions. Cereal Res. Commun. 39, 147–159 (2011).
Gamon, J. A. et al. Relationships between ndvi, canopy structure, and photosynthesis in three californian vegetation types. Ecol. Appl. 5, 28–41 (1995).
Marques Ramos, A. P. et al. A random forest ranking approach to predict yield in maize with uav-based vegetation spectral indices. Comput. Electron. Agric. 178, 105791 (2020).
Stepchenko, A. & Chizhov, J. Ndvi short-term forecasting using recurrent neural networks. Environ. Technol. Resour. Proc. Int. Sci. Pract. Conf. 3, 180 (2015).
Carvalho, S., Putten, W. & Hol, G. The potential of hyperspectral patterns of winter wheat to detect changes in soil microbial community composition. Front. Plant Sci. 7, https://doi.org/10.3389/fpls.2016.00759 (2016).
Costa, D. et al. Soil fertility impact on recruitment and diversity of the soil microbiome in sub-humid tropical pastures in northeastern brazil. Sci. Rep. 14, 3919 (2024).
Chamizo, S., Mugnai, G., Rossi, F., Certini, G. & De Philippis, R. Cyanobacteria inoculation improves soil stability and fertility on different textured soils: Gaining insights for applicability in soil restoration. Front. Environ. Sci. 6, 49 (2018).
Choi, B., Lee, J., Park, B. & Sungjong, L. A study of cyanobacterial bloom monitoring using unmanned aerial vehicles, spectral indices, and image processing techniques. Heliyon 9, e16343 (2023).
Skidmore, A. K. et al. Mapping the relative abundance of soil microbiome biodiversity from edna and remote sensing. Sci. Remote Sens. 6, 100065 (2022).
Charya, L. S. & Garg, S. Chapter 19 - advances in methods and practices of ectomycorrhizal research. In Meena, S. N. & Naik, M. M. (eds.) Advances in Biological Science Research, 303–325 (Academic Press, 2019).
Hyde, K. et al. The amazing potential of fungi: 50 ways we can exploit fungi industrially. Fungal Diversity 1–136, https://doi.org/10.1007/s13225-019-00430-9 (2019).
Jr, N. et al. Integrating GPS, GIS, and remote sensing technologies with disease management principles to improve plant health, 59–90 (CRC Press, 2011).
Yamauchi, D. H. et al. Soil mycobiome is shaped by vegetation and microhabitats: A regional-scale study in southeastern brazil. J. Fungi 7, 587 (2021).
Liu, S. et al. Phylotype diversity within soil fungal functional groups drives ecosystem stability. Nat. Ecol. Evolut. 6, 1–10 (2022).
du Plessis, W. Linear regression relationships between ndvi, vegetation and rainfall in etosha national park, namibia. J. Arid Environ. 42, 235–260 (1999).
Labouyrie, M. et al. Patterns in soil microbial diversity across europe. Nat. Commun. 14, 3311 (2023).
O, F. U. et al. Lucas 2018 soil module. In LUCAS 2018 Soil Module, KJ-NA-31-144-EN-N (online) (Publications Office of the European Union, Luxembourg (Luxembourg, 2022).
Hersbach, H. et al. ERA5 hourly data on single levels from 1940 to present (2023).
European Commission, Joint Research Centre (JRC). LUCAS Copernicus 2018. European Commission, Joint Research Centre (JRC) [Dataset] (2018). PID: http://data.europa.eu/89h/cfe66a0c-bdee-4074-96e1-a2f7030b9515.
Sinergise. Sentinel hub. https://www.sentinel-hub.com/ (2023).
Beck, H. et al. Present and future köppen-geiger climate classification maps at 1-km resolution. Sci. Data 5, 180214 (2018).
Minasny, B., McBratney, A. B., Brough, D. M. & Jacquier, D. Models relating soil ph measurements in water and calcium chloride that incorporate electrolyte concentration. Eur. J. Soil Sci. 62, 728–732 (2011).
Abarenkov, K. et al. The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa,and classifications reconsidered. Nucleic Acids Res. 1039 (2023).
Bálint, M. et al. Millions of reads, thousands of taxa: microbial community structure and associations analyzed via marker genes. FEMS Microbiol. Rev. 40 5, 686–700 (2016).
Cheng, Z. et al. Cortinarius and tomentella fungi become dominant taxa in taiga soil after fire disturbance. J. Fungi 9, 1113 (2023).
Lilleskov, E. A. & Bruns, T. D. Spore dispersal of a resupinate ectomycorrhizal fungus, tomentella sublilacina, via soil food webs. Mycologia 97, 762–769 (2005).
Hallas-Møller, M., Nielsen, K. F. & Frisvad, J. C. Secondary metabolite production by cereal-associated penicillia during cultivation on cereal grains. Appl. Microbiol. Biotechnol. 102, 8477–8491 (2018).
Liu, J., Pattey, E. & Jégo, G. Assessment of vegetation indices for regional crop green lai estimation from landsat images over multiple growing seasons. Remote Sens. Environ. 123, 347–358 (2012).
Tottrup, C. & Rasmussen, M. S. Mapping long-term changes in savannah crop productivity in senegal through trend analysis of time series of remote sensing data. Agric. Ecosyst. Environ. 103, 545–560 (2004).
Loozen, Y. et al. Mapping canopy nitrogen in european forests using remote sensing and environmental variables with the random forests method. Remote Sens. Environ. 247, 111933 (2020).
Kariyeva, J. & Van Leeuwen, W. J. D. Environmental drivers of ndvi-based vegetation phenology in central asia. Remote Sens. 3, 203–246 (2011).
Zhang, Y. et al. Prediction of soil organic carbon based on landsat 8 monthly ndvi data for the jianghan plain in hubei province, China. Remote Sens. 11, 1683 (2019).
Fathollahi, L., Wu, F., Melaki, R., Jamshidi, P. & Sarwar, S. Global normalized difference vegetation index forecasting from air temperature, soil moisture and precipitation using a deep neural network. Appl. Comput. Geosci. 23, 100174 (2024).
Piao, S. et al. Leaf onset in the northern hemisphere triggered by daytime temperature. Nat. Commun. 6, 6911 (2015).
Huang, S., Huang, Q., Leng, G., Zhao, M. & Meng, E. Variations in annual water-energy balance and their correlations with vegetation and soil moisture dynamics: A case study in the wei river basin, China. J. Hydrol. 546, 515–525 (2017).
Ozimek, E. & Hanaka, A. Mortierella species as the plant growth-promoting fungi present in the agricultural soils. Agric. 11, 7 (2021).
Casella, G. Split plot designs. Statistical Design 171–241 (2008).
Huang, F. et al. Canker disease intensifies cross-kingdom microbial interactions in the endophytic microbiota of citrus phyllosphere. Phytobiomes J. 7, 365–374 (2023).
Li, F. et al. Mortierella elongata’s roles in organic agriculture and crop growth promotion in a mineral soil. Land Degrad. Dev. 29, 1642–1651 (2018).
Zhang, K. et al. Mortierella elongata increases plant biomass among non-leguminous crop species. Agronomy 10, 754 (2020).
De Notaris, C., Øster Mortensen, E., Sørensen, P., Olesen, J. E. & Rasmussen, J. Cover crop mixtures including legumes can self-regulate to optimize n2 fixation while reducing nitrate leaching. Agriculture, Ecosyst. Environ. 309, 107287 (2021).
Liu, S. et al. Phylotype diversity within soil fungal functional groups drives ecosystem stability. Nat. Ecol. Evolut. 6, 900–909 (2022).
Mallon, C. A. et al. Resource pulses can alleviate the biodiversity-invasion relationship in soil microbial communities. Ecology 96, 915–926 (2015).
Main-Knorn, M. et al. Sen2cor for sentinel-2. In Image and signal processing for remote sensing XXIII, vol. 10427, 37–48 (SPIE, 2017).
EROS. Usgs eros archive-landsat archives-landsat 8-9 oli/tirs collection 2 level-2 science products (2020).
Hurtik, P. & Madrid, N. Bilinear interpolation over fuzzified images: Enlargement. In 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8 (2015).
Gascon, F. et al. Copernicus sentinel-2a calibration and products validation status. Remote Sensing 9, 1–81 (2017).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
Rivers, A. R., Weber, K. C., Gardner, T. G., Liu, S. & Armstrong, S. D. Itsxpress: Software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis. F1000Research7 (2018).
Abarenkov, K. et al. Unite uchime 9.0 reference data (2022).
Abarenkov, K. et al. Full unite+insd dataset for fungi (2023).
Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. Vsearch: A versatile open source tool for metagenomics. PeerJ 4, https://doi.org/10.7717/peerj.2584 (2016).
Tedersoo, L. et al. The global soil mycobiome consortium dataset for boosting fungal diversity research. Fungal Diversity 111, 573–588 (2021).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Madsen, H. & Thyregod, P. Introduction to General and Generalized LInear Models, 302 (CRC Press, 2011).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics (Springer, 2009).
Legendre, P. & Gallagher, E. Ecologically meaningful transformations for ordination of species data. OECOLOGIA 129, 271–280 (2001).
Davison, A. C. & Hinkley, D. V. Bootstrap Methods and Their Applications (Cambridge University Press, Cambridge, 1997).
Yuanyuan, X., Chen, H., Yang, J., Liu, M. & Huang, B. Distinct patterns and processes of abundant and rare eukaryotic plankton communities following a reservoir cyanobacterial bloom. ISME J. 12, 2263–2277 (2018).
Logares, R. et al. Patterns of rare and abundant marine microbial eukaryotes. Curr. Biol. 24, 813–821 (2014).
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007).
Lauritzen, S. L.Graphical models, vol. 17 (Clarendon Press, 1996).
Mazumder, R. & Hastie, T. The graphical lasso: New insights and alternatives. Electron. J. Stat. 6, 2125–2149 (2012).
Salavaty, A., Ramialison, M. & Currie, P. D. Integrated value of influence: An integrative method for the identification of the most influential nodes within networks. Patterns 1, 100052 (2020).
Pavlopoulos, G. A. et al. Using graph theory to analyze biological networks. BioData Min. 4, 1–27 (2011).
Klein, D. J. Centrality measure in graphs. J. Math. Chem. 47, 1209–1223 (2010).
Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics 99–114 (1949).
Langsrud, O. Anova for unbalanced data: Use type II instead of type III sums of squares. Stat. Comput. 13, 163–167 (2003).
van Rossum, G. Python tutorial. Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI) (1995).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2021).
Acknowledgements
We thank the Villum Foundation for funding this study under the project grant number 00050095. In addition, we thank the NoR Foundation (a European Space Agency initiative) for providing access to the Sentinel Hub platform under project grant number 4517rH: ArtiMATE. D.F. acknowledges funding from the Danish National Research Foundation (DNRF137) as part of the Center for Microbial Secondary Metabolites (CeMiSt) and funding by the Novo Nordisk Foundation (NNF20CC0035580).
Author information
Authors and Affiliations
Contributions
Sequence data pre-processing was done by D.F. with the final data format set up by D.F. and M.B.S. Satellite data and tabular data was pre-processed and fused by M.B.S. Initial data exploration was done by M.B.S, E.D.J and G.S. In addition E.D.J, M.K.J and G.S provided knowledge of plant-fungal interactions, classifying pathogenic genera from plant beneficial genera. All statistical and machine learning models applied throughout the paper was done by M.B.S. The selection of models and the development of the analytical workflow were developed by L.K.H.C. and M.B.S. The initial manuscript draft, all figures and tables were made by M.B.S, while all authors reviewed the manuscript. The overall project scope, ideas, and funding were initialized and obtained by M.K.J. and L.K.H.C.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
: Communications Earth & Environment thanks Hoa Thi Pham and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Alice Drinkwater. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sørensen, M.B., Faurdal, D., Schiesaro, G. et al. Exploring crop health and its associations with fungal soil microbiome composition using machine learning applied to remote sensing data. Commun Earth Environ 6, 355 (2025). https://doi.org/10.1038/s43247-025-02330-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43247-025-02330-0