Exploring crop health and its associations with fungal soil microbiome composition using machine learning applied to remote sensing data

Sørensen, Mathies Brinks; Faurdal, David; Schiesaro, Giovanni; Jensen, Emil Damgaard; Jensen, Michael Krogh; Clemmensen, Line Katrine Harder

doi:10.1038/s43247-025-02330-0

Download PDF

Article
Open access
Published: 07 May 2025

Exploring crop health and its associations with fungal soil microbiome composition using machine learning applied to remote sensing data

Mathies Brinks Sørensen¹,
David Faurdal ORCID: orcid.org/0000-0001-8871-1357²,
Giovanni Schiesaro ORCID: orcid.org/0000-0002-6309-4337²,
Emil Damgaard Jensen²,
Michael Krogh Jensen² &
…
Line Katrine Harder Clemmensen^1,3

Communications Earth & Environment volume 6, Article number: 355 (2025) Cite this article

3371 Accesses
2 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Global food security is increasingly challenged by climate change and unsustainable agriculture, emphasizing the need for strategies to enhance crop productivity. Understanding the interplay between crop health and soil microbiomes is crucial. This study explores the link between crop health, observed via multi-spectral satellite imagery, and fungal soil microbiome taxonomy. We associate the normalized difference vegetation index with fungal microbiomes in wheat, barley, and maize using a two-step machine learning process. The first step adjusts normalized difference vegetation index values for abiotic confounders using a random forest model trained on Lucas 2018 topsoil and ERA5 climate datasets. The second step clusters operational taxonomy unit counts from fungal DNA, revealing significant differences in residual normalized difference vegetation index values. To identify potential bio-fertilizer candidates, we compare the average relative abundance of operational taxonomy unit clusters and construct sparse biological networks. Key findings are: (I) clusters with higher plant pathogenic genera have lower normalized difference vegetation index values; (II) clusters with higher influential scores for multiple beneficial genera have higher normalized difference vegetation index values; (III) lower abundance taxonomy (1-3%) seems to regulate microbial networks; (IV) the influence of beneficial vs. pathogenic taxonomy is relative to their abundance. The study links satellite imagery to fungal microbiomes, providing a baseline for exploring fungal bio-fertilizers.

Astragalus-cultivated soil was a suitable bed soil for nurturing Angelica sinensis seedlings from the rhizosphere microbiome perspective

Article Open access 28 February 2023

Persistent microbiome members in the common bean rhizosphere: an integrated analysis of space, time, and plant genotype

Article Open access 26 March 2021

Biodiversity of the beneficial soil-borne fungi steered by Trichoderma-amended biofertilizers stimulates plant production

Article Open access 05 July 2023

Introduction

The world relies on agricultural food production as one of the primary sources of nutrition¹, making it a critical area for global food security. With an ever-growing population, increasing changes to the climate, and a need for more sustainable production, it is necessary to identify sustainable methods for optimizing agricultural yield and health. One such method is the development of smart farming², which can aid in optimizing crop yield and health by monitoring individual field conditions, making near real-time adjustments possible. Smart farming integrates sensors or remote sensing technologies, such as satellite imagery³ and drones⁴, with data analytics and decision support systems⁵. A wide array of data sources and techniques are available to monitor agricultural fields via satellite imagery. Some of the common satellites utilized in agricultural imagery are the Sentinel-2 and Landsat 8 satellites, which are both considered to be of medium resolution⁶. Lower resolution satellites, such as the MODIS satellite, are also applied in agriculture, although often they do not provide enough accuracy for precision agriculture^6,7,8,9,10. MODIS does have the upside of daily samples, whilst Landsat 8 and Sentinel-2 only sample every sixteenth and fifth day¹¹. Apart from resolution differences, these satellites all provide multi-spectral images^3,12. For agriculture, the multi-spectral images are mostly used to compute vegetation indices (VI)¹³. Some of the most applied indices are the normalized difference vegetation index (NDVI)¹⁴, the enhanced vegetation index (EVI)¹⁵, the leaf area index (LAI)¹⁶, and the Chlorophyll Index¹⁷. For a more extensive review of multispectral VI’s, we refer to the Index Database¹⁸ and the works of Xue and Su¹⁹.

One of the most commonly used VIs for assessing crop health is that of NDVI, which is calculated using the ratio of near-infrared (NIR) and red bands from multi-spectral images. NDVI ranges from –1 to 1, where higher values indicate healthy crops with higher vegetation density^19,20. Several factors have been reported to impact NDVI, including soil moisture²¹, temperature²², seasonality²³, climate²⁴, nutrients²⁵, and crop type²⁶. These factors are, for the most part, well covered and described in the literature and mainly linked to NDVI via non-linear models^23,27,28. However, studies linking remote sensing with microbial soil composition remain limited compared to those examining nutrient effects.

In the works of Carvalho²⁹, species richness was concluded to be positively linked to NDVI, but high variance across sampling sites led to high uncertainty in the conclusions. A more recent study suggests similar findings, linking higher richness and specific bacteria to higher soil fertility³⁰. Still, their study did not include the effects of climate and weather conditions when estimating the effect of the bacteria. Another study investigated the role of Cyanobacteria and showed that when inoculated over 90 days, NDVI was significantly increased in fields with added Cyanobacteria compared to fields with no Cyanobacteria³¹. Other studies confirm that Cyanobacteria plays a role in NDVI measurements^23,32, but these are mainly linked to Cyanobacteria blooming in lakes. In another study, the alpha diversity of bacteria was linked to hyperspectral imagery uing Airborne Visible InfraRed Imaging Spectrometer—Next Generation (AVIRIS-NG)³³. A combination of linear discriminant analysis and partial least squares regression was applied to determine the dominant bacterial families’ abundance from airborne hyperspectral imagery³³. Thus, confirming a link between microbial diversity and remote sensing is possible.

Apart from bacteria, fungi are of particular interest since ectomycorrhizal associations³⁴ could be applied in agriculture as a means of reducing traditional fertilization in favor of bio-fertilization³⁵. This property makes the remote sensing connection to soil fungal communities very attractive, as cheap satellite imagery may provide field diagnostics indicating if particular parts of a fungal community may be lacking. However, the link between NDVI (or other VIs) and the soil microbial fungal composition is only sparsely covered in the literature. In the work of Nutter³⁶, NDVI was applied to find the epicenters of soy-bean rust disease caused by the Phakopsora pachyrhizi fungi. Another study links higher NDVI values with a larger species richness, having a larger proportion of Onygenales³⁷. The effect of richness is also evident in the works of Liu, concluding that soil decomposers seem to play a role in higher NDVI values and retaining community stability³⁸. In a recent study, the alpha diversity of fungi and bacteria in forests was investigated using hyperspectral imagery from the DESIS satellite. The study concluded that high fungal richness (hot spots) is linked to environmental parameters such as pH and landscape type?. Thereby making remote sensing an attractive opportunity for the monitoring of crops.

However, every study is on a relatively small scale (few fields), and the data have similar spatio-temporal properties (fields in the same area sampled at the same time interval). In addition, only a limited influence of climate is accounted for, which may lead to nuisance effects influencing the NDVI response in particular across spatio-temporal data.

Our study aims to explore the associations of the fungal soil microbiome composition and the NDVI response while accounting for the effects occurring in data across regions and one growth season. We propose a two-step methodology: first, we adjust NDVI values for abiotic influences, and then link the residual NDVI values to fungal biotic variance. In the initial step, a model is used to adjust NDVI values for abiotic factors, based on previous findings^25,39. Following this adjustment, the adjusted NDVI values are analyzed in relation to the fungal soil microbiome (see Fig. 1 for a graphical outline).

**Fig. 1: Graphical overview of the paper.**

Summing up, our contributions are (I) A methodology that adjusts for abiotic factors prior to investigating biotic relations to the residual NDVI. (II) Demonstrating a significant difference in NDVI values for different clusters of observed fungal microbiome compositions. (III) The Mortierella genus has the strongest influence regardless of its abundance; the presence of multiple influential plant pathogenic genera is associated with lower NDVI, while multiple influential beneficial genera are associated with higher NDVI.

Data description

This paper utilized multiple data sources to investigate the features affecting the NDVI responses. Specifically, the datasets used within our analysis were the 2018 LUCAS biodiversity dataset⁴⁰, the LUCAS 2018 topsoil dataset⁴¹, and the ERA5 Copernicus climate dataset⁴². The remote sensing data contained the 2018 Copernicus LUCAS multi-polygons⁴³ and our own polygons for the 2015 which were used to generate Sentinel-2 and Landsat 8 satellite images from which the mean NDVI values were extracted from regions of interest (ROIs) via Sentinel Hub⁴⁴ (see the Satellite image subsection for details). To get an overview of how the aforementioned datasets were combined and how pre-processing was carried out, the data linkage and processing are visually outlined in section 1.1 of the supplementary material. Furthermore, the pre-processing of the acquired satellite images and bioinformatics data is outlined in the Data Pre-processing subsection. In the following subsections, the tabular datasets are described in detail.

Abiotic data

Multiple abiotic factors, such as soil nutrients, crop type, soil type, seasons, and climate conditions, are all known to influence the NDVI values and should therefore be taken into account when estimating the influence of the microbial composition. The LUCAS 2018 topsoil dataset consists of a total of 7430 unique crop-related samples with 10 features (see Table 1) describing the soil composition at different sampling sites with a unique set of latitude and longitude coordinates across Europe⁴¹. In total, 41 unique crop types are found in the data, and for our study, we focus on the three most abundant crop types: common wheat, barley, and maize. To obtain additional geospatial information, each sampling site was divided into climate zones based on the Köppen-Geiger climate zone classification system⁴⁵. An additional filtration based on months was done, removing the months of winter (December, January, February), early spring (March, April), and late autumn (October and November), as the images mostly consisted of barren soil, snow, and clouds. Finally, the data was reduced, leaving out climate zones with less than 5 observations, removing observations from the BSh and ET zones. The final data consists of 2245 observations and is visualized for each crop type in Fig. 2C. The same procedure was carried out for the Landsat 8 2015 and 2018 datasets with the results depicted in Fig. 2A, B.

Table 1 Summary of LUCAS top-soil 2018 abiotic data variables

Full size table

In addition to the topsoil composition affecting the observed NDVI values, climate data were considered. The climate data were obtained from the ERA5 Copernicus dataset, which contains detailed climate records dating back to 1940 until the present day⁴². For our purpose, we chose variables linked to the properties of soil, which were soil temperature (K), soil moisture (m³), soil type (categorical, 6 soil types in total), and air temperature (K). The climate data were grouped such that the yearly average air temperature, soil moisture, and soil temperature were linked to each NDVI observation. Additional temporal information for the aforementioned variables was also included in the form of the previous month’s average values. A summarized overview of all abiotic variables from the year 2018 is provided in Table 1, with the year 2015 found in the supplementary material section 1.2. For a complete overview of the distribution of each variable of Table 1, the reader is referred to the supplementary material section 1.2.

In the topsoil data (Table 1), the variable of water-based pH was excluded due to a high correlation with the CaCl₂ based pH index (see the correlation plot found in supplementary section 1.3). The calcium chloride pH was chosen as it has been reported to be less affected by soil electrolyte concentration and produces more consistent measurements⁴⁶. The variables of CaCO₃, Aluminum Oxalate, and Iron Oxalate were also excluded from the analysis as the inclusion would reduce the data size to approximately one-third of the original size due to missing observations.

Biodiversity data

The final dataset included in this paper was the 2018 LUCAS biodiversity dataset⁴⁰. The data contains a total of 885 bio-samples, each with a unique barcode ID (1-885) from which fungal ITS DNA sequences were obtained, where 347 samples were linked to cropland. For the crops of interest (wheat, barley, and maize) within the specified time period, a total of 115 samples were present. The taxonomy was classified from the raw DNA based on the UNITE-INSDc 9.0 database⁴⁷ (see the Data Pre-processing subsection for details regarding the classification and processing of the raw DNA data). To ensure the quality of the classification, a cutoff based on the Operational taxonomic unit (OTU) counts was set at 1, filtering any sample that had OTU counts lower than the set threshold⁴⁸ (see the ITS PACBIO sequencing for details on how the OTU counts were generated).

Results

Modeling of abiotic effects

The regression results on the test set, modeling the abiotic factors influence on the NDVI values, show that the RF outperforms the linear regression model with lower root mean squared error (RMSE) and higher explained variance, R² (Table 2). Both models are based on the 16 variables in Table 1.

Table 2 Model comparisons overview

Full size table

The final linear regression models, including up to two-factor interaction terms, are detailed in Section 1.6 of the supplementary material for each model (see Table 2). The hyperparameter values for the final RF model are provided in Section 1.7 of the supplementary material.

For the RF model, the observed NDVI values vs the predicted NDVI values showed good association, and we conclude that the model adequately captures dependencies within the data when comparing the test and full dataset to the true NDVI values (section 1.8 of supplementary material). Furthermore, we investigated the residuals based on single crop type predictions for every model listed in Table 2, finding that no inherent bias towards any single crop type is present (supplementary section 1.10). The residual analysis for every model, likewise found no larger variance across any of the less-represented levels of the categorical and numerical variables listed in Table 1 (see supplementary section 1.10). We observed that the most important variables are the season and crop type. In contrast, soil type and potassium levels seem to be less important (Fig. 3 and supplementary material section 1.9). To summarize, the results show a robust estimation of abiotic influences via the RF model. The step ensures a reliable quantification of residual NDVI, allowing for a clearer investigation of biological relationships.

**Fig. 3: Visualization of the RF model variable importance plot (Model A-2).**

OTU data analysis

In the first step of our two-step process, we identified two unique clusters of specific taxonomy through unsupervised clustering of OTU samples (Fig. 4). The samples are approximately equally distributed between the first (Cluster 1) and the second cluster (Cluster 2).

As a sensitivity analysis, we additionally carried out a clustering, which focused on each crop individually (wheat, maize, and barley - supplementary material section 1.5). We identified two clusters for wheat with 30 and 19 observations, respectively. Due to low sample sizes, no meaningful patterns were identified for the separate clustering of barley and maize.

The clusters (including the separate wheat clusters) were further analyzed by connecting the residual NDVI values of the samples to the fungal taxonomy present in each cluster.

Mircobiome impact modeling

OTU Cluster 1 showed significantly higher residual NDVI values (Table 2 in supplementary material section 1.4) compared to Cluster 2 (Fig. 5 and table 2 in supplementary material section 1.4). The results for the raw NDVI values for each cluster are found in the supplementary section 1.4 for both Landsat 8 and Sentinel-2.

Similarly, on the wheat clusters, we found significant differences between the two clusters (supplementary material table 5, section 1.5). The average relative abundance analysis (Fig. 6) showed that genera associated with plant health, (such as Tomentella and Mortierella^49,50), were more abundant in Cluster 1 (higher NDVI), while pathogenic genera, such as Fusarium, were more prevalent in Cluster 2 (lower NDVI). Specifically, Fusarium was more abundant in maize and wheat within Cluster 2, with similar levels in barley across both clusters.

**Fig. 6: Comparison of the top 30 most abundant genera per crop based on the average relative abundance of each unique taxon within each cluster for all crops.**

Abundance analysis for the wheat clusters, revealed a similar pattern, with pathogenic genera being more abundant for clusters with lower NDVI (section 1.12 in the supplementary material).

Using our bootstrapping approach (algorithm 1), we identified 99 and 114 genera (Cluster 1 and 2, respectively) as non-outlier taxa (outlier taxa being low abundance genera with high random large OTU spikes) (Fig. 7).

Network models of the non-outlier taxonomy show that many of the taxonomy correlations have been penalized by the sparse lasso model, retaining few connected nodes for both Cluster 1 and 2 (Figs. 8, 9). Results on the optimal regularization parameter are provided in the supplementary section 1.11.

**Fig. 8: Illustration of the connections in the network based on the sparse lasso graphical model with 99 genera located in the first cluster.**

Only a few strong negative correlations are present in the network for Cluster 1 (Fig. 8), forming a denser network compared to the network of Cluster 2. The network of Cluster 2 is smaller with fewer strong connections compared to the Cluster 1 network (Fig. 9). A notable strong negative correlation appears in both networks between the Fusarium and Russula genera.

**Fig. 9: Illustration of the connections in the network based on the sparse lasso graphical model with 114 genera located in the second cluster.**

Network analysis

Analyzing each sparse network showed that Mortierella is the most influential genus for both clusters (Table 3), having the highest abundance at approximately 15% in each cluster. In Cluster 1, the second most influential genus is Cortinarius, which shows a higher abundance than in Cluster 2, where Fusarium ranks second. Both genera appear highly influential in each cluster but have different abundance and ivi scores. Low-abundance genera (Trechispora, Schizothecium, and Serendipita) are also assigned high ivi scores and are of high influence in Cluster 2. They are present in Cluster 1 as well, but at a much lower abundance (below 0.5% in all cases).

Table 3 Top 10 node influence based on the Integrated Value of Influence

Full size table

In wheat-specific clusters, Mortierella was identified as the most influential genus across clusters, despite not being the most abundant (supplementary material section 1.15). In the higher NDVI cluster, Funneliformis was the most abundant genus but did not contribute to network influence, whereas lower-abundance genera such as Neoschizothecium and Curvularia demonstrated higher influence. Additionally, the plant pathogenic genera Fusarium and Penicillium⁵¹ showed the highest ivi scores in the lower NDVI cluster. In the wheat-specific Cluster 2 (lower NDVI), the same beneficial genera were present as in Cluster 1 (higher NDVI), but with lower ivi scores (except for Mortierella).

Discussion

When comparing the variable importance of our RF model with the literature, we find the following agreements: Previous studies have shown that the canopy structure of vegetation influences the NDVI with a non-linear relationship^26,52. As for seasons, it has been proven that NDVI is temporally affected, with a maximum NDVI around harvest months⁵³. In terms of nutrient influences, Loozen⁵⁴ used RF modeling to demonstrate a relationship between NDVI and nitrogen content in forest canopies. Furthermore, the soil carbon content has been identified to have a non-linear link with NDVI^55,56. In addition, in the literature, we find that meteorological variables (eg. temperatures and soil moisture)^57,58,59agree with our findings of having high importance with respect to NDVI. We note that the temporal resolution of the satellite imagery may affect the results, as the sampling date does not always correspond with the satellite recording date. Thus, the Sentinel-2 imagery may be more accurate with respect to the NDVI responses as these have higher temporal resolution. In the future, it would be interesting to repeat our analysis on PlanetScope data, which has daily recordings, although at a higher cost than Sentinel-2 and Landsat 8. Finally, the results of our abiotic model confirm previous findings in the literature, and it serves as an important step towards removing abiotic influence from the NDVI responses

Even so, our model is still subject to certain limitations. First, our model is limited as most of the meteorological observations are confined to moisture and temperature levels in the range 0.1–0.4 m³ and 280–290 K. This implies that the model may not be able to handle drought or very high temperatures. Second, our model has limited temporal coverage, as only the year 2018 is covered for biological data, possibly making the clustering susceptible to temporal distribution shifts. The temporal robustness and/or evolution is worth investigating in future studies.

For future research, it would be valuable to include a wider spatio-temporal area, including multiple years and using large-scale geospatial data with higher resolution than the Landsat 8 (30 m) and Sentinel-2 (10 m) satellites. This is particularly interesting, as differences have been observed between modeling based on Sentinel-2 and Landsat 8 data within the same year. An increased resolution could provide a more accurate representation of biotic effects. However, a significant limitation of higher-resolution imagery is its cost, as high-resolution data is much more expensive. It is also worth noting that the 2022 LUCAS Topsoil data is forthcoming. Although it is not yet publicly accessible, it could be used to perform an analysis with Sentinel-2 similar to that of the Landsat 8 analysis performed in this study. We strongly encourage such future external validations.

Our study mainly focuses on investigating the fungal impact on NDVI values across heterogeneous spatial areas across Europe. Overall, our findings successfully demonstrate that the fungal taxonomic composition is significantly associated with the NDVI values adjusted for abiotic influence.

From the LUCAS data alone, we cannot explicitly state which mechanism may be behind the regularization of the networks (Figs. 8, 9). We can infer that high abundance is not the sole criterion for network influence, although it seems to play a role, as the most influential genera (Table 3) are mainly comprised of genera with high average relative abundance. However, lower-abundance genera seem to have a much higher impact relative to their presence.

Moving further from our exploratory analysis, the highly influential taxonomy may provide a baseline candidate list for further studies investigating the fungal interaction both in vivo and in vitro. For instance, it would be interesting to investigate the effects on the NDVI values when Mortierella has minimal presence vs being highly abundant. From the literature, it is known that Mortierella has been found very beneficial for agriculture, in general⁶⁰. However, to the best of our knowledge, no larger-scale controlled field experiments have been conducted with different levels of Mortierella. This could be carried out with the statistical split-plot design type⁶¹, which is frequently utilized in agronomy for investigating the influence of other factors, such as the impact of fertilizer types or irrigation methods. Likewise, the negative correlation between Russula and Fusarium is interesting to explore further, as the same relation is identified in the study of canker disease for citrus fruits⁶², revealing Russula to counteract the influence of Fusarium causing the disease. This relation has, to the best of our knowledge, never been reported in crops, and further investigation is required to assess our finding.

Further relevant actions would be to analyze the metabolomics profile of the soil communities and pinpoint the origin of metabolites positively or negatively affecting plant growth. It should be noted that data is only assigned until the genus level, hence, species information is lost, making the above discussion solely based on the most common occurrence of fungal species in maize, wheat, and barley. For future studies, the acquisition of species-level taxonomy would enrich the analysis as specific species linked to NDVI may be identified. In addition, relevant actions would be to analyze the metabolomics profile of the soil communities, pinpointing the taxonomic origin of metabolites positively or negatively affecting plant growth. In addition, an investigation of the rhizosphere of the plants would enrich our results further, as some fungal-plant interactions are known to work through the rizosphere. Specifically, this investigation could reveal if bulk soil content plays a mediating role in shaping fungal community composition by influencing root exudates, microbial recruitment, or nutrient availability within the rhizosphere. This is particularly relevant as Mortierella seems to be a central genus for regulation of the microbiome which is known for having soil-plant nexus interactions^63,64. However, this is very difficult to investigate as the meta-genom of a single field is subjected to high spatial biological variance even when sampling within a few meters of the same site. Aggregating samples across a grid may help, but outliers could skew results. Hence, further studies are best done in a controlled lab with few crops and limited field space.

Our analysis demonstrates the potential of relating satellite imagery and the composition of fungal soil microbiomes across datasets characterized by varying spatiotemporal attributes. However, the biological analysis is solely based on a total of 115 bio-samples, which is currently limiting the construction of robust models for predicting NDVI based on the soil microbiome. Therefore, more bio-samples should be collected in accordance with the LUCAS initiative to provide a ground truth foundation for prediction models. Hence, our approach reveals in an exploratory sense, that certain microbiome compositions may affect the overall observed vegetation health, which potentially opens the way for selective microbial fertilization. One potential benefit of establishing this link is that expensive soil sample analysis can be replaced by satellite imagery. This would not only save time and resources but also enable much faster fine-tuning of microbial compositions for individual fields. An interesting approach would be to combine fungal fertilization with well-established bio-fertilization methods, such as crop rotation and the introduction of cover crops like legumes, which promote biodiversity by introducing nitrogen-fixing bacteria into the soil⁶⁵. As noted by Liu et al.⁶⁶, fungal functional diversity plays a crucial role in ecosystem stability, suggesting that the addition of specific fungal taxa could enhance the benefits already provided by practices like crop rotation and cover crops. Furthermore, a consequence of introducing fungal bio-fertilization may be increased soil taxonomic stability, making the system more robust to invasive crop pathogens⁶⁷.

In the end, this fine-tuning may aid in an optimized agricultural yield³⁵ through the increase of crop health, aiding in food security. Furthermore, the possibility of using selective microbial fertilizers may not only aid in achieving higher crop yields but ultimately replace some parts of the conventional fertilization process³⁵ saving resources and opting for a more sustainable future through green farming.

Methods

Data pre-processing

Satellite images

To ensure consistent atmospheric correction, the satellite images were corrected for atmospheric disturbances using the sen2LA product over the manual correction of sen2L1C product⁶⁸. For the Landsat 8 images, atmospheric disturbances were corrected with the usage of the level 2 OT product⁶⁹. Furthermore, we removed images with more than 5% cloud cover by applying the Sentinelhub API cloud filtering algorithm⁶⁸. This resulted in a reduction of data from 7430 crop samples to 5410 crop samples. To ensure band resolution was uniform at a targeted 10 meters, we used bilinear interpolation to resample each image⁷⁰. The same procedure was utilized on the Landsat 8 imagery. To ensure the pixel band values are not affected by environmental artifacts, we applied masking on the Sentinel-2 scene classification maps (SCM)⁷¹, which removes pixels containing dark areas, snow, smaller clouds, and water from each sample tile. From the filtered, interpolated images, the mean NDVI values were aggregated based on the pixels found in each multi-polygon. Sampling dates did not always match the Sentinel-2 or Landsat-8 temporal recording scheme. To meet this challenge, a search range of 3 weeks, before and after the sampling data, was utilized. Imagery that did not meet the 5% could cover cutoff was discarded in the search process. The image closest to the sampling date was chosen to represent NDVI values for a sample. When no suitable imagery could be found in the temporal search span, the sample was removed from the final data.

ITS PacBio sequencing

An initial quality filtering was performed by discarding reads containing more than 1 ambiguous base and more than 2 expected errors. Adapters were trimmed and read orientation corrected using CutAdapt 2.10⁷². ITSxpress⁷³ was then used to extract ITS regions. The UNITE 9.0 UCHIME reference dataset⁷⁴ was used for reference-based chimera removal. Sequences were clustered using open reference clustering by including the ITS sequences extracted from the UNITE-INSDc 9.0 database⁷⁵ using ITSx. The sequences were clustered at 98% similarity using VSEARCH⁷⁶, applying the “-cluster-smallmem -usersort” arguments to prioritize full-length UNITE-INSDc and PACBIO sequences before partial sequences as described by Tedersoo et al.⁷⁷. A sample-by-OTU table was produced using VSEARCH, and OTUs shorter than 250 nucleotides were discarded. Representative sequences from each OTU were classified by Megablast queries using BLAST 2.12.0, applying taxon-specific e-value and sequence similarity thresholds as described by Tedersoo et al.⁷⁷. Conservatively, OTUs were classified to the level of genus.

Supervised machine learning and statistical modeling

We built two models that adjust the NDVI for abiotic confounders/contributions. We compare an RF model⁷⁸ and a linear regression model⁷⁹ using the variables listed in Table 1 (see linear regression model results in the supplementary, for parameters included in the fully reduced regression model) as the predictors and NDVI as the response. The RF model parameters were tuned using 5 repetitions of 5-fold cross-validation (CV). For the 2018 Sentinel-2 data, we split the observations in 20% for a one-off unseen testing and 80% for training via cross-validation. The Landsat 8 data underwent several splitting scenarios based on different combinations of datasets from 2015 and 2018. In the first scenario, the 2018 dataset was split into 20% for testing and 80% for training. The second scenario involved combining the 2018 and 2015 datasets into a single dataset, which was then split in the same 20% testing and 80% training ratio. The third scenario treated the 2015 and 2018 datasets as independent, with 2015 serving as the test data and 2018 as the training data. The final scenario mirrored the third but swapped the roles of the datasets, using 2018 for testing and 2015 for training. In the repeated CV on training data, the RF hyperparameters (variable splits, number of trees, and minimal terminal size) were selected using a grid search over a range of values (see the supplementary material) according to the minimal RMSE value. We estimate the expected error using the unseen test data for the selected set of hyperparameter values. The final model used for adjusting the NDVI from abiotic confounders was constructed by training a model with the selected hyperparameters to the full data (see 2). An analysis of residuals for the RF model is also included. For readers new to the usage of RF models, it should be noted that the residuals do not follow the same assumptions of traditional statistical models (eg, identical, independent distribution)⁷⁸. However, the analysis is included to test if any bias is present among the prediction of less-represented levels of variables.

The linear regression model included two-factor interaction terms and was reduced according to the principles described in the Statistical analysis subsection and the results section. The models were compared using the test set errors and the errors estimated from making predictions onto the full dataset. Both models were evaluated based on the explained variance (R²) and root mean squared error (RMSE).

The estimated linear regression model coefficients $\hat{{{\boldsymbol{\beta }}}}$ were computed by least squares⁷⁹.

To mitigate over-fitting, the significance of each linear model regression coefficient was determined by ANOVA tests (see the Statistical Analysis subsection). Model diagnostics were also evaluated to ensure adequate fulfillment of the underlying assumptions. For a more extensive theoretical review of the linear model, we refer to the work of Madsen⁷⁹.

To adjust the NDVI values for confounders, the estimated NDVI values from the abiotic RF model (see the Modeling of abiotic effects subsection) were used to subtract the non-microbiome-related variance (Y_RF) from the raw NDVI values (Y_raw), resulting in the creation of the residual NDVI values (Y_residual) given as:

$${{{\boldsymbol{Y}}}}_{residual}={{{\boldsymbol{Y}}}}_{raw}-{{{\boldsymbol{Y}}}}_{{{\boldsymbol{RF}}}},$$

(1)

Subsequently, we investigate the association between the residual NDVI values and clusters derived from the microbiome sequencing data (Fig. 4).

Clustering

We used hierarchical clustering to create clusters from the pre-processed PACBIO data in an unsupervised manner. We based the clustering on the Euclidean distance and chose the Agnes method, which applies Agglomerative coefficients⁸⁰ to identify the best linkage. Hierarchical clustering computes the dissimilarities between observations based on a distance matrix, where each row/column represents an observation and the distances reflect how dissimilar or similar two observations are. The Agnes method iteratively merges the closest clusters based on these dissimilarities until an optimal number of clusters is reached. To identify the optimal number of clusters, we applied the average silhouette width s(i), defined as:

$$s(i)=\frac{a(i)-b(i)}{\max [a(i),b(i)]}$$

(2)

Here, a(i) describes the average within-cluster distance between observation i and all other observations of the same cluster. The term b(i) denotes the average between-cluster distance between observation i and the observations assigned to the neighboring cluster. The term s(i) expresses how well each point is clustered, resulting in the optimal clustering being assigned when all clusters have observations above the average silhouette width (see the work of Friedman⁸⁰ for more details on how to estimate between- and within-cluster distance). It should be noted that prior to clustering, the data was standardized in accordance with the Hellinger transformation⁸¹.

Bootstrapping

The non-parametric bootstrap approach⁸² was utilized for filtering the rare taxonomy of the pre-processed PACBIO data. A total of 5000 bootstrap samples were generated for each genus within each cluster. The interquartile range (IQR) of the generated samples was then applied to filter out taxonomy, with a threshold set at the IQR confidence interval containing 0. The full process is mathematically outlined in algorithm 1.

Algorithm 1

Genus filtration pseudo code

Initial inputs:

A matrix X of OTU abundances, with N rows (samples) and K columns (genera), i.e, ${{\boldsymbol{X}}}\in {{\mathbb{R}}}^{N\times K}$ is given.

for k= 1 to K do:

Step 1: Draw B = 5000 bootstrap samples, X^*, with replacement from X

${X}_{k}^{* }=[{X}_{k,1}^{* },{X}_{k,2}^{* },..,{X}_{k,B}^{* }]$, with ${X}_{k}^{* }\in {{\mathbb{R}}}^{N\times B}$.

Where ${X}_{k,b}^{* }=({x}_{1,k}^{(b)},{x}_{2,k}^{(b)},...,{x}_{N,k}^{(b)})$, with ${x}_{k,b}^{* }\in {{\mathbb{R}}}^{N\times 1}$.

Step 2: Estimate IQR for the k^th genus, i.e., the k^th row of ${X}_{k}^{* }$.

$IQ{R}_{k,b}={q}_{3}({X}_{k,b}^{* })-{q}_{1}({X}_{k,b}^{* })$, with $IQ{R}_{k,b}\in {{\mathbb{R}}}^{B\times 1}$.

Where q₃ and q₁ indicates the 75% and 25% quantile respectively.

Step 3: Per k’th IQR bootstrap samples (IQR_k), estimate:

${\bar{IQR}}_{k}=\frac{1}{B}{\sum }_{b = 1}^{B}IQ{R}_{k,b}$, mean IQR

$SD({IQR}_{n})=\sqrt{\frac{1}{B-1}\mathop{\sum }_{b = 1}^{B}{(IQ{R}_{k,b}-{\bar{IQR}}_{k})}^{2}}$, IQR standard deviation

$C{I}_{0.95}(IQ{R}_{k})={\bar{IQR}}_{k}\pm \frac{{Z}_{1-\alpha /2}SD({IQR}_{k})}{\sqrt{B}}$, IQR Wald 95% confidence interval.

Where Z_1−α/2 ≈ 1.96 for the alpha-level quantile of the standard normal distribution, at α = 0.05

Step 4: remove outlier taxonomy

if 0 ∈ CI_0.95(IQR_k) then

The k’th column of X is considered outlier taxonomy and excluded

end if

end for

From the resampled relative abundance data, a 95% confidence interval (CI) for the IQR values was computed. Any taxonomy with an IQR confidence interval overlapping 0 is considered rare and therefore discarded. In contrast, taxonomies with IQR confidence intervals not containing 0 are retained. The method was applied as even with the addition of the lasso penalty, low-abundance genera with high random large OTU spikes may produce random correlations. Still, setting a cutoff is problematic as there are no universally accepted lower values^83,84. Consequently, we applied a data-driven thresholding strategy via the IQR bootstrapping approach.

Graphical lasso model and networks

To derive the sparse network analysis visualized in Figs. 8, 9, the graphical lasso model was applied to the correlation matrix of the filtered and pre-processed PACBIO data. The main principles are outlined here with a reference to the works of Friedman^85,86 for a more detailed description. The idea behind the graphical lasso model is to omit edges (correlations) from a network by controlling the number of zero entries of the precision matrix (or inverse covariance matrix, Θ) via the lasso penalty, see Friedman⁸⁵ for more details. The graphical lasso model can be defined through the following optimization problem:

$$\hat{\Theta }=\arg \mathop{\min }_{\Theta \ge 0}(-\log (\det (\Theta ))+{{\rm{Tr}}}(S\Theta )+\lambda | | \Theta | {| }_{1})$$

(3)

The optimization problem of Eq. (3) does not have an analytical solution, thus element-wise coordinate descent is applied, identifying one parameter at a time (one entry of the precision matrix at a time), for more details, we refer to the work of Mazumder and Hastie⁸⁷, which reviews the original algorithm while implementing additional improvements.

The networks are constructed from the sparse precision matrix estimated via the graphical lasso model. The sparse covariance matrix is estimated as

$$\Sigma ={\Theta }^{-1},$$

(4)

from which the sparse correlation matrix (ρ) may be obtained as

$$D=\, \sqrt{{{\rm{diag}}}\Sigma }\\ \rho =\, {D}^{-1}\Sigma {({D}^{-1})}^{T}$$

(5)

The sparse correlation matrix is then utilized as the edges of the networks, with each genus being a network node.

To enhance the insights for each network, the Integrated Value of Influence^88,89 (ivi) is evaluated for each connected node in the sparse networks. The ivi score was applied since it combines six centrality metrics^88,89 into a score that considers both local and global centrality. We refer to the works of Salavaty⁸⁸ covering the details of the ivi score and to the works of Klein⁹⁰ for readers not too familiar with graph network theory.

Statistical analysis

The significance level for the hypothesis tests was set to α = 0.05 and was further regulated for multiple hypothesis testing via the Tukey honest significance difference (HSD) test when factor levels exceed two⁹¹. For the subsequent tuning of the linear adjustment model, Type 2 ANOVA was applied based on a series of nested χ² hypothesis tests⁹². Note that the principle of hierarchical modeling was utilized, retaining non-significant main effects when these are part of significant interaction terms⁷⁹. The model diagnostics of the linear abiotic adjustment model are listed, investigated, and reported in the supplementary.

Code

Pre-processing of the satellite images was carried out in Python (version 3.10.9)⁹³, while all remaining data processing, machine learning tasks, and statistical evaluations were carried out in R (version 4.1.2)⁹⁴. Both source codes are available on our GitHub (https://github.com/Mabso1/Artimate), and the specific software packages used are all listed in the supplementary material, including a brief description of how each was used and references.

Data availability

All pre-processed data applied within this study is located in our GitHub repository: https://github.com/Mabso1/Artimate. For the raw data, we refer to the original sources.

Code availability

All code applied for this study is located in our GitHub repository: https://github.com/Mabso1/Artimate.

References

Viana, C. M. & Rocha, J. Evaluating dominant land use/land cover changes and predicting future scenario in a rural region using a memoryless stochastic method. Sustainability 12, 4332 (2020).
Article Google Scholar
Mohamed, E. et al. Smart farming for improving agricultural management. Egyptian J. Remote Sens. Space Sci. 24, 971–981 (2021).
Im, J. & Jensen, J. R. Hyperspectral remote sensing of vegetation. Geogr. Compass 2, 1943–1961 (2008).
Article Google Scholar
Singh, A. P., Yerudkar, A., Mariani, V., Iannelli, L. & Glielmo, L. A bibliometric review of the use of unmanned aerial vehicles in precision agriculture and precision viticulture for sensing applications. Remote Sensing 14 (2022).
Abhiram, M., Kuppili, J. & Manga, N. Smart farming system using iot for efficient crop growth. In 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 1–4 (2020).
Jiménez-Jiménez, S. I. et al. Vical: Global calculator to estimate vegetation indices for agricultural areas with landsat and sentinel-2 data. Agronomy 12, 1518 (2022).
Sishodia, R. P., Ray, R. L. & Singh, S. K. Applications of remote sensing in precision agriculture: A review. Remote Sensing 12, 3136 (2020).
Segarra, J., Buchaillot, M. L., Araus, J. L. & Kefauver, S. C. Remote sensing for precision agriculture: Sentinel-2 improved features and applications. Agronomy 10, 641 (2020).
Giri, C., Pengra, B., Long, J. & Loveland, T. R. Next generation of global land cover characterization, mapping, and monitoring. Int. J. Appl. Earth Observ. Geoinf. 25, 30–37 (2013).
Google Scholar
Chen, J. et al. Global land cover mapping at 30m resolution: A pok-based operational approach. ISPRS J. Photogramm. Remote Sens. 103, 7–27 (2015).
Article Google Scholar
Martinis, S., Wieland, M. & Rättich, M. Chapter 2 - An automatic system for near-real time flood extent and duration mapping based on multi-sensor satellite data. In Earth Observation for Flood Applications, Earth Observation, (ed. Schumann, G. J.-P.) 7–37 (Elsevier, 2021).
Qian, S.-E. Hyperspectral satellites, evolution, and development history. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 14, 7032–7056 (2021).
Article Google Scholar
Zeng, Y. et al. Optical vegetation indices for monitoring terrestrial ecosystems globally. Nat. Rev. Earth Environ. 3, 477–493 (2022).
Article Google Scholar
Rouse, J. W., Haas, R. H., Schell, J. A. & Deering, D. W. et al. Monitoring vegetation systems in the great plains with erts. NASA Spec. Publ. 351, 309 (1974).
Google Scholar
Liu, H. Q. & Huete, A. A feedback based modification of the ndvi to minimize canopy background and atmospheric noise. IEEE Trans. Geosci. Remote Sens. 33, 457–465 (1995).
Article Google Scholar
Chen, J. M., Rich, P. M., Gower, S. T., Norman, J. M. & Plummer, S. Leaf area index of boreal forests: Theory, techniques, and measurements. J. Geophys. Res.: Atmospheres 102, 29429–29443 (1997).
Article Google Scholar
Gitelson, A. A., Viña, A., Ciganda, V., Rundquist, D. C. & Arkebauer, T. J. Remote estimation of canopy chlorophyll content in crops. Geophys. Res. Lett. 32, https://doi.org/10.1029/2005GL022688 (2005).
Development of an online indices database: Motivation, concept, and implementation. In Proc. 6th EARSeL Imaging Spectroscopy SIG Workshop Innovative Tool for Scientific and Commercial Environment Applications, 16–18 (2009).
Jinru, X. & Su, B. Significant remote sensing vegetation indices: A review of developments and applications. J. Sens. 2017, 1–17 (2017).
Google Scholar
Stamford, J. D., Vialet-Chabrand, S., Cameron, I. & Lawson, T. Development of an accurate low cost ndvi imaging system for assessing plant health. Plant Methods 19, 9 (2023).
Article Google Scholar
Klimavičius, L., Rimkus, E., Stonevičius, E. & Mačiulyte, V. Seasonality and long-term trends of ndvi values in different land use types in the eastern part of the baltic sea basin. Oceanologia 65, 171–181 (2023).
Article Google Scholar
Dabrowska-Zielinska, K., Kogan, F., Ciolkosz, A., Gruszczynska, M. & Kowalik, W. Modelling of crop growth conditions and crop yield in poland using avhrr-based indices. Int. J. Remote Sens. 23, 1109–1123 (2002).
Article Google Scholar
Zhao, H. et al. Monitoring cyanobacteria bloom in dianchi lake based on ground-based multispectral remote-sensing imaging: Preliminary results. Remote Sensing 13, 3970 (2021).
Zhang, J.-f., Liu, H.-b., Wu, W. & Fan, L. Correlation analysis of NDVI and meteorological variables. In PIAGENG 2010: Photonics and Imaging for Agricultural Engineering (ed. Tan, H.), vol. 7752 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 77521K (2011).
Cabrera-Bosquet, L. et al. Ndvi as a potential tool for predicting biomass, plant nitrogen content and growth in wheat genotypes subjected to different water and nitrogen conditions. Cereal Res. Commun. 39, 147–159 (2011).
Article Google Scholar
Gamon, J. A. et al. Relationships between ndvi, canopy structure, and photosynthesis in three californian vegetation types. Ecol. Appl. 5, 28–41 (1995).
Article Google Scholar
Marques Ramos, A. P. et al. A random forest ranking approach to predict yield in maize with uav-based vegetation spectral indices. Comput. Electron. Agric. 178, 105791 (2020).
Article Google Scholar
Stepchenko, A. & Chizhov, J. Ndvi short-term forecasting using recurrent neural networks. Environ. Technol. Resour. Proc. Int. Sci. Pract. Conf. 3, 180 (2015).
Google Scholar
Carvalho, S., Putten, W. & Hol, G. The potential of hyperspectral patterns of winter wheat to detect changes in soil microbial community composition. Front. Plant Sci. 7, https://doi.org/10.3389/fpls.2016.00759 (2016).
Costa, D. et al. Soil fertility impact on recruitment and diversity of the soil microbiome in sub-humid tropical pastures in northeastern brazil. Sci. Rep. 14, 3919 (2024).
Chamizo, S., Mugnai, G., Rossi, F., Certini, G. & De Philippis, R. Cyanobacteria inoculation improves soil stability and fertility on different textured soils: Gaining insights for applicability in soil restoration. Front. Environ. Sci. 6, 49 (2018).
Article Google Scholar
Choi, B., Lee, J., Park, B. & Sungjong, L. A study of cyanobacterial bloom monitoring using unmanned aerial vehicles, spectral indices, and image processing techniques. Heliyon 9, e16343 (2023).
Article CAS Google Scholar
Skidmore, A. K. et al. Mapping the relative abundance of soil microbiome biodiversity from edna and remote sensing. Sci. Remote Sens. 6, 100065 (2022).
Article Google Scholar
Charya, L. S. & Garg, S. Chapter 19 - advances in methods and practices of ectomycorrhizal research. In Meena, S. N. & Naik, M. M. (eds.) Advances in Biological Science Research, 303–325 (Academic Press, 2019).
Hyde, K. et al. The amazing potential of fungi: 50 ways we can exploit fungi industrially. Fungal Diversity 1–136, https://doi.org/10.1007/s13225-019-00430-9 (2019).
Jr, N. et al. Integrating GPS, GIS, and remote sensing technologies with disease management principles to improve plant health, 59–90 (CRC Press, 2011).
Yamauchi, D. H. et al. Soil mycobiome is shaped by vegetation and microhabitats: A regional-scale study in southeastern brazil. J. Fungi 7, 587 (2021).
Liu, S. et al. Phylotype diversity within soil fungal functional groups drives ecosystem stability. Nat. Ecol. Evolut. 6, 1–10 (2022).
Google Scholar
du Plessis, W. Linear regression relationships between ndvi, vegetation and rainfall in etosha national park, namibia. J. Arid Environ. 42, 235–260 (1999).
Article Google Scholar
Labouyrie, M. et al. Patterns in soil microbial diversity across europe. Nat. Commun. 14, 3311 (2023).
O, F. U. et al. Lucas 2018 soil module. In LUCAS 2018 Soil Module, KJ-NA-31-144-EN-N (online) (Publications Office of the European Union, Luxembourg (Luxembourg, 2022).
Hersbach, H. et al. ERA5 hourly data on single levels from 1940 to present (2023).
European Commission, Joint Research Centre (JRC). LUCAS Copernicus 2018. European Commission, Joint Research Centre (JRC) [Dataset] (2018). PID: http://data.europa.eu/89h/cfe66a0c-bdee-4074-96e1-a2f7030b9515.
Sinergise. Sentinel hub. https://www.sentinel-hub.com/ (2023).
Beck, H. et al. Present and future köppen-geiger climate classification maps at 1-km resolution. Sci. Data 5, 180214 (2018).
Article Google Scholar
Minasny, B., McBratney, A. B., Brough, D. M. & Jacquier, D. Models relating soil ph measurements in water and calcium chloride that incorporate electrolyte concentration. Eur. J. Soil Sci. 62, 728–732 (2011).
Article CAS Google Scholar
Abarenkov, K. et al. The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa,and classifications reconsidered. Nucleic Acids Res. 1039 (2023).
Bálint, M. et al. Millions of reads, thousands of taxa: microbial community structure and associations analyzed via marker genes. FEMS Microbiol. Rev. 40 5, 686–700 (2016).
Article Google Scholar
Cheng, Z. et al. Cortinarius and tomentella fungi become dominant taxa in taiga soil after fire disturbance. J. Fungi 9, 1113 (2023).
Lilleskov, E. A. & Bruns, T. D. Spore dispersal of a resupinate ectomycorrhizal fungus, tomentella sublilacina, via soil food webs. Mycologia 97, 762–769 (2005).
Article Google Scholar
Hallas-Møller, M., Nielsen, K. F. & Frisvad, J. C. Secondary metabolite production by cereal-associated penicillia during cultivation on cereal grains. Appl. Microbiol. Biotechnol. 102, 8477–8491 (2018).
Article Google Scholar
Liu, J., Pattey, E. & Jégo, G. Assessment of vegetation indices for regional crop green lai estimation from landsat images over multiple growing seasons. Remote Sens. Environ. 123, 347–358 (2012).
Article Google Scholar
Tottrup, C. & Rasmussen, M. S. Mapping long-term changes in savannah crop productivity in senegal through trend analysis of time series of remote sensing data. Agric. Ecosyst. Environ. 103, 545–560 (2004).
Article Google Scholar
Loozen, Y. et al. Mapping canopy nitrogen in european forests using remote sensing and environmental variables with the random forests method. Remote Sens. Environ. 247, 111933 (2020).
Article Google Scholar
Kariyeva, J. & Van Leeuwen, W. J. D. Environmental drivers of ndvi-based vegetation phenology in central asia. Remote Sens. 3, 203–246 (2011).
Article Google Scholar
Zhang, Y. et al. Prediction of soil organic carbon based on landsat 8 monthly ndvi data for the jianghan plain in hubei province, China. Remote Sens. 11, 1683 (2019).
Fathollahi, L., Wu, F., Melaki, R., Jamshidi, P. & Sarwar, S. Global normalized difference vegetation index forecasting from air temperature, soil moisture and precipitation using a deep neural network. Appl. Comput. Geosci. 23, 100174 (2024).
Article Google Scholar
Piao, S. et al. Leaf onset in the northern hemisphere triggered by daytime temperature. Nat. Commun. 6, 6911 (2015).
Article CAS Google Scholar
Huang, S., Huang, Q., Leng, G., Zhao, M. & Meng, E. Variations in annual water-energy balance and their correlations with vegetation and soil moisture dynamics: A case study in the wei river basin, China. J. Hydrol. 546, 515–525 (2017).
Article Google Scholar
Ozimek, E. & Hanaka, A. Mortierella species as the plant growth-promoting fungi present in the agricultural soils. Agric. 11, 7 (2021).
Casella, G. Split plot designs. Statistical Design 171–241 (2008).
Huang, F. et al. Canker disease intensifies cross-kingdom microbial interactions in the endophytic microbiota of citrus phyllosphere. Phytobiomes J. 7, 365–374 (2023).
Google Scholar
Li, F. et al. Mortierella elongata’s roles in organic agriculture and crop growth promotion in a mineral soil. Land Degrad. Dev. 29, 1642–1651 (2018).
Article Google Scholar
Zhang, K. et al. Mortierella elongata increases plant biomass among non-leguminous crop species. Agronomy 10, 754 (2020).
Article Google Scholar
De Notaris, C., Øster Mortensen, E., Sørensen, P., Olesen, J. E. & Rasmussen, J. Cover crop mixtures including legumes can self-regulate to optimize n2 fixation while reducing nitrate leaching. Agriculture, Ecosyst. Environ. 309, 107287 (2021).
Article Google Scholar
Liu, S. et al. Phylotype diversity within soil fungal functional groups drives ecosystem stability. Nat. Ecol. Evolut. 6, 900–909 (2022).
Article Google Scholar
Mallon, C. A. et al. Resource pulses can alleviate the biodiversity-invasion relationship in soil microbial communities. Ecology 96, 915–926 (2015).
Article Google Scholar
Main-Knorn, M. et al. Sen2cor for sentinel-2. In Image and signal processing for remote sensing XXIII, vol. 10427, 37–48 (SPIE, 2017).
EROS. Usgs eros archive-landsat archives-landsat 8-9 oli/tirs collection 2 level-2 science products (2020).
Hurtik, P. & Madrid, N. Bilinear interpolation over fuzzified images: Enlargement. In 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8 (2015).
Gascon, F. et al. Copernicus sentinel-2a calibration and products validation status. Remote Sensing 9, 1–81 (2017).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
Rivers, A. R., Weber, K. C., Gardner, T. G., Liu, S. & Armstrong, S. D. Itsxpress: Software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis. F1000Research7 (2018).
Abarenkov, K. et al. Unite uchime 9.0 reference data (2022).
Abarenkov, K. et al. Full unite+insd dataset for fungi (2023).
Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. Vsearch: A versatile open source tool for metagenomics. PeerJ 4, https://doi.org/10.7717/peerj.2584 (2016).
Tedersoo, L. et al. The global soil mycobiome consortium dataset for boosting fungal diversity research. Fungal Diversity 111, 573–588 (2021).
Article Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Madsen, H. & Thyregod, P. Introduction to General and Generalized LInear Models, 302 (CRC Press, 2011).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics (Springer, 2009).
Legendre, P. & Gallagher, E. Ecologically meaningful transformations for ordination of species data. OECOLOGIA 129, 271–280 (2001).
Article Google Scholar
Davison, A. C. & Hinkley, D. V. Bootstrap Methods and Their Applications (Cambridge University Press, Cambridge, 1997).
Yuanyuan, X., Chen, H., Yang, J., Liu, M. & Huang, B. Distinct patterns and processes of abundant and rare eukaryotic plankton communities following a reservoir cyanobacterial bloom. ISME J. 12, 2263–2277 (2018).
Article Google Scholar
Logares, R. et al. Patterns of rare and abundant marine microbial eukaryotes. Curr. Biol. 24, 813–821 (2014).
Article CAS Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007).
Article Google Scholar
Lauritzen, S. L.Graphical models, vol. 17 (Clarendon Press, 1996).
Mazumder, R. & Hastie, T. The graphical lasso: New insights and alternatives. Electron. J. Stat. 6, 2125–2149 (2012).
Article Google Scholar
Salavaty, A., Ramialison, M. & Currie, P. D. Integrated value of influence: An integrative method for the identification of the most influential nodes within networks. Patterns 1, 100052 (2020).
Article Google Scholar
Pavlopoulos, G. A. et al. Using graph theory to analyze biological networks. BioData Min. 4, 1–27 (2011).
Article Google Scholar
Klein, D. J. Centrality measure in graphs. J. Math. Chem. 47, 1209–1223 (2010).
Article CAS Google Scholar
Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics 99–114 (1949).
Langsrud, O. Anova for unbalanced data: Use type II instead of type III sums of squares. Stat. Comput. 13, 163–167 (2003).
Article Google Scholar
van Rossum, G. Python tutorial. Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI) (1995).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2021).

Download references

Acknowledgements

We thank the Villum Foundation for funding this study under the project grant number 00050095. In addition, we thank the NoR Foundation (a European Space Agency initiative) for providing access to the Sentinel Hub platform under project grant number 4517rH: ArtiMATE. D.F. acknowledges funding from the Danish National Research Foundation (DNRF137) as part of the Center for Microbial Secondary Metabolites (CeMiSt) and funding by the Novo Nordisk Foundation (NNF20CC0035580).

Author information

Authors and Affiliations

Technical University of Denmark, Department of Applied Mathematics and Computer Science, 2800, Kgs. Lyngby, Denmark
Mathies Brinks Sørensen & Line Katrine Harder Clemmensen
Technical University of Denmark, The Novo Nordisk Foundation Center for Biosustainability, 2800, Kgs. Lyngby, Denmark
David Faurdal, Giovanni Schiesaro, Emil Damgaard Jensen & Michael Krogh Jensen
University of Copenhagen, Department of Mathematical Sciences, 2100, Copenhagen, Denmark
Line Katrine Harder Clemmensen

Authors

Mathies Brinks Sørensen
View author publications
Search author on:PubMed Google Scholar
David Faurdal
View author publications
Search author on:PubMed Google Scholar
Giovanni Schiesaro
View author publications
Search author on:PubMed Google Scholar
Emil Damgaard Jensen
View author publications
Search author on:PubMed Google Scholar
Michael Krogh Jensen
View author publications
Search author on:PubMed Google Scholar
Line Katrine Harder Clemmensen
View author publications
Search author on:PubMed Google Scholar

Contributions

Sequence data pre-processing was done by D.F. with the final data format set up by D.F. and M.B.S. Satellite data and tabular data was pre-processed and fused by M.B.S. Initial data exploration was done by M.B.S, E.D.J and G.S. In addition E.D.J, M.K.J and G.S provided knowledge of plant-fungal interactions, classifying pathogenic genera from plant beneficial genera. All statistical and machine learning models applied throughout the paper was done by M.B.S. The selection of models and the development of the analytical workflow were developed by L.K.H.C. and M.B.S. The initial manuscript draft, all figures and tables were made by M.B.S, while all authors reviewed the manuscript. The overall project scope, ideas, and funding were initialized and obtained by M.K.J. and L.K.H.C.

Corresponding authors

Correspondence to Mathies Brinks Sørensen, Michael Krogh Jensen or Line Katrine Harder Clemmensen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

: Communications Earth & Environment thanks Hoa Thi Pham and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Alice Drinkwater. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sørensen, M.B., Faurdal, D., Schiesaro, G. et al. Exploring crop health and its associations with fungal soil microbiome composition using machine learning applied to remote sensing data. Commun Earth Environ 6, 355 (2025). https://doi.org/10.1038/s43247-025-02330-0

Download citation

Received: 03 July 2024
Accepted: 24 April 2025
Published: 07 May 2025
DOI: https://doi.org/10.1038/s43247-025-02330-0

Subjects

Abstract

Similar content being viewed by others

Astragalus-cultivated soil was a suitable bed soil for nurturing Angelica sinensis seedlings from the rhizosphere microbiome perspective

Persistent microbiome members in the common bean rhizosphere: an integrated analysis of space, time, and plant genotype

Biodiversity of the beneficial soil-borne fungi steered by Trichoderma-amended biofertilizers stimulates plant production

Introduction

Data description

Abiotic data

Biodiversity data

Results

Modeling of abiotic effects

OTU data analysis

Mircobiome impact modeling

Network analysis

Discussion

Methods

Data pre-processing

Satellite images

ITS PacBio sequencing

Supervised machine learning and statistical modeling

Clustering

Bootstrapping

Algorithm 1

Graphical lasso model and networks

Statistical analysis

Code

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links