Introduction

The world relies on agricultural food production as one of the primary sources of nutrition1, making it a critical area for global food security. With an ever-growing population, increasing changes to the climate, and a need for more sustainable production, it is necessary to identify sustainable methods for optimizing agricultural yield and health. One such method is the development of smart farming2, which can aid in optimizing crop yield and health by monitoring individual field conditions, making near real-time adjustments possible. Smart farming integrates sensors or remote sensing technologies, such as satellite imagery3 and drones4, with data analytics and decision support systems5. A wide array of data sources and techniques are available to monitor agricultural fields via satellite imagery. Some of the common satellites utilized in agricultural imagery are the Sentinel-2 and Landsat 8 satellites, which are both considered to be of medium resolution6. Lower resolution satellites, such as the MODIS satellite, are also applied in agriculture, although often they do not provide enough accuracy for precision agriculture6,7,8,9,10. MODIS does have the upside of daily samples, whilst Landsat 8 and Sentinel-2 only sample every sixteenth and fifth day11. Apart from resolution differences, these satellites all provide multi-spectral images3,12. For agriculture, the multi-spectral images are mostly used to compute vegetation indices (VI)13. Some of the most applied indices are the normalized difference vegetation index (NDVI)14, the enhanced vegetation index (EVI)15, the leaf area index (LAI)16, and the Chlorophyll Index17. For a more extensive review of multispectral VI’s, we refer to the Index Database18 and the works of Xue and Su19.

One of the most commonly used VIs for assessing crop health is that of NDVI, which is calculated using the ratio of near-infrared (NIR) and red bands from multi-spectral images. NDVI ranges from –1 to 1, where higher values indicate healthy crops with higher vegetation density19,20. Several factors have been reported to impact NDVI, including soil moisture21, temperature22, seasonality23, climate24, nutrients25, and crop type26. These factors are, for the most part, well covered and described in the literature and mainly linked to NDVI via non-linear models23,27,28. However, studies linking remote sensing with microbial soil composition remain limited compared to those examining nutrient effects.

In the works of Carvalho29, species richness was concluded to be positively linked to NDVI, but high variance across sampling sites led to high uncertainty in the conclusions. A more recent study suggests similar findings, linking higher richness and specific bacteria to higher soil fertility30. Still, their study did not include the effects of climate and weather conditions when estimating the effect of the bacteria. Another study investigated the role of Cyanobacteria and showed that when inoculated over 90 days, NDVI was significantly increased in fields with added Cyanobacteria compared to fields with no Cyanobacteria31. Other studies confirm that Cyanobacteria plays a role in NDVI measurements23,32, but these are mainly linked to Cyanobacteria blooming in lakes. In another study, the alpha diversity of bacteria was linked to hyperspectral imagery uing Airborne Visible InfraRed Imaging Spectrometer—Next Generation (AVIRIS-NG)33. A combination of linear discriminant analysis and partial least squares regression was applied to determine the dominant bacterial families’ abundance from airborne hyperspectral imagery33. Thus, confirming a link between microbial diversity and remote sensing is possible.

Apart from bacteria, fungi are of particular interest since ectomycorrhizal associations34 could be applied in agriculture as a means of reducing traditional fertilization in favor of bio-fertilization35. This property makes the remote sensing connection to soil fungal communities very attractive, as cheap satellite imagery may provide field diagnostics indicating if particular parts of a fungal community may be lacking. However, the link between NDVI (or other VIs) and the soil microbial fungal composition is only sparsely covered in the literature. In the work of Nutter36, NDVI was applied to find the epicenters of soy-bean rust disease caused by the Phakopsora pachyrhizi fungi. Another study links higher NDVI values with a larger species richness, having a larger proportion of Onygenales37. The effect of richness is also evident in the works of Liu, concluding that soil decomposers seem to play a role in higher NDVI values and retaining community stability38. In a recent study, the alpha diversity of fungi and bacteria in forests was investigated using hyperspectral imagery from the DESIS satellite. The study concluded that high fungal richness (hot spots) is linked to environmental parameters such as pH and landscape type?. Thereby making remote sensing an attractive opportunity for the monitoring of crops.

However, every study is on a relatively small scale (few fields), and the data have similar spatio-temporal properties (fields in the same area sampled at the same time interval). In addition, only a limited influence of climate is accounted for, which may lead to nuisance effects influencing the NDVI response in particular across spatio-temporal data.

Our study aims to explore the associations of the fungal soil microbiome composition and the NDVI response while accounting for the effects occurring in data across regions and one growth season. We propose a two-step methodology: first, we adjust NDVI values for abiotic influences, and then link the residual NDVI values to fungal biotic variance. In the initial step, a model is used to adjust NDVI values for abiotic factors, based on previous findings25,39. Following this adjustment, the adjusted NDVI values are analyzed in relation to the fungal soil microbiome (see Fig. 1 for a graphical outline).

Fig. 1: Graphical overview of the paper.
figure 1

(1) We estimate crop health/growth as NDVI values from satellite images. (2) We adjust the NDVI values by removing abiotic influence through a Random Forest model. (3) We identify clusters of pre-processed biotic data from soil samples using hierarchical clustering and investigate the link between the residual NDVI values and the derived clusters. (4) We filter rare taxonomy applying IQR bootstrapping, followed by the generation of sparse networks, identifying and investigating nodes of importance from biotic networks of the clusters.

Summing up, our contributions are (I) A methodology that adjusts for abiotic factors prior to investigating biotic relations to the residual NDVI. (II) Demonstrating a significant difference in NDVI values for different clusters of observed fungal microbiome compositions. (III) The Mortierella genus has the strongest influence regardless of its abundance; the presence of multiple influential plant pathogenic genera is associated with lower NDVI, while multiple influential beneficial genera are associated with higher NDVI.

Data description

This paper utilized multiple data sources to investigate the features affecting the NDVI responses. Specifically, the datasets used within our analysis were the 2018 LUCAS biodiversity dataset40, the LUCAS 2018 topsoil dataset41, and the ERA5 Copernicus climate dataset42. The remote sensing data contained the 2018 Copernicus LUCAS multi-polygons43 and our own polygons for the 2015 which were used to generate Sentinel-2 and Landsat 8 satellite images from which the mean NDVI values were extracted from regions of interest (ROIs) via Sentinel Hub44 (see the Satellite image subsection for details). To get an overview of how the aforementioned datasets were combined and how pre-processing was carried out, the data linkage and processing are visually outlined in section 1.1 of the supplementary material. Furthermore, the pre-processing of the acquired satellite images and bioinformatics data is outlined in the Data Pre-processing subsection. In the following subsections, the tabular datasets are described in detail.

Abiotic data

Multiple abiotic factors, such as soil nutrients, crop type, soil type, seasons, and climate conditions, are all known to influence the NDVI values and should therefore be taken into account when estimating the influence of the microbial composition. The LUCAS 2018 topsoil dataset consists of a total of 7430 unique crop-related samples with 10 features (see Table 1) describing the soil composition at different sampling sites with a unique set of latitude and longitude coordinates across Europe41. In total, 41 unique crop types are found in the data, and for our study, we focus on the three most abundant crop types: common wheat, barley, and maize. To obtain additional geospatial information, each sampling site was divided into climate zones based on the Köppen-Geiger climate zone classification system45. An additional filtration based on months was done, removing the months of winter (December, January, February), early spring (March, April), and late autumn (October and November), as the images mostly consisted of barren soil, snow, and clouds. Finally, the data was reduced, leaving out climate zones with less than 5 observations, removing observations from the BSh and ET zones. The final data consists of 2245 observations and is visualized for each crop type in Fig. 2C. The same procedure was carried out for the Landsat 8 2015 and 2018 datasets with the results depicted in Fig. 2A, B.

Table 1 Summary of LUCAS top-soil 2018 abiotic data variables
Fig. 2: NDVI time series overview.
figure 2

A NDVI values (mean within ROIs) over time for Landsat 8 2015, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. B NDVI values (mean within ROIs) over time for Landsat-8 2018, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. C NDVI values (mean within ROIs) over time for Sentinel-2 2018, data is grouped by crop type and colored according to the Köppen-Geiger climate zone classification. The black line represents a linear trendline showing the progression of NDVI values over time for each crop. Further details of NDVI distribution across crop types are provided in the supplementary section 1.2.

In addition to the topsoil composition affecting the observed NDVI values, climate data were considered. The climate data were obtained from the ERA5 Copernicus dataset, which contains detailed climate records dating back to 1940 until the present day42. For our purpose, we chose variables linked to the properties of soil, which were soil temperature (K), soil moisture (m3), soil type (categorical, 6 soil types in total), and air temperature (K). The climate data were grouped such that the yearly average air temperature, soil moisture, and soil temperature were linked to each NDVI observation. Additional temporal information for the aforementioned variables was also included in the form of the previous month’s average values. A summarized overview of all abiotic variables from the year 2018 is provided in Table 1, with the year 2015 found in the supplementary material section 1.2. For a complete overview of the distribution of each variable of Table 1, the reader is referred to the supplementary material section 1.2.

In the topsoil data (Table 1), the variable of water-based pH was excluded due to a high correlation with the CaCl2 based pH index (see the correlation plot found in supplementary section 1.3). The calcium chloride pH was chosen as it has been reported to be less affected by soil electrolyte concentration and produces more consistent measurements46. The variables of CaCO3, Aluminum Oxalate, and Iron Oxalate were also excluded from the analysis as the inclusion would reduce the data size to approximately one-third of the original size due to missing observations.

Biodiversity data

The final dataset included in this paper was the 2018 LUCAS biodiversity dataset40. The data contains a total of 885 bio-samples, each with a unique barcode ID (1-885) from which fungal ITS DNA sequences were obtained, where 347 samples were linked to cropland. For the crops of interest (wheat, barley, and maize) within the specified time period, a total of 115 samples were present. The taxonomy was classified from the raw DNA based on the UNITE-INSDc 9.0 database47 (see the Data Pre-processing subsection for details regarding the classification and processing of the raw DNA data). To ensure the quality of the classification, a cutoff based on the Operational taxonomic unit (OTU) counts was set at 1, filtering any sample that had OTU counts lower than the set threshold48 (see the ITS PACBIO sequencing for details on how the OTU counts were generated).

Results

Modeling of abiotic effects

The regression results on the test set, modeling the abiotic factors influence on the NDVI values, show that the RF outperforms the linear regression model with lower root mean squared error (RMSE) and higher explained variance, R2 (Table 2). Both models are based on the 16 variables in Table 1.

Table 2 Model comparisons overview

The final linear regression models, including up to two-factor interaction terms, are detailed in Section 1.6 of the supplementary material for each model (see Table 2). The hyperparameter values for the final RF model are provided in Section 1.7 of the supplementary material.

For the RF model, the observed NDVI values vs the predicted NDVI values showed good association, and we conclude that the model adequately captures dependencies within the data when comparing the test and full dataset to the true NDVI values (section 1.8 of supplementary material). Furthermore, we investigated the residuals based on single crop type predictions for every model listed in Table 2, finding that no inherent bias towards any single crop type is present (supplementary section 1.10). The residual analysis for every model, likewise found no larger variance across any of the less-represented levels of the categorical and numerical variables listed in Table 1 (see supplementary section 1.10). We observed that the most important variables are the season and crop type. In contrast, soil type and potassium levels seem to be less important (Fig. 3 and supplementary material section 1.9). To summarize, the results show a robust estimation of abiotic influences via the RF model. The step ensures a reliable quantification of residual NDVI, allowing for a clearer investigation of biological relationships.

Fig. 3: Visualization of the RF model variable importance plot (Model A-2).
figure 3

The y-axis shows the variable names in accordance with Table 1, and the x-axis is the percent increase in mean squared error.

OTU data analysis

In the first step of our two-step process, we identified two unique clusters of specific taxonomy through unsupervised clustering of OTU samples (Fig. 4). The samples are approximately equally distributed between the first (Cluster 1) and the second cluster (Cluster 2).

Fig. 4
figure 4

Visualization of the two clusters identified from the average silhouette method (see supplementary material section 1.4) on the pre-processed PACBIO data (biotic). Note that the data has been scaled via the Hellinger transformation to reduce the effect of large OTU values. The letters in the cluster refer to the crop type of each sample (see the LUCAS 2018 biodiversity dataset for details40).

As a sensitivity analysis, we additionally carried out a clustering, which focused on each crop individually (wheat, maize, and barley - supplementary material section 1.5). We identified two clusters for wheat with 30 and 19 observations, respectively. Due to low sample sizes, no meaningful patterns were identified for the separate clustering of barley and maize.

The clusters (including the separate wheat clusters) were further analyzed by connecting the residual NDVI values of the samples to the fungal taxonomy present in each cluster.

Mircobiome impact modeling

OTU Cluster 1 showed significantly higher residual NDVI values (Table 2 in supplementary material section 1.4) compared to Cluster 2 (Fig. 5 and table 2 in supplementary material section 1.4). The results for the raw NDVI values for each cluster are found in the supplementary section 1.4 for both Landsat 8 and Sentinel-2.

Fig. 5
figure 5

Boxplots of the residual NDVI values from each RF model of Table 2 within clusters 1 and 2 originating from the unsupervised clustering of the biodiversity samples (Fig. 4).

Similarly, on the wheat clusters, we found significant differences between the two clusters (supplementary material table 5, section 1.5). The average relative abundance analysis (Fig. 6) showed that genera associated with plant health, (such as Tomentella and Mortierella49,50), were more abundant in Cluster 1 (higher NDVI), while pathogenic genera, such as Fusarium, were more prevalent in Cluster 2 (lower NDVI). Specifically, Fusarium was more abundant in maize and wheat within Cluster 2, with similar levels in barley across both clusters.

Fig. 6: Comparison of the top 30 most abundant genera per crop based on the average relative abundance of each unique taxon within each cluster for all crops.
figure 6

A Visualization of the 30 most abundant taxonomies per crop for the first cluster. B Visualization of the 30 most abundant taxonomies per crop for the second cluster.

Abundance analysis for the wheat clusters, revealed a similar pattern, with pathogenic genera being more abundant for clusters with lower NDVI (section 1.12 in the supplementary material).

Using our bootstrapping approach (algorithm 1), we identified 99 and 114 genera (Cluster 1 and 2, respectively) as non-outlier taxa (outlier taxa being low abundance genera with high random large OTU spikes) (Fig. 7).

Fig. 7: IQR filtration overview.
figure 7

A Graphical results of the bootstrap resampling for Cluster 1, in total, the relative abundance IQR confidence intervals of 99 genera (blue lines) do not overlap 0 (red line), while the remaining genera contain 0 in the confidence intervals (orange line). B Graphical results of the bootstrap resampling for Cluster 2, in total, the relative abundance IQR confidence intervals of 114 genera (blue lines) do not overlap 0 (red line), while the remaining genera contain 0 in the confidence intervals (orange line). In addition, it should be noted that the Wald approximation has been applied for estimating the CIs, hence the negative CI values.

Network models of the non-outlier taxonomy show that many of the taxonomy correlations have been penalized by the sparse lasso model, retaining few connected nodes for both Cluster 1 and 2 (Figs. 8, 9). Results on the optimal regularization parameter are provided in the supplementary section 1.11.

Fig. 8: Illustration of the connections in the network based on the sparse lasso graphical model with 99 genera located in the first cluster.
figure 8

Red and blue edges indicate negative and positive correlations, respectively. The full network includes all non-connected genera and is shown in the supplementary section 1.13. Note that only the genus names are used for the nodes and that smaller absolute correlations (smaller than 0.3) have been removed from the plot.

Only a few strong negative correlations are present in the network for Cluster 1 (Fig. 8), forming a denser network compared to the network of Cluster 2. The network of Cluster 2 is smaller with fewer strong connections compared to the Cluster 1 network (Fig. 9). A notable strong negative correlation appears in both networks between the Fusarium and Russula genera.

Fig. 9: Illustration of the connections in the network based on the sparse lasso graphical model with 114 genera located in the second cluster.
figure 9

Red and blue edges indicate negative and positive correlations, respectively. The full network includes all non-connected genera and is shown in the supplementary section 1.13. Note that only the genus names are used for the nodes and that smaller absolute correlations (smaller than 0.3) have been removed from the plot.

Network analysis

Analyzing each sparse network showed that Mortierella is the most influential genus for both clusters (Table 3), having the highest abundance at approximately 15% in each cluster. In Cluster 1, the second most influential genus is Cortinarius, which shows a higher abundance than in Cluster 2, where Fusarium ranks second. Both genera appear highly influential in each cluster but have different abundance and ivi scores. Low-abundance genera (Trechispora, Schizothecium, and Serendipita) are also assigned high ivi scores and are of high influence in Cluster 2. They are present in Cluster 1 as well, but at a much lower abundance (below 0.5% in all cases).

Table 3 Top 10 node influence based on the Integrated Value of Influence

In wheat-specific clusters, Mortierella was identified as the most influential genus across clusters, despite not being the most abundant (supplementary material section 1.15). In the higher NDVI cluster, Funneliformis was the most abundant genus but did not contribute to network influence, whereas lower-abundance genera such as Neoschizothecium and Curvularia demonstrated higher influence. Additionally, the plant pathogenic genera Fusarium and Penicillium51 showed the highest ivi scores in the lower NDVI cluster. In the wheat-specific Cluster 2 (lower NDVI), the same beneficial genera were present as in Cluster 1 (higher NDVI), but with lower ivi scores (except for Mortierella).

Discussion

When comparing the variable importance of our RF model with the literature, we find the following agreements: Previous studies have shown that the canopy structure of vegetation influences the NDVI with a non-linear relationship26,52. As for seasons, it has been proven that NDVI is temporally affected, with a maximum NDVI around harvest months53. In terms of nutrient influences, Loozen54 used RF modeling to demonstrate a relationship between NDVI and nitrogen content in forest canopies. Furthermore, the soil carbon content has been identified to have a non-linear link with NDVI55,56. In addition, in the literature, we find that meteorological variables (eg. temperatures and soil moisture)57,58,59agree with our findings of having high importance with respect to NDVI. We note that the temporal resolution of the satellite imagery may affect the results, as the sampling date does not always correspond with the satellite recording date. Thus, the Sentinel-2 imagery may be more accurate with respect to the NDVI responses as these have higher temporal resolution. In the future, it would be interesting to repeat our analysis on PlanetScope data, which has daily recordings, although at a higher cost than Sentinel-2 and Landsat 8. Finally, the results of our abiotic model confirm previous findings in the literature, and it serves as an important step towards removing abiotic influence from the NDVI responses

Even so, our model is still subject to certain limitations. First, our model is limited as most of the meteorological observations are confined to moisture and temperature levels in the range 0.1–0.4 m3 and 280–290 K. This implies that the model may not be able to handle drought or very high temperatures. Second, our model has limited temporal coverage, as only the year 2018 is covered for biological data, possibly making the clustering susceptible to temporal distribution shifts. The temporal robustness and/or evolution is worth investigating in future studies.

For future research, it would be valuable to include a wider spatio-temporal area, including multiple years and using large-scale geospatial data with higher resolution than the Landsat 8 (30 m) and Sentinel-2 (10 m) satellites. This is particularly interesting, as differences have been observed between modeling based on Sentinel-2 and Landsat 8 data within the same year. An increased resolution could provide a more accurate representation of biotic effects. However, a significant limitation of higher-resolution imagery is its cost, as high-resolution data is much more expensive. It is also worth noting that the 2022 LUCAS Topsoil data is forthcoming. Although it is not yet publicly accessible, it could be used to perform an analysis with Sentinel-2 similar to that of the Landsat 8 analysis performed in this study. We strongly encourage such future external validations.

Our study mainly focuses on investigating the fungal impact on NDVI values across heterogeneous spatial areas across Europe. Overall, our findings successfully demonstrate that the fungal taxonomic composition is significantly associated with the NDVI values adjusted for abiotic influence.

From the LUCAS data alone, we cannot explicitly state which mechanism may be behind the regularization of the networks (Figs. 8, 9). We can infer that high abundance is not the sole criterion for network influence, although it seems to play a role, as the most influential genera (Table 3) are mainly comprised of genera with high average relative abundance. However, lower-abundance genera seem to have a much higher impact relative to their presence.

Moving further from our exploratory analysis, the highly influential taxonomy may provide a baseline candidate list for further studies investigating the fungal interaction both in vivo and in vitro. For instance, it would be interesting to investigate the effects on the NDVI values when Mortierella has minimal presence vs being highly abundant. From the literature, it is known that Mortierella has been found very beneficial for agriculture, in general60. However, to the best of our knowledge, no larger-scale controlled field experiments have been conducted with different levels of Mortierella. This could be carried out with the statistical split-plot design type61, which is frequently utilized in agronomy for investigating the influence of other factors, such as the impact of fertilizer types or irrigation methods. Likewise, the negative correlation between Russula and Fusarium is interesting to explore further, as the same relation is identified in the study of canker disease for citrus fruits62, revealing Russula to counteract the influence of Fusarium causing the disease. This relation has, to the best of our knowledge, never been reported in crops, and further investigation is required to assess our finding.

Further relevant actions would be to analyze the metabolomics profile of the soil communities and pinpoint the origin of metabolites positively or negatively affecting plant growth. It should be noted that data is only assigned until the genus level, hence, species information is lost, making the above discussion solely based on the most common occurrence of fungal species in maize, wheat, and barley. For future studies, the acquisition of species-level taxonomy would enrich the analysis as specific species linked to NDVI may be identified. In addition, relevant actions would be to analyze the metabolomics profile of the soil communities, pinpointing the taxonomic origin of metabolites positively or negatively affecting plant growth. In addition, an investigation of the rhizosphere of the plants would enrich our results further, as some fungal-plant interactions are known to work through the rizosphere. Specifically, this investigation could reveal if bulk soil content plays a mediating role in shaping fungal community composition by influencing root exudates, microbial recruitment, or nutrient availability within the rhizosphere. This is particularly relevant as Mortierella seems to be a central genus for regulation of the microbiome which is known for having soil-plant nexus interactions63,64. However, this is very difficult to investigate as the meta-genom of a single field is subjected to high spatial biological variance even when sampling within a few meters of the same site. Aggregating samples across a grid may help, but outliers could skew results. Hence, further studies are best done in a controlled lab with few crops and limited field space.

Our analysis demonstrates the potential of relating satellite imagery and the composition of fungal soil microbiomes across datasets characterized by varying spatiotemporal attributes. However, the biological analysis is solely based on a total of 115 bio-samples, which is currently limiting the construction of robust models for predicting NDVI based on the soil microbiome. Therefore, more bio-samples should be collected in accordance with the LUCAS initiative to provide a ground truth foundation for prediction models. Hence, our approach reveals in an exploratory sense, that certain microbiome compositions may affect the overall observed vegetation health, which potentially opens the way for selective microbial fertilization. One potential benefit of establishing this link is that expensive soil sample analysis can be replaced by satellite imagery. This would not only save time and resources but also enable much faster fine-tuning of microbial compositions for individual fields. An interesting approach would be to combine fungal fertilization with well-established bio-fertilization methods, such as crop rotation and the introduction of cover crops like legumes, which promote biodiversity by introducing nitrogen-fixing bacteria into the soil65. As noted by Liu et al.66, fungal functional diversity plays a crucial role in ecosystem stability, suggesting that the addition of specific fungal taxa could enhance the benefits already provided by practices like crop rotation and cover crops. Furthermore, a consequence of introducing fungal bio-fertilization may be increased soil taxonomic stability, making the system more robust to invasive crop pathogens67.

In the end, this fine-tuning may aid in an optimized agricultural yield35 through the increase of crop health, aiding in food security. Furthermore, the possibility of using selective microbial fertilizers may not only aid in achieving higher crop yields but ultimately replace some parts of the conventional fertilization process35 saving resources and opting for a more sustainable future through green farming.

Methods

Data pre-processing

Satellite images

To ensure consistent atmospheric correction, the satellite images were corrected for atmospheric disturbances using the sen2LA product over the manual correction of sen2L1C product68. For the Landsat 8 images, atmospheric disturbances were corrected with the usage of the level 2 OT product69. Furthermore, we removed images with more than 5% cloud cover by applying the Sentinelhub API cloud filtering algorithm68. This resulted in a reduction of data from 7430 crop samples to 5410 crop samples. To ensure band resolution was uniform at a targeted 10 meters, we used bilinear interpolation to resample each image70. The same procedure was utilized on the Landsat 8 imagery. To ensure the pixel band values are not affected by environmental artifacts, we applied masking on the Sentinel-2 scene classification maps (SCM)71, which removes pixels containing dark areas, snow, smaller clouds, and water from each sample tile. From the filtered, interpolated images, the mean NDVI values were aggregated based on the pixels found in each multi-polygon. Sampling dates did not always match the Sentinel-2 or Landsat-8 temporal recording scheme. To meet this challenge, a search range of 3 weeks, before and after the sampling data, was utilized. Imagery that did not meet the 5% could cover cutoff was discarded in the search process. The image closest to the sampling date was chosen to represent NDVI values for a sample. When no suitable imagery could be found in the temporal search span, the sample was removed from the final data.

ITS PacBio sequencing

An initial quality filtering was performed by discarding reads containing more than 1 ambiguous base and more than 2 expected errors. Adapters were trimmed and read orientation corrected using CutAdapt 2.1072. ITSxpress73 was then used to extract ITS regions. The UNITE 9.0 UCHIME reference dataset74 was used for reference-based chimera removal. Sequences were clustered using open reference clustering by including the ITS sequences extracted from the UNITE-INSDc 9.0 database75 using ITSx. The sequences were clustered at 98% similarity using VSEARCH76, applying the “-cluster-smallmem -usersort” arguments to prioritize full-length UNITE-INSDc and PACBIO sequences before partial sequences as described by Tedersoo et al.77. A sample-by-OTU table was produced using VSEARCH, and OTUs shorter than 250 nucleotides were discarded. Representative sequences from each OTU were classified by Megablast queries using BLAST 2.12.0, applying taxon-specific e-value and sequence similarity thresholds as described by Tedersoo et al.77. Conservatively, OTUs were classified to the level of genus.

Supervised machine learning and statistical modeling

We built two models that adjust the NDVI for abiotic confounders/contributions. We compare an RF model78 and a linear regression model79 using the variables listed in Table 1 (see linear regression model results in the supplementary, for parameters included in the fully reduced regression model) as the predictors and NDVI as the response. The RF model parameters were tuned using 5 repetitions of 5-fold cross-validation (CV). For the 2018 Sentinel-2 data, we split the observations in 20% for a one-off unseen testing and 80% for training via cross-validation. The Landsat 8 data underwent several splitting scenarios based on different combinations of datasets from 2015 and 2018. In the first scenario, the 2018 dataset was split into 20% for testing and 80% for training. The second scenario involved combining the 2018 and 2015 datasets into a single dataset, which was then split in the same 20% testing and 80% training ratio. The third scenario treated the 2015 and 2018 datasets as independent, with 2015 serving as the test data and 2018 as the training data. The final scenario mirrored the third but swapped the roles of the datasets, using 2018 for testing and 2015 for training. In the repeated CV on training data, the RF hyperparameters (variable splits, number of trees, and minimal terminal size) were selected using a grid search over a range of values (see the supplementary material) according to the minimal RMSE value. We estimate the expected error using the unseen test data for the selected set of hyperparameter values. The final model used for adjusting the NDVI from abiotic confounders was constructed by training a model with the selected hyperparameters to the full data (see 2). An analysis of residuals for the RF model is also included. For readers new to the usage of RF models, it should be noted that the residuals do not follow the same assumptions of traditional statistical models (eg, identical, independent distribution)78. However, the analysis is included to test if any bias is present among the prediction of less-represented levels of variables.

The linear regression model included two-factor interaction terms and was reduced according to the principles described in the Statistical analysis subsection and the results section. The models were compared using the test set errors and the errors estimated from making predictions onto the full dataset. Both models were evaluated based on the explained variance (R2) and root mean squared error (RMSE).

The estimated linear regression model coefficients \(\hat{{{\boldsymbol{\beta }}}}\) were computed by least squares79.

To mitigate over-fitting, the significance of each linear model regression coefficient was determined by ANOVA tests (see the Statistical Analysis subsection). Model diagnostics were also evaluated to ensure adequate fulfillment of the underlying assumptions. For a more extensive theoretical review of the linear model, we refer to the work of Madsen79.

To adjust the NDVI values for confounders, the estimated NDVI values from the abiotic RF model (see the Modeling of abiotic effects subsection) were used to subtract the non-microbiome-related variance (YRF) from the raw NDVI values (Yraw), resulting in the creation of the residual NDVI values (Yresidual) given as:

$${{{\boldsymbol{Y}}}}_{residual}={{{\boldsymbol{Y}}}}_{raw}-{{{\boldsymbol{Y}}}}_{{{\boldsymbol{RF}}}},$$
(1)

Subsequently, we investigate the association between the residual NDVI values and clusters derived from the microbiome sequencing data (Fig. 4).

Clustering

We used hierarchical clustering to create clusters from the pre-processed PACBIO data in an unsupervised manner. We based the clustering on the Euclidean distance and chose the Agnes method, which applies Agglomerative coefficients80 to identify the best linkage. Hierarchical clustering computes the dissimilarities between observations based on a distance matrix, where each row/column represents an observation and the distances reflect how dissimilar or similar two observations are. The Agnes method iteratively merges the closest clusters based on these dissimilarities until an optimal number of clusters is reached. To identify the optimal number of clusters, we applied the average silhouette width s(i), defined as:

$$s(i)=\frac{a(i)-b(i)}{\max [a(i),b(i)]}$$
(2)

Here, a(i) describes the average within-cluster distance between observation i and all other observations of the same cluster. The term b(i) denotes the average between-cluster distance between observation i and the observations assigned to the neighboring cluster. The term s(i) expresses how well each point is clustered, resulting in the optimal clustering being assigned when all clusters have observations above the average silhouette width (see the work of Friedman80 for more details on how to estimate between- and within-cluster distance). It should be noted that prior to clustering, the data was standardized in accordance with the Hellinger transformation81.

Bootstrapping

The non-parametric bootstrap approach82 was utilized for filtering the rare taxonomy of the pre-processed PACBIO data. A total of 5000 bootstrap samples were generated for each genus within each cluster. The interquartile range (IQR) of the generated samples was then applied to filter out taxonomy, with a threshold set at the IQR confidence interval containing 0. The full process is mathematically outlined in algorithm 1.

Algorithm 1

Genus filtration pseudo code

Initial inputs:

A matrix X of OTU abundances, with N rows (samples) and K columns (genera), i.e, \({{\boldsymbol{X}}}\in {{\mathbb{R}}}^{N\times K}\) is given.

for k= 1 to K do:

Step 1: Draw B = 5000 bootstrap samples, X*, with replacement from X

\({X}_{k}^{* }=[{X}_{k,1}^{* },{X}_{k,2}^{* },..,{X}_{k,B}^{* }]\), with \({X}_{k}^{* }\in {{\mathbb{R}}}^{N\times B}\).

Where \({X}_{k,b}^{* }=({x}_{1,k}^{(b)},{x}_{2,k}^{(b)},...,{x}_{N,k}^{(b)})\), with \({x}_{k,b}^{* }\in {{\mathbb{R}}}^{N\times 1}\).

Step 2: Estimate IQR for the kth genus, i.e., the kth row of \({X}_{k}^{* }\).

\(IQ{R}_{k,b}={q}_{3}({X}_{k,b}^{* })-{q}_{1}({X}_{k,b}^{* })\), with \(IQ{R}_{k,b}\in {{\mathbb{R}}}^{B\times 1}\).

Where q3 and q1 indicates the 75% and 25% quantile respectively.

Step 3: Per k’th IQR bootstrap samples (IQRk), estimate:

\({\bar{IQR}}_{k}=\frac{1}{B}{\sum }_{b = 1}^{B}IQ{R}_{k,b}\), mean IQR

\(SD({IQR}_{n})=\sqrt{\frac{1}{B-1}\mathop{\sum }_{b = 1}^{B}{(IQ{R}_{k,b}-{\bar{IQR}}_{k})}^{2}}\), IQR standard deviation

\(C{I}_{0.95}(IQ{R}_{k})={\bar{IQR}}_{k}\pm \frac{{Z}_{1-\alpha /2}SD({IQR}_{k})}{\sqrt{B}}\), IQR Wald 95% confidence interval.

Where Z1−α/2 ≈ 1.96 for the alpha-level quantile of the standard normal distribution, at α = 0.05

Step 4: remove outlier taxonomy

if 0 CI0.95(IQRk) then

The k’th column of X is considered outlier taxonomy and excluded

end if

end for

From the resampled relative abundance data, a 95% confidence interval (CI) for the IQR values was computed. Any taxonomy with an IQR confidence interval overlapping 0 is considered rare and therefore discarded. In contrast, taxonomies with IQR confidence intervals not containing 0 are retained. The method was applied as even with the addition of the lasso penalty, low-abundance genera with high random large OTU spikes may produce random correlations. Still, setting a cutoff is problematic as there are no universally accepted lower values83,84. Consequently, we applied a data-driven thresholding strategy via the IQR bootstrapping approach.

Graphical lasso model and networks

To derive the sparse network analysis visualized in Figs. 8, 9, the graphical lasso model was applied to the correlation matrix of the filtered and pre-processed PACBIO data. The main principles are outlined here with a reference to the works of Friedman85,86 for a more detailed description. The idea behind the graphical lasso model is to omit edges (correlations) from a network by controlling the number of zero entries of the precision matrix (or inverse covariance matrix, Θ) via the lasso penalty, see Friedman85 for more details. The graphical lasso model can be defined through the following optimization problem:

$$\hat{\Theta }=\arg \mathop{\min }_{\Theta \ge 0}(-\log (\det (\Theta ))+{{\rm{Tr}}}(S\Theta )+\lambda | | \Theta | {| }_{1})$$
(3)

The optimization problem of Eq. (3) does not have an analytical solution, thus element-wise coordinate descent is applied, identifying one parameter at a time (one entry of the precision matrix at a time), for more details, we refer to the work of Mazumder and Hastie87, which reviews the original algorithm while implementing additional improvements.

The networks are constructed from the sparse precision matrix estimated via the graphical lasso model. The sparse covariance matrix is estimated as

$$\Sigma ={\Theta }^{-1},$$
(4)

from which the sparse correlation matrix (ρ) may be obtained as

$$D=\, \sqrt{{{\rm{diag}}}\Sigma }\\ \rho =\, {D}^{-1}\Sigma {({D}^{-1})}^{T}$$
(5)

The sparse correlation matrix is then utilized as the edges of the networks, with each genus being a network node.

To enhance the insights for each network, the Integrated Value of Influence88,89 (ivi) is evaluated for each connected node in the sparse networks. The ivi score was applied since it combines six centrality metrics88,89 into a score that considers both local and global centrality. We refer to the works of Salavaty88 covering the details of the ivi score and to the works of Klein90 for readers not too familiar with graph network theory.

Statistical analysis

The significance level for the hypothesis tests was set to α = 0.05 and was further regulated for multiple hypothesis testing via the Tukey honest significance difference (HSD) test when factor levels exceed two91. For the subsequent tuning of the linear adjustment model, Type 2 ANOVA was applied based on a series of nested χ2 hypothesis tests92. Note that the principle of hierarchical modeling was utilized, retaining non-significant main effects when these are part of significant interaction terms79. The model diagnostics of the linear abiotic adjustment model are listed, investigated, and reported in the supplementary.

Code

Pre-processing of the satellite images was carried out in Python (version 3.10.9)93, while all remaining data processing, machine learning tasks, and statistical evaluations were carried out in R (version 4.1.2)94. Both source codes are available on our GitHub (https://github.com/Mabso1/Artimate), and the specific software packages used are all listed in the supplementary material, including a brief description of how each was used and references.