Aboveground biomass estimation using multimodal remote sensing observations and machine learning in mixed temperate forest

Lamahewage, Shashika Himandi Gardeye; Witharana, Chandi; Riemann, Rachel; Fahey, Robert; Worthley, Thomas

doi:10.1038/s41598-025-15585-6

Download PDF

Article
Open access
Published: 24 August 2025

Aboveground biomass estimation using multimodal remote sensing observations and machine learning in mixed temperate forest

Shashika Himandi Gardeye Lamahewage¹,
Chandi Witharana^1,2,
Rachel Riemann³,
Robert Fahey^1,2 &
…
Thomas Worthley^1,2

Scientific Reports volume 15, Article number: 31120 (2025) Cite this article

8815 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Plants sequester carbon in their aboveground components, making aboveground tree biomass a key metric for assessing forest carbon storage. Traditional methods of aboveground biomass (AGB) estimation via Forest Inventory and Analysis (FIA) plots lack sufficient sampling intensity to directly produce accurate estimates at fine granularities. Increasing the sampling intensity with additional FIA plots would be labor and time intensive, particularly for large-scale carbon studies. Utilizing remote sensing (RS) data, such as Airborne Light Detection and Ranging (LiDAR), aerial imagery, and satellite images can significantly enhance the efficiency of forest carbon monitoring efforts. The principal objective of this study is to utilize the random forest (RF) algorithm to build predictive AGB models. We utilized 67 explanatory variables, which were extracted from three RS sources resulting in nine RF models. Each RF model was subjected to variable selection, hyperparameter tuning, and model evaluation. The optimum model considered 28 explanatory variables, with root mean square error (RMSE) of 27.19 Mgha⁻¹ and R²of 0.41. Combining LiDAR with image metrics increased the accuracy of prediction models, serving as a pivotal tool for large area biomass mapping and carbon related decision making.

Mapping tropical forest aboveground biomass using airborne SAR tomography

Article Open access 17 April 2023

Estimation of woody vegetation biomass in Australia based on multi-source remote sensing data and stacking models

Article Open access 07 October 2025

LiDAR-based reference aboveground biomass maps for tropical forests of South Asia and Central Africa

Article Open access 04 April 2024

Introduction

Forests span over 410 million hectares globally, serving as vital carbon sinks that store atmospheric carbon mainly as above-ground biomass¹. Understanding carbon sources and their sequestration is crucial for mitigating actions against global climate change². Aboveground biomass (AGB) of trees in the forest is the foundation of estimating above ground carbon and understanding the carbon sink and source balance³. Forests and woodlands occupy over one third of the United States landscape, containing approximately 1 trillion cubic feet of wood volume⁴. An estimated 58% of the land area of the state of Connecticut meets the Forest Inventory and Analysis (FIA) definition of forest land. Connecticut encompasses 1.8 million acres of forested land, which bears approximately 135.69 million tons of oven-dry AGB⁵.

There are two primary methods of calculating above ground biomass, (i) direct and (ii) indirect methods⁶. The direct (or destructive) method measures the oven-dry weight of the entire plant above ground material. Destructive sampling is a complex, time-consuming method used primarily to develop and calibrate allometric equations of AGB from tree structural parameters. Indirect methods utilize those allometric equations to provide AGB estimates from field inventories of tree structural characteristics^7,8. In forests indirect estimation is almost always the method used due to the practical infeasibility of directly estimating AGB. As applied indirect estimation uses species-specific allometric equations to calculate the approximate AGB of trees and forested areas^9,10. The FIA program, which has been conducting comprehensive forest inventories for over seven decades from local to national scales¹¹provides a statewide database of AGB estimates for inventoried trees, subplots and plots. FIA has historically used the component ratio method (CRM) to calculate the AGB estimates, but in 2023 the National Scale Volume Biomass (NSVB) model was introduced for more accurate biomass estimations^12,58.

A rapid and accurate estimation of large area (e.g., statewide) forest AGB using field inventory data remains challenging in the forestry research sector¹³. Utilization of remote sensing (RS) techniques in forestry has been increasing due to efficiency, accuracy and predictability when studies require largescale and high temporal resolution data^13,14. RS data facilitates building large-area maps in different spatiotemporal scales, particularly when field sampling is insufficient for statistical estimation for state or landscape level modeling¹⁵.

Combining RS data with FIA data is one of the popular techniques that has frequently been utilized^16,17,18. Light Detection and Ranging (LiDAR) data is often employed to build random forest (RF) models to predict forest AGB¹⁶. Hu et al.¹⁹ used spaceborne LiDAR, satellite images, climate surfaces, topographic data, and optical imagery to build 1 km resolution AGB density maps at global scale with R² of 0.56 and RMSE of 87.53Mgha⁻¹. Hudak et al.²⁰ built a carbon monitoring system utilizing 3805 field measured plots for mapping regional, annual aboveground biomass across the northwestern USA achieving R² of 0.8 and RMSE of 152 Mgha⁻¹. Tang et al.²¹ published a multi-state level forest carbon map with 30 m level predictions with R² of 0.38. Tang et al.²¹ utilized 1986 FIA subplot locations, LiDAR, and National Agriculture Imagery Program (NAIP) data. Other studies have utilized synthetic aperture radar (SAR), vegetation indices (VIs), and Moderate Resolution Imaging Spectroradiometer (MODIS), to create a range of machine learning (ML) and multiple linear regression models (MLR)^18,22,23,24.

Integrating FIA data with both passive and active remote sensing technologies, (such as LiDAR, Landsat, Sentinel-2, and NAIP), enhances large-area biomass predictions by providing valuable details at higher temporal and spatial resolutions that are particularly relevant for canopy cover and disturbance characteristics²⁵. According to Mancini et al.²⁶ remotely acquired data enhances the information in field data and reduces the need for additional extensive fieldwork. LiDAR provides a comprehensive characterization of forest structure, canopy height information, vertical profile, and tree density^16,27,28. Such detailed 3D representation enables a precise estimation of AGB and carbon stocks due to the strong linear relationship between tree height and tree diameter²⁹. In terms of RS image observations, Sentinel-2 data provides high temporal resolution and NAIP offers high spatial resolution (0.6 m). Variables developed from remotely sensed imagery have become crucial data for characterizing forest health, composition, and qualitative characteristics. For example, grey-level co-occurrence matrix (GLCM) textures³⁰ inform canopy complexity and spatial heterogeneity, which are important for distinguishing biomass variation^31,32. The vegetation indices provide valuable information about the vegetation characteristics, vegetation vigor, moisture content, canopy density, and distribution^32,33. Principal Component Analysis (PCA) has been employed to reduce the large number of variables and redundant information into a more manageable set of variables while preserving the primary information dimensions available in the original data³⁴.

Among previous AGB modeling efforts, Sheridan et al.²⁷ and Li et al.¹⁵ built linear regression models, linear dummy variable models, and linear mixed-effects models. Previous studies have extensively explored different parametric and non-parametric algorithms to model forest AGB^18,35. The RF algorithm has been frequently used in forest AGB modeling^16,27,36,37 but in addition to RF Li et al.¹⁵ used gradient boosting and Sivasankar et al.³⁸ used a support vector machine (SVM) regression to predict forest AGB. Johnson et al.¹⁸ and Wolpert³⁹, developed stacked ensemble models for previously developed prediction models. In addition to pure statistical models, Ma et al.⁴⁰ estimated AGB using the ecosystem demography (ED) model. Other studies further utilized other image composites and percent tree cover, land cover proportions, and climate parameters to build tree-based algorithms^41,42.

For large-area estimations, traditional forest inventory approaches are time and labor intensive, which limits their effectiveness for high-resolution AGB assessment. This limitation underscores the need for scalable, accurate, and cost-effective methods to estimate forest AGB across large and heterogeneous landscapes. Previous studies have leveraged remote sensing data and ML algorithms as potential solutions. Despite the growing use of ML algorithms such as random forest (RF) for AGB prediction, critical knowledge gaps remain^16,43,44,45. Notably, the implications of Forest Inventory and Analysis (FIA) data policies which restrict access to georeferenced field plots are not well studied. Furthermore, the utility and structural optimization of RF models with limited training data conditions remain underexplored in the context of forest AGB estimation. These gaps limit the generalizability and robustness of RF based approaches, particularly in data-sparse or policy constrained regions. In this study, we hypothesize that optimizing the tree structure of random forest (RF) models through variable selection and hyperparameter tuning, combining with the integration of multimodal remote sensing (RS) data, can significantly improve the accuracy and robustness of aboveground biomass predictions. This approach is expected to be especially effective in scenarios with limited training data access and necessary for generating reliable AGB estimates across large, forested landscapes.

We utilized FIA data for both model training and validation, applying FIA-developed methods to evaluate the consistency between our biomass estimates and those generated by FIA. To achieve a spatially explicit representation of AGB across Connecticut, we employed a model-based approach that translated FIA’s discrete plot-level measurements into spatially continuous AGB predictions at a minimum of 15 m resolution. We addressed several common challenges associated with combining FIA and RS data for large-scale, fine-resolution biomass modeling and mapping. We employed the Random Forest (RF) algorithm, a method commonly utilized by researchers for estimating AGB due to its robustness in handling collinear data and spatial autocorrelation effectively^44,45 In this study we also investigated the effectiveness of using detailed hyperparameter tuning and variable importance analysis to optimize RF model and to identify the most meaningful set of input variables^16,43. We attempted to identify an effective method for integrating multi-modal remote sensing data and auxiliary spatial products with FIA subplot data. Our goal was to build robust models with improved accuracy of AGB estimates in Connecticut forests compared to previous efforts in literature^17,27. We developed a comprehensive framework to address the challenges associated with utilizing FIA data, emphasizing its critical value in large-scale biomass estimation. Finally, we evaluated the agreement between our spatially explicit AGB estimates and FIA-derived biomass estimates with existing AGB map products^18,46.

Data and methods

Study area

Our study was the state of Connecticut (41.6° N, 72.7° W), located in the northeastern United States and spanning approximately 13,023 km²⁴⁷. Connecticut is characterized as a humid continental climate with cold winters and warm, humid summers within the Northeastern coastal zone and lower New England ecoregion⁴⁸. We focused on the forests of the state of Connecticut, United States (Fig. 1) dominated by the Oak and Hickory species. As defined by the Global Forest Resources Assessment⁴⁹ a forest is characterized as land covering more than 0.5 ha, with trees exceeding 5 m in height and a canopy cover greater than 10%, or with trees that have the potential to naturally meet these criteria. In 2016, Connecticut’s forested land was estimated at 7,274.35 km² (1.8 million acres). Of this, 71% is privately owned, while 28.59% is owned by state and local governments. Federal ownership accounts for 0.46%, excluding developed and agricultural lands⁵⁰.

Modeling framework

The methodology (Fig. 2) involves identifying FIA plot locations belonging to the 2016 collection year, and creating raster layers from LiDAR, Sentinel-2 images, NAIP images, CT soil maps, and CT forest cover types⁵¹. The study consisted of 42 FIA true plot locations (168 subplots) and 67 explanatory variables derived for each subplot location corresponding to 2016 data collection. Each explanatory variable was meticulously selected by a thorough literature review to represent forest and vegetation characteristics (i.e. tree height, slope, forest heterogeneity, and tree health). Prior to model training, candidate remote sensing variables underwent a rigorous selection process to reduce dimensionality and identify the optimal subset of variables for RF models given the sample size.

FIA plot data

The FIA process integrates two active phases as outlined in the FIA user guide⁵⁸. According to the 1998 legislation, a five-year data collection cycle is required for permanent phase two plots nationwide^11,52,53. The FIA program measured and recorded 320 sample plots with some forested conditions (plot status code = 1) in Connecticut from 2011 to 2016. Approximately 14% out of 320 were measured annually resulting in 42 plots for 2016. The FIA plot design consists of a systematic cluster of four subplots, as shown in Fig. 3^54,55.

We used the 2016 FIA data to align with Connecticut’s 2016 LiDAR mission. This ensured efficiency and the project completion within the timeline since the entire analysis was done remotely. Additionally, growth adjustments were not integrated, as we used single-year data to avoid introducing further uncertainties into the analysis. In 2016, the Northern Research Station Forest Inventory and Analysis program (NRS-FIA) utilized Rockwell Precision Lightweight GPS Receivers. The GPS readings have an average discrepancy of 8.0 m between the observed coordinates and the known reference points⁵⁶. We utilized true locations of 42 FIA plots, resulting in 168 FIA subplots (Fig. 3a)^16,27. The public FIA data is perturbed to reduce plot coordinate accuracy to maintain landowner privacy via fuzzing and swapping⁵⁷. Fuzzing involves shifting the plot coordinates to within 1.0 mile of the actual location. In addition, up to 20% of private plot coordinates are swapped with similar plots within the same county to prevent linking data to specific landowners. Obtaining true FIA subplot locations was essential for this study because of the fine spatial resolution of the analysis and imagery used, however this posed challenges due to FIA’s data consent policies. To work with this situation, all R codes were written and initially tested using the publicly available data through FIA’s Spatial Data Services. The final development of R code and model tuning was run remotely using FIA true plot locations via a collaboration with FIA scientists.

Aboveground biomass calculation

We used the FIA subplot-level AGB estimates calculated using the component ratio method (CRM) for 168 FIA subplots²⁷. The CRM calculations were already recorded on FIA database following the exact methods following FIA user guide Version 9.0.1 for Phase 2⁵⁸. The subplot locations allow for a better assessment of heterogeneity within the forested areas. This approach increased our ability to extract high-resolution RS data and to improve the model predictions^{12,16,57,58,59}. The FIA data tables with tree measurement records were used to calculate the cubic foot volume of each tree stem in all 168 subplots^57,58. Afterwards, tree-level AGB was totaled to get the AGB per subplot⁵⁸. Allometric estimates of AGB were calculated by using Eq. 1 >⁹ for tops and limbs as a proportion of the merchantable bole.

$$bm~=~Exp~({\beta _0}+{\beta _1}\ln dbh)$$

(1)

Where;

bm = total aboveground biomass for trees 12 cm and larger¹⁶.

dbh = diameter at breast height (cm).

Exp = exponential function.

ln = log base e (2.718282).

Β₀, β₁ = specific parameters for species group^9,58.

In our study, we only considered the trees with diameter at breast height (dbh) ≥ 12.7 cm (≥ 5 in) for the AGB calculations¹⁸. Moreover, 892.17 lbac⁻¹ equivalent to 1 Mgha⁻¹ was the conversion factor of AGB^60,61. Although non-forest areas in CT are known to contain trees, in this study we assumed that non-forested areas consist of 0 Mgha⁻¹ AGB to align with the definition of FIA forestland^16,18. The FIA estimated AGB values were the response variable of our study. After excluding subplots in non-forested areas^12,29other spatial outliers generated from instrument errors and geolocation uncertainties of the field dataset were identified and removed by comparing discrepancies between observed and expected values¹⁶. We excluded the locations with smaller field biomass estimates (< 50 Mgha⁻¹) with high LiDAR heights (> 30 m) or subplots with zero AGB but had LiDAR heights greater than 10 m²⁹. This resulted in reducing the training data set from 168 to 142.

Remote sensing variables

LiDAR data

The state of Connecticut has a statewide airborne LiDAR data set, which was collected from March 11, 2016, through April 16, 2016, with USGS quality Level 2. The point density is 2 points/m² covering about 13,567 km² (5240 miles²), producing 23,381 tiles. LiDAR data and derived 1 m resolution digital elevation models are referenced to the Connecticut State Plane NAD83 (2011) feet horizontal coordinate system, and NAVD88 feet vertical datum⁶². We derived⁶³ LiDAR metrics using ArcGIS Pro 3.1.2 and FUSION/LDV processing software (version 4.40)²⁷. We extracted mean raster values of each LiDAR metrics listed in Appendix 1 at subplot level. The canopy height model (CHM) was one of the input variables derived from LiDAR first returns provided tree heights to the model^13,32,64. We created rasters for LiDAR height percentiles, height bins, and height densities using the distribution of Z Cartesian coordinate from point cloud. These elevation metrics were used to train the model with the information of vertical structure and complexity of the forest vegetation^{17,20,24,27,65}.

Remote sensing images and auxiliary geospatial data layers

We utilized the National Agriculture Imagery Program (NAIP) data acquired in 2016 (with leaf-on condition) at a nominal resolution of 0.6 m^24,31. We built the second order GLCM textural rasters of NAIP green and near-infrared (NIR) bands^24,31,66, VIs^24,64,67 and first four PCA rasters based on remaining NAIP bands^31,34 (Appendix 2). We employed Sentinel-2 satellite images from spring 2016, coinciding with the LiDAR data collection period. The considered Sentinel-2 images only consisted of ≤ 10% cloud coverage and were preprocessed^68,69. The VIs and spectral responses of image bands were derived from bottom of atmosphere reflectance (BOA) within selected Sentinel-2 images (Appendix 3)¹³. We derived a series of VIs including normalized difference vegetation index (NDVI)^24,64,67, normalized difference red edge (NDRE)^70,71, and normalized difference moisture index (NDMI)⁷¹ to represent the vegetation health, greenness and moisture content. The existing National Oceanic and Atmospheric Administration (NOAA) Coastal Change Analysis Program 1 m land cover map and Connecticut soil type shape file were also employed to extract categorical variables⁷².

Grouping remote sensing variables

We grouped the explanatory variables based on their original RS data source³². We identified three distinctive groups; i) Group-1: all 67 metrics combined (LiDAR, NAIP and Sentinel-2), ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics (Appendix 1–3). This categorization enabled us to gauge the contribution of different RS sources to AGB modeling^24,32. We tested nine RF model configurations to systematically evaluate the effects of different variable combinations, aiming to balance model efficiency and accuracy during tuning and training. Following the approach suggested by Erdody and Moskal^73.This process also helped us identify the most important variables for training and reduce model complexity.

Random forest algorithm

Although the Random Forest (RF) algorithm is widely used in AGB research, its performance with limited training datasets remains underexplored^19,20,21. Due to the FIA data non-consent policy, we conducted our analysis off-site using the computing infrastructure of the USDA Northern Research Station, without direct access to precise plot coordinates. Our innovative approach included the following treatments: i) grouping of remote sensing (RS) variables based on RS data source³²and ii). feature selection based on cross-validated R² and IncNodePurity metrics to reduce the multicollinearity, dimensionality and overfitting risk³²ii). meticulous hyperparameter tuning to optimize model performance^13,16,32,43. This integrated strategy allowed us to effectively train the RF model despite limited access to ground-truth data.

Variable reduction and selection

The RS data extraction produced 67 explanatory variables as inputs for the RF models, with FIA-AGB estimates serving as the response variable. We identified three explanatory variable groups based on RS data source: i) Group-1: all 67 metrics combined, ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics^24,32. Per each group, we used increase in node purity (IncNodePurity)⁷⁴ and the coefficient of determination (R²) from the 5-fold cross-validation (CV) to identify the most important set of variables for all the RF models in a forward selection process^43,75. We trained a series of RF models using repeated K fold cross-validation by adding one variable at a time from the most important to the least important variable based on IncNodePurity^13,32,43. As the last step, partial dependency curves were plotted to illustrate contribution of the most important variables for AGB prediction⁷⁶.

Hyperparameter tuning and random forest model training

The RF algorithm is known for making robust predictions even if the predictor variables are highly correlated⁷⁷. Further, we incorporated bootstrap sampling to reduce the effect of spatial autocorrelation among the subplots. The tree-based structure of RF algorithm utilizes different training datasets to build each decision tree, minimizing spatial dependency among subplots⁷⁸. We implemented the RF algorithm using the randomForest package in R version 4.3.1³⁶^,⁷⁹. The identified groups i) Group-1: all 67 metrics combined, ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics^24,32 separately went through a careful hyperparameter tuning process resulting in nine models.

We employed an 80:20 split strategy to divide 142 subplot data into training and testing subsets ensuring a robust training process and providing a separate dataset for testing for all three variable groups¹⁸. Given the relatively small size of the dataset (142), hyperparameter tuning (Table 1) was tasked to strike a balance without overfitting³². We initially trained a default RF model for each group. Secondly, we optimized two key parameters: M_try (number of independent variables per split) and N_tree (number of regression trees) to identify the requirement of an exhaustive hyperparameter tuning⁴³. Finally, a complete hyperparameter optimization was done by using tuneRF function of the randomForest package in the R environment^32,43. The final hyperparameter optimization employed the expand.grid function in the R environment, providing a sequence of values representing each hyperparameter of the parameter space (Appendix 5). The grid search involves exhaustive exploration with a stopping metric equals to 20 to streamline the process. This procedure resulted in nine individual RF models, three for each variable group³². We tasked a 5-fold CV for tuning and variable selection without requiring a separate validation set⁴³(Table 1).

We used the hold-out 20% of data for the RF model testing which was never integrated in model training or validation process. The testing evaluation metrics, such as root mean square error (RMSE), mean square error of the out-of-bag errors (MSE), coefficient of determination (R², and percentage variance explained were utilized to compare model performances^17,27,32. This approach ensures robust evaluation and enhances model generalization to unseen data and grants a platform for comparative analysis⁴³.

Table 1 Hyperparameter combinations for each model belong to all three variable groups.

Full size table

AGB comparison with existing map products

Finally, we compared the prediction performance of our RF model with three existing regional and landscape-level AGB map products: (i) a 90 m resolution map by Ma et al.⁴⁰which employed the Ecosystem Demography (ED) model (Map_1), (ii) a 250 m resolution map developed by Blackard et al.⁴¹ using tree-based algorithms (Map_2), and (iii) a regional map created by Menlove and Healey⁴² (Map_3). Our plot-pixel assessment was based on absolute difference (i.e. mean, range) and similarity metrics (RMSE, R²³⁵ to understand the possible improvements that our RF has at FIA sublot level. We compared all four predicted AGB values with growth adjusted FIA-CRM calculated biomass values of 28 subplot testing data with resampling if necessary (Appendix 4).

Results

Variable importance

This study considered IncNodePurity value during the variable importance analysis for all three variable groups. We identified the most important set of variables to build the RF models followed by the ranking of variables based on IncNodePurity and forward selection of mean R² of 5-fold CV (Fig. 4). The most important variable of both Group-1 (Combined) and Group-2 (LiDAR_metrics) was the 95th percentile derived from LiDAR point clouds (Fig. 4a,b). From the cohort of image metrics in Group-3, the short-wave infrared (SWIR) band of Sentinel-2 leaf-off images was identified as the most important predictor variable for building RF models (Fig. 4c). Out of all the explanatory variables (Group-1), 68% of the most important variables were LiDAR height-related metrics. The SWIR (1610 nm) band of Sentinel-2 and correlation texture rasters from the NAIP green band were the only image-derived metrics in the top 10 explanatory variables of Group-1 RF models.

Finding the best subset of the predictor variables is crucial to reduce dimensionality and overfitting of RF algorithm⁸⁰. Group-1 reported the maximum cross validation R² value of 0.3364 for the first 28 explanatory variables. Group-2 and Group-3 yielded the maximum cross validation R² of 0.2265 and 0.2039, respectively using their top 16 variables. At this stage, we were able to reduce overfitting and ultimately reduce the complexity of variables.

Random forest model comparison

A careful comparison of hyperparameter tuning was considered for all three groups as mentioned in Table 1. All nine models were cross-validated and tested using a 20% hold-out dataset. Model test results are shown in Table 2. We identified the grid tuned (grid search technique with 5-fold CV) RF models as the highest performing models through a comparative analysis of R² RMSE and percentage variance explained for each variable group. The Group-1, the grid tuned model (All_RF03) reported RMSE of 27.19Mgha⁻¹, R² of 0.41, and 40.54% of percentage variance explained. The Group-2 grid tuned model (LiDAR_RF03) reported RMSE of 27.88 Mgha⁻¹, R² of 0.23, and 23.49% of percentage variance explained. Lastly, The Group-3 grid tuned model (Img_RF03) showed RMSE of 26.50 Mgha⁻¹, R² of 0.31, and 30.85% of percentage variance explained. We observed the smallest mean absolute error (MAE) and confidence interval (CI) for the All_RF03 model.

Table 2 Evaluation metrics of random forest regression models.

Full size table

When considering Group-1, the grid tuning process improved the R² from 0.19 to 0.40 while the RMSE decreased by 23.71%. Similarly for Group-2, the R² increased from 0.01 to 0.23 with RMSE reduction by 12.16%. Because of grid tuning, Group-3 also exhibited an increase in R² from − 0.01 to 0.31 whilst RMSE reduced by 18.06%. Increasing only the number of trees (N_tree) in the RF model does not necessarily result in improved performance, rather it escalates the computational burden⁴³. A comparison between FIA-based AGB estimates and all the RF model estimates for different hyperparameter and variable spaces are illustrated in Fig. 5.

Upon studying the evaluation metrics outlined in Table 2 alongside the residual plots in Fig. 6, the tuned RF models (All_RF03, LiDAR_RF03, and Img_RF03) of each group represented the lowest RMSE and the highest R² values comparing to their default RF counterparts. We selected All_RF03 model as the optimal RF model after comparing its evaluation metrics with the next best model, Img_RF03. The Img_RF03 model showed a decrease in R² by 24.39%, even with the 2.54% reduction in RMSE compared to the counterparts of All_RF03 model (Table 2). An examination of residuals was conducted for the best models chosen from each group. Figure 6 shows that all the RF models give smaller residuals for AGB values below 100 Mgha⁻¹, which demonstrates that tuned models represent poor agreement with actual data for the areas with higher AGB. Figure 6a shows the smallest dispersion of residuals from the x axis, which is the FIA-AGB estimates. Concluding the results of evaluation metrics and the residual analysis, the All_RF03 model was selected as the optimal RF model for estimating forest aboveground tree biomass within the study area.

Partial dependency curves

The partial dependency curves (PDC) of the All_RF03 model depict the average relationship between individual predictor variable and AGB prediction of the optimal RF model. In this study the top four explanatory variables from the optimum RF were documented, considering interactions and non-linearities captured by the ensemble of the RF decision trees. In Fig. 7, PDCs display the explanatory variables on the X-axis, while the Y-axis represents the mean RF predictions of AGB. According to Fig. 7a–c mean AGB response increases steadily, yet at different rates as the 95th percentile (P95th), 90th percentile (P90th), and Sentinel-2 SWIR band (sen_swir) increases until it flattens out. This suggests that higher variable values increase in the predicted AGB up to a certain point, after which the response stabilizes. However, the GLCM correlation from NAIP green band (green_layer8) exhibited a negative slope indicating the predicted AGB declines steeply with higher values of correlation of NAIP green band (Fig. 7d).

Comparison with available map products

This section presents the results of comparing our RF model performance and the three existing AGB map products against FIA-CRM estimates. The All_RF03 model achieved the lowest RMSE of 28.33 Mgha⁻¹ and highest R² value of 0.41, demonstrating better predictive accuracy than the existing regional and national maps (Table 3). The Map_01 (regional 90 m) had the highest RMSE (145.35 Mgha⁻¹) and the lowest R² (0.01), while the Map_02 (regional 250 m) and the Map_03 (national scale) performed slightly better but still exhibited higher RMSE values (82.47 Mgha⁻¹ and 81.84 Mgha⁻¹, respectively) compared to our optimal RF model. This suggests that the optimal RF model significantly outperforms coarse-resolution AGB maps for plot-level biomass estimation in Connecticut.

Table 3 Evaluation metrics of the plot-pixel analysis among the existing map products and the optimum random forest model biomass predictions with forest inventory and analysis field biomass estimates.

Full size table

According to Fig. 8, the wider confidence intervals for the available maps, particularly at higher AGB values, further highlight the advantages of the RF approach for finer-scale forest biomass predictions. The RF model (in red) shows the closest alignment with the 1:1 reference line indicating better agreement with FIA estimates across a wide range of biomass values. In contrast, MAP_1 consistently overestimates AGB estimates at FIA subplot level. MAP_2 and MAP_3 demonstrated more moderate overestimations but still deviate from the 1:1 line compared to RF. The RF model had a narrower uncertainty range at lower and moderate AGB values. These results suggest that our RF model outperforms the existing products in estimating plot-level AGB under restricted data conditions.

Discussion

The study aimed to evaluate the predictive capability of remotely sensed variables derived from a combination of multi-modal sources (LiDAR, Sentinel-2, and NAIP) and FIA data for enhancing state-wide forest aboveground biomass (AGB) estimation at fine spatial scales via the random forest (RF) algorithm. Initially, we leveraged the utilization of publicly available RS data combined with single year FIA data. This step prevents adding growth increment-related uncertainties into the model prediction¹⁸ yet resulted in a small sample size to start with. We utilized the 2016 FIA data to align with Connecticut’s 2016 LiDAR mission, optimizing the project timeline and addressing the challenges associated with using FIA true plot data. Accurate estimation of forest AGB is essential for a range of ecological research areas, including carbon cycling, forest management, and forest dynamics⁸¹. The RF algorithm is widely used in forestry sector analysis due to its ability to work with multicollinearity and spatial autocorrelation problems^82,83. We studied the importance of variable selection and hyperparameter optimization of the RF algorithm in estimating AGB. The AGB values (response variable) in this study were estimated using the FIA-CRM technique. As of September 2023, FIA has transitioned to the National Scale Volume and Biomass Estimators (NSVB) for more consistent and accurate tree structure accounting⁸⁴.

Hyperparameter tuning led to a significant improvement in R² across all three groups, particularly the optimal RF model with a 115.79% increase in R² compared to its default counterpart. Negative R² values indicate that the model fit is poor. Johnson et al.¹⁶ provides a comprehensive framework in utilizing RF algorithm to achieve R² 0.49 with a RMSE (91.5 Mgha⁻¹). Our study achieved R² of 0.41, demonstrating the value of meticulous hyperparameter tuning and variable optimization even though the sample size is small (N = 142). Torre-Tojal et al.⁴³ provides recommendations to enhance AGB estimate accuracy, with hyperparameter tuning for small datasets (N = 55), achieving high geolocation accuracy and using area-specific allometric equations.

A majority of our estimates fall outside the 95% confidence interval at the upper end of the AGB distribution, indicating an underestimation compared to the FIA-estimated AGB at the upper levels of the reference distribution (Fig. 5). This behavior can be a result of the smoothing characteristics of the RF algorithm⁸⁵ predictions for extreme values. We had only few field AGB samples exceeding 100 Mg ha^-1 in both training and testing datasets. Therefore, making data augmentation or splitting analysis was infeasible⁸⁶. Synthetic data generation or multi-year data acquisition was challenging due to FIA plot data restrictions⁸⁷. To avoid bias, we did not apply imbalance handling techniques such as oversampling.

In this study, LiDAR metrics were employed to quantify forest structural attributes, including tree height percentiles and slope, which are critical indicators of forest vertical complexity⁶⁰. In contrast, the RS image metrics, obtained from sources, such as Sentinel-2 and NAIP, provided information on vegetation heterogeneity and overall vegetation health³¹. Combining image data with LiDAR data increased the overall performance of the model by 24% based on R² due to the inclusion of both forest vertical structure and heterogeneity related information. These complementary datasets allowed us a more comprehensive understanding of forest dynamics and contributed to the improved predictive accuracy of AGB estimation at subplot level.

Our study also focused on reducing the dimensionality via identifying the best variable subset for model tuning⁴³. We employed IncNodePurity as the determination factor. Increase in IncNodePurity increases the overall performance of the RF model by improving the homogeneity at each node⁷⁴. Of the most important variables from Group-1 RF models, 68% out of the 28 selected variables were LiDAR-derived while 18% and 14% of the selected variables were derived from NAIP and Sentinel-2, respectively. The prominence of top-level LiDAR height metrics, such as P95th or P99th, in RF models for biomass estimation can be attributed to their strong correlation with canopy structure and tree height. Tree height is one of the most crucial parameters to calculate AGB despite the diameter (DBH)^88,89. According to Wang et al.⁹⁰, tree height and AGB has an R² of 0.77. Riggins et al⁹¹., achieved R² of 0.72 only using LiDAR height percentiles with highly accurate field AGB values. But they mentioned that it is extremely difficult to achieve such accuracy in complex forested environments. As a solution, most of the recent studies use an array of LiDAR height metrics (percentiles, densities, and height bins) to train models^17,27,32. We observed a strong influence of LiDAR height metrics in our RF model as 68% of selected predictors were LiDAR derived. LiDAR height percentiles provide a statistical summary of canopy height distribution within the plot. These metrics reflect important structural characteristics such as, the central tendency (mean and median), variability, and asymmetry of tree heights^92,93. The selected LiDAR percentiles, 95th to 60th represent the upper canopy structure, with the 95th percentile indicating the tallest trees, which strongly correlate with mean AGB of plot level⁹⁰. The height bins provide a normalized measure of the number of points²⁷ when density metrics represent the cumulative frequency of LiDAR returns of above specified height bins that include all points at or exceeding each threshold. They represent the canopy closure, density, and layering of the plots⁹⁴. Generally, taller and mature trees have a greater biomass accumulation⁸⁸. This explains the strong contribution of LiDAR metrics to the predictive accuracy of the RF model.

The tree aboveground biomass mainly consisted of organic matter and water stored in above ground components. The Sentinel-2 SWIR band is sensitive to vegetation water content⁹⁵ and to tree structural organic matter such as nitrogen, lignin, and cellulose⁹⁶. Dang et al.⁹⁷ proved that the SWIR was the best response variables of AGB prediction with an R² of 0.81. Traditional vegetation indices such as NDVI, which primarily reflect greenness and chlorophyll concentration but often saturate in closed canopy or mature forest conditions^2,98. Partial dependency curves (PDCs) illustrate the relationships between predictor variables and the predicted outcome while holding all other variables constant. PDCs provide valuable insights into how RF predictions change as a function of each predictor variable, allowing us to identify the importance and influence of each variable in RF regression models⁹⁹. In our analysis soil-related and landcover classes were also examined. The importance of these values was less than the LiDAR heights and image metrics.

Our study systematically addressed and mitigated uncertainties associated with machine learning model selection, training, and geolocation error of FIA field measurements. In the publicly available FIA data, up to 20% of private FIA plot coordinates are swapped with another similar private plot within the same county and fuzzed up to 1 mile on a small subset of them⁵⁸. Even when using the true FIA subplot data, the average centroid GPS location error is about 8 m. Therefore, removing non-forested areas and treating for obvious spatial outliers by comparing LiDAR heights and FIA-AGB estimations were necessary to maintain the consistency and accuracy of AGB predictions²⁹.

Trees can be found in both forested and non-forested land use areas in Connecticut¹⁷. The AGB estimates of trees outside of the forests (non-forested areas) may not be accurately estimated when the model is applied because those areas were not represented in the model training and selecting the allometric Equations⁶³. Therefore, Our RF models only suitable to predict the AGB of forested areas. Adding trees outside of the forest could significantly improve the estimations of total AGB in Connecticut¹⁷. Adjusting tree increment factors is useful to increase the number of samples of FIA subplots. However, this study was concluded only using the FIA data collected withing 2016 since it exceeds the expected timeline of the project in these circumstances⁴³.

Most of the previous research attained R² values in between the ranges of 0.30–0.60, even with less restrictions and more flexible access to FIA data^19,43. Our research achieved a moderate R² value of 0.41, yet within the range of previous research accuracy. The main reason for the moderate R² is the restrictions on using actual plot locations, which substantially plummeted the size of the training data. RF models typically perform better with larger datasets. However, we only considered single year data since our codes were run on off-site USDA-FIA computers. Also, the true FIA plot locations potentially have a geolocation error of 8 m (± 2 m) which adds an unknown error to the predictions and noise to the RS variables⁵⁶. The research can be improved by adding more subplots and using growth increment factors to provide more generalization investigating the entire data cycle. Torre-Tojal et al.⁴³ and Luo et al.³⁴ conducted biomass estimations in various forest age classes using airborne LiDAR data with a significantly higher pulse density (4.1pulses/m²) than that utilized in our study. They achieved superior results with full-waveform metrics (R² between 0.81 and 0.84) compared to discrete-return metrics (R² of 0.8), guiding the potential improvements to our RF models. The scale of analysis could be investigated thoroughly by considering FIA subplot, plot, or hectare plot scales. But the size of the data set and FIA plot coordinate related restrictions limited our ability to use the above techniques in our analysis²⁷.

In this study careful consideration was given to the RF tuning grids since both the sample size and test data size were smaller. Additional analysis (Moran’s I) for spatial autocorrelation can clarify the specific effects of subplot-level autocorrelation beyond the assumptions and treatments used in our RF model structure. Integrating ecology-based biomass modeling concepts can enhance RF model accuracy by accounting for variations in site quality, tree age, or available nutrient content. The current study compares with 90 m AGB map⁴⁰: trained by LiDAR, NAIP, NLCD, and FIA data using ecological modeling, 250 m AGB map by Blackard et al.⁴¹ and Menlove and Healey⁴². Wider grey shaded areas of CI indicate higher uncertainty, while narrower ones suggest more precise estimates (Fig. 8). Furthermore, our study was unable to provide an in-depth analysis of which forest types of the model predict best. This prohibits us from using a representative stratified random sampling technique for our study of different forest types. But the incorporated forest type information from the CT 1 m landcover (Cover_type) was ranked amongst the least important variables based on IncNodePurity, likely because of two reasons; i). The dataset lacked sufficient representation to differentiate forest types (Fig. 4a), or ii). existing set of explanatory variables were already sophisticated enough to achieve the current accuracy without additional contribution from forest type information. But future studies could consider methods such as destructive sampling, integrating non-forested tree data, or establishing FIA-like plots outside the federal data system to enable more spatially robust validation strategies to increase ecological representation in training and validation^32,43.

We compared our AGB estimates with three existing biomass map products (Appendix 4) to evaluate the potential benefits of fine-resolution models. However, the differences in spatial resolution and data sources can significantly influence comparison metrics we compared (Table 3). For instance, our RF model achieved the lowest RMSE (28.33 Mgha⁻¹) and highest R² (0.41), while Map_1 showed the highest RMSE (145.35 Mgha⁻¹) and lowest R² (0.01), likely due to scale mismatches at broader spatial extents. Additionally, mean and variance suggest that these coarser-resolution maps tend to overestimate biomass and show higher variability (Map_1: variance = 5170.04 Mgha⁻¹), while our RF model produced estimates more aligned with growth adjusted FIA-CRM estimates (mean = 43.92Mgha⁻¹; variance = 277.78 Mgha⁻¹). These results highlight how spatial resolution and model selection influence accurate AGB predictions, supporting the value of high-resolution AGB modeling for capturing local characteristics.

Finally, we acknowledge that ML models are fundamentally data driven. Our study addressed this challenge through extensive hyperparameter tuning, variable grouping and variable selection. But recent strategies, such as data augmentation, ensemble modeling, or theory-guided constraints¹⁰⁰¹⁰¹ may further enhance predictive accuracy in future research. Also, transformer-based foundation models such as Pritivi-WXC¹⁰² and SatCLIP¹⁰³ show promising improvements to future AGB context. Our research will be instrumental for future researchers to identify the areas to develop, using improved ML and DL models not only to evaluate spatial context but as a guide to build a protocol when the data access is limited.

Conclusion

The study utilized the RF algorithm to estimate forest AGB in Connecticut, showcasing the algorithm’s robustness even with a limited training data set. Employing grid search for hyperparameter optimization and cross-validation revealed promising results, with RMSE of 27.19 Mgha⁻¹ and R² of 0.41. Combining RS image data with LiDAR metrics improved R² by 34.78% compared to exclusively using LiDAR. Out of all, 68% of the most important variables are LiDAR height related variables. Integrating LiDAR data with RS image data and hyperparameter tuning significantly enhanced model performance. The optimum RF model shows high accuracy within the inter-quartile range of field CRM-AGB. Future research could focus on increasing testing, refining plot locations, spatial autocorrelation and leveraging high-density LiDAR/RS data for improved performance in biomass mapping.

Data availability

All publicly accessible data supporting the findings of this study are provided within the manuscript and its supplementary information files (Appendices 1–5). However, due to the Forest Inventory and Analysis (FIA) landowners’ data protection policy, the training data cannot be shared herewith. Interested users may contact the authors directly via the email addresses provided in the manuscript for guidance or collaboration opportunities involving FIA data. Our analysis codes, trained models and workflows will be made available upon request on GitHub to ensure transparency and support reproducibility. Please contact Shashika Himandi at shashika_himandi.lamahewa@uconn.edu or Dr. Chandi Witharana at chandi.witharana@uconn.edu to request access.

References

Dixon, R. K. et al. Carbon pools and flux of global forest ecosystems. Science 263(5144), 185–190 (1994).
Article PubMed CAS Google Scholar
Baccini, A. G. S. J. et al. Estimated carbon dioxide emissions from tropical deforestation improved by carbon-density maps. Nature climate change 2(3), 182–185 (2012).
Article CAS Google Scholar
Zaki, N. A., Mohd, Z. A., Latif & Mohd Zainee Zainal. and. Predicting above-ground biomass and carbon stocks by using geographically weighted regression (GWR). In 38th Asian Conf Remote Sens–Sp Appl Touching Hum Lives, ACRS (2017).
Oswalt, S. N., Brad Smith, W., Miles, P. D., Scott, A. & Pugh Forest resources of the United States. In General Technical Report-US Department of Agriculture, Forest Service. Forest Service (2019). (2017).
FS-130 & Update, R. Forests of Connecticut. (2016).
Wang, L. P., Basu, S. & Zhang, Z. M. Direct and indirect methods for calculating thermal emission from layered structures with nonuniform temperatures. 072701. (2011).
Shi, L. & Liu, S. Methods of estimating forest biomass: A review. Biomass Volume Estimation Valorization Energy. 10, 65733 (2017).
Google Scholar
Wang, M., Im, J., Zhao, Y. & Zhen, Z. Multi-Platform lidar for Non-Destructive individual aboveground biomass Estimation for Changbai larch (Larix olgensis Henry) using a hierarchical bayesian approach. Remote Sens. 14 (17), 4361 (2022).
Article Google Scholar
Jenkins, J. C., Chojnacky, D. C., Heath, L. S. & Birdsey, R. A. National-scale biomass estimators for united States tree species. For. Sci. 49 (1), 12–35 (2003).
Google Scholar
Somogyi, Z. et al. Indirect methods of large-scale forest biomass Estimation. European J. For. Research. 126, 197–207 (2007).
Article Google Scholar
Smith, W. B. Forest inventory and analysis: a National inventory and monitoring program. Environ. Pollut. 116, S233–S242 (2002).
Article PubMed CAS Google Scholar
Woudenberg, S. W. et al. And Karen L. Waddell. The Forest Inventory and Analysis Database: Database Description and User’s Manual Version 4.0 for Phase 2 (United States Department of Agriculture, Forest Service, Rocky Mountain Research Station, 2010).
Tamiminia, H., Salehi, B., Mahdianpari, M. & Goulden, T. State-wide forest canopy height and aboveground biomass map for new York with 10 m resolution, integrating GEDI, Sentinel-1, and Sentinel-2 data. Ecol. Inf. 79, 102404 (2024).
Article Google Scholar
Huang, H., Liu, C., Wang, X., Zhou, X. & Gong, P. Integration of multi-resource remotely sensed data and allometric models for forest aboveground biomass Estimation in China. Remote Sens. Environ. 221, 225–234 (2019).
Article Google Scholar
Li, C., Li, Y. & Li, M. Improving Forest aboveground biomass (AGB) estimation by incorporating crown density and using landsat 8 OLI images of a subtropical forest in Western Hunan in Central China. Forests 10(2), 104 (2019).
Article CAS Google Scholar
Johnson, K. D. et al. Integrating forest inventory and analysis data into a LIDAR-based carbon monitoring system. Carbon Balance Manag. 9, 1–11 (2014).
Article Google Scholar
Johnson, K. D. et al. Integrating LIDAR and forest inventories to fill the trees outside forests data gap. Environ. Monit. Assess. 187, 1–8 (2015).
Article CAS Google Scholar
Johnson, L. K. et al. Fine-resolution landscape-scale biomass mapping using a Spatiotemporal patchwork of lidar coverages. Int. J. Appl. Earth Obs. Geoinf. 114, 103059 (2022).
Google Scholar
Hu, T. et al. Mapping global forest aboveground biomass with spaceborne lidar, optical imagery, and forest inventory data. Remote Sens. 8 (7), 565 (2016).
Article Google Scholar
Hudak, A. T. et al. A carbon monitoring system for mapping regional, annual aboveground biomass across the Northwestern USA. Environmental Res. Letters. 15 (9), 095003 (2020).
Article CAS Google Scholar
Tang, H. et al. and G. C. Hurtt. Lidar derived biomass, canopy height, and cover for new England region, USA. ORNL DAAC (2021).
Chen, H. et al. Mapping forest aboveground biomass with MODIS and Fengyun-3 C VIRR imageries in Yunnan province, Southwest China using linear regression, K-Nearest neighbor and random forest. Remote Sens. 14 (21), 5456 (2022).
Article Google Scholar
Dubayah, R. O. et al. Estimation of tropical forest height and biomass dynamics using lidar remote sensing at La selva, Costa Rica. J. Geophys. Res. Biogeosci. 115, (G2) (2010).
Ehlers, D. et al. Mapping forest aboveground biomass using multisource remotely sensed data. Remote Sens. 14 (5), 1115 (2022).
Article Google Scholar
Zheng, D., Heath, L. S. & Ducey, M. J. Spatial distribution of forest aboveground biomass estimated from remote sensing and forest inventory data in new england, USA. J. Appl. Remote Sens. 2 (1), 021502 (2008).
Article Google Scholar
Mancini, F. et al. An integrated procedure to assess the stability of coastal Rocky cliffs: from UAV close-range photogrammetry to Geomechanical finite element modeling. Remote Sens. 9 (12), 1235 (2017).
Article Google Scholar
Sheridan, R. D. et al. Modeling forest aboveground biomass and volume using airborne lidar metrics and forest inventory and analysis data in the Pacific Northwest. Remote Sens. 7 (1), 229–255 (2014).
Article Google Scholar
Urbazaev, M. et al. Estimation of forest aboveground biomass and uncertainties by integration of field measurements, airborne lidar, and SAR and optical satellite data in Mexico. Carbon Balance Manag. 13, 1–20 (2018).
Article Google Scholar
Duncanson, L. et al. Implications of allometric model selection for county-level biomass mapping. Carbon Balance Manag. 12, 1–11 (2017).
Article Google Scholar
Haralick, R. M., Shanmugam, K. & Its’ Hak Dinstein. Textural features for image classification. IEEE Trans. Syst. Man. Cybernetics. 6, 610–621 (1973).
Article Google Scholar
Csillik, O., Kumar, P., Mascaro, J., O’Shea, T. & Asner, G. P. Monitoring tropical forest carbon stocks and emissions using planet satellite data. Sci. Rep. 9 (1), 1–12 (2019).
Article CAS Google Scholar
Nandy, S., Srinet, R. & Padalia, Hitendra. Mapping forest height and aboveground biomass by integrating ICESat-2, Sentinel‐1 and Sentinel‐2 data using Random Forest algorithm in northwest Himalayan foothills of India. Geophysical Research Letters 48(14), e2021GL093799 (2021).
Article Google Scholar
Naik, P., Dalponte, M. & Lorenzo Bruzzone. Prediction of forest aboveground biomass using multitemporal multispectral remote sensing data. Remote Sens. 13 (7), 1282 (2021).
Article Google Scholar
Luo, P., Liao, J. & Shen, G. Combining spectral and texture features for estimating leaf area index and biomass of maize using Sentinel-1/2, and Landsat-8 data. IEEE Access. 8, 53614–53626 (2020).
Article Google Scholar
Gao, Y. et al. Comparative analysis of modeling algorithms for forest aboveground biomass Estimation in a subtropical region. Remote Sens. 10 (4), 627 (2018).
Article Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Han, S., Williamson, B. D. & Youyi Fong. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inf. Decis. Mak. 21, 1–9 (2021).
CAS Google Scholar
Sivasankar, T., Lone, J. M., Sarma, K. K., Qadir, A. & Raju, P. L. N. Estimation of above ground biomass using support vector. Vietnam Journal of Earth Sciences 41(2), 95–104 (2013).
Article Google Scholar
Wolpert, D. H. Stacked generalization. Neural Netw. 5 (2), 241–259 (1992).
Article Google Scholar
Ma, L. et al. Mapping US forest biomass using nationwide forest inventory data and moderate resolution information. Remote sensing of Environment 112(4), 1658–1677 (2022).
Google Scholar
Menlove, J. & Healey, S. CMS: Forest Aboveground Biomass from FIA Plots across the Conterminous USA 2009–2019 (ORNL DAAC, 2021).
Fox, E. W., Jay, M., Ver Hoef, Anthony, R. & Olsen Comparing Spatial regression to random forests for large environmental data sets. PloS ONE. 15 (3), e0229509 (2020).
Article PubMed PubMed Central CAS Google Scholar
Torre-Tojal, L., Bastarrika, A. & Boyano, A. Above-ground biomass estimation from LiDAR data using random forest algorithms. J. Comput. Sci. 58, 101517 (2022).
Luna, Soriano et al. Determinants of above-ground biomass and its spatial variability in a temperate forest managed for timber production. Forests 9(8), 490 (2018).
Article Google Scholar
Riemann, R., Wilson, B. T., Lister, A. & Parks, S. An effective assessment protocol for continuous Geospatial datasets of forest characteristics using USFS forest inventory and analysis (FIA) data. Remote Sens. Environ. 114 (10), 2337–2352 (2010).
Article Google Scholar
Food and Agriculture Organization of the United Nations (FAO). Global Forest Resources Assessment 2020: Main Report (FAO, 2020).
EPA. Level III and IV Ecoregions of the Continental United States. U.S. Environmental Protection Agency. (Accessed 15 May 2025). https://www.epa.gov/eco-research/ecoregions (2013).
Liu, Z., Luong, P., Boley, M. & Schmidt, D. F. Improving random forests by smoothing. https://arXiv.org/abs/2505.06852. (2025).
USDA Forest Service. Forests of Connecticut, 2020. Resource Update FS-334. Madison 2. https://doi.org/10.2737/FS-RU-334 (U.S. Department of Agriculture, Forest Service, 2021).
Brand, G. J., Nelson, M. D., Wendt, D. G. & Kevin, K. Nimerfro. The Hexagon/panel System for Selecting FIA Plots Under an Annual Inventory ( USFS FIA research, 2003).
Lu, D. et al. A survey of remote sensing-based aboveground biomass Estimation methods in forest ecosystems. Int. J. Digit. Earth. 9 (1), 63–105 (2016).
Article Google Scholar
Tinkham, W. T. et al. Applications of the united States forest inventory and analysis dataset: a review and future directions. Can. J. For. Res. 48 (11), 1251–1268 (2018).
Article Google Scholar
Burkman, B. Forest inventory and analysis: sampling and plot design. FIA Fact. Sheet Ser. (2005).
Butler, B. J. & Connecticut Forests of 2017. Resource Update FS-159. Newtown Square, PA: U.S. Department of Agriculture, Forest Service, Northern Research Station. 3. https://doi.org/10.2737/FS-RU-159 (2018).
Hoppus, M. and Andrew Lister. The status of accurately locating forest inventory and analysis plots using the global positioning system. In Proceedings of the Seventh Annual Forest Inventory and Analysis Symposium, Portland, OR, USA 36, 179184. (2005).
Lister, A. et al., Strategies for preserving owner privacy in the national information management system of the USDA Forest Service’s Forest Inventory and Analysis unit. United States department of agriculture forest service general technical report NC 352 163 (2005).
Woodall, C. W., Linda, S., Heath, G. M., Domke & Nichols, M. C. Methods and equations for estimating aboveground volume, biomass, and carbon for trees in the US forest inventory. Gen. Tech. Rep. NRS-88. Newtown Square, PA: US Department of Agriculture, Forest Service, Northern Research Station 30 (2011).
Burrill, E. A. et al. The Forest Inventory and Analysis Database: Database Description and User Guide Version 9.0.1 for Phase 2. U.S. Department of Agriculture, Forest Service. 1026. (Accessed 03 March 2022). https://research.fs.usda.gov/programs/fia#data-and-tools (2021).
Chen, Q., Laurin, G. V. & Valentini, R. Uncertainty of remotely sensed aboveground biomass over an African tropical forest: propagating errors from trees to plots to pixels. Remote Sens. Environ. 160, 134–143 (2015).
Article Google Scholar
Chen, Q., Laurin, G. V., Battles, J. J. & Saah, D. Integration of airborne lidar and vegetation types derived from aerial photography for mapping aboveground live biomass. Remote Sens. Environ. 121, 108–117 (2012).
Article Google Scholar
CT department of energy and environmental protection. (n.d.). Connecticut environmental conditions online. Connecticut Environmental Conditions Online Maps & Geospatial Data for Everyone. https://cteco.uconn.edu/guides/Soils.htm (2024).
Shao, G. et al. Improving Lidar-based aboveground biomass Estimation of temperate hardwood forests with varying site productivity. Remote Sens. Environ. 204, 872–882 (2018).
Article Google Scholar
McPherson, E., Gregory, Natalie, S., van Doorn & Peper, P. J. Urban tree database and allometric equations. Gen. Tech. Rep. PSW-GTR-253. Albany, CA: US department of agriculture, forest service. Pac. Southwest. Res. Stn. 86, 253 (2016).
Google Scholar
Hayashi, M., Saigusa, N., Yamagata, Y. & Hirano, T. Regional forest biomass Estimation using icesat/glas spaceborne lidar over Borneo. Carbon Manag. 6 (1–2), 19–33 (2015).
CAS Google Scholar
Chen, L., Ren, C., Zhang, B., Wang, Z. & Xi, Y. Estimation of forest above-ground biomass by geographically weighted regression and machine learning with Sentinel imagery. Forests 9 (10), 582 (2018).
Article Google Scholar
Bright, B. C., Hicke, J. A. & Hudak, A. T. Estimating aboveground carbon stocks of a forest affected by mountain pine beetle in Idaho using lidar and multispectral imagery. Remote Sens. Environ. 124, 270–281 (2012).
Article Google Scholar
Pandit, S., Tsuyuki, S. & Dube, T. Estimating above-ground biomass in sub-tropical buffer zone community forests, nepal, using Sentinel 2 data. Remote Sens. 10 (4), 601 (2018).
Article Google Scholar
Moradi, F., Darvishsefat, A. A., Pourrahmati, M. R. & Deljouei, A. Stelian Alexandru Borz, Estimating aboveground biomass in dense Hyrcanian forests by the use of Sentinel-2 data. Forests 13(1), 104 (2022).
Article Google Scholar
Parent, J. R., Arthur, J., Gold, E., Vogler & Kelly Addy Lowder Guiding decisions on the future of dams: A GIS database characterizing ecological and social considerations of dam decisions. J. Environ. Manage. 351, 119683 (2024).
Article PubMed Google Scholar
Huete, A. R. A soil-adjusted vegetation index (SAVI). Remote sensing of environment 25(3), 295–309 (1988).
Article Google Scholar
Mohammadpour, P., Viegas, D. X. & Carlos Viegas. Vegetation mapping with random forest using Sentinel 2 and GLCM texture feature - A case study for lousã region, Portugal. Remote Sens. 14 (18), 4585 (2022).
Article Google Scholar
Genuer, R., Poggi, J. M. & Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 31 (14), 2225–2236 (2010).
Article Google Scholar
Wang, Y., Zhang, X. & Guo, Z. Estimation of tree height and aboveground biomass of coniferous forests in North China using stereo ZY-3, multispectral Sentinel-2, and DEM data. Ecol. Ind. 126, 107645 (2021).
Article Google Scholar
Breiman, L. Classification and Regression Trees (Routledge, 2017).
Wongchai, W., Onsree, T., Sukkam, N., Promwungkwa, A. & Nakorn Tippayawong. Machine learning models for estimating above ground biomass of fast-growing trees. Expert Syst. Appl. 199, 117186 (2022).
Article Google Scholar
Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 13 (1), 1063–1095 (2012).
MathSciNet Google Scholar
Sekulić, A., Kilibarda, M., Heuvelink, G. B. M., Nikolić, M. & Branislav Bajat. Random forest Spatial interpolation. Remote Sens. 12 (10), 1687 (2020).
Article Google Scholar
Liaw, A. & Matthew Wiener. Classification and regression by randomforest. R News. 2 (3), 18–22 (2002).
Google Scholar
Dewi, C. & Rung-Ching Chen Random forest and support vector machine on features selection for regression analysis. Int. J. Innov. Comput. Inf. Control. 15 (6), 2027–2037 (2019).
Google Scholar
Xu, D., Wang, H., Xu, W., Luan, Z. & Xu, X. LiDAR applications to estimate forest biomass at individual tree scale: Opportunities, challenges and future perspectives. Forests 12(5), 550 (2021).
Article Google Scholar
Hong, Y. et al. Combining multisource data and machine learning approaches for multiscale Estimation of forest biomass. Forests 14 (11), 2248 (2023).
Article Google Scholar
Tang, Z., Xia, X., Huang, Y., Lu, Y. & Guo, Z. Estimation of National forest aboveground biomass from multi-source remotely sensed dataset with machine learning algorithms in China. Remote Sens. 14 (21), 5487 (2022).
Article Google Scholar
U.S. Forest Service. National Scale Volume and Biomass Estimators (NSVB). Forest Inventory and Analysis Program. (Accessed 27 October 2024). https://research.fs.usda.gov/programs/fia/nsvb
Asner, G. P. et al. James Jacobson, Ty Kennedy-Bowdoin et al. High-resolution forest carbon stocks and emissions in the Amazon. Proc. Natl. Acad. Sci. 107 (38), 16738–16742. https://doi.org/10.1073/pnas.1004875107 (2010).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data. 6 (1), 1–48 (2019).
Article Google Scholar
Lu, Y. et al. Machine learning for synthetic data generation: a review. (2023). arXiv preprint arXiv:2302.04062.
Qadeer, A., Shakir, M., Wang, L. & Talha, S. M. Evaluating machine learning approaches for aboveground biomass prediction in fragmented high-elevated forests using multi-sensor satellite data. Remote Sens. Applications: Soc. Environ. 36, 101291 (2024).
Article Google Scholar
White, J. et al. Coops. A model development and application guide for generating an enhanced forest inventory using airborne laser scanning data and an area-based approach. (2017).
Friedman, J. H., Bogdan, E. & Popescu Predictive learning via rule ensembles. 916–954. (2008).
Riggins, J. J., Tullis, J. A. & Stephen, F. M. Per-segment aboveground forest biomass Estimation using LIDAR-derived height percentile statistics. GIScience Remote Sens. 46 (2), 232–248 (2009).
Article Google Scholar
Holmgren, J. Prediction of tree height, basal area and stem volume in forest stands using airborne laser scanning. Scand. J. For. Res. 19 (6), 543–553 (2004).
Article Google Scholar
Lim, K. S. & Treitz, P. M. Estimation of above ground forest biomass from airborne discrete return laser scanner data using canopy-based quantile estimators. Scand. J. For. Res. 19 (6), 558–570 (2004).
Article Google Scholar
Næsset, E. Airborne laser scanning as a method in operational forest inventory: status of accuracy assessments accomplished in Scandinavia. Scand. J. For. Res. 22 (5), 433–442 (2007).
Article Google Scholar
Chen, L., Wang, Y., Ren, C., Zhang, B. & Wang, Z. Optimal combination of predictors and algorithms for forest above-ground biomass mapping from Sentinel and SRTM data. Remote Sens. 11 (4), 414 (2019).
Article CAS Google Scholar
Wai, P., Su, H. & Li, M. Estimating aboveground biomass of two different forest types in Myanmar from sentinel-2 data with machine learning and Geostatistical algorithms. Remote Sens. 14 (9), 2146 (2022).
Article Google Scholar
Dang, A. T. N. et al. Forest aboveground biomass Estimation using machine learning regression algorithm in Yok don National park, Vietnam. Ecol. Inf. 50, 24–32 (2019).
Article Google Scholar
Huete, A. et al. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 83 (1–2), 195–213 (2002).
Article Google Scholar
Liu, Z., Luong, P., Boley, M., Daniel, F. & Schmidt Improv. Random Forests Smoothing https://arXiv.org/abs/250506852. (2025).
He, Q., Chen, E., An, R. & Li, Y. Above-ground biomass and biomass components Estimation using lidar data in a coniferous forest. Forests 4 (4), 984–1002 (2013).
Article Google Scholar
Chen, L. et al. Evaluating the transferability of spectral variables and prediction models for mapping forest aboveground biomass using transfer learning methods. Remote Sens. 15 (22), 5358 (2023).
Article Google Scholar
Schmude, J. et al. Prithvi wxc: foundation model for weather and climate. https://arXiv.org/abs/2409.13598 (2024).
Klemmer, K., Rolf, E., Robinson, C., Mackey, L. & Rußwurm, M. April. Satclip: Global, general-purpose location embeddings with satellite imagery. In Proceedings of the AAAI Conference on Artificial Intelligence. 39 (4), 4347–4355. (2025).
Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095 (2012).
MathSciNet Google Scholar

Download references

Acknowledgements

The work reported in this paper was fully funded by USDA-NIFA Mclnire-Stennis Capacity Grant (#CONS01050), USA (2022-2024). We are grateful to the US Forest Service, U.S. Department of Agriculture, Northern Research Station, United States for sharing the data and expertise. We extend our gratitude to Andrew J. Lister and Charles Paulson of the U.S. Forest Service, Northern Research Station, for their insightful comments and invaluable support during the data extraction process.

Author information

Authors and Affiliations

Department of Natural Resources and the Environment, College of Agriculture, Health and Natural Resources, University of Connecticut, Storrs, CT, 06269, USA
Shashika Himandi Gardeye Lamahewage, Chandi Witharana, Robert Fahey & Thomas Worthley
Eversource Energy Center, University of Connecticut, Storrs, CT, 06262, USA
Chandi Witharana, Robert Fahey & Thomas Worthley
Forest Inventory and Analysis, Northern Research Station, USDA Forest Service, Troy, NY, USA
Rachel Riemann

Authors

Shashika Himandi Gardeye Lamahewage
View author publications
Search author on:PubMed Google Scholar
Chandi Witharana
View author publications
Search author on:PubMed Google Scholar
Rachel Riemann
View author publications
Search author on:PubMed Google Scholar
Robert Fahey
View author publications
Search author on:PubMed Google Scholar
Thomas Worthley
View author publications
Search author on:PubMed Google Scholar

Contributions

Shashika Himandi: Conceptualization, Software, Formal analysis, Figures, Tables, Writing - original draft. Chandi Witharana: Conceptualization, Supervision, Funding acquisition, and reviewing- original draft.Rachel Riemann: Conceptualization, Formal analysis, and Reviewing—original draft. Robert Fahey: Supervision and Reviewing—original draft. Thomas E. Worthley: Supervision, Funding acquisition, and Reviewing—original draft.

Corresponding author

Correspondence to Shashika Himandi Gardeye Lamahewage.

Ethics declarations

Competing interests

The authors declare no competing interests.

Permissions and licenses

This study was conducted across the forests of Connecticut in collaboration with scientists from the U.S. Forest Service’s Forest Inventory and Analysis (FIA) program. Since the work was carried out as part of this established collaboration, it focused on publicly available data, and not revealing true plot locations no special permissions or licenses were required.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lamahewage, S.H.G., Witharana, C., Riemann, R. et al. Aboveground biomass estimation using multimodal remote sensing observations and machine learning in mixed temperate forest. Sci Rep 15, 31120 (2025). https://doi.org/10.1038/s41598-025-15585-6

Download citation

Received: 26 December 2024
Accepted: 08 August 2025
Published: 24 August 2025
Version of record: 24 August 2025
DOI: https://doi.org/10.1038/s41598-025-15585-6

Subjects

Abstract

Similar content being viewed by others

Mapping tropical forest aboveground biomass using airborne SAR tomography

Estimation of woody vegetation biomass in Australia based on multi-source remote sensing data and stacking models

LiDAR-based reference aboveground biomass maps for tropical forests of South Asia and Central Africa

Introduction

Data and methods

Study area

Modeling framework

FIA plot data

Aboveground biomass calculation

Remote sensing variables

LiDAR data

Remote sensing images and auxiliary geospatial data layers

Grouping remote sensing variables

Random forest algorithm

Variable reduction and selection

Hyperparameter tuning and random forest model training

AGB comparison with existing map products

Results

Variable importance

Random forest model comparison

Partial dependency curves

Comparison with available map products

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Permissions and licenses

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links