Abstract
Plants sequester carbon in their aboveground components, making aboveground tree biomass a key metric for assessing forest carbon storage. Traditional methods of aboveground biomass (AGB) estimation via Forest Inventory and Analysis (FIA) plots lack sufficient sampling intensity to directly produce accurate estimates at fine granularities. Increasing the sampling intensity with additional FIA plots would be labor and time intensive, particularly for large-scale carbon studies. Utilizing remote sensing (RS) data, such as Airborne Light Detection and Ranging (LiDAR), aerial imagery, and satellite images can significantly enhance the efficiency of forest carbon monitoring efforts. The principal objective of this study is to utilize the random forest (RF) algorithm to build predictive AGB models. We utilized 67 explanatory variables, which were extracted from three RS sources resulting in nine RF models. Each RF model was subjected to variable selection, hyperparameter tuning, and model evaluation. The optimum model considered 28 explanatory variables, with root mean square error (RMSE) of 27.19 Mgha−1 and R2 of 0.41. Combining LiDAR with image metrics increased the accuracy of prediction models, serving as a pivotal tool for large area biomass mapping and carbon related decision making.
Similar content being viewed by others
Introduction
Forests span over 410 million hectares globally, serving as vital carbon sinks that store atmospheric carbon mainly as above-ground biomass1. Understanding carbon sources and their sequestration is crucial for mitigating actions against global climate change2. Aboveground biomass (AGB) of trees in the forest is the foundation of estimating above ground carbon and understanding the carbon sink and source balance3. Forests and woodlands occupy over one third of the United States landscape, containing approximately 1 trillion cubic feet of wood volume4. An estimated 58% of the land area of the state of Connecticut meets the Forest Inventory and Analysis (FIA) definition of forest land. Connecticut encompasses 1.8 million acres of forested land, which bears approximately 135.69 million tons of oven-dry AGB5.
There are two primary methods of calculating above ground biomass, (i) direct and (ii) indirect methods6. The direct (or destructive) method measures the oven-dry weight of the entire plant above ground material. Destructive sampling is a complex, time-consuming method used primarily to develop and calibrate allometric equations of AGB from tree structural parameters. Indirect methods utilize those allometric equations to provide AGB estimates from field inventories of tree structural characteristics7,8. In forests indirect estimation is almost always the method used due to the practical infeasibility of directly estimating AGB. As applied indirect estimation uses species-specific allometric equations to calculate the approximate AGB of trees and forested areas9,10. The FIA program, which has been conducting comprehensive forest inventories for over seven decades from local to national scales11provides a statewide database of AGB estimates for inventoried trees, subplots and plots. FIA has historically used the component ratio method (CRM) to calculate the AGB estimates, but in 2023 the National Scale Volume Biomass (NSVB) model was introduced for more accurate biomass estimations12,58.
A rapid and accurate estimation of large area (e.g., statewide) forest AGB using field inventory data remains challenging in the forestry research sector13. Utilization of remote sensing (RS) techniques in forestry has been increasing due to efficiency, accuracy and predictability when studies require largescale and high temporal resolution data13,14. RS data facilitates building large-area maps in different spatiotemporal scales, particularly when field sampling is insufficient for statistical estimation for state or landscape level modeling15.
Combining RS data with FIA data is one of the popular techniques that has frequently been utilized16,17,18. Light Detection and Ranging (LiDAR) data is often employed to build random forest (RF) models to predict forest AGB16. Hu et al.19 used spaceborne LiDAR, satellite images, climate surfaces, topographic data, and optical imagery to build 1 km resolution AGB density maps at global scale with R2 of 0.56 and RMSE of 87.53Mgha−1. Hudak et al.20 built a carbon monitoring system utilizing 3805 field measured plots for mapping regional, annual aboveground biomass across the northwestern USA achieving R2 of 0.8 and RMSE of 152 Mgha−1. Tang et al.21 published a multi-state level forest carbon map with 30 m level predictions with R2 of 0.38. Tang et al.21 utilized 1986 FIA subplot locations, LiDAR, and National Agriculture Imagery Program (NAIP) data. Other studies have utilized synthetic aperture radar (SAR), vegetation indices (VIs), and Moderate Resolution Imaging Spectroradiometer (MODIS), to create a range of machine learning (ML) and multiple linear regression models (MLR)18,22,23,24.
Integrating FIA data with both passive and active remote sensing technologies, (such as LiDAR, Landsat, Sentinel-2, and NAIP), enhances large-area biomass predictions by providing valuable details at higher temporal and spatial resolutions that are particularly relevant for canopy cover and disturbance characteristics25. According to Mancini et al.26 remotely acquired data enhances the information in field data and reduces the need for additional extensive fieldwork. LiDAR provides a comprehensive characterization of forest structure, canopy height information, vertical profile, and tree density16,27,28. Such detailed 3D representation enables a precise estimation of AGB and carbon stocks due to the strong linear relationship between tree height and tree diameter29. In terms of RS image observations, Sentinel-2 data provides high temporal resolution and NAIP offers high spatial resolution (0.6 m). Variables developed from remotely sensed imagery have become crucial data for characterizing forest health, composition, and qualitative characteristics. For example, grey-level co-occurrence matrix (GLCM) textures30 inform canopy complexity and spatial heterogeneity, which are important for distinguishing biomass variation31,32. The vegetation indices provide valuable information about the vegetation characteristics, vegetation vigor, moisture content, canopy density, and distribution32,33. Principal Component Analysis (PCA) has been employed to reduce the large number of variables and redundant information into a more manageable set of variables while preserving the primary information dimensions available in the original data34.
Among previous AGB modeling efforts, Sheridan et al.27 and Li et al.15 built linear regression models, linear dummy variable models, and linear mixed-effects models. Previous studies have extensively explored different parametric and non-parametric algorithms to model forest AGB18,35. The RF algorithm has been frequently used in forest AGB modeling16,27,36,37 but in addition to RF Li et al.15 used gradient boosting and Sivasankar et al.38 used a support vector machine (SVM) regression to predict forest AGB. Johnson et al.18 and Wolpert39, developed stacked ensemble models for previously developed prediction models. In addition to pure statistical models, Ma et al.40 estimated AGB using the ecosystem demography (ED) model. Other studies further utilized other image composites and percent tree cover, land cover proportions, and climate parameters to build tree-based algorithms41,42.
For large-area estimations, traditional forest inventory approaches are time and labor intensive, which limits their effectiveness for high-resolution AGB assessment. This limitation underscores the need for scalable, accurate, and cost-effective methods to estimate forest AGB across large and heterogeneous landscapes. Previous studies have leveraged remote sensing data and ML algorithms as potential solutions. Despite the growing use of ML algorithms such as random forest (RF) for AGB prediction, critical knowledge gaps remain16,43,44,45. Notably, the implications of Forest Inventory and Analysis (FIA) data policies which restrict access to georeferenced field plots are not well studied. Furthermore, the utility and structural optimization of RF models with limited training data conditions remain underexplored in the context of forest AGB estimation. These gaps limit the generalizability and robustness of RF based approaches, particularly in data-sparse or policy constrained regions. In this study, we hypothesize that optimizing the tree structure of random forest (RF) models through variable selection and hyperparameter tuning, combining with the integration of multimodal remote sensing (RS) data, can significantly improve the accuracy and robustness of aboveground biomass predictions. This approach is expected to be especially effective in scenarios with limited training data access and necessary for generating reliable AGB estimates across large, forested landscapes.
We utilized FIA data for both model training and validation, applying FIA-developed methods to evaluate the consistency between our biomass estimates and those generated by FIA. To achieve a spatially explicit representation of AGB across Connecticut, we employed a model-based approach that translated FIA’s discrete plot-level measurements into spatially continuous AGB predictions at a minimum of 15 m resolution. We addressed several common challenges associated with combining FIA and RS data for large-scale, fine-resolution biomass modeling and mapping. We employed the Random Forest (RF) algorithm, a method commonly utilized by researchers for estimating AGB due to its robustness in handling collinear data and spatial autocorrelation effectively44,45 In this study we also investigated the effectiveness of using detailed hyperparameter tuning and variable importance analysis to optimize RF model and to identify the most meaningful set of input variables16,43. We attempted to identify an effective method for integrating multi-modal remote sensing data and auxiliary spatial products with FIA subplot data. Our goal was to build robust models with improved accuracy of AGB estimates in Connecticut forests compared to previous efforts in literature17,27. We developed a comprehensive framework to address the challenges associated with utilizing FIA data, emphasizing its critical value in large-scale biomass estimation. Finally, we evaluated the agreement between our spatially explicit AGB estimates and FIA-derived biomass estimates with existing AGB map products18,46.
Data and methods
Study area
Our study was the state of Connecticut (41.6° N, 72.7° W), located in the northeastern United States and spanning approximately 13,023 km²47. Connecticut is characterized as a humid continental climate with cold winters and warm, humid summers within the Northeastern coastal zone and lower New England ecoregion48. We focused on the forests of the state of Connecticut, United States (Fig. 1) dominated by the Oak and Hickory species. As defined by the Global Forest Resources Assessment49 a forest is characterized as land covering more than 0.5 ha, with trees exceeding 5 m in height and a canopy cover greater than 10%, or with trees that have the potential to naturally meet these criteria. In 2016, Connecticut’s forested land was estimated at 7,274.35 km² (1.8 million acres). Of this, 71% is privately owned, while 28.59% is owned by state and local governments. Federal ownership accounts for 0.46%, excluding developed and agricultural lands50.
The study area. Forests of Connecticut extracted using Connecticut 1 m landcover map 2016. Publicly available forest inventory and analysis (FIA) plot locations were demarcated in dark purple points available for 2016, phase 01 FIA inventory. The bottom right corner is a graphical visualization of FIA subplot design identical for all the FIA plot locations.
Modeling framework
The methodology (Fig. 2) involves identifying FIA plot locations belonging to the 2016 collection year, and creating raster layers from LiDAR, Sentinel-2 images, NAIP images, CT soil maps, and CT forest cover types51. The study consisted of 42 FIA true plot locations (168 subplots) and 67 explanatory variables derived for each subplot location corresponding to 2016 data collection. Each explanatory variable was meticulously selected by a thorough literature review to represent forest and vegetation characteristics (i.e. tree height, slope, forest heterogeneity, and tree health). Prior to model training, candidate remote sensing variables underwent a rigorous selection process to reduce dimensionality and identify the optimal subset of variables for RF models given the sample size.
FIA plot data
The FIA process integrates two active phases as outlined in the FIA user guide58. According to the 1998 legislation, a five-year data collection cycle is required for permanent phase two plots nationwide11,52,53. The FIA program measured and recorded 320 sample plots with some forested conditions (plot status code = 1) in Connecticut from 2011 to 2016. Approximately 14% out of 320 were measured annually resulting in 42 plots for 2016. The FIA plot design consists of a systematic cluster of four subplots, as shown in Fig. 354,55.
(a) Phase 02 and phase 03 forest inventory and analysis (FIA) subplot design and assigning subplot numbers58. (b) True LiDAR points cloud of Forest Inventory and Analysis plot location as a 3D local scene. (c) A raster layer of 95th percentile derived using 2016 LiDAR point cloud.
We used the 2016 FIA data to align with Connecticut’s 2016 LiDAR mission. This ensured efficiency and the project completion within the timeline since the entire analysis was done remotely. Additionally, growth adjustments were not integrated, as we used single-year data to avoid introducing further uncertainties into the analysis. In 2016, the Northern Research Station Forest Inventory and Analysis program (NRS-FIA) utilized Rockwell Precision Lightweight GPS Receivers. The GPS readings have an average discrepancy of 8.0 m between the observed coordinates and the known reference points56. We utilized true locations of 42 FIA plots, resulting in 168 FIA subplots (Fig. 3a)16,27. The public FIA data is perturbed to reduce plot coordinate accuracy to maintain landowner privacy via fuzzing and swapping57. Fuzzing involves shifting the plot coordinates to within 1.0 mile of the actual location. In addition, up to 20% of private plot coordinates are swapped with similar plots within the same county to prevent linking data to specific landowners. Obtaining true FIA subplot locations was essential for this study because of the fine spatial resolution of the analysis and imagery used, however this posed challenges due to FIA’s data consent policies. To work with this situation, all R codes were written and initially tested using the publicly available data through FIA’s Spatial Data Services. The final development of R code and model tuning was run remotely using FIA true plot locations via a collaboration with FIA scientists.
Aboveground biomass calculation
We used the FIA subplot-level AGB estimates calculated using the component ratio method (CRM) for 168 FIA subplots27. The CRM calculations were already recorded on FIA database following the exact methods following FIA user guide Version 9.0.1 for Phase 258. The subplot locations allow for a better assessment of heterogeneity within the forested areas. This approach increased our ability to extract high-resolution RS data and to improve the model predictions12,16,57,58,59. The FIA data tables with tree measurement records were used to calculate the cubic foot volume of each tree stem in all 168 subplots57,58. Afterwards, tree-level AGB was totaled to get the AGB per subplot58. Allometric estimates of AGB were calculated by using Eq. 1 >9 for tops and limbs as a proportion of the merchantable bole.
Where;
bm = total aboveground biomass for trees 12 cm and larger16.
dbh = diameter at breast height (cm).
Exp = exponential function.
ln = log base e (2.718282).
Β0, β1 = specific parameters for species group9,58.
In our study, we only considered the trees with diameter at breast height (dbh) ≥ 12.7 cm (≥ 5 in) for the AGB calculations18. Moreover, 892.17 lbac−1 equivalent to 1 Mgha−1 was the conversion factor of AGB60,61. Although non-forest areas in CT are known to contain trees, in this study we assumed that non-forested areas consist of 0 Mgha−1 AGB to align with the definition of FIA forestland16,18. The FIA estimated AGB values were the response variable of our study. After excluding subplots in non-forested areas12,29other spatial outliers generated from instrument errors and geolocation uncertainties of the field dataset were identified and removed by comparing discrepancies between observed and expected values16. We excluded the locations with smaller field biomass estimates (< 50 Mgha−1) with high LiDAR heights (> 30 m) or subplots with zero AGB but had LiDAR heights greater than 10 m29. This resulted in reducing the training data set from 168 to 142.
Remote sensing variables
LiDAR data
The state of Connecticut has a statewide airborne LiDAR data set, which was collected from March 11, 2016, through April 16, 2016, with USGS quality Level 2. The point density is 2 points/m2 covering about 13,567 km2 (5240 miles2), producing 23,381 tiles. LiDAR data and derived 1 m resolution digital elevation models are referenced to the Connecticut State Plane NAD83 (2011) feet horizontal coordinate system, and NAVD88 feet vertical datum62. We derived63 LiDAR metrics using ArcGIS Pro 3.1.2 and FUSION/LDV processing software (version 4.40)27. We extracted mean raster values of each LiDAR metrics listed in Appendix 1 at subplot level. The canopy height model (CHM) was one of the input variables derived from LiDAR first returns provided tree heights to the model13,32,64. We created rasters for LiDAR height percentiles, height bins, and height densities using the distribution of Z Cartesian coordinate from point cloud. These elevation metrics were used to train the model with the information of vertical structure and complexity of the forest vegetation17,20,24,27,65.
Remote sensing images and auxiliary geospatial data layers
We utilized the National Agriculture Imagery Program (NAIP) data acquired in 2016 (with leaf-on condition) at a nominal resolution of 0.6 m24,31. We built the second order GLCM textural rasters of NAIP green and near-infrared (NIR) bands24,31,66, VIs24,64,67 and first four PCA rasters based on remaining NAIP bands31,34 (Appendix 2). We employed Sentinel-2 satellite images from spring 2016, coinciding with the LiDAR data collection period. The considered Sentinel-2 images only consisted of ≤ 10% cloud coverage and were preprocessed68,69. The VIs and spectral responses of image bands were derived from bottom of atmosphere reflectance (BOA) within selected Sentinel-2 images (Appendix 3)13. We derived a series of VIs including normalized difference vegetation index (NDVI)24,64,67, normalized difference red edge (NDRE)70,71, and normalized difference moisture index (NDMI)71 to represent the vegetation health, greenness and moisture content. The existing National Oceanic and Atmospheric Administration (NOAA) Coastal Change Analysis Program 1 m land cover map and Connecticut soil type shape file were also employed to extract categorical variables72.
Grouping remote sensing variables
We grouped the explanatory variables based on their original RS data source32. We identified three distinctive groups; i) Group-1: all 67 metrics combined (LiDAR, NAIP and Sentinel-2), ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics (Appendix 1–3). This categorization enabled us to gauge the contribution of different RS sources to AGB modeling24,32. We tested nine RF model configurations to systematically evaluate the effects of different variable combinations, aiming to balance model efficiency and accuracy during tuning and training. Following the approach suggested by Erdody and Moskal73. This process also helped us identify the most important variables for training and reduce model complexity.
Random forest algorithm
Although the Random Forest (RF) algorithm is widely used in AGB research, its performance with limited training datasets remains underexplored19,20,21. Due to the FIA data non-consent policy, we conducted our analysis off-site using the computing infrastructure of the USDA Northern Research Station, without direct access to precise plot coordinates. Our innovative approach included the following treatments: i) grouping of remote sensing (RS) variables based on RS data source32and ii). feature selection based on cross-validated R² and IncNodePurity metrics to reduce the multicollinearity, dimensionality and overfitting risk32ii). meticulous hyperparameter tuning to optimize model performance13,16,32,43. This integrated strategy allowed us to effectively train the RF model despite limited access to ground-truth data.
Variable reduction and selection
The RS data extraction produced 67 explanatory variables as inputs for the RF models, with FIA-AGB estimates serving as the response variable. We identified three explanatory variable groups based on RS data source: i) Group-1: all 67 metrics combined, ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics24,32. Per each group, we used increase in node purity (IncNodePurity)74 and the coefficient of determination (R²) from the 5-fold cross-validation (CV) to identify the most important set of variables for all the RF models in a forward selection process43,75. We trained a series of RF models using repeated K fold cross-validation by adding one variable at a time from the most important to the least important variable based on IncNodePurity13,32,43. As the last step, partial dependency curves were plotted to illustrate contribution of the most important variables for AGB prediction76.
Hyperparameter tuning and random forest model training
The RF algorithm is known for making robust predictions even if the predictor variables are highly correlated77. Further, we incorporated bootstrap sampling to reduce the effect of spatial autocorrelation among the subplots. The tree-based structure of RF algorithm utilizes different training datasets to build each decision tree, minimizing spatial dependency among subplots78. We implemented the RF algorithm using the randomForest package in R version 4.3.136,79. The identified groups i) Group-1: all 67 metrics combined, ii). Group-2: LiDAR metrics, and iii). Group-3: image metrics24,32 separately went through a careful hyperparameter tuning process resulting in nine models.
We employed an 80:20 split strategy to divide 142 subplot data into training and testing subsets ensuring a robust training process and providing a separate dataset for testing for all three variable groups18. Given the relatively small size of the dataset (142), hyperparameter tuning (Table 1) was tasked to strike a balance without overfitting32. We initially trained a default RF model for each group. Secondly, we optimized two key parameters: Mtry (number of independent variables per split) and Ntree (number of regression trees) to identify the requirement of an exhaustive hyperparameter tuning43. Finally, a complete hyperparameter optimization was done by using tuneRF function of the randomForest package in the R environment32,43. The final hyperparameter optimization employed the expand.grid function in the R environment, providing a sequence of values representing each hyperparameter of the parameter space (Appendix 5). The grid search involves exhaustive exploration with a stopping metric equals to 20 to streamline the process. This procedure resulted in nine individual RF models, three for each variable group32. We tasked a 5-fold CV for tuning and variable selection without requiring a separate validation set43(Table 1).
We used the hold-out 20% of data for the RF model testing which was never integrated in model training or validation process. The testing evaluation metrics, such as root mean square error (RMSE), mean square error of the out-of-bag errors (MSE), coefficient of determination (R2, and percentage variance explained were utilized to compare model performances17,27,32. This approach ensures robust evaluation and enhances model generalization to unseen data and grants a platform for comparative analysis43.
AGB comparison with existing map products
Finally, we compared the prediction performance of our RF model with three existing regional and landscape-level AGB map products: (i) a 90 m resolution map by Ma et al.40which employed the Ecosystem Demography (ED) model (Map_1), (ii) a 250 m resolution map developed by Blackard et al.41 using tree-based algorithms (Map_2), and (iii) a regional map created by Menlove and Healey42 (Map_3). Our plot-pixel assessment was based on absolute difference (i.e. mean, range) and similarity metrics (RMSE, R235 to understand the possible improvements that our RF has at FIA sublot level. We compared all four predicted AGB values with growth adjusted FIA-CRM calculated biomass values of 28 subplot testing data with resampling if necessary (Appendix 4).
Results
Variable importance
This study considered IncNodePurity value during the variable importance analysis for all three variable groups. We identified the most important set of variables to build the RF models followed by the ranking of variables based on IncNodePurity and forward selection of mean R2 of 5-fold CV (Fig. 4). The most important variable of both Group-1 (Combined) and Group-2 (LiDAR_metrics) was the 95th percentile derived from LiDAR point clouds (Fig. 4a,b). From the cohort of image metrics in Group-3, the short-wave infrared (SWIR) band of Sentinel-2 leaf-off images was identified as the most important predictor variable for building RF models (Fig. 4c). Out of all the explanatory variables (Group-1), 68% of the most important variables were LiDAR height-related metrics. The SWIR (1610 nm) band of Sentinel-2 and correlation texture rasters from the NAIP green band were the only image-derived metrics in the top 10 explanatory variables of Group-1 RF models.
Finding the best subset of the predictor variables is crucial to reduce dimensionality and overfitting of RF algorithm80. Group-1 reported the maximum cross validation R2 value of 0.3364 for the first 28 explanatory variables. Group-2 and Group-3 yielded the maximum cross validation R2 of 0.2265 and 0.2039, respectively using their top 16 variables. At this stage, we were able to reduce overfitting and ultimately reduce the complexity of variables.
Mean R2 value for 5-fold cross-validation (a) all 67 variables, (b) Light Detection and Ranging metrics only (c) Image metrics only. In all three plots, X-axis represents the ranking of variables according to the IncNodePurity value, from left to right decreasing the importance. Main blue line shows the mean R2 value of adding one variable at a time, from the most important to the least important explanatory variable. The Y-axis represents the changes of mean R2 with 5-fold cross validation (CV) with error bars. The red highlighted points illustrate the highest CV R2 value obtained.
Random forest model comparison
A careful comparison of hyperparameter tuning was considered for all three groups as mentioned in Table 1. All nine models were cross-validated and tested using a 20% hold-out dataset. Model test results are shown in Table 2. We identified the grid tuned (grid search technique with 5-fold CV) RF models as the highest performing models through a comparative analysis of R2 RMSE and percentage variance explained for each variable group. The Group-1, the grid tuned model (All_RF03) reported RMSE of 27.19Mgha−1, R2 of 0.41, and 40.54% of percentage variance explained. The Group-2 grid tuned model (LiDAR_RF03) reported RMSE of 27.88 Mgha−1, R2 of 0.23, and 23.49% of percentage variance explained. Lastly, The Group-3 grid tuned model (Img_RF03) showed RMSE of 26.50 Mgha−1, R2 of 0.31, and 30.85% of percentage variance explained. We observed the smallest mean absolute error (MAE) and confidence interval (CI) for the All_RF03 model.
When considering Group-1, the grid tuning process improved the R2 from 0.19 to 0.40 while the RMSE decreased by 23.71%. Similarly for Group-2, the R2 increased from 0.01 to 0.23 with RMSE reduction by 12.16%. Because of grid tuning, Group-3 also exhibited an increase in R2 from − 0.01 to 0.31 whilst RMSE reduced by 18.06%. Increasing only the number of trees (Ntree) in the RF model does not necessarily result in improved performance, rather it escalates the computational burden43. A comparison between FIA-based AGB estimates and all the RF model estimates for different hyperparameter and variable spaces are illustrated in Fig. 5.
Performance of random forest (RF) models for all three variable groups. The X-axis of the plots represents the aboveground biomass (AGB) values predicted from each RF model while the Y-axis represents the Forest Inventory and Analysis AGB estimates named as the Actual AGB in the plots. (a) RF Model predictions for all LiDAR and image metrics, (b) RF Model predictions for LiDAR metrics, and (c) RF model predictions for image metrics.
Upon studying the evaluation metrics outlined in Table 2 alongside the residual plots in Fig. 6, the tuned RF models (All_RF03, LiDAR_RF03, and Img_RF03) of each group represented the lowest RMSE and the highest R2 values comparing to their default RF counterparts. We selected All_RF03 model as the optimal RF model after comparing its evaluation metrics with the next best model, Img_RF03. The Img_RF03 model showed a decrease in R2 by 24.39%, even with the 2.54% reduction in RMSE compared to the counterparts of All_RF03 model (Table 2). An examination of residuals was conducted for the best models chosen from each group. Figure 6 shows that all the RF models give smaller residuals for AGB values below 100 Mgha−1, which demonstrates that tuned models represent poor agreement with actual data for the areas with higher AGB. Figure 6a shows the smallest dispersion of residuals from the x axis, which is the FIA-AGB estimates. Concluding the results of evaluation metrics and the residual analysis, the All_RF03 model was selected as the optimal RF model for estimating forest aboveground tree biomass within the study area.
Residual vs. actual aboveground biomass (AGB) plots for the best random forest (RF) models of each group of variables. (a) Residual plot for the All_RF03 model for all LiDAR and image metrics (Group-1), (b) Residual plot for the LiDAR_RF03 model for LiDAR metrics (Group-2), and (c) Residual plot for Img_RF03 model for image metrics (Group-3). The field estimated biomass was considered as the actual biomass of the residual plots.
Partial dependency curves
The partial dependency curves (PDC) of the All_RF03 model depict the average relationship between individual predictor variable and AGB prediction of the optimal RF model. In this study the top four explanatory variables from the optimum RF were documented, considering interactions and non-linearities captured by the ensemble of the RF decision trees. In Fig. 7, PDCs display the explanatory variables on the X-axis, while the Y-axis represents the mean RF predictions of AGB. According to Fig. 7a–c mean AGB response increases steadily, yet at different rates as the 95th percentile (P95th), 90th percentile (P90th), and Sentinel-2 SWIR band (sen_swir) increases until it flattens out. This suggests that higher variable values increase in the predicted AGB up to a certain point, after which the response stabilizes. However, the GLCM correlation from NAIP green band (green_layer8) exhibited a negative slope indicating the predicted AGB declines steeply with higher values of correlation of NAIP green band (Fig. 7d).
Partial dependency curves (PDC) for the top four variables of optimal RF model (All_RF03). The X-axis represents the response of predicted aboveground biomass while the Y-axis shows the change of each explanatory variable while other variables remain constant. (a) Partial dependency curve for p95th LiDAR percentile, (b) Partial dependency curve for p90th LiDAR percentile, (c) Partial dependency curve of the short-wave infrared band of Sentinel-2 images, and (c) Partial dependency curve for correlation raster of NAIP green band.
Comparison with available map products
This section presents the results of comparing our RF model performance and the three existing AGB map products against FIA-CRM estimates. The All_RF03 model achieved the lowest RMSE of 28.33 Mgha−1 and highest R2 value of 0.41, demonstrating better predictive accuracy than the existing regional and national maps (Table 3). The Map_01 (regional 90 m) had the highest RMSE (145.35 Mgha−1) and the lowest R2 (0.01), while the Map_02 (regional 250 m) and the Map_03 (national scale) performed slightly better but still exhibited higher RMSE values (82.47 Mgha−1 and 81.84 Mgha−1, respectively) compared to our optimal RF model. This suggests that the optimal RF model significantly outperforms coarse-resolution AGB maps for plot-level biomass estimation in Connecticut.
The results of the plot-pixel comparison of the performance of optimal random forest (RF) model and existing map products. Scatter plot of the Forest Inventory and Analysis aboveground biomass (AGB) estimates (X-axis) with RF predicted aboveground biomass AGB values of RF models and three available map products (Y-axis). Considered map products are Map_140, Map_241 and Map_342. RF denotes the AGB prediction of our optimum RF model. The grey shaded area represents CI = 95% for each distribution and the 1:1 reference line is the dashed grey line.
According to Fig. 8, the wider confidence intervals for the available maps, particularly at higher AGB values, further highlight the advantages of the RF approach for finer-scale forest biomass predictions. The RF model (in red) shows the closest alignment with the 1:1 reference line indicating better agreement with FIA estimates across a wide range of biomass values. In contrast, MAP_1 consistently overestimates AGB estimates at FIA subplot level. MAP_2 and MAP_3 demonstrated more moderate overestimations but still deviate from the 1:1 line compared to RF. The RF model had a narrower uncertainty range at lower and moderate AGB values. These results suggest that our RF model outperforms the existing products in estimating plot-level AGB under restricted data conditions.
Discussion
The study aimed to evaluate the predictive capability of remotely sensed variables derived from a combination of multi-modal sources (LiDAR, Sentinel-2, and NAIP) and FIA data for enhancing state-wide forest aboveground biomass (AGB) estimation at fine spatial scales via the random forest (RF) algorithm. Initially, we leveraged the utilization of publicly available RS data combined with single year FIA data. This step prevents adding growth increment-related uncertainties into the model prediction18 yet resulted in a small sample size to start with. We utilized the 2016 FIA data to align with Connecticut’s 2016 LiDAR mission, optimizing the project timeline and addressing the challenges associated with using FIA true plot data. Accurate estimation of forest AGB is essential for a range of ecological research areas, including carbon cycling, forest management, and forest dynamics81. The RF algorithm is widely used in forestry sector analysis due to its ability to work with multicollinearity and spatial autocorrelation problems82,83. We studied the importance of variable selection and hyperparameter optimization of the RF algorithm in estimating AGB. The AGB values (response variable) in this study were estimated using the FIA-CRM technique. As of September 2023, FIA has transitioned to the National Scale Volume and Biomass Estimators (NSVB) for more consistent and accurate tree structure accounting84.
Hyperparameter tuning led to a significant improvement in R2 across all three groups, particularly the optimal RF model with a 115.79% increase in R2 compared to its default counterpart. Negative R2 values indicate that the model fit is poor. Johnson et al.16 provides a comprehensive framework in utilizing RF algorithm to achieve R2 0.49 with a RMSE (91.5 Mgha−1). Our study achieved R2 of 0.41, demonstrating the value of meticulous hyperparameter tuning and variable optimization even though the sample size is small (N = 142). Torre-Tojal et al.43 provides recommendations to enhance AGB estimate accuracy, with hyperparameter tuning for small datasets (N = 55), achieving high geolocation accuracy and using area-specific allometric equations.
A majority of our estimates fall outside the 95% confidence interval at the upper end of the AGB distribution, indicating an underestimation compared to the FIA-estimated AGB at the upper levels of the reference distribution (Fig. 5). This behavior can be a result of the smoothing characteristics of the RF algorithm85 predictions for extreme values. We had only few field AGB samples exceeding 100 Mg ha-1 in both training and testing datasets. Therefore, making data augmentation or splitting analysis was infeasible86. Synthetic data generation or multi-year data acquisition was challenging due to FIA plot data restrictions87. To avoid bias, we did not apply imbalance handling techniques such as oversampling.
In this study, LiDAR metrics were employed to quantify forest structural attributes, including tree height percentiles and slope, which are critical indicators of forest vertical complexity60. In contrast, the RS image metrics, obtained from sources, such as Sentinel-2 and NAIP, provided information on vegetation heterogeneity and overall vegetation health31. Combining image data with LiDAR data increased the overall performance of the model by 24% based on R2 due to the inclusion of both forest vertical structure and heterogeneity related information. These complementary datasets allowed us a more comprehensive understanding of forest dynamics and contributed to the improved predictive accuracy of AGB estimation at subplot level.
Our study also focused on reducing the dimensionality via identifying the best variable subset for model tuning43. We employed IncNodePurity as the determination factor. Increase in IncNodePurity increases the overall performance of the RF model by improving the homogeneity at each node74. Of the most important variables from Group-1 RF models, 68% out of the 28 selected variables were LiDAR-derived while 18% and 14% of the selected variables were derived from NAIP and Sentinel-2, respectively. The prominence of top-level LiDAR height metrics, such as P95th or P99th, in RF models for biomass estimation can be attributed to their strong correlation with canopy structure and tree height. Tree height is one of the most crucial parameters to calculate AGB despite the diameter (DBH)88,89. According to Wang et al.90, tree height and AGB has an R² of 0.77. Riggins et al91., achieved R2 of 0.72 only using LiDAR height percentiles with highly accurate field AGB values. But they mentioned that it is extremely difficult to achieve such accuracy in complex forested environments. As a solution, most of the recent studies use an array of LiDAR height metrics (percentiles, densities, and height bins) to train models17,27,32. We observed a strong influence of LiDAR height metrics in our RF model as 68% of selected predictors were LiDAR derived. LiDAR height percentiles provide a statistical summary of canopy height distribution within the plot. These metrics reflect important structural characteristics such as, the central tendency (mean and median), variability, and asymmetry of tree heights92,93. The selected LiDAR percentiles, 95th to 60th represent the upper canopy structure, with the 95th percentile indicating the tallest trees, which strongly correlate with mean AGB of plot level90. The height bins provide a normalized measure of the number of points27 when density metrics represent the cumulative frequency of LiDAR returns of above specified height bins that include all points at or exceeding each threshold. They represent the canopy closure, density, and layering of the plots94. Generally, taller and mature trees have a greater biomass accumulation88. This explains the strong contribution of LiDAR metrics to the predictive accuracy of the RF model.
The tree aboveground biomass mainly consisted of organic matter and water stored in above ground components. The Sentinel-2 SWIR band is sensitive to vegetation water content95 and to tree structural organic matter such as nitrogen, lignin, and cellulose96. Dang et al.97 proved that the SWIR was the best response variables of AGB prediction with an R2 of 0.81. Traditional vegetation indices such as NDVI, which primarily reflect greenness and chlorophyll concentration but often saturate in closed canopy or mature forest conditions2,98. Partial dependency curves (PDCs) illustrate the relationships between predictor variables and the predicted outcome while holding all other variables constant. PDCs provide valuable insights into how RF predictions change as a function of each predictor variable, allowing us to identify the importance and influence of each variable in RF regression models99. In our analysis soil-related and landcover classes were also examined. The importance of these values was less than the LiDAR heights and image metrics.
Our study systematically addressed and mitigated uncertainties associated with machine learning model selection, training, and geolocation error of FIA field measurements. In the publicly available FIA data, up to 20% of private FIA plot coordinates are swapped with another similar private plot within the same county and fuzzed up to 1 mile on a small subset of them58. Even when using the true FIA subplot data, the average centroid GPS location error is about 8 m. Therefore, removing non-forested areas and treating for obvious spatial outliers by comparing LiDAR heights and FIA-AGB estimations were necessary to maintain the consistency and accuracy of AGB predictions29.
Trees can be found in both forested and non-forested land use areas in Connecticut17. The AGB estimates of trees outside of the forests (non-forested areas) may not be accurately estimated when the model is applied because those areas were not represented in the model training and selecting the allometric Equations63. Therefore, Our RF models only suitable to predict the AGB of forested areas. Adding trees outside of the forest could significantly improve the estimations of total AGB in Connecticut17. Adjusting tree increment factors is useful to increase the number of samples of FIA subplots. However, this study was concluded only using the FIA data collected withing 2016 since it exceeds the expected timeline of the project in these circumstances43.
Most of the previous research attained R2 values in between the ranges of 0.30–0.60, even with less restrictions and more flexible access to FIA data19,43. Our research achieved a moderate R2 value of 0.41, yet within the range of previous research accuracy. The main reason for the moderate R2 is the restrictions on using actual plot locations, which substantially plummeted the size of the training data. RF models typically perform better with larger datasets. However, we only considered single year data since our codes were run on off-site USDA-FIA computers. Also, the true FIA plot locations potentially have a geolocation error of 8 m (± 2 m) which adds an unknown error to the predictions and noise to the RS variables56. The research can be improved by adding more subplots and using growth increment factors to provide more generalization investigating the entire data cycle. Torre-Tojal et al.43 and Luo et al.34 conducted biomass estimations in various forest age classes using airborne LiDAR data with a significantly higher pulse density (4.1pulses/m2) than that utilized in our study. They achieved superior results with full-waveform metrics (R2 between 0.81 and 0.84) compared to discrete-return metrics (R2 of 0.8), guiding the potential improvements to our RF models. The scale of analysis could be investigated thoroughly by considering FIA subplot, plot, or hectare plot scales. But the size of the data set and FIA plot coordinate related restrictions limited our ability to use the above techniques in our analysis27.
In this study careful consideration was given to the RF tuning grids since both the sample size and test data size were smaller. Additional analysis (Moran’s I) for spatial autocorrelation can clarify the specific effects of subplot-level autocorrelation beyond the assumptions and treatments used in our RF model structure. Integrating ecology-based biomass modeling concepts can enhance RF model accuracy by accounting for variations in site quality, tree age, or available nutrient content. The current study compares with 90 m AGB map40: trained by LiDAR, NAIP, NLCD, and FIA data using ecological modeling, 250 m AGB map by Blackard et al.41 and Menlove and Healey42. Wider grey shaded areas of CI indicate higher uncertainty, while narrower ones suggest more precise estimates (Fig. 8). Furthermore, our study was unable to provide an in-depth analysis of which forest types of the model predict best. This prohibits us from using a representative stratified random sampling technique for our study of different forest types. But the incorporated forest type information from the CT 1 m landcover (Cover_type) was ranked amongst the least important variables based on IncNodePurity, likely because of two reasons; i). The dataset lacked sufficient representation to differentiate forest types (Fig. 4a), or ii). existing set of explanatory variables were already sophisticated enough to achieve the current accuracy without additional contribution from forest type information. But future studies could consider methods such as destructive sampling, integrating non-forested tree data, or establishing FIA-like plots outside the federal data system to enable more spatially robust validation strategies to increase ecological representation in training and validation32,43.
We compared our AGB estimates with three existing biomass map products (Appendix 4) to evaluate the potential benefits of fine-resolution models. However, the differences in spatial resolution and data sources can significantly influence comparison metrics we compared (Table 3). For instance, our RF model achieved the lowest RMSE (28.33 Mgha−1) and highest R² (0.41), while Map_1 showed the highest RMSE (145.35 Mgha−1) and lowest R² (0.01), likely due to scale mismatches at broader spatial extents. Additionally, mean and variance suggest that these coarser-resolution maps tend to overestimate biomass and show higher variability (Map_1: variance = 5170.04 Mgha−1), while our RF model produced estimates more aligned with growth adjusted FIA-CRM estimates (mean = 43.92Mgha−1; variance = 277.78 Mgha−1). These results highlight how spatial resolution and model selection influence accurate AGB predictions, supporting the value of high-resolution AGB modeling for capturing local characteristics.
Finally, we acknowledge that ML models are fundamentally data driven. Our study addressed this challenge through extensive hyperparameter tuning, variable grouping and variable selection. But recent strategies, such as data augmentation, ensemble modeling, or theory-guided constraints100101 may further enhance predictive accuracy in future research. Also, transformer-based foundation models such as Pritivi-WXC102 and SatCLIP103 show promising improvements to future AGB context. Our research will be instrumental for future researchers to identify the areas to develop, using improved ML and DL models not only to evaluate spatial context but as a guide to build a protocol when the data access is limited.
Conclusion
The study utilized the RF algorithm to estimate forest AGB in Connecticut, showcasing the algorithm’s robustness even with a limited training data set. Employing grid search for hyperparameter optimization and cross-validation revealed promising results, with RMSE of 27.19 Mgha−1 and R2 of 0.41. Combining RS image data with LiDAR metrics improved R2 by 34.78% compared to exclusively using LiDAR. Out of all, 68% of the most important variables are LiDAR height related variables. Integrating LiDAR data with RS image data and hyperparameter tuning significantly enhanced model performance. The optimum RF model shows high accuracy within the inter-quartile range of field CRM-AGB. Future research could focus on increasing testing, refining plot locations, spatial autocorrelation and leveraging high-density LiDAR/RS data for improved performance in biomass mapping.
Data availability
All publicly accessible data supporting the findings of this study are provided within the manuscript and its supplementary information files (Appendices 1–5). However, due to the Forest Inventory and Analysis (FIA) landowners’ data protection policy, the training data cannot be shared herewith. Interested users may contact the authors directly via the email addresses provided in the manuscript for guidance or collaboration opportunities involving FIA data. Our analysis codes, trained models and workflows will be made available upon request on GitHub to ensure transparency and support reproducibility. Please contact Shashika Himandi at shashika_himandi.lamahewa@uconn.edu or Dr. Chandi Witharana at chandi.witharana@uconn.edu to request access.
References
Dixon, R. K. et al. Carbon pools and flux of global forest ecosystems. Science 263(5144), 185–190 (1994).
Baccini, A. G. S. J. et al. Estimated carbon dioxide emissions from tropical deforestation improved by carbon-density maps. Nature climate change 2(3), 182–185 (2012).
Zaki, N. A., Mohd, Z. A., Latif & Mohd Zainee Zainal. and. Predicting above-ground biomass and carbon stocks by using geographically weighted regression (GWR). In 38th Asian Conf Remote Sens–Sp Appl Touching Hum Lives, ACRS (2017).
Oswalt, S. N., Brad Smith, W., Miles, P. D., Scott, A. & Pugh Forest resources of the United States. In General Technical Report-US Department of Agriculture, Forest Service. Forest Service (2019). (2017).
FS-130 & Update, R. Forests of Connecticut. (2016).
Wang, L. P., Basu, S. & Zhang, Z. M. Direct and indirect methods for calculating thermal emission from layered structures with nonuniform temperatures. 072701. (2011).
Shi, L. & Liu, S. Methods of estimating forest biomass: A review. Biomass Volume Estimation Valorization Energy. 10, 65733 (2017).
Wang, M., Im, J., Zhao, Y. & Zhen, Z. Multi-Platform lidar for Non-Destructive individual aboveground biomass Estimation for Changbai larch (Larix olgensis Henry) using a hierarchical bayesian approach. Remote Sens. 14 (17), 4361 (2022).
Jenkins, J. C., Chojnacky, D. C., Heath, L. S. & Birdsey, R. A. National-scale biomass estimators for united States tree species. For. Sci. 49 (1), 12–35 (2003).
Somogyi, Z. et al. Indirect methods of large-scale forest biomass Estimation. European J. For. Research. 126, 197–207 (2007).
Smith, W. B. Forest inventory and analysis: a National inventory and monitoring program. Environ. Pollut. 116, S233–S242 (2002).
Woudenberg, S. W. et al. And Karen L. Waddell. The Forest Inventory and Analysis Database: Database Description and User’s Manual Version 4.0 for Phase 2 (United States Department of Agriculture, Forest Service, Rocky Mountain Research Station, 2010).
Tamiminia, H., Salehi, B., Mahdianpari, M. & Goulden, T. State-wide forest canopy height and aboveground biomass map for new York with 10 m resolution, integrating GEDI, Sentinel-1, and Sentinel-2 data. Ecol. Inf. 79, 102404 (2024).
Huang, H., Liu, C., Wang, X., Zhou, X. & Gong, P. Integration of multi-resource remotely sensed data and allometric models for forest aboveground biomass Estimation in China. Remote Sens. Environ. 221, 225–234 (2019).
Li, C., Li, Y. & Li, M. Improving Forest aboveground biomass (AGB) estimation by incorporating crown density and using landsat 8 OLI images of a subtropical forest in Western Hunan in Central China. Forests 10(2), 104 (2019).
Johnson, K. D. et al. Integrating forest inventory and analysis data into a LIDAR-based carbon monitoring system. Carbon Balance Manag. 9, 1–11 (2014).
Johnson, K. D. et al. Integrating LIDAR and forest inventories to fill the trees outside forests data gap. Environ. Monit. Assess. 187, 1–8 (2015).
Johnson, L. K. et al. Fine-resolution landscape-scale biomass mapping using a Spatiotemporal patchwork of lidar coverages. Int. J. Appl. Earth Obs. Geoinf. 114, 103059 (2022).
Hu, T. et al. Mapping global forest aboveground biomass with spaceborne lidar, optical imagery, and forest inventory data. Remote Sens. 8 (7), 565 (2016).
Hudak, A. T. et al. A carbon monitoring system for mapping regional, annual aboveground biomass across the Northwestern USA. Environmental Res. Letters. 15 (9), 095003 (2020).
Tang, H. et al. and G. C. Hurtt. Lidar derived biomass, canopy height, and cover for new England region, USA. ORNL DAAC (2021).
Chen, H. et al. Mapping forest aboveground biomass with MODIS and Fengyun-3 C VIRR imageries in Yunnan province, Southwest China using linear regression, K-Nearest neighbor and random forest. Remote Sens. 14 (21), 5456 (2022).
Dubayah, R. O. et al. Estimation of tropical forest height and biomass dynamics using lidar remote sensing at La selva, Costa Rica. J. Geophys. Res. Biogeosci. 115, (G2) (2010).
Ehlers, D. et al. Mapping forest aboveground biomass using multisource remotely sensed data. Remote Sens. 14 (5), 1115 (2022).
Zheng, D., Heath, L. S. & Ducey, M. J. Spatial distribution of forest aboveground biomass estimated from remote sensing and forest inventory data in new england, USA. J. Appl. Remote Sens. 2 (1), 021502 (2008).
Mancini, F. et al. An integrated procedure to assess the stability of coastal Rocky cliffs: from UAV close-range photogrammetry to Geomechanical finite element modeling. Remote Sens. 9 (12), 1235 (2017).
Sheridan, R. D. et al. Modeling forest aboveground biomass and volume using airborne lidar metrics and forest inventory and analysis data in the Pacific Northwest. Remote Sens. 7 (1), 229–255 (2014).
Urbazaev, M. et al. Estimation of forest aboveground biomass and uncertainties by integration of field measurements, airborne lidar, and SAR and optical satellite data in Mexico. Carbon Balance Manag. 13, 1–20 (2018).
Duncanson, L. et al. Implications of allometric model selection for county-level biomass mapping. Carbon Balance Manag. 12, 1–11 (2017).
Haralick, R. M., Shanmugam, K. & Its’ Hak Dinstein. Textural features for image classification. IEEE Trans. Syst. Man. Cybernetics. 6, 610–621 (1973).
Csillik, O., Kumar, P., Mascaro, J., O’Shea, T. & Asner, G. P. Monitoring tropical forest carbon stocks and emissions using planet satellite data. Sci. Rep. 9 (1), 1–12 (2019).
Nandy, S., Srinet, R. & Padalia, Hitendra. Mapping forest height and aboveground biomass by integrating ICESat-2, Sentinel‐1 and Sentinel‐2 data using Random Forest algorithm in northwest Himalayan foothills of India. Geophysical Research Letters 48(14), e2021GL093799 (2021).
Naik, P., Dalponte, M. & Lorenzo Bruzzone. Prediction of forest aboveground biomass using multitemporal multispectral remote sensing data. Remote Sens. 13 (7), 1282 (2021).
Luo, P., Liao, J. & Shen, G. Combining spectral and texture features for estimating leaf area index and biomass of maize using Sentinel-1/2, and Landsat-8 data. IEEE Access. 8, 53614–53626 (2020).
Gao, Y. et al. Comparative analysis of modeling algorithms for forest aboveground biomass Estimation in a subtropical region. Remote Sens. 10 (4), 627 (2018).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Han, S., Williamson, B. D. & Youyi Fong. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inf. Decis. Mak. 21, 1–9 (2021).
Sivasankar, T., Lone, J. M., Sarma, K. K., Qadir, A. & Raju, P. L. N. Estimation of above ground biomass using support vector. Vietnam Journal of Earth Sciences 41(2), 95–104 (2013).
Wolpert, D. H. Stacked generalization. Neural Netw. 5 (2), 241–259 (1992).
Ma, L. et al. Mapping US forest biomass using nationwide forest inventory data and moderate resolution information. Remote sensing of Environment 112(4), 1658–1677 (2022).
Menlove, J. & Healey, S. CMS: Forest Aboveground Biomass from FIA Plots across the Conterminous USA 2009–2019 (ORNL DAAC, 2021).
Fox, E. W., Jay, M., Ver Hoef, Anthony, R. & Olsen Comparing Spatial regression to random forests for large environmental data sets. PloS ONE. 15 (3), e0229509 (2020).
Torre-Tojal, L., Bastarrika, A. & Boyano, A. Above-ground biomass estimation from LiDAR data using random forest algorithms. J. Comput. Sci. 58, 101517 (2022).
Luna, Soriano et al. Determinants of above-ground biomass and its spatial variability in a temperate forest managed for timber production. Forests 9(8), 490 (2018).
Riemann, R., Wilson, B. T., Lister, A. & Parks, S. An effective assessment protocol for continuous Geospatial datasets of forest characteristics using USFS forest inventory and analysis (FIA) data. Remote Sens. Environ. 114 (10), 2337–2352 (2010).
Food and Agriculture Organization of the United Nations (FAO). Global Forest Resources Assessment 2020: Main Report (FAO, 2020).
EPA. Level III and IV Ecoregions of the Continental United States. U.S. Environmental Protection Agency. (Accessed 15 May 2025). https://www.epa.gov/eco-research/ecoregions (2013).
Liu, Z., Luong, P., Boley, M. & Schmidt, D. F. Improving random forests by smoothing. https://arXiv.org/abs/2505.06852. (2025).
USDA Forest Service. Forests of Connecticut, 2020. Resource Update FS-334. Madison 2. https://doi.org/10.2737/FS-RU-334 (U.S. Department of Agriculture, Forest Service, 2021).
Brand, G. J., Nelson, M. D., Wendt, D. G. & Kevin, K. Nimerfro. The Hexagon/panel System for Selecting FIA Plots Under an Annual Inventory ( USFS FIA research, 2003).
Lu, D. et al. A survey of remote sensing-based aboveground biomass Estimation methods in forest ecosystems. Int. J. Digit. Earth. 9 (1), 63–105 (2016).
Tinkham, W. T. et al. Applications of the united States forest inventory and analysis dataset: a review and future directions. Can. J. For. Res. 48 (11), 1251–1268 (2018).
Burkman, B. Forest inventory and analysis: sampling and plot design. FIA Fact. Sheet Ser. (2005).
Butler, B. J. & Connecticut Forests of 2017. Resource Update FS-159. Newtown Square, PA: U.S. Department of Agriculture, Forest Service, Northern Research Station. 3. https://doi.org/10.2737/FS-RU-159 (2018).
Hoppus, M. and Andrew Lister. The status of accurately locating forest inventory and analysis plots using the global positioning system. In Proceedings of the Seventh Annual Forest Inventory and Analysis Symposium, Portland, OR, USA 36, 179184. (2005).
Lister, A. et al., Strategies for preserving owner privacy in the national information management system of the USDA Forest Service’s Forest Inventory and Analysis unit. United States department of agriculture forest service general technical report NC 352 163 (2005).
Woodall, C. W., Linda, S., Heath, G. M., Domke & Nichols, M. C. Methods and equations for estimating aboveground volume, biomass, and carbon for trees in the US forest inventory. Gen. Tech. Rep. NRS-88. Newtown Square, PA: US Department of Agriculture, Forest Service, Northern Research Station 30 (2011).
Burrill, E. A. et al. The Forest Inventory and Analysis Database: Database Description and User Guide Version 9.0.1 for Phase 2. U.S. Department of Agriculture, Forest Service. 1026. (Accessed 03 March 2022). https://research.fs.usda.gov/programs/fia#data-and-tools (2021).
Chen, Q., Laurin, G. V. & Valentini, R. Uncertainty of remotely sensed aboveground biomass over an African tropical forest: propagating errors from trees to plots to pixels. Remote Sens. Environ. 160, 134–143 (2015).
Chen, Q., Laurin, G. V., Battles, J. J. & Saah, D. Integration of airborne lidar and vegetation types derived from aerial photography for mapping aboveground live biomass. Remote Sens. Environ. 121, 108–117 (2012).
CT department of energy and environmental protection. (n.d.). Connecticut environmental conditions online. Connecticut Environmental Conditions Online Maps & Geospatial Data for Everyone. https://cteco.uconn.edu/guides/Soils.htm (2024).
Shao, G. et al. Improving Lidar-based aboveground biomass Estimation of temperate hardwood forests with varying site productivity. Remote Sens. Environ. 204, 872–882 (2018).
McPherson, E., Gregory, Natalie, S., van Doorn & Peper, P. J. Urban tree database and allometric equations. Gen. Tech. Rep. PSW-GTR-253. Albany, CA: US department of agriculture, forest service. Pac. Southwest. Res. Stn. 86, 253 (2016).
Hayashi, M., Saigusa, N., Yamagata, Y. & Hirano, T. Regional forest biomass Estimation using icesat/glas spaceborne lidar over Borneo. Carbon Manag. 6 (1–2), 19–33 (2015).
Chen, L., Ren, C., Zhang, B., Wang, Z. & Xi, Y. Estimation of forest above-ground biomass by geographically weighted regression and machine learning with Sentinel imagery. Forests 9 (10), 582 (2018).
Bright, B. C., Hicke, J. A. & Hudak, A. T. Estimating aboveground carbon stocks of a forest affected by mountain pine beetle in Idaho using lidar and multispectral imagery. Remote Sens. Environ. 124, 270–281 (2012).
Pandit, S., Tsuyuki, S. & Dube, T. Estimating above-ground biomass in sub-tropical buffer zone community forests, nepal, using Sentinel 2 data. Remote Sens. 10 (4), 601 (2018).
Moradi, F., Darvishsefat, A. A., Pourrahmati, M. R. & Deljouei, A. Stelian Alexandru Borz, Estimating aboveground biomass in dense Hyrcanian forests by the use of Sentinel-2 data. Forests 13(1), 104 (2022).
Parent, J. R., Arthur, J., Gold, E., Vogler & Kelly Addy Lowder Guiding decisions on the future of dams: A GIS database characterizing ecological and social considerations of dam decisions. J. Environ. Manage. 351, 119683 (2024).
Huete, A. R. A soil-adjusted vegetation index (SAVI). Remote sensing of environment 25(3), 295–309 (1988).
Mohammadpour, P., Viegas, D. X. & Carlos Viegas. Vegetation mapping with random forest using Sentinel 2 and GLCM texture feature - A case study for lousã region, Portugal. Remote Sens. 14 (18), 4585 (2022).
Genuer, R., Poggi, J. M. & Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 31 (14), 2225–2236 (2010).
Wang, Y., Zhang, X. & Guo, Z. Estimation of tree height and aboveground biomass of coniferous forests in North China using stereo ZY-3, multispectral Sentinel-2, and DEM data. Ecol. Ind. 126, 107645 (2021).
Breiman, L. Classification and Regression Trees (Routledge, 2017).
Wongchai, W., Onsree, T., Sukkam, N., Promwungkwa, A. & Nakorn Tippayawong. Machine learning models for estimating above ground biomass of fast-growing trees. Expert Syst. Appl. 199, 117186 (2022).
Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 13 (1), 1063–1095 (2012).
Sekulić, A., Kilibarda, M., Heuvelink, G. B. M., Nikolić, M. & Branislav Bajat. Random forest Spatial interpolation. Remote Sens. 12 (10), 1687 (2020).
Liaw, A. & Matthew Wiener. Classification and regression by randomforest. R News. 2 (3), 18–22 (2002).
Dewi, C. & Rung-Ching Chen Random forest and support vector machine on features selection for regression analysis. Int. J. Innov. Comput. Inf. Control. 15 (6), 2027–2037 (2019).
Xu, D., Wang, H., Xu, W., Luan, Z. & Xu, X. LiDAR applications to estimate forest biomass at individual tree scale: Opportunities, challenges and future perspectives. Forests 12(5), 550 (2021).
Hong, Y. et al. Combining multisource data and machine learning approaches for multiscale Estimation of forest biomass. Forests 14 (11), 2248 (2023).
Tang, Z., Xia, X., Huang, Y., Lu, Y. & Guo, Z. Estimation of National forest aboveground biomass from multi-source remotely sensed dataset with machine learning algorithms in China. Remote Sens. 14 (21), 5487 (2022).
U.S. Forest Service. National Scale Volume and Biomass Estimators (NSVB). Forest Inventory and Analysis Program. (Accessed 27 October 2024). https://research.fs.usda.gov/programs/fia/nsvb
Asner, G. P. et al. James Jacobson, Ty Kennedy-Bowdoin et al. High-resolution forest carbon stocks and emissions in the Amazon. Proc. Natl. Acad. Sci. 107 (38), 16738–16742. https://doi.org/10.1073/pnas.1004875107 (2010).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data. 6 (1), 1–48 (2019).
Lu, Y. et al. Machine learning for synthetic data generation: a review. (2023). arXiv preprint arXiv:2302.04062.
Qadeer, A., Shakir, M., Wang, L. & Talha, S. M. Evaluating machine learning approaches for aboveground biomass prediction in fragmented high-elevated forests using multi-sensor satellite data. Remote Sens. Applications: Soc. Environ. 36, 101291 (2024).
White, J. et al. Coops. A model development and application guide for generating an enhanced forest inventory using airborne laser scanning data and an area-based approach. (2017).
Friedman, J. H., Bogdan, E. & Popescu Predictive learning via rule ensembles. 916–954. (2008).
Riggins, J. J., Tullis, J. A. & Stephen, F. M. Per-segment aboveground forest biomass Estimation using LIDAR-derived height percentile statistics. GIScience Remote Sens. 46 (2), 232–248 (2009).
Holmgren, J. Prediction of tree height, basal area and stem volume in forest stands using airborne laser scanning. Scand. J. For. Res. 19 (6), 543–553 (2004).
Lim, K. S. & Treitz, P. M. Estimation of above ground forest biomass from airborne discrete return laser scanner data using canopy-based quantile estimators. Scand. J. For. Res. 19 (6), 558–570 (2004).
Næsset, E. Airborne laser scanning as a method in operational forest inventory: status of accuracy assessments accomplished in Scandinavia. Scand. J. For. Res. 22 (5), 433–442 (2007).
Chen, L., Wang, Y., Ren, C., Zhang, B. & Wang, Z. Optimal combination of predictors and algorithms for forest above-ground biomass mapping from Sentinel and SRTM data. Remote Sens. 11 (4), 414 (2019).
Wai, P., Su, H. & Li, M. Estimating aboveground biomass of two different forest types in Myanmar from sentinel-2 data with machine learning and Geostatistical algorithms. Remote Sens. 14 (9), 2146 (2022).
Dang, A. T. N. et al. Forest aboveground biomass Estimation using machine learning regression algorithm in Yok don National park, Vietnam. Ecol. Inf. 50, 24–32 (2019).
Huete, A. et al. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 83 (1–2), 195–213 (2002).
Liu, Z., Luong, P., Boley, M., Daniel, F. & Schmidt Improv. Random Forests Smoothing https://arXiv.org/abs/250506852. (2025).
He, Q., Chen, E., An, R. & Li, Y. Above-ground biomass and biomass components Estimation using lidar data in a coniferous forest. Forests 4 (4), 984–1002 (2013).
Chen, L. et al. Evaluating the transferability of spectral variables and prediction models for mapping forest aboveground biomass using transfer learning methods. Remote Sens. 15 (22), 5358 (2023).
Schmude, J. et al. Prithvi wxc: foundation model for weather and climate. https://arXiv.org/abs/2409.13598 (2024).
Klemmer, K., Rolf, E., Robinson, C., Mackey, L. & Rußwurm, M. April. Satclip: Global, general-purpose location embeddings with satellite imagery. In Proceedings of the AAAI Conference on Artificial Intelligence. 39 (4), 4347–4355. (2025).
Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095 (2012).
Acknowledgements
The work reported in this paper was fully funded by USDA-NIFA Mclnire-Stennis Capacity Grant (#CONS01050), USA (2022-2024). We are grateful to the US Forest Service, U.S. Department of Agriculture, Northern Research Station, United States for sharing the data and expertise. We extend our gratitude to Andrew J. Lister and Charles Paulson of the U.S. Forest Service, Northern Research Station, for their insightful comments and invaluable support during the data extraction process.
Author information
Authors and Affiliations
Contributions
Shashika Himandi: Conceptualization, Software, Formal analysis, Figures, Tables, Writing - original draft. Chandi Witharana: Conceptualization, Supervision, Funding acquisition, and reviewing- original draft.Rachel Riemann: Conceptualization, Formal analysis, and Reviewing—original draft. Robert Fahey: Supervision and Reviewing—original draft. Thomas E. Worthley: Supervision, Funding acquisition, and Reviewing—original draft.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Permissions and licenses
This study was conducted across the forests of Connecticut in collaboration with scientists from the U.S. Forest Service’s Forest Inventory and Analysis (FIA) program. Since the work was carried out as part of this established collaboration, it focused on publicly available data, and not revealing true plot locations no special permissions or licenses were required.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lamahewage, S.H.G., Witharana, C., Riemann, R. et al. Aboveground biomass estimation using multimodal remote sensing observations and machine learning in mixed temperate forest. Sci Rep 15, 31120 (2025). https://doi.org/10.1038/s41598-025-15585-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-15585-6