Introduction

Habitat plays a pivotal role in the distribution and abundance of species and is the focus of conservation efforts1,2. The Species Distribution Models (SDM) have become effective tools for species and habitat conservation by integrating species abundance observations with environmental estimates. They can predict potentially suitable habitat for species3, identify high-priority areas for monitoring and conservation4,5,6, and assess the influences of environmental changes on species7. The SDMs are constrained by the ecological principles and assumptions in theory and application8. In particular, the influence of environment on species is complex9. It is necessary to select primary limiting and determining variables for SDMs10. The machine learning algorithms, which learn patterns from extensive data to perform prediction and classification, have become an effective method for constructing SDMs11. Among these, the Random Forest algorithm (RF) not only performs extremely well in ecological predictions but also identifies key factors12. With the increasing availability of big-data, the use of RF in SDMs has become more prevalent13,14. The environmental layers used to construct SDMs base on RF are typically derived from remote sensing satellites15,16. The aquatic factors such as dissolved oxygen, which are vital for aquatic organisms17, cannot be acquired through remote sensing18. As a result, the application of RF is challenging in SDMs for endangered aquatic species19,20. Empirical research on the distribution and habitat utilization of endangered aquatic species is needed urgently to provide references.

The Yangtze finless porpoise (YFP, Neophocaena asiaeorientalis) is a critically endangered freshwater cetacean that inhabits the middle-lower Yangtze channel and two large appended lakes, namely Poyang and Dongting21. Poyang Lake is one of the most critical habitats, supporting almost half of the YFP populations22. Field surveys indicate YFPs primarily distribute along the main channel between Hukou and Poyang, and also in some large sandpit areas between Duchang and Yongxiu23,24. These field surveys do not explain how the YFP preference habitats in Poyang Lake as well as the driving forces. Previous study found that the YFP sighting rates only showed a negative correlation with the density of ships in the main channel25,26. However, due to the hydrological condition varies in Poyang Lake, the YFP population distributions showed drastic fluctuation seasonally24. It seemed that hydrological conditions also played a considerable role in their habitat preference. In Poyang Lake, increasingly frequent human activities, such as heavy shipping and sand mining, have severely squeezed their living space, almost all the YFP populations suffer from habitat loss27. Therefore, studies on SDMs for YFP become extremely important, as they help delineate suitable areas for targeted conservation efforts.

In this study, we conducted habitats surveys in Poyang Lake, investigating the distribution of the YFP populations and the aquatic factors of habitats. The result were used to construct SDMs base on RF. We analyzed the impact of aquatic factors on the YFP distribution patterns. The objective of this study was to elucidated how YFPs select their habitats and to provide information for the management and conservation of both populations and habitats. Based on above analysis, we also discussed perspectives on the SDM for YFPs. Our research also demonstrated the applicability of machine learning algorithms in SDMs for endangered aquatic species.

Materials and methods

Study area and monitoring point

The study area is a large sandpit area in Poyang Lake (see Supplementary Fig.S1), with a total water area of approximately 20 km² and minimal evidence of human activity28,29. In accordance with the technical specifications for surface water environmental quality monitoring issued by the Chinese Ministry of Agriculture (HJ 91.2—2022), a total of 12 sampling points (p1-p12 in Fig. 1) were established. Four surveys were conducted in April, June, September, and December 2023.

Fig. 1
figure 1

Study area and sampling points. (It was generated by ArcMap 10.8 software (https://www.esri.com/en-us/arcgis/products/arcgis-desktop/overview)).

Environmental variables

The aquatic factors were collected from sampling points in accordance with the technical specifications for surface water environmental quality monitoring (HJ 91.2—2022) and the technical guidelines for water ecological monitoring-aquatic organism monitoring and evaluation of lakes and reservoirs (HJ 1296—2023), both issued by the Chinese Ministry of Agriculture. The data were subsequently imported into ArcGIS (version 10.8). The inverse distance weighting (IDW) method was employed to generate the aquatic factor layers.

The 57 aquatic factor layers (25 m x 25 m) were created in a GIS format as proxies for habitat predictors (see Supplementary Table.S1). These layers included water physical and chemical indicators, the comprehensive trophic level index (TLI)30, the aquatic biology factors (the density and biomass of zooplankton, phytoplankton and benthic), and the biodiversity indicators (the Shannon-Wiener index (H), the Simpson index (D), the Margalef index (M) and the Pielou index (J) of zooplankton, phytoplankton and benthic).

Presence and pseudo-absence points of YFPs

A research vessel was operated at a speed of approximately 12 km/h in the study area. Five observers were positioned at the bow of the vessel to observe YFPs during the voyage. The latitude and longitude coordinates of the YFP occurrences were recorded with the hand-held GPS (GPSMAP639CSX, Garmin, America)22,27. An acoustic event recorder (RPCD-A2, PinDu, China) was installed at the stern of the vessel to assist visual observations31.

The study area was divided into equal-sized grids (0.1 × 0.1 km) and the geographic coordinates of the center point of grids were extracted in ArcGIS 10.8. Both the geographic coordinates of the grids and YFPs were imported into RStudio, where the R package “geosphere” was employed to calculate the minimum spatial distance (MSD) between grids and YFPs. The points with the MSD less than the thresholds (the threshold was set from 100 m to 1000 m) were classified as presence points, while the remaining points were classified as pseudo-absence points.

Construction of SDMs

The 57 aquatic factor layers were imported into RStudio, and the R package “abundanceR” was employed to extract aquatic factor data of the presence and pseudo-absence points. The SDMs were created base on RF with the R package “randomForest” in accordance with default parameters. The results of each survey were used to construct independent SDMs, a total of four initial SDMs were obtained.

Lists of factors ranked by RF in order of feature importance (mean decrease accuracy index and mean decrease Gini index) were determined over 100 iterations. The number of key factors was identified using 10-fold cross-validation implemented with the rfcv () function in the R package “randomForest”. When the cross-validation error was minimum, the top-ranked importance factors across multiple SDMs were selected to construct the reconstructed SDMs.

The performance of SDMs

The SDMs were used to make cross-predictions the presence and pseudo-absence area of YFPs with the predict () function in the R package “stat”. The predicted results were then used to calculate the receiver operating characteristic (ROC) curves, the True Skill Statistic (TSS)and the area under the ROC curve (AUC) in the R package “pROC” and “tidysdm”.

Results

The results of habitat environments survey

The four surveys identified a total of 26, 38, 37 and 21 groups of YFPs. A few of YFPs were found outside the study area due to rising water levels during the rainy season (Fig. 2).

Fig. 2
figure 2

Distribution of YFPs in each survey.

The most appropriate thresholds

As the thresholds increased, the out-of-bag (OOB) estimate error rate and the classification error rate of the presence points progressively decreased. The classification error rate of the pseudo-absence points increases. These three curves intersected when the threshold reaches approximately 550 m (Fig. 3).

Fig. 3
figure 3

Relationship between the SDMs accuracy and thresholds.

The use of SDMs

The SDMs exhibited accurate predictions of the YFP distributions for the same period, with all YFPs were distributed in the areas predicted by the SDMs (Fig. 4 and Supplementary Fig.S2-4). The SDMs exhibited limited efficacy in predicting the YFP distributions for other survey periods, with only a modest number of YFPs distributed in the areas predicted by SDMs (Fig. 4 and Supplementary Fig.S2-4); In some instances, the prediction offered by SDMs are particularly extreme. They assert that the entirety of the study area is either a presence area or an absence area (Supplementary Fig.S2-4).

Fig. 4
figure 4

Prediction maps of the SDM that based on the April habitat survey. (a) The SDM predictions for the April survey. (b) The SDM predictions for the June survey. (c) The SDM predictions for the September Survey. (d) The SDM predictions for the December Survey.

The contribution of aquatic factors to SDMs

The cross-validation error curve stabilized when 15 factors were used in the SDMs and these top-ranked aquatic factors in each SDMs are not identical. The 13 aquatic factors appeared in multiple top aquatic factors lists of SDMs (Fig. 5 and Supplementary Table.S2). The importance of the aquatic factors for SDMs is visualized in Supplementary Fig.S5.

Fig. 5
figure 5

Degree of contribution of aquatic factors to the SDMs. (a) The 15 top importance aquatic factors were identified by RF in the four SDMs. (b) The occurrence of top aquatic factors based on mean decrease accuracy index in the multiple SDMs. (c) The occurrence of top aquatic factors based on mean decrease Gini index in the multiple SDMs.

The performance of SDMs

The AUC and TSS of the reconstructed SDMs (excluding September) were higher than those of the initial SDMs (Fig. 6). The ROC curves of the initial and reconstructed SDMs are displayed in Supplementary Fig.S6-9.

Fig. 6
figure 6

(a) The AUC of the initial and reconstructed SDMs. (b) The TSS of the initial and reconstructed SDMs.

Discussion

The population distributions

Previous studies found that YFPs migrate between Poyang Lake and the mainstream of the Yangtze River in response to hydrological variations32. We found that the abundance and distribution of YFPs were not exactly the same in each survey (Fig. 2), which suggests that the migration of YFPs occurs between the sandpit and the main channel. This also demonstrates the YFP populations have seasonal habitat preferences28. These hydrological influences not only influenced on the YFP distributions across the whole-lake scale, but also work at fine-scales.

The principal limiting factors

In this study, the prediction maps were not entirely accurate in delineating the actual presence of YFPs (Fig. 4). This limitation stems from SDMs predicting the Suitability of presence for YFP, reflecting constraints of binary classification in habitat prediction33. Suitability distributions are theoretical representations of the potential distribution of species. However, specie distributions do not necessarily align with the most suitable habitats unless the population reaches its carrying capacity2. Consequently, it is imperative to prioritize the consideration of the aquatic factors, which play a major role in the SDMs34,35, rather than the prediction of SDMs.

We conducted comprehensive habitat surveys, including water level and water flow measurements, which correlate with YFP distribution36,37. The aquatic factors that had not been validated were quantified, including dissolved oxygen and diversity of aquatic organisms38. The numerous and multifaceted aquatic factors ensured the accuracy of SDMs. This also significantly influenced the model transferability, as evidenced by the performance of the initial SDMs in predicting other periods (AUC = 0.557 ± 0.062, TSS = 0.18 ± 0.062). The reconstructed SDMs using top-ranked factors showed improved performance (AUC = 0.624 ± 0.056, TSS = 0.30 ± 0.11). AUC and TSS is shown to be independent of prevalence39. The performance of the models is evaluated by AUC and TSS based on a dichotomous prediction of presence–absence in this study. This results in suboptimal SDM performance, and an AUC > 0.75 is generally considered a good result. However, our study underscores the importance of selecting key factors for constructing SDM of YFP.

There were no factors appeared in the lists of top-ranked importance factors for all SDMs. This suggests seasonally variable habitat preferences in YFPs, and the seasons cannot be simply categorized as rainy or dry seasons. We also found that total phosphate concentration and cyanobacteria density, appeared in the list of top-ranked importance factors for three SDMs (April, June and September). Previous research found that one of the most important factors affecting the distribution of YFPs is prey resources40. It is speculated that these factors may have indirectly influenced the distribution of YFPs by affecting prey resources.

The perspectives on SDMs for YFP

The objective of field surveys is to ascertain the presence points of species. However, the absence points are also crucial for SDMs. Consequently, researchers use pseudo-absences in place of true absences, which are established under multiple assumptions with careful planning41. Due to the restricted distribution range and high mobility of YFPs, a series of thresholds were set for the MSD42. We found that the threshold was most appropriate when set at approximately 550 m in this study, all four SDMs performed well under this threshold. It is speculated that YFPs occupy the surrounding 550 m area. But the behavioral characteristics of YFP are extremely complex43. Therefore, it is necessary to adopt a suitable approach for determining the threshold for each SDM, e.g., by observing the model performs across different thresholds.

In this study, the study area is a natural habitat with little or no human disturbance. This establishes a foundation for the acquisition of the baseline of natural habitat preferences38. However, inevitable overlap between the activity space of YFPs and that of humans in most habitats26. Research showed that factors such as ship noise have been demonstrated to exert an influence on the natural behavior of the YFPs44. It is suggested that anthropogenic intensity indicators need to be included in SDMs for YFPs.

Additionally, prey resources were not studied due to no-fishing policy constraints. Cetaceans are highly selective in habitat utilization which is influenced by prey resources45. The diet of YFPs consists only of specific small fish46. Although acoustic equipment can monitor prey populations, it cannot distinguish species47. This posed challenge for incorporation prey resources into SDMs. In recent years, environmental DNA (eDNA) technology has matured and demonstrated particular efficacy in analyzing fish community structure48. The integration of eDNA with acoustic monitoring is anticipated to address this challenge.

Conclusion

In this study, we employed RF for the first time to construct SDMs for YFPs and investigate the principal limiting factors affecting the YFP distributions. We found that YFPs were affected by different environmental factors in different seasons, total phosphorus concentration and cyanobacteria density were identified as common key factors. Our findings provide crucial insights to inform the conservation management of YFPs and serve as a theoretical foundation for the developing SDMs for endangered aquatic species.