Introduction

Reliable methods to accurately count wildlife are critical to detect changes in population sizes and document demographic trends that can be used to guide the conservation and management of populations. For many species, obtaining reliable data on population numbers is challenging, as species often occupy remote or inaccessible habitats, are nocturnal, or display cryptic behaviors that hinder detection. Wildlife survey methods vary in their strengths, with the optimal choice dependent on the monitoring objective. For example, foot patrols are necessary for close-range assessments of body condition, while estimating survival rates necessitates individual identification. The use of UAVs for monitoring wildlife is promising1, but the scale of surveys is currently limited by UAV battery life, and data can be compromised by the acoustic disturbance UAVs cause to animals2. To date, Earth observation satellite imagery has been used primarily for species detection, but this approach also provides broader ecological insights into habitat selection and migration dynamics3. Several comprehensive reviews have examined the use of satellite imagery for wildlife monitoring4,5,6, however, empirical studies remain limited relative to the number of species for which this monitoring approach is now technically viable. The high cost of sub-meter imagery remains a major challenge; costs depend on the provider, order volume, and image specifications. Freely available imagery from satellite programs such as NASA’s Landsat Series (30 m resolution, 16-day revisit) and ESA’s optical sensors under the Copernicus program (10–20 m resolution, 5-day revisit) offers valuable broad-scale coverage for habitat assessment; however, their coarse spatial resolution prevents detection of individual animals. In contrast, several commercial providers offer sub-meter imagery suitable for detailed object detection. For example, Maxar Technologies’ WorldView-3 satellite—used in this study—and the WorldView Legion constellation provide panchromatic imagery at resolutions as fine as 0.29 m, with up to 15 revisits per day, enabling the detection of individual animals from low Earth orbit7. As an unprecedented number of high spatial- and temporal-resolution commercial constellations are launched8, and companies seek to remain competitive, costs are expected to fall. In addition to high costs, manual analysis of large satellite images remains labor-intensive, time-consuming, and tedious. Deep-learning approaches allow for automation of this process, provided sufficient labeled training data are available. Both point- and object-based deep-learning approaches have proven effective for detecting individual animals in satellite imagery, as successfully demonstrated in the case of whales9, elephants10, and wildebeest3. For wildebeest, smaller body size is offset by their tendency to form large, dense aggregations, which in the case of migratory wildebeest number in the hundreds of thousands, which increases ease of detectability11. In contrast, white rhinoceroses typically occur in much smaller crashes or around five and rarely exceeding 10 individuals12, making them harder to detect despite having a larger individual body size.

Satellite-based monitoring is suited to species detection13 and population counting14 —tasks for which aerial surveys are the main alternative, with each method subject to distinct biases that influence survey accuracy. Both techniques are subject to availability or concealment bias, where animals present in the survey area may remain undetected when hidden—such as rhinos resting under trees during the hottest parts of the day15. This bias has spatial and temporal components: satellite imagery can reduce spatial bias by capturing much larger areas in a single acquisition, but temporal bias remains due to fixed orbital schedules. Aerial surveys provide greater flexibility in timing to avoid periods of low visibility, whereas the revisit limitations of satellites are expected to diminish with the launch of new constellations, which will offer higher revisit frequencies and improved temporal resolution.

Fig. 1
figure 1

Temporal comparison of WorldView-3 satellite imagery illustrating rhino movement. A blue box highlights the presence of a rhino, while a green box shows its absence in the same location on a later acquisition, confirming the object as a moving animal rather than a static feature. The red box similarly indicates a rhino detection and its subsequent absence in the same location at a later time. Images acquired on 3 September 2019 and 13 September 2022. Imagery from Maxar Technologies.

Observer bias and detectability bias—where object visibility decreases with distance from the observer—are mitigated in satellite surveys; however, classification error, i.e., the incorrect assignment of the wrong species category (e.g., mistaking a rhino for an elephant or vice versa), affects both techniques and is particularly challenging for satellite surveys due to the lower resolution of imagery. This form of sampling error can skew population estimates, particularly in areas where multiple large-bodied species co-occur, such as rhinos, elephants, and hippos. Involving species experts familiar with diurnal movement patterns can reduce this detection bias by improving image interpretation and minimizing classification error.

Given the critical conservation status of rhino species, improving the accuracy and scalability of surveying methods is essential for tracking population trends and informing protection efforts. As umbrella species, rhinos provide protection for many co-occurring species; conserving them also safeguards the broader ecosystems and landscapes they inhabit. Rhinos remain critically threatened throughout much of their range. Globally, five rhinoceros species and eleven subspecies have been documented, of which four subspecies have recently gone extinct. The Northern White Rhino (Ceratotherium simum cottoni) is now functionally extinct, while the Southern White Rhino (Ceratotherium simum simum), the most numerous remaining species, is classified as Near Threatened by the IUCN and is most populous in South Africa16. Poaching, primarily driven by demand for rhino horn as a status symbol and in traditional medicine, remains a significant threat with over 1,000 individuals killed annually between 2013 and 201617. In recent years, there has been a reported decline in mortality18, while the reduction is in part due to a reduction in poaching effort, the disruption to rhino demography has been significant19.

Protection efforts for rhinos extend beyond biological management to include law enforcement initiatives to disrupt organized crime, strategies for trade and demand reduction, and the development of alternative livelihoods for local communities. Conservation strategies for rhino range from breeding programs and translocations20,21, to demand-reduction campaigns22, the creation of synthetic horn alternatives23, dehorning24, and proposals to legalize the horn trade25. Despite all these initiatives, the future of rhinos remains uncertain. Currently, around half of the global white rhino population is privately managed, while efforts continue to seek long-term solutions for protecting populations in the wild26.

In this study, we evaluate the feasibility of detecting rhinoceroses using high-resolution imagery (33–36 cm) acquired from the commercial WorldView-3 satellite (Maxar Technologies) over the world’s largest private rhino reserve in South Africa, home to approximately 2,000 individual white rhino (Ceratotherium simum simum). We use temporal image differencing as shown in Fig. 1 to identify rhinos. We then automate the detection process using the YOLOv12x object detection model27 - detections from this model are shown in (Fig. 2).

Fig. 2
figure 2

Top row: original unlabeled satellite imagery. Bottom row: rhinoceros detections generated by the YOLOv12 model. WorldView-3 image acquired on 13 September 2022. Imagery from Maxar Technologies.

The site is selected to minimize potential confusion with other African megafauna and to provide consistent background conditions for the detection task. It is not fully representative of the broader range of habitats occupied by rhinos, and other locations may present either greater or lesser challenges for detection than those examined here. We assess whether augmenting the training dataset with synthetically generated rhino imagery improves model performance. We also investigate whether human evaluators can distinguish synthetic from real rhinos, and whether the model can reliably differentiate rhinos (Ceratotherium simum simum) from African elephants (Loxodonta africana) using satellite imagery from Addo Elephant Park, South Africa.

Results

The YOLOv12x object detection model, trained on 1,700 manually annotated sub-images and evaluated on a test set of 300 images, achieved an average precision (AP) of 0.6308, with a precision of 65.2%, recall of 61.4%, and an F1 score of 63.3%. This demonstrates the feasibility of detecting rhinoceroses in satellite imagery despite their cryptic appearance, small pixel footprint, and the complexity of heterogeneous landscapes.

A separate model trained solely on 1,000 synthetic images achieved a lower AP of 0.26 (precision: 40.5%, recall: 22.3%, F1 score: 28.6%) when evaluated on real-world imagery. However, augmenting the training set with synthetic data led to a modest improvement in detection performance, with the combined model reaching a peak AP of 0.65 (precision: 66.7%, recall: 63.1%, F1 score: 64.8%).

To assess the visual realism of synthetic images, human annotators were asked to distinguish between real and synthetic examples. For real images, annotators correctly identified 140 out of 270 as genuine (true positives), yielding a recall of 53%. For synthetic images, 130 out of 171 were correctly identified as fake (true negatives), while 41 were misclassified as real (false positives), resulting in a specificity of 76%.

When trained on a combined dataset of 7,532 individually annotated instances of African elephants (Loxodonta africana) and white rhinoceroses (Ceratotherium simum simum), the model failed to reliably distinguish between the two species. We tested three two-class training configurations (rhino vs. elephant), although no image contained both species together. On the held-out test set of 324 images (1,158 annotated instances), the model achieved an overall AP of 0.247, with 32.8% precision, 28.7% recall, and an F1 score of 30.6%. Performance was notably stronger for rhinos (AP 0.350, precision 42.1%, recall 39.0%, F1 40.5%) than for elephants (AP 0.143, precision 23.5%, recall 18.3%, F1 20.6%).

Discussion

The results of this study demonstrate that white rhinoceroses (Ceratotherium simum simum) can be identified and detected in satellite imagery Fig. 3 with manual screening and temporal image differencing. The 33–36 cm resolution imagery used in this study, combined with YOLOv12x, achieved only a moderate average precision (AP) of 0.63 without any synthetic data included in training. Future research could build on the freely available training dataset provided here to improve detection accuracy by testing alternative models, such as U-Net28 or HerdNet29. We encourage others to build on this dataset by incorporating imagery of additional rhino species from the diverse ecological contexts in which they occur across Asia and Africa. The challenge of distinguishing between species at this spatial resolution is evident from the poor performance in separating elephants from rhinos. We encourage future research to address this limitation, which will be surmountable as satellite imagery with higher spatial resolution becomes available. The capacity to detect animals directly, without relying on temporal image differencing, will be enabled by advances in satellite video acquisition30. The issue of double-counting is minimized in satellite surveys because very large areas upwards of 4000km2 3 can be captured in a single image at less than 50 cm resolution. When multiple images from different timestamps are used and individuals are treated as independent observations, the same problem of double-counting that occurs in traditional aerial surveys can arise. In such cases, established principles for estimating detection probability and abundance can be applied to account for availability bias31.

In the long term, assembling a spectrally and contextually diverse library of wildlife examples in satellite imagery across multiple habitats will advance the field, enabling the refinement of robust automated detection methods and benchmarks and reducing reliance on manual annotation. Large-scale datasets have been compiled for camera-trap studies32,33, but are lacking for satellite imagery due to the high cost and proprietary nature of imagery. The cost of acquiring sub-half-meter resolution satellite imagery is substantial; however, we are at an unprecedented point in history with a record number of Earth observation satellites being launched8. As the market for very high-resolution satellite imagery grows and competition increases, acquisition costs are likely to decrease over time, making this monitoring technique more accessible. In addition, several satellite providers offer imagery at no cost for projects with ecological34 or societal benefits35.

At present, satellite-based surveying is competitive in cost to manned aerial surveys in remote or logistically challenging regions, such as Arctic environments, while in areas with established aviation infrastructure, conventional aerial surveys may remain more cost-effective in the near term. Different survey methods serve different purposes, and the optimal approach depends on the wildlife monitoring objective. At present, the primary applications of satellite-based monitoring are species detection and counting; however, beyond counting, satellites can be used to examine how groups of animals move together across vast areas, providing a better understanding of collective behavior dynamics3. While the cost of census counts is an important consideration, accuracy and bias quantification are equally critical. Conducting aerial and satellite surveys simultaneously across representative sites would enable direct evaluation of detection accuracy and quantification of relative bias errors. Such cross-method comparisons are well established in ecology, for example in contrasts between mark–recapture and block-count estimation36. Embracing methodological pluralism is essential to validate monitoring approaches rather than relying on a single technique.

Fig. 3
figure 3

Zoomed-in visualization of two rhinoceroses within the reserve, illustrating that each individual occupies fewer than 30 pixels, including body shadow. WorldView-3 image acquired on 3 September 2019. Imagery from Maxar Technologies.

In our study, incorporating synthetic images led to only a modest improvement in model performance (peak AP of 0.65). However, this limited gain may be attributed more to the challenge of dissociating pixel clusters due to current spatial resolution constraints of satellite imagery than to the effectiveness of synthetic augmentation itself. As wildlife satellite monitoring advances, continued development of processing pipelines that leverage synthetic data, as piloted here, will be essential to overcome sparse training data and unlock the full potential of satellite-based biodiversity monitoring.

While not directly examined in this study, the integration of AI-generated imagery into wildlife detection pipelines raises a potential future concern: fabricated or manipulated imagery could distort population estimates. Although synthetic data can enhance model training and address class imbalance, a common limitation in wildlife datasets37, it also introduces the risk of misuse. This risk becomes particularly salient if species population estimates are incorporated into financial mechanisms, such as biodiversity credit markets—satellite-based detection is already used to quantify the benefits of forest carbon offset projects38, and similar integration for species accounting is conceivable. Should this occur, ensuring rigorous data provenance capable of distinguishing genuine from synthetic imagery will be essential39, paralleling safeguards now being implemented in other fields vulnerable to deepfake manipulation40. This study demonstrates the feasibility of satellite-based monitoring for surveying rhino populations and provides a foundation for a scalable monitoring approach in support of their conservation.

Methods

Satellite image acquisition

We used six cloud-free, high-resolution WorldView-3 satellite images (Maxar Technologies) acquired between 2015 and 2022 (Table 1). Each image had a ground sample distance of 33–36 cm and covered a substantial portion of the farm. Across all six dates, the total area surveyed was 188.6 km², with locations imaged more than once across the six images. The farm is located in the North West province of South Africa, the climate is characterized by a wet summer season (November–March) and a dry winter season (April–October); all satellite imagery in this study was acquired during the dry season, spanning early (May), mid (July), and late (September) phases, when reduced vegetation cover and clearer atmospheric conditions generally enhance detectability. This site was selected to minimize potential confusion with other African megafauna and to ensure consistent background conditions for the detection task. It is not representative of the broader range of habitats occupied by rhinos, and other locations may present either greater or lesser challenges for detection than those examined here. To produce high-quality labeled training data, annotations were carried out by a professional annotation service (https://labelyourdata.com/). Each image was independently annotated by two trained annotators, with all instances of disagreement adjudicated by a senior reviewer to verify the presence or absence of rhinoceros. Any annotations unique to a single annotator were subjected to an additional verification step to confirm accuracy before inclusion in the final dataset. In several instances, small trees and shrubs were initially misidentified as rhinoceroses; however, temporal image differencing revealed that these objects were stationary, and they were subsequently excluded from the training dataset.

Table 1 Summary of Worldview-3 Satellite Images Used in Training (3229 unique rhinos), Validation (513 unique rhinos), and Test Datasets (481 unique rhinos), each label contains between 0–64 rhino.

This consensus-based procedure yielded 4,223 validated rhinoceros annotations. To further improve annotation reliability and reduce the likelihood of false positives arising from visually similar static features (e.g., rocks, shrubs, shadows), a temporal image differencing method was employed. This approach involved confirming the presence of a suspected rhinoceros by examining its location in images acquired at different times as shown in Fig. 2 below. Only those objects that exhibited positional displacement consistent with rhino movement were retained.

Preprocessing and model selection

Given the large spatial dimensions of the WorldView-3 satellite imagery and the irregular extents of the georeferenced TIFF files (imagery in which each pixel is assigned precise geographic coordinates), we adopted an image tiling strategy to facilitate efficient training. The area surveyed covers 188.6 km, which converts to 1.73 billion pixels at 33 cm resolution. Each image was subdivided into uniformly sized tiles of 512 \(\times\) 512 pixels. This patch size was selected to meet the input requirements of deep learning architectures while preserving the fine spatial resolution necessary to detect small, visually cryptic targets such as rhinoceros. Care was taken during the tiling process to preserve spatial accuracy and prevent the truncation of rhinos at tile edges, there was no overlap across the images. Subsequent preprocessing and quality control were conducted using QGIS (Quantum GIS)41, an open-source geographic information system. These steps included manual inspection to ensure annotation consistency, correction of any labeling discrepancies i.e., tightening labels around the rhino, and conversion of polygon shapefiles (.shp) into the bounding box format compatible with the YOLO training framework. The final dataset was partitioned into 1,700 training images, 150 validation images, and 300 test images. Table 2 shows the training hyperparameters used during training.

These numbers refer to images that comprise different numbers of rhino per image, while Table 1 refers to the number of unique rhino “objects” in the imagery. For object detection, we employed YOLOv12x27, the largest variant of the recently released YOLOv12 model family. We selected YOLOv12x for its high-resolution feature extraction and strong multi-scale detection performance, which enables identification of small, visually cryptic targets such as rhino.

Table 2 Hyperparameter values for YOLOv12 training.

Synthetic data generation

To explore the potential of synthetic data in augmenting model training, we constructed a 3D scene in Blender, to emulate the satellite observation perspective. Blender is an open-source 3D computer graphics software used for creating 3D models, and simulations42. To generate synthetic imagery, a virtual camera was positioned directly overhead of a flat plane representing the ground. The plane was textured using patches cut from real satellite images to create realistic backgrounds. For each synthetic image, a random patch was selected, with all patches manually screened to remove those containing cloud cover, visible reserve infrastructure, or real rhinoceroses. Three-dimensional rhinoceros models were randomly distributed across the scene with variable scaling and rotation parameters, simulating natural variation in animal size and orientation As shown in Fig. 4. The models were placed sparsely to reflect realistic densities observed in the original dataset.

Fig. 4
figure 4

Top row: examples of real satellite detections of rhinos. Bottom row: synthetic rhino images generated in Blender using realistic 3D models overlaid on actual satellite backgrounds. WorldView-3 imagery acquired on 3 September 2019. Imagery from Maxar Technologies.

Synthetic images were rendered at a resolution of 512 \(\times\) 512 pixels to match the dimensions used for training with real image tiles. While this approach enabled the rapid generation of labeled training data, initial training runs using YOLOv12x on the purely synthetic dataset yielded low performance, with a baseline average precision (AP) of 0.26 substantially below that of the real-image model. The model exhibited difficulty in generalizing from synthetic scenes, likely due to differences in texture fidelity and visual complexity, despite extensive hyperparameter tuning using grid search. In this process, key hyperparameters such as the learning rate (0.001), momentum (0.937), weight decay (0.0005), optimizer (SGD), image size (512), batch size (8), epochs (500), and training options (augmentation, multi-scale input, mixed-precision, checkpoint saving) were systematically varied across predefined values to identify the best-performing configuration.

The synthetic data went through several iterations, focusing on altering the size, distribution, and number of 3D model placements to see what the model would respond best to.

Differentiating elephant and rhino

To assess whether African elephants (Loxodonta africana) and white rhinoceroses (Ceratotherium simum) can be distinguished in satellite imagery, we trained a YOLOv12 object detection model on a combined dataset. Elephant training data were sourced from Duporge et al. (2021)10,43 combined with newly annotated rhino imagery. All images were acquired from very high-resolution commercial satellites (30–50 cm per pixel), and annotations were manually generated by trained observers using bounding boxes to delineate individual animals. The final dataset comprised 7,532 annotations within the train set, 324 test images, and 125 validation images, containing 1,126 elephant and 4,223 rhino annotations with varying numbers of each species in each image. We used an 80–10–10 split for training, validation, and testing.

Human evaluation of synthetic data

To assess the visual realism of our synthetic data, we conducted a blinded classification task involving human annotators (Princeton University Freshman students). Eight participants—none of whom were involved in the project and with little to no prior exposure to satellite wildlife imagery—were asked to classify images as either real or synthetic. The dataset comprised an equal mix of real and AI-generated high-resolution satellite images of rhinoceroses. The 108 images (54 real, 54 synthetic) were divided into two labeling tasks. Annotators were instructed to categorize each image as “real” or “synthetic” without any prior training or guidance. Results revealed that participants frequently misclassified synthetic images as real, underscoring the high visual fidelity of the generated data (Fig. 4).