Earth Observation (EO) data have become widely available since the launch of the first Landsat satellite in 1972. With the development of modern observation instruments, EO data are now gathered from a range of platforms — satellites, aircraft, drones and ground-based sensors — enabling the continuous monitoring of the state and evolution of Earth. This offers a global view of processes that occur at the surface of our planet, from the atmosphere and oceans to farmland and growing urban areas. Nevertheless, converting such a wealth of data, in the form of remote-sensing imagery and geospatial vector and raster data, into useful information for decision makers is challenging. Strategies and tools are needed that can handle the various data sources, their modalities, and the different spatial and temporal resolutions.

The rise of deep learning and the growth of computational power in the past few decades have been game changers in the processing of EO data, with applications in domains such as Earth system science, urban computing, geospatial semantics and remote sensing. Moreover, deep learning approaches have provided solutions for several downstream EO-related tasks, such as segmentation or object detection in remote-sensing images. In a recent analysis paper1, the positive role of artifical intelligence (AI) tools in remote sensing applications towards the United Nations Sustainable Development Goals was highlighted.

However, task-specific machine learning models are limited by the availability of good quality labelled data and struggle with generalization. Researchers have therefore pivoted away from developing task-specific models towards constructing foundation models that enable knowledge transfer across tasks of different domains through the pretraining–finetuning paradigm. Furthermore, foundation models offer greatly improved performance and could potentially offer a more sustainable approach compared with the development of many task-specific models.

Many geo-specific foundation models have appeared in recent years, as documented in a survey paper2 that covers 58 remote-sensing vision foundation models developed between June 2021 and June 2024. These models demonstrate the importance of innovative pretraining methods that can handle diverse, multi-modal data, and enhance the performance and robustness of models across tasks in remote-sensing applications. In an Article in this issue, Wu et al. developed a new pre-training strategy to combine multi-modal data with semantic information from those datasets. They trained a remote-sensing foundation model from a curated dataset of 27 million images from 3 modalities and 11 satellite platforms. The model presents adaptability to downstream tasks that span 7 domains, with tasks including wildlife detection in Kenya, oil spill segmentation, land surveying and flood monitoring.

More recently, Google released the AlphaEarth Foundations model3, which weaves together trillions of images acquired from observation instruments to map the planet “from local to global scales”. The team explains that AlphaEarth essentially acts as a ‘virtual satellite’, yielding highly general representation across multiple geospatial sources, and has made the generated embeddings openly available at their Google Earth Engine4.

With ongoing observation platform launches, the pace of data collection is accelerating, particularly in the acquisition of high-resolution images. This further unlocks the adaptability and scalability of foundation models. So far, efforts in geospatial AI models have mainly focused on increasing model complexity for improved performance. However, as these models are intended for ecological studies such as climate change prediction, deforestation monitoring and related tasks, attention needs to be paid to the models’ own environmental impact1, including that associated with the storage and processing of large amounts of data in data centres. As highlighted by Lu et al.2, a challenge for the future of geo-specific foundation models as they grow in size and complexity is to be more resource efficient, particularly for deployment in the real-time monitoring of systems or environments with limited computational resources. Developers should also address additional challenges beyond model architecture, such as privacy issues when collecting high-resolution images that contain personal information5.