Abstract
Remote sensing foundation models, pretrained on massive remote sensing data, have shown impressive performance in several Earth observation (EO) tasks. These models usually use single-modal temporal data for pretraining, which is insufficient for multi-modal applications. Moreover, these models require a considerable number of samples for fine-tuning in downstream tasks, posing challenges in time-sensitive scenarios, such as rapid flood mapping. We present SkySense++, a multi-modal remote sensing foundation model for diverse EO tasks. SkySense++ has a factorized architecture to accommodate multi-modal images acquired by diverse sensors. We adopt progressive pretraining, which involves two stages, on meticulously curated datasets of 27 million multi-modal remote sensing images. The first representation-enhanced pretraining stage uses multi-granularity contrastive learning to obtain general representations. The second semantic-enhanced pretraining stage leverages masked semantic learning to learn semantically enriched representations, enabling few-shot capabilities. This ability allows the model to handle unseen tasks with minimal labelled data, alleviating the need for fine-tuning on extensive annotated data. SkySense++ demonstrates consistent improvements in classification, detection and segmentation over previous state-of-the-art models across 12 EO tasks in 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying and disaster management. This generalizability may lead to a new chapter of remote sensing foundation model applications for EO tasks at scale.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The pretraining data and EO benchmarks used in this work are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. 94). Following the collaborative agreement between Wuhan University and Ant Group regarding requirements for data redistribution compliance, we have checked the user agreements of the original data providers. Some prohibit redistribution of the data, for example, DeepGlobe, and some lack explicit redistribution guidelines, for example, Potsdam. Therefore, download links are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. 94) for accessing these data. Interested researchers are required to sign the user agreements directly with the original data providers before accessing the datasets. Source data are provided with this paper.
Code availability
The code implemented in this work is available via GitHub at https://github.com/kang-wu/SkySensePlusPlus (ref. 95).
References
Chen, S. et al. Amazon forest biogeography predicts resilience and vulnerability to drought. Nature 631, 111–117 (2024).
Rohde, M. M. et al. Groundwater-dependent ecosystem map exposes global dryland protection needs. Nature 632, 101–107 (2024).
Mo, L. et al. Integrated global assessment of the natural forest carbon potential. Nature 624, 92–101 (2023).
Paolo, F. S. et al. Satellite mapping reveals extensive industrial activity at sea. Nature 625, 85–91 (2024).
Shen, H., Meng, X. & Zhang, L. An integrated framework for the spatio–temporal–spectral fusion of remote sensing images. IEEE Trans. Geosci. Remote Sens. 54, 7135–7148 (2016).
Yuan, Q. et al. Deep learning in environmental remote sensing: achievements and challenges. Remote Sens. Environ. 241, 111716 (2020).
Sun, X. et al. RingMo: a remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 61, 5612822 (2023).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 6, 354–367 (2024).
Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Cong, Y. et al. SatMAE: pre-training transformers for temporal and multi-spectral satellite imagery. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 197–211 (Curran Associates, 2022).
Muhtar, D., Zhang, X., Xiao, P., Li, Z. & Gu, F. CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 5607817 (2023).
Mall, U., Hariharan, B. & Bala, K. Change-aware sampling and contrastive learning for satellite images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5261–5270 (IEEE, 2023).
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J. & Kembhavi, A. SatlasPretrain: a large-scale dataset for remote sensing image understanding. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16772–16782 (IEEE, 2023).
Mendieta, M. et al. Towards geospatial foundation models via continual pretraining. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16806–16816 (IEEE, 2023).
Reed, C. J. et al. Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 4088–4099 (IEEE, 2023).
Xiong, Z. et al. Neural plasticity-inspired foundation model for observing the Earth crossing modalities. Preprint at https://arxiv.org/abs/2403.15356 (2024).
Li, W. et al. Self-supervised learning for SAR ATR with a knowledge-guided predictive architecture. ISPRS J. Photogramm. Remote Sens. 218, 326–338 (2024).
Wang, Y., Albrecht, C. M. & Zhu, X. X. Self-supervised vision transformers for joint SAR-optical representation learning. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 139–142 (IEEE, 2022).
Fuller, A., Millard, K. & Green, J. R. CROMA: remote sensing representations with contrastive radar-optical masked autoencoders. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 5506–5538 (Curran Associates, 2023).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 16000–16009 (IEEE, 2022).
Xie, Z. et al. SimMIM: a simple framework for masked image modeling. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9653–9663 (IEEE, 2022).
Ahlswede, S. et al. TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 15, 681–695 (2023).
Tarasiou, M., Chavez, E. & Zafeiriou, S. ViTs for SITS: vision transformers for satellite image time series. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10418–10428 (IEEE, 2023).
Wang, D. et al. MTP: advancing remote sensing foundation model via multi-task pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 11632–11654 (2024).
Guo, X. et al. SkySense: a multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 27672–27683 (IEEE, 2024).
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).
Bar, A., Gandelsman, Y., Darrell, T., Globerson, A. & Efros, A. Visual prompting via image inpainting. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 25005–25017 (Curran Associates, 2022).
Wang, X., Wang, W., Cao, Y., Shen, C. & Huang, T. Images speak in images: a generalist painter for in-context visual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6830–6839 (2023).
Wang, X. et al. SegGPT: towards segmenting everything in context. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 1130–1140 (IEEE, 2023).
Rußwurm, M. & Körner, M. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS Int. J. Geo-Inf. 7, 129 (2018).
Ahlswede, S. et al. TreeSatAI benchmark archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 2022, 681–695 (2022).
Bragagnolo, L., da Silva, R. V. & Grzybowski, J. M. V. Towards the automatic monitoring of deforestation in Brazilian rainforest. Ecol. Inform. 66, 101454 (2021).
Zhu, Q. et al. Oil spill contextual and boundary-supervised detection network based on marine SAR images. IEEE Trans. Geosci. Remote Sens. 60, 5213910 (2021).
Rowley, A. & Karakuş, O. Predicting air quality via multimodal AI and satellite imagery. Remote Sens. Environ. 293, 113609 (2023).
Eikelboom, J. A. J. et al. Improving the precision and accuracy of animal population estimates with aerial image object detection. Methods Ecol. Evol. 10, 1875–1887 (2019).
Hong, D. et al. Cross-city matters: a multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks. Remote Sens. Environ. 299, 113856 (2023).
Zhang, C. et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 166, 183–200 (2020).
Cloud to Street - Microsoft Flood and Clouds Dataset. Registry of Open Data https://registry.opendata.aws/c2smsfloods (2023).
Cambrin, D. R., Colomba, L. & Garza, P. Cabuar: California burned areas dataset for delineation. IEEE Geosci. Remote Sens. Mag. 11, 106–113 (2023).
Zhang, X., Yu, W., Pun, M.-O. & Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 197, 1–17 (2023).
Gupta, R. et al. Creating xBD: a dataset for assessing building damage from satellite imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 10–17 (IEEE, 2019).
Astruc, G., Gonthier, N., Mallet, C. & Landrieu, L. Omnisat: self-supervised modality fusion for Earth observation. In Proc. European Conference on Computer Vision (ECCV) (eds Leonardis, A. et al.) 409–427 (Springer, 2019).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).
Wang, K., Liew, J. H., Zou, Y., Zhou, D. & Feng, J. PANet: few-shot image semantic segmentation with prototype alignment. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9197–9206 (IEEE, 2019).
Li, X. et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 106, 102638 (2022).
Zhang, B., Xiao, J. & Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8312–8321 (IEEE, 2021).
Jia, Y., Gao, J., Huang, W., Yuan, Y. & Wang, Q. Holistic mutual representation enhancement for few-shot remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 61, 5622613 (2023).
Wanyan, X., Seneviratne, S., Shen, S. & Kirley, M. Extending global-local view alignment for self-supervised learning with remote sensing imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2443–2453 (IEEE, 2024).
Rahnemoonfar, M. et al. Floodnet: a high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 9, 89644–89654 (2021).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).
Dai, D. & Yang, W. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Trans. Geosci. Remote Sens. 8, 173–176 (2011).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning 1126–1135 (PMLR, 2017).
Li, Z., Zhou, F., Chen, F. & Li, H. Meta-SGD: learning to learn quickly for few-shot learning. Preprint at https://arxiv.org/abs/1707.09835 (2017).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Sung, F. et al. Learning to compare: relation network for few-shot learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1199–1208 (IEEE, 2018).
Zhang, C., Cai, Y., Lin, G. & Shen, C. DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12203–12213 (IEEE, 2020).
Zhang, P., Bai, Y., Wang, D., Bai, B. & Li, Y. Few-shot classification of aerial scene images via meta-learning. Remote Sens. 13, 108 (2020).
Li, H. et al. RS-MetaNet: deep meta metric learning for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 59, 6983–6994 (2021).
Li, L., Han, J., Yao, X., Cheng, G. & Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 59, 7844–7853 (2020).
Zhang, B. et al. SGMNet: scene graph matching network for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 60, 5628915 (2022).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).
Wang, Y. et al. SSL4EO-S12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. IEEE Geosci. Remote Sens. Mag. 11, 98–106 (2023).
Sumbul, G. et al. Bigearthnet: a large-scale benchmark archive for remote sensing image understanding. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 5901–5904 (IEEE, 2019).
Christie, G., Fendley, N., Wilson, J. & Mukherjee, R. Functional map of the world. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 6172–6180 (IEEE, 2018).
Manas, O., Lacoste, A., Giró-i Nieto, X., Vazquez, D. & Rodriguez, P. Seasonal contrast: unsupervised pre-training from uncurated remote sensing data. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9414–9423 (IEEE, 2021).
Long, Y. et al. On creating benchmark dataset for aerial image interpretation: reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14, 4205–4230 (2021).
Tong, X.-Y., Xia, G.-S. & Zhu, X. X. Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS J. Photogramm. Remote Sens. 196, 178–196 (2023).
2D semantic labeling contest—Potsdam. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (2018).
2D semantic labeling—Vaihingen data. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (2018).
Demir, I. et al. Deepglobe 2018: a challenge to parse the Earth through satellite images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 172–181 (IEEE, 2018).
Waqas Zamir, S. et al. ISAID: a large-scale dataset for instance segmentation in aerial images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 28–37 (IEEE, 2019).
Wang, J., Zheng, Z., Ma, A., Lu, X. & Zhong, Y. LoveDA: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks Vol. 1 (eds Vanschoren, J. et al.) (Curran Associates, 2021).
Toker, A. et al. DynamicEarthNet: daily multi-spectral satellite dataset for semantic change segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21158–21167 (IEEE, 2022).
Garnot, V. S. F., Landrieu, L. & Chehata, N. Multi-modal temporal attention models for crop mapping from satellite time series. ISPRS J. Photogramm. Remote Sens. 187, 294–305 (2022).
Garioud, A. et al. FLAIR: a country-scale land cover semantic segmentation dataset from multi-source optical imagery. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) (Curran Associates, 2024).
Schmitt, M., Hughes, L., Ghamisi, P., Yokoya, N. & Hänsch, R. 2020 IEEE GRSS data fusion contest. IEEE DataPort https://doi.org/10.21227/rha7-m332 (2020).
Wolters, P., Bastani, F. & Kembhavi, A. Zooming out on zooming in: advancing super-resolution for remote sensing. Preprint at https://arxiv.org/abs/2311.18082 (2023).
Koßmann, D., Brack, V. & Wilhelm, T. Seasonet: a seasonal scene classification, segmentation and retrieval dataset for satellite imagery over Germany. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 243–246 (IEEE, 2022).
Wang, F. et al. Scaling efficient masked image modeling on large remote sensing dataset. Preprint at https://arxiv.org/abs/2406.11933 (2024).
Schmitt, M., Hughes, L. H., Qiu, C. & Zhu, X. X. SEN12MS–a curated dataset of georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion. ISPRS Annals 4, 153–160 (2019).
Tong, X.-Y. et al. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 237, 111322 (2020).
Johnson, N., Treible, W. & Crispell, D. OpenSentinelMap: a large-scale land use dataset using OpenStreetMap and Sentinel-2 imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1333–1341 (IEEE, 2022).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2021).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9650–9660 (IEEE, 2021).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9729–9738 (IEEE, 2020).
Jang, E., Gu, S. & Poole, B. Categorical reparameterization with Gumbel-Softmax. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2017).
Wu, K., Yu, L., Zhang, J. & Sun, Y. Pretraining datasets of SkySense++. Zenodo https://doi.org/10.5281/zenodo.14994429 (2025).
Wu, K., Zhang, Y. & Ru, L. Code of SkySense++. Zenodo https://doi.org/10.5281/zenodo.15378721 (2025).
Acknowledgements
Y.L. acknowledges the support of the National Natural Science Foundation of China (Grant Nos 42030102 and 42371321), the National Key Research and Development Program of China (Grant No. 2024YFB3909001), and Ant Group. We thank X. Guo for his valuable advice in the initial phase of this work.
Author information
Authors and Affiliations
Contributions
K.W., Yingying Zhang, L.R., J.C., Yongjun Zhang and Y.L.: conceptualization. K.W., Yingying Zhang and L.R.: methodology, investigation, development, analysis, writing—original draft, writing—review and editing. Y.L., Yongjun Zhang and J.C.: supervision, methodology, analysis, writing—review and editing. J.W.: supervision, methodology and analysis. M.Y.: supervision, analysis, writing—review and editing. B.D., J. Lao, J. Luo and Q.Z.: analysis. K.W., L.Y., Z.Z., Y.S. and J.Z.: data curation. K.W., J. Luo and Q.Z. contributed to this work during their internships at Ant Group.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Congcong Wen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Supplementary information
Supplementary Information
Supplementary Figs. 1–12, Sections 1–18 and Tables 1–19.
Source data
Source Data Fig. 1
Statistical source data for plots in Fig. 1.
Source Data Fig. 3
Statistical source data for plots in Fig. 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, K., Zhang, Y., Ru, L. et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat Mach Intell 7, 1235–1249 (2025). https://doi.org/10.1038/s42256-025-01078-8
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01078-8
This article is cited by
-
Advancing Earth observation with a multi-modal remote sensing foundation model
Nature Reviews Electrical Engineering (2025)