A semantic-enhanced multi-modal remote sensing foundation model for Earth observation

Wu, Kang; Zhang, Yingying; Ru, Lixiang; Dang, Bo; Lao, Jiangwei; Yu, Lei; Luo, Junwei; Zhu, Zifan; Sun, Yue; Zhang, Jiahao; Zhu, Qi; Wang, Jian; Yang, Ming; Chen, Jingdong; Zhang, Yongjun; Li, Yansheng

doi:10.1038/s42256-025-01078-8

Article
Published: 04 August 2025

A semantic-enhanced multi-modal remote sensing foundation model for Earth observation

Nature Machine Intelligence volume 7, pages 1235–1249 (2025)Cite this article

3418 Accesses
3 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Remote sensing foundation models, pretrained on massive remote sensing data, have shown impressive performance in several Earth observation (EO) tasks. These models usually use single-modal temporal data for pretraining, which is insufficient for multi-modal applications. Moreover, these models require a considerable number of samples for fine-tuning in downstream tasks, posing challenges in time-sensitive scenarios, such as rapid flood mapping. We present SkySense++, a multi-modal remote sensing foundation model for diverse EO tasks. SkySense++ has a factorized architecture to accommodate multi-modal images acquired by diverse sensors. We adopt progressive pretraining, which involves two stages, on meticulously curated datasets of 27 million multi-modal remote sensing images. The first representation-enhanced pretraining stage uses multi-granularity contrastive learning to obtain general representations. The second semantic-enhanced pretraining stage leverages masked semantic learning to learn semantically enriched representations, enabling few-shot capabilities. This ability allows the model to handle unseen tasks with minimal labelled data, alleviating the need for fine-tuning on extensive annotated data. SkySense++ demonstrates consistent improvements in classification, detection and segmentation over previous state-of-the-art models across 12 EO tasks in 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying and disaster management. This generalizability may lead to a new chapter of remote sensing foundation model applications for EO tasks at scale.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Visual comparison of the previous state-of-the-art method, SkySense and SkySense++ in diverse domains.**

**Fig. 3: Comparison of SkySense++ and other methods in the few-shot tasks.**

**Fig. 4: Representation-enhanced and semantic-enhanced pretraining.**

High spatiotemporal-resolution mapping for a seasonal erosion flooding inundation using time-series Landsat and MODIS images

Article Open access 20 February 2024

An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+

Article Open access 27 April 2024

Multimodal fusion for anticipating human decision performance

Article Open access 08 June 2024

Data availability

The pretraining data and EO benchmarks used in this work are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. ⁹⁴). Following the collaborative agreement between Wuhan University and Ant Group regarding requirements for data redistribution compliance, we have checked the user agreements of the original data providers. Some prohibit redistribution of the data, for example, DeepGlobe, and some lack explicit redistribution guidelines, for example, Potsdam. Therefore, download links are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. ⁹⁴) for accessing these data. Interested researchers are required to sign the user agreements directly with the original data providers before accessing the datasets. Source data are provided with this paper.

Code availability

The code implemented in this work is available via GitHub at https://github.com/kang-wu/SkySensePlusPlus (ref. ⁹⁵).

References

Chen, S. et al. Amazon forest biogeography predicts resilience and vulnerability to drought. Nature 631, 111–117 (2024).
Article Google Scholar
Rohde, M. M. et al. Groundwater-dependent ecosystem map exposes global dryland protection needs. Nature 632, 101–107 (2024).
Article Google Scholar
Mo, L. et al. Integrated global assessment of the natural forest carbon potential. Nature 624, 92–101 (2023).
Article Google Scholar
Paolo, F. S. et al. Satellite mapping reveals extensive industrial activity at sea. Nature 625, 85–91 (2024).
Article Google Scholar
Shen, H., Meng, X. & Zhang, L. An integrated framework for the spatio–temporal–spectral fusion of remote sensing images. IEEE Trans. Geosci. Remote Sens. 54, 7135–7148 (2016).
Article Google Scholar
Yuan, Q. et al. Deep learning in environmental remote sensing: achievements and challenges. Remote Sens. Environ. 241, 111716 (2020).
Article Google Scholar
Sun, X. et al. RingMo: a remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 61, 5612822 (2023).
Google Scholar
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Article Google Scholar
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Article Google Scholar
Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 6, 354–367 (2024).
Article Google Scholar
Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Article Google Scholar
Cong, Y. et al. SatMAE: pre-training transformers for temporal and multi-spectral satellite imagery. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 197–211 (Curran Associates, 2022).
Muhtar, D., Zhang, X., Xiao, P., Li, Z. & Gu, F. CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 5607817 (2023).
Article Google Scholar
Mall, U., Hariharan, B. & Bala, K. Change-aware sampling and contrastive learning for satellite images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5261–5270 (IEEE, 2023).
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J. & Kembhavi, A. SatlasPretrain: a large-scale dataset for remote sensing image understanding. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16772–16782 (IEEE, 2023).
Mendieta, M. et al. Towards geospatial foundation models via continual pretraining. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16806–16816 (IEEE, 2023).
Reed, C. J. et al. Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 4088–4099 (IEEE, 2023).
Xiong, Z. et al. Neural plasticity-inspired foundation model for observing the Earth crossing modalities. Preprint at https://arxiv.org/abs/2403.15356 (2024).
Li, W. et al. Self-supervised learning for SAR ATR with a knowledge-guided predictive architecture. ISPRS J. Photogramm. Remote Sens. 218, 326–338 (2024).
Article Google Scholar
Wang, Y., Albrecht, C. M. & Zhu, X. X. Self-supervised vision transformers for joint SAR-optical representation learning. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 139–142 (IEEE, 2022).
Fuller, A., Millard, K. & Green, J. R. CROMA: remote sensing representations with contrastive radar-optical masked autoencoders. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 5506–5538 (Curran Associates, 2023).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 16000–16009 (IEEE, 2022).
Xie, Z. et al. SimMIM: a simple framework for masked image modeling. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9653–9663 (IEEE, 2022).
Ahlswede, S. et al. TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 15, 681–695 (2023).
Article Google Scholar
Tarasiou, M., Chavez, E. & Zafeiriou, S. ViTs for SITS: vision transformers for satellite image time series. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10418–10428 (IEEE, 2023).
Wang, D. et al. MTP: advancing remote sensing foundation model via multi-task pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 11632–11654 (2024).
Article Google Scholar
Guo, X. et al. SkySense: a multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 27672–27683 (IEEE, 2024).
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).
Bar, A., Gandelsman, Y., Darrell, T., Globerson, A. & Efros, A. Visual prompting via image inpainting. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 25005–25017 (Curran Associates, 2022).
Wang, X., Wang, W., Cao, Y., Shen, C. & Huang, T. Images speak in images: a generalist painter for in-context visual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6830–6839 (2023).
Wang, X. et al. SegGPT: towards segmenting everything in context. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 1130–1140 (IEEE, 2023).
Rußwurm, M. & Körner, M. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS Int. J. Geo-Inf. 7, 129 (2018).
Article Google Scholar
Ahlswede, S. et al. TreeSatAI benchmark archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 2022, 681–695 (2022).
Google Scholar
Bragagnolo, L., da Silva, R. V. & Grzybowski, J. M. V. Towards the automatic monitoring of deforestation in Brazilian rainforest. Ecol. Inform. 66, 101454 (2021).
Article Google Scholar
Zhu, Q. et al. Oil spill contextual and boundary-supervised detection network based on marine SAR images. IEEE Trans. Geosci. Remote Sens. 60, 5213910 (2021).
Google Scholar
Rowley, A. & Karakuş, O. Predicting air quality via multimodal AI and satellite imagery. Remote Sens. Environ. 293, 113609 (2023).
Article Google Scholar
Eikelboom, J. A. J. et al. Improving the precision and accuracy of animal population estimates with aerial image object detection. Methods Ecol. Evol. 10, 1875–1887 (2019).
Article Google Scholar
Hong, D. et al. Cross-city matters: a multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks. Remote Sens. Environ. 299, 113856 (2023).
Article Google Scholar
Zhang, C. et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 166, 183–200 (2020).
Article Google Scholar
Cloud to Street - Microsoft Flood and Clouds Dataset. Registry of Open Data https://registry.opendata.aws/c2smsfloods (2023).
Cambrin, D. R., Colomba, L. & Garza, P. Cabuar: California burned areas dataset for delineation. IEEE Geosci. Remote Sens. Mag. 11, 106–113 (2023).
Article Google Scholar
Zhang, X., Yu, W., Pun, M.-O. & Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 197, 1–17 (2023).
Article Google Scholar
Gupta, R. et al. Creating xBD: a dataset for assessing building damage from satellite imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 10–17 (IEEE, 2019).
Astruc, G., Gonthier, N., Mallet, C. & Landrieu, L. Omnisat: self-supervised modality fusion for Earth observation. In Proc. European Conference on Computer Vision (ECCV) (eds Leonardis, A. et al.) 409–427 (Springer, 2019).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).
Wang, K., Liew, J. H., Zou, Y., Zhou, D. & Feng, J. PANet: few-shot image semantic segmentation with prototype alignment. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9197–9206 (IEEE, 2019).
Li, X. et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 106, 102638 (2022).
Google Scholar
Zhang, B., Xiao, J. & Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8312–8321 (IEEE, 2021).
Jia, Y., Gao, J., Huang, W., Yuan, Y. & Wang, Q. Holistic mutual representation enhancement for few-shot remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 61, 5622613 (2023).
Article Google Scholar
Wanyan, X., Seneviratne, S., Shen, S. & Kirley, M. Extending global-local view alignment for self-supervised learning with remote sensing imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2443–2453 (IEEE, 2024).
Rahnemoonfar, M. et al. Floodnet: a high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 9, 89644–89654 (2021).
Article Google Scholar
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).
Dai, D. & Yang, W. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Trans. Geosci. Remote Sens. 8, 173–176 (2011).
Article Google Scholar
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning 1126–1135 (PMLR, 2017).
Li, Z., Zhou, F., Chen, F. & Li, H. Meta-SGD: learning to learn quickly for few-shot learning. Preprint at https://arxiv.org/abs/1707.09835 (2017).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Sung, F. et al. Learning to compare: relation network for few-shot learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1199–1208 (IEEE, 2018).
Zhang, C., Cai, Y., Lin, G. & Shen, C. DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12203–12213 (IEEE, 2020).
Zhang, P., Bai, Y., Wang, D., Bai, B. & Li, Y. Few-shot classification of aerial scene images via meta-learning. Remote Sens. 13, 108 (2020).
Article Google Scholar
Li, H. et al. RS-MetaNet: deep meta metric learning for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 59, 6983–6994 (2021).
Article Google Scholar
Li, L., Han, J., Yao, X., Cheng, G. & Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 59, 7844–7853 (2020).
Article Google Scholar
Zhang, B. et al. SGMNet: scene graph matching network for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 60, 5628915 (2022).
Google Scholar
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).
Wang, Y. et al. SSL4EO-S12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. IEEE Geosci. Remote Sens. Mag. 11, 98–106 (2023).
Article Google Scholar
Sumbul, G. et al. Bigearthnet: a large-scale benchmark archive for remote sensing image understanding. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 5901–5904 (IEEE, 2019).
Christie, G., Fendley, N., Wilson, J. & Mukherjee, R. Functional map of the world. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 6172–6180 (IEEE, 2018).
Manas, O., Lacoste, A., Giró-i Nieto, X., Vazquez, D. & Rodriguez, P. Seasonal contrast: unsupervised pre-training from uncurated remote sensing data. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9414–9423 (IEEE, 2021).
Long, Y. et al. On creating benchmark dataset for aerial image interpretation: reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14, 4205–4230 (2021).
Article Google Scholar
Tong, X.-Y., Xia, G.-S. & Zhu, X. X. Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS J. Photogramm. Remote Sens. 196, 178–196 (2023).
Article Google Scholar
2D semantic labeling contest—Potsdam. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (2018).
2D semantic labeling—Vaihingen data. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (2018).
Demir, I. et al. Deepglobe 2018: a challenge to parse the Earth through satellite images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 172–181 (IEEE, 2018).
Waqas Zamir, S. et al. ISAID: a large-scale dataset for instance segmentation in aerial images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 28–37 (IEEE, 2019).
Wang, J., Zheng, Z., Ma, A., Lu, X. & Zhong, Y. LoveDA: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks Vol. 1 (eds Vanschoren, J. et al.) (Curran Associates, 2021).
Toker, A. et al. DynamicEarthNet: daily multi-spectral satellite dataset for semantic change segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21158–21167 (IEEE, 2022).
Garnot, V. S. F., Landrieu, L. & Chehata, N. Multi-modal temporal attention models for crop mapping from satellite time series. ISPRS J. Photogramm. Remote Sens. 187, 294–305 (2022).
Article Google Scholar
Garioud, A. et al. FLAIR: a country-scale land cover semantic segmentation dataset from multi-source optical imagery. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) (Curran Associates, 2024).
Schmitt, M., Hughes, L., Ghamisi, P., Yokoya, N. & Hänsch, R. 2020 IEEE GRSS data fusion contest. IEEE DataPort https://doi.org/10.21227/rha7-m332 (2020).
Wolters, P., Bastani, F. & Kembhavi, A. Zooming out on zooming in: advancing super-resolution for remote sensing. Preprint at https://arxiv.org/abs/2311.18082 (2023).
Koßmann, D., Brack, V. & Wilhelm, T. Seasonet: a seasonal scene classification, segmentation and retrieval dataset for satellite imagery over Germany. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 243–246 (IEEE, 2022).
Wang, F. et al. Scaling efficient masked image modeling on large remote sensing dataset. Preprint at https://arxiv.org/abs/2406.11933 (2024).
Schmitt, M., Hughes, L. H., Qiu, C. & Zhu, X. X. SEN12MS–a curated dataset of georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion. ISPRS Annals 4, 153–160 (2019).
Tong, X.-Y. et al. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 237, 111322 (2020).
Article Google Scholar
Johnson, N., Treible, W. & Crispell, D. OpenSentinelMap: a large-scale land use dataset using OpenStreetMap and Sentinel-2 imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1333–1341 (IEEE, 2022).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2021).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9650–9660 (IEEE, 2021).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9729–9738 (IEEE, 2020).
Jang, E., Gu, S. & Poole, B. Categorical reparameterization with Gumbel-Softmax. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2017).
Wu, K., Yu, L., Zhang, J. & Sun, Y. Pretraining datasets of SkySense++. Zenodo https://doi.org/10.5281/zenodo.14994429 (2025).
Wu, K., Zhang, Y. & Ru, L. Code of SkySense++. Zenodo https://doi.org/10.5281/zenodo.15378721 (2025).

Download references

Acknowledgements

Y.L. acknowledges the support of the National Natural Science Foundation of China (Grant Nos 42030102 and 42371321), the National Key Research and Development Program of China (Grant No. 2024YFB3909001), and Ant Group. We thank X. Guo for his valuable advice in the initial phase of this work.

Author information

These authors contributed equally: Kang Wu, Yingying Zhang, Lixiang Ru.

Authors and Affiliations

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
Kang Wu, Bo Dang, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, Yongjun Zhang & Yansheng Li
Ant Group, Hangzhou, China
Yingying Zhang, Lixiang Ru, Jiangwei Lao, Lei Yu, Jian Wang, Ming Yang & Jingdong Chen
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Qi Zhu

Authors

Kang Wu
View author publications
Search author on:PubMed Google Scholar
Yingying Zhang
View author publications
Search author on:PubMed Google Scholar
Lixiang Ru
View author publications
Search author on:PubMed Google Scholar
Bo Dang
View author publications
Search author on:PubMed Google Scholar
Jiangwei Lao
View author publications
Search author on:PubMed Google Scholar
Lei Yu
View author publications
Search author on:PubMed Google Scholar
Junwei Luo
View author publications
Search author on:PubMed Google Scholar
Zifan Zhu
View author publications
Search author on:PubMed Google Scholar
Yue Sun
View author publications
Search author on:PubMed Google Scholar
Jiahao Zhang
View author publications
Search author on:PubMed Google Scholar
Qi Zhu
View author publications
Search author on:PubMed Google Scholar
Jian Wang
View author publications
Search author on:PubMed Google Scholar
Ming Yang
View author publications
Search author on:PubMed Google Scholar
Jingdong Chen
View author publications
Search author on:PubMed Google Scholar
Yongjun Zhang
View author publications
Search author on:PubMed Google Scholar
Yansheng Li
View author publications
Search author on:PubMed Google Scholar

Contributions

K.W., Yingying Zhang, L.R., J.C., Yongjun Zhang and Y.L.: conceptualization. K.W., Yingying Zhang and L.R.: methodology, investigation, development, analysis, writing—original draft, writing—review and editing. Y.L., Yongjun Zhang and J.C.: supervision, methodology, analysis, writing—review and editing. J.W.: supervision, methodology and analysis. M.Y.: supervision, analysis, writing—review and editing. B.D., J. Lao, J. Luo and Q.Z.: analysis. K.W., L.Y., Z.Z., Y.S. and J.Z.: data curation. K.W., J. Luo and Q.Z. contributed to this work during their internships at Ant Group.

Corresponding authors

Correspondence to Jingdong Chen, Yongjun Zhang or Yansheng Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Congcong Wen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Summary of our Earth observation benchmark, including the domains, task types, task datasets, modalities, ground sample distances (GSDs) and image sizes

Full size table

Extended Data Table 2 Summary of collected datasets in our RS-Sem datasets, including dataset names, modalities, ground sample distances (GSDs), image sizes, number of categories, number of images and number of annotated pixels

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–12, Sections 1–18 and Tables 1–19.

Source data

Source Data Fig. 1

Statistical source data for plots in Fig. 1.

Source Data Fig. 3

Statistical source data for plots in Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, K., Zhang, Y., Ru, L. et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat Mach Intell 7, 1235–1249 (2025). https://doi.org/10.1038/s42256-025-01078-8

Download citation

Received: 15 November 2024
Accepted: 16 June 2025
Published: 04 August 2025
Issue date: August 2025
DOI: https://doi.org/10.1038/s42256-025-01078-8

This article is cited by

Advancing Earth observation with a multi-modal remote sensing foundation model
- Silvia Conti
Nature Reviews Electrical Engineering (2025)