Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A semantic-enhanced multi-modal remote sensing foundation model for Earth observation

Abstract

Remote sensing foundation models, pretrained on massive remote sensing data, have shown impressive performance in several Earth observation (EO) tasks. These models usually use single-modal temporal data for pretraining, which is insufficient for multi-modal applications. Moreover, these models require a considerable number of samples for fine-tuning in downstream tasks, posing challenges in time-sensitive scenarios, such as rapid flood mapping. We present SkySense++, a multi-modal remote sensing foundation model for diverse EO tasks. SkySense++ has a factorized architecture to accommodate multi-modal images acquired by diverse sensors. We adopt progressive pretraining, which involves two stages, on meticulously curated datasets of 27 million multi-modal remote sensing images. The first representation-enhanced pretraining stage uses multi-granularity contrastive learning to obtain general representations. The second semantic-enhanced pretraining stage leverages masked semantic learning to learn semantically enriched representations, enabling few-shot capabilities. This ability allows the model to handle unseen tasks with minimal labelled data, alleviating the need for fine-tuning on extensive annotated data. SkySense++ demonstrates consistent improvements in classification, detection and segmentation over previous state-of-the-art models across 12 EO tasks in 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying and disaster management. This generalizability may lead to a new chapter of remote sensing foundation model applications for EO tasks at scale.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of SkySense++.
Fig. 2: Visual comparison of the previous state-of-the-art method, SkySense and SkySense++ in diverse domains.
Fig. 3: Comparison of SkySense++ and other methods in the few-shot tasks.
Fig. 4: Representation-enhanced and semantic-enhanced pretraining.

Similar content being viewed by others

Data availability

The pretraining data and EO benchmarks used in this work are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. 94). Following the collaborative agreement between Wuhan University and Ant Group regarding requirements for data redistribution compliance, we have checked the user agreements of the original data providers. Some prohibit redistribution of the data, for example, DeepGlobe, and some lack explicit redistribution guidelines, for example, Potsdam. Therefore, download links are available via Zenodo at https://doi.org/10.5281/zenodo.14994429 (ref. 94) for accessing these data. Interested researchers are required to sign the user agreements directly with the original data providers before accessing the datasets. Source data are provided with this paper.

Code availability

The code implemented in this work is available via GitHub at https://github.com/kang-wu/SkySensePlusPlus (ref. 95).

References

  1. Chen, S. et al. Amazon forest biogeography predicts resilience and vulnerability to drought. Nature 631, 111–117 (2024).

    Article  Google Scholar 

  2. Rohde, M. M. et al. Groundwater-dependent ecosystem map exposes global dryland protection needs. Nature 632, 101–107 (2024).

    Article  Google Scholar 

  3. Mo, L. et al. Integrated global assessment of the natural forest carbon potential. Nature 624, 92–101 (2023).

    Article  Google Scholar 

  4. Paolo, F. S. et al. Satellite mapping reveals extensive industrial activity at sea. Nature 625, 85–91 (2024).

    Article  Google Scholar 

  5. Shen, H., Meng, X. & Zhang, L. An integrated framework for the spatio–temporal–spectral fusion of remote sensing images. IEEE Trans. Geosci. Remote Sens. 54, 7135–7148 (2016).

    Article  Google Scholar 

  6. Yuan, Q. et al. Deep learning in environmental remote sensing: achievements and challenges. Remote Sens. Environ. 241, 111716 (2020).

    Article  Google Scholar 

  7. Sun, X. et al. RingMo: a remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 61, 5612822 (2023).

    Google Scholar 

  8. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article  Google Scholar 

  9. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).

    Article  Google Scholar 

  10. Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).

    Article  Google Scholar 

  11. Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 6, 354–367 (2024).

    Article  Google Scholar 

  12. Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

    Article  Google Scholar 

  13. Cong, Y. et al. SatMAE: pre-training transformers for temporal and multi-spectral satellite imagery. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 197–211 (Curran Associates, 2022).

  14. Muhtar, D., Zhang, X., Xiao, P., Li, Z. & Gu, F. CMID: a unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. 61, 5607817 (2023).

    Article  Google Scholar 

  15. Mall, U., Hariharan, B. & Bala, K. Change-aware sampling and contrastive learning for satellite images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5261–5270 (IEEE, 2023).

  16. Bastani, F., Wolters, P., Gupta, R., Ferdinando, J. & Kembhavi, A. SatlasPretrain: a large-scale dataset for remote sensing image understanding. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16772–16782 (IEEE, 2023).

  17. Mendieta, M. et al. Towards geospatial foundation models via continual pretraining. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 16806–16816 (IEEE, 2023).

  18. Reed, C. J. et al. Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 4088–4099 (IEEE, 2023).

  19. Xiong, Z. et al. Neural plasticity-inspired foundation model for observing the Earth crossing modalities. Preprint at https://arxiv.org/abs/2403.15356 (2024).

  20. Li, W. et al. Self-supervised learning for SAR ATR with a knowledge-guided predictive architecture. ISPRS J. Photogramm. Remote Sens. 218, 326–338 (2024).

    Article  Google Scholar 

  21. Wang, Y., Albrecht, C. M. & Zhu, X. X. Self-supervised vision transformers for joint SAR-optical representation learning. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 139–142 (IEEE, 2022).

  22. Fuller, A., Millard, K. & Green, J. R. CROMA: remote sensing representations with contrastive radar-optical masked autoencoders. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 5506–5538 (Curran Associates, 2023).

  23. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 16000–16009 (IEEE, 2022).

  24. Xie, Z. et al. SimMIM: a simple framework for masked image modeling. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9653–9663 (IEEE, 2022).

  25. Ahlswede, S. et al. TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 15, 681–695 (2023).

    Article  Google Scholar 

  26. Tarasiou, M., Chavez, E. & Zafeiriou, S. ViTs for SITS: vision transformers for satellite image time series. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10418–10428 (IEEE, 2023).

  27. Wang, D. et al. MTP: advancing remote sensing foundation model via multi-task pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 11632–11654 (2024).

    Article  Google Scholar 

  28. Guo, X. et al. SkySense: a multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 27672–27683 (IEEE, 2024).

  29. Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).

  30. Bar, A., Gandelsman, Y., Darrell, T., Globerson, A. & Efros, A. Visual prompting via image inpainting. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 25005–25017 (Curran Associates, 2022).

  31. Wang, X., Wang, W., Cao, Y., Shen, C. & Huang, T. Images speak in images: a generalist painter for in-context visual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6830–6839 (2023).

  32. Wang, X. et al. SegGPT: towards segmenting everything in context. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 1130–1140 (IEEE, 2023).

  33. Rußwurm, M. & Körner, M. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS Int. J. Geo-Inf. 7, 129 (2018).

    Article  Google Scholar 

  34. Ahlswede, S. et al. TreeSatAI benchmark archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data 2022, 681–695 (2022).

    Google Scholar 

  35. Bragagnolo, L., da Silva, R. V. & Grzybowski, J. M. V. Towards the automatic monitoring of deforestation in Brazilian rainforest. Ecol. Inform. 66, 101454 (2021).

    Article  Google Scholar 

  36. Zhu, Q. et al. Oil spill contextual and boundary-supervised detection network based on marine SAR images. IEEE Trans. Geosci. Remote Sens. 60, 5213910 (2021).

    Google Scholar 

  37. Rowley, A. & Karakuş, O. Predicting air quality via multimodal AI and satellite imagery. Remote Sens. Environ. 293, 113609 (2023).

    Article  Google Scholar 

  38. Eikelboom, J. A. J. et al. Improving the precision and accuracy of animal population estimates with aerial image object detection. Methods Ecol. Evol. 10, 1875–1887 (2019).

    Article  Google Scholar 

  39. Hong, D. et al. Cross-city matters: a multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks. Remote Sens. Environ. 299, 113856 (2023).

    Article  Google Scholar 

  40. Zhang, C. et al. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 166, 183–200 (2020).

    Article  Google Scholar 

  41. Cloud to Street - Microsoft Flood and Clouds Dataset. Registry of Open Data https://registry.opendata.aws/c2smsfloods (2023).

  42. Cambrin, D. R., Colomba, L. & Garza, P. Cabuar: California burned areas dataset for delineation. IEEE Geosci. Remote Sens. Mag. 11, 106–113 (2023).

    Article  Google Scholar 

  43. Zhang, X., Yu, W., Pun, M.-O. & Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 197, 1–17 (2023).

    Article  Google Scholar 

  44. Gupta, R. et al. Creating xBD: a dataset for assessing building damage from satellite imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 10–17 (IEEE, 2019).

  45. Astruc, G., Gonthier, N., Mallet, C. & Landrieu, L. Omnisat: self-supervised modality fusion for Earth observation. In Proc. European Conference on Computer Vision (ECCV) (eds Leonardis, A. et al.) 409–427 (Springer, 2019).

  46. Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).

  47. Wang, K., Liew, J. H., Zou, Y., Zhou, D. & Feng, J. PANet: few-shot image semantic segmentation with prototype alignment. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9197–9206 (IEEE, 2019).

  48. Li, X. et al. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 106, 102638 (2022).

    Google Scholar 

  49. Zhang, B., Xiao, J. & Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8312–8321 (IEEE, 2021).

  50. Jia, Y., Gao, J., Huang, W., Yuan, Y. & Wang, Q. Holistic mutual representation enhancement for few-shot remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 61, 5622613 (2023).

    Article  Google Scholar 

  51. Wanyan, X., Seneviratne, S., Shen, S. & Kirley, M. Extending global-local view alignment for self-supervised learning with remote sensing imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2443–2453 (IEEE, 2024).

  52. Rahnemoonfar, M. et al. Floodnet: a high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 9, 89644–89654 (2021).

    Article  Google Scholar 

  53. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).

  54. Dai, D. & Yang, W. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Trans. Geosci. Remote Sens. 8, 173–176 (2011).

    Article  Google Scholar 

  55. Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning 1126–1135 (PMLR, 2017).

  56. Li, Z., Zhou, F., Chen, F. & Li, H. Meta-SGD: learning to learn quickly for few-shot learning. Preprint at https://arxiv.org/abs/1707.09835 (2017).

  57. Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).

  58. Sung, F. et al. Learning to compare: relation network for few-shot learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1199–1208 (IEEE, 2018).

  59. Zhang, C., Cai, Y., Lin, G. & Shen, C. DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12203–12213 (IEEE, 2020).

  60. Zhang, P., Bai, Y., Wang, D., Bai, B. & Li, Y. Few-shot classification of aerial scene images via meta-learning. Remote Sens. 13, 108 (2020).

    Article  Google Scholar 

  61. Li, H. et al. RS-MetaNet: deep meta metric learning for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 59, 6983–6994 (2021).

    Article  Google Scholar 

  62. Li, L., Han, J., Yao, X., Cheng, G. & Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 59, 7844–7853 (2020).

    Article  Google Scholar 

  63. Zhang, B. et al. SGMNet: scene graph matching network for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 60, 5628915 (2022).

    Google Scholar 

  64. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  65. Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  66. Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

  67. Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).

  68. Wang, Y. et al. SSL4EO-S12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. IEEE Geosci. Remote Sens. Mag. 11, 98–106 (2023).

    Article  Google Scholar 

  69. Sumbul, G. et al. Bigearthnet: a large-scale benchmark archive for remote sensing image understanding. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 5901–5904 (IEEE, 2019).

  70. Christie, G., Fendley, N., Wilson, J. & Mukherjee, R. Functional map of the world. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 6172–6180 (IEEE, 2018).

  71. Manas, O., Lacoste, A., Giró-i Nieto, X., Vazquez, D. & Rodriguez, P. Seasonal contrast: unsupervised pre-training from uncurated remote sensing data. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9414–9423 (IEEE, 2021).

  72. Long, Y. et al. On creating benchmark dataset for aerial image interpretation: reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14, 4205–4230 (2021).

    Article  Google Scholar 

  73. Tong, X.-Y., Xia, G.-S. & Zhu, X. X. Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS J. Photogramm. Remote Sens. 196, 178–196 (2023).

    Article  Google Scholar 

  74. 2D semantic labeling contest—Potsdam. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (2018).

  75. 2D semantic labeling—Vaihingen data. ISPRS https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (2018).

  76. Demir, I. et al. Deepglobe 2018: a challenge to parse the Earth through satellite images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops 172–181 (IEEE, 2018).

  77. Waqas Zamir, S. et al. ISAID: a large-scale dataset for instance segmentation in aerial images. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 28–37 (IEEE, 2019).

  78. Wang, J., Zheng, Z., Ma, A., Lu, X. & Zhong, Y. LoveDA: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks Vol. 1 (eds Vanschoren, J. et al.) (Curran Associates, 2021).

  79. Toker, A. et al. DynamicEarthNet: daily multi-spectral satellite dataset for semantic change segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21158–21167 (IEEE, 2022).

  80. Garnot, V. S. F., Landrieu, L. & Chehata, N. Multi-modal temporal attention models for crop mapping from satellite time series. ISPRS J. Photogramm. Remote Sens. 187, 294–305 (2022).

    Article  Google Scholar 

  81. Garioud, A. et al. FLAIR: a country-scale land cover semantic segmentation dataset from multi-source optical imagery. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) (Curran Associates, 2024).

  82. Schmitt, M., Hughes, L., Ghamisi, P., Yokoya, N. & Hänsch, R. 2020 IEEE GRSS data fusion contest. IEEE DataPort https://doi.org/10.21227/rha7-m332 (2020).

  83. Wolters, P., Bastani, F. & Kembhavi, A. Zooming out on zooming in: advancing super-resolution for remote sensing. Preprint at https://arxiv.org/abs/2311.18082 (2023).

  84. Koßmann, D., Brack, V. & Wilhelm, T. Seasonet: a seasonal scene classification, segmentation and retrieval dataset for satellite imagery over Germany. In Proc. IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 243–246 (IEEE, 2022).

  85. Wang, F. et al. Scaling efficient masked image modeling on large remote sensing dataset. Preprint at https://arxiv.org/abs/2406.11933 (2024).

  86. Schmitt, M., Hughes, L. H., Qiu, C. & Zhu, X. X. SEN12MS–a curated dataset of georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion. ISPRS Annals 4, 153–160 (2019).

  87. Tong, X.-Y. et al. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 237, 111322 (2020).

    Article  Google Scholar 

  88. Johnson, N., Treible, W. & Crispell, D. OpenSentinelMap: a large-scale land use dataset using OpenStreetMap and Sentinel-2 imagery. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1333–1341 (IEEE, 2022).

  89. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2021).

  90. Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).

  91. Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9650–9660 (IEEE, 2021).

  92. He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9729–9738 (IEEE, 2020).

  93. Jang, E., Gu, S. & Poole, B. Categorical reparameterization with Gumbel-Softmax. In Proc. International Conference on Learning Representations (ICLR) (OpenReview, 2017).

  94. Wu, K., Yu, L., Zhang, J. & Sun, Y. Pretraining datasets of SkySense++. Zenodo https://doi.org/10.5281/zenodo.14994429 (2025).

  95. Wu, K., Zhang, Y. & Ru, L. Code of SkySense++. Zenodo https://doi.org/10.5281/zenodo.15378721 (2025).

Download references

Acknowledgements

Y.L. acknowledges the support of the National Natural Science Foundation of China (Grant Nos 42030102 and 42371321), the National Key Research and Development Program of China (Grant No. 2024YFB3909001), and Ant Group. We thank X. Guo for his valuable advice in the initial phase of this work.

Author information

Authors and Affiliations

Contributions

K.W., Yingying Zhang, L.R., J.C., Yongjun Zhang and Y.L.: conceptualization. K.W., Yingying Zhang and L.R.: methodology, investigation, development, analysis, writing—original draft, writing—review and editing. Y.L., Yongjun Zhang and J.C.: supervision, methodology, analysis, writing—review and editing. J.W.: supervision, methodology and analysis. M.Y.: supervision, analysis, writing—review and editing. B.D., J. Lao, J. Luo and Q.Z.: analysis. K.W., L.Y., Z.Z., Y.S. and J.Z.: data curation. K.W., J. Luo and Q.Z. contributed to this work during their internships at Ant Group.

Corresponding authors

Correspondence to Jingdong Chen, Yongjun Zhang or Yansheng Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Congcong Wen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Summary of our Earth observation benchmark, including the domains, task types, task datasets, modalities, ground sample distances (GSDs) and image sizes
Extended Data Table 2 Summary of collected datasets in our RS-Sem datasets, including dataset names, modalities, ground sample distances (GSDs), image sizes, number of categories, number of images and number of annotated pixels

Supplementary information

Supplementary Information

Supplementary Figs. 1–12, Sections 1–18 and Tables 1–19.

Source data

Source Data Fig. 1

Statistical source data for plots in Fig. 1.

Source Data Fig. 3

Statistical source data for plots in Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, K., Zhang, Y., Ru, L. et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat Mach Intell 7, 1235–1249 (2025). https://doi.org/10.1038/s42256-025-01078-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01078-8

This article is cited by

Search

Quick links

Nature Briefing Anthropocene

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Anthropocene