Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
Representation learning to advance multi-institutional studies with electronic health record data from US and France
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 03 April 2026

Representation learning to advance multi-institutional studies with electronic health record data from US and France

  • Doudou Zhou  ORCID: orcid.org/0000-0002-0830-22871,2 na1,
  • Han Tong  ORCID: orcid.org/0009-0002-2775-215X3 na1,
  • Linshanshan Wang2 na1,
  • Suqi Liu4 na1,
  • Xin Xiong  ORCID: orcid.org/0000-0002-1162-52202,
  • Ziming Gan5,
  • Griffier Romain6,7,
  • Boris P. Hejblum  ORCID: orcid.org/0000-0003-0646-452X6,8,
  • Yun-Chung Liu  ORCID: orcid.org/0000-0002-3433-56649,
  • Chuan Hong9,
  • Clara-Lea Bonzel2,4,
  • Tianrun Cai10,11,
  • Kevin Pan12,
  • Yuk-Lam Ho  ORCID: orcid.org/0000-0003-3305-383010,
  • Lauren Costa10,
  • Vidul A. Panickan  ORCID: orcid.org/0000-0003-0616-04034,10,
  • J. Michael Gaziano4,10,11,
  • Kenneth D. Mandl  ORCID: orcid.org/0000-0002-9781-047713,
  • Vianney Jouhet  ORCID: orcid.org/0000-0001-5272-22656,7,
  • Rodolphe Thiebaut  ORCID: orcid.org/0000-0002-5235-39626,7,8,
  • Zongqi Xia  ORCID: orcid.org/0000-0003-1500-258914,
  • Kelly Cho4,10,11,
  • Katherine Liao  ORCID: orcid.org/0000-0002-4797-32004,10,11 na2 &
  • …
  • Tianxi Cai  ORCID: orcid.org/0000-0002-5379-25022,4,10 na2 

Nature Communications , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Data integration
  • Machine learning
  • Translational research

Abstract

The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.

Data availability

Source data are provided in the Source data file. The minimum dataset required to interpret and reproduce the main findings of this study consists of the pairwise cosine similarity data derived from the GAME embeddings, which are publicly available at https://shiny.parse-health.org/GAME/. Institution-level summary data are available under restricted access due to data use agreements (DUAs) with the participating healthcare institutions. Access to these data may be obtained by submitting a request and establishing a DUA with the relevant institution(s); requests should be directed to the corresponding author. Individual-level clinical data used for downstream analyses are not publicly available due to patient privacy protections, ethical approval constraints, and institutional regulations. These data were accessed under institution-specific DUAs and IRB approvals and cannot be shared beyond those agreements. Requests for restricted data are reviewed by the relevant institution(s), with an expected response time of approximately 4–8 weeks. The duration of access and permitted use are governed by the terms of the corresponding DUA. Source data are provided with this paper.

Code availability

The code used in this study is publicly available at https://github.com/celehs/GAME. The specific version used in this study has been archived on Zenodo at https://doi.org/10.5281/zenodo.18222787, ensuring long-term accessibility and reproducibility69. An interactive visualization of the knowledge graph derived from the GAME embeddings is available at https://shiny.parse-health.org/GAME/.

References

  1. Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural languageprocessing. BMJ 350, h1885 https://doi.org/10.1136/bmj.h1885 (2015).

  2. Wang, L. et al. Stratification of Alzheimer’s disease patients using knowledge-guided unsupervised latent factor clustering with electronic health record data. Preprint at Dec 26 https://doi.org/10.1101/2024.12.23.24319588 (2024).

  3. Doshi-Velez, F., Ge, Y. & Kohane, I. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics 133, e54–e63 (2014).

    Google Scholar 

  4. Sheu, Y. -h. et al. An efficient landmark model for prediction of suicide attempts in multiple clinical settings. Psychiatry Res. 323, 115175 (2023).

    Google Scholar 

  5. Federico, P. et al. Gnaeus: Utilizing clinical guidelines for knowledge-assisted visualisation of EHR cohorts. In Roberts, J. C. & Bertini, E. (eds.) 6th International EuroVis Workshop on Visual Analytics, EuroVA@EuroVis 2015, Cagliari, Sardinia, Italy, May 25-26, 2015, 79–83 (Eurographics Association, 2015).

  6. Ferté, T., Jouhet, V., Griffier, R., Hejblum, B. P. & Thiébaut, R. The benefit of augmenting open data with clinical data-warehouse EHR for forecasting SARS-CoV-2 hospitalizations in Bordeaux area, France. JAMIA Open 5, ooac086 (2022).

    Google Scholar 

  7. Wen, J. et al. Multimodal representation learning for predicting molecule–disease relations. Bioinformatics 39, btad085 (2023).

    Google Scholar 

  8. Cai, T., Xia, D., Zhang, L. & Zhou, D. Consensus knowledge graph learning via multi-view sparse low rank block model. Preprint at https://doi.org/10.48550/arXiv.2209.13762 (2022).

  9. Hur, K. et al. Unifying heterogeneous electronic health records systems via text-based code embedding. In Proc. Conference on Health, Inference, and Learning, Vol. 174 of Proc. of Machine Learning Research, (eds. Flores, G., Chen, G. H., Pollard, T., Ho, J. C. & Naumann, T.) 183–203 (PMLR, 2022).

  10. Molaei, S. et al. Federated learning for heterogeneous electronic health records utilising augmented temporal graph attention networks. In Proc. International Conference on Artificial Intelligence and Statistics, 1342–1350 (PMLR, 2024).

  11. Thakur, A. et al. Knowledge abstraction and filtering based federated learning over heterogeneous data views in healthcare. NPJ Digit. Med. 7, 283 (2024).

    Google Scholar 

  12. Centre for Disease Control and Prevention et al. International classification of diseases, ninth revision (ICD-9). Cincinnati, Ohio: National Center for Health Statistics (1979).

  13. McDonald, C. J. et al. Loinc, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, 624–633 (2003).

    Google Scholar 

  14. Chen, M. et al. Privacy protection and intrusion avoidance for cloudlet-based medical data sharing. IEEE Trans. Cloud Comput. 8, 1274–1283 (2016).

    Google Scholar 

  15. Sheller, M. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 10, (2020).

  16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, (2013).

  17. Pennington, J., Socher, R. & Manning, C. D. Glove: global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543 (ACL, 2014).

  18. Wang, Z., Zhang, J., Feng, J. & Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proc. AAAI Conference on Artificial Intelligence, Vol. 28 (AAAI, 2014).

  19. Balažević, I., Allen, C. & Hospedales, T. Tucker: tensor factorization for knowledge graph completion. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5185–5194 (ACL, 2019).

  20. Yuan, Z. et al. CODER: knowledge-infused cross-lingual medical term embedding for term normalization. J. Biomed. Inform. 126, 103983 (2022).

    Google Scholar 

  21. Lin, Y., Lu, K., Yu, S., Cai, T. & Zitnik, M. Multimodal learning on graphs for disease relation extraction. J. Biomed. Inform. 143, 104415 (2023).

    Google Scholar 

  22. Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004).

    Google Scholar 

  23. Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4228–4238 (ACL, 2021).

  24. Maldonado, R., Yetisgen, M. & Harabagiu, S. M. Adversarial learning of knowledge embeddings for the Unified Medical Language System. AMIA Summits Transl. Sci. Proc. 2019, 543 (2019).

    Google Scholar 

  25. Michalopoulos, G., Wang, Y., Kaka, H., Chen, H. & Wong, A. UmlsBERT: clinical domain knowledge augmentation of contextual embeddings using the Unified Medical Language System Metathesaurus. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1744–1753 (ACL, 2021).

  26. Piya, F. L., Gupta, M. & Beheshti, R. HealthGAT: node classifications in electronic health records using graph attention networks. In Proc. 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), 132–141 (IEEE, 2024).

  27. Choi, E. et al. Multi-layer representation learning for medical concepts. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1495–1504 (ACM, 2016).

  28. Kartchner, D., Christensen, T., Humpherys, J. & Wade, S. Code2vec: embedding and clustering medical diagnosis data. In Proc. 2017 IEEE International Conference on Healthcare Informatics, 386–390 (IEEE, 2017).

  29. Hong, C. et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ Digit. Med. 4, 151 (2021).

    Google Scholar 

  30. Zhou, D. et al. Multiview incomplete knowledge graph integration with application to cross-institutional EHR data harmonization. J. Biomed. Inform. 133, 104147 (2022).

    Google Scholar 

  31. Gan, Z. et al. ARCH: large-scale knowledge graph via aggregated narrative codified health records analysis. J. Biomed. Inform. 162, 104761 (2025).

  32. Wang, K., Chen, N. & Chen, T. Joint medical ontology representation learning for healthcare predictions. In Proc. 2020 International Joint Conference on Neural Networks (IJCNN), 1–7 (IEEE, 2020).

  33. Ying, H., Zhao, Z., Zhao, Y., Zeng, S. & Yu, S. CoRTEx: contrastive learning for representing terms via explanations with applications on constructing biomedical knowledge graphs. J. Am. Med. Inform. Assoc. 31, 1912–1920 (2024).

    Google Scholar 

  34. Gao, Y. et al. Leveraging medical knowledge graphs into large language models for diagnosis prediction: design and application study. JMIR AI 4, e58670 (2025).

    Google Scholar 

  35. Cai, T., Huang, F., Nakada, R., Zhang, L. & Zhou, D. Contrastive learning on multimodal analysis of electronic health records. Preprint at https://doi.org/10.48550/arXiv.2403.14926 (2024).

  36. Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27, (2014).

  37. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

    Google Scholar 

  38. Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

    Google Scholar 

  39. Chen, J. et al. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Findings of the Association for Computational Linguistics: ACL 2024, 2318–2335 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  40. Cipriani, A. et al. Comparative efficacy and acceptability of antimanic drugs in acute mania: a multiple-treatments meta-analysis. Lancet 378, 1306–1315 (2011).

    Google Scholar 

  41. Arvanitis, L. A. & Miller, B. G. Multiple fixed doses of “Seroquel” (Quetiapine) in patients with acute exacerbation of schizophrenia: a comparison with Haloperidol and placebo. Biol. Psychiatry 42, 233–246 (1997).

    Google Scholar 

  42. Ismail, Z. et al. Psychosis in Alzheimer disease-mechanisms, genetics and therapeutic opportunities. Nat. Rev. Neurol. 18, 131–144 (2022).

    Google Scholar 

  43. Liu, J., Chang, L., Song, Y., Li, H. & Wu, Y. The role of NMDA receptors in Alzheimer’s disease. Front. Neurosci. 13, 43 (2019).

    Google Scholar 

  44. Tariot, P. N. et al. Memantine treatment in patients with moderate to severe Alzheimer disease already receiving donepezil: a randomized controlled trial. J. Am. Med. Inform. Assoc. 291, 317–324 (2004).

    Google Scholar 

  45. Anthropic. Introducing the next generation of claude https://www.anthropic.com/news/claude-3-family (2024).

  46. Meta AI. The LLaMA 4 herd: The beginning of a new era of natively multimodal AI innovation (2025) https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Accessed: 2025-Apr-07.

  47. OMOP. Standardized data: The OMOP common data model (2021) https://www.ohdsi.org/data-standardization/. Accessed: Jun, 2025.

  48. Wen, J. et al. DOME: directional medical embedding vectors from electronic health records. J. Biomed. Inform. 162, 104768 (2025).

    Google Scholar 

  49. Chen, L. et al. Graph optimal transport for cross-domain alignment. In Proc. International Conference on Machine Learning, 1542–1553 (PMLR, 2020).

  50. Veličković, P. et al. Graph attention networks. In Proc. International Conference on Learning Representations (ICLR, 2018).

  51. Gori, M., Monfardini, G. & Scarselli, F. A new model for learning in graph domains. In Proc. 2005 IEEE International Joint Conference on Neural Networks, Vol. 2, 729–734 (IEEE, 2005).

  52. Johnson, A. et al. MIMIC-IV (version 0.4). PhysioNet. (2020) https://physionet.org/content/mimiciv/0.4/. Accessed: June, 2025.

  53. Bousquet, C., Trombert, B., Souvignet, J., Sadou, E. & Rodrigues, J.-M. Evaluation of the CCAM hierarchy and semi structured code for retrieving relevant procedures in a hospital case mix database. In Proc.AMIA Annual Symposium Proceedings, Vol. 2010, 61 (AMIA, 2010).

  54. Beam, A. L. et al. Clinical concept embeddings learned from massive sources of multimodal medical data. In Proc. Pacific Symposium on Biocomputing, Vol. 25, 295–306 (PSB, 2020).

  55. Shin, H.-C. et al. BioMegatron: larger biomedical domain language model. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4700–4706 (ACL, 2020).

  56. Wang, X., Han, X., Huang, W., Dong, D. & Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5022–5030 (IEEE Computer Society, 2019).

  57. Li, L. et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174–311ra174 (2015).

    Google Scholar 

  58. Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).

    Google Scholar 

  59. Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, (2008).

  60. Garst, S. & Reinders, M. Federated k-means clustering. In Proc.International Conference on Pattern Recognition, 107–122 (Springer, 2024).

  61. Armstrong, M. J., Song, S., Kurasz, A. M. & Li, Z. Predictors of mortality in individuals with dementia in the National Alzheimer’s Coordinating Center. J. Alzheimer’s. Dis. 86, 1935–1946 (2022).

    Google Scholar 

  62. Zheng, X., Wang, S., Huang, J., Li, C. & Shang, H. Predictors for survival in patients with Alzheimer’s disease: a large comprehensive meta-analysis. Transl. Psychiatry 14, 184 (2024).

    Google Scholar 

  63. Abdelnour, C. et al. Perspectives and challenges in patient stratification in Alzheimer’s disease. Alzheimer’s. Res. Ther. 14, 112 (2022).

    Google Scholar 

  64. Han, E., Kharrazi, H., Shi, L. et al. Identifying predictors of nursing home admission by using electronic health records and administrative data: scoping review. JMIR Aging 6, e42437 (2023).

    Google Scholar 

  65. Favril, L., Yu, R., Uyar, A., Sharpe, M. & Fazel, S. Risk factors for suicide in adults: systematic review and meta-analysis of psychological autopsy studies. BMJ Ment. Health 25, 148–155 (2022).

    Google Scholar 

  66. Sutar, R., Kumar, A. & Yadav, V. Suicide and prevalence of mental disorders: a systematic review and meta-analysis of world data on case-control psychological autopsy studies. Psychiatry Res. 329, 115492 (2023).

  67. Fazel, S. & Runeson, B. Suicide. N. Engl. J. Med. 382, 266–274 (2020).

    Google Scholar 

  68. Lee, D., Jiang, X. & Yu, H. Harmonized representation learning on dynamic EHR graphs. J. Biomed. Inform. 106, 103426 (2020).

    Google Scholar 

  69. Panickan, V. A., CELEHS & Tong, H. celehs/game: representation learning to advance multi-institutional studies with electronic health record data https://github.com/celehs/GAME (2026).

Download references

Acknowledgements

This research was supported by the Office of Research and Development, Veterans Health Administration, under award MVP000. This work also used resources of the Knowledge Discovery Infrastructure (KDI) at Oak Ridge National Laboratory, supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC05-00OR22725. The contents of this publication do not represent the views of the U.S. Department of Veterans Affairs or the United States Government. D.Z. was supported by the MOE AcRF Tier 1 grant A-8003569-00-00 and the NUS Start-up grant A-0009985-00-00. Z.X. was supported by NIH grant 5R01NS098023. K.L. was supported by NIH grants P30 AR072577 and K24 AR085342. T.C. was supported by NIH grants R01 LM013614, R01 HL089778, P30 AR072577, and P50 MH129699.

Author information

Author notes
  1. These authors contributed equally: Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu.

  2. These authors jointly supervised this work: Katherine Liao, Tianxi Cai.

Authors and Affiliations

  1. Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore

    Doudou Zhou

  2. Harvard T.H. Chan School of Public Health, Boston, MA, USA

    Doudou Zhou, Linshanshan Wang, Xin Xiong, Clara-Lea Bonzel & Tianxi Cai

  3. Department of Statistics, Columbia University, New York, NY, USA

    Han Tong

  4. Harvard Medical School, Boston, MA, USA

    Suqi Liu, Clara-Lea Bonzel, Vidul A. Panickan, J. Michael Gaziano, Kelly Cho, Katherine Liao & Tianxi Cai

  5. Department of Statistics, University of Chicago, Chicago, IL, USA

    Ziming Gan

  6. INSERM, Bordeaux Population Health Research Center, University Bordeaux, Bordeaux, France

    Griffier Romain, Boris P. Hejblum, Vianney Jouhet & Rodolphe Thiebaut

  7. Service d’Information Médicale, CHU de Bordeaux, Bordeaux, France

    Griffier Romain, Vianney Jouhet & Rodolphe Thiebaut

  8. Inria SISTM Team, Talence, France

    Boris P. Hejblum & Rodolphe Thiebaut

  9. Duke University, Durham, NC, USA

    Yun-Chung Liu & Chuan Hong

  10. VA Boston Healthcare System, Boston, MA, USA

    Tianrun Cai, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kelly Cho, Katherine Liao & Tianxi Cai

  11. Brigham and Women’s Hospital, Boston, MA, USA

    Tianrun Cai, J. Michael Gaziano, Kelly Cho & Katherine Liao

  12. Brown University, Providence, RI, USA

    Kevin Pan

  13. Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, USA

    Kenneth D. Mandl

  14. Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA

    Zongqi Xia

Authors
  1. Doudou Zhou
    View author publications

    Search author on:PubMed Google Scholar

  2. Han Tong
    View author publications

    Search author on:PubMed Google Scholar

  3. Linshanshan Wang
    View author publications

    Search author on:PubMed Google Scholar

  4. Suqi Liu
    View author publications

    Search author on:PubMed Google Scholar

  5. Xin Xiong
    View author publications

    Search author on:PubMed Google Scholar

  6. Ziming Gan
    View author publications

    Search author on:PubMed Google Scholar

  7. Griffier Romain
    View author publications

    Search author on:PubMed Google Scholar

  8. Boris P. Hejblum
    View author publications

    Search author on:PubMed Google Scholar

  9. Yun-Chung Liu
    View author publications

    Search author on:PubMed Google Scholar

  10. Chuan Hong
    View author publications

    Search author on:PubMed Google Scholar

  11. Clara-Lea Bonzel
    View author publications

    Search author on:PubMed Google Scholar

  12. Tianrun Cai
    View author publications

    Search author on:PubMed Google Scholar

  13. Kevin Pan
    View author publications

    Search author on:PubMed Google Scholar

  14. Yuk-Lam Ho
    View author publications

    Search author on:PubMed Google Scholar

  15. Lauren Costa
    View author publications

    Search author on:PubMed Google Scholar

  16. Vidul A. Panickan
    View author publications

    Search author on:PubMed Google Scholar

  17. J. Michael Gaziano
    View author publications

    Search author on:PubMed Google Scholar

  18. Kenneth D. Mandl
    View author publications

    Search author on:PubMed Google Scholar

  19. Vianney Jouhet
    View author publications

    Search author on:PubMed Google Scholar

  20. Rodolphe Thiebaut
    View author publications

    Search author on:PubMed Google Scholar

  21. Zongqi Xia
    View author publications

    Search author on:PubMed Google Scholar

  22. Kelly Cho
    View author publications

    Search author on:PubMed Google Scholar

  23. Katherine Liao
    View author publications

    Search author on:PubMed Google Scholar

  24. Tianxi Cai
    View author publications

    Search author on:PubMed Google Scholar

Contributions

D.Z., H.T., L.W., and S.L. conceived the study. D.Z., H.T. contributed to the methodology and model design. D.Z. conceptualized the study and H.T. led the implementation. H.T., L. W. conducted data analysis and validation experiments. S. L., X.X., and Z.G. contributed to data preprocessing and experimental evaluation. R.G., B.H., V.J., and R.T. contributed to the French institutional data and clinical interpretation. Y.-C.L., C.H. contributed to data extraction and site-specific implementation at Duke University. C.-L.B., V.A.P., K.P., and Z.X. assisted with data harmonization and result interpretation. T. R.C., Y.-L.H., L.C., J.M.G., and K.C. contributed to study design, clinical interpretation, and application development. K.M. contributed to informatics design and integration strategy. K.L., T.C. jointly supervised the study. All authors contributed to manuscript writing and approved the final version.

Corresponding authors

Correspondence to Katherine Liao or Tianxi Cai.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Majid Afshar and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Peer Review File (download PDF )

Reporting Summary (download PDF )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, D., Tong, H., Wang, L. et al. Representation learning to advance multi-institutional studies with electronic health record data from US and France. Nat Commun (2026). https://doi.org/10.1038/s41467-026-71152-1

Download citation

  • Received: 24 February 2025

  • Accepted: 11 March 2026

  • Published: 03 April 2026

  • DOI: https://doi.org/10.1038/s41467-026-71152-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research