Abstract
Microbial safety is fundamental to ensuring water quality, particularly in the Yangtze River Basin, China’s most critical drinking water source. Despite its ecological and economic importance, the basin faces significant anthropogenic pressures, including wastewater discharge, which may elevate the risk of pathogenic contamination. However, fragmented sampling efforts and limited coverage have hindered a systematic understanding of pathogenic microbial diversity and distribution across this vast ecosystem. A novel bioinformatic pipeline leveraging Genome-Specific Markers to accurately identify and quantify potential pathogenic taxa in metagenomic data was applied to 625 publicly available metagenomes, spanning water, sediments, and riparian soils along the 6,300 km Yangtze River continuum. We reconstructed a potential pathogen catalog comprising 403 taxa, largely expanding the pathogen diversity in the large river ecosystem. We also generate the Richness distribution maps of potential pathogens for water, sediments and soils along Yangtze River. The basin-scale pathogen inventory not only establishes a baseline for potential pathogenic bacteria communities in the Yangtze Basin but also serves as a reference library for quick biosurveillance and risk management from genomic resolution.
Similar content being viewed by others
Data availability
Data are available at the figshare repository (https://doi.org/10.6084/m9.figshare.30196462)29. The repository contains four datasets, including the spatial distribution maps for water, sediment and soils; S1. Metadata of samples for pathogen detection analysis; S2. Pathogens identified by GSMer in the Yangtze River Basin and their potential hosts and S3. Georeferenced sampling locations and pathogen richness used in spatial mapping. Dataset S1 contains the sources of the original metagenomic sequencing data used in this study. Dataset S2 provides potential pathogen species identified by the GSM-based matching and their host information.
Code availability
The parameters of all programs used for the analysis are described in the main text. GSM library construction code was available at https://github.com/yedeng-lab/humanpathogen-GSM.
References
Hu, Y. et al. Annual trends and health risks of antibiotics and antibiotic resistance genes in a drinking water source in East China. Science of The Total Environment 791, 148152 (2021).
Pandey, P. K., Kass, P. H., Soupir, M. L., Biswas, S. & Singh, V. P. Contamination of water resources by pathogenic bacteria. AMB Expr 4, 51 (2014).
Oon, Y.-L. et al. Waterborne pathogens detection technologies: Advances, challenges, and future perspectives. Front. Microbiol. 14, 1286923 (2023).
Liu, W. et al. Unraveling pathogen dynamics in rivers flowing into taihu lake: Insights from high-throughput sequencing and environmental correlations. Water Research X 29, 100406 (2025).
Carraro, L., Mächler, E., Wüthrich, R. & Altermatt, F. Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems. Nat Commun 11, 3585 (2020).
Deiner, K., Fronhofer, E. A., Mächler, E., Walser, J.-C. & Altermatt, F. Environmental DNA reveals that rivers are conveyer belts of biodiversity information. Nat Commun 7, 12544 (2016).
Ding, J. et al. Impacts of land use on surface water quality in a subtropical river basin: A case study of the dongjiang river basin, southeastern China. Water 7, 4427–4445 (2015).
McKee, A. M. & Cruz, M. A. Microbial and viral indicators of pathogens and human health risks from recreational exposure to waters impaired by fecal contamination. J. Sustainable Water Built Environ. 7, 03121001 (2021).
Hofstra, N. Quantifying the impact of climate change on enteric waterborne pathogen concentrations in surface water. Current Opinion in Environmental Sustainability 3, 471–479 (2011).
Hales, S. Climate change, extreme rainfall events, drinking water and enteric disease. Reviews on Environmental Health 34, 1–3 (2019).
Seymour, J. R. & McLellan, S. L. Climate change will amplify the impacts of harmful microorganisms in aquatic ecosystems. Nat Microbiol 10, 615–626 (2025).
Girones, R. et al. Molecular detection of pathogens in water–the pros and cons of molecular techniques. Water Res 44, 4325–4339 (2010).
Gu, W. et al. Rapid pathogen detection by metagenomic next-generation sequencing of infected body fluids. Nat Med 27, 115–124 (2021).
Gallagher, T., Phan, J. & Whiteson, K. Getting Our Fingers on the Pulse of Slow-Growing Bacteria in Hard-To-Reach Places. J Bacteriol 200, e00540–18 (2018).
Aw, T. G. & Rose, J. B. Detection of pathogens in water: from phylochips to qPCR to pyrosequencing. Curr Opin Biotechnol 23, 422–430 (2012).
Wang, J., Han, Y. & Feng, J. Metagenomic next-generation sequencing for mixed pulmonary infection diagnosis. BMC Pulm Med 19, 252 (2019).
Tu, Q., He, Z. & Zhou, J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Research 42 (2014).
Li, T. et al. Beyond water and soil: Air emerges as a major reservoir of human pathogens. Environment International 190, 108869 (2024).
NNCBI sequence read archive https://identifiers.org/insdc.sra:SRP288687 (2020).
NCBI sequence read archive https://identifiers.org/insdc.sra:SRP217764 (2020).
NCBI sequence read archive https://identifiers.org/insdc.sra:SRP394638 (2023).
NCBI sequence read archive https://identifiers.org/insdc.sra:SRP201455 (2019).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA006054 (2023).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA008231 (2023).
National Microbiology Data Center (NMDC) https://nmdc.cn/resource/genomics/project/detail/NMDC10020587 (2026).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Wang, B. et al. Tackling Soil ARG‐Carrying Pathogens with Global‐Scale Metagenomics. Advanced Science 10, 2301980 (2023).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017).
Wang, J., Wang, S., Li, T., Hou, W. & Deng, Y. A watershed-scale Potential pathogenic bacteria dataset from the Yangtze River Basin. figshare https://doi.org/10.6084/m9.figshare.30196462 (2026).
Acknowledgements
This work was supported by Opening Project of State Key Laboratory of Geomicrobiology and Environmental Changes (51830100303), the National Key Research and Development Program of China (Grant 2022YFC3204703) and the National Natural Science Foundation of China (Grant 42277104).
Author information
Authors and Affiliations
Contributions
J.W. generated the data and contributed to manuscript writing and revision. S.W. and Y.D designed the study and organized the research, manuscript writing and revision. T.L. contributed to the code writing and data analysis. W.G.H. contributed to manuscript revision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, J., Wang, S., Li, T. et al. A watershed-scale potential pathogenic bacteria dataset from the Yangtze River Basin. Sci Data (2026). https://doi.org/10.1038/s41597-026-06983-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06983-0


