Background & Summary

Antimicrobial resistance (AMR) has emerged as a critical global public health threat, exacerbated by the overuse and misuse of antibiotics, leading to the rise of antibiotic-resistant bacteria. This resistance results in treatment failures, rendering certain life-threatening infections untreatable. According to the World Health Organization (WHO), AMR is projected to become a leading cause of death worldwide by 2050, with an estimated 10 million deaths annually1,2,3. Antibiotic resistance poses a significant risk not only to individual health but also to global health and economic stability. Consequently, the discovery of novel antibiotics, especially those targeting multidrug-resistant bacteria, has become an urgent global priority.

Actinomycetes, a diverse group of microorganisms, have long been a cornerstone in the discovery of antibiotics4. These bacteria are distributed across various environments, including soil, wetlands, and marine ecosystems, where their broad distribution and environmental adaptability confer substantial genetic diversity5. This genetic diversity enables actinomycetes to thrive across different ecological niches, providing a valuable reservoir for discovering new antibiotics. Marine-derived actinomycetes, in particular, face unique ecological pressures—such as spatial competition, predation, and extreme environmental conditions—leading to the evolution of distinctive biochemical pathways and bioactive compounds not found in their terrestrial counterparts. These traits suggest that marine actinomycetes may hold untapped potential for antibiotic discovery6,7. Exploring actinomycetes from diverse habitats offers the opportunity to discover natural products with broad-spectrum antimicrobial, antifungal, and antiviral properties, with significant applications in clinical medicine, agriculture, and pest control4.

However, many potential antibiotic biosynthesis gene clusters in actinomycetes remain dormant under standard laboratory conditions. Recent research has uncovered numerous “silent gene clusters” in their genomes, which are not expressed during routine cultivation, limiting their discovery potential8,9. Streptomyces, the largest genus of actinomycetes, harbors 25 to 50 biosynthetic gene clusters per genome, but up to 90% of these clusters remain inactive under conventional laboratory conditions. Despite the vast biosynthesis potential encoded in actinomycete genomes, much of it remains untapped under natural growth conditions10,11,12. While substantial genomic resources from soil-derived actinomycetes have been accumulated13,14, the genomic data from actinomycetes originating from other environments remain limited, constraining the broader application of gene cluster activation strategies. Therefore, exploring the diversity of actinomycetes from various ecosystems is crucial for uncovering new biosynthetic pathways and advancing antibiotic discovery.

In this study, we conducted a comprehensive analysis of the high-quality genome sequences of 221 actinomycete strains collected from diverse sampling sites and environments, aiming to providing a valuable dataset for downstream analyses and biological resource exploration. The average sequencing depth of the 211 actinomycetes genomes was approximately 328X, with at least 95% completeness and less than 5% contamination level, indicating high-quality genome sequencing and assembly. Through genome alignment and classification, our study revealed that all 211 strains belong to the phylum Actinomycetota and the class Actinomycetes, encompassing four orders, eleven families, twenty genera, and seventy-six known species. The most abundant families identified were Streptomycetaceae (n = 134), Micromonosporaceae (n = 25), and Microbacteriaceae (n = 14). Additionally, 32 actinomycete strains could not be assigned to any defined species in the Genome Taxonomy Database (GTDB). These unclassified strains may possess genomic features and metabolic potentials distinct from known actinomycetes, and their uniqueness presents opportunities for further research. In conclusion, the diverse origins of the actinomycetes render this dataset highly representative, highlighting the extensive diversity and unexplored potential of these microorganisms. It offers a valuable resource for research in natural product discovery, environmental microbiology, and biotechnology, and facilitates the identification of novel secondary metabolites and a more comprehensive understanding of microbial bioactivity.

Methods

Sample collection, isolation, and culture of microbes

Samples were systematically collected over the course of 14 regularly scheduled sampling trips, spanning a period of 20 years from September 2001 to September 2021, across China and Vietnam. Sample collection encompassed a wide range of environmental sources, with the majority derived from mangrove sediment (77 samples), mountain soil (33 samples), and sponge homogenate (29 samples). Additional sources included seawater, sediment, rhizosphere soil, gut, and others (Fig. 1, Supplementary Table S1).

Fig. 1
figure 1

Map of Sample Collection Sites (a) and overview of the sample sources (b). (a) This map shows the geographic distribution of sample collection sites in southern China (including Guangxi, Guangdong, and Hainan) and Vietnam. The size of the red dots is proportional to the number of samples collected at each site (small dot: 10 samples; medium dot: 20 samples; large dot: 30 samples). A legend is provided in the upper right corner to explain the relationship between dot size and sample quantity. Provincial and national names are labeled in blue text, and a compass rose and scale bar (1000 km) are included for geographic reference. (b) The distribution of sample sources. The x-axis denotes the sources from which samples were isolated, while the y-axis indicates the total number of samples collected.

Soil, rhizosphere, and sea sand samples were systematically collected using a multi-point strategy, combining five random sampling points into a composite sample. Surface soil was collected from 0–10 cm depths with shovels and trowels, while rhizosphere soil was obtained by dislodging adhered soil from plant roots. Sea sand was collected from the surface layer. Seawater samples were collected after triple-rinsing bags in situ to ensure sterility. Marine organisms were processed by homogenizing tissues or isolating gut contents; approximately 1 g of tissue was homogenized in sterile seawater and serially diluted for analysis. All procedures adhered to aseptic techniques, with methods tailored to sample source. This approach ensures the integrity and representativeness of samples for subsequent microbiological isolation.

Actinomycete strains were isolated using the spread plate technique15 on 2216E agar (Hopebio, Qingdao, China) and Gauze’s agar (Hopebio, Qingdao, China). The 2216E agar was prepared with the following ingredients (g/L): peptone 5.0 g, yeast extract 1.0 g, ferric citrate 0.1 g, sodium chloride 19.45 g, magnesium chloride 5.98 g, sodium sulfate 3.24 g, calcium chloride 1.8 g, potassium chloride 0.55 g, sodium carbonate 0.16 g, potassium bromide 0.08 g, strontium chloride 0.034 g, boric acid 0.022 g, sodium silicate 0.004 g, sodium fluoride 0.0024 g, sodium nitrate 0.0016 g, and disodium hydrogen phosphate 0.008 g, with a pH value of 7.6 ± 0.2. The Gauze’s agar was composed of (g/L): potassium nitrate 1.0, potassium dihydrogen phosphate 0.5, magnesium sulfate 0.5, ferrous sulfate 0.01, sodium chloride 0.5, soluble starch 20.0, and agar 15.0, with a pH value of 7.2–7.4. Plates were incubated at 28 °C for 7–14 days to allow colony formation. The selection of distinct colonies involved a careful visual examination of the colonies that had formed on the agar plates. Colonies differing in color, size, shape, elevation, surface characteristics, margin shape, and glossiness were all considered. These differences indicated potential genetic or phenotypic variations among the actinomycete strains. Using a sterile inoculating loop or needle, each visibly distinct colony was individually picked from the original agar plate and transferred to a fresh 2216E agar plate. Purified isolates were identified by 16S rDNA PCR analysis. For long-term storage, isolates were preserved in glycerol stocks at −80 °C.

Genomic DNA extraction and high-throughput sequencing

Isolates were cultured in 2216E broth at 28 °C for 36–48 hours. Cells in the mid-log phase (typically after 36–48 hours) were harvested by centrifugation at 4,000 rpm for 10 minutes. Genomic DNA was extracted using the TIANamp Bacteria DNA Kit (a spin column - based kit, Tiangen, China). Quality was assessed via spectrophotometer-measured A260/A280 (1.7–2.0) and A260/A230 (2.0–2.2) ratios, and integrity was checked by agarose gel electrophoresis. Sequencing libraries were prepared using the Illumina TruSeq DNA Sample Preparation Kit: genomic DNA was sonicated, followed by end-repair, A-tailing, and ligation of Illumina adapters, with magnetic bead purification and size selection. PCR amplification was performed using the KAPA HiFi HotStart DNA Polymerase with the Illumina TruSeq DNA Sample Preparation Kit, typically with 15 cycles, which is crucial for high GC content organisms. The PCR product was then purified and resuspended in Elution Buffer. Library concentration and size were detected using Qubit 4.0 and QSep400. Qualified libraries were sequenced on the Illumina NovaSeq6000 platform: denatured with NaOH to single strands, diluted, hybridized to FlowCell adapters, amplified via bridge PCR on cBot, and then sequenced.

Genome assembly

Quality control of the raw sequencing data was conducted using Fastp (v0.12.0)16 with default parameters. Reads containing adapter sequences, those with more than five ambiguous bases (N), and low-quality reads (defined as reads where over 40% of bases have a quality score below 15) were excluded. The genomes of the cultivated isolates were assembled using Unicycler (v0.5.0)17 with the default parameters and only contigs ≥500 bp were retained. Unicycler functions as a SPAdes-optimiser when given short-read only sets. The completeness, contamination and strain heterogeneity of each genome were evaluated using the module ‘lineage_wf’ of CheckM (v1.2.1)18.

After quality control, a total of 460.49 Gbp of clean, high-quality data were retained for further analysis, with 97.67% of bases achieving a quality score of ≥ 20 and an average yield of 2.18 Gbp per sample (Supplementary Table S2). The full lengths of the 211 assembled genomes ranged from 2,462,157 to 14,777,510 bp, with an average length of 7,350,478 bp. The N50 values ranged from 58,110 to 2,055,415 bp, with an average length of 196,859 bp. The average sequencing depth across the 211 genomes was approximately 328X, with coverage ranging from 148X to 1485X. The first and third quartiles were 244X and 343X, respectively. The number of contigs per genomes ranged from 7 to 368, with an average of 109. All of the 211 genomes exhibited a completeness of at least 95% and a contamination level below 5%, meet the criteria of completeness and contamination defined in MISAG’s high-quality genomes (completeness >90% and contamination <5%)19 (Supplementary Table S1).

Taxonomic classification

Taxonomic classification of each genome was performed using the Genome Taxonomy Database Toolkit (GTDB-Tk v2.4.0)20 with reference to GTDB release r22021. The phylogenetic affiliation and diversity of the 211 actinomycete strains were determined using the “classify_wf” module in GTDB-TK, which identified 120 bacterial marker genes and constructed multiple sequence alignments based on them.

According to GTDB release r22021, all 211 genomes were classified within the phylum Actinomycetota and class Actinomycetes, encompassing 4 orders, 11 families, 20 genera, and 76 known species. Notably, 32 genomes (15.2%) could not be assigned to any defined species in GTDB, suggesting that these genomes represent novel taxa (Supplementary Table S1).

Genome annotation

Functional annotation of the 211 actinomycete genomes was performed using Prokka v1.14.622 with the default parameters. According to Prokka annotation, the number of CDS per genome ranged from 2,211 to 13,369, with an average of 6,541. The number of rRNA genes ranged from 1 to 6, with an average of 3, while the number of tRNA genes ranged from 43 to 101, with an average of 76 (Supplementary Table S1).

Data Records

The sequencing reads, assembled genomes (e.g., GCA_965341825.123, GCA_965341695.124, GCA_965342205.125) and corresponding sample metadata are available in the European Nucleotide Archive (ENA) under Project accession number PRJEB8996626, detailed accession numbers for these genomes are provided in Supplementary Table S3. The data above have also been deposited in the CNGB Sequence Archive (CNSA)27 of the China National GeneBank Database (CNGBdb)28 under project accession number CNP000654329.

Technical Validation

To ensure the technical quality and reliability of the dataset, multiple validation steps were implemented. The extraction process for each sample source was designed to ensure sample integrity for culturable actinomycete isolation. Soil samples were collected from 0–10 cm depths using shovels and trowels, and rhizosphere soil was obtained by dislodging soil from plant roots. Sea sand samples were collected from the surface layer, and seawater samples were collected after triple-rinsing bags in situ. Marine organisms were processed by homogenizing tissues or isolating gut contents, with 1 g of tissue homogenized in sterile seawater and serially diluted. All procedures adhered to aseptic techniques, tailored to the specific sample type, to ensure the quality and reliability of the isolates. Genomic DNA quality was assessed via NanoDrop (A260/280 and A260/230 ratios), Qubit fluorometry, and agarose gel electrophoresis to confirm integrity and purity. DNA library quality was evaluated using Qubit 4.0 and QSep400 for insert size distribution. A total of 460.49 Gbp of high-quality data was generated (Supplementary Table S2). Raw reads underwent quality control using Fastp, with adapter trimming and filtering of low-complexity regions, followed by genome assembly using Unicycler. Assembly quality was validated using CheckM, yielding average completeness ≥95% and contamination <5% (Supplementary Table S1). The complete dataset, including assembly metrics and validation parameters is publicly available in the CNGB Sequence Archive to support reproducibility and community validation.