Abstract
Staphylococcus aureus (S. aureus) can cause various infections in humans and animals, contributing to high morbidity and mortality. To prevent and control cross-species transmission of S. aureus, it is necessary to understand the host-associated genetic variants. We performed a two-stage genome-wide association study (GWAS) including initial screening and further validation to compare genomic differences between human and pig S. aureus, aiming to identify host-associated determinants. Our multiple GWAS analyses found six consensus significant k-mers associated with host species, providing novel genetic evidence for distinguishing human from pig S. aureus. The best k-mer predictor achieved a high classification accuracy of 98.12% on its own and had extremely high resolution similar to the SNPs-based phylogeny, offering a very simple target for predicting the cross-species transmission risk of S. aureus. The final k-mer model revealed that 90% of S. aureus isolates from farm workers were predicted as livestock origin, suggesting a high risk of cross-species transmission. Bayesian inference revealed different cross-species transmission directions, with the human-to-pig transmission for ST5 and the pig-to-human transmission for ST398. Our findings provide a simple and accurate k-mer model for identifying and predicting the cross-species transmission risk of S. aureus.
Similar content being viewed by others
Introduction
Staphylococcus aureus (S. aureus) known as a “super-bug” has become one of the most important and widespread human pathogens, which can asymptomatically colonize the nasopharynx and also enter the bloodstream leading to life-threatening invasive diseases, causing major morbidity and mortality globally1,2. Moreover, S. aureus can colonize or infect various animals (livestock, companion animals, and wildlife), which has been regarded as livestock-associated S. aureus (LA-SA)3,4. More recently, the epidemiologic pattern of S. aureus has reshaped with the increasing emergence of LA-SA clones infecting humans via direct or indirect exposure to animals, the food chain, and environmental routes, with livestock-associated ST9 prevalent in Asia and ST398 in Europe and North America5,6,7. More and more evidence has found that S. aureus isolates from animals and related workers are indistinguishable8, suggesting the potential risk of cross-species transmission.
Defining the host origin of S. aureus is the first and most important step for preventing potential cross-species transmission. More and more studies have attempted to define livestock-associated isolates based on molecular characteristics and genotypes9,10, but the specific biomarkers that distinguish between livestock- and human-associated isolates are still unclear. High-throughput whole-genome sequencing (WGS) has become an essential method for generating high-dimensional genomic data and performing genome-wide association studies (GWAS). Bacterial GWAS analyses are widely used to reveal potential statistical links between genetic variation and phenotypes of interest, such as host specificity, antimicrobial resistance, and disease susceptibility11,12,13. However, traditional GWAS analyses typically focus on using single-nucleotide polymorphisms (SNPs) as predictors to test for genotype-phenotype associations, which makes it difficult to detect rare variants14. To overcome this limitation, we used an alternative genetic determinant of k-mers (i.e., DNA substring of length k) to perform k-mer-based GWAS, which can capture various types of host-associated genetic variation such as SNPs, insertions and deletions, and genes15.
To clarify the cross-species transmission risk of S. aureus, it is important to identify host-associated genetic determinants to distinguish animal from human isolates. Here, we performed a two-stage k-mer-based GWAS of human and pig S. aureus isolates to identify host-specific genomic variation (Fig. 1) and predict host species of S. aureus. Our findings may provide novel insights for elucidating and preventing S. aureus cross-species transmission.
Results
Characteristics of the S. aureus isolates
We analyzed the genomes of 797 S. aureus isolates collected from 1997 to 2021, including 488 human isolates and 309 pig isolates (Supplementary Data 1). The human isolates were sampled from blood (n = 460), joint fluid (n = 18), ascitic fluid (n = 4), bile (n = 3), cerebrospinal fluid (n = 2) and abscess (n = 1). All of these isolates came from China, including Hubei (n = 208), Taiwan (n = 70), Shandong (n = 59), Jiangxi (n = 33), Shanghai (n = 24) and so on. All these isolates corresponded to 44 different sequence types (STs) belonging to 21 clone complexes (CCs). In terms of CCs and STs, the predominant genotype for pig isolates was CC9 (ST9 and ST3957), while the most common genotypes for human isolates were CC59 (ST59) and CC5 (ST5) belonged to non-CC9. Among 98 Staphylococcal protein A (spa) types from all S. aureus isolates, the most frequent spa type for pig isolates was t899, mostly belonging to CC9; while the most prevalent spa type for human isolates was t437.
Identification of host-associated genotypes by association analysis
In the association analyses for host-associated genotypes, we used the Bonferroni correction (α/N) to control for false-positive rates resulting from multiple comparisons of 44 STs, 21 CCs, and 98 spa types (adjusted P value threshold being 0.0011, 0.0023, and 0.0005, respectively). Clone association analysis revealed statistical differences in the proportion of specific CCs between human and pig isolates (CC9, 0.0% versus 89.6%; CC59, 38.9% versus 0.3%; CC5, 20.5% versus 1.0%; CC239, 8.8% versus 0.6%; CC72, 4.1% versus 0.0%; all P < 0.0023, Table 1), with significantly high rates of livestock-associated CC9 in pig isolates. Similar differences were observed for certain STs (ST9, ST3597, ST59, ST5, ST965, ST239, and ST72) between the two groups, with markedly high rates of livestock-associated ST9 in pig isolates. Moreover, there were significant differences in the proportion of certain spa types (t899, t437, t2, t172, and t62) between the two groups. According to the phylogenetic trees (Fig. 2 and Supplementary Fig. S1), we observed that human isolates were present across the tree and even clustered in the same branches with pig isolates, indicating the emergence of human clones from multiple genetic backgrounds.
Preliminary screening for host-associated k-mers by LMM
After filtering out low-frequency k-mers (minor allele frequency <1%), 14,815,802 k-mers were selected for k-mer-based GWAS using the univariate linear mixed model (LMM) method, aiming to test for genetic associations with host species (human or pig). There were 128,169 genome-wide significant host-associated k-mers after adjusting the population structure (Fig. 3a), with 24,761 k-mers mapped to 731 unique genes with known functions. According to the resulting quantile-quantile (QQ) plot, we observed no apparent issues, indicating adequate control for population structure as a confounder in the LMM-based GWAS (Supplementary Fig. S2). In the preliminary screening model, all 24,761 k-mers presence and absence patterns were included as predictors in the Random Forest (RF) model, with a high classification accuracy (98.87%, Table 2) and a high AUC value (0.99).
The importance of these significant host-associated k-mers was estimated and sorted by the RF model. Figure 3b illustrates the importance of 100 top-ranked k-mers and most k-mers are associated with hydrolase (75%), followed by antibiotic resistance (7%) and transposase (5%). In order to improve understanding of host-specific association, the top 100 host-associated k-mers were functionally annotated using Gene ontology (GO) categories (Fig. 3c). The top 2 GO terms were hydrolase and antibiotic resistance, indicating that hydrolase mainly played an important role in the molecular function and antibiotic resistance has a crucial role in the biological process.
Further validation of host-associated k-mers by multiple GWAS
To minimize false-positive associations and avoid the redundancy among the 24,761 k-mer predictors, we used Scoary (https://github.com/AdmiralenOla/Scoary), least absolute shrinkage and selection operator (LASSO; https://github.com/cran/glmnet), and extreme gradient boosting (XGBoost; https://github.com/dmlc/xgboost) to identify consensus associations with host species. Interestingly, multiple GWAS based on the final subset of 2645 k-mers (out of 24,761 k-mers) revealed consensus associations (Fig. 4a), with six consensus k-mers identified by all three methods. The much simpler model with these six host-associated k-mers reached a classification accuracy of 98.58% (Table 2) and the AUC value of 0.99 (Fig. 4b), which was comparable with the model with 24,761 k-mers. Importantly, the highest-ranked predictor (kmer_2937 in mecR1) gave a classification accuracy of 98.12% on its own (Table 2), suggesting that a single k-mer classifier is very powerful. Based on small independent datasets of S. aureus genomes including 40 human and 40 pig isolates available on the NCBI database (Supplementary Data 2), we performed an additional external validation RF analysis. The prediction models based on six k-mers and the best k-mer (kmer_2937 in mecR1) achieved a classification accuracy of 95.00% and 88.18%, respectively.
a Upset plot visualization of the host-associated k-mers identified by three methods. b Receiver operating characteristic curve of the final model with six k-mers. c Importance of the six k-mers in the final model. d Proportion of k-mer predictors between human and pig isolates. e Change in risk score for a specific k-mer profile when the k-mer is present (y-axis) compared to absent (x-axis). f Heatmap of the host prediction of unknown origin S. aureus and the presence/absence patterns of k-mers.
According to the importance of the six k-mer predictors in the final model shown in Fig. 4c, we observed that the most important k-mers were kmer_2937 (in mecR1), kmer_1005 (in ccrA) and kmer_18092 (in ccrB), which were associated with antibiotic resistance. There were significant differences in the prevalence of genetic elements (k-mers) between human and pig isolates (all P < 0.05; Fig. 4d). Figure 4e shows the effect of each k-mer on the estimated risk score (defined as the probability of an isolate from human host given a certain k-mer). The point above the diagonal indicates an increased risk of causing human infection when the k-mer is present. Specifically, we observed four k-mers above the diagonal (OR = 1.59 for kmer_2937 in mecR1; OR = 1.62 for kmer_18092 in ccrB; OR = 1.92 for kmer_913 in trpC; OR = 1.69 for kmer_22471 in tnpB1/tnpB2; Table 3), suggesting that the presence of these k-mers may increase the risk of causing human infection.
Prediction for cross-species transmission risk of S. aureus
In order to reveal the potential cross-species transmission risk of LA-SA, we used the RF classifier to predict the host species of S. aureus of unknown origin (40 isolates from farm workers, Supplementary Data 3). In the RF classifier with six k-mer predictors, there were 36 isolates (90.0%) predicted as pig origin (Fig. 4f), suggesting the high risk of cross-species transmission of LA-SA. Similarly, in the classifier based on the best k-mer (kmer_2937), 37 isolates (92.5%) were predicted as pig origin, indicating that we obtained robust prediction results.
Inference for cross-species transmission directions of S. aureus
Although ST5 and ST398 isolates were observed in both human and pig isolates, the potential cross-species transmission direction is still unknown. So we conducted a Bayesian evolutionary analysis to infer the historical evolution of ST5 and ST398 isolates. Bayesian evolutionary analyses for ST5 and ST398 isolates showed MCMC convergence with an effective sample size (ESS) greater than 100 (Supplementary Figs. S3, S4) and the existence of clear temporal signal in the tree (Supplementary Fig. S5). According to the time-based phylogenetic tree (Fig. 5), ST5 isolates may originate from humans emerged in 1905 (95% CI: 1800–1957), with host switching from humans to animals emerged in 1937 (95% CI: 1873–1980). The time-based phylogeny (Fig. 6) indicated that ST398 may originate from pigs emerged in 1888 (95% CI: 1692–1966), with the animal-to-human host switching appeared in 1907 (95% CI: 1789–1971); while we also observed the opposite host switching from humans to animals appeared in 1932 (95% CI: 1839–1979) and 1956 (95% CI: 1871–1995). Interestingly, we observed that the best k-mer (kmer_2937 in mecR1) was detected in almost all of ST5 and ST398 isolates from humans but absent from all isolates from pigs (Figs. 5 and 6), suggesting that the classification resolution of this k-mer was similar to the phylogenetic tree based on core SNPs. This may provide more evidence for revealing the importance of the best k-mer in differentiating animal from human isolates.
Discussion
To address the genetic backgrounds for cross-species infection of S. aureus, it is necessary to consider possible host-associated models based on S. aureus genome data, which has been used for identifying pathogenicity models of S. epidermidis, E. coli, and S. pneumoniae16,17,18. First, a simple host-specific clone model (Fig. 1a), in which only specific clones can invade humans (seen as one or few lineages on the tree) and host-associated determinants unevenly enriched in both human and pig isolates. Second, a host-mediated infection model (Fig. 1b), in which multiple diverse clones can cause both human and animal infection (seen as several lineages on the tree) as a result of host factors, irrespective of host-associated determinants. Third, a host-associated genome model (Fig. 1c), in which enrichment of host-associated genetic elements may increase the risk of invading specific hosts, appeared as multiple diverse clones to cause human infection.
Host-specific genotypes of S. aureus may aid in differentiating animal from human isolates, so as to clarify the potential cross-species transmission. In terms of pig isolates in the present study, the most prevalent genotype is livestock-associated CC9 (ST9), which is consistent with previous reports in China and other regions of Asia6,19, but different from those in North America and Europe7. These results suggest that CC9 and CC398 may be important molecular characteristics for livestock association. Moreover, we found significant associations between specific genotypes and host species, with significantly higher rates of livestock-associated CC9/ST9/t899 in pig isolates. These findings provide evidence for revealing host-specific genotypes for differentiating animal from human isolates. In a simple host-specific clone model (Fig. 1a), all isolates would appear as discrete clones, and human isolates were restricted to specific clones in the phylogeny. However, many human isolates clustered in the same branches of the phylogenetic tree with pig isolates, suggesting that this simple clone model is not suitable for all S. aureus isolates. These findings reveal that traditional genotyping techniques only partially explain the host-associated genetic variation of S. aureus.
It is known that S. aureus is a commensal and opportunistic pathogen colonizing various species. However, there is still lack of host-specific markers for identifying the potential cross-species transmission risk of S. aureus. Advances in WGS have improved the applicability of bacterial GWAS, elucidating potential genetic associations between genotypes and bacterial phenotypes such as antimicrobial resistance, disease status, and host adaptation11,18,20. It is clear from the phylogeny based on core SNPs that human isolates were distributed across the tree and even clustered in the same clades with pig isolates, indicating that most clones are equally able to invade various hosts (a host-mediated infection model; Fig. 1b). If this model is reasonable, host-associated genetic determinants are evenly distributed in human and pig isolates (no significant difference). However, the GWAS in this study identified numerous host-associated k-mers, suggesting that the enrichment of host-associated genetic elements may increase the risk of invading specific hosts (the host-associated genome model; Fig. 1c). This is consistent with the divided genome pathogenicity model for avian E. coli and S. pneumoniae17,21.
Considering high-dimensional genomic data as well as reducing the model complexity and overfitting, we constructed a two-stage GWAS process to obtain a simple and accurate prediction model, including initial screening by LMM and further validating by three GWAS methods. In the final model, we identified six consensus k-mers statistically associated with host species. Interestingly, the best k-mer predictor (kmer_2937 in mecR1) achieved a high classification accuracy by itself (98.12%), which is significantly higher than the findings from previous studies (67.0% for predicting disease-associated Staphylococcus epidermidis and 79.6% for predicting infection-associated Streptococcus pneumoniae)16,22; and the additional external validation analysis also reached a high classification accuracy of 88.18%, offering a very simple target for predicting the cross-species transmission risk of S. aureus. More importantly, among ST5 and ST398 isolates observed in both human and pig isolates, the best k-mer (in mecR1) was carried in almost all human isolates but absent from all pig isolates, with extremely high resolution similar to the phylogenetic tree based on core SNPs, which may provide more evidence for revealing the importance of the best k-mer in differentiating animal from human isolates. In order to elucidate the potential cross-species transmission risk of LA-SA, we used an RF classifier to predict the host species of S. aureus isolates of unknown origin (that is, S. aureus from pig farm workers), with 90% of isolates predicted as pig origin, suggesting the high risk of cross-species transmission of LA-SA. Many S. aureus clones were found in both humans and pigs, but the potential transmission direction between humans and livestock is still unclear. Bayesian evolutionary analysis was used to infer that ST398 can spread from pigs to humans while ST5 can spread from humans to pigs, indicating the different cross-species transmission directions for these clones. These findings provide novel insights into identifying and predicting the cross-species transmission risk and directions of S. aureus.
A deep understanding of host-associated genetic traits that differentiate animal from human isolates could pave the way to early diagnosis and effective control of cross-species transmission. In the final prediction model, the top six k-mers were mapped to genes associated with antibiotic resistance, lyase, transposase, and virulence, indicating that host adaptation is a complex multifactorial property. These findings suggest that the enrichment of genetic variation may promote S. aureus adaptation to various host species. Of the SCCmec-associated genes, the mecR1 gene encodes methicillin receptor protein, which has an important role in triggering a resistance response23; and the novel recombinase activities of CcrA and CcrB are important for the site-specific integration and excision of SCCmec that is responsible for the resistance acquisition and spread24. These results provide more evidence that the acquisition of methicillin resistance may increase the effectiveness of adaptation to different host environments, leading to the emergence of livestock-associated clones more adapted to human colonization and infection. The tryptophan biosynthesis protein (trpC) is involved in reducing plasma coagulation, which is a critical pathogenic process in S. aureus25. In terms of transposase-associated genes, tnpB1 and tnpB2 play an important role in catalyzing the recombination reaction and DNA sequence variation, which may contribute to S. aureus rapidly adapting to different stress conditions and new hosts26,27. Signal transduction histidine-protein kinase (arlS) is involved in the regulation of adhesion, virulence, and multidrug resistance, which may enhance the ability of adhesion and adapting to different hosts28. In summary, these host-associated k-mers provide genetic evidence for identifying and elucidating cross-species transmission of S. aureus isolates.
This study is a new attempt to identify host-associated variation and predict host species of S. aureus using a simple k-mer model, which may provide novel insights into elucidating the cross-species transmission risk of S. aureus. Considering that no GWAS method is perfect, we constructed a two-stage GWAS process using multiple methods to identify consensus significant host-associated k-mers, thereby controlling for the population structure, reducing the model complexity, and minimizing false-positive associations. However, there are potential limitations to this study. First, we did not replicate GWAS analysis using S. aureus isolates from companion and wildlife animals because of the small sample sizes. However, this is a novel comparative genomics study of human and animal S. aureus isolates to identify host-associated determinants. Second, in the GWAS model using k-mer presence/absence patterns as predictors, copy number variation as an important dimension of genetic diversity may not be detected if all copies are identical29. To overcome this limitation, we should construct more elaborate k-mer-based GWAS methods using k-mer counts as predictors in the future, so as to confirm our findings as well as broaden the scope of genetic variants30.
Defining the host origin of S. aureus is the first step for preventing and controlling cross-species transmission. By performing the two-stage multiple GWAS analyses, there was consensus evidence of a subset of six significant host-associated k-mers. Surprisingly, the best k-mer predictor achieved a high classification accuracy on its own (98.12%) and had extremely high resolution similar to the phylogeny based on core SNPs, offering a very simple and precise target for identifying and predicting the cross-species transmission of S. aureus in healthcare settings. The final model based on six k-mers revealed that most S. aureus isolates from farm workers were predicted as livestock origin (90%), suggesting the high risk of cross-species transmission. Bayesian evolutionary analysis inferred that ST398 spreads from pigs to humans while ST5 spreads from humans to pigs, indicating the different cross-species transmission directions for these clones. Our findings provide important clues for tracing the source and route of S. aureus cross-species transmission.
Methods
Sample selection and quality control
In this study, there were 797 assembled genomes of S. aureus isolates obtained from the NCBI genome database between 1997 and 2021 in China (Supplementary Data 1), including 488 (61.2%) isolated from sterile sites (e.g., blood, cerebrospinal fluid, and joint fluid) in infection patients without occupational livestock exposure (defined as human isolates) and 309 (38.8%) from the nasopharynx of pigs (defined as pig isolates). In order to exclude livestock-associated isolates in humans, livestock-associated CC9 isolates were excluded from human isolates. The genome quality was assessed using the bacterial database in Kraken (v.1.1.1)31 and the genomes with more than 95% of the total sequence mapping to S. aureus were included. The completeness and contamination of S. aureus genomes were also evaluated using the default parameters of CheckM (v.1.2.0)32.
Sequence typing
The STs were inferred with the allelic profile of the seven housekeeping genes by submitting the genome sequences to Pathogenwatch (https://pathogen.watch/). STs were assigned to specific CCs using the goeBURST algorithm in Phyloviz (v.2.0)33. Spa types were performed using the online tool SpaTyper (v.1.0) available through the Center for Genomic Epidemiology (https://cge.food.dtu.dk/services/spaTyper/)34.
Phylogenetic analysis
We identified the SNPs by mapping assembled contigs to the reference genome of S.aureus NCTC 8325 (GenBank accession no. NC_007795) using Snippy (v.4.6.0). After identifying and removing recombination regions using Gubbins (v.2.4.1), the maximum likelihood phylogenetic trees with and without removing existing recombination sites were constructed using the GTR + Γ (Gamma) model and 100 bootstrap replicates performed in RAxML (v.7.0.4). We performed the Bayesian evolutionary analysis to estimate the node dates of S. aureus using BactDating (v.1.1.1; https://xavierdidelot.github.io/BactDating/), a tool used to perform Bayesian dating of a bacterial phylogenetic tree. A relaxed-gamma model was run for 1.0 × 107 MCMC steps to perform the Bayesian analysis, and the effective sample size (ESS) of all parameters was greater than 100. Visualization and annotation of the phylogeny were performed using the ChiPlot (https://www.chiplot.online/).
Counting and annotating k-mers
The alignment-free method based on k-mers was used to capture the host-associated genetic variation in the core and accessory genomes of S. aureus. The k-mers of length 9–100 bp were extracted from all assemblies using fsm-lite (https://github.com/nvalimak/fsm-lite), and filtered to retain 24,116,651 k-mers seen in 1–99% of samples. In order to identify the relevant variants and genes, all k-mers were mapped to eight S. aureus reference genomes (RF122, MRSA252, M013, NCTC 8325, MSSA476, Mu3, Newman, and N315) by BWA-MEM with default parameters (v.0.7.17; https://github.com/lh3/bwa), using RMS for assessing the mapping quality (a reasonable cutoff value for the annotation is ≥50). The biological properties of k-mers were determined by GO annotations using the UniProt (https://beta.uniprot.org/).
Two-stage GWAS analysis for screening host-associated k-mers
To reveal potential associations between genetic elements (k-mers) and host species, we performed GWAS analyses with the k-mers matrix (presence or absence) as the independent variable and S. aureus host species (human or pig) as the outcome variable. Considering high-dimensional and high-correlated genomic data, a two-stage GWAS analysis process was conducted to reveal host-associated k-mers. In the first step, we fitted univariate LMM to initially screen host-associated k-mers using the Pyseer (v.1.3.10)35, in which a genetic relatedness matrix based on core SNPs was calculated to control for the clone population structure. In the second step, all k-mers identified from the first stage were included in multiple GWAS models (Scoary, LASSO, and XGBoost) at the same time so that we explored consensus host-associated k-mers. Scoary is a widely applicable, ultra-fast, and easy-to-use GWAS analysis software for high dimensional genome data, which directly infers and controls for bacterial population structure based on the phylogenetic structure of the sample36. LASSO regression as a linear prediction model with regularization is suitable for high-correlated genome data, which can avoid over-fitting and reduce the model complexity by compressing the coefficients of variables to zero37. XGBoost is a decision-tree-based ensemble method under the gradient boosting framework, which has been widely used in handling sparse and high-dimensional genome data and also supports various analysis functions including classification, regression, and ranking38. In the LMM and Scoary models, Bonferroni correction was used to control for the false discovery rate due to multiple comparisons of 14,815,802 k-mers in the first step and 24,761 k-mers in the second step (adjusted P threshold being 3.37 × 10−9 and 2.02 × 10−6, respectively). To validate the GWAS results, we used random forest to perform the ten-fold cross-validation with 200 repeats and additional external validation analysis on an independent dataset of S. aureus (40 human and 40 pig isolates; Supplementary Data 2) available on the NCBI database.
Risk prediction
By training a classifier using the significant host-associated k-mers from the GWAS analysis, we were able to estimate the risk score of the prediction model (defined as the probability of an isolate from a human host given a certain k-mer). We used an RF classifier based on the R package “randomForest” to predict the host origin (human or pig) of S. aureus isolates. In the RF model, the importance of the predictors was sorted from the most to the least important by the mean decrease in impurity (Mean Decrease Gini, MDG). The predictive effect of the RF model was evaluated by accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), kappa value, and receiver-operating characteristic (ROC) curve. In addition, we built an RF classifier to predict the origin of S. aureus from farm workers with occupational livestock exposure (40 isolates), which were available on the NCBI database (Supplementary Data 3).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data generated or analyzed in this study are available as Supplementary Information (Supplementary Figs. S1–S6; Supplementary Data 1–8 available at https://doi.org/10.6084/m9.figshare.26954722).
Code availability
Pyseer (v.1.3.10) at https://github.com/mgalardini/pyseer; Scoary (v.1.6.16) at https://github.com/AdmiralenOla/Scoary; LASSO at https://github.com/cran/glmnet; XGBoost at https://github.com/dmlc/xgboost; RandomForest at https://github.com/cran/randomForest; BactDating (v.1.1.1) at https://xavierdidelot.github.io/BactDating/.
References
Tong, S. Y. C., Davis, J. S., Eichenberger, E., Holland, T. L. & Fowler, V. G. Staphylococcus aureus infections: epidemiology, pathophysiology, clinical manifestations, and management. Clin. Microbiol. Rev. 28, 603–661 (2015).
Howden, B. P. et al. Staphylococcus aureus host interactions and adaptation. Nat. Rev. Microbiol. 21, 380–395 (2023).
Feßler, A. T. et al. Antimicrobial and biocide resistance among feline and canine Staphylococcus aureus and Staphylococcus pseudintermedius isolates from diagnostic submissions. Antibiotics 11, 127 (2022).
Heaton, C. J., Gerbig, G. R., Sensius, L. D., Patel, V. & Smith, T. C. Staphylococcus aureus epidemiology in wildlife: A systematic review. Antibiotics 9, 89 (2020).
Liu, Y., Han, C., Chen, Z., Guo, D. & Ye, X. Relationship between livestock exposure and methicillin-resistant Staphylococcus aureus carriage in humans: A systematic review and dose–response meta-analysis. Int. J. Antimicrob. Agents 55, 105810 (2020).
Chuang, Y.-Y. & Huang, Y.-C. Livestock-associated meticillin-resistant Staphylococcus aureus in Asia: An emerging issue? Int. J. Antimicrob. Agents 45, 334–340 (2015).
Aires-de-Sousa, M. Methicillin-resistant Staphylococcus aureus among animals: current overview. Clin. Microbiol. Infect. 23, 373–380 (2017).
Wang, Y. et al. Transmission of livestock-associated methicillin-resistant Staphylococcus aureus between animals, environment, and humans in the farm. Environ. Sci. Pollut. Res. 30, 86521–86539 (2023).
Zou, G. et al. A survey of Chinese pig farms and human healthcare isolates reveals separate human and animal methicillin‐resistant Staphylococcus aureus populations. Adv. Sci. 9, 2103388 (2022).
Zhou, W. et al. WGS analysis of ST9-MRSA-XII isolates from live pigs in China provides insights into transmission among porcine, human and bovine hosts. J. Antimicrob. Chemother. 73, 2652–2661 (2018).
Farhat, M. R. et al. GWAS for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions. Nat. Commun. 10, 2128 (2019).
Sheppard, S. K. et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. 110, 11923–11927 (2013).
Young, B. C. et al. Antimicrobial resistance determinants are associated with Staphylococcus aureus bacteraemia and adaptation to the healthcare environment: a bacterial genome-wide association study. Microb. Genomics. 7, 000700 (2021).
Power, R. A., Parkhill, J. & De Oliveira, T. Microbial genome-wide association studies: lessons from human GWAS. Nat. Rev. Genet. 18, 41–50 (2017).
Voichek, Y. & Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat. Genet. 52, 534–540 (2020).
Méric, G. et al. Disease-associated genotypes of the commensal skin bacterium Staphylococcus epidermidis. Nat. Commun. 9, 5034 (2018).
Mageiros, L. et al. Genome evolution and the emergence of pathogenicity in avian Escherichia coli. Nat. Commun. 12, 765 (2021).
Li, T. et al. Pan-genome-wide association study of serotype 19A pneumococci identifies disease-associated genes. Microbiol. Spectr. 11, e0407322 (2023).
Wei, W. et al. Genotypic Characterization of methicillin-resistant Staphylococcus aureus isolated from pigs and retail foods in China. Biomed. Env. Sci. 30, 570–580 (2017).
Sieber, R. N. et al. Genome investigations show host adaptation and transmission of LA-MRSA CC398 from pigs into Danish healthcare institutions. Sci. Rep. 9, 18655 (2019).
Chaguza, C. et al. Bacterial genome-wide association study of hyper-virulent pneumococcal serotype 1 identifies genetic variation associated with neurotropism. Commun. Biol. 3, 559 (2020).
Yang, S. et al. Disease-associated Streptococcus pneumoniae genetic variation. Emerg. Infect. Dis. 30, 39–49 (2024).
Belluzo, B. S. et al. An experiment-informed signal transduction model for the role of the Staphylococcus aureus MecR1 protein in β-lactam resistance. Sci. Rep. 9, 19558 (2019).
Wang, L. & Archer, G. L. Roles of CcrA and CcrB in Excision and Integration of Staphylococcal Cassette Chromosome mec, a Staphylococcus aureus Genomic Island. J. Bacteriol. 192, 3204–3212 (2010).
Luo, D. et al. cydA, spdC, and mroQ are novel genes involved in the plasma coagulation of Staphylococcus aureus. Microbiol. Immunol. 65, 383–391 (2021).
Bastos, M. C. & Murphy, E. Transposon Tn554 encodes three products required for transposition. EMBO J. 7, 2935–2941 (1988).
Sheppard, S. K., Guttman, D. S. & Fitzgerald, J. R. Population genomics of bacterial host adaptation. Nat. Rev. Genet. 19, 549–565 (2018).
Fournier, B., Klier, A. & Rapoport, G. The two‐component system ArlS–ArlR is a regulator of virulence gene expression in Staphylococcus aureus. Mol. Microbiol. 41, 247–261 (2001).
Lemay, M., De Ronne, M., Bélanger, R. & Belzile, F. k ‐mer‐based GWAS enhances the discovery of causal variants and candidate genes in soybean. Plant Genome 16, e20374 (2023).
Rahman, A., Hallgrímsdóttir, I., Eisen, M. & Pachter, L. Association mapping from sequencing reads using k-mers. eLife 7, e32920 (2018).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Nascimento, M. et al. PHYLOViZ 2.0: providing scalable data integration and visualization for multiple phylogenetic inference methods. Bioinformatics 33, 128–129 (2017).
Bartels, M. D. et al. Comparing whole-genome sequencing with sanger sequencing for spa typing of methicillin-resistant Staphylococcus aureus. J. Clin. Microbiol. 52, 4305–4308 (2014).
Lees, J. A., Galardini, M., Bentley, S. D., Weiser, J. N. & Corander, J. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34, 4310–4312 (2018).
Brynildsrud, O., Bohlin, J., Scheffer, L. & Eldholm, V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17, 238 (2016).
Dai, P. et al. Retrospective study on the influencing factors and prediction of hospitalization expenses for chronic renal failure in China based on random forest and LASSO regression. Front. Public Health 9, 678276 (2021).
Milella, F., Famiglini, L., Banfi, G. & Cabitza, F. Application of machine learning to improve appropriateness of treatment in an orthopaedic setting of personalized medicine. J. Pers. Med. 12, 1706 (2022).
Acknowledgements
This work was supported by the Key Scientific Research Foundation of Guangdong Educational Committee (No. 2022ZDZX2033), the Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515011583), and the National Natural Science Foundation of China (Nos. 81973069 and 81602901). The funders had no role in the study design, data collection and analysis, and interpretation of the data.
Author information
Authors and Affiliations
Contributions
H.Z. and X.Y. designed the study and wrote the manuscript. H.Z., W.D., D.O., and Y.L. performed all bioinformatic analyses. X.Y., Z.Y., Y.G., M.Z. and X.Z. took charge of supervision and reviewed the data. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Toni de Dios Martinez and the other anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editors:Christina Karlsson Rosenthal and Aylin Bircan. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhou, H., Du, W., Ouyang, D. et al. Simple and accurate genomic classification model for distinguishing between human and pig Staphylococcus aureus. Commun Biol 7, 1171 (2024). https://doi.org/10.1038/s42003-024-06883-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-024-06883-2