Introduction

Colorectal cancer (CRC) is a leading cause of cancer-related morbidity and mortality, globally1,2. In Ghana, CRC is one of the prevalent gastrointestinal malignancies affecting the population3. The incidence of CRC is expected to rise in the next decade due to increasing Western habits in the daily lifestyle of the population in Ghana. Lifestyle factors that contribute to development of CRC include smoking, high intake of alcohol and processed meat, less physical activity, obesity and low intake of fruits and vegetables4,5. Family history and germline mutations in human DNA mismatch repair genes (MLH1, MSH2, MSH6, PMS1 and PMS2) are genetic risk factors for CRC2,6,7,8,9,10.

The gut microbiome have been reported to be associated with onset and progression of CRC via perturbation of host immune system and secretion of metabolites capable of triggering oncogenesis11,12,13,14,15,16. The pks-positive Enterobacteriaceae are notable gut bacteria that have been implicated in development of CRC due to their ability to synthesize colibactin, which is a genotoxin encoded by the pks-genomic island. Colibactin causes alkylation and crosslinks in DNA of host eukaryotic cells, leading to formation of DNA double-strand breaks, cell cycle arrest at G2 phase and intestinal tumorigenesis12,17,18,19. The pks-genomic island has been identified in Escherichia coli, Klebsiella pneumoniae, Enterobacter aerogenes, and Citrobacter koseri20. In Escherichia coli, the pks-genomic island is mainly found in isolates of the B2 phylogenetic group, and to a lesser extent in isolates belonging to phylogenetic groups A, B1 and D21,22. Gut bacteria also secrete metabolites such as cytotoxic necrotizing factors (Cnf), cycle-inhibiting factor (Cif) and cytolethal distending toxins (Cdt), which modulate proliferation, differentiation and apoptosis in host eukaryotic cells. Moreover, the CdtB protein causes DNA damage in the host and may predispose infected cells to oncogenesis23.

Beyond secretion of cell cycle-modulating toxins, gut bacteria encode a myriad of virulence factors that have the potential to perturb vital cellular events and signaling pathways, ultimately contributing to initiation and progression of colorectal tumorigenesis24. For instance, Escherichia coli can attach intimately to intestinal mucosa using an adhesin protein intimin (encoded by eae gene) and cause downregulation of DNA mismatch proteins in colorectal cells leading to tumourigenesis25. Virulence genes encoding fimbriae in Klebsiella pneumoniae (mrkCDF) and Escherichia coli (fimE) have also been reported to be significantly associated with CRC26. Escherichia coli isolates exhibiting profound cytoadherence and expressing the cnf1 gene were associated with CRC, which might indicate a potential role of adhesins in bacteria-mediated oncogenesis27. Genes encoding siderophores were also predominant in gut bacteria from early-stage CRC samples, possibly facilitating iron acquisition and bacterial survival in the host tissues26,28,29. Collectively, these observations from previous studies highlight an important role of bacterial virulence genes in instigating and promoting CRC.

Even though several studies have investigated CRC-associated microbiota30,31,32,33,34, very limited data exist in the context of the Ghanaian population. Since the composition and diversity of gut bacteria varies across different geographical locations35, it is essential to investigate gut microbial characteristics in CRC patients in the Ghanaian population in this era of increasing prevalence of CRC. In the present study, Enterobacteriaceae isolated from CRC patients and healthy control participants at a tertiary hospital in Ghana were analyzed via whole genome sequencing to identify novel and already known genomic features uniquely associated with gut bacteria from the cancer patients.

Results

Demographic information of the study participants

Demographic information was obtained from 58 CRC patients and 50 control participants (Table 1). The minimum ages recorded for the CRC patients and control participants were 16 years and 19 years, respectively, whiles the oldest participants were 88 years for the CRC patients, and 93 years for the control participants. The mean ages were 52.6 ± 15.0 years for the CRC patients and 48.9 ± 19.8 years for the control participants. None of the 108 participants reported a family history of CRC. Forty-two out of the 58 CRC participants had a habit of alcohol consumption. Only six CRC participants indicated they had a habit of smoking. Alcohol consumption was associated with increased likelihood of developing CRC among the participants [OR 2.625 (1.180–5.838), p = 0.018], but smoking habit was not significantly associated with the disease [OR 0.863 (0.260–2.868), p = 0.810]. Participants that were at least 30 years old also had an increased likelihood of developing CRC indicating that both young and older adults are prone to the disease [OR 4.706 (1.124–19.705), p = 0.034 for participants from 30 to 49 years; OR 4.493 (1.110-18.191), p = 0.035 for participants  50 years old].

Table 1 Demographic information of the study participants.

Enterobacteriaceae isolated from the study participants

Stool samples were obtained from 47 CRC patients and 50 healthy control participants (Fig. 1). All stool samples from the study participants were inoculated onto MacConkey agar, which led to isolation of Enterobacteriaceae from 44 out of the 47 CRC patients (94%) whereas these isolates were obtained from 36 out of the 50 healthy control participants (72%). These observations indicate that Enterobacteriaceae were more readily detected in stool samples from CRC patients compared to the control participants (p = 0.007). Two stool samples from CRC patients generated two Enterobacteriaceae each, whilst one isolate was obtained from each of the 36 control participants (Fig. 1). Hence, a total of 46 Enterobacteriaceae were isolated from the CRC patients and 36 isolates were obtained from the control participants.

Fig. 1
figure 1

Schematic summary for recruitment of study participants and collection of socio-demographic data and stool samples for isolation of Enterobacteriaceae.

Escherichia coli was the predominant Enterobacteriaceae isolated from stool samples of the CRC patients (76.1%) and the control participants (80.6%; Fig. 2A). The other Enterobacteriaceae isolated from the study participants were Klebsiella pneumoniae, Proteus mirabilis, Escherichia fergusonii, Enterobacter cloacae, Enterobacter hormaechei, Providencia stuartii, Alcaligenes faecalis, Raoultella ornithinolytica and Morganella morganii (Fig. 2A). The Enterobacteriaceae from the CRC patients were more diverse compared to the isolates from the healthy control participants, highlighting a potential effect of CRC on dysbiosis.

Fig. 2
figure 2

Enterobacteriaceae isolated from the study participants. (A) Identity of Enterobacteriaceae obtained from the study participants. (B) Phylogroups of the Escherichia coli isolates determined via the EzClermont web app36.

Genomic DNA sequencing and quality control statistics of assembled genomes

The genomic DNA of each Enterobacteriaceae was used for DNA library preparation, multiplexing and paired-end sequencing on the Illumina NextSeq 2000 system. Sequence reads were assembled using SPAdes, and the quality of the assembled genomes was assessed using Quast. Draft assembled genomes were annotated using Prokka (1.14.6) and core genomes were analyzed using Roary. The median contig length of the draft assembled genomes was 61 bp (range is 11–162 bp), and the N50 median was 221,614 bp (72,883–2,457,026 bp). The median GC content was 50.73 (38.7–57.35), with a total median genome length of 4,776,786 bp (3,872,290–5,673,707 bp), which is within the range for E. coli genome size, and there was no evidence of contamination.

Cumulative virulence genes encoded by Escherichia coli phylogroups

Nineteen (54.3%) of the 35 Escherichia coli isolates from the CRC patients were categorized as phylogroup A, and nine (25.7%) other isolates as phylogroup B1, demonstrating the majority of the Escherichia coli isolates from the CRC patients were most likely intestinal commensal strains (Fig. 2B). One (2.9%) Escherichia coli isolate from the CRC patients was categorized as phylogroup B2 and four (11.4%) other isolates were categorized as phylogroup D. The Escherichia coli isolates from the control participants were also predominantly in phylogroups A and B1. The B2 phylogroup of Escherichia coli, notable for their hypervirulence traits and possession of the pks-genomic island, was not significantly associated with CRC [OR 0.18 (0.02–1.75), p = 0.140].

The Escherichia coli isolates from the control participants encoded relatively higher number of virulence genes compared to the isolates from the CRC patients, even though no significant association was observed between number of virulence genes per isolate and CRC (p-value = 0.101; Fig. 3A). Moreover, the non-commensal isolates (phylogroups B2, D and F) from the CRC patients encoded significantly higher number of virulence genes compared to the CRC commensal (phylogroups A and B1) isolates, suggesting a hypervirulent potential of the non-commensal isolates (p < 0.0001, Fig. 3B,D). The non-commensal isolates from the control participants also encoded significantly higher number of virulence genes compared to the commensal isolates of the control (p = 0.037, Fig. 3C,D). Since the isolate of phylogroup B2 from the CRC patient encoded relatively higher number of virulence genes compared to the corresponding isolates from the control participants, we infer that the phylogroup B2 isolate from CRC patient might represent a potential hypervirulent strain compared to the four isolates of phylogroup B2 from the control participants.

Fig. 3
figure 3

Cumulative virulence genes encoded by the genome of the Enterobacteriaceae isolates. (A) Comparison of the total number of virulence genes encoded by the genome of the Escherichia coli isolates from the CRC patients and the control participants. (B) Cumulative virulence genes encoded by the phylogroups of Escherichia coli from the CRC patients. (C) Cumulative virulence genes encoded by the phylogroups of Escherichia coli from the control participants. (D) Number of virulence genes encoded by the commensal and non-commensal Escherichia coli isolates. Error bars represent the standard error of the mean. N/A denotes the non-Escherichia coli isolates, which cannot be classified into phylogroups. The seven non-Escherichia coli isolates from the control participants are K. pneumoniae, E. cloacae, K. pneumoniae, K. pneumoniae, E. hormaechei, K. pneumoniae and E. cloacae. The eight non-Escherichia coli isolates from the CRC patients are M. morganii, P. mirabilis, P. mirabilis, K. pneumoniae, K. pneumoniae, R. ornithinolytica, E. fergusonii, and E. fergusonii.

Analysis of toxin-encoding genes in the genome of the Escherichia coli isolates

The pks-genomic island (clbA-clbS), which has been reported to be overrepresented in Enterobacteriaceae from CRC patients, was not detected in any of the sixty-four Escherichia coli isolates from the study participants (Fig. 4A). Genes encoding cytolethal distending toxin (cdtA-cdtC), cytotoxic necrotizing factor (cnf1-cnf3) and cycle inhibiting factors (cif) were also not detected in the genome of the Enterobacteriaceae from the CRC patients and the control participants. Enterotoxin-encoding genes (astA and senB), and the hlyA-hlyD genes responsible for the synthesis and secretion of haemolysin, were detected in the genome of only one Escherichia coli isolate from the CRC patients. In contrast, the hlyE gene encoding the pore-forming haemolysin E (α-haemolysin) toxin was detected in the genome of most of the Escherichia coli isolates from the CRC patients, even though there was no statistically significant association between the hlyE gene and CRC (OR 1.538 (0.453–5.226); p-value = 0.49). Hence, Escherichia coli genes capable of causing genotoxic effects or cell cycle perturbations in host eukaryotic cells might have no role in colorectal tumorigenesis of the patients recruited for this study. Alternatively, the genotoxic Escherichia coli could previously have been present in the gut microbiota of the CRC patients, but were cleared and undetected in the present study.

Fig. 4
figure 4

Categories of virulence genes encoded by the genome of the Enterobacteriaceae isolates. Genes encoding toxins, invasion factors, adhesins, effector delivery systems, nutrition and metabolic factors and autotransporters are shown in (A)–(F).

Analysis of genes encoding invasion and adhesion factors

The aslA, ibeB and ibeC genes, encoding notable invasion factors, were detected in at least 69% of the Escherichia coli isolates from the CRC patients and the control participants (Fig. 4B). In contrast, the ibeA, kpsDM, ompAT and tia genes were less prevalent in the isolates from the study participants. Other invasion factor genes such kpsT, neuAC and traJ were rare in the genomes of the Escherichia coli isolates. None of these invasion factor genes encoded by the Escherichia coli genome was significantly associated with the CRC patients (Supplementary Table S1).

Genes encoding the curli fimbriae (csg) were detected in almost all the Escherichia coli isolates, indicating this class of adhesins might be relevant for adhesion and colonization phenotypes of these isolates in host eukaryotic cells during infection (Fig. 4C). Escherichia common pili (ecp) and type I fimbriae (fim) genes were detected in several of the isolates, even though they were modestly over-represented in the isolates from the control participants compared to the CRC patients. P fimbriae (pap) genes were rare in the genome of the isolates from the study participants despite the similar morphological identity of P fimbriae and type I fimbriae. The eae gene, encoding intimin, was not detected in the genome of the Escherichia coli isolates from the CRC patients but was identified in a few (2 out of 29) isolates from the control participants. Since most of the Escherichia coli isolates from the CRC patients and control participants encode several adhesins, it can be inferred that these isolates can use their attachment to host cells as a vital survival mechanism during infection. Nevertheless, none of these adhesins were significantly associated with the Escherichia coli isolates from the CRC patients in comparison to the control participants (Supplementary Table S1).

Analysis of genes encoding effector proteins, nutrition/metabolic factors and autotransporters

Generally, bacterial strains (especially enteropathogenic isolates) are capable of directly delivering effector proteins into host cells via the type III secretion system, leading to perturbation of cellular processes in the host, and ultimately disease conditions. Genes encoding notable bacterial effector proteins, such as tir, map, espG2, nleB2, nleC, nleD, nleG, nleF and nleH2 were absent in the genome of the Escherichia coli isolates from the CRC patients, but were detected in 2 out of the 29 isolates from the control participants (Fig. 4D). Less notable effector proteins including espL1, espL3, espL4, espR1, espX1, espX4 and espX5 were detected in several Escherichia coli isolates from the study participants, with modest over-representation in the isolates from the CRC patients in comparison to the control participants. None of the genes encoding effector proteins was significantly associated with CRC (Supplementary Table S1).

Bacterial genes that facilitate bioavailability or acquisition of iron in host cells are crucial for bacterial survival, especially when the bacteria reside in regions of the host where iron exist in an insoluble state or is complexed with iron-binding proteins of the host. The ent operon, which encodes enterobactin and is well known for its role in iron acquisition in a host, was detected in the genome of almost all the Escherichia coli isolates from the study participants (Fig. 4E). Ferric-enterobactin uptake (fepABCDEG) and utilization (fes) genes were also detected in almost all the Escherichia coli isolates, indicating these genes are vital for facilitating iron bioavailability and acquisition in the Escherichia coli isolates from the study participants. Nevertheless, these genes were not significantly associated with the Escherichia coli isolates from the CRC patients in comparison to the control participants (Supplementary Table S1).

Autotransporters (type V secretion system) are functionally relevant for bacterial survival and pathogenesis. The espC, pet, pic, sat and vat autotransporter genes, belonging to the chymotrypsin-like serine protease family, were rare in the genome of the study participants (Fig. 4F). ehaB was the most widespread autotransporter gene in the genome of the Escherichia coli isolates from both the CRC patients and the control participants. agn43 and cah, both encoding self-associating autotransporters associated with biofilm formation, were modestly over-represented in the genome of the isolates from the CRC patients. Moreover, the eaeX gene encoding an intimin-like protein, was only detected in the isolates from the CRC patients (4 out of 35). None of the genes encoding autotransporters was significantly associated with CRC (Supplementary Table S1).

Analysis of Escherichia coli virulence genes encoded by the non-Escherichia coli isolates

Virulome analysis was conducted for 15 out of the 18 non-Escherichia coli isolates from the study participants using the Escherichia coli virulence genes database. This analysis was vital for determining potential transmission of Escherichia coli virulence genes to the other strains of Enterobacteriaceae isolated from the study participants. Virulence genes that were detected in the genome of at least one of the non-Escherichia coli isolates are indicated in Fig. 5. The pks-genomic island was not detected in the genome of the 15 non-Escherichia coli strains used for the analysis. Most virulence genes that were commonly detected in the genome of the Escherichia coli isolates from this study (hlyE, aslA, ibeBC, csg, fimDFGH, espL1/3/4, espR1, espX1/4/5, ent, fep, fes, ehaB) were also encoded by the genome of at least one of the two Escherichia fergusonii strains. The only exceptions were eaeH, ecp and fimABCE adhesin genes, which were absent in the genome of the Escherichia fergusonii strains. The eaeH and fimABCE genes were also not detected in the genome of any of the non-Escherichia coli strains from the study participants while ecp was detected in the genome of the Klebsiella pneumoniae strains.

Fig. 5
figure 5

Detection of Escherichia coli virulence genes in the genome of the other Enterobacteriaceae isolated from the study participants. Black shaded boxes represent presence of the indicated virulence gene while unshaded boxes represent absence of the gene. a, b, and c denote Morganella morganii, R. ornithinolytica and E. hormaechei, respectively.

The ompA, entAB and fepC genes were identified in all the Klebsiella pneumoniae strains, in addition to the ecp genes. The cah gene, which was over-represented in Escherichia coli isolates from the CRC patients, was identified in Proteus mirabilis isolates from the CRC patients and one isolate of Klebsiella pneumoniae from the control participants. The agn43 gene, which was widespread in Escherichia coli strains from the CRC patients, was not detected in the genome of the non-Escherichia coli strains. None of the virulence genes analyzed in this study were identified in the Morganella morganii strain while Escherichia fergusonii encoded most of the virulence genes, depicting a close relatedness of the latter to Escherichia coli (Figs. 3B,C and 5).

Genes unique to putative hypervirulent Escherichia coli from CRC patients

The seven Escherichia coli isolates from CRC patients, encoding the highest number of virulence genes (> 200 virulence genes) were referred to as putative hypervirulent strains (Fig. 3B). The sequence types of these seven isolates, in descending order of the total virulence genes encoded, were ST1380 (two isolates of phylogroup D), ST69 (two isolates of phylogroup D), ST636 (one isolate of phylogroup B2) and ST3901 (two isolates of phylogroup F). Genomic analysis of these seven isolates showed that the transcriptional activator gene aggR, the Aai chromosomal type VI secretion system (aaiA-X) and the hdaA-D adhesion genes were unique to ST1380, even though aaiT and aaiW were detected in the genome of a few other isolates (Table 2). The hlyA-D, mcbA and tcpC genes were identified only in the genome of isolates belonging to ST636. The eaeX gene was detected in both ST1380 and ST69 only, while the gspC-M genes were identified in ST1380, ST69 and ST3901.

Table 2 Genes unique to putative hypervirulent Escherichia coli from CRC patients.

Analysis of agn43 gene and Escherichia coli phylogroups

The agn43 gene has previously been reported to be more prevalent in Escherichia coli phylogroup E isolates37. Since the agn43 gene was modestly over-represented in Escherichia coli isolates from the CRC patients, despite a q > 0.05, we investigated the distribution of agn43 in the Escherichia coli phylogroups. Three out of the thirty-three phylogroup A isolates encoded the agn43 gene, indicating the gene was significantly under-represented in Phylogroup A compared to the other phylogroups (p = 0.002; Table 3). Seven out of the nine phylogroup B1 isolates from the CRC patients encoded agn43 whilst the gene was detected in only one of the six isolates of phylogroup B1 from the control participants. All the five phylogroup D isolates encoded the agn43 gene, highlighting the prevalence and significant association of the gene with the phylogroup D isolates compared to the other Escherichia coli phylogroups (p < 0.001).

Table 3 Distribution of agn43 gene in the Escherichia coli phylogroups.

Analysis of phylogeny and antimicrobial resistance genes

Phylogenetic analysis of the Enterobacteriaceae from the study participants was conducted to identify clustering patterns of the isolates from the CRC patients and the control participants. The analysis demonstrated Escherichia coli strains were clustered according to their phylogroups and multilocus sequence sub-types, but not the source of isolation (CRC patients or control participants; Fig. 6). Furthermore, the phylogenetic analysis confirmed the close relatedness of Escherichia fergusonii to the Escherichia coli strains and a distant relationship between the Escherichia coli isolates and Morganella morganii, as was initially deduced from the virulome analysis.

Fig. 6
figure 6

Analysis of the phylogeny and antimicrobial resistance genes in the genome of the Enterobacteriaceae isolates. The maximum likelihood phylogeny was constructed using RAxML and the bootstrap are displayed ranging from black (less confident) to red (high confident) score. The antibiotic class are shown in parenthesis; AGly is aminoglycosides, Bla is β-lactam antibiotics, Fcyn is fosfomycin, Flq is fluoroquinolone, MLS is macrolide, lincosamide, and streptogramin, Phe is phenicol, Sul is sulphonamide, Tet is tetracycline and Tmt is trimethoprim. int denotes the int gene. NA denotes isolates in which the phylogroup or sequence type is not available.

Further analysis of the genomic data revealed the int gene, which is capable of facilitating acquisition of exogenous DNA into bacterial chromosomes, was modestly over-represented in the genome of Enterobacteriaceae from the CRC patients (q = 0.1128; Fig. 6 and Supplementary Table S1). This observation prompted the investigation of antimicrobial resistance genes encoded by the genome of the bacterial isolates from the study participants. β-lactamase resistance genes such as E. coli ampC2 and ampH genes were encoded by the genome of most of the Enterobacteriaceae from the study participants. Penicillin binding proteins-mediated mechanisms for resistance to β-lactams was also predominant in the Enterobacteriaceae from both the CRC patients and control participants. In contrast, the CTX-M-55 and OXA-10 β-lactamase resistance genes were encoded by two Escherichia coli isolates from the CRC patients, indicating they were less prevalent in the genome of the Enterobacteriaceae from the study participants. Resistance genes affecting aminoglycosides (strA and strB), macrolide (mphA), sulphonamides (sul1 and sul2), tetracycline (tetA and tetR) and trimethoprim (dfrA14) were also encoded by several Enterobacteriaceae from the study participants. Nonetheless, only the sul2 gene was significantly associated with Enterobacteriaceae from the CRC patients in comparison to the control participants [OR 4.79 (1.79–12.81), q = 0.045; Supplementary Table S2].

Discussion

In the present study, whole genome sequencing was used for analysis of Enterobacteriaceae isolated from CRC patients and healthy control participants at a tertiary hospital in Ghana. Socio-demographic data from study participants showed alcohol consumption and age ( 30 years), but not gender and smoking status, were significantly associated with the CRC patients. We had previously reported that smoking habit was not significantly associated with lower gastrointestinal cancers, which includes colorectal cancers, in the Ghanaian population3. However, the earlier study found no significant association between alcohol consumption and lower gastrointestinal cancers, as was detected in the present study, and this might be attributed to the small sample size used for both studies. Another intriguing observation from this present study was the lower age limit (30 years) of the Ghanaian population that was significantly associated with CRC. In the United States, the incidence rate of CRC increased by 80–100% with each 5-year group until the age of 50 years, whilst the incidence rate increased by 20–30% among the older adults ( 55 years) in the year 2015–201938. These observations highlight the need to actively conduct routine genetic test for CRC in both young and older adults, especially in the Ghanaian population.

Enterobacteriaceae, such as Escherichia coli, Klebsiella pneumonia, Enterobacter spp and Morganella morganii have been associated with CRC in previous reports39,40,41,42. The present study identified higher microbial diversity in the samples from the CRC patients compared to the control group. In addition, detection of Enterobacteriaceae from the samples of the CRC patients was significantly higher than the control group. These observations might indicate that there is perturbation in gut microbial content in the CRC patients and hence, necessitated thorough evaluation of the virulence tendencies of the isolates to confirm whether the dysbiosis favoured growth of harmful microbiota while causing loss of beneficial Enterobacteriaceae.

Escherichia coli was the predominant Enterobacteriaceae isolated from stool samples of the participants, in the present study, and they were mostly commensal strains43,44. These commensal strains usually lack virulence determinants implicated in disease conditions caused by intestinal and extraintestinal pathogenic Escherichia coli45. Our data indicate that the B2 phylogroup of Escherichia coli was under-represented in the study participants, and this could be due to the isolation of Enterobacteriaceae from fecal samples instead of the site of malignant tumors at the colon46. Escherichia coli from the CRC patients, belonging to phylogroups B2, D and F, also encoded more virulence factors compared to isolates of phylogroups A and B1 (commensal strains). Moreover, the isolates of phylogroups B2, D and F from the CRC patients had different sequence typing in comparison to the isolates of the same phylogroups from the control participants. These observations highlight the uniqueness of diverse isolates belonging to the same phylogroups but obtained from different sources, which might be indicative of differences in virulence tendencies during infection. Confounding factors caused by variations in microbial content between bowel movement were not taken into account in the present study, and might represent a potential limitation of the study design.

None of the bacterial cyclomodulin-encoding genes was detected in the genome of the Enterobacteriaceae isolated from the CRC patients and control participants. Notably, the pks-genomic island previously reported to be associated with CRC tumorigenesis was not detected in the genome of the Enterobacteriaceae from the participants in the present study47. Nevertheless, there is a possibility that Enterobacteriaceae encoding the cyclomodulin genes might have previously inhabited the gut of the CRC patients and increased the risk of CRC tumorigenesis in these patients. Extensive analysis of virulence genes available in the virulence factor database revealed the virulome profile of the Enterobacteriaceae from the CRC patients and control participants. Genes encoding curli fimbriae, type I fimbriae, Escherichia common pili, enterobactin and proteins utilized in ferric iron transport were encoded by most of the Escherichia coli isolates, as previously reported by a genomic study of Escherichia coli obtained from human, animal and food sources48. These Escherichia coli virulence genes were also identified in the genomes of Escherichia fergusonii, and to a lesser extent Klebsiella pneumoniae, highlighting the potential role of horizontal gene transfer in boosting bacterial virulence in the gut of the host. This present study also showed that the agn43 autotransporter gene, which is associated with biofilm formation, autoaggregation and attachment of bacteria to host cells was modestly over-represented in Escherichia coli isolates from the CRC patients49,50,51. Biofilm formation and attachment to host cells are mechanisms of pathogenic bacteria that have been reported to be associated with CRC52,53. Hence, it is possible that the agn43-encoding Enterobacteriaceae isolated in this study could have facilitated CRC tumorigenesis in the patients via these bacterial pathogenesis-related mechanisms.

Analysis of genes that are unique to the Escherichia coli isolates from the CRC patients encoding the highest total number of virulence genes revealed additional virulence factors and mechanisms that might be associated with CRC tumorigenesis in the patients enrolled for this study. The type II secretion system and virulence factors involved in aggregation and adhesion were identified in the isolates of phylogroups D and F. In contrast, alpha haemolysin toxin, biofilm formation and inhibition of innate immune system in host cells were the cellular processes that are affected by the virulence genes identified in the B2 phylogroup isolate. These observations highlight potential cellular process in host cells that can be targeted by pathogenic Escherichia coli in order to enhance the risk of tumorigenesis in host eukaryotic cells during exposure to these pathogens.

The int gene allows mobility of genetic elements between organisms via horizontal gene transfer, preferably transduction, and can facilitate the spread of antimicrobial resistance genes54,55. Virulome analysis in the present study showed the int gene was modestly over-represented in Escherichia coli isolates from the CRC patients. Even though the int gene is only one of the several key mediators of horizontal gene transfer in bacteria, this observation prompted the analysis of antimicrobial resistance genes encoded by the Enterobacteriaceae from the study participants. Antimicrobial resistance genes for β-lactam antibiotics were widespread in the genome of the Enterobacteriaceae from the study participants, as previously reported by a research conducted at the same study site56. Other antibiotic resistance genes we detected in the genome of the Enterobacteriaceae have also been reported in bacterial isolates from human inhabitants, livestock and water sources in Ghana57,58,59. Nonetheless, antimicrobial susceptibility assays are needed to confirm the expression of the antimicrobial resistance genes identified in the present study, and the extent of antibiotic resistance conferred on the Enterobacteriaceae.

Methods

Study design and participants

This genomic study utilized Enterobacteriaceae isolated from stool samples of CRC patients and healthy control participants at the Korle Bu Teaching Hospital (KBTH). The study participants were randomly recruited at the study site from February 2022 – January 2023. All the participants consented for using their socio-demographic data and stool samples in this study (Fig. 1). The CRC patients were either on admission at the Colorectal unit or outpatients at the Oncology and Surgical Clinics of the KBTH. The healthy participants were individuals who were not diagnosed with CRC or other forms of ailments at the time of sample collection. Individuals on antibiotic or cancer chemotherapy were excluded from the study.

Collection of socio-demographic data and stool samples

Paper-based questionnaires were administered to the study participants to obtain their socio-demographic data such as age, gender, family history of CRC, alcohol consumption and smoking status. Stool samples were also collected into 10 mL Cary-Blair transport media (Thermo Fisher Scientific) according to the manufacturer’s instructions. Socio-demographic data and stool samples were collected only once from participants that consented to participate in the study. All socio-demographic data from study participants were handled anonymously and confidentially. Anonymity was ensured by the use of codes generated from the respondents’ initials. These data were stored and analyzed electronically using Microsoft Office (Excel) version 2021.

Isolation and identification of Enterobacteriaceae from study participants

All the stool samples provided by the study participants were used for microbial analysis. Stool samples were inoculated onto MacConkey agar (BBL™, BD) for isolation of Enterobacteriaceae. Isolates of Enterobacteriaceae were initially identified using a standard biochemical test protocol. The identities of the isolates were confirmed via MALDI-TOF mass spectrometry, as previously reported59.

Whole genome sequencing and analysis

Extraction of genomic DNA from the Enterobacteriaceae isolates was conducted using QIAamp® DNA mini kit (QIAGEN Inc. GmbH, Holden, Germany), according to the manufacturer’s protocol. Genomic DNA was used for library preparation, followed by multiplexing and paired-end sequencing on the Illumina NextSeq 2000 system (Illumina, San Diego, CA, USA). Quality control was performed using FastQC (v0.12.1) and quality trimming was done using trimmomatic (v0.39) to remove low quality calls and sequencing adapters60. De novo assembly of the reads were performed using SPAdes (v3.15.5) with the following Kmers (21, 33, 55 and 77) and quality of the draft assemblies were checked using Quast (5.2.0)61,62. Draft genome assemblies were annotated using Prokka (1.14.6) and core genome were analyzed using Roary63,64. Multi-locus sequence type was determined using get_sequence_type from mlst_check of the Sanger Institute Pathogen group (https://github.com/sanger-pathogens/mlst_check). ABRicate (v1.0.1; https://github.com/tseemann/abricate) was used to identify AMR genes and E. coli virulence genes in the Virulence Factor Database (VFDB)65. A minimum coverage threshold of 80% and a minimum identity threshold of 80% was used for identification of AMR and virulence genes. Phylogenetic tree was generated from the core aligned single nucleotide polymorphisms (SNPs) using RAxML and visualised and annotated with iTOL (v8.2.12)66,67.

The fastQ files of all the sequence data were submitted to the National Center for Biotechnology Information (NCBI) and assigned accession numbers under Bioproject ID PRJNA1126843 (ID 1126843-BioProject-NCBI (nih.gov)).

Determination of the phylogroups of the Escherichia coli isolates

Fasta files from the whole genome sequence analysis were used for the determination of the Escherichia coli phylogroups via the EzClermont web app36.

Statistical analysis

Odds ratio was used to determine the association of parameters from socio-demographic data (age, gender, family history of CRC, alcohol consumption and smoking status), and CRC at 95% confidence interval. Association between virulence/antimicrobial resistance genes and CRC was also determined by odds ratio at 95% confidence interval, and the p-values were corrected using the Benjamini–Hochberg method. Odds ratio was also used for determining the associations between Escherichia coli phylogroups and agn43 gene, B2 phylogroup and CRC, and Escherichia coli commensal status and number of virulence genes per isolate. The Mann-Whitney U Test was used for evaluating statistical significance between cumulative number of virulence genes per isolate and CRC. Fisher exact test was used to determine the association between detection of Enterobacteriaceae on MacConkey agar and CRC.