Introduction

Bloodstream infections (BSI) are a major cause of sepsis, hospitalisation and death for children under the age of 5. Worldwide, there were an estimated 2.9 million sepsis-related deaths in 2017, with the highest burden in sub-Saharan Africa1. In this region, neonates can be particularly susceptible to sepsis caused by invasive bacterial infections due to co-morbidities such as HIV infection, malnutrition, malaria and sickle-cell disease2,3,4. It is often difficult to distinguish between febrile illness in infants caused by bacteria and malaria or other infectious agents within low-middle-income countries (LMICs), making them hard to diagnose and treat5,6,7. Gram-negative bacteria, including the Enterobacterales Klebsiella pneumoniae and Escherichia coli, were shown to be the main causative agents of neonatal sepsis in a study of several LMICs in Africa and South Asia8.

K. pneumoniae and E. coli both make up part of the human commensal, intestinal microbiota, and colonisation of the neonatal gastrointestinal (GI) tract with such organisms occurs rapidly after birth, influenced by the maternal flora and peri-partum environment. In the context of neonatal sepsis, the colonised GI tract has been identified as a reservoir for BSI9. This can arise from direct bacterial translocation from the gastrointestinal (GI) tract to the blood across the gut epithelium, or via secondary colonisation and infection of other body sites such as the skin, respiratory or urinary tract10,11. Bacterial translocation through the intestinal mucosal barrier into the bloodstream can occur during gut dysbiosis12 or increased intestinal permeability13. Premature neonates and those born with low birth weight are particularly vulnerable, as immature immune and gastrointestinal systems can lead to breach of the mucosal barrier11,14,15, and antibiotic exposure seen in healthcare environments such as the neonatal intensive care unit can contribute to gut dysbiosis16. Once the intestinal mucosal barrier has been traversed and bacteria are in the blood, this can lead to sepsis. To survive within two distinct environments as diverse as the intestine and blood, K. pneumoniae and E. coli isolates may already contain all the genes they need, or they may require adaptive mutations or the acquisition of genes via mobile genetic elements17.

A previous study investigated the trends of bacteraemia amongst 2226 children under 5 years of age who were hospitalised with fever in Dar es Salaam, Tanzania7. This study builds on that work by directly comparing bacterial isolates concurrently taken from blood and faecal samples of individual neonates on the same day, to assess whether the faecal and blood isolates are likely related. Highly related E. coli and K. pneumoniae isolates would likely have translocated between body sites, either from the GI tract to the blood or from the blood to the GI tract.

Studies from the USA and Taiwan have analysed faecal metagenomes of pre-term and very low birth weight neonates, showing bacterial translocation in patients11,12,18. There have also been studies that used comparative genomics to compare E. coli and K. pneumoniae from the urinary tract and gut in patients in the USA19,20. There have been further studies exploring the virulence factors that contribute to bacterial translocation to extra-intestinal sites and subsequent infections in the urinary tract of patients in France21. However, to our knowledge, this is the first study that has directly compared hybrid assembled genomes of blood and faecal E. coli and K. pneumoniae isolates from the same patients at the genomic and molecular level and highlighted virulence genes and acquired mutations that may contribute to the transition between body sites within individual patient isolates. This is also the first study to investigate bacterial translocation between the GI tract and bloodstream in a sub-Saharan Africa setting. The hybrid assemblies provide unique structural and contextual information about genes and replicons, allowing an in-depth comparative genomics approach. These findings expand our understanding of the Gram-negative organisms causing BSI infections and the dynamic genetic landscape likely involved in the translocation between body sites.

Results

Typing of E. coli and K. pneumoniae isolated from blood and faeces

In a previous study, 2226 blood cultures were analysed from children below the age of 5 years hospitalised with fever in Dar es Salaam, Tanzania7. Two hundred of these children (9%) were enrolled in a parallel study where faecal samples were taken at the same time as the blood samples22. 40 of the 200 patients (20%) had both ESBL-positive faecal and blood isolates. Sixteen out of these 40 blood-faecal pairs of ESBL isolates (40%) displayed identical phenotypic antimicrobial susceptibility testing (AST) profiles22.

In this study, we analysed 13 of these 16 blood–faecal pairs from the same patients with identical AST profiles (one or more of the 3 remaining pairs could not be located)22. These blood and faecal isolates were sequenced and hybrid assembled (see Supplementary Table 1) to determine their relatedness and the likelihood of bacterial translocation from the gastrointestinal (GI) tract to the bloodstream, or vice versa (see Table 1). Therefore, in total 26 isolates were assessed as 13 faecal (FC) and blood (BL) pairs; 4 pairs were confirmed as Escherichia coli (n = 8), 8 pairs of Klebsiella pneumoniae (n = 16), and 1 “pair” consisting of Klebsiella pneumoniae and Klebsiella quasipneumoniae23 (n = 2). These were from neonatal patients under the age of 19 days with fevers lasting 1–3 days (Supplementary Table 2). When referencing the genomic data, the prefix FS refers to patient ID (e.g., FS2155), the prefix FSBL refers to the blood isolate (FSBL2155), and the prefix FSFC refers to the faecal isolate (FSFC2155).

Table 1 MLST profiles of the 13 paired blood and faecal isolates (n = 26)

Firstly, the relatedness of the paired blood and faecal isolates was assessed using multi-locus sequence typing (MLST) and core genome MLST (cgMLST). Eight out of the thirteen pairs (4 E. coli and 4 K. pneumoniae) shared the same MLST, and nine shared the same cgMLST (4 E. coli and 5 K. pneumoniae). The discrepancy occurred with patient FS2155; the faecal (FSFC2155) and blood (FSBL2155) MLST profiles were ST426 and ST3559, respectively, but the sublineage (323) and clonal group (ST429) from cgMLST were the same for both, as well as the K-type, O-type and wzi capsular gene type (see Table 1). Of the four K. pneumoniae highly related pairs, all had the same within-pair K-type, O-type and wzi gene type. Of the four E. coli pairs, all isolates, both from blood and faeces, were of the phylogroup B2. The E. coli isolates from patient FS2071 belonged to ST131, and the E. coli isolates from patients FS1654, FS2246 and FS2258 belonged to ST1193.

Determining genome similarity between paired isolates using average nucleotide identity (ANI) and core genome single-nucleotide polymorphisms (SNPs)

To further assess the relatedness of the paired blood and faecal, average nucleotide identities (ANI) and core genome single nucleotide polymorphisms (SNP) distances were determined for E. coli and Klebsiella spp. (Fig. 1) and between all isolates (Supplementary Fig. 1). Of the 9 pairs sharing the same MLST, 8 had an ANI of 99.99% or above and 23 SNPs or less (i.e. ≥99.99% ANI and ≤23 SNPs), with FS2155 isolates having an ANI of 99.78% and 7 SNPs (see Fig. 1). The ANI analysis also revealed close relatedness between isolates which were not part of the same pair. For example, the K. pneumoniae isolate FSBL2111 shared an ANI of 100% and 0 SNPs with both FSFC2112 and FSBL2112, indicating they are all the same strain. The K. pneumoniae isolate FSBL2265 shares an ANI of 99.88% with FSBL2130 (0 SNPs) and 100% (0 SNPs) with FSFC2130, indicating they are likely the same sequence type or possibly the same strain. The E. coli isolates FSBL1654 and FSBL2240 share an ANI of 100% with FSFC2258 (see Fig. 1).

Fig. 1: Average nucleotide identities (ANI) and core genome single-nucleotide polymorphisms (SNPs) across 4 E. coli and 9 Klebsiella spp. paired isolates.
figure 1

ANI was calculated using the BLAST alignment algorithm (ANIb). The core genome SNPs are displayed in each cell below the ANI %, this was calculated using snippy v4.6.3 and snp-dist using FSFC1386 as reference (see the “Methods” section). Klebsiella spp. are displayed together with 17 Klebsiella pneumoniae and 1 (FSFC0558) Klebsiella quasipneumoniae highlighted by the lines of blue.

Phylogenetics reveals the genomic relatedness across isolates

Core genome trees were built based on SNP differences between the core genomes of closely related isolates (see Fig. 2A), all 8 pairs that share the same MLST, ≥99.99% ANI and ≤23 SNPs were on the same branch, with distances less than 0.00008 branch length between nodes. The isolates from FS2155 were also on the same branch. As with the ANI analysis, isolates from different patients were closely related; this included the K. pneumoniae isolate FSBL2111, which was on the same branch as FSBL2112 and FSFC2112, and the K. pneumoniae isolate FSBL2265, which was on the same branch as FSFC2130 and FSBL2130. Further to this, both blood and faecal E. coli isolates from FS1654, FS2240 and FS2258 were closely related on the tree.

Fig. 2: Core genome phylogenetic trees and admission timelines reveal relatedness across isolates.
figure 2

A Core genome phylogenetic tree of K. pneumoniae and E. coli isolates. The phylogenetic trees are based on SNPs derived from the core genome alignment of K. pneumoniae and E. coli isolates. The Klebsiella pneumoniae subsp. pneumoniae HS11286 genome (GCF_000240185.1) was used as a reference for Klebsiella pneumoniae isolates and the Escherichia coli str. K-12 substr. The MG1655 genome (GCF_000005845.2) was used as a reference for E. coli isolates. Branch lengths are representative of the number of SNPs between isolates. For the E. coli tree, FS2071 was removed to clearly visualise the relationship between the other paired isolates (shown in a black box). The coloured shapes indicate metadata such as species (green), sample type (brown/red) and hospital of origin (blue). MNH Muhimbili National Hospital. B Timelines from admission to discharge of neonatal patients in Temeke and Amana hospitals are represented by the start and end of the green lines in the timeline. MNH was excluded as only one patient was from this study site. The colours represent species consistent with the panel above.

Isolates from patients FS2111 and FS2112 were taken from the same hospital (Amana) within 3 days of each other (see Fig. 2B), indicating possible nosocomial transmission. Isolates from patient FS2265 came from Temeke hospital, and isolates from patient FS2130 came from Amana, so the similarity in sequence type is likely only incidental. Isolates from patients FS2258, FS1654 and FS2240 were all taken from Temeke hospital, but in different months of the year, so nosocomial transmission here would indicate an endemic strain within the hospital.

Subsequently, the eight pairs (4 E. coli and 4 K. pneumoniae) shown to have the same ST from MLST, an ANI ≥ 99.99% and ≤23 SNPs (i.e., isolates from FS1386, FS1448, FS1654, FS2071, FS2112, FS2130, FS2240 and FS2258) will be referred to as highly related pairs. Together, it is likely that each of these highly related pairs represents bacterial translocation of the same isolate either from the GI tract to the bloodstream or the bloodstream to the GI tract.

Comparison of AMR, virulence and metal resistance genes between highly related blood and faecal paired isolates

We then looked to compare the antimicrobial resistance (AMR), virulence factor and metal resistance genes between the highly related pairs to determine whether any genes were acquired or lost in the translocation between body sites. The acquired AMR profiles did change in two highly related pairs (see Supplementary Data 1). The percentage identity to the database reference gene changed for sul2 and aph(3”)-lb in FS1448 isolates and there was a gain of dfrA17 on an integron in FSBL1654 compared to FSFC1654, the dfrA17 gene confers resistance to trimethoprim (part of co-trimoxazole) (Supplementary Fig. 2). There were no differences in the genes conferring resistance to amoxycillin, gentamicin and ceftriaxone, which were the main antibiotics used to treat the patients (see Supplementary Table 2). All 8 highly related pairs shared the same virulence factors when compared against the VFDB database and metal resistance genes when compared against the MEGARES database (see Supplementary Data 1).

Comparative genomics of SNPs between highly related blood and faecal paired isolates

We used breseq to run a genome comparison using the hybrid assembled faecal genome as a reference and the long reads of the blood isolates as a query for each of the highly related pairs, which had been confirmed to be highly related. This identified SNPs (both synonymous and non-synonymous), small INDELs (<100 bp) in the coding region and intergenic mutations which had been acquired in the blood isolate compared to the faecal isolate are shown in Fig. 3. The number of predicted mutations ranged from 1 to 516 across the nine pairs (Supplementary Table 3), with a median of 10, an average of 77.6 and standard deviation of 167. However, for some isolates, large numbers of SNPs were concentrated within a single coding DNA sequence (CDS), for example, in isolates from FS2071 there were 97 SNPs between the faecal and blood isolates with 19 in a gene encoding a phage lysozyme, 5 in a gene encoding a phage endopeptidase, and 16 in a gene encoding a phage tail fibre protein.

Fig. 3: Predicted SNP mutations when comparing the blood genome against the faecal genome for a set of eight highly related paired isolates.
figure 3

All pairs compared with SNP mutations and INDELs <100 bp categorised by gene class, gene context and eggNOG clusters of orthologous genes (COG). SNPs and INDELs were identified with breseq. Heatmaps were visualised separately for synonymous SNPs, non-synonymous SNPs, INDELs <100 bp within the coding region, and intergenic mutations (both SNPs and INDELs). For intergenic mutations, the closest gene in bp to the SNP/INDEL is displayed.

General trends across all the pairs can be seen in Fig. 3. SNPs were observed across pairs in genes involved in amino acid metabolism, DNA replication, metal transport/metabolism, sugar transport/metabolism and genes associated with mobile genetic elements and phages. Overall, most SNPs were found in genes related to prophages, which contributed to a range of functions, but mostly within genes encoding phage endopeptidase, phage lysozyme and genes involved in the assembly of the phage tail. Synonymous SNPs were found in the genes involved in the synthesis of the amino acid glutamate in FS2240 and FS1654. Mutations in and around genes of the lac operon were found in FS2240 and FS2130, which included non-synonymous SNPs, INDELs and intergenic mutations. Synonymous SNPs and INDELs were found in the virulence gene vgrG in both FS1654 and FS2240.

FS2155 is an outlier with 516 SNPs, and therefore, the SNPs are displayed separately in Supplementary Data 2. Non-synonymous SNPs of note were found in genes involved in sugar metabolism/transport, including genes of the lac operon (e.g., beta‑galactosidase) and those associated with galactose, glucose, lactose, sucrose and mannose. Genes encoding transcriptional regulators of the lac operon were also found to have non-synonymous SNPs. Ten non-synonymous SNPs and 3 intergenic mutations were found in the fec genes encoding for the iron dicitrate transport system, with 1 non-synonymous mutation found in the ferrichrome‑iron receptor gene. One of the non-synonymous SNPs in fecA was an A16T mutation, which is predicted not to be tolerated in the protein structure and therefore is likely to affect protein function. Together, this shows a complex picture of SNPs in a varied set of genes and gene classes contributing to, or as a result of, translocation between body sites.

Insertions, deletions and mobile genetic element movement between highly related paired isolates

We investigated any discrepancies across the entire genome between the highly related pairs, such as large insertions and deletions. The hybrid assemblies allowed us to compare contigs directly across pairs (see Supplementary Table 4) and a BLAST alignment of the blood and faecal isolates for each pair considered large insertions and deletions between and within contigs (Fig. 4). All the highly related pairs, except FS1654 and FS2140, have 1 chromosome and between 0 and 8 plasmids, however it is worth noting that the term “plasmid” here is used to refer to any extra-chromosomal contig, and not all extra-chromosomal contigs are likely to be plasmids, as some are <5 kb and contain no identifiable rep gene. Generally, across the pairs there is a loss or gain of phage genes and mobile genetic elements when translocating between body sites (see Supplementary Fig. 2). The blood and faecal genomes of FS2155 have the largest disparities of all paired samples in Fig. 4, with a 400 kb disparity in chromosome size however this could be explained by the insertion of FC p1 into the chromosome (see Supplementary Fig. 2). The connections in Fig. 4 show that all 8 highly related pairs (from patients FS1386, FS1448, FS1654, FS2071, FS2112, FS2130, FS2240 and FS2258) have chromosomes which share >99.9% sequence identity between blood and faecal isolates, providing further evidence that all these bloodstream isolates are clonally related to the isolates from the GI tract of their respective host.

Fig. 4: Whole genome comparison between the blood and faecal isolates for each highly related pair of isolates.
figure 4

Connections shown are homologous regions that share >99.9% sequence identity and are larger than 2 kb for all pairs except FS1654 and FS2240, where no size limit was imposed (i.e. >0 kb) due to the method of assembly. Each pair was aligned with BLASTn, blood isolate genome assemblies (red) were used as the query sequence and faecal isolate genome assemblies (blue) were used as the reference. Each genome is split into individual contigs, each with one chromosome and any extra-chromosomal contigs designated plasmids (even if a rep gene was absent). The exception is FS1654 and FS2240, where all contigs are numbered and not distinguished as a chromosome or plasmid. FC faecal isolates, BL blood isolate, chr chromosome, p plasmid, c contig.

Hypermutator assay

We investigated whether FS2155 was a hypermutator strain, which could explain why there is a difference in the MLST, a lower ANI and large genomic changes between blood and faecal isolates. There was no clear difference between FS2155 and the other isolates in their ability to become resistant to rifampicin, indicating that the differences in the FS2155 isolates are not due to it being a hypermutator strain under the laboratory conditions under which it was tested. (Supplementary Table 5).

Virulence, pathogenicity and invasiveness determinant genes across all isolates

We screened all 26 isolates against the VirulenceFinder database to determine if there were certain genes that determined invasiveness/virulence in the translocation between body sites (Fig. 5A). The E. coli and K. pneumoniae isolates can be easily distinguished by the absence or presence of certain genes. With the E. coli, all 8 isolates have similar genes, which is expected as they are all part of phylogroup B2; the only difference is that FS2071 has 7 additional adhesion genes and lacks 2 genes, one involved in host defence evasion (kpsT) and another expressing a toxin (vat), which are found in the rest of the E. coli. With the Klebsiella spp., 61% of isolates (11/18) have the same set of virulence genes, these are all involved in iron-uptake, including a set of ybt genes, irp and fyuA. The ybt genes, involved in the production of the yersiniabactin siderophore, are present in the highly related paired isolates, which likely translocated between body sites, but are not present in the unpaired isolates. This is also the case for the irp2, encoding an iron regulatory protein, and fyuA, encoding the yersiniabactin receptor. This could indicate that these ybt, fyuA and irp2 genes are important for the adaptation of K. pneumoniae in the bloodstream. These genes are always found on the chromosome of these isolates and are co-located within the same biosynthetic gene cluster (Fig. 5C). Of the rest of the Klebsiella isolates, 22% (4/18) have only one virulence gene, yagZ/ecpA, involved in adhesion, and 17% (3/18) had no detectable virulence gene from the virulence factor database.

Fig. 5: Virulence determinant genes in blood and faecal isolates across several datasets.
figure 5

A Paired and unpaired isolates from this study (n = 26) were screened against the virulence factor database, and the virulence genes present were visualised as a heatmap. Highly related pairs are in bold. The virulence genes were split into four classes. B The percentage of the ybt, fyuA and ipA genes in two datasets, BSI and DASSIM, from Blantyre, Malawi, containing 472 E. coli and 244 K. pneumoniae species complex isolates (KpSc). Out of the 472 E. coli 23, were BSI, 31 were DASSIM community, 334 were DASSIM sepsis and 84 were DASSIM inpatient. Out of the 244 KpSc 57, were BSI, 16 were DASSIM community, 138 were DASSIM sepsis and 33 were DASSIM inpatient. C The genomic structure of the ybt biosynthetic gene cluster in the FSFC1448 genome.

To further investigate the association of the ybt, fyuA and irp2 genes with BSI pathogenicity, we analysed two datasets, BSI and DASSIM, containing 716 E. coli and K. pneumoniae species complex isolates (KpSc) from Blantyre, Malawi. The BSI dataset contained blood isolates from sepsis patients and the DASSIM dataset contained faecal samples from community patients, inpatients and sepsis patients. We determined the percentage of these isolates containing the ybt, fyuA and ipr2 genes (Fig. 5B) to see if they were overrepresented in bloodstream isolates compared to faecal isolates. Sixty-nine percent of E. coli isolates from the BSI dataset had all ybt, fyuA and ipr2 genes, whereas in the DASSIM dataset, 58% were found in community patients, 59% found in inpatients and 59% found in sepsis patients. In this case, the ybt, fyuA and ipr2 genes were overrepresented in the bloodstream infection isolates. With the K. pneumoniae, the picture is different, with the genes present in 18% of BSI isolates, and from the DASSIM faecal isolates, the genes are present in 18% of inpatients, 34% of sepsis patients and 38% of community patients.

Discussion

In this study we used in-depth comparisons of hybrid assembled genomes with blood isolates and faecal samples taken from the same neonatal patients in Dar es Salaam, Tanzania7,22 to determine if the bacterial isolates translocated between body sites and what genomic changes, if any, accompanied this translocation event. We found that eight out of the 13 pairs (4 E. coli and 4 K. pneumoniae) shared the same MLST, had an ANI ≥ 99.99%, ≤23 SNPs and had a very small difference in branch length on the core genome phylogenetic tree. This high relatedness between the pairs indicates that the 4 E. coli and 4 K. pneumoniae isolates are likely clonal and either translocated from the GI tract to the blood, or vice versa, from the blood to the GI tract. The directionality cannot be deduced from our data since the faecal and blood samples were collected concurrently. In the context of neonatal sepsis, the colonised GI tract has previously been identified as a reservoir for BSI, and translocation of bacteria from the GI tract to the bloodstream has been reported from NICUs in the USA9,11. This is the most likely scenario for intestinal commensal organisms such as E. coli and K. pneumoniae. Translocation from the GI tract can either occur directly through the intestinal mucosal barrier into the bloodstream or via a secondary route by colonising other body sites such as the skin, respiratory or urinary tract10,11. The route of translocation (i.e., direct or secondary) cannot be deduced from our data; this would require extensive and invasive sampling of body sites in the same patients over time, something which was not feasible within our studies.

Alternatively, there are cases where bacteria have been shown to translocate from the bloodstream to the GI tract, such as in the case of Salmonella enterica serovar Typhi, a causative agent of Typhoid, which can infect the GI tract from the bloodstream via a secondary route of the gallbladder24,25. After initial invasion of the gut, S. Typhi invades the bloodstream either directly through the intestinal epithelial barrier26 or via the lymphatic system to cause systemic invasion27. S. Typhi can then invade the gallbladder directly from the bloodstream or by retrograde spread through the bile28. This can then lead to chronic colonisation of the gallbladder in a subset of individuals, where the S. Typhi is shed intermittently into the GI tract and in the faeces26. While a secondary route from the bloodstream to the GI tract is possible for E. coli and K. pneumoniae via organs such as the liver29, this route has been mostly reported with bacteria which are not commensals in the GI tract, such as S. Typhi, where it is a critical part of their life cycle. Further studies, including longitudinal cohort studies, should seek to determine the directionality and route of translocation and K. pneumoniae and E. coli in neonatal BSI infections in this setting.

The 4 E. coli highly-related pairs (from patients FS1654, FS2071, FS2240 and FS2258) were all of the phylogroup B2 (Table 1), which has been associated with pathogenic clones, particularly extraintestinal pathogenic E. coli (ExPEC)10 and has been indicative of urinary tract21,30 and bloodstream infection31. The 4 K. pneumoniae highly related pairs each shared the same within-pair K-type, O-type and wzi gene type, yet these differed between different pairs.

The blood and faecal K. pneumoniae isolates from patient FS2155 shared the same cgMLST, K-type, O-type and wzi gene type and 7 SNPs but had a different MLST and a lower ANI of 99.78% between blood and faecal isolates. Further comparison of whole genome sequences (Fig. 4) also indicated it is unlikely to be the same strain, as large parts of the chromosome did not share >99.9% sequence identity, which would be expected if the strain was the same and, in fact, with the eight highly related pairs, the entire chromosome shares 99.9% homology. Seemingly, the young age of the patient (1 day old) and the short time of fever (see Supplementary Table 2) would not accommodate enough generations of K. pneumoniae to allow for large number for SNPs between the FS2155 blood and faecal isolates, however no isolates were found to have an elevated mutation rate based on the results of the rifampicin mutation assay we carried out (Supplementary Table 5). Alternatively, we may have just isolated different but related isolates from the faeces and the blood, or the selection procedure may have missed the blood isolate in the faecal sample when in fact it was present.

The ANI/SNP comparisons and phylogenetic tree showed there was high relatedness between isolates from different patients in the same hospital, such as FS2111 and FS2112 from Amana and FS2258, FS1654 and FS2240 from Temeke (see Fig. 2), indicating potential nosocomial transmission between patients. Nosocomial transmission between patients within NICUs in the USA has been indicated prior to the onset of sepsis in another study9.

It is difficult to apply ANI thresholds to isolate relatedness, however a recent study showed >95% indicates the same species, >99.5% indicates the same sequence type and >99.99% indicates the same strain32. Our methodology also included SNP distances (see Fig. 1), which provides a clearer distinction between closely related isolates. In genomes containing millions of base pairs, percentages can provide misleading results. For example, the E. coli K-12 strain has a reference genome with a size of 4,639,221 bp, and 400 SNPs would still give an ANI of 99.99% but they may not be clonal. Nevertheless, there is no consensus on SNP thresholds to apply to bacterial relatedness. A recent study has shown that <100 SNPs represent isolate linkage across One Health sectors (i.e., animals, humans and the environment)33, while other studies have used lower thresholds to define relatedness in hospital settings, such as <25 SNPs per 5Mbp to indicate likely strain sharing of K. pneumoniae34 and <17 SNPs to infer E. coli transmission35. Therefore, in this study, we have chosen a multilayered approach to determine high relatedness between isolates. Going forward, we recommend the use of both ANI and SNP distances to determine isolate relatedness within a cohort of compared strains, as well as other methodologies presented in this study such as phylogenetics, variant calling and whole genome comparison.

The antibiotic resistance profiles only differed in two highly related pairs, with the gain of dfrA17 in FSBL1654 compared to FSFC1654 being the only notable difference between all highly related pairs. Previously, it has been noted that the presence of ESBL-producing Gram-negative isolates in the faeces of hospitalised adults is associated with the subsequent onset of sepsis9 and all our blood isolates that caused bloodstream infections contained the common ESBL gene blaCTX-M-15, which confers resistance to 3rd generation cephalosporins and is the most common ESBL gene found in E. coli isolates worldwide36. This is likely due to the similarity in phenotypic antibiotic susceptibility profiles being a pre-requisite for selection of isolates in this study.

Comparative genomics showed there were common SNPs across patient-paired isolates in amino acid metabolism, DNA replication, metal transport/metabolism, sugar transport/metabolism and genes associated with mobile genetic elements and phages.

INDELs were found in genes of the lac operon in FS2240 and FS2130 (Fig. 3). INDEL mutations were also found in the lac operon and beta-galactosidase gene in FS2155, as well as genes involved in the metabolism of a wide set of sugars. E. coli can utilise a variety of sugars for growth in the GI tract19, and mutations in sugar metabolism/transport pathways have been shown to have an effect on the colonisation of the enterohemorrhagic E. coli strain EDL933 in the intestine of mice37. Since all the patients involved in this study were aged <19 days, the main sugar source for gut bacteria would be lactose from breast or formula milk; this could explain the mutations in the lac operon, as it would be redundant following a change of carbohydrate substrate.

When comparing highly related pairs, a synonymous mutation was found in a copper resistance transcriptional regulator in FS1654, and in FS2155, 10 non-synonymous mutations were found within the fec genes, which encode for the iron dicitrate transport system. This includes an A16T mutation in fecA, predicted to affect protein function. In a previous study, a separate non-synonymous mutation was found in fecD when assessing E. coli moving from the gut to the urinary tract17.

Most SNPs were found in genes related to phage/prophage (Fig. 3). This supports the whole genome comparative analysis in Fig. 4 and Supplementary Fig. 2, which show that differences in the genomes were mainly accounted for by movement of mobile elements, including phage. If the translocation occurred from the GI tract to the bloodstream, it could be the case that the movement from the nutrient-rich environment of the gut to the nutrient-sparse environment of the bloodstream causes a stress response in the cell, which in turn leads to widespread movement of mobile genetic elements (MGEs) and the activation of prophages and subsequent mobilisation of phages38,39. This activation of mobile genetic elements could be beneficial to bacteria, and the MGEs, allowing them to adapt to new environmental stresses by increasing genetic variability, as the movement of MGEs intracellularly can lead to gene knockouts or upregulation of mobilised genes when their genetic context changes40. Paradoxically, the activation of prophage to a phage would likely be deleterious for the bacterial host, and therefore, there could be a selection bottleneck for mutated phage genes in isolates successfully translocating between the gut and blood. In this case, the inactivation of genes involved in the replication of phages and subsequent lysis would be advantageous to the bacterium41,42,43.

This widespread movement of MGEs, including phage, could also enable the acquisition of beneficial traits from other bacteria; however, while there is a high abundance of plasmid transfer between bacteria in the gut44, there is unlikely to be high plasmid transfer in blood, as there is a high amount of transience. During conjugation, the cells must come into contact with one another and sustain the interaction long enough for genetic material to be exchanged45, and previously it has been shown that conjugation occurs at a higher rate in fixed conditions, such as biofilms, rather than liquid culture46,47,48.

We also looked to see if there were any genes in our blood isolates that were indicative of virulence, pathogenicity and invasiveness by screening all isolates against the virulence factor database (Fig. 5A). All E. coli isolates had the same virulence genes except the ones from FS2071, which had 7 additional adhesion genes and lacked 2 genes when compared to the others. The Klebsiella spp. had more variation in virulence genes, with 61%, including all the highly related pairs, having 11 genes involved in iron-uptake, and the rest (39%) only having 1 or 0 virulence genes. The additional 11 genes were variants of ybt, irp and fyu genes, which are all involved in iron uptake and metabolism. The ybt gene encodes for the yersiniabactin siderophore, which, unlike the enterobactin siderophore, can escape the host innate immune system protein sideocalin, and is associated with pathogenicity in E. coli and K. pneumoniae49. This could indicate that the yersiniabactin siderophore is key for the translocation of K. pneumoniae into the blood since all the highly related paired isolates had the yersiniabactin genes, and the unpaired isolates did not contain these genes. These unpaired isolates could unintentionally provide a background of faecal isolates that were not concurrently in the gut and the bloodstream and therefore show a potential difference in genes of intestinal Klebsiella isolates and extra-intestinal pathogenic Klebsiella isolates. However, since single colony picks do not capture the full strain diversity in the gut microbiota, we would need to use a more complete approach, such as metagenomics or whole plate sweeps, to assess the differences. We tested this further by including two datasets from Blantyre, Malawi (BSI and DASSIM), to compare the presence of the ybt, iprA and fyuA genes in 716 blood and faecal isolates and see if the genes were overrepresented in the bloodstream isolates. While there was no clear distinction between blood and faecal Klebsiella isolates, in E. coli isolates, there was a higher abundance of these genes in the BSI isolates when compared to faecal carriage isolates from DASSIM. Together, this indicates that important genes required for invasion into and/or persistence in the blood were already present in the gut isolates. This opens the possibility for diagnostic interventions, for example, certain patient groups, such as very low birthweight neonates, could have their stool screened by PCR for the presence of the ybt, fyuA and irp2 genes to determine if they are predisposed to bloodstream infections. Since the same isolates were found in both blood and faecal samples, and the sampling of stool is much less invasive than blood sampling for neonates, this could make it easier for healthcare workers in limited resource settings to take stool samples to identify bacteria pre-disposed to cause BSI. This would require further investigation; however, if successful, this targeted intervention could contribute to the prevention of BSIs in neonates.

This study has several limitations. We selected our isolates using single colony picks; therefore, we may have missed some highly related paired isolates at the selection stage and were not able to capture strain diversity within the gut microbiota. The study was cross-sectional as the faecal and blood isolates were collected concurrently; therefore, the directionality of any bacterial translocation cannot be deduced from this data alone. A longitudinal study following children over time to establish any subsequent BSIs would have required a very large group of children due to the low incidence of BSI in an unselected population. Even though such a design is ideal, it is currently only theoretically feasible in the low-resource setting of this study. Not all the originally paired isolates could be found and sequenced; therefore, only 26 out of the original 32 isolates could be sequenced as pairs. By not sequencing these, we may have lost information about genes or SNPs involved in bacterial translocation to the blood, and we cannot predict how this may have biased the results. Further to this, the sample size of 13 paired samples from blood and faeces (n = 26) is small and as such we are not able to infer statistical significance from this low powered sample size. While most of the genomes were assembled with the same hybrid assembly method (hybracter), due to a lack of short read sequences two were assembled with flye and due to poor long read quality two were assembled with unicycler. The use of different genome assemblers for some isolates also introduces variability. While this did not affect sequence typing, ANI, SNP analysis or screening for the absence or presence of genes it did have an impact on the whole genome comparison between the blood and faecal isolates (Fig. 4).

To conclude, we have shown that 8 isolate pairs (4 E. coli and 4 K. pneumoniae) were highly related (same MLST, ≥99.99% ANI and ≤23 SNPs) and therefore translocated between body sites in neonatal patients with bloodstream infections. The changes that marked this transition varied between the isolate pairs but were mainly found in genes associated with mobile genetic elements and phages. It is possible that most of the genes required for persistence in the blood were also present in the highly related gut isolates, as a set of important iron uptake genes was shared across blood and faecal isolates within each species. This analysis also highlighted the potential involvement of the yersiniabactin siderophore in BSI caused by both Klebsiella spp. and E. coli. However, functional validation is required to confirm this. These findings expand our understanding of the Gram-negative bacteria causing BSI and provide potential genetic markers that could be utilised in targeted interventions to prevent future BSI infections in neonates.

Methods

Study design, sample collection and bacterial isolation

This hospital-based, prospective, cross-sectional study was conducted in Dar es Salaam, Tanzania, from March 2017 to July 2018. The study enroled children below 5 years of age who were admitted to hospital with fever (>37.5 °C) at three regional hospitals, Amana, Temeke, and Mwananyamala, and one tertiary hospital, Muhimbili National Hospital (MNH). The study analysed blood-cultures from 2226 children7, with 82% of participants under the age of two and about 10% having positive blood cultures. Sixty-two K. pneumoniae and 28 E. coli were isolated from the blood cultures of these patients, totalling 90 blood isolates. The collection and processing of blood samples and bacterial isolation have been described in detail previously7. Two hundred of these children (9%) were enroled in a parallel study, which also collected faecal samples22 at the same time as the blood samples using a systematic sampling approach, whereby after every 11 children, a child was selected. The faecal samples were collected as rectal swabs from each patient, and bacteria were isolated as previously described22. Among these, from the blood samples of the 200 patients, 83 children had confirmed Gram-negative Enterobacteriaceae BSI, and 117 children had negative blood cultures22. From the faecal samples, 46 K. pneumoniae and 64 E. coli were isolated from these 200 patients, totalling 110 faecal isolates. Fourty of the 200 patients (20%) had both ESBL-positive faecal and blood isolates. Sixteen out of these 40 blood–faecal pairs of ESBL isolates (40%) displayed identical phenotypic antimicrobial susceptibility testing (AST) profiles (same zone size in the disc diffusion method)22. We included 13 of these pairs with identical AST profiles in this study. Three pairs of the 16 were omitted as not all the isolates could be found and sequenced. The strains (n = 26) were sent to the Liverpool School of Tropical Medicine, UK and stored at −80 °C until the DNA was extracted and sequenced.

Whole genome sequencing

Short-read sequencing was carried out by MicrobesNG (Birmingham, UK). Briefly, the DNA was extracted from the blood and faecal bacterial strains and prepared into DNA libraries using the Nextera XT Library Prep Kit (Illumina, CA, USA). The short-read sequencing was carried out using an Illumina machine (HiSeq X10 platform) according to the 250 bp paired-end protocol.

For long-read sequencing, DNA was extracted from blood isolates with the fire monkey weight high molecular weight (HMW) genomic (gDNA) DNA extraction kit (Revolugen, UK) and the faecal isolates with the Wizard HMW DNA extraction kit (Promega, WI, USA). Library prep was carried out using the SQK-NBD114.24 ligation sequencing and native barcoding kit according to the manufacturer's protocol (Oxford Nanopore Technologies, Oxford, UK). Sequencing was carried out using a FLO-MIN114 (R10.4.1) flow cell (Oxford Nanopore Technologies, Oxford, UK) on a MinION Mk1B sequencer, running for 72 h at a translocation speed of 400 bp/s. Data acquisition used the MinKNOW software (v22.08.9).

Hybrid genome assembly

The raw data (fast5) was basecalled using guppy v6.4.2 with the super accuracy (sup) model specific to the flow cell (R10.4.1), motor protein (E8.2) and translocation speed (400 bp/s). Read quality was assessed with Nanoplot v1.38.150. The basecalled files were demultiplexed with the guppy_barcoder from Guppy v6.3.8, specifying the SQK-NBD114-24 kit. The reads were assembled into de novo genome assemblies using either hybracter v0.6.051, unicycler v0.4.852 or flye v2.9.253 (see Supplementary Table 1). The majority were assembled with hybracter (n = 22), FSFC1448 and FSBL1448 were assembled with flye as no short read match could be found, and FSFC1654 and FSFC2240 were assembled with unicycler due to the short reads being better quality than the long reads. Hybracter is an automated pipeline for long-read first hybrid assembly. This workflow runs initial quality control, assembles the genome from long reads with flye53 and plasmids with plassembler54, polishes with long reads using medaka (https://github.com/nanoporetech/medaka), polishes with short reads using polypolish55 and pypolca (https://github.com/gbouras13/pypolca) and returns an output of assembly files and statistics. This workflow has been shown to be more accurate than the existing gold-standard hybrid assembly software51. Assemblies were assessed for quality with seqkit56 (see Supplementary Table 1) and visualised with Bandage v0.8.157.

Genomic data analysis and phylogenetics

The assembled genomes were annotated using RAST58 and screened against the Resfinder v 2.4.059, virulence factor database (VFDB)60 and MEGARes61 databases using abricate v1.0.1 (https://github.com/tseemann/abricate). Average nucleotide identities (ANI) between genomes were determined using JSpeciesWS v4.1.162 and SNP distances were determined using snippy v4.6.3 and snp-dists v0.8.2 with the K. pneumoniae FSFC1386 isolate as reference for Klebsiella spp. isolates and the E. coli FSBL1654 isolate as a reference for the E. coli isolates. The heatmap showing both ANI and SNP distances between isolates was visualised with the pandas v2.2.3, matplotlib v3.9.4, seaborn v0.13.2 and numpy v2.2.0 packages in Python v3.10.12. Sequence typing used multilocus sequence types (MLST) and core genome MLST (cgMLST) schemes as part of the Pathogenwatch platform63. The MLST database from Pasteur Institute, France, was used for Klebsiella species (https://bigsdb.pasteur.fr/klebsiella/) and from Warwick University, UK, for E. coli (http://mlst.warwick.ac.uk/mlst/dbs/Ecoli). The cgMLST database from Pasteur Institute, France was used for Klebsiella species (https://bigsdb.pasteur.fr/klebsiella/cgmlst-lincodes/). O- and K-types for Klebsiella species were predicted with Kaptive64, along with wzi capsular gene type65. For E. coli phylogroups were assigned using ClermonTyping66, which is an in silico typing method based on the quadruplex PCR of the arpA, chuA, yjaA and TspE4.C2 genes67.

A core genome phylogenetic tree of 17 K. pneumoniae and 8 E. coli isolates were built (n = 25), the K. quasipneumoniae isolate FSFC0558 was omitted from this analysis so that intra-species relationships could be analysed. The isolates were aligned to species-specific reference genomes using snippy v4.3.6, with the Klebsiella pneumoniae subsp. pneumoniae HS11286 genome (GCF_000240185.1) used as reference for Klebsiella pneumoniae isolates and the Escherichia coli str. K-12 substr. MG1655 genome (GCF_000005845.2) used as reference for E. coli isolates. The core SNP alignments were then combined using snippy-core. To retain only vertically inherited SNPs, Gubbins v2.3.4 was used to remove any recombinant regions from the alignment. A SNP-only alignment was extracted using snp-sites v2.5.1 with the -c option to select for A, C, G and T only. This was passed to IQ-tree v 1.6.1, which was used to construct a maximum-likelihood phylogenetic tree. The model was determined using the ModelFinder function of IQ-tree (-m MF), which calculated TVM + F + ASC + R2 as the best model for K. pneumoniae alignments and TVMe+ASC for E. coli alignments. The model was run using ultrafast 1000 bootstrap replicates (-bb 1000) and keeping identical data (-keep-ident). The phylogenetic tree and associated metadata were visualised using Interactive Tree Of Life (iTOL) v668. The timelines were plotted from the admission metadata using dplyr v1.1.4 and ggplot v3.5.1.

Breseq v0.38.369 was used to compare the RAST-annotated genomes from the faecal isolates against the long reads from the blood isolates, with the predicted mutations and unassigned missing coverage (MC) analysed from the output. The genes of interest were classified into clusters of orthologous genes (COGs)70 using eggNOG v5.071, separated according to their mutations, and visualised with the pheatmap v1.0.12 package in R v4.3.1. Genes which were categorised as “Function unknown” within the COG framework were further investigated for putative functions, leading to an additional classification termed “class” inferred from gene annotations and BLAST72 searches against similar genes. We used SIFT73 to determine the predicted effect of genetic mutations on their protein function.

To determine the whole genome comparison between pairs, the hybrid assemblies from blood and faecal isolates were compared with BLASTn72, with the faecal isolate as query compared against a database made from the blood isolate genome. The BLAST output was wrangled with dplyr v1.1.4 to visualise as a circos diagram with the “genomic initialise” function of the circlize v0.4.16 package in R v4.3.1. Connections plotted were homologous regions that shared greater than 99.9% sequence identity and were larger than 2 kb, except for FS1654 and FS1654, where connections were homologous regions with > 99.9% sequence identity with no threshold limit on basepair size.

The 26 isolates from this study were screened against the virulence factor database (VFDB)60 using abricate v1.0.1 (https://github.com/tseemann/abricate). The output was plotted as a heatmap with the pheatmap v1.0.12 package in R v4.3.1. Two datasets, BSI and DASSIM, were analysed from Blantyre, Malawi, to determine the percentage of iron uptake virulence genes (ybt, fyuA and ipA) in blood and faecal isolates from another country in sub-Saharan Africa. Raw sequence read data were downloaded from the European Nucleotide Archive from the following projects: PRJEB8265, PRJEB28522, PRJEB26677, and PRJEB3648674,75,76. The BSI dataset contains isolates from blood, cerebral spinal fluid (CSF) and rectal swabs, collected as part of routine BSI surveillance; only the blood isolates were used for this analysis74,75. The DASSIM dataset involved stool sampling and ESBL-E isolate selection from patients with sepsis76. The sequence reads were quality controlled, trimmed and assembled as previously described77. 716 isolates (472 E. coli and 244 Klebsiella spp.) were screened against the VFDB60 using abricate v1.0.1. The data were separated by dataset and group (DASSIM_sepsis, DASSIM_inpatient, DASSIM_community, and BSI), and the percentage of occurrences was calculated by the number of gene occurrences divided by the total number in each group per species. Calculations were performed using the “summarise” function in dplyr v1.1.4, and the barplots were visualised with ggplot v3.5.1. The genomic structure of the ybt biosynthetic gene cluster from the RAST58 annotated FSFC1448 genome was visualised using clinker v0.0.3178.

Calculation of mutation frequency

A rifampicin resistance assay was used to determine the mutation frequency for seven faecal isolates. The colonies were incubated at 37 °C in LB broth for 18 h, the cultures were normalised to an OD600 of 0.1, then incubated at 37 °C until the OD600 reached 0.6-0.8. The cultures were centrifuged at 4000 rpm to pellet the cells, then resuspended in 1 ml of PBS. From this resuspended solution, 100 ml was diluted in PBS to 10−6 and plated on LB agar plates, and 100 ml was plated on LB agar containing rifampicin (20 µg/ml). Plates were incubated at 37 °C for 16 h. CFU/ml was determined from colony counts of the total cell count plates. Mutation frequency was calculated by dividing the number of rifampicin-resistant colonies by the total number of viable cells plated (CFU/mL).

Ethics declarations

The original sample collection was part of a study approved by the Senate Research and Publications Committee of Muhimbili University of Health and Allied Sciences, National Institute of Medical Research, Tanzania and the Regional Committee for Medical and Health Research Ethics (REK), Norway.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.