Introduction

Escherichia coli sequence type 131 (E. coli ST131) clone has disseminated rapidly worldwide since it was identified in 20081. E. coli ST131 is an important cause of community-onset bloodstream and urinary tract infections, accounting for up to 30% of all E. coli infections globally1. The rising prevalence of this clone is associated with increasing multidrug resistance (MDR) observed in E. coli infections, particularly against fluoroquinolones and third generation cephalosporins, severely limiting antibiotic treatment options. Among E. coli ST131’s three sub-lineages, clades A, B and C, subclades H30R/C1 and H30Rx/C2 have dominated the spread of E. coli ST1312. These fluoroquinolone-resistant subclades possess the fimH30 allele and are the most frequently associated with CTX-M-class beta-lactamase alleles, especially blaCTX-M-15. These alleles are responsible for extended-spectrum beta-lactamase (ESBL) production and confer resistance to third-generation cephalosporins. CTX-M-type enzymes are currently recognised as the most common ESBL type globally2.

The transmission dynamics of E. coli ST131 driving its global success remain poorly understood3. Cross-sectional epidemiological studies suggested that E. coli ST131 acquisition often occurs through person-to-person transmission, particularly within the community rather than healthcare settings4,5,6. However, there is limited understanding of the reservoirs and drivers of E. coli ST131 transmission in terms of demographic characteristics and community contact patterns. Delineating the transmission dynamics of E. coli ST131 is crucial in designing infection control and public health measures to curb its associated morbidity and mortality.

In this prospective cohort study, we adopted a longitudinal, one health sampling approach to identify the transmission pathways of E. coli ST131 as a gut colonising strain in households. We quantified the carriage duration and acquisition risks of E. coli ST131 via Markov models and mapped transmission events using both epidemiological and genomic data. The overall aim of this study was to clarify the transmission dynamics associated with asymptomatic E. coli ST131 carriage to identify key targets for control interventions.

Results

From February 2017 to November 2018, 135 participants from 34 households were enrolled (Fig. 1). The 34 index patients were initially identified from 355 consecutive hospitalised patients who had E. coli extraintestinal infections. Seventeen of these index patients were previously infected with E. coli ST131 and the other 17 were previously infected with other E. coli STs (Table 1). None of the participants developed symptomatic E. coli infection during the study. By end-of-study, 124 participants provided at least one stool sample. One index patient did not provide any stool samples, but their household coresidents did.

Fig. 1
figure 1

Study enrolment flowchart.

Table 1 Participant characteristics and prior E. coli ST131 carriage status

The median age at enrolment was 38.16 years (IQR 23.88–62.03), with 16 participants (12.90%) under the age of two. Seventy-four participants (59.68%) were female. The median household size was four (interquartile range (IQR) 3–5). Fifty-one participants (41.13%) had chronic diseases, including cancer, diabetes, or heart diseases. Forty-eight (38.71%) took antibiotics within the six months prior to enrolment. The majority (87.10%) spent more than 30 hours at home per week.

The participants provided stool samples over a median of 141.5 (IQR 86.5–220.0) days. In total, 601 human stool samples, 35 companion animal stool samples (from four cats and two birds), 127 environmental swabs and 73 food samples were collected. The median number of stool samples per human study participant was five (IQR 3 to 6), collected at a median of 28-day intervals (IQR 19 to 50).

We recovered 6559 E. coli isolates, of which 6,273 (95.64%) were from humans; 157 (2.39%) were from domestic animals; 44 (0.67%) were from food; 81 (1.23%) were from the environment. Out of these, 514 were identified to be E. coli ST131 using PCR (498, 96.89% from humans; three, 0.58% from domestic animals; two, 0.39% from food; 11, 2.14% from the environment) (Supplementary Fig. 1). Since the vast majority of the E. coli ST131 isolates were recovered from humans, and the next commonest being from toilets (10/11 of the environmental-source E. coli ST131 isolates), we proceeded to focus on human-derived isolates for subsequent whole genome sequencing (WGS) and analyses.

For human-derived isolates, WGS confirmed 338 E. coli as ST131 by in silico MLST typing, among which 135 (41.16%) were clade A, 23 (7.01%) were clade B, and 170 (51.83%) were clade C (including 122 in subclade C1-M27 (72.76%) and 48 in subclade C2 (28.24%)). These 338 isolates served as input for downstream modelling and genomic analyses. Coresidents were more likely to carry clade A than index patients. 83% of E. coli ST131 isolates from coresidents were clade A, versus 53% of E. coli ST131 isolates from index patients (p < 0.001). Of the 206 E. coli ST131 isolates derived from persistent carriers (defined as individuals with at least two sequential E. coli ST131 positive stool samples), 86 (41.75%) were clade A and 108 (52.43%) were clade C. The remainder (12 isolates, 5.83%) were clade B.

In the 212 human-source E. coli ST131 isolates subjected to antibiotic susceptibility testing, resistance to more than two classes of antibiotics was greater among E. coli ST131 clades A (71.26%, 62/87) and C (75.93%, 82/108) compared to clade B (23.53%, 4/17). None were resistant to carbapenems (Supplementary Fig. 2). Of the 212 isolates tested, 126 came from persistent carriers. In this subset, 74.60% (94/126) of isolates demonstrated resistance to more than two classes of antibiotic.

Asymptomatic E. coli ST131 carriage amongst index patients

Amongst the 34 index patients, 22 had urinary tract infections while the other 12 had bloodstream infections. The proportion of E. coli ST131 carriers amongst those infected with E. coli ST131 and those infected with other E. coli STs were similar (6/16, 37.50% versus 6/17, 35.29%, respectively). However, E. coli ST131 carriers amongst index patients previously infected with E. coli ST131 had a longer carriage duration than those previously infected with other E. coli STs (median duration 59.71, 80% CrI 33.92 to 111.83 versus 16.97, 80% CrI 3.49 to 45.52 days, Supplementary Table 1). In addition, the density of E. coli ST131 in each stool sample was higher in index patients who were infected with E. coli ST131 (0.69 versus 0.27, p = 0.007 by independent samples t-test) than those infected with other E. coli STs.

Duration and density of E. coli ST131 carriage in the overall study population

The majority of households (61.76%, 21/34) had at least one E. coli ST131 carrier (Table 1). The overall prevalence of E. coli ST131 carriage in individuals was 33.06% (41/124). Similar proportions of E. coli ST131 carriers were found in households with an index patient previously infected with E. coli ST131 and those with an index patient previously infected with other E. coli STs. However, E. coli ST131 carriers living with an index patient with previous E. coli ST131 infection had more stool samples positive for E. coli ST131 during the study period and also carried E. coli ST131 at higher densities (Table 1).

Out of the E. coli ST131 carriers, the majority (78.05%, 32/41) carried E. coli ST131 intermittently, with one or more non-consecutive positive samples. Nine E. coli ST131 carriers persistently carried E. coli ST131 for at least two consecutive time points (Supplementary Table 2). These persistent carriers had a median carriage duration of 86.35 days (80% CrI 30.03 to 187.80), whereas intermittent carriers had median carriage duration for 2.26 days (80% CrI 0.52 to 6.00) (Fig. 2A). The mean density of E. coli ST131 in positive stool samples collected from the persistent carriers was 57.79%, compared with 36.06% among intermittent carriers (Fig. 3). When stratified by clade, median carriage duration for clade A was 35.33 days (80% CrI 23.81 to 52.47), longer than for clade C which was 21.55 days (80% CrI 14.14 to 33.28) (Fig. 2B).

Fig. 2: Estimates obtained for daily probability of acquisition and median duration of carriage of E. coli ST131.
figure 2

Estimates were calculated by (A) epidemiological participant subgroups, defined as (i) persistent carriers, participants with at least two sequential samples positive for E. coli ST131 (n = 531 isolates), (ii) intermittent carriers, those with no more than one sequential positive sample, including those with multiple non-sequential positives, coresident with a persistent carrier (n = 1332 isolates), and (iii) all other intermittent carriers (n = 4410 isolates) (B) E. coli ST131 clades A (n = 135 isolates) and C (n = 170 isolates), respectively (as identified by WGS). Clade B was not included in the analysis due to small sample size (<25 isolates obtained). All estimates were obtained using a two-state Markov model. Results are presented as median estimates with 80% credible intervals (CrI).

Fig. 3: Number of E. coli ST131 (n = 338 isolates) and non-ST131 E. coli isolates (n = 5935 isolates) identified per participant during each sampling timepoint (1-12).
figure 3

E. coli ST131 isolates derived from persistent carriers (n = 206 isolates), defined by having two or more consecutive stool samples which recovered E. coli ST131, are highlighted in blue. E. coli ST131 isolates derived from intermittent carriers are highlighted in black. All non-ST131 E. coli isolates are plotted in grey. Numbers in the top left corner of each box denote the household.

Risk of E. coli ST131 carriage and acquisition

None of the epidemiological risk factors, including age, sex, time spent at home per week, comorbidities, incontinence, medical devices including urinary catheter, antibiotic intake in the past six months, and hospital stay in the past year, were associated with E. coli ST131 carriage in both multivariable and univariable regression analyses (Supplementary Fig. 3A). Similarly, there were no demographic or clinical risk factors associated with the status of being a persistent carrier (Supplementary Fig. 3B). Additionally, we examined the virulence profile of representative E. coli ST131 clones and found no significant association, in either the number of virulence genetic elements or their composition, with the bacteria’s persistent carriage status (Supplementary Fig. 4).

Daily probabilities of acquiring E. coli ST131 carriage amongst persistent carriers and coresidents of persistent carriers were 3.92% (80% CrI 2.01 to 11.11%) and 3.13% (80% CrI 1.22% to 10.14%), respectively. Amongst those who did not live with persistent carriers, daily acquisition probability was lower at 1.57% (80% CrI 0.65 to 5.19%) (Fig. 2A). By clade, daily probability of acquisition was 0.16% (80% CrI 0.11 to 0.24%) and 0.26% (80% CrI 0.17 to 0.40%) for E. coli ST131 clades A and C, respectively (Fig. 2B).

Sensitivity analyses found similar results when applying a higher cutoff for persistent carrier status as detection of E. coli ST131 at three or more consecutive time points (Supplementary Fig. 5). For persistent carriers, median duration of carriage and probability of acquisition were 135.81 days (80% CrI: 23.65 to 566.68) and 3.50% (80% CrI: 1.20 to 13.30%), respectively. Among coresidents of persistent carriers, median duration of carriage and probability of acquisition were 4.09 days (80% CrI: 0.74 to 17.92) and 2.05% (80% CrI: 0.63 to 7.62%). Among those who did not live with a persistent carrier, median duration of carriage and probability of acquisition were 10.13 days (80% CrI: 1.73 to 19.98) and 0.86% (80% CrI: 0.05 to 2.92%).

We also evaluated the impact of false negatives on estimated carriage duration. We first estimated carriage duration in the overall study population assuming no E. coli ST131 positive samples were missed during culturing or PCR screening. Then, we estimated carriage duration assuming up to 20% of E. coli ST131 positive samples were missed during screening (false negatives). Baseline carriage duration was 27.52 days (95% CI: 17.37 to 43.59). Assuming up to 20% false negatives, carriage duration was 58.45 days (95% CI: 25.71 to 132.88) (Supplementary Fig. 6).

Phylogenetic analysis for household transmissions of E. coli ST131

Phylogenetic grouping showed that each defined phylogenetic cluster (PC) consisted of genomes derived from single households (Fig. 4A). However, six households (households 3, 8, 9, 20, 21, 32 in Fig. 4A) had more than one PCs present across different clades including A, B, C1, and C2, highlighting the diverse E. coli ST131 populations in these households. Inter-individual genetic distance of E. coli ST131 was markedly different when comparing within-household and between-household. The median within-household pairwise single nucleotide polymorphism (SNP) distance was 3 (IQR 2 to 11), while the between-household equivalent was 5287 (IQR 258 to 5424) (Fig. 4B). The respective median SNP distances of E. coli ST131 isolates recovered between-households were 136 (clade A, IQR 126 to 213), 688.5 (clade B, IQR 687 to 878), 40.5 (clade C1, IQR 26 to 51), and 292 (clade C2, IQR 281 to 359) (Supplementary Fig. 7).

Fig. 4: Phylogeny of E. coli ST131 isolates.
figure 4

A Maximum likelihood phylogeny of 320 E. coli ST131 genomes sequenced for this study. E. coli ST131’s defined clade (A, B, C1, C2) was annotated to the phylogeny, based on E. coli ST131Typer output. The phylogeny is rooted using the basal clade A. Branches are coloured in accordance to bootstrap values, from low (light pink) to high (black). The rings present metadata associated with each taxon, in the following order (from inner to outermost): (1) household identity, (2) households with persistent carrier(s), and (3) presence of blaCTX-M genes. Households with only one member yielding positive culture for E. coli ST131 (ID: 1, 4, 10, 12, 15, 23, 28) were masked and coloured grey. B Inter-participant pairwise mutational SNP (single nucleotide polymorphism) distance of E. coli ST131 genomes, classified by household membership of the examined pair (within and inter-household). The rectangular dashed box (lower panel) serves as a zoom-in subplot for part of the upper panel. C Genomic variation of within-household E. coli ST131 in four households (2, 8, 11, 34) with persistent carriers. The upper panels detail the sampling schedule for each household (in days), with members named A to F. Stool samples positive and negative for E. coli ST131 (via WGS results) are coloured in red and grey, respectively. Household members who have at least two consecutive E. coli ST131 positive stool samples (persistent carriers) have names highlighted in orange, with positive duration denoted as red rectangular box. The lower panels provide the phylogenetic inferences of E. coli ST131 isolates recovered within each respective household. Phylogenies were constructed using a parsimony approach (see Methods), with branch lengths denoting estimated single nucleotide polymorphisms (SNPs; see scale bars). Tree tips are coloured according to the source of isolation (household members).

Within-household sharing of E. coli ST131 was detected in 8/21 (38.09%) households, defined as presence of the same E. coli ST131 clade in at least two residents. Five of these households with sharing of E. coli ST131 also had at least one persistent carrier, with four retrieving at least 20 E. coli ST131 genomes for high-resolution phylogenetic inferences (households 2, 8, 11, 34; Fig. 4C). Notably, E. coli ST131 from persistent carriers had minimal SNP distances to sequenced ST131 isolates from within-household coresidents (Fig. 4C). In two households (households 11 and 34 in figure 6), persistent carriers (participants 11D and 34 A in Fig. 4C) had higher E. coli ST131 genomic diversity than their coresident carriers. This evidence suggested that persistent carriers were the source of E. coli ST131 dissemination in their respective households.

Discussion

In this prospective cohort study of asymptomatic E. coli ST131 carriage in households, we found two distinct patterns of carriage: persistent, high-density carriage and intermittent, low-density carriage. Individuals living with persistent, high-density carriers acquired E. coli ST131 at twice the rate of those who were not coresident with a persistent carrier. Genomic evidence showed that these persistent carriers shared genetically similar E. coli ST131 isolates with their household members and harboured greater E. coli ST131 diversity. This indicated that the persistent carriers were the likely source of E. coli ST131 in their respective households, with coresident colonisations resulting from transmissions of a single strain from the source persistent carriers.

Our findings carry important implications for controlling the spread of E. coli ST131, identifying that a subset of individuals may carry E. coli ST131 asymptomatically for longer-than-average durations and act as a reservoir for spread to close contacts. These persistent carriers may be suitable candidates for infection prevention and control measures for E. coli ST131 in the community setting, such as the Extraintestinal Pathogenic E. coli 9-valent vaccine currently undergoing phase three clinical trial7, or other novel decolonisation strategies such as phage therapy or faecal transplant. Given the infeasibility of finding these asymptomatic reservoirs in the community, individuals with a history of E. coli ST131 infection may be priority candidates for these prevention measures given that they were more likely to be persistent carriers.

The importance of persistent asymptomatic carriers as drivers of pathogenic Gram-negative bacteria transmission has been well-described in Salmonella spp.8,9 and Vibrio cholerae. However, unlike Salmonella spp. and V. cholerae, E. coli is a common gut commensal, and many serotypes can be carried concurrently, making it challenging to identify and trace its transmission. Our systematic and longitudinal approach, which included up to 12 time points at two- to six-week intervals, provided high-resolution data and allowed application of multistate Markov models to estimate carriage duration and inter-individual transmission rates. Furthermore, we obtained up to 12 individual colonies per sample, which allowed us to ascertain the density of carriage. This wealth of isolates per sample also permitted the assessment of genomic diversity within individuals and households, demonstrating the extent of strain sharing in some households with persistent carriers.

We acknowledge several limitations in our study. Despite collecting up to 12 E. coli colonies per sample, microbiology cultures and PCR may have missed E. coli ST131 isolates. This would have led to an underestimation of carriage duration, transmission rates, and number of persistent, high-density carriers. In addition, there were three individuals in this study (2C, 17B, and 21A) with two non-sequential positive samples on either side of a negative sample. Mean SNP distances between isolates sampled at the first and third time points were estimated as 1, 1.3, and 1 for each individual respectively. While it was probable that, in these cases, the middle samples were false negatives, we could not confidently determine other false negative samples where there was no subsequent positive E. coli ST131 isolates. Hence, we assessed the potential impact of false negatives using a Hidden Markov Model. This model found that estimates under assumptions of perfect observation fell within the credible intervals of estimates obtained assuming between 1 to 20% negative samples were true positives.

Given that we only identified nine persistent carriers, our assessment of risk factors associated with persistent carrier status may have been underpowered due to small sample size. Further, for estimation of carriage duration and acquisition probability, our division of epidemiological subgroups into persistent carriers (n = 9 individuals), coresidents (n = 26 individuals), and non-coresidents (n = 89 individuals) resulted in smaller group sample sizes and thus wide credible intervals. We mitigated this uncertainty by using Bayesian methods to estimate and report the distribution of probable values. The definition of at least two sequential E. coli ST131 positive samples for persistent carriage was also arbitrary. The description of persistent carriage was meant to illustrate heterogeneity in carriage duration and risk of transmission to coresidents, and not to impose strict definitions on carrier types. Nonetheless, we performed sensitivity analyses using three sequential time points as the lower bound for defining persistent carriage, which produced similar estimates. Lastly, despite intense within-individual sampling, the limited genomic variation rendered transmission directionality inconclusive in many cases.

In conclusion, asymptomatic persistent carriers are potential reservoirs for E. coli ST131 spread to household contacts in the community setting. Further investigation to identify the risk factors for these persistent carriers will be critical, as these groups represent a potential infection prevention and control target for mitigating the spread of E. coli ST131, a highly prevalent and antimicrobial resistant strain in the community.

Methods

Participant enrolment and follow-up

From February 2017 to November 2018, we screened and sought consent from patients of all ages admitted to the National University Hospital, Singapore, whose urine or blood cultures performed for clinical indications were positive for E. coli. Enroled patients were classified as index patients infected with E. coli ST131 or other STs using PCR performed on their respective clinical E. coli isolates. From each of these two groups, index patients were randomly selected and consented to participate in a prospective cohort study alongside their household coresidents. Informed consent was sought from individuals according to ICH Harmonised Good Clinical Practice (ICH E6 (R2)) guidelines:

  1. 1)

    For individuals aged 21 and above with capacity, informed consent was sought.

  2. 2)

    For minors aged between 12 and 20

    1. a.

      with understanding: informed consent from minors and guardian were sought.

    2. b.

      unable to understand: assent form from minors and informed consent form from guardians/parents were sought.

  1. 3)

    For minors aged 6–11, assent form from minors and informed consent from guardians/parents were sought.

  2. 4)

    For children aged <6, informed consent from guardians/parents was sought.

The sample size for the number of households was calculated using a two-sided Z-test using the expected proportion of E. coli ST131 carriers in the households with an index patient previously infected with E. coli ST131 (50%) and those with an index patient previously infected with other STs (25%). Given 80% power and 5% type I error, 116 participants were required in total, or 30 households assuming an average of four members in each household.

Study participants self-reported a baseline questionnaire on demographics, lifestyle, familial relationships and interactions, dietary habits, medical history, and travel history. Sex rather than gender was recorded, as this covariate was identified as a potential risk factor for E. coli ST131 urinary tract infection during literature review.

Home visits were conducted at two to six-weekly intervals, during which participants and their companion animals (if applicable) provided stool samples. The period of participation for each study participant was from consent to whenever they chose to end their participation or until twelve sampling time points had been reached. Environmental and food sampling was performed during the first home visit. Environmental samples included swabs of high-touch surfaces, such as kitchen counters, doorknobs, and toilet bowls using standard methods10. Food samples were taken from both raw and cooked meat and vegetables. Study participants contacted the study team as soon as stool samples were collected, and the study team retrieved these samples within the same day. Stool samples were stored in a 0–4 °C freezer between the time of sample collection and DNA extraction in both the participants’ home and the research facility. All DNA extractions were performed within 96 hours of collection. Ethics approval was obtained from the National Healthcare Group Domain Specific Review Board (Reference: 2016/00998).

Identification of E. coli ST131 from clinical, stool, food and environmental samples

All samples were cultured on UriSelect™ 4 medium, a chromogenic medium used for E. coli detection and isolation. Up to 12 colonies per sample were randomly picked from the initial culture plates. Species identifications were confirmed using matrix-assisted laser desorption/ionisation and time-of-flight mass spectrometry (MALDI-TOF) spectrometry.

E. coli isolates were tested for ST131 via PCR targeting the putative manganese transferase gene (mntH)11 using a primer sequence customised from Integrated DNA Technologies IDT (310 bp with forward: GACTGCATTTCGTCGCCATA and reverse: CCGGCGGCATCATAATGAAA). PCR was performed using the CFX96 TouchTM Real-Time PCR Detection System, with the iTaq Universal SYBR Green Supermix (Bio-Rad Laboratories, Inc; catalogue no: 1725121) and cycling conditions according to previously validated protocol11. E. coli isolates identified as ST131 by mntH PCR and sourced from human stool samples (N = 472) were subjected to whole genome sequencing (WGS). WGS failed for 20 isolates, leaving 452 genomes for downstream analyses. Subsequently in this manuscript, human-source E. coli ST131 isolates referred to are those confirmed with WGS.

Antibiotic susceptibility testing was performed on 212 of the 338 WGS-confirmed human-source E. coli ST131 isolates using the broth microdilution method. The minimum inhibitory concentrations (MICs) of 33 antimicrobials were determined by using MicroScan Neg MIC Panel Type 40 (Beckman Coulter, Inc., Brea, CA, USA), in accordance with the manufacturer’s instructions. E. coli ATCC 25922 was included as a quality control. The isolates were then classified as susceptible, intermediate, and resistant, based on the interpretation of MIC results in accordance with EUCAST guideline (2015)12. As no EUCAST interpretation was available for tetracycline, trimethoprim and cefoxitin, classifications on these antimicrobials followed the CLSI guideline (2013)13. The isolates resistant to ceftriaxone were further subjected to double-disk synergy test to confirm ESBL-production using three cephalosporins including ceftriaxone (CRO; 30 μg), ceftazidime (CAZ; 30 μg), and cefotaxime (CTX; 30 μg) (Oxoid, UK), as well as of amoxicillin/clavulanic acid (AMC; 20/10 μg) (Thermo Fisher Scientific, USA).

Statistical analysis and mathematical modelling

Descriptive and statistical analyses were performed using the following definitions. A sample was considered positive for E. coli ST131 if at least one of the E. coli isolates recovered from the sample was confirmed to be E. coli ST131. The density of E. coli ST131 for each sample was calculated using the number of E. coli ST131 isolates divided by the total number of E. coli isolates recovered from the sample. A participant was considered a carrier of E. coli ST131 if any of their stool samples were positive for E. coli ST131. Participants with at least two sequential E. coli ST131 positive samples were considered persistent carriers. All other carriers, i.e. those who had one isolated positive sample or had more than one positive sample but interspersed between negative samples, were considered intermittent carriers. We performed sensitivity analysis by defining persistent carriers as those who carried E. coli ST131 for at least three sequential time points and intermittent carriers as those with no more than two sequential positive stool samples.

To identify risk factors associated with E. coli ST131 carriage and with persistent carrier status, univariable and multivariable logistic regression analyses were conducted. Independent variables assessed in these regressions were selected following literature review, and included sex, age, time spent at home, comorbidities (including chronic metabolic conditions, major organ failures and cancer), urinary incontinence, use of medical devices, antibiotic use in the past six months, and hospital admittance within the past year14,15,16,17,18. We also included carriage status subgroup (persistent carriers, intermittent carriers who were coresident with a persistent carrier, all other intermittent carriers, non-carriers) as a categorical dependent variable.

Given that samples were collected in two-to-six-week intervals, we employed a multistate Markov modelling approach which allowed us to estimate the per-day decolonisation probability while accounting for interval censoring. This approach offered an advantage over conventional calculation methods which quantify carriage duration as the difference between first observed and last observed positive sample by taking into consideration decolonisation events which could have taken place at unsampled time points.

A two-state Markov model was constructed to assess transition rates between non-carrier and carrier states for individual participants. We chose not to model carriage at the household level, given that household sizes in our study population ranged from two to eight members, and analysis would necessitate subsetting households by size, resulting in small sample sizes for each group with weak statistical power. In this model, the transition of individuals between non-carrier and carrier states was assumed to be a Markov process, in which the probability of a particular carrier status at the next time point depends only upon carrier status at the current time point and not on the status at earlier time points. The transition from one state to another was governed by an intensity matrix Q (K \(\times\) K, where K is the number of states, i.e. two). For every individual i, the probability of transition between states was modelled by the equations:

$$\log \left({q}_{12}\right)={\beta }_{0}+{\sum }_{i=1}^{D}{\beta }_{i}\times {X}_{i}$$
(1)
$$\log \left({q}_{21}\right)={\beta }_{0}+{\sum }_{i=1}^{D}{\beta }_{i}\times {X}_{i}$$
(2)

in which q12 and q21 represented the probability of transitioning from non-carrier to carrier state and from carrier to non-carrier state, respectively. \({\beta }_{i}\) represented the sampling parameter for transition from non-carrier to carrier state, D represents the number of covariates, and Xi represents covariate i in set of covariates X1, X2, … XD.

Models were constructed in Stan19 to sample for the posterior distribution of the parameters 12 and 21. From these values, median rate of acquisition of E. coli ST131 per day and median carriage duration (1 / mean rate of transition from carrier to non-carrier state per day) were derived using the above formulae. Model fits were assessed using the Hamiltonian Monte Carlo diagnostics function in RStan20, Gelman-Rubin convergence diagnostics, and through the visualisation of trace plots and sample density plots.

The model was fit to each carriage status subgroup to explore differences in carriage and acquisition among these groups.

The two-state Markov model was supplemented by a Hidden Markov Model, which incorporated the potential for false negative PCR and culturing when screening for E. coli ST131. These false negatives would have resulted in E. coli ST131 isolates not being eventually sequenced and confirmed via WGS. In this model, we estimated carriage duration with underlying false negative rates of up to 20% (Supplementary Fig. 6).

Statistical analyses were conducted with R version 4.3.3, using RStan (v2.32.6)20 (for the primary Markov model) and the msm package (v.1.7.1) (for the Hidden Markov Model)21.

Genomic and phylogenetic analysis

The following bioinformatic analyses utilised default settings unless stated otherwise. For all analysis tools utilized in this manuscript, we have appended the version number and included specific parameters where applicable. Low quality bases and adaptors in the sequencing reads were removed using Trimmomatic (v0.32)22. The trimmed reads were then assembled using SPAdes (v3.11.1)23, and assemblies were subsequently subjected to species identification using Mash Screen (v2.3)24. MLST was determined by running the ‘mlst’ package (https://github.com/tseemann/mlst) on assemblies, showing that 338/452 successfully sequenced isolates belonged to E. coli ST131. E. coli ST131 assemblies were further subjected to ST131Typer to infer the O:H type, fimH allele and clade membership25. The presence of virulence genes, antibiotic resistance genes and point mutations were identified in the assembled genomes using AMRFinderPlus (v3.11.18) using its own database (version 2023-04-17.1)26.

For downstream genomic analyses, assembly quality was assessed using assembly-stats v1.0.1, which removed 10 genomes with contamination (assembly sizes > 6.4 Mbp and exceeding 501 contigs). The cut-off ‘501’ refers to the highest number of contigs found in the remaining 328 assemblies, while the 10 removed genomes showed high genome fragmentation (>6000 contigs each; Supplementary Table 3). The remaining E. coli ST131 genomes (n = 328) had overall good assembly quality, with mean completeness >99% and contamination rate of 1.5% (as assessed by CheckM2)27, N50 of 178,879 bp, and contig count of 244 (range: 107–501). Prokka v1.14.6 was used to annotate the assemblies and two E. coli ST131 chromosome references (NCTC13441, and EC958)28. A reference-free pseudo-alignment was produced for all 330 E. coli ST131 genomes using the split k-mer approach implemented in SKA2 (‘align’ command with the following parameters: -m 0.95 --filter-ambig-as-missing --filter no-ambig-or-const --no-gap-only-sites --ambig-mask)29. This allowed for inclusions of both coding and non-coding regions, capturing a more complete spectrum of genetic variations to compute SNP differences. This generated an alignment of 4,207,040 bp. A maximum-likelihood phylogeny was constructed with RaxML-NG v1.2.2 with 1,000 bootstrap replicates (GTR + G + ASC_LEWIS model, ascertainment bias correction) using the SNP-only alignment (alignment length: 8,678 bp)30. While comparing phylogenetic grouping and family information, we noticed that several genomes (n = 8) were interspersed within specific family-restricted phylogenetic clusters. These were usually lone genomes (1–2 isolates situated outside their respective family-phylogenetic cluster) and shared 0–3 SNP difference to their nearest sister taxa on the phylogeny (of a different family). We reasoned that this observation most likely stems from mislabelling of isolates during sample handling and processing, and did not represent true biological interpretation. Therefore, we removed these outlying genomes from downstream analyses to avoid misleading interpretation. Pairwise SNP distance between isolates was calculated using the snp-dists programme (https://github.com/tseemann/snp-dists).

We further investigated the transmission of E. coli ST131 in households by phylogenetic approach. We selected households which (1) had at least one persistent carrier, (2) showed presumptive evidence of inter-member transmission (at least two members sharing the same clade of E. coli ST131), and (3) provided >20 E. coli ST131 genomes. This resulted in four households with 179 E. coli ST131 genomes. For each genome, paired-end reads were mapped to an E. coli ST131 reference genome (NCTC13441, accession number: NZ_LT632320.1) with bwa-mem v0.7.17, followed by removal of duplicate reads by PICARD and indel realignment using GATK v3.7.0. Non-optimal local alignment was removed using samclip (https://github.com/tseemann/samclip) to limit false positives during variant calling. SNPs were identified using the haplotype-based caller Freebayes v1.3.7, and we used bcftools v1.12 to remove low quality SNPs, which satisfied any of these criteria: consensus quality <30, mapping quality <30, read depth <4, ratio of SNPs to reads at a position (AO/DP) < 85%, coverage of the forward or reverse strand <1. For each isolate, the bcftools ‘consensus’ command was used to generate a pseudogenome sequence, with length equal to that of the reference and incorporating filtered SNPs as well as invariant sites. Regions with low mapping (depth <4) and low-quality SNPs were masked as “N”. This procedure produced four alignments (each corresponding to a household). Regions annotated as transposases and prophages (by PHASTEST web search) were further masked, followed by the removal of sites with at least 80% indeterminate nucleotide and invariable, generating SNP-only alignments of variable lengths. For each alignment, a maximum parsimony phylogeny was constructed using the DNApars algorithm implemented in SEAVIEW v5.0.5, with 100 bootstrap supports. Visualisation of phylogenetic analysis was conducted in R version 4.3.1, using R packages ggplot2 v.3.5.0 and ggtree v1.8.228.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.