Main

The nasopharynx is the natural habitat of the bacterium Haemophilus influenzae, where it exists in asymptomatic carriage. However, it frequently translocates to other body sites such as the inner ears, lungs and sinuses, causing a range of disease manifestations1. The most common of these is acute otitis media (AOM), which is one of the leading causes of antibiotic prescriptions in children. Global estimates suggest over 700 million AOM cases per annum caused in total by any bacterial pathogen, and a substantial fraction of these lead to further complications and sequelae, particularly in low- and middle-income countries (LMICs)2. After licensing of the polysaccharide-protein conjugate H. influenzae type b (Hib) vaccine in the late 1980s, its adoption into national vaccination programmes worldwide has led to a notable reduction of Hib colonization and its associated invasive disease manifestations, such as meningitis and pneumonia. However, the vaccine does not protect against colonization by other serotypes or unencapsulated non-typeable H. influenzae (NTHi). Therefore, H. influenzae remains a major cause of AOM, sinusitis, conjunctivitis and pneumonia and consequently is an important public health burden globally. A particular concern has arisen from the widespread antibiotic resistance observed in some strains of NTHi and particularly the possibility of multidrug resistance (MDR), as widespread β-lactamase resistance has led to difficulties in treating recurrent infections, particularly in children where alternate classes of antibiotics may not be approved3.

The evidence for NTHi as an important cause of paediatric community-acquired pneumonia (CAP) has been summarized in comprehensive reviews4,5. The determination of aetiology in paediatric CAP remains a challenge, with a minority of cases being bacteraemic. However, specimens obtained via bronchoscopy revealed NTHi to be the dominant bacterial pathogen in 250 Belgian children with recurrent or non-resolving CAP6. The nasopharyngeal colonization by non-Hib/NTHi, especially at higher densities, has also been shown to be associated with paediatric CAP in LMICs7,8.

Whole-genome sequencing (WGS) studies of H. influenzae have been mainly conducted from smaller-scale collections of disease cases9,10 but rarely from large-scale collections of both carriage and disease isolates of the same population. Furthermore, few studies have been conducted in LMICs, where nasopharyngeal pathogen colonization rates and the burden of CAP are generally much higher than in high-income countries. As a consequence, the genetic population structure and evolutionary dynamics of the species remain poorly understood in LMIC settings and at a global scale11.

This motivated us to conduct a longitudinal paediatric cohort study of both healthy colonization and pneumonia among a large birth cohort in a population located in northwestern Thailand, the Maela camp for displaced persons. The densely populated camp is located on the Thailand–Myanmar border and provided a unique opportunity to systematically sample both carriage and disease cases in a pre-Hib vaccine population. Here, we detail results from the WGS of isolates from our cohort and also of further analyses performed on the Maela data combined with all publicly available high-quality H. influenzae genomes with sufficient metadata. This combined collection of 9,849 genomes allowed us to conduct genomic analyses of the species at a global and species-wide scale, providing novel insight into how its high levels of recombination shape its global population structure.

Results

Serotype distribution across the Maela paediatric population

The infants in the Maela cohort carrying H. influenzae (Fig. 1a,b) were predominantly colonized by NTHi, despite lacking immunization against Hib (Table 1). Out of 3,970 isolates that passed the final quality control (QC) filters, 613 (15.4%) were from cases of pneumonia. Of the 3,210 host-deduplicated isolates, 524 were collected from pneumonia cases (16.3%). The counts and estimated frequencies of the six different serotypes and non-typeable (NT) (unencapsulated) isolates are listed both with and without host deduplication in Table 1. Notably, NT isolates made up 91.7% of all isolates, and serotype b isolates are the second most prevalent, making up 5.7% of the population. The remaining five serotypes account for less than 1% of the population each. The serotypable isolates generally form monophyletic lineages on the tree (Fig. 2). Two isolates, one NT and one serogroup b by agglutination, gave only partial in silico capsule typing results due to us being unable to identify the entire capsule locus in the data. The in silico capsule typing was generally congruent (overall congruence 95.3%) with the agglutination-based phenotypic serotyping (except for serotypes d and e) and was corrected by the latter in those 28 cases where the serological typing indicated a serotype for a NT in silico type. These included serotype a (n = 1), b (n = 19), d (n = 3) and e (n = 5). Our results are reasonably well in line with earlier comparisons between agglutination-based and in silico typing (98–100% congruence)12,13; however, the higher level of discrepancy observed here could be due to the much larger and more diverse set of genomes considered. The population frequencies of serotypes were highly similar between pneumonia and non-pneumonia cases. The distribution of serotypes in pneumonia/non-pneumonia cases is as follows, in pneumonia: 583 NT (95.1%), 19 serotype b (3.1%), 5 serotype e (0.8%) and 2 (0.3%) each of serotypes c, d and f; and in non-pneumonia: 3,086 NT (91.9%), 189 serotype b (5.6%), 34 serotype f (1.0%), 26 serotype e (0.8%), 15 serotype a (0.4%), 4 serotype c (0.1%) and 3 serotype d (0.10%) (Table 1).

Fig. 1: Overview of the Maela cohort study design and the global collection of published genomes.
Fig. 1: Overview of the Maela cohort study design and the global collection of published genomes.
Full size image

a, The geographical location of the study site. b, The cohort design and sample processing for 999 mother–infant pairs recruited to the study. The nasopharyngeal (NP) swabs were taken at monthly health checks (indicated by black asterisks) and at any timepoint when the infant presented symptoms of clinical pneumonia (indicated by red asterisks). Bottom: the sample numbers at each step of the study. c, The global map coloured by the number of isolates per country of origin in the systematic public collection of global H. influenzae isolates.

Fig. 2: Phylogeny of Maela H. influenzae genomes for 3,970 isolates.
Fig. 2: Phylogeny of Maela H. influenzae genomes for 3,970 isolates.
Full size image

The sample type is indicated by the innermost ring. The phylogeny was estimated using FastTree v.2.1.10 on the core-genome alignment mapped against the H. influenzae reference 86-028NP (NC_007146.2). The 20 largest PopPUNK clusters (>50 isolates) are indicated by coloured dots at the tips of the phylogeny, whereas smaller clusters are in grey. The in silico serotypes (second ring), inferred by using Hicap v.1.0.3, and AMR profiles (eight outer rings), screened with AMRFinderPlus v.4.0.3, are shown by colour as indicated in the legend. An interactive online phylogeny, with additional metadata including cgMLST and cgMLST cluster data, is available at ref. 84.

Table 1 Proportions and counts of the number of isolates of each serotype and non-typable isolates in the Maela collection and within pneumonia and non-pneumonia cases

Genetic population structure of H. influenzae in Maela

Core genome multilocus sequence type (cgMLST), has previously been used to study H. influenzae11, but when applied to the entire global dataset (Fig. 3), neither the typing nor clustering by allelic profiles was able to provide meaningful insight into the population structure of H. influenzae in the Maela cohort owing to too-high diversity in the allelic profiles. Of note, the number of cgMLST allelic profiles and the overall nucleotide diversity are not always strongly correlated, as many very low-frequency mutations can generate a large number of allelic profiles, although nucleotide diversity across the pangenome remains low. The PopPUNK clustering method, which combines information from core and accessory genomic variation, largely identified monophyletic clusters among Maela isolates (Fig. 2). However, the largest cluster contained 349 isolates, with only 13 PopPUNK clusters consisting of at least 100 isolates (out of a total of 122 clusters) and 20 clusters consisting of 50 or more isolates. A large proportion of clusters (50%) contained ten or fewer isolates. The serotypable isolates were found in ten clusters in total (serotypes a, b, c and f each in two clusters, with serotypes d and e each in one cluster). NTHi were observed as the dominating type in both non-pneumonia and pneumonia timepoint samples (Fig. 2), and no particular genetic lineage of NTHi was overrepresented in pneumonia samples (Fisher’s exact test P value 0.091, Methods).

Fig. 3: Global H. influenzae phylogeny.
Fig. 3: Global H. influenzae phylogeny.
Full size image

The maximum-likelihood core genome phylogeny of 9,849 H. influenzae isolates, combining the Maela cohort and a systematically identified global collection of published genomes. The phylogeny was estimated using IQ-TREE v.2.4.0 on the core genome alignment. The in silico serotypes are indicated by the circles on the tips of the phylogeny; the isolation location is indicated on the outer ring, as shown by colour as indicated in the legend. An interactive online phylogeny, with additional metadata including cgMLST, cgMLST clusters and partial disease state data, is available at ref. 85.

Distribution of AMR determinants

Antimicrobial resistance (AMR) determinants were frequently identified across the phylogeny and strongly associated with MDR lineages (Methods; Fig. 2). Only one of the MDR lineages was clearly associated with serotype b (Fig. 2), and the remainder were NTHi. A total of 41 PopPUNK (Methods) clusters contained at least one MDR isolate, indicating the repeated acquisition of AMR determinants across the population. In the Maela host-deduplicated dataset, most of the MDR (resistance against at least four out of nine antibiotic classes; Methods) isolates (507/3,210, 15.8%) were NT (77.3%, 392/507), followed by serotype b (22.3%, 113/507) and two serotype e isolates (0.4%). Hence, serotype b was clearly overrepresented among the more resistant isolates (overall frequency 4.8%; Table 1), whereas there were less NT (overall frequency 92.7%; Table 1) and a high proportion (73.4%) of serotype b isolates are MDR (113/154) compared with NT (392/2,977, 13.2%) or serotype e (2/26, 7.7%).

Within pneumonia cases (523/3,210), 17.0% (89/523) of isolates were MDR, of which 76 were NT, 12 were serotype b and 1 was serotype e, among 498 NT (95.2%), 15 serotype b (2.9%), 4 serotype e (0.8%), 2 serotype c (0.4%), 2 serotype d (0.4%) and 2 serotype f (0.4%) isolates. Within non-pneumonia cases (2,687/3,210), 15.6% (418/2,687) were MDR, of which 316 were NT, 101 were serotype b and 1 was serotype e, among 2,479 NT (92.3%), 139 serotype b (5.2%), 29 serotype f (1.1%), 22 serotype e (0.8%), 11 serotype a (0.4%), 4 serotype c (0.2%) and 3 serotype d (0.1%) isolates. Therefore, the frequency of the MDR phenotype was highly similar between the two sample types.

Quantification of homologous recombination

Because the acquisition of AMR determinants is probably aided by horizontal gene transfer in this naturally transforming species, we quantified the extent of homologous recombination. Mapping Illumina reads from isolates in the same PopPUNK cluster against long-read reference assemblies to produce whole-genome pseudoalignments to be used as inputs to single nucleotide polymorphism (SNP)-density based recombination analysis (Gubbins) was not a feasible approach to quantify recombination in the entire Maela cohort owing to the size of the dataset and the large number of PopPUNK clusters present.

Consequently, we leveraged the aligned pangenome genes for the 3,970 Maela isolates to perform per-gene recombination inference (Methods). Of the 7,015 genes in the inferred pangenome, at least one recombination event between PopPUNK lineages was identified in 2,672 genes (38%). On average, 193.36 recombination events were identified per gene (including recombination-free genes), and the frequency of recombination events was significantly correlated with the estimated nucleotide diversity per gene (Spearman’s r = 0.49, P < 7.15 × 10−293). A substantial proportion of genes with no detected recombination (64.0%) also had zero nucleotide diversity (Extended Data Fig. 1a). Finally, we also quantified the rate of decay of linkage disequilibrium in the core genome and compared this with several other common bacterial pathogens analysed in ref. 14. This showed that the decoupling of SNPs as a function of base-pair distance happens fastest in H. influenzae (Extended Data Fig. 1b), and the rate is considerably elevated compared with other species known to routinely engage in homologous recombination, such as Campylobacter jejuni and Enterococcus faecalis. Taken together, these results suggest that the H. influenzae population within Maela is extremely recombinant, to the extent that it probably reduces the overall level of diversity within the population.

Population genetic analyses of the global dataset

To understand how the genetic variation observed in the Maela cohort samples relates to internationally circulating H. influenzae, we combined the study data with a systematic collection of all publicly available H. influenzae genome data with basic metadata available (country and year of collection), for a total dataset comprising 9,849 isolates (Methods; Fig. 1c). The PopPUNK clustering of the combined dataset identified 752 monophyletic lineages, with the largest composed of 483 isolates (>99% serotype a). There were 20 clusters with at least 100 isolates and 595 clusters with <10 isolates. Many larger clusters were paraphyletic according to the core genome phylogeny (Fig. 3), whereas the monophyletic lineages mostly corresponded to small or singleton clusters.

The core genome phylogeny of the global collection clearly demonstrates that the Maela isolates are extensively interspersed within the global population of the species, suggesting a rapid cross-border and intercontinental transmission. Furthermore, the isolates spanning the sampling window (1962–2023) are distributed across the phylogeny and do not form monophyletic lineages made up of temporally restricted isolates. Together, these patterns strongly suggest suggest that the history of migration within the global H. influenzae population is sufficiently frequent and extensive to erase any phylogeographical signal of the local clonal expansion of lineages.

Given the low level of nucleotide diversity identified in the recombination analysis of the Maela cohort, we further investigated the overall level of nucleotide diversity across the aligned pangenome of the entire global collection, which consisted of 18,265 genes. Although the nucleotide diversity in both core (n = 1103) and non-singleton accessory (n = 8843) genes have overlapping ranges (Fig. 4a), core genes are on average significantly less diverse than accessory genes (two-tailed Mann–Whitney U test, P = 1.205 × 10−5).

Fig. 4: Nucleotide diversity dN/dS across the H. influenzae pangenome.
Fig. 4: Nucleotide diversity dN/dS across the H. influenzae pangenome.
Full size image

a, The box plots of the estimated average pairwise nucleotide diversity, π, in each gene of the aligned pangenome of the combined dataset, split into genes present in 80% or more isolates (core, n = 1,044) and genes in fewer than 80% of isolates (accessory, n = 8,843). The blue hexagons indicate gene-frequency weighted average nucleotide diversity across all genes; the yellow line is the median, the outer edges are the first and third quartiles, and the whiskers are 1.5× the interquartile range beyond those values. The black points indicate outliers, and all data points are plotted in transparent red. b, The log-scaled histogram of the estimated dN/dS values, the ratio of nonsynonymous to synonymous mutations, across 6,853 aligned genes from the pangenome of the combined dataset.

To understand how selective forces may be influencing the diversity observed within the pangenome, we further estimated dN/dS, the ratio of nonsynonymous to synonymous nucleotide mutations within every gene of the pangenome (Methods). Consistent with the low level of diversity observed, the average dN/dS value was 0.28, and 96% of the 6,853 genes for which it was successfully estimated (Methods) had dN/dS <1 (Fig. 4b). This implies that negative selection is widespread across the coding regions of the H. influenzae genome. Of the remaining 256 genes (4%) with a dN/dS estimate, 45 had dN/dS >2, indicating potential positive directional selection. A further analysis of these genes was undertaken using three statistical tests implemented in the HYPHY v.2.5.60 (Methods), and a few accessory genes possessed extremely strong evidence of selection, where at least two of the three statistical tests rejected the null hypothesis of neutral evolution (Methods). The genes involved included an unnamed gluconate transporter, the BrnT toxin protein and a third small protein of unknown function. The results of these analyses are illustrated in Extended Data Figs. 2 and 3 and are briefly summarized here. The unnamed gluconate transporter showed statistically significant results in all three HYPHY tests used, and these tests indicated a branch of the gene phylogeny containing eight isolates and a specific codon (185) in the protein alignment which have been positively selected for. This branch consists of seven Maela isolates and one isolate from elsewhere, which all possess a structural variant of the unnamed gluconate transporter with a large deletion of a transmembrane domain. The brnT toxin gene, the toxin from the BrnT/BrnA type II toxin–antitoxin system15 also showed statistically significant results in all three HYPHY tests, which indicated that a branch of gene phylogeny consisting of two Maela isolates with a large deletion of an alpha helix had recently been subject to positive selection. Finally, the protein of unknown function showed a statistically significant result in two of the three HYPHY tests, identifying a glutamine–valine variable site, with the valine variant primarily associated with Maela isolates and the glutamine variant primarily associated with isolates from elsewhere. All three of these proteins correspond to low-frequency accessory genes which are globally distributed. Notably, the variants identified as under selection in the brnT toxin and the unnamed gluconate transporter are either unique or much more prevalent among Maela isolates, with a similar split association between the two variants of the unnamed protein. This suggests that either the intensive longitudinal sampling frame or the circumstances of the Maela camp may be resulting in elevated statistical power to detect selection or genuinely stronger positive selection and rapid local adaptation.

Finally, to explore the geographic distribution of MDR lineages in greater detail, we focused on the lineages with ≥50 isolates and ≥30% resistance prevalence for at least four of nine antibiotic classes. This revealed that all large MDR lineages (n = 3) are widely disseminated internationally, that is, observed in at least 11 different locations (Fig. 5). One of these lineages was dominated by Hib strains and showed evidence of independent capsule switches to serotype a, whereas the rest were composed of NTHi.

Fig. 5: Phylogenies of MDR H. influenzae lineages.
Fig. 5: Phylogenies of MDR H. influenzae lineages.
Full size image

Recombination-free maximum-likelihood phylogenies for each PopPUNK MDR cluster. Each cluster comprises at least 50 genomes, with at least 30% resistance prevalence for a minimum of four of nine antibiotic classes (Fig. 2). The phylogenies were inferred by using Gubbins on whole-genome pseudoalignments of each cluster, separately mapped against the H. influenzae reference 86-028NP (NC_007146.2). The in silico serotypes (inner ring) and isolation location (second ring) are shown by colour as indicated in the legend. The isolates collected in Maela represent the most common origin in all three clusters (40% in cluster 3, 30% in cluster 5 and 45% in cluster 19).

Discussion

The genomic epidemiology of non-b H. influenzae has remained largely elusive so far due to the lack of carriage studies in high-burden settings, particularly in populations from before the rollout of the Hib vaccine. Our study provides comprehensive evidence that NTHi are equally capable of causing invasive disease irrespective of their genetic background, even in a pre-Hib vaccine host population. Although only colonizing isolates were available from the Maela cohort, comparable sampling during episodes of clinical pneumonia has provided insights into disease-associated strains. Results from the multi-country PERCH pneumonia aetiology study8 confirmed a positive association between non-b H. influenzae upper respiratory colonization and chest X-ray confirmed pneumonia. The same study demonstrated an aetiologic fraction of 4.5% for non-b H. influenzae among human immunodeficiency virus-negative, chest X-ray-confirmed cases, which is comparable with the 6.7% fraction estimated for Streptococcus pneumoniae. There is evidence from several countries of an increasing burden of invasive non-b H. influenzae disease, notably in neonates and older adults, with the vast majority being NTHi infections1,16. The high burden of pneumonia attributable to NTHi and the notable childhood mortality associated with it (the third most common bacterial pathogen) in the low-resource settings in both Africa and Asia16, combined with the frequent emergence of MDR lineages as identified in the current study, serve as a reminder of the health benefits of developing an immunization programme targeting the eradication of these pathogens. This would contribute not only towards removing the public health burden of non-B H. influenzae invasive disease, but also to substantially reduce both the incidence of AOM and the need to prescribe antibiotics to children.

A pre-vaccine carriage study conducted in The Gambia during the 1980s identified a highly variable carriage rate of serotype b, ranging between 0% and 33% across rural and urban areas, whereas the species-wide carriage rate was found to be 90% among children under 5 years of age17. To our knowledge, there are no other comparable prevaccine studies with serotyping data, but the study in The Gambia suggests that the Maela H. influenzae are not atypical in terms of serotype distribution in an unvaccinated host population in a high-burden setting. Postvaccine carriage studies across Europe and China consistently show a decline of serotype b as expected but also that NTHi are most commonly colonizing young children and that all non-b serotypes are rare18,19,20,21. A Belgian study compared carriage rates among children attending day care and those diagnosed with either AOM or invasive disease during 2016–2018. Notably, NTHi were dominating in each category (colonizing, AOM, invasive), with the percentages 95.2%, 98.2% and 68.1%, respectively19. Similarly, in Norwegian (2017–2021) and Portuguese (2011–2018) national surveillance of H. influenzae invasive disease, NTHi accounted for 71.8% and 79.2% of the cases, respectively. These findings are well aligned with our data from Maela and further with a recent study of CAP in children under 5 years vaccinated against Hib in Vietnam, where a high fraction of NTHi was also detected using real-time PCR22. Interestingly, although serotype b has been found either completely absent18,19 or very rare20 in European carriage studies, it is still found in invasive disease across the continent19,23,24, suggesting ongoing transmission from unvaccinated regions of the world.

Apart from the genomic epidemiology of H. influenzae, the overall understanding of the species’ population structure has also remained largely elusive, particularly at a global scale, despite various efforts to elucidate it over the past decade. This study, through analysing a large cohort of isolates from an understudied region combined with a systematic collection of publicly available data, suggests that the global H. influenzae population is not structured into independently evolving lineages which are predominant in certain regions but rare in others. This is unlike other well-studied bacterial species that colonize the same niche, such as S. pneumoniae or Neisseria meningitidis, where distinct, independently evolving lineages have been readily identified for decades25,26,27. Based on our analyses, H. influenzae, in particular NTHi, instead appears to have a population structure reminiscent of panmixia, where routine gene flow between members of the species prevents the formation of stable lineages. This type of population structure would account for the limited success of various methods used to cluster the population in this study and the difficulties previous efforts have encountered when using smaller datasets11. Technically, clustering methods have probably been limited by the low levels of nucleotide diversity observed within the H. influenzae genome, as we found no SNPs in over half the core genome of the Maela isolates. This, however, is not apparent in the output of clustering methods and only becomes evident when population genomics analyses are conducted at species-wide scale.

Despite the low levels of nucleotide diversity, the phylogenetic analysis of the combined Maela and globally sequenced isolates remains possible and clearly demonstrates in H. influenzae a persistent lack of phylogeographical signal (closely related isolates being highly colocalized), even with this collection of isolates spanning over 50 years. This strongly suggests that interregional and intercontinental transmission of these bacteria happens frequently. This is consistent with the high levels of recombination observed in the Maela dataset, as that would facilitate the efficient admixture of migrating isolates with the destination population. Furthermore, the frequent migration and recombination, when combined with widespread evidence of negative selection across coding regions of the genome implied by the low dN/dS values, correspond to a pool of biological forces that probably explains the low nucleotide diversity of the H. influenzae genome. It is, however, difficult to disentangle the individual contributions of migration, recombination and negative selection in producing low levels of diversity, and indeed, they are probably acting in concert, as has been demonstrated previously in other ecological settings28,29.

Our results underscore the importance of a global perspective on disease surveillance when developing public health strategies for managing invasive H. influenzae disease, as it is clear that pathogenic adaptations which arise in one part of the world have ample opportunity for global spread. Although the MDR lineages identified in this study were more frequent in Maela than any other individual sampling country, they were still mostly composed of isolates from around the world (55–65%). Due to the general bias of sampling towards high-income settings, we cannot exclude the possibility that these lineages may be further established in unsampled LMIC populations with high antibiotic use. Intensified efforts should thus be made to include H. influenzae into AMR surveillance programmes as widely as possible. Such surveillance should preferably not be limited to including only bloodstream isolates, because it will otherwise underestimate the prevalence of circulating AMR determinants among pneumonia and AOM clinical cases. Similar to S. pneumoniae, carefully conducted studies of H. influenzae colonization in pneumonia cases and controls may provide further data on the relative invasiveness of capsulated and unencapsulated strains30. Given the significant evidence of adaptation in accessory genes in the Maela population and that the MDR lineages were predominantly identified among Maela isolates, it is possible that the camp host population may be exceptionally well suited to the evolutionary adaptation of these bacteria. This could be due to either the host population density resulting in high colonization and transmission success or the level of antibiotic use in the camp, and it is further feasible that the fitness cost of maintaining such high levels of resistance beyond these settings is prohibitive. An alternative and more plausible explanation is that the higher sampling density in the Maela cohort has led to higher statistical power to identify adaptation using methods based on aligned gene sequences, suggesting that similar adaptation could have also taken place elsewhere. The widespread genomic surveillance in comparable settings would facilitate early detection of the spread of extensive levels of AMR. Importantly, such surveillance data would also be crucial in developing a deeper understanding of how selection drives the evolution and maintenance of AMR in H. influenzae.

Apart from the concerning implication regarding the possibility of the global spread of AMR in H. influenzae, the results of this study also highlight the importance of vaccination against serotype b H. influenzae, where we have detected extremely high rates of MDR isolates. Our results further suggest that vaccination may be a particularly effective strategy to control invasive H. influenzae disease irrespectively of the serotype due to the lower level of diversity present within its core genome relative to the accessory and its highly admixed population structure. Given the low level of the observed allelic diversity, the pervasive negative selection we detected throughout the H. influenzae genome at a global scale may be strong enough to overcome selection driving compensatory adaptations which would generally reduce the vaccine efficacy in response to rollout. Although this is a cause for optimism, it must be tempered by the fact that the high levels of recombination observed in H. influenzae may also increase the efficacy of positive selection on any mutations which do arise, as has been observed in other species31. In any case, the stark contrast between the H. influenzae population structure identified in this work and the highly stratified population structure of S. pneumoniae, both globally27 and in the Maela host population32, strongly suggests that vaccine evasion through interlineage competition and replacement, as has repeatedly been observed in S. pneumoniae33, would be much less likely to happen in H. influenzae, due to the absence of a deeply structured population and local variants. A number of conserved surface antigens have been under investigation as potential candidates for protein subunit vaccines34. Recently, antigenic responses to several of the promising candidates have been measured for otitis-media-prone children and their controls; these include the recombinant soluble PilA (rsPilA) fused with protein E, protein D and the ubiquitous surface protein A2 (UspA2) from Moraxella catarrhalis, as well as ChimV4 (a chimera of protective epitopes from rsPilA) and the outer membrane protein P5 (OMP P5)35. Although there are many complications involved in the design of protein-based bacterial vaccines which would need to be overcome36, our work supports the conjecture that a single universal vaccine could possibly be developed to combat invasive H. influenzae disease and suggests that the eradication of invasive disease caused by H. influenzae may be a feasible end goal of widespread vaccination campaigns.

Methods

Ethical approval

Written informed consent was obtained from the participating infants’ mothers before enrolment into the cohort study. Ethical approval was granted by the ethics committees of the Faculty of Tropical Medicine, Mahidol University, Thailand (MUTM-2009-306) and Oxford University, UK (OXTREC-031-06). The sequencing work on stored isolates described here was approved by the same committees (TMEC-19-043; OxTREC-551-19).

Study design and collections

A total of 4,474 H. influenzae isolates were retrieved from a mother–infant cohort of 999 pregnant women from the Maela camp for displaced persons, Thailand, from October 2007 to November 200837,38. Within a 24-month postpartum period, the infants were sampled by NPSs monthly and when the infant presented symptoms of World Health Organization (WHO) clinical pneumonia. For comparative analyses, a systematic search was conducted for publicly available short-read genome sequences for which country and year of isolation metadata was available. The data were retrieved from the ENA for 6,129 isolates9,10,13,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66, of which 5,879 passed QC, resulting in a final dataset size of 9,849 isolates.

Sampling and sequencing procedures

Between October 2007 and November 2008, 999 pregnant women from the Maela camp for displaced persons (located on the Thailand–Myanmar border in Tak province, northwest Thailand (Fig. 1a,b)) were recruited into a mother–infant colonization study. The infants were followed from birth for 24 months, and a nasopharyngeal swab (NPS) specimen was collected (dacron tipped swabs; Medical Wire & Equipment) at monthly intervals and if the infant presented to the Shoklo Malaria Research Unit clinic with symptoms and signs compatible with WHO clinical pneumonia (Fig. 1b).

Following sampling, the NPS tip was excised immediately into a sterile cryovial containing 1 ml skim milk, tryptone, glucose, glycerol medium (STGG; prepared in-house) using 70% ethanol-cleaned scissors. The NPS–STGG specimens were transferred to the Shoklo Malaria Research Unit (SMRU) microbiology laboratory in a cool box within 8 h of collection and were frozen at −80 °C until culture.

In total, 10 µl of thawed NPS–STGG specimen was cultured onto plain chocolate agar (Clinical Diagnostics), a 10-unit bacitracin disc (Oxoid) applied to the first streak, and the plate was incubated overnight at 36 °C in 5% CO2. The bacitracin-resistant colonies were confirmed as H. influenzae by Gram stain and X + V factor-dependent growth. The serotype was determined by slide agglutination (Becton Dickinson) and by in silico capsule typing12. Two isolates, one NTHi and one serogroup b by agglutination, gave only partial in silico capsule typing12 results. The in silico capsule typing was generally congruent (overall congruence 95.3%) with the agglutination-based phenotypic serotyping, and was corrected by the latter in those 28 cases where the serological typing indicated a serotype for a NT in silico type. The pure isolates of H. influenzae were collected from an overnight culture plate into 1 ml of STGG and stored at −80 °C before DNA extraction at Qiagen using the DNeasy protocol.

Short-read WGS of the 4,474 H. influenzae isolates was performed at the Sanger Wellcome Institute on the Illumina-HTP NovaSeq 6000 platform with 150-bp paired-end sequencing.

For long-read sequencing, one reference isolate was selected per each of the 48 largest PopPUNK clusters, covering 3,558 (89.6%) of 3,970 isolates of the study cohort. The reference isolates were selected on the basis of the gene presence absence matrix from the estimated pangenome, using a published selection pipeline67,68.

The selected H. influenzae strains were subcultured on chocolate agar and incubated overnight at 35–37 °C in 5% CO2. The genomic DNA was extracted using the Qiagen MagAttract HMW DNA Kit. The WGS libraries were constructed using the Oxford Nanopore Technologies (ONT) SQK-NBD112.96 Native Barcoding Kit, and all 48 strains were pooled together and sequenced on one ONT R9.4.1 flowcell using a MinION Mk1c. The hybrid assembly of the reference isolates was performed using a publicly available pipeline69 with a minimum ONT coverage of 40× and a phred score of 20 to trim the Illumina reads, resulting in 40 (83.3%) complete hybrid assemblies of the 48 reference isolates.

Genomic Analysis

A total of 4,474 H. influenzae isolates were sequenced at the Wellcome Sanger Institute on NovaSeq 6000 150-bp paired-end platform. Species contamination was identified by using Kraken v.0.10.6 (ref. 70), and the sequence data failed QC if the depth of coverage was <20× or if there was evidence of contamination or mixed strains, poor assembly or extreme violation of any of the QC parameters. In total, 3,970 isolates passed QC and were included in the genomic analyses. Short-read genome sequences, both newly sequenced and publicly available, were assembled and annotated using a published pipeline with default parameters71, and QC on all isolates was performed on the basis of the number of contigs, genes and distance from the origin in an multidimensional scaling projection of all pairwise distances. The Maela isolates were clustered using PopPUNK v.2.4.072 with a core threshold of 0.11, whereas the default threshold was used for the combined global data. Antimicrobial resistance genes and point mutations were screened from assemblies using AMRFinderPlus v.4.0.3 and H. influenzae-curated database v.2024-12-18.1 (--organism Haemophilus_influenzae) with a minimum identity of 75% and a minimum coverage of 80%. Hicap v.1.0.312 was used to infer capsule type from assemblies. cgMLST was identified using chewBBACA73 and the H. influenzae cgMLST database11, and a simple network clustering method was used to group isolates into complexes on the basis of the number of mismatches in their allele profiles (either 100 or 250). For the phylogenetic analyses on the Maela data, the sequence reads of the 3,970 genomes were mapped to the complete genome of H. influenzae 86-028NP (NC_007146.2) (ref. 74) using Snippy v.4.6.075, and an SNP-only sequence alignment was created using snp-sites v.2.5.176. A phylogeny for Maela collection was inferred using FastTree v.2.1.10 with a generalized time-reversible model77, and a maximum-likelihood tree was inferred for the global collection using IQ-TREE v.2.4.078,79 on Panaroo core-genome alignment, with uninformative regions masked using information entropy scores.

The pangenome was inferred for the Maela genome collection using Panaroo v.1.2.980 using sensitive mode and merging paralogs. The pangenome was further inferred for the entire combined global dataset running Panaroo in strict mode. FastGEAR v.2016-12-16 was used to infer recombinations, and pixy v.1.2.7.beta1 was used to infer per-gene nucleotide diversity, π, both from aligned pangenome gene sequences. Genomegamap v.1.0.181 was used to infer the maximum-likelihood estimates of the average dN/dS for each gene in the pangenome. Only genes with an estimated nucleotide diversity (θ) >0.005 were considered robust enough estimates for further interpretation (n = 6,853 genes). Genes with robust estimates of dN/dS greater than 2 were further analysed using the HYPHY package v.2.5.6082, specifically the FUBAR, FEL and ABSREL statistical tests for pervasive gene-wide directional selection, specific sites subject to directional selection and subsets of branches subject to directional selection respectively. Genes which had significant results to at least one of these HYPHY tests were manually investigated by searching the consensus and variant nucleotide sequences against the non-redundant protein database with tblastx, searching ESMFold-predicted protein structures against the AlphaFold, Uniprot and Swiss-prot database using Foldseek. Three-dimensional structural alignments of consensus, variant and reference protein structures were created for specific genes using TM-align.

MDR clusters in the combined collection were defined as PopPUNK clusters with ≥50 isolates, of which ≥30% harboured resistance determinants to greater than or equal to four out of nine antibiotic classes (aminoglycoside, β-lactam, phenicol, sulfonamide, tetracycline, trimethoprim, macrolide, quinolone and rifamycin). For the phylogenetic reconstruction of the global MDR clusters, assemblies of each cluster were mapped to the reference H. influenzae 86-028NP74 using Snippy, and phylogenies were inferred using Gubbins83.

To test for an association between the lineages of NTHi and invasive disease, we deduplicated isolates from the same host likely to be clonally related to remove the bias associated with longitudinal sampling. The deduplication was performed by first grouping all isolates sampled from the same hosts within 60 days that belonged to the same PopPUNK cluster and then randomly selecting a single isolate for the association analysis. The deduplicated NTHi isolates spanned 120 PopPUNK clusters of the Maela cohort, and we performed a permutation test of association between these and pneumonia by randomly shuffling sample labels (pneumonia and non-pneumonia). A two-sided Fisher’s exact test with 10,000 Monte Carlo replications was used.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.