Abstract
Tuberculosis epidemics have traditionally been conceptualized as arising from a single uniform pathogen. However, Mycobacterium tuberculosis complex (Mtbc), the pathogen causing tuberculosis in humans, encompasses multiple lineages exhibiting genetic and phenotypic diversity that may be responsible for heterogeneity in tuberculosis transmission. We analysed a population-based dataset of 1,354 Mtbc whole-genome sequences collected over four years in Botswana, a country with high HIV and tuberculosis burden. We identified Lineage 4 (L4) as the most prevalent (87.4%), followed by L1 (6.4%), L2 (5.3%), and L3 (0.9%). Within L4, multiple sublineages were identified, with L4.3.4 being the predominant sublineage. Phylodynamic analysis revealed that L4.3.4 expanded steadily from the late 1800s to early 2000s. Conversely, L1, L4.4, and L4.3.2 showed population trajectories closely aligned with the HIV epidemic. Meanwhile, L2 saw rapid expansion throughout most of the 20th century but declined sharply in the early 1990s. Additionally, pairwise genome comparison of Mtbc highlighted differences in clustering proportions due to recent transmission at the sublineage level. These findings emphasize the diverse transmission dynamics of strains of different Mtbc lineages and highlight the potential for phylodynamic analysis of routine sequences to refine our understanding of lineage-specific behaviors.
Similar content being viewed by others
Introduction
Despite being preventable and curable, tuberculosis (TB) claims an estimated 1.4 million lives each year worldwide, surpassing deaths caused by any other infectious disease besides severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2)1. The SARS-CoV-2 pandemic highlighted the public health importance of monitoring the spread of different genomic variants globally and in local settings2. Indeed, genomic epidemiology has been instrumental in outbreak investigation and public health over the previous decade by tracing the spread of various infectious pathogens and informing key stakeholders. For instance, genomic surveillance of viral pathogens guided Ebola vaccine allocation in west Africa during the 2013–2016 outbreak3, identified likely source and routes of zika virus introductions into the United States during the 2016 outbreak4, and has continuously mapped transmission networks and detected high-risk groups in the ongoing HIV pandemic5,6. These findings are facilitated through the use of phylodynamics, a statistical framework that integrates high-resolution genetic data with individual-level epidemiological data to shed light on the transmission dynamics of infectious pathogens, providing timely guidance for public health interventions7,8.
For TB research, large-scale comparative genomic analysis have provided valuable insights into the origin and genetic diversity of the human-adapted TB bacteria, collectively known as the Mycobacterium tuberculosis complex (Mtbc)9,10. There are currently a total of 9 major lineages of Mtbc and 64 sublineages that characterize their genetic diversity according to a 62-single nucleotide polymorphism (SNP) barcode11,12,13. Strain diversity in Mtbc can lead to important phenotypic variations in virulence, antibiotic resistance patterns, and transmissibility14,15. For example, Lineages 2 and 4 are widespread globally and noted for their enhanced transmissibility, as evidenced by comparing Mtbc genomes (e.g., genotypic clustering proportions) collected in many settings16,17. By contrast, Lineage 1 is less likely to cause disease shortly after infection and may have a reduced ability to spread14,16,17. However, most epidemiological studies and surveillance efforts still do not distinguish between Mtbc strains, and the conventional approach of treating TB as a uniform epidemic fails to account for pathogen heterogeneity and potential differences in the transmission dynamics of Mtbc strains, which could vary significantly across different regions and populations14,18,19. Much like the insights for viral transmission dynamics gained from genomic surveillance of SARS-CoV-2 variants2, analysing Mtbc sequences to uncover long-term epidemiological trends of specific Mtbc strains may offer a more granular understanding of the TB transmission dynamics at the population level, which could improve our ability to predict the trajectory of the TB epidemic and design more tailored interventions for a given setting.
Located in southern Africa, Botswana continues to face one of the most severe TB endemics globally and remains one of the 30 high TB/HIV burden countries1. Despite the public health implications, the evolutionary history and transmission dynamics of Mtbc lineages remain poorly understood in this high burden country. In this study, we sought to expand our understanding of the Mtbc population dynamics underlying the long-standing TB endemic in Botswana by leveraging a countrywide dataset of whole genome sequences (WGS) from Mtbc isolates collected over four years20. Our aim was to delineate the long-term spread of Mtbc subpopulations in the context of sociohistorical events and the generalized HIV epidemic in the region. In addition to the broader long-term trends revealed through phylodynamic analysis, we also examined clustering of cases by Mtbc lineages based on pairwise genetic distances to track recent transmission patterns. We hypothesized that TB in Botswana is composed of a collection of heterogeneous epidemics driven by distinct transmission dynamics of Mtbc lineages.
Results
Distribution of Mtbc lineages in Botswana
During the study period, 1,426 Mtbc isolates underwent successful WGS and were assigned into known lineages based on lineage-specific SNPs. We excluded 72 isolates with evidence of mixed Mtbc strain infection for subsequent phylogenetic and phylodynamic analysis. Among the remaining 1,354 isolates, 1,184 belonged to lineage 4 (L4; 87.4%), 86 (6.4%) were classified as lineage 1 (L1), 72 (5.3%) were lineage 2 (L2), and 12 (0.9%) were lineage 3 (L3). L4 was found to encompass a number of sublineages, including L4.1 (n = 2; 0.1%), L4.1.1 (n = 141; 10.4%), L4.1.2 (n = 149; 11.0%), L4.2.2 (n = 1; 0.1%), L4.3.2 (n = 117; 8.6%), L4.3.3 (n = 27; 2.0%), L4.3.4 (n = 400; 29.5%), L4.4 (n = 189; 14.0%), L4.6 (n = 2; 0.1%), L4.7 (n = 14; 1.0%), L4.8 (n = 92; 6.8%), and L4.9 (n = 50; 3.7%). Within L1, 22 (1.6%) were L1.1 and 64 (4.7%) were L1.2. All of the L2 isolates were classified as L2.2 (Fig. 1). All percentages reported represent proportions relative to the total 1,354 isolates. Furthermore, Gaborone harbored more Mtbc strain diversity compared to Ghanzi. L4.3.4 (S10 Fig), the most prevalent sublineage, was found in higher proportion in Ghanzi than Gaborone (67% vs. 21%; Table 1). Host characteristics were similar among the major Mtbc lineages (Table 2).
Maximum likelihood phylogeny of Mycobacterium tuberculosis–complex (Mtbc) isolates collected in Botswana, 2012–2016 (n = 1,354). (A) Phylogeny of Lineage 1 Indo-Oceanic Mtbc isolates (n = 86). (B) Phylogeny of Lineage 2 East Asian Mtbc isolates (n = 72). (C) Phylogeny of Lineage 3 East African-Indian Mtbc isolates (n = 12). (D) Phylogeny of Lineage 4 Euro-American Mtbc isolates (n = 1184). Tree tips are colored by Mtbc strains within a major lineage.
Historical origins and population dynamics of Mtbc lineages
Demographic reconstruction was performed for L1, L2, L4.1.1, L4.1.2, L4.3.2, L4.3.4, L4.4, and L4.8. The time to the most recent common ancestor (MRCA) refers to the estimated age of a common ancestral population from which current populations or strains of Mtbc have descended. We inferred the MRCA to have emerged around year 1727 (95% highest posterior density interval [HPDI], 1607–1828) and 1900 (95% HPDI 1854–1938) for strains of L1 and L2, respectively. Time to MRCA varied among L4 sublineages, ranging from year 1695 for L4.1.2 (95% HPDI 1546–1817) to 1941 for L4.3.2 (95% HPDI 1911–1966; Table 3). Our analysis revealed that strains of L4.3.4 experienced a steady expansion from late 1800s to early 2000s, and remained the most prevalent sublineage in recent years. Similar population trajectory was observed for L1 and L4.4 strains, with a substantial expansion observed from the late 1970s for two decades, followed by a continuous decline from the late 1990s. For L4.3.2 strains, we observed significant growth between 1980 to early 2000s, followed by a sharp reduction afterward. L2 strains experienced a steady increase in population growth between 1940 and early 1990s, followed by a steep decline until 2000, after which the population plateaued followed by a second decline around 2012. Other L4 sublineages experienced several waves of expansion and contraction. Except for L4.1.1, all other Mtbc lineages underwent contraction after early 2000s (Fig. 2 and S1 Fig). Additionally, we compared HIV prevalence across the analyzed Mtbc strains: compared to L1, HIV prevalence was lower among L4.1.1 and L4.8 isolates (S3 Table). We also screened for variants differentiating L4.3.2 and L4.3.4, identifying 110 SNPs unique to L4.3.2 and 69 SNPs unique to L4.3.4 (S4 Table). For the inferred clock rate and maximum clade credibility trees of these lineages, refer to S1 Table and S2-9 Fig.
Genomic clustering analysis
Using a 5-SNPs cutoff as an indicator of recent transmission, 48% of the isolates (652/1354) were clustered. The cluster sizes varied, ranging from 2 isolates per cluster to 23 isolates in the largest cluster. The median cluster size for each lineage was as follows: L1 and L2: 2 (range: 2–7), L4.1.1: 3 (range: 2–23), L4.1.2: 3 (range: 2–12), L4.3.2 and L4.4: 2 (range 2–8), L4.3.4: 2 (range: 2–19), and L4.8: 3.5 (range: 2–9). Median cluster sizes showed little variation as the distributions of cluster sizes were heavily right-skewed and > 60% of clusters across lineages or sub-lineages comprised only 2 to 3 isolates (S11 Fig). The slightly higher median of L4.8 reflects a greater proportion of clusters with 4 or more isolates, despite the absence of very large clusters within this sublineage. L4.1.1 and L4.3.4 exhibited a few markedly larger clusters, with the largest (n = 23 isolates) observed in L4.1.1, suggesting possible outbreak-associated transmission events. Examination of sampling locations within genomic clusters revealed that, although most clusters were composed solely of isolates sampled from Gaborone, all lineages included at least some clusters containing isolates from both Gaborone and Ghanzi, particularly among the larger clusters. This pattern suggests that transmission is largely localized but that occasional cross-site spread or shared transmission networks likely occurred (S11 Fig). Compared to L1, we found that isolates belonging to any other lineages were more likely to be part of a cluster (Table 4). Isolates belonging to L2, L4.1.1, and L4.8 had the highest cluster proportion. Additionally, the cluster proportion for the predominant L4.3.4 was significantly higher in Ghanzi than Gaborone (57% vs. 40%; p-value < 0.001) These findings remained robust when we repeated the analysis using a 12-SNPs cutoff (S2 Table).
Discussion
Few studies to date have examined lineage-specific, long-term transmission dynamics of Mtbc strains in a high HIV/TB setting, largely due to the lack of population-level WGS datasets over extended sampling periods16,21. By reconstructing the evolutionary history and spread of Mtbc strains of different lineages using WGS collected over a four-year period, we observed varying transmission dynamics and demographic trajectories for extant lineages circulating in Botswana. The expansion of many lineages coincided with the onset of the HIV/AIDS epidemic, while their contraction aligned with the implementation of antituberculosis and antiretroviral treatment programs.
With the exception of strains of L4.1.1, L4.1.2 and L4.8, strains of most lineages experienced an expansion of various magnitude in the second half of the 20th century, correlating closely with the region’s high TB burden22. Notably, this period coincides with the emergence and rapid escalation of the HIV epidemic. Botswana experienced one of the world’s most severe HIV epidemics, with initial cases emerging in the 1980s23. A phylodynamic study of HIV dated the virus’s emergence in Botswana around 1960, followed by rapid growth until stabilization post-199024. HIV increases the susceptibility of TB disease and is the strongest risk factor for reactivating latent TB infections due to compromised immunity25. Further, the HIV crisis was exacerbated by rapid urbanization driven by economic growth in the mining industry. A network analysis of micro-census data from Botswana between 1981 and 2011 suggests a highly mobile population with considerable movement between urban and rural areas26. Mining towns, where HIV cases initially emerged, acted as major migration hubs for both inflow and outflow of populations. In Selebi Phikwe, a mining town for copper and nickel in Botswana, HIV prevalence had surged to 50% in 200027. These towns also became hubs of TB transmission due to the high-risk conditions associated with mining and the growing HIV prevalence28,29. As the HIV epidemic escalated, the number of individuals susceptible to Mtbc infection and reactivation increased, driving the expansion of Mtbc populations. Thus, the intersection of the HIV epidemic, mining-driven urbanization, and high population mobility likely played a critical role in the significant expansion of strains from most Mtbc lineages in the latter part of the 20th century. Nevertheless, this expansion was not observed across all lineages. It’s possible that Mtbc strains may exhibit varying capacities for TB transmission or reactivation influenced by their interactions with the host in the context of HIV-related immunosuppression15,17,30. We observed a marked decline in Mtbc populations at the turn of the 21st century, which aligned with a consistent decrease in TB incidence from 2000 to 201631. Several nationwide programs likely contributed to the reduction. The most significant decrease in L1, L4.3.2, L4.3.4, L4.4, and L4.8 occurred beginning the early 2000s, coinciding with the launch of the Masa Antiretroviral Therapy (ART) Programme in 200223. This program, also known as the Masa Programme, initially aimed to provide free ART to individuals with advanced immune suppression but eligibility expanded over the years. “Masa” means “new dawn” in Setswana, symbolizing hope and a fresh start for those affected by the epidemic. Shortly after, the President’s Emergency Plan for AIDS Relief (PEPFAR) was launched, focusing on combating HIV/AIDS globally, with a significant emphasis on sub-Saharan Africa32. Botswana, with its high HIV prevalence rates, was a key focus country for PEPFAR. Both initiatives aimed to support those affected by HIV/AIDS through comprehensive treatment, prevention, and care programs, thereby reducing HIV transmission23,32,33. As a result, Botswana had significantly increased number of people receiving ART, reduced the rate of new HIV infections, and improved the overall healthcare system’s capacity to manage HIV/AIDS32,33,34. While PEPFAR and Masa Programme primarily targeted the HIV/AIDS epidemic, their impact on reducing HIV transmission indirectly led to a significant reduction in the burden of TB in Botswana.
Our study highlighted significant heterogeneity in transmission dynamics among strains of different Mtbc lineages analysed. We found that L1 strains maintained a relatively stable effective population size until experiencing a significant expansion following the emergence of HIV. L1 is most prevalent in regions bordering the Indian Ocean, and several large-scale phylogeographic studies have shown that South Asia and eastern Africa played critical roles in dispersing this lineage via maritime trade35,36. It is possible that MRCA of L1 strains was introduced into Botswana through subsequent migrations between eastern and southern regions of the continent. Post establishment, evidence suggests that L1 likely expanded due to changing environmental conditions (e.g. increased population densities, malnutrition, immunosuppression), rather than from large migration or transmission events35. Consistent with previous findings, our study found that L1 had the lowest proportion of cases that belonged to clusters14,16,37. A phylodynamic study from Vietnam, an L1-endemic region, concluded that compared to L2 and L4, the disease burden due to L1 in the region was associated with reactivation of long-term remote infections, evidenced by larger SNP differences and limited inferred migration events among L1 isolates38. Similarly, in Botswana, reactivation of latent TB infections likely contributes to new cases of L1 and may have driven this lineage’s expansion as population immunity waned due to HIV. L4.4 strains followed similar demographic trends but showed a slower decline in population size after nationwide introduction of ART interventions, which may be linked to its higher proportion of clustered cases due to recent transmission compared to L1 strains.
Notably, L2 strains underwent rapid expansion in effective population size shortly after their estimated emergence in the early 20th century. Phylogenetic studies suggest that L2.2, often associated with the Beijing genotype, likely originated in China21,39. The 19th and 20th centuries were characterized by intense interaction between Europe and China, driven by trade, imperial ambitions, and cultural exchanges. Following the Opium Wars, numerous Chinese cities became treaty ports where European powers had extraterritorial rights and conducted trade with relative freedom. Prominent treaty ports included Shanghai, Guangzhou, and Tianjin, which became hubs of international trade and cultural exchange. Additionally, during the late 19th and early 20th centuries, Chinese laborers were often recruited for mining and railway construction projects worldwide, including in Africa40. South Africa, which borders Botswana, had a significant number of Chinese laborers working in its mines51. This movement could have facilitated the introduction and spread of L2.2 (with putative origins in Beijing, China) to southern Africa50. L2 has been associated with hypervirulence (ability to cause disease shortly after infection) and high transmissibility, supported by both clinical evidence and animal experiments41,42. In our study, we also observed that L2 had one of the highest proportions of genomic clustering due to recent transmission. However, L2 strains experienced a decline in the early 1990s before the implementation of country-wide ART programs. This decline might reflect a combination of factors, including implementation of facility-based directly observed therapy, local pathogen-host dynamics and its challenges in competing with other well-adapted indigenous lineages15. The discrepancy between high genotypic clustering proportion and population dynamic of L2 strains in Botswana highlights that genotypic clustering analysis alone does not necessarily capture the dynamics of an epidemic, nor does it always correlate with the population size of a specific Mtbc lineage. Our finding also contrasts sharply with the phylodynamic study in Vietnam, where L2 showed clear patterns of overtaking endemic L1 strains over time38. Overall, these findings imply that the competitive dynamics between strains of different lineages can vary significantly depending on the region and the specific lineages involved, and that studying lineage features without accounting for geographic or host genetic background can obscure important host-pathogen interactions15,43.
Our analysis indicates that most TB cases in Botswana were caused by L4 strains, specifically L4.3 strains. This finding aligns with previous reports of L4-belonging Latin American and Mediterranean (LAM) sublineage’s dominant presence in southern Africa region44,45,46,47. Based on the sampled genetic diversity, we estimated that most L4 sublineages trace back to MRCAs that emerged between the late 17th and 19th centuries. Recent genomic studies of Mtbc suggests that the introduction and spread of L4 coincide with historical events involving European colonization and subsequent socioeconomic changes44,48. European colonization of South Africa, Botswana’s neighboring country, began in earnest with the establishment of a Dutch settlement at the Cape of Good Hope in 165249. Later, British colonization intensified in early 19th century, and Botswana (then Bechuanaland) became a British protectorate in 1885. The establishment of colonial administration and subsequent movements of European settlers between Europe, South Africa and Botswana during this period may have facilitated the establishment and spread of L4 strains still observed today.
The most common Mtbc lineage found in Botswana was L4.3.4, and it was overwhelmingly more prevalent in Ghanzi than Gaborone (67% vs. 21%). Another study using a different genotyping technique noted this, but was unable to differentiate whether it was due to recent transmission or reactivation of latent infections50. The effective population size of L4.3.4 strains grew steadily since its introduction around 1880 and only began to decline slightly after country-wide ART programs began. Unlike other lineages, L4.3.4’s expansion during the HIV epidemic was modest. This contrast may be explained by the higher prevalence of this lineage in Ghanzi, which has lower HIV prevalence compared to Gaborone. Prior to the study, approximately 36% of TB patients in Ghanzi were coinfected with HIV, compared to 70% in Gaborone51, suggesting a less pronounced impact of the HIV epidemic on TB transmission dynamics in Ghanzi. In our analysis, L4.3.4 displayed moderate clustering proportions due to recent transmission, falling between the lowest (L1) and the highest groups (e.g., L2, L4.1.1, L4.8). However, it exhibited significantly higher cluster proportions in Ghanzi compared to Gaborone. Taken collectively, we propose that the success of L4.3.4 is driven primarily by sustained ongoing transmission, rather than reactivation of latent infections50. This highlights the need for public health efforts to prioritize activities that interrupt transmission, such as enhancing contact tracing52, improving infection control measures53, and ensuring timely diagnosis and treatment to reduce disease burden, especially in a rural setting like Ghanzi54.
L4.3.4 displayed clear competitive advantage as evidenced by its dominant presence in both districts despite not being the earliest lineage to emerge in Botswana. It’s also interesting to observe that L4.3.2, a strain closely related to L4.3.4, emerged around the time of estimated HIV introduction in Botswana24. However, unlike L4.3.4’s steady growth, L4.3.2 experienced a rapid expansion followed by a sharp decline, mirroring remarkably well with the timing of the HIV epidemic. The striking difference in population dynamics between two closely related strains highlight the need for further investigation to determine whether strain-specific mutations identified in this study contribute to distinct phenotypes, and/or if potential host-pathogen interactions influence their respective trajectories and the success of L4.3.4.
L4.1.1, L4.1.2, and L4.8 strains share similar demographic trends with large overlapping CIs. Notably, L4.1.1 and L4.8 exhibited the lowest HIV prevalence but highest cluster proportions due to recent transmission, almost doubling the estimate compared to L4.3.4. This variation in genotypic clustering within L4 sublineages was also observed in San Francisco, USA55, underscoring sublineage-level heterogeneity and the necessity of investigating them individually. Furthermore, L4.1.1 was the only lineage that expanded after the implementation of country-wide ART programs. This pattern, together with the presence of a few very large clusters defined by the 5-SNP cutoff, suggests that L4.1.1 may be associated with ongoing localized outbreak-like transmission. In addition to possible sublineage-specific characteristics (e.g. virulence), other socioecological drivers, for example, overcrowding or malnutrition, may be responsible for the expansion and contraction of this sublineage56. These findings point to the need for public health programs to monitor this sublineage more closely and to further investigate any contextual drivers underlying its transmission.
Our study had limitations. First, the available genetic data may not fully represent the diversity of Mtbc strains circulating in Botswana. However, the greater Gaborone and Ghanzi district encompass a considerable segment (20%–25%) of the total population, and sampling in Gaborone likely captured a substantial portion of Mtbc strain diversity due to its role as a central urban hub for inflow and outflow of population. Second, Mtbc isolates were limited to sputa. It is possible that some lineages may be under-represented because of the predication for pulmonary clinical presentation57. Third, coalescent models (e.g. Skyride model applied in our analysis) use pathogen genomic data to estimate the timescale of population dynamics58. The relatively low mutation rate of Mtbc can result in broad uncertainty intervals for estimated effective population sizes and divergence times, making it challenging to pinpoint precise timings for events such as introductions or expansions of lineages and limiting our ability to analyze local features of their curves. Inclusion of Mtbc sequences as they become available in the future may improve the reconstruction of Mtbc population dynamics and reduce uncertainty associated with population parameters. Lastly, we acknowledge the limitation of using a fixed-SNP threshold to determine transmission clusters59. The chosen threshold is often arbitrary and might not accurately reflect recent transmission. However, a SNP cutoff provides a standardized approach, making it easier to compare results across different studies and useful for public health investigations. The chosen 5-SNP cutoff for determining TB transmission clusters is common in molecular epidemiology and has been shown to provide an informative overview of local transmission patterns60,61.
In summary, our study provides insight into the recent evolutionary origins and population dynamics of strains of different Mtbc lineages in Botswana. We highlight the heterogeneous transmission dynamics of key lineages circulating in Botswana and emphasize that increasing awareness of these heterogeneities may be essential to further reducing TB burden. Similar to viral pathogens, we recommend routine molecular surveillance and phylodynamic analysis to monitor local TB transmission. This could enable public health programs to effectively track the spread of different Mtbc strains, identify those that are expanding over time, and tailor control strategies based on the epidemiological trends of circulating Mtbc strains within a setting.
Methods
Data source and study setting
Data were obtained from a population-based TB study conducted between 2012 and 2016. Study procedures have been described in detail elsewhere20. Briefly, the study screened 5,515 individuals of all ages with presumptive TB in the greater Gaborone and Ghanzi districts of Botswana, of whom 4,331 were enrolled. Among these, 2,162 participants had culture-confirmed M. tuberculosis64. Gaborone, located in the southeast and bordering South Africa, is the capital city with a high population density of 1,370 persons/km2 in 2011. In contrast, Ghanzi is a rural district located in western Botswana with the lowest population density in the country according to the 2011 census, at 0.37 persons/km262. Despite Ghanzi’s low population density, it consistently has the highest TB incidence in the country. Prior to the study, the TB incident rate per 100,000 people in Botswana, and Gaborone and Ghanzi districts were approximately 445, 440, and 722 cases, respectively63.
During the enrollment process, participants without a documented HIV status or with negative test results over 12 months old were offered rapid HIV testing. Each participant was required to provide at least one expectorated sputum sample. Sputum induction with nebulized hypertonic saline solution was performed for those who had difficulty producing sputum spontaneously. The Kopanyo Study was approved by the U.S. Centers for Disease Control and Prevention, the Botswana Ministry of Health and Wellness, and the University of Pennsylvania Institutional Review Boards. Additionally, approval of this secondary data analysis was obtained from University of California at Irvine Institutional Review Board. All research procedures were conducted in accordance with the Declaration of Helsinki and relevant institutional guidelines and regulations. Written informed consent was obtained from all participants prior to enrollment.
Culture and DNA extraction
The sputum specimens underwent decontamination using the N-acetyl-L-cysteine and sodium hydroxide method. After processing, the specimens were inoculated into Mycobacteria Grow Indicator Tubes (MGIT) 960 system (Becton Dickinson Microbiology Systems, Sparks, MD, USA) at the Botswana national reference laboratory in Gaborone. Weekly monitoring of mycobacteria growth occurred, and if no growth was observed within 8 weeks, it was classified as culture negative. For culture positive colonies, Mtbc DNA was extracted with GenoLyse kit (Hain Lifescience, Germany) following the manufacturer’s protocol. Genomic extracts were stored at -80 °C, and those containing a sufficient amount of DNA (at least 0.05ng/uL) were sent to the Research Center Borstel for WGS.
Whole genome sequencing and bioinformatics
Genomic libraries were prepared with Illumina Nextera XT kit to generate 2 × 150 bp paired-end reads for sequencing using the Illumina NextSeq 500 platform. An automated bioinformatics pipeline, MTBseq, was used for WGS data analysis64. MTBseq combined all necessary steps needed for analysis of WGS data. Briefly, raw sequences were aligned to the pan-susceptible, H37Rv reference genome (GenBank NC000962.2) with BWA-mem software and Samtools. Next, base call recalibration and realignment of reads around insertions and deletions was performed with GATKv3. Variant calling was performed using Samtools mpileup and in-house scripts, and positions were considered reliable with 75% allele frequency, at least 4 calls with a phred score of 20 or more, and a minimum of 4 reads mapped in each direction (forward and reverse orientation). Libraries were considered as high quality and further analyzed when at least 50x coverage was achieved and 95% of the reference genome covered. Next, genomes were annotated with known associations to antibiotic resistance. Finally, phylogenetically informative SNPs based on existing literature were used for lineage classification of the input samples13, excluding genes associated with antibiotic resistance, insertions and deletions, repetitive regions (PPE and PE-PGRS gene families), and consecutive variants in a 12 bp window.
Phylogenetic reconstruction
We extracted variable sites from whole genome alignments of Mtbc isolates to create the fasta files used for phylogenetic analysis. To illustrate the overall population structure of Mtbc lineages in Botswana, we used IQ-TREE v1.6.12 to reconstruct maximum likelihood phylogenies for each major Mtbc lineage, with isolates possibly containing mixed Mtbc strains excluded from the dataset65. The Hasegawa, Kishino, and Yano (HKY) and General Time Reversible (GTR) models of nucleotide evolution were evaluated with SNP ascertainment bias correction specified66. No major differences were observed between phylogenies generated under the two models; therefore, the HKY model was selected for its simpler assumptions and parameterization. Additionally, time-measured phylogenetic trees for specific lineages included in the coalescent-based analysis were estimated using Bayesian method with details described below.
Coalescent-based demographic reconstruction
We reconstructed Mtbc population dynamics for L1, L2, and L4 sublineages that had at least 70 samples using BEAST v1.10.467. The HKY nucleotide substitution model with gamma-distributed rate heterogeneity (discretized to 4 categories) was used66. We specified a strict molecular clock model, with the substitution rate assumed to be distributed lognormal with a mean of 1 and a sigma of 1.25. This prior translates to 95% of the probability mass between 0.09 and 4.2 SNPs per site per year (s/s/y). We also explored alternative priors more aligned with the published substitution rate of Mtbc68, such as a lognormal distribution with a mean of 0.001 and a sigma of 2 (i.e. 95% of probability mass between 2.4E-10 and 0.001 s/s/y), as well as a uniform distribution with upper and lower bounds set to 5E-7 and 1E-8 s/s/y, respectively. However, the posterior clock rate remained robust to these different prior specifications. To estimate the mycobacterial population size change through time, we adopted the coalescent-based Gaussian Markov random fields Bayesian Skyride tree prior58. This approach offers greater flexibility compared to the Bayesian Skyline prior and does not require strong prior assumptions on the number of population size change points. Furthermore, to account for ascertainment bias, we made manual adjustments in the xml file to incorporate the number of invariant sites. This correction was applied individually for each xml file corresponding to a specific lineage. The Markov chain Monte Carlo (MCMC) chain for each lineage ran for 500 million iterations, with parameters and trees sampled every 50,000th iteration. We verified adequate mixing and posterior convergence using Tracer v1.7.2 after discarding the first 10% as burn-in. For a parameter to be considered sufficiently sampled, we aimed for an effective sample size of at least 200. If any parameter fell short of this threshold, we ran a second chain of 500 million iterations and combined the chains using LogCombiner v1.10.4. We note that L4.3.4 – the largest sublineage in our dataset – was the only one that required a second MCMC chain of 500 million iterations to achieve sufficient mixing and sampling. Finally, a consensus tree summarizing the sampled posterior trees for each lineage was constructed using the maximum clade credibility method.
Genomic clustering analysis
We explored differences in patterns of recent transmission between Mtbc lineages that were included in the coalescent analysis. Genetically similar Mtbc strains under a pre-specified SNP threshold (usually between 5 and 12-SNPs) are interpreted as consistent with recent transmission events60. We assigned cases into clusters, which was defined as two or more individuals with Mtbc strains sharing ≤ 5-SNPs differences, as determined by pairwise SNP comparison60,61. Next, we regressed clustered or non-clustered case on Mtbc lineage in a logistic model, adjusting for potential confounding by age, gender, enrollment district, and HIV status. To evaluate the robustness of our findings, we also conducted logistic regression analysis using a 12-SNPs cutoff for identifying clustered cases. All statistical analysis and phylogenetic tree visualizations were conducted in R version 4.2.069. WGS data from this study has been submitted to the European Nucleotide Archive (ENA) under study accession PRJEB62480.
Data availability
Whole genome sequencing data generated and/or analysed during the current study are available at the European Nucleotide Archive (ENA) under study accession PRJEB62480. The metadata for the Kopanyo study is available from the corresponding author on reasonable request.
References
Organization, W. H. Global Tuberculosis Report 2023 (2023). https://www.who.int/teams/global-tuberculosis-programme/tb-reports/global-tuberculosis-report-2023
Walensky, R. P., Walke, H. T. & Fauci, A. S. SARS-CoV-2 variants of concern in the united States-Challenges and opportunities. JAMA 325, 1037–1038. https://doi.org/10.1001/jama.2021.2294 (2021).
Kinganda-Lusamaki, E. et al. Integration of genomic sequencing into the response to the Ebola virus outbreak in Nord Kivu, Democratic Republic of the congo. Nat. Med. 27, 710–716. https://doi.org/10.1038/s41591-021-01302-z (2021).
Grubaugh, N. D. et al. Genomic epidemiology reveals multiple introductions of Zika virus into the united States. Nature 546, 401–405. https://doi.org/10.1038/nature22400 (2017).
Yuan, D. et al. HIV-1 genetic transmission networks among people living with HIV/AIDS in Sichuan, china: a genomic and Spatial epidemiological analysis. Lancet Reg. Health West. Pac. 18, 100318. https://doi.org/10.1016/j.lanwpc.2021.100318 (2022).
Skaathun, B. et al. HIV-1 transmission dynamics among people who inject drugs on the US/Mexico border during the COVID-19 pandemic: a prosepective cohort study. Lancet Reg. Health Am. 33, 100751. https://doi.org/10.1016/j.lana.2024.100751 (2024).
Attwood, S. W., Hill, S. C., Aanensen, D. M., Connor, T. R. & Pybus, O. G. Phylogenetic and phylodynamic approaches to Understanding and combating the early SARS-CoV-2 pandemic. Nat. Rev. Genet. 23, 547–562. https://doi.org/10.1038/s41576-022-00483-8 (2022).
Rife, B. D. et al. Phylodynamic applications in 21(st) century global infectious disease research. Glob Health Res. Policy. 2(13). https://doi.org/10.1186/s41256-017-0034-y (2017).
Bos, K. I. et al. Pre-Columbian mycobacterial genomes reveal seals as a source of new world human tuberculosis. Nature 514, 494–497. https://doi.org/10.1038/nature13591 (2014).
Comas, I. et al. Out-of-Africa migration and neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat. Genet. 45, 1176–1182. https://doi.org/10.1038/ng.2744 (2013).
Coscolla, M. et al. Phylogenomics of Mycobacterium Africanum reveals a new lineage and a complex evolutionary history. Microb. Genom. 7 https://doi.org/10.1099/mgen.0.000477 (2021).
Ngabonziza, J. C. S. et al. A sister lineage of the Mycobacterium tuberculosis complex discovered in the African great lakes region. Nat. Commun. 11, 2917. https://doi.org/10.1038/s41467-020-16626-6 (2020).
Coll, F. et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat. Commun. 5, 4812. https://doi.org/10.1038/ncomms5812 (2014).
Freschi, L. et al. Population structure, biogeography and transmissibility of Mycobacterium tuberculosis. Nat. Commun. 12, 6099. https://doi.org/10.1038/s41467-021-26248-1 (2021).
Gagneux, S. Host-pathogen Coevolution in human tuberculosis. Philos. Trans. R. Soc. Lond. B Biol. Sci. 367, 850–859. https://doi.org/10.1098/rstb.2011.0316 (2012).
Guerra-Assuncao, J. A. et al. Large-scale whole genome sequencing of M. tuberculosis provides insights into transmission in a high prevalence area. Elife 4 https://doi.org/10.7554/eLife.05166 (2015).
Coscolla, M. & Gagneux, S. Consequences of genomic diversity in Mycobacterium tuberculosis. Semin Immunol. 26, 431–444. https://doi.org/10.1016/j.smim.2014.09.012 (2014).
Mathema, B. et al. Epidemiologic consequences of microvariation in Mycobacterium tuberculosis. J. Infect. Dis. 205, 964–974. https://doi.org/10.1093/infdis/jir876 (2012).
Gagneux, S. et al. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. U S A. 103, 2869–2873. https://doi.org/10.1073/pnas.0511240103 (2006).
Zetola, N. M. et al. Protocol for a population-based molecular epidemiology study of tuberculosis transmission in a high HIV-burden setting: the Botswana Kopanyo study. BMJ open. 6, e010046. https://doi.org/10.1136/bmjopen-2015-010046 (2016).
Liu, Q. et al. China’s tuberculosis epidemic stems from historical expansion of four strains of Mycobacterium tuberculosis. Nat. Ecol. Evol. 2, 1982–1992. https://doi.org/10.1038/s41559-018-0680-6 (2018).
Dodd, P. J. et al. Transmission modeling to infer tuberculosis incidence prevalence and mortality in settings with generalized HIV epidemics. Nat. Commun. 14, 1639. https://doi.org/10.1038/s41467-023-37314-1 (2023).
Ramogola-Masire, D. et al. Botswana’s HIV response: Policies, context, and future directions. J. Community Psychol. 48, 1066–1070. https://doi.org/10.1002/jcop.22316 (2020).
Wilkinson, E., Engelbrecht, S. & de Oliveira, T. History and origin of the HIV-1 subtype C epidemic in South Africa and the greater Southern African region. Sci. Rep. 5, 16897. https://doi.org/10.1038/srep16897 (2015).
Corbett, E. L. et al. The growing burden of tuberculosis: global trends and interactions with the HIV epidemic. Arch. Intern. Med. 163, 1009–1021. https://doi.org/10.1001/archinte.163.9.1009 (2003).
Song, J. et al. Population mobility and the development of botswana’s generalized HIV epidemic: a network analysis. MedRxiv https://doi.org/10.1101/2023.02.01.23285339 (2023).
UNAIDS. Botswana: country factsheets, (2004). https://www.unaids.org/en/regionscountries/countries/botswana
Stuckler, D., Basu, S., McKee, M. & Lurie, M. Mining and risk of tuberculosis in sub-Saharan Africa. Am. J. Public. Health. 101, 524–530. https://doi.org/10.2105/AJPH.2009.175646 (2011).
Basu, S., Stuckler, D., Gonsalves, G. & Lurie, M. The production of consumption: addressing the impact of mineral mining on tuberculosis in Southern Africa. Global Health. 5, 11. https://doi.org/10.1186/1744-8603-5-11 (2009).
Fenner, L. et al. HIV infection disrupts the sympatric host-pathogen relationship in human tuberculosis. PLoS Genet. 9, e1003318. https://doi.org/10.1371/journal.pgen.1003318 (2013).
Bank, T. W. Incidence of tuberculosis in Botswana, (2022). https://data.worldbank.org/indicator/SH.TBS.INCD?locations=BW
Chin, R. J., Sangmanee, D. & Piergallini, L. P. E. P. F. A. R. Funding and reduction in HIV infection rates in 12 focus Sub-Saharan African countries: A quantitative analysis. Int. J. MCH AIDS. 3, 150–158 (2015).
Farahani, M. et al. Outcomes of the Botswana National HIV/AIDS treatment programme from 2002 to 2010: a longitudinal analysis. Lancet Glob Health. 2, e44–50. https://doi.org/10.1016/S2214-109X(13)70149-9 (2014).
Mine, M. et al. Progress towards the UNAIDS 95-95-95 targets in the fifth Botswana AIDS impact survey (BAIS V 2021): a nationally representative survey. Lancet HIV. 11, e245–e254. https://doi.org/10.1016/S2352-3018(24)00003-1 (2024).
O’Neill, M. B. et al. Lineage specific histories of Mycobacterium tuberculosis dispersal in Africa and Eurasia. Mol. Ecol. 28, 3241–3256. https://doi.org/10.1111/mec.15120 (2019).
Menardo, F. et al. Local adaptation in populations of Mycobacterium tuberculosis endemic to the Indian Ocean Rim. F1000Res 10, 60 (2021). https://doi.org/10.12688/f1000research.28318.2
Couvin, D., Reynaud, Y. & Rastogi, N. Two tales: worldwide distribution of central Asian (CAS) versus ancestral East-African Indian (EAI) lineages of Mycobacterium tuberculosis underlines a remarkable cleavage for phylogeographical, epidemiological and demographical characteristics. PLoS One. 14, e0219706. https://doi.org/10.1371/journal.pone.0219706 (2019).
Holt, K. E. et al. Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam. Nat. Genet. 50, 849–856. https://doi.org/10.1038/s41588-018-0117-9 (2018).
Merker, M. et al. Evolutionary history and global spread of the Mycobacterium tuberculosis Beijing lineage. Nat. Genet. 47, 242–249. https://doi.org/10.1038/ng.3195 (2015).
Altan, S. Chinese Workers of the World: Colonialism, Chinese Labor, and the Yunnan–Indochina Railway (Stanford University Press, 2024).
Hanekom, M. et al. Mycobacterium tuberculosis Beijing genotype: a template for success. Tuberculosis (Edinb). 91, 510–523. https://doi.org/10.1016/j.tube.2011.07.005 (2011).
Parwati, I., van Crevel, R. & van Soolingen, D. Possible underlying mechanisms for successful emergence of the Mycobacterium tuberculosis Beijing genotype strains. Lancet Infect. Dis. 10, 103–111. https://doi.org/10.1016/S1473-3099(09)70330-5 (2010).
Gagneux, S. Ecology and evolution of Mycobacterium tuberculosis. Nat. Rev. Microbiol. 16, 202–213. https://doi.org/10.1038/nrmicro.2018.8 (2018).
Stucki, D. et al. Mycobacterium tuberculosis lineage 4 comprises globally distributed and geographically restricted sublineages. Nat. Genet. 48, 1535–1543. https://doi.org/10.1038/ng.3704 (2016).
Mogashoa, T. et al. Genetic diversity of Mycobacterium tuberculosis strains Circulating in Botswana. PLoS One. 14, e0216306. https://doi.org/10.1371/journal.pone.0216306 (2019).
Chihota, V. N. et al. Geospatial distribution of Mycobacterium tuberculosis genotypes in Africa. PLoS One. 13, e0200632. https://doi.org/10.1371/journal.pone.0200632 (2018).
Viegas, S. O. et al. Molecular diversity of Mycobacterium tuberculosis isolates from patients with pulmonary tuberculosis in Mozambique. BMC Microbiol. 10, 195. https://doi.org/10.1186/1471-2180-10-195 (2010).
Brynildsrud, O. B. et al. Global expansion of Mycobacterium tuberculosis lineage 4 shaped by colonial migration and local adaptation. Sci. Adv. 4, eaat5869. https://doi.org/10.1126/sciadv.aat5869 (2018).
Oliver, E. & Oliver, W. H. The colonisation of South africa: A unique case. HTS Theological Stud. 73, 1–8. https://doi.org/10.4102/hts.v73i3.4498 (2017).
Click, E. S. et al. Phylogenetic diversity of Mycobacterium tuberculosis in two geographically distinct locations in Botswana - The Kopanyo study. Infect. Genet. Evol. 81, 104232. https://doi.org/10.1016/j.meegid.2020.104232 (2020).
Botswana, G. Botswana AIDS Impact Survey IV 2021 (BAIS V): Report., (2013). https://www.statsbots.org.bw/sites/default/files/publications/BOTSWANA%20AIDS%20IMPACT%20SURVEY%20IV%202013.pdf
Moonan, P. K. et al. A Neighbor-Based approach to identify tuberculosis Exposure, the Kopanyo study. Emerg. Infect. Dis. 26, 1010–1013. https://doi.org/10.3201/eid2605.191568 (2020).
Smith, J. P. et al. High-Resolution characterization of nosocomial Mycobacterium tuberculosis transmission events in Botswana. Am. J. Epidemiol. 192, 503–506. https://doi.org/10.1093/aje/kwac214 (2023).
Smith, J. P. et al. Characterizing tuberculosis transmission dynamics in high-burden urban and rural settings. Sci. Rep. 12, 6780. https://doi.org/10.1038/s41598-022-10488-2 (2022).
Anderson, J. et al. Sublineages of lineage 4 (Euro-American) Mycobacterium tuberculosis differ in genotypic clustering. Int. J. Tuberc Lung Dis. 17, 885–891. https://doi.org/10.5588/ijtld.12.0960 (2013).
Lopez, M. G. et al. Deciphering the tangible Spatio-Temporal spread of a 25-Year tuberculosis outbreak boosted by social determinants. Microbiol. Spectr. 11, e0282622. https://doi.org/10.1128/spectrum.02826-22 (2023).
Click, E. S., Moonan, P. K., Winston, C. A., Cowan, L. S. & Oeltmann, J. E. Relationship between Mycobacterium tuberculosis phylogenetic lineage and clinical site of tuberculosis. Clin. Infect. Dis. 54, 211–219. https://doi.org/10.1093/cid/cir788 (2012).
Minin, V. N., Bloomquist, E. W. & Suchard, M. A. Smooth skyride through a rough skyline: bayesian coalescent-based inference of population dynamics. Mol. Biol. Evol. 25, 1459–1471. https://doi.org/10.1093/molbev/msn090 (2008).
Hatherell, H. A. et al. Interpreting whole genome sequencing for investigating tuberculosis transmission: a systematic review. BMC Med. 14, 21. https://doi.org/10.1186/s12916-016-0566-x (2016).
Walker, T. M. et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect. Dis. 13, 137–146. https://doi.org/10.1016/S1473-3099(12)70277-3 (2013).
Zhang, X. et al. Exploring programmatic indicators of tuberculosis control that incorporate routine Mycobacterium tuberculosis sequencing in low incidence settings: a comprehensive (2017–2021) patient cohort analysis. Lancet Reg. Health West. Pac. 41, 100910. https://doi.org/10.1016/j.lanwpc.2023.100910 (2023).
Botswana, S. Botswana Population and Housing Census (2011). https://www.statsbots.org.bw/sites/default/files/2011%20Population%20and%20housing%20Census.pdf
Zetola, N. M. et al. Population-Based Geospatial and molecular epidemiologic study of tuberculosis transmission Dynamics, Botswana, 2012–2016. Emerg. Infect. Dis. 27, 835–844. https://doi.org/10.3201/eid2703.203840 (2021).
Kohl, T. A. et al. MTBseq: A comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates. PeerJ 6, e5895. https://doi.org/10.7717/peerj.5895 (2018).
Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274. https://doi.org/10.1093/molbev/msu300 (2015).
Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174. https://doi.org/10.1007/BF02101694 (1985).
Suchard, M. A. et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4, vey016. https://doi.org/10.1093/ve/vey016 (2018).
Menardo, F., Duchene, S., Brites, D. & Gagneux, S. The molecular clock of Mycobacterium tuberculosis. PLoS Pathog. 15, e1008067. https://doi.org/10.1371/journal.ppat.1008067 (2019).
R. C. T. _R: A Language and Environment for Statistical Computing (2023). https://www.R-project.org/.
Acknowledgements
We would like to thank the participants who made this research possible.
Author information
Authors and Affiliations
Contributions
These authors contributed equally: Sanghyuk S. Shin and Stefan Neimann. Q.W. contributed to project conception, performed the data analysis, and drafted the manuscript. I.B. and S.N. processed the whole genome sequencing data and performed the bioinformatic analysis. C.M., N.Z., P.K.M., A.F., R.B., J.E.O., and T.L.M. contributed to the Kopanyo Study design, participant recruitment, data collection, and quality control. V.M.M., S.S.S., and T.F.B. contributed to project conception, supervised data analysis and interpretation of results. All authors were involved in revising the manuscript and approved the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Q., Barilar, I., Minin, V.M. et al. Phylodynamic analysis reveals disparate transmission dynamics of Mycobacterium tuberculosis complex lineages in Botswana. Sci Rep 16, 1813 (2026). https://doi.org/10.1038/s41598-025-31425-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-31425-z




