Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Intragenic DNA inversions expand bacterial coding capacity

Abstract

Bacterial populations that originate from a single bacterium are not strictly clonal and often contain subgroups with distinct phenotypes1. Bacteria can generate heterogeneity through phase variation—a preprogrammed, reversible mechanism that alters gene expression levels across a population1. One well-studied type of phase variation involves enzyme-mediated inversion of specific regions of genomic DNA2. Frequently, these DNA inversions flip the orientation of promoters, turning transcription of adjacent coding regions on or off2. Through this mechanism, inversion can affect fitness, survival or group dynamics3,4. Here, we describe the development of PhaVa, a computational tool that identifies DNA inversions using long-read datasets. We also identify 372 ‘intragenic invertons’, a novel class of DNA inversions found entirely within genes, in genomes of bacterial and archaeal isolates. Intragenic invertons allow a gene to encode two or more versions of a protein by flipping a DNA sequence within the coding region, thereby increasing coding capacity without increasing genome size. We validate ten intragenic invertons in the gut commensal Bacteroides thetaiotaomicron, and experimentally characterize an intragenic inverton in the thiamine biosynthesis gene thiC.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Short-read metagenomic datasets reveal intragenic invertons in BTh.
Fig. 2: PhaVa analysis of long-read sequencing data from isolates reveals that intragenic inversions are prevalent across the bacterial tree of life.
Fig. 3: PhaVa analysis of 210 long-read metagenomes from human stool.
Fig. 4: Consequences of inversion in thiamine biosynthesis protein.

Similar content being viewed by others

Data availability

Short-read adult stool sequencing data was previously published and is available under NCBI BioProject ID PRJNA707487. Short-read paediatric stool sequencing data were previously published and are available under NCBI BioProject ID PRJNA787952. Long-read metagenomic sequencing data were previously published and are available under BioProject PRJNA820119 and BioProject PRJNA940499. Assembled metagenomic contigs are available at https://doi.org/10.5281/zenodo.7662825. A list of accession numbers for long-read isolate sequencing data are available in Supplementary Table 3. Mass spectrometry raw files (.d) generated in this study have been deposited to the ProteomeXchange Consortium through the PRIDE partner repository77 (project accession PXD054577). Long-read sequencing data for the locked thiC intragenic inverton strains and RNA-sequencing data are available under NCBI BioProject ID PRJNA1118344. Accession codes for long-read datasets are listed in Supplementary Table 3. The reference genome for B. thetaiotaomicron VPI-5482 is the NCBI reference sequence AE015928.1. The reference genome for B. fragilis FDAARGOS_1225 is the NCBI reference sequence NZ_CP069563.1Source data are provided with this paper.

Code availability

PhaVa is available at https://github.com/patrickwest/PhaVa. Long-read datasets were analysed with PhaVa (v0.1.0) with default parameters.

References

  1. van der Woude, M. W. & Bäumler, A. J. Phase and antigenic variation in bacteria. Clin. Microbiol. Rev. 17, 581–611 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Trzilova, D. & Tamayo, R. Site-specific recombination—how simple DNA inversions produce complex phenotypic heterogeneity in bacterial populations. Trends Genet. 37, 59–72 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Zieg, J., Silverman, M., Hilmen, M. & Simon, M. Recombinational switch for gene expression. Science 196, 170–172 (1977).

    Article  ADS  CAS  PubMed  Google Scholar 

  4. Stocker, B. A. Measurements of rate of mutation of flagellar antigenic phase in Salmonella typhimurium. J. Hyg. 47, 398–413 (1949).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Meydan, S., Vázquez-Laslop, N. & Mankin, A. S. Genes within genes in bacterial genomes. Microbiol. Spectr. 6, rwr-0020-2018 (2018).

    Article  Google Scholar 

  6. Zhong, A. et al. Toxic antiphage defense proteins inhibited by intragenic antitoxin proteins. Proc. Natl Acad. Sci. USA 120, e2307382120 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Moxon, R., Bayliss, C. & Hood, D. Bacterial contingency loci: the role of simple sequence DNA repeats in bacterial adaptation. Annu. Rev. Genet. 40, 307–333 (2006).

    Article  CAS  PubMed  Google Scholar 

  8. Sberro, H. et al. Large-scale analyses of human microbiomes reveal thousands of small, novel genes. Cell 178, 1245–1259.e14 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Schlub, T. E. & Holmes, E. C. Properties and abundance of overlapping genes in viruses. Virus Evol. 6, veaa009 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Medhekar, B. & Miller, J. F. Diversity-generating retroelements. Curr. Opin. Microbiol. 10, 388–395 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Andrewes, F. W. Studies in group-agglutination I. The Salmonella group and its antigenic structure. J. Pathol. Bacteriol. 25, 505–521 (1922).

    Article  Google Scholar 

  12. Goldberg, A., Fridman, O., Ronin, I. & Balaban, N. Q. Systematic identification and quantification of phase variation in commensal and pathogenic Escherichia coli. Genome Med. 6, 112 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Sekulovic, O. et al. Genome-wide detection of conservative site-specific recombination in bacteria. PLoS Genet. 14, e1007332 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Jiang, X. et al. Invertible promoters mediate bacterial phase variation, antibiotic resistance, and host adaptation in the gut. Science 363, 181–187 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  15. Milman, O., Yelin, I. & Kishony, R. Systematic identification of gene-altering programmed inversions across the bacterial domain. Nucleic Acids Res. 51, 553–573 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Komano, T. Shufflons: multiple inversion systems and integrons. Annu. Rev. Genet. 33, 171–191 (1999).

    Article  CAS  PubMed  Google Scholar 

  17. Atack, J. M., Guo, C., Yang, L., Zhou, Y. & Jennings, M. P. DNA sequence repeats identify numerous type I restriction-modification systems that are potential epigenetic regulators controlling phase-variable regulons; phasevarions. FASEB J. 34, 1038–1051 (2020).

    Article  CAS  PubMed  Google Scholar 

  18. Chatzidaki-Livanis, M., Coyne, M. J., Roche-Hakansson, H. & Comstock, L. E. Expression of a uniquely regulated extracellular polysaccharide confers a large-capsule phenotype to Bacteroides fragilis. J. Bacteriol. 190, 1020–1026 (2008).

    Article  CAS  PubMed  Google Scholar 

  19. Taketani, M., Donia, M. S., Jacobson, A. N., Lambris, J. D. & Fischbach, M. A. A phase-variable surface layer from the gut symbiont Bacteroides thetaiotaomicron. mBio 6, e01339-15 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Troy, E. B., Carey, V. J., Kasper, D. L. & Comstock, L. E. Orientations of the Bacteroides fragilis capsular polysaccharide biosynthesis locus promoters during symbiosis and infection. J. Bacteriol. 192, 5832–5836 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Severyn, C. J. et al. Microbiota dynamics in a randomized trial of gut decontamination during allogeneic hematopoietic cell transplantation. JCI Insight 7, e154344 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Siranosian, B. A. et al. Rare transmission of commensal and pathogenic bacteria in the gut microbiome of hospitalized adults. Nat. Commun. 13, 586 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  23. Martens, E. C., Chiang, H. C. & Gordon, J. I. Mucosal glycan foraging enhances fitness and transmission of a saccharolytic human gut bacterial symbiont. Cell Host Microbe 4, 447–457 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Martens, E. C., Roth, R., Heuser, J. E. & Gordon, J. I. Coordinate regulation of glycan degradation and polysaccharide capsule biosynthesis by a prominent human gut symbiont. J. Biol. Chem. 284, 18445–18457 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Krinos, C. M. et al. Extensive surface diversity of a commensal microorganism by multiple DNA inversions. Nature 414, 555–558 (2001).

    Article  ADS  CAS  PubMed  Google Scholar 

  26. Porter, N. T. et al. Phase-variable capsular polysaccharides and lipoproteins modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat. Microbiol. 5, 1170–1181 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Round, J. L. et al. The Toll-like receptor 2 pathway establishes colonization by a commensal of the human microbiota. Science 332, 974–977 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  28. Neff, C. P. et al. Diverse intestinal bacteria contain putative zwitterionic capsular polysaccharides with anti-inflammatory properties. Cell Host Microbe 20, 535–547 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Mazmanian, S. K., Liu, C. H., Tzianabos, A. O. & Kasper, D. L. An immunomodulatory molecule of symbiotic bacteria directs maturation of the host immune system. Cell 122, 107–118 (2005).

    Article  CAS  PubMed  Google Scholar 

  30. Porter, N. T., Canales, P., Peterson, D. A. & Martens, E. C. A Subset of polysaccharide capsules in the human symbiont Bacteroides thetaiotaomicron promote increased competitive fitness in the mouse gut. Cell Host Microbe 22, 494–506.e8 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Musumeci, O. et al. Intragenic inversion of mtDNA: a new type of pathogenic mutation in a patient with mitochondrial myopathy. Am. J. Hum. Genet. 66, 1900–1904 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Smyshlyaev, G., Bateman, A. & Barabas, O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. West, P. T., Chanin, R. B. & Bhatt, A. S. From genome structure to function: insights into structural variation in microbiology. Curr. Opin. Microbiol. 69, 102192 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

    Article  PubMed  Google Scholar 

  35. Casino, P., Rubio, V. & Marina, A. The mechanism of signal transduction by two-component systems. Curr. Opin. Struct. Biol. 20, 763–771 (2010).

    Article  CAS  PubMed  Google Scholar 

  36. Loenen, W. A. M., Dryden, D. T. F., Raleigh, E. A. & Wilson, G. G. Type I restriction enzymes and their relatives. Nucleic Acids Res. 42, 20–44 (2014).

    Article  CAS  PubMed  Google Scholar 

  37. De Ste Croix, M. et al. Phase-variable methylation and epigenetic regulation by type I restriction-modification systems. FEMS Microbiol. Rev. 41, S3–S15 (2017).

    Article  Google Scholar 

  38. Chen, L. et al. Short- and long-read metagenomics expand individualized structural variations in gut microbiomes. Nat. Commun. 13, 3175 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  39. Maghini, D. G. et al. Quantifying bias introduced by sample collection in relative and absolute microbiome measurements. Nat. Biotechnol. 42, 328–338 (2024).

  40. Rodionov, D. A. et al. Micronutrient requirements and sharing capabilities of the human gut microbiome. Front. Microbiol. 10, 1316 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Sharma, V. et al. B-vitamin sharing promotes stability of gut microbial communities. Front. Microbiol. 10, 1485 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  43. Costliow, Z. A. & Degnan, P. H. Thiamine acquisition strategies impact metabolism and competition in the gut microbe Bacteroides thetaiotaomicron. mSystems 2, e00116–17 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Martinez-Gomez, N. C. & Downs, D. M. ThiC is an [Fe-S] cluster protein that requires AdoMet to generate the 4-amino-5-hydroxymethyl-2-methylpyrimidine moiety in thiamin synthesis. Biochemistry 47, 9054–9056 (2008).

    Article  CAS  PubMed  Google Scholar 

  45. Said, H. M. Intestinal absorption of water-soluble vitamins in health and disease. Biochem. J. 437, 357–372 (2011).

    Article  CAS  PubMed  Google Scholar 

  46. D’Souza, G. et al. Less is more: selective advantages can explain the prevalent loss of biosynthetic genes in bacteria. Evolution 68, 2559–2570 (2014).

    Article  PubMed  Google Scholar 

  47. Jurgenson, C. T., Ealick, S. E. & Begley, T. P. Biosynthesis of thiamin pyrophosphate. EcoSal Plus https://doi.org/10.1128/ecosalplus.3.6.3.7 (2009).

  48. Rodionov, D. A., Vitreschak, A. G., Mironov, A. A. & Gelfand, M. S. Comparative genomics of thiamin biosynthesis in prokaryotes. J. Biol. Chem. 277, 48949–48959 (2002).

    Article  CAS  PubMed  Google Scholar 

  49. Bacic, M. K. & Smith, C. J. Laboratory maintenance and cultivation of bacteroides species. Curr. Protoc. Microbiol. https://doi.org/10.1002/9780471729259.mc13c01s9 (2008).

  50. Zhu, W. et al. Xenosiderophore utilization promotes Bacteroides thetaiotaomicron resilience during colitis. Cell Host Microbe 27, 376–388.e8 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  52. Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    Article  CAS  PubMed  Google Scholar 

  53. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    Article  CAS  PubMed  Google Scholar 

  55. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 6, gix010 (2017).

    Article  Google Scholar 

  57. Ono, Y., Asai, K. & Hamada, M. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 37, 589–595 (2021).

    Article  CAS  PubMed  Google Scholar 

  58. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Meng, E. C. et al. UCSF ChimeraX: tools for structure building and analysis. Protein Sci. 32, e4792 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Aramaki, T. et al. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).

    Article  CAS  PubMed  Google Scholar 

  64. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    Article  ADS  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  65. Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).

    Article  CAS  PubMed  Google Scholar 

  66. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995).

    Article  MathSciNet  Google Scholar 

  67. Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using SPAdes de novo assembler. Curr. Protoc. Bioinformatics 70, e102 (2020).

    Article  CAS  PubMed  Google Scholar 

  68. Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  70. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  CAS  PubMed  Google Scholar 

  72. Skowronek, P. et al. Rapid and in-depth coverage of the (phospho-)proteome with deep libraries and optimal window design for dia-PASEF. Mol. Cell. Proteomics 21, 100279 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).

    Article  CAS  PubMed  Google Scholar 

  75. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Pino, L. K. et al. The Skyline ecosystem: informatics for quantitative mass spectrometry proteomics. Mass Spectrom. Rev. 39, 229–244 (2020).

    Article  ADS  CAS  PubMed  Google Scholar 

  77. Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 50, D543–D552 (2022).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank D. Schmidtke, A. Natarajan, J. D. Shanahan, D. Maghini, M. Dvorak, A. Han, M. Chakraborty, X. Jin, E. Martens and Bhatt laboratory members for helpful conversations and scientific advice regarding this project; W. Zhu for plasmids (pNBU2_tet and pNBU2_erm); S. Winter for the DH5α strain; and D. Haft and F. Thibaud-Nissen for helpful discussion about accessing SRA long-read datasets. Funding was provided as follows: National Institutes of Health R01 AI148623 (A.S.B.), National Institutes of Health R01 AI143757 (A.S.B.), Stand Up 2 Cancer Foundation (A.S.B. and I.M.C.), National Institutes of Health R01 AI174515 (I.M.C.), National Institutes of Health T32 training Grant HG000044 (R.B.C.), The AP Giannini Foundation (R.B.C.), National Institutes of Health T32 training Grant HL120824 (P.T.W.), National Science Foundation Graduate Research Fellowship (M.O.G., A.S.H. and N.E.), Stanford DARE fellowship (N.E.), and National Institutes of Health TL1 training Award TL1TR003019 (K.K.L.). Computing costs were supported, in part, by an NIH S10 Shared Instrumentation grant (1S10OD02014101).

Author information

Authors and Affiliations

Authors

Contributions

R.B.C., P.T.W. and A.S.B. conceived and designed the study. Data acquisition and processing was performed by R.B.C., P.T.W., J.W., R.M.P., G.Z.M.G., A.S.H., M.O.G., E.F.B., A.M.M., N.E. and K.K.L. Data were visualized by R.B.C., P.T.W., J.W., R.M.P., K.K.L. and M.O.G. Data interpretation was done by R.B.C., P.T.W., J.W., M.O.G., K.K.L. and A.S.B. Funding acquisition was done by A.S.B. and I.M.C. Writing of the original draft was performed by R.B.C., P.T.W. and A.S.B. All authors contributed to the review and editing of this manuscript.

Corresponding author

Correspondence to Ami S. Bhatt.

Ethics declarations

Competing interests

P.T.W. is a contract bioinformatician at Oxford Nanopore Technologies; this position started during the review process. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Inverton detection and confirmation in BTh.

(A) Inversion proportions of CPS loci in metagenomic samples measured with PhaseFinder (Top) and PhaVa (Bottom). Samples with no inversions in the five CPS invertons were removed. (B) Schematic of PCR confirmation. Forward and Reverse primers bind to regions of the genome upstream and downstream of the inverton on opposite strands. The Common primer binds the DNA inside of the inverton, between the inverted repeats. When the DNA is in the forward orientation (left), the Common and Forward primer will generate a PCR product. When the inverton flips, the Common and Reverse primer will generate a PCR product (right). HCT, hematopoietic cell transplantation.

Source Data

Extended Data Fig. 2 Taxonomic composition of short-read samples from Siranosian et al. 2020.

(A-B) Taxonomic distribution for samples at the genus level. Individual reads were taxonomically classified with Kraken2 using a Genbank reference set. Relative abundances were estimated with Bracken. Genera that represented less than 2% estimated relative abundance in a given sample were collapsed into ‘other’ for plotting. (A) Samples without detected BTh intragenic inversions and (B) samples with detected BTh intragenic inversions are shown. (C) The distribution of genus level Shannon diversity calculated for individual samples. Samples are grouped by presence or absence of BTh intragenic inversions. Center line represents the mean Shannon diversity of grouped samples, boxes represent quartiles, and whiskers extend to the furthest datapoint in either direction within 1.5 times the interquartile range.

Source Data

Extended Data Fig. 3 Developing and optimizing PhaVa, a long-read based, accurate inverton caller.

(A) Schematic of the PhaVa workflow. Putative invertons are identified and long-reads are mapped to both a forward (highlighted by the black dashed lines) and reverse orientation (highlighted by the grey dashed lines) version of the inverton and surrounding genomic sequence. Reads that do not map across the entire inverton and into the flanking sequence on either side or have poor mapping characteristics are removed. See methods for details. (B-C) Optimizing cutoffs for the minimum number of reverse reads, as both a raw number and percentage of all reads, to reduce false positive inverton calls with simulated reads. Cell color and number represent (B) the false positive rate per simulated readset and (C) the total number of unique false positives across all simulated datasets. (D) False positives in simulated data plotted per species. All measurements were made with a minimum of three reverse reads cutoff and varying the percentage of minimum reverse reads cutoff. Dashed line indicates the minimum reverse reads percent cutoff used for isolate and metagenomic datasets. Solid lines indicate sample mean while colored bands indicate 95% confidence interval. (E) Output tables of particular interest are labeled and shown below the diagram with example output.

Source Data

Extended Data Fig. 4 Benchmarking PhaVa with pbsim2 simulated reads.

(A) Density plot of false positive rate vs false negative rate for 100 bacterial species. Three 100x replicates were generated per species. (B) Scatterplot of simulated reads from 8 different E. coli genomes, with varying ANI to a singular reference E. coli genome, showing reduced coverage (and thus reduced detection) of invertons when mapping to a distant reference. Three 100x replicates were generated per genome. (C) False positive rate for read sets simulated for both E. coli and BTh and varying coverage levels. Three replicates were generated per coverage level. Error bars represent standard deviation between replicates. (D) False positive rates in readsets simulated from E. coli with varying mean read length. Three 100x replicates were generated per mean read length. Error bars represent standard deviation between replicates.

Source Data

Extended Data Fig. 5 Very long (>750 bp), near perfect, inverted repeats can lead to false positives.

(A) Alignment of inverton NZ_CP025371.1:2124719-2124870-2125316-2125467, with its invertible sequence flipped, against the B. pertussis genome leads to perfect alignment of flanking and IR regions as expected. ‘Reference genome’ refers to the B. pertussis reference genome sequence. ‘Inverton reversed’ refers to the putative inverton sequence and flanking sequence, with the invertible sequence inverted. Red dashed lines indicate boundaries of the invertible sequence, black dashed lines indicate boundaries of the inverted repeats as detected by einverted, and purple dashed lines indicate the true boundary of inverted repeats. (B) Alignment of the reverse complement of the entire inverton NZ_CP025371.1:2124719-2124870-2125316-2125467 with its invertible sequence inverted and flanking sequence, against the B. pertussis genome leads to near perfect alignment (6 mismatches) spanning far into the flanking sequence to the true boundary of the inverted repeats, allowing for reads to map regardless of inverton orientation. (C) Example with toy nucleotide sequences. Red nucleotides indicate mismatches.

Extended Data Fig. 6 Overview of SRA long-read isolate sequencing samples analyzed with PhaVa.

(A) The number of unique species represented in the dataset, grouped by phylum. (B) The raw number of sequencing samples, grouped by phylum. (C) Histogram of sequencing samples per species. Species with large numbers of samples are labeled. (D) A histogram of sequencing depths for all long-read isolate sequencing samples.

Source Data

Extended Data Fig. 7 Intragenic invertons that recode proteins are identified in long-read isolate datasets.

(A-C) Genome diagrams for recoding intragenic invertons are shown. Grey boxes indicate annotated protein domains. Black lines indicate the region contained within the inverton. AlphaFold structure of the forward (dark blue) and reverse (light blue) are shown. Amino acids affected by the inverton are shown in pink. (A) slmA nucleoid occlusion factor Bordetella bronchiseptica. RMSD 26.287 angstroms across all pairs. pLDDT forward: 94.5. pLDDT reverse: 51.3. (B) barA two-component sensor histidine kinase in Aeromonas hydrophila. The Receiver and HPt domain are shown. RMSD 5.492 angstroms across all pairs. pLDDT forward: 83.4. pLDDT reverse: 76.2. (C) Type I restriction enzyme S subunit hsdS1 and hsdS2 in Mycoplasma hominis. RMSD 1.809 and 4.167 angstroms across all pairs, respectively. pLDDT hsdS1 forward 90.8, reverse 87. pLDDT hsdS2 forward 92.4, reverse 84.2. HTH, helix-turn-helix; HK, histidine kinase; HPt, histidine phosphotransfer domain; TRD, target recognition domains.

Extended Data Fig. 8 Intragenic invertons are rare across genomes yet consistently enriched in some Pfam clans.

(A) Histograms showing the number of clades (genomes, species, or genera) at various numbers of invertons indicate that invertons are rare, as only one to three invertons can be detected in the majority of clades. Only clades with at least five invertons (red line; number of clades is indicated in the top-right corner of each subplot) were included for the subsequent enrichment analysis. (B) KEGG pathways and Pfam clans were tested for enrichment of intragenic (or partial intergenic) invertons in included clades, using a one-sided Fisher’s exact test per clade (see Methods). Enrichment was only calculated for sets with at least five invertons associated with genes in the set. Histograms show the number of sets with enrichment score at the number of included clades, showing that most enrichments could be calculated for single clades only. For example, all KEGG pathways associated with enough intragenic invertons for an enrichment analysis on genome-level were specific for each genome. Sets with enrichment scores across at least five clades (red line) are labeled with their corresponding identifiers. (C) Heatmap showing the log-odds ratio (effect size for the enrichment of intragenic invertons) across included clades for the six Pfam clans that have enrichment scores on genus-level (see panel B). Stars indicate significance of the enrichment as calculated by Fisher’s exact test and corrected for multiple hypothesis testing using the Benjamini-Hochberg procedure.

Source Data

Extended Data Fig. 9 Locked thiC intragenic inverton construction and growth competition.

(A) Generation of locked intragenic invertons. The forward and locked forward thiC inverted repeat (IR) nucleotide sequences are shown. When possible, the wobble position of each codon corresponding to the IR was mutated to increase mismatches between the two palindromic sequences while maintaining the amino acid sequence. (B) Mutated nucleotides are highlighted in grey. (C) Locked thiC strains were competed against each other in thiamine-containing media in a 1:1 ratio. After 40 h, the abundance of each strain was enumerated using selective agar. Black bars indicate the locked forward strain and white bars indicate the locked reverse strain. Recovered abundances shown here correspond with the competitive index shown in Fig. 4d. Left - the locked forward strain is marked with an erythromycin resistant cassette and the locked reverse strain is marked with a tetracycline resistant cassette. Right - the locked forward strain is marked with a tetracycline resistant cassette and the locked reverse strain is marked with an erythromycin resistant cassette. Geometric mean and geometric standard deviation are shown. Each dot represents an individual replicate. Experiments were done in biological duplicate or triplicate and repeated 4 or 6 times. A two-tailed ratio paired t test was performed on the locked forward and locked reverse abundances. ***, p < 0.001; **, p < 0.01; *, p < 0.05.

Source Data

Extended Data Fig. 10 Detection of ThiC unique peptides in the forward and reverse orientation.

(A) Schematic showing the ThiC protein sequence in the forward and predicted reverse open reading frames (ORF1, ORF2, ORF3). Arrows indicate the direction of transcription. Shared colors between the forward and reverse ORFs indicate identical amino acids. The dark orange box in ORF1 is the ThiC reverse specific peptide. (B) Mass spectrometry quantification, by data-independent acquisition, of ThiC tryptic peptides aligned with each ORF. NCPVPVGTVPIYQALEK includes cysteine carbamidomethylation. (C) Quantification of the unique ThiC tryptic peptides that align exclusively to the forward or reverse sequences. Representative extracted ion chromatograms show the identified fragment ions applied in the quantification of the forward and reverse peptides, as detected in WT and LR, respectively. Representative MS/MS spectra of the unique ThiC forward peptide GDVEQLPEITSEYGQMR detected in WT and unique ThiC predicted inverton peptide GDVEQLPEITSEYGQIR detected in LR. Spectra include the MS1 precursor ion (orange), as well as b- and y- fragment ions (blue and red, respectively). ND, not detected; WT, wild-type; LF, locked forward thiC intragenic inverton; UF, unlocked forward thiC intragenic inverton; LR, locked reverse thiC intragenic inverton; UR, unlocked reverse thiC intragenic inverton.

Source Data

Extended Data Fig. 11 Effect of thiamine on BTh invertases and thiamine biosynthesis and uptake loci.

(A) Log2RPKM values for all annotated invertases in BTh are shown across the thiC mutant backgrounds and thiamine concentrations. Axis is clustered with pheatmap’s default parameters (Euclidean). (B) Log2RPKM values for transcript levels of all genes in the thiamine biosynthesis pathway (BT0647-0653) and genes involved in uptake of thiamine (BT2390, BT2396) are denoted. (C) Intensity of ThiH protein as determined by mass spectrometry using data-independent acquisition. UR, unlocked thiC reverse; LR, locked thiC reverse.

Source Data

Extended Data Fig. 12 Thiamine is not a strong driver of thiC intragenic inverton flipping.

(A) Schematic denoting the generation of unlocked thiC strains and the experimental setup for the long-term thiamine exposure assay. (B) Results of the long-term thiamine exposure assay. Top – two replicates of the unlocked reverse thiC strain that were serially cultured in low thiamine (0.001 μM). Bottom – two replicates of the unlocked forward thiC strain that were serially cultured in high thiamine (10 μM). LOD, limit of detection; O/N, overnight culture; ND, not detected.

Source Data

Supplementary information

Supplementary Figure 1

Full image of the gel from Fig. 4b.

Reporting Summary

Peer Review File

Supplementary Table 1

B. fragilis intragenic invertons identified from short-read metagenomic sequencing samples.

Supplementary Table 2

Sheet 1: NanoSim simulated read datasets. Sheet 2: pbsim2 simulated read datasets.

Supplementary Table 3

Sheet 1: Invertons identified from long-read isolate sequencing samples. Sheet 2: Intragenic invertons identified from long-read isolate sequencing samples. Sheet 3: List of accession numbers and associated metadata for long-read isolate sequencing samples. Sheet 4: Archaeal invertons identified from long-read isolate sequencing samples.

Supplementary Table 4

Dereplicated invertons identified from long-read metagenomic sequencing samples.

Supplementary Table 5

Differentially expressed genes across thiamine concentrations.

Supplementary Table 6

Sheet 1: Additional sequences. Sheet 2: Primers used in this study. Sheet 3: Strains used in this study Sheet 4: Recombinant DNA.

Supplementary Table 7

Variable isolation windows used in dia-PASEF method.

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chanin, R.B., West, P.T., Wirbel, J. et al. Intragenic DNA inversions expand bacterial coding capacity. Nature 634, 234–242 (2024). https://doi.org/10.1038/s41586-024-07970-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41586-024-07970-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing