Fig. 1: General overview of the pipeline and dataset.

a Count of samples per combinations of sequencing platforms, by biospecimen type. b Overview of the analytical pipeline used for this study. Bracken abundance estimation was used only with WGS (combining this study and Zhang et al.) and 16S. After decontamination, read counts above the genus level were recursively adjusted (“Methods”). Created in BioRender. McElderry, J. (2025) https://BioRender.com/8kkrqgu. c Total reads assigned to different domains and to the human genome (WGS n = 1176; RNA-seq n = 1203, 16S n = 1264). d log10 bacterial reads per million, including human and other sequences, by sequencing modality and tissue type (WGS n = 811 tumors, 365 normal lung, 447 blood samples; RNA-seq n = 661 tumors, 542 normal lung samples; 16S n = 701 tumors, 563 normal lung samples). e log10 absolute bacterial read counts by sequencing modality and tissue type (WGS n = 811 tumors, 365 normal lung, 447 blood samples; RNA-seq n = 661 tumors, 542 normal lung samples; 16S n = 701 tumors, 563 normal lung samples). f Comparison of log10 per-million genus-level bacterial reads in the WGS dataset compared to WGS from other studies. Boxplot centers, upper and lower bounds, and whiskers represent median, upper and lower quartiles, and quartiles ± 1.5 inter-quartile range, respectively. WGS whole genome sequencing, Rna-seq RNA sequencing, 16S 16S rRNA gene sequencing.