Abstract
Respiratory infections pose significant challenges to global health, impacting millions of individuals annually. Understanding the molecular mechanisms underlying the pathogenicity of these infections is crucial for developing effective interventions. RNA sequencing provides insights into a patient’s global transcriptome changes, facilitating the identification of host gene signatures in response to infection and potential therapeutic targets. Here we present an extensive whole blood transcriptome dataset from a demographically diverse cohort of 502 patients with infections including COVID-19, seasonal coronavirus, influenza A or influenza B, sepsis, septic shock, and co-infections (Viral/Viral, Bacterial/Viral, Bacterial/Viral/Fungal, Viral/Fungal, Viral/ Viral/Fungal). The cohort size and depth of data showcase its potential to unravel respiratory infection pathogenesis for the development of better diagnostics, treatments, and preventive strategies for respiratory infections and future global health crises.
Similar content being viewed by others
Background & Summary
Infections continue to impact millions of individuals each year, despite the significant advances that have been made in research and development of antimicrobial therapies. Numerous studies have identified and unravelled the causative organisms underpinning infectious diseases, such as severe-acute-respiratory-syndrome-related coronavirus-2 (SARS-CoV-2). However, our understanding of the complex molecular changes occurring within the human body during infections remains limited. Therefore, there is a critical need for large-scale whole-blood transcriptome datasets from infected patients to gain a comprehensive understanding of the underlying mechanisms of disease progression resulting from infection. Such datasets should capture patient heterogeneity, improve our understanding of the immune response to emerging variants and guide future research and clinical interventions. Increasing evidence suggests that changes in the expression of host immune responses are significant contributors to the development and progression of disease following infections1,2,3,4. Molecular profiling technologies, especially RNA sequencing (RNASeq) stand out as powerful tools capable of unravelling these complex host responses during infections. This approach facilitates the profiling of gene expression changes across diverse biological samples and offers insights into their involvement in disease pathogenesis. With the significant drop in the cost of running RNASeq experiments, this technique has now become a primary step to characterise and compare gene expression and molecular pathways across many types of infections5. It also facilitates a rapid translation of research findings from bench to bedside, given that blood-based transcriptomic analysis is easily accessible and non-invasive, and it provides a systemic view of the body’s response to infections (https://www.lexogen.com/whole-blood-rna-seq-best-practice/)6,7,8,9. By analysing blood transcriptomes, researchers can accurately identify differentially expressed genes, discover key molecular signatures or biomarkers, and unravel key pathways underpinning pathogenesis. Due to these advantages, RNASeq has been widely accepted as an indispensable tool in studying immune responses in many physiological and pathological situations10,11.
Respiratory infections, characterised by their dynamic and complex nature, continue to pose significant challenges to global health, affecting people worldwide and presenting challenges to healthcare systems. This manuscript reports next-generation RNA sequencing within a large-scale, multi-centre framework to interpret the immunology landscape of respiratory infections. To date, we have accumulated a comprehensive whole blood transcriptome dataset obtained from 502 patients with SARS-COV-2, seasonal coronavirus, influenza A and influenza B, sepsis, septic shock, and co-infection across 11 multiple centres from 5 countries, drawn from various patient populations and covering a diverse range of clinical presentations and settings. In addition, we provide longitudinal data for patients with Coronavirus disease 2019 (COVID-19) and co-infected groups, capturing changes in the transcriptome spanning the entire course of infection with varied disease severity. The objective of this manuscript is to highlight the dataset scope, sample characteristics, and experimental approach and to describe data quality using various assessment metrics. Additionally, it highlights its translational potential by providing researchers with a roadmap to understand critical aspects of respiratory infection pathogenesis, thereby facilitating future research in the field. By making this valuable resource open to the scientific community, we expect to promote collaborative research efforts to inform the design of future studies, accelerate discoveries, and contribute to a more thorough understanding of respiratory infections, including SARS-CoV-2. Our goal is to aid the development of future diagnostic and prognostic tools, therapeutic interventions, and preventive strategies to address and combat future global health crises.
Methods
Ethics statement
This multicentre, observational cohort study recruited patients with respiratory infections across different sites in five countries. The study was approved by Human Research Ethics Committees (HRECs) at all participating institutions. Informed consent was obtained from all participants. Further details are provided in the Supplementary Information under the ethics statement.
Study design and participants of human cohorts
A total of 681 samples collected from 502 participants with a respiratory infection are included in this paper. Of these, 322 participants, comprising 301 adults and 21 children with confirmed COVID-19 were enrolled from 10 multinational centres (Australia, Czech Republic, France, Indonesia, and Singapore) between February 2020 and February 2022. Samples with other respiratory infections including seasonal coronavirus (n = 9), influenza A (n = 55) or influenza B (n = 8), sepsis (n = 17), and septic shock (n = 7) were collected between July 2014 and November 2019 in Australia or Singapore. In addition, this study includes samples collected (2014–2022) from subjects with various co-infections (n = 84) either with bacterial/viral (n = 56), bacterial/viral/fungal (n = 4), viral/fungal (n = 17) or viral/viral (n = 7) from Australia, France, and the Czech Republic. Longitudinal sample(s) obtained between two to nine days post-infection was collected for COVID-19 (65 patients) and co-infection groups (19 patients). Seventy-two volunteer samples from Australia collected before 2019 were included as healthy controls. A summary of the relevant cohort characteristics (study population and disease demographics) is provided in Table S1 (see Supplementary Information document). Detailed clinical data for each sample is provided as a supplementary file in the attached file 1 ([1] PREDICT-19 clinical data.xlsx).
Eligibility criteria included (1) age equal to or greater than 18 years for adults and less than 18 years for the paediatrics cohort (2) the World Health Organization definition of influenza-like illness (fever of 38 °C or higher, cough, sore throat, nasal congestion, and illness onset within the last ten days), and (3) confirmed infection by appropriate microbiological or virological assays, in addition to the presence of clinical evidence of infection (e.g. physical examination and imaging studies such as chest X-ray). For example, COVID-19 infection was confirmed by virological testing on respiratory samples (nasal swab/ throat swab/sputum/bronchoalveolar lavage) by PCR or antigen detection assay, together with signs of respiratory infection (respiratory distress and chest X-ray findings) as assessed by an admitting physician. All control samples in the dataset were tested negative for common respiratory infections (bacterial or viral). Study data were collected and managed using Research Electronic Data Capture (REDCap) electronic data capture tools hosted at the University of Sydney12,13.
Blood sample collection and RNA isolation
Blood samples were collected into PAXgene Blood RNA Tubes (2.5 mL blood) (PreAnalytiX, Qiagen, Germany) from participants at the time of study enrolment according to the manufacturer’s supplied protocol. Samples were stored at room temperature for 2 h, −20 °C for 24 h and finally to −80 °C for long-term storage. Total RNA was extracted according to the manufacturer’s instructions and included DNase I treatment (PreAnalytiX, QIAGEN/BD, Switzerland). An aliquot of 4 μl of each extracted total RNA was used for RNA quality control assessments. The concentration and integrity of extracted RNA were evaluated by visualization of 28S and 18S band integrity on a Tapestation 4200 system (Agilent). RNA purity was estimated by examining the OD 260/280 and the OD 260/230 ratios. RNA samples were stored at −80 °C until use. Samples with the concentration of 100ng-1ug of total RNA with high RNA Integrity Number (RIN) (>7), OD 260/280 nm ratio of 1.8–2.0 and OD 260/230 nm ratio of 2.0–2.2 were sent for RNASeq.
Library preparation and RNASeq
Libraries were prepared from 300 ng total RNA using the Illumina Stranded Total RNA Prep with Ribo-Zero Plus with Unique Dual Indexes (Illumina, CA, USA). Briefly, human ribosomal and globin RNA were depleted, remaining RNA fragmented (targeting insert size of ~190 bp), and strand-specific double-stranded cDNA was synthesised. After adapter ligation and indexing, libraries were purified, quality-checked (PerkinElmer GXII), quantified (qPCR), and pooled (32 samples/lane). Sequencing was performed on an Illumina NovaSeq 6000 (150 bp paired-end, S4-300 flow cell), yielding an average of 90.8 million read pairs per sample. Base calling and FASTQ conversion were completed with standard Illumina pipelines: NovaSeq Control Software v1.7.5, RTA v3.4.4, and DRAGEN BCL Convert v3.10.8.
Data pre-processing: sequence reads to count data
Raw RNA sequencing data were quality-controlled and pre-processed into analysis-ready count data using the highly scalable RNASeq-DE workflow (v1.0.0) (https://github.com/Sydney-Informatics-Hub/RNASeq-DE). This workflow uses OpenMPI (v4.1.0) and nci-parallel (v1.0.0a) (https://doi.org/10.1007/978-3-540-30218-6_19) to distribute tasks across multiple compute nodes for compute efficient, parallel data pre-processing. Default or developer-recommended settings were applied unless otherwise described below. Quality reports for each FASTQ file were obtained using FastQC (v0.11.7) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and summarised with MultiQC (v1.9)14. FASTQ pairs were 3′ adapter and polyA tail trimmed using BBduk trim (v37.98) (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), leaving an average of 90.6 million trimmed read pairs. The human reference genome GRCh38 primary assembly and gene set release 106 were obtained from Ensembl and prepared with STAR’s GenomeGenerate tool (v2.7.3a)15, with –sjdbOverhang set to 149. Each pair of trimmed FASTQ reads was mapped using STAR to the prepared reference. Sequencing batch-level binary alignment (BAM) files were merged and indexed with SAMtools (v1.10)16 to obtain sample-level BAMs. HTSeq-count (v0.12.4)17 with -s reverse was used to obtain feature-level raw counts. TPMCalculator (v0.0.4)18 was used to obtain TPM normalized, feature level counts. The experimental flow chart (study design) is shown in Fig. 1.
Signal validation through differential expression
To validate data quality, we performed differential expression between controls and each disease group and confirmed that previously reported gene markers were identified in the current dataset. First, we loaded feature counts generated from HTSeq-count into a DESeq object with design = ~1 and transformed the data using variance stabilizing transformation (VST). Principal component analysis (PCA) using prcomp() on VST counts was used to observe variation between disease groups (Fig. 2a) and confirm no unwanted batch effects. The batch effect caused by the collection site was of primary interest, as methods were otherwise applied consistently across samples. The PCA analysis plot grouped by collection sites is shown in Figure S1 (see Supplementary Information document). DESeq 2’s dispersion estimates closely follow the fitted trend line, with decreasing dispersion at higher mean counts and no major outliers, indicating that the model provides a good fit for the data (Fig. 2b). Differential expression was then performed with DESeq 2 using HTSeq-count data, setting control samples as the base level. Significantly differentially expressed genes were defined as protein-coded genes with adjusted P-value ≤ 0.05 and |log2 FC| ≥ 2 ≤ for each pairwise comparison between the disease groups against the control. Volcano plots of different expression results were generated with the package Enhanced Volcano, version 1.18.019, with enhanced colouring and labelling shown in Figure S2 (a-j) (see Supplementary Information document).
Raw and analysis-ready data quality after mapping and controlling unwanted batch effects. (a) The principal components analysis plot is grouped by disease status. (b) DESeq 2 generated dispersion parameter estimates for each gene. (c) Mean quality scores of raw sequencing data per FASTQC file across the 150 bp read length were generated with FastQC and MultiQC. All bases across reads had a mean quality score of Phred 30 or greater, except at position one, where the quality was a minimum of Phred 16 (97.5% accuracy).
The proportion of males to females within each disease group is shown in Figure S3 (see Supplementary Information document).
Data Records
Raw FASTQ data discussed in this publication have been deposited in NCBI’s Sequence Read Archive under BioProject accession PRJNA901461. Count data were deposited to NCBI’s Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4944384/) and are available at GEO Series accession numbers GSE21794820 and GSE28246421. (see supplementary file “2 PREDICT-19 RNA-samples and data quality.xlsx”). All samples in PRJNA901461 were collected for the same project as described in this article. The two series represent two different parts of the entire dataset. Samples in GSE217948 were included in previous publications22,23,24,25. In this publication, we provide additional clinical data for these samples in the supplementary file “2 PREDICT-19 clinical data.xlsx.” For some patients, additional longitudinal samples are also available in series GSE282464. All samples in GSE282464 are newly released as part of this article.
Technical Validation
Transcriptome data quality assessment
FastQC reported that all FASTQ files containing raw sequencing data had high per-sequence quality scores (Phred >30) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). This was reflected at each base across reads, except at position one, where quality was slightly lower for some reads (minimum Phred 16, equivalent to 97.5% accuracy, Fig. 2c). Adapter content and over-representation of poly A sequence were detected in the raw data. We trimmed adapters and poly A sequence and confirmed that they were successfully removed with FastQC and MultiQC.
Mapping metrics obtained by RSeQC’s tools are reported in Supplementary attached Files 2 and 3 ([2] PREDICT-19 RNA samples and data quality and [3] PREDICT19_DataDictionary - see Supplementary Information document). On average, 95.0% of reads were mapped (SD ± 2.2%), 73.4% of which were uniquely mapped (SD ± 7.0%), 17.0% were non-primary hits (SD ± 6.3%), and 36.7% (SD ± 6.5%) to coding sequence on the GRCh38 primary assembly. RSeQC’s infer_experiment.py was used to confirm that libraries were reverse strand aware with >0.7 of reads explained by “1 + −,1−+, 2++,2–” for all samples. The number of paired reads, read length and mapping characteristics exceed or meet Illumina’s recommendations and ENCODE’s best practice guidelines for profiling global gene expression and obtaining some information on alternative splicing26 (https://knowledge.illumina.com/library-preparation/rna-library-prep/library-preparation-rna-library-prep-reference_material-list/000001243).
Signal validation through differential expression
The PCA plot and volcano plots presented in this paper serve as powerful tools to confirm the quality of our dataset. The PCA plot visually represents that the data’s variance can be attributed to infection status rather than unwanted batch effects, such as collection site. This is also confirmed by differential expression analysis and identification of gene markers reported to characterise specific infection types. For instance, IFI27 was significantly upregulated in viral infections, consistent with previous reports, where IFI27 was identified as a strong biomarker distinguishing viral from bacterial respiratory infections27. Meanwhile, the volcano plots allow us to visualise significant changes in gene expression between different conditions, providing insights into the dataset’s robustness and reliability. Together, these analyses validate the quality of our dataset and enhance our confidence in its suitability for further study. Data presented here provides a valuable resource to replicate and validate similar findings from other studies (Table S2- see Supplementary Information document) (adjusted P-value ≤ 0.05 and |log2 FC| ≥ 2 ≤ ).
Usage Notes
This study’s RNASeq data analysis is limited by the unequal distribution of male and female participants across specific cohorts, especially for the seasonal coronavirus group (which comprised only females) and the co-infected groups. As a result, some sex-specific genes were detected as ‘DEGs’ in contrast to infected versus healthy controls. This gender imbalance may introduce biases, affecting the generalizability of the results. Future research should aim for a more balanced gender representation to ensure broader applicability and minimise potential biases. Also, the lack of follow-up data for some infected individuals or groups could limit the analytical power. This dataset (GSE282464 and associated SRA records from PRJNA901461) provides newly released clinical and transcriptomic data that complement prior datasets. While this dataset can be analyzed independently for specific research applications, it is also designed to be interoperable with previously published datasets, particularly PRJNA901461.
Code availability
Highly scalable RNASeq-DE workflow v1.0.0 was used to perform quality assessment and pre-processing of raw RNA-sequencing data to raw counts. The code, tools and versions used are publicly available and accessible, as documented on GitHub (https://github.com/Sydney-Informatics-Hub/RNASeq-DE).
References
Nicholson, L. B. The immune system. Essays Biochem 60, 275–301, https://doi.org/10.1042/ebc20160017 (2016).
Brodin, P. & Davis, M. M. Human immune system variation. Nat Rev Immunol 17, 21–29, https://doi.org/10.1038/nri.2016.125 (2017).
Netea, M. G., Schlitzer, A., Placek, K., Joosten, L. A. B. & Schultze, J. L. Innate and Adaptive Immune Memory: an Evolutionary Continuum in the Host’s Response to Pathogens. Cell Host Microbe 25, 13–26, https://doi.org/10.1016/j.chom.2018.12.006 (2019).
Ochando, J., Mulder, W. J. M., Madsen, J. C., Netea, M. G. & Duivenvoorden, R. Trained immunity - basic concepts and contributions to immunopathology. Nat Rev Nephrol 19, 23–37, https://doi.org/10.1038/s41581-022-00633-5 (2023).
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13, https://doi.org/10.1186/s13059-016-0881-8 (2016).
Mohr, S. & Liew, C. C. The peripheral-blood transcriptome: new insights into disease and risk assessment. Trends Mol Med 13, 422–432, https://doi.org/10.1016/j.molmed.2007.08.003 (2007).
Sweeney, T. E., Wong, H. R. & Khatri, P. Robust classification of bacterial and viral infections via integrated host gene expression diagnostics. Sci Transl Med 8, 346ra391, https://doi.org/10.1126/scitranslmed.aaf7165 (2016).
Almansa, R. et al. A host transcriptomic signature for identification of respiratory viral infections in the community. Eur J Clin Invest 51, e13626, https://doi.org/10.1111/eci.13626 (2021).
Shojaei, M. et al. Multisite validation of a host response signature for predicting likelihood of bacterial and viral infections in patients with suspected influenza. Eur J Clin Invest 53, e13957, https://doi.org/10.1111/eci.13957 (2023).
Casamassimi, A., Federico, A., Rienzo, M., Esposito, S. & Ciccodicola, A. Transcriptome Profiling in Human Diseases: New Advances and Perspectives. Int J Mol Sci 18, https://doi.org/10.3390/ijms18081652 (2017).
Kukurba, K. R. & Montgomery, S. B. RNA Sequencing and Analysis. Cold Spring Harb Protoc 2015, 951–969, https://doi.org/10.1101/pdb.top084970 (2015).
Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42, 377–381, https://doi.org/10.1016/j.jbi.2008.08.010 (2009).
Harris, P. A. et al. The REDCap consortium: Building an international community of software platform partners. J Biomed Inform 95, 103208, https://doi.org/10.1016/j.jbi.2019.103208 (2019).
Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048, https://doi.org/10.1093/bioinformatics/btw354 (2016).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21, https://doi.org/10.1093/bioinformatics/bts635 (2013).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169, https://doi.org/10.1093/bioinformatics/btu638 (2015).
Vera Alvarez, R., Pongor, L. S., Mariño-Ramírez, L. & Landsman, D. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics 35, 1960–1962, https://doi.org/10.1093/bioinformatics/bty896 (2019).
Blighe, K. Rana, S. & Lewis, M. EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. R package version 1.22.0 (2024).
NCBI GEO https://identifiers.org/geo/GSE217948 (2025).
NCBI GEO https://identifiers.org/geo/GSE282464 (2025).
Carney, M. et al. Host transcriptomics and machine learning for secondary bacterial infections in patients with COVID-19: a prospective, observational cohort study. Lancet Microbe 5, e272–e281, https://doi.org/10.1016/s2666-5247(23)00363-4 (2024).
Wang, Y. et al. Pathway and Network Analyses Identify Growth Factor Signaling and MMP9 as Potential Mediators of Mitochondrial Dysfunction in Severe COVID-19. Int J Mol Sci 24, https://doi.org/10.3390/ijms24032524 (2023).
Wang, Y. et al. Blood transcriptome responses in patients correlate with severity of COVID-19 disease. Front Immunol 13, 1043219, https://doi.org/10.3389/fimmu.2022.1043219 (2022).
Shojaei, M. et al. IFI27 transcription is an early predictor for COVID-19 outcomes, a multi-cohort observational study. Front Immunol 13, 1060438, https://doi.org/10.3389/fimmu.2022.1060438 (2022).
ENCODE Guidelines and Best Practices for RNA-Seq: Revised. https://doi.org/10.1101/044578 (2016).
Tang, B. M. et al. A novel immune biomarker IFI27 discriminates between influenza and bacteria in patients with suspected respiratory infection. Eur Respir J 49, https://doi.org/10.1183/13993003.02098-2016 (2017).
Acknowledgements
We thank all participants involved in this study. This study was funded by the Snow Medical Research Foundation (BEAT COVID-19) (grant no. CT28701/G207593), the National Health and Medical Research Council (Australian Partnership for Preparedness Research on Infectious Disease Emergencies, APPRISE AppID 1116530), and the Jack Ma Foundation. The authors also declare that this study received funding from A2 Milk Company. The funder had no role in the study design, data collection, analysis, interpretation, writing of this article, or the decision to submit it for publication. The authors wish to acknowledge the Australian Genomics Research Facility (AGRF) for sequencing services, supported by the Australian Government’s National Collaborative Research Infrastructure Strategy through Bioplatforms Australia. RNA sample quality control was performed by Dr Joey Lai at the Westmead Scientific Platforms, which are supported by the Westmead Research Hub, the Cancer Institute New South Wales, the National Health and Medical Research Council and the Ian Potter Foundation. The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney, and the Australian BioCommons, enabled by NCRIS via Bioplatforms Australia. We acknowledge using the National Computational Infrastructure (NCI), supported by the Australian Government, and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney. We are also especially grateful to Carl Feng, Alice Grey, Angela Ferguson, Jennifer Audsley, Rebecca Burrell, Amith Shetty, and Kevin Lai for their invaluable contributions across various aspects of the project. Members of the PREDICT-19 consortium are listed in alphabetical order by first name: Alberto Ballestrero, Allan Cripps, Amanda Cox, Amy L Phu, Andrea De Maria, Anthony McLean, Arutha Kulasinghe, Ben Marais, Benjamin Tang, Carl Feng, Damien Chaussabel, Darawan Rinchai, Davide Bedognetti, Gabriele Zoppoli, Gunawan Gunawan, Irani Thevarajan, Jennifer Audsley, John-Sebastian Eden, Jonathan Iredell, Karan Kim, Kirsty R Short, Klaus Schughart, Mandira Chakraborty, Marcela Kralovcova, Marek Nalos, Marko Radic, Martin Matejovic, Maryam Shojaei, Meagan Carney, Michele Bedognetti, Miroslav Prucha, Mohammed Toufiq, Nandan Deshpande, Narasaraju Teluguakula, Nicholas West, Paolo Cremonesi, Philip N. Britton, Ricardo Garcia Branco, Rodolphe Thiebaut, Rostyslav Bilyy, Sally Teoh, Stephen MacDonald, Tania Sorrell, Thomas Karvunidis, Tiana M. Pelaia, Tim Kwan, Tracy Chew, Velma Herwanto, Win Sen Kuan, Ya Wang, and Yoann Zerbib.
Author information
Authors and Affiliations
Contributions
Study concept and design: B.T. and M.S., ethics/governance application: B.T., M.S., Y.W., T.P., S.T., recruitment of participants, sample collection/processing: S.T., T.P., K.K., M.S., P.N.B., T.K., Y.Z., A.D.M., I.T., W.S.K. clinical data collection and REDCap database: B.T., A.M., A.P., T.P., S.T., P.N.B., M.S., RNASeq data pre-processing Q.C. and data analyses: T.C., J.S., KSch and N.D., data interpretation and discussion: T.C., KSch, D.C., A.M., J.S., M.S., manuscript writing and generating figures: T.C. and M.S., manuscript revision: T.C., KSch, B.T. A.M., M.S., D.C., funding acquisition and project supervision: B.T., M.S., A.M. All authors contributed to the article and approved the submitted version. PREDICT-19 Consortium contributed to many aspects of this study, including study concept and design, applications of material transfer agreements, recruitment of participants, sample collections, clinical data collection, setup of REDCap database, data interpretation and discussions.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chew, T., Pelaia, T.M., Phu, A.L. et al. Molecular landscape of respiratory infection: A large-scale, multi-centre blood transcriptome dataset. Sci Data 12, 1175 (2025). https://doi.org/10.1038/s41597-025-05488-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05488-6