Molecular landscape of respiratory infection: A large-scale, multi-centre blood transcriptome dataset

Chew, Tracy; Pelaia, Tiana M.; Phu, Amy L.; Teoh, Sally; Wang, Ya; Deshpande, Nandan; Kim, Karan; Herwanto, Velma; Gunawan; Karvunidis, Thomas; Zerbib, Yoann; Short, Kirsty R.; Macdonald, Stephen; Thevarajan, Irani; Rinchai, Darawan; Kuan, Win Sen; Knippenberg, Ben; Iredell, Jonathan; Britton, Philip N.; Hutchings, Owen; Britton, Warwick J.; Eden, John-Sebastian; Orde, Sam; Tang, Benjamin; McLean, Anthony; Chaussabel, Damien; Schughart, Klaus; Shojaei, Maryam

doi:10.1038/s41597-025-05488-6

Download PDF

Data Descriptor
Open access
Published: 10 July 2025

Molecular landscape of respiratory infection: A large-scale, multi-centre blood transcriptome dataset

Tracy Chew¹^na1,
Tiana M. Pelaia²^na1,
Amy L. Phu^3,4,
Sally Teoh²,
Ya Wang^2,5,6,
Nandan Deshpande¹,
Karan Kim ORCID: orcid.org/0000-0003-3898-3308⁵,
Velma Herwanto^2,5,7,
Gunawan⁸,
Thomas Karvunidis⁹,
Yoann Zerbib¹⁰,
Kirsty R. Short¹¹,
Stephen Macdonald¹²,
Irani Thevarajan^13,14,
Darawan Rinchai^15,16,
Win Sen Kuan^17,18,
Ben Knippenberg¹⁹,
Jonathan Iredell^20,21,22,23,
Philip N. Britton^24,25,
Owen Hutchings²⁶,
Warwick J. Britton^27,28,
John-Sebastian Eden ORCID: orcid.org/0000-0003-1374-3551²⁹,
Sam Orde²,
Benjamin Tang ORCID: orcid.org/0000-0002-1469-9540⁵,
Anthony McLean^2,6,
Damien Chaussabel ORCID: orcid.org/0000-0002-6131-7242^30,31,
Klaus Schughart ORCID: orcid.org/0000-0002-6824-7523^32,33 &
…
Maryam Shojaei ORCID: orcid.org/0000-0002-5400-4029^2,5,6

Scientific Data volume 12, Article number: 1175 (2025) Cite this article

3234 Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Respiratory infections pose significant challenges to global health, impacting millions of individuals annually. Understanding the molecular mechanisms underlying the pathogenicity of these infections is crucial for developing effective interventions. RNA sequencing provides insights into a patient’s global transcriptome changes, facilitating the identification of host gene signatures in response to infection and potential therapeutic targets. Here we present an extensive whole blood transcriptome dataset from a demographically diverse cohort of 502 patients with infections including COVID-19, seasonal coronavirus, influenza A or influenza B, sepsis, septic shock, and co-infections (Viral/Viral, Bacterial/Viral, Bacterial/Viral/Fungal, Viral/Fungal, Viral/ Viral/Fungal). The cohort size and depth of data showcase its potential to unravel respiratory infection pathogenesis for the development of better diagnostics, treatments, and preventive strategies for respiratory infections and future global health crises.

Metatranscriptome of human lung microbial communities in a cohort of mechanically ventilated COVID-19 Omicron patients

Article Open access 10 November 2023

SARS-CoV-2 infection initiates interleukin-17-enriched transcriptional response in different cells from multiple organs

Article Open access 19 August 2021

Development and clinical validation of a novel multiplex PCR test for detection of respiratory pathogens via fluorescence melting curve analysis

Article Open access 20 August 2025

Background & Summary

Infections continue to impact millions of individuals each year, despite the significant advances that have been made in research and development of antimicrobial therapies. Numerous studies have identified and unravelled the causative organisms underpinning infectious diseases, such as severe-acute-respiratory-syndrome-related coronavirus-2 (SARS-CoV-2). However, our understanding of the complex molecular changes occurring within the human body during infections remains limited. Therefore, there is a critical need for large-scale whole-blood transcriptome datasets from infected patients to gain a comprehensive understanding of the underlying mechanisms of disease progression resulting from infection. Such datasets should capture patient heterogeneity, improve our understanding of the immune response to emerging variants and guide future research and clinical interventions. Increasing evidence suggests that changes in the expression of host immune responses are significant contributors to the development and progression of disease following infections^1,2,3,4. Molecular profiling technologies, especially RNA sequencing (RNASeq) stand out as powerful tools capable of unravelling these complex host responses during infections. This approach facilitates the profiling of gene expression changes across diverse biological samples and offers insights into their involvement in disease pathogenesis. With the significant drop in the cost of running RNASeq experiments, this technique has now become a primary step to characterise and compare gene expression and molecular pathways across many types of infections⁵. It also facilitates a rapid translation of research findings from bench to bedside, given that blood-based transcriptomic analysis is easily accessible and non-invasive, and it provides a systemic view of the body’s response to infections (https://www.lexogen.com/whole-blood-rna-seq-best-practice/)^6,7,8,9. By analysing blood transcriptomes, researchers can accurately identify differentially expressed genes, discover key molecular signatures or biomarkers, and unravel key pathways underpinning pathogenesis. Due to these advantages, RNASeq has been widely accepted as an indispensable tool in studying immune responses in many physiological and pathological situations^10,11.

Respiratory infections, characterised by their dynamic and complex nature, continue to pose significant challenges to global health, affecting people worldwide and presenting challenges to healthcare systems. This manuscript reports next-generation RNA sequencing within a large-scale, multi-centre framework to interpret the immunology landscape of respiratory infections. To date, we have accumulated a comprehensive whole blood transcriptome dataset obtained from 502 patients with SARS-COV-2, seasonal coronavirus, influenza A and influenza B, sepsis, septic shock, and co-infection across 11 multiple centres from 5 countries, drawn from various patient populations and covering a diverse range of clinical presentations and settings. In addition, we provide longitudinal data for patients with Coronavirus disease 2019 (COVID-19) and co-infected groups, capturing changes in the transcriptome spanning the entire course of infection with varied disease severity. The objective of this manuscript is to highlight the dataset scope, sample characteristics, and experimental approach and to describe data quality using various assessment metrics. Additionally, it highlights its translational potential by providing researchers with a roadmap to understand critical aspects of respiratory infection pathogenesis, thereby facilitating future research in the field. By making this valuable resource open to the scientific community, we expect to promote collaborative research efforts to inform the design of future studies, accelerate discoveries, and contribute to a more thorough understanding of respiratory infections, including SARS-CoV-2. Our goal is to aid the development of future diagnostic and prognostic tools, therapeutic interventions, and preventive strategies to address and combat future global health crises.

Methods

Ethics statement

This multicentre, observational cohort study recruited patients with respiratory infections across different sites in five countries. The study was approved by Human Research Ethics Committees (HRECs) at all participating institutions. Informed consent was obtained from all participants. Further details are provided in the Supplementary Information under the ethics statement.

Study design and participants of human cohorts

A total of 681 samples collected from 502 participants with a respiratory infection are included in this paper. Of these, 322 participants, comprising 301 adults and 21 children with confirmed COVID-19 were enrolled from 10 multinational centres (Australia, Czech Republic, France, Indonesia, and Singapore) between February 2020 and February 2022. Samples with other respiratory infections including seasonal coronavirus (n = 9), influenza A (n = 55) or influenza B (n = 8), sepsis (n = 17), and septic shock (n = 7) were collected between July 2014 and November 2019 in Australia or Singapore. In addition, this study includes samples collected (2014–2022) from subjects with various co-infections (n = 84) either with bacterial/viral (n = 56), bacterial/viral/fungal (n = 4), viral/fungal (n = 17) or viral/viral (n = 7) from Australia, France, and the Czech Republic. Longitudinal sample(s) obtained between two to nine days post-infection was collected for COVID-19 (65 patients) and co-infection groups (19 patients). Seventy-two volunteer samples from Australia collected before 2019 were included as healthy controls. A summary of the relevant cohort characteristics (study population and disease demographics) is provided in Table S1 (see Supplementary Information document). Detailed clinical data for each sample is provided as a supplementary file in the attached file 1 ([1] PREDICT-19 clinical data.xlsx).

Eligibility criteria included (1) age equal to or greater than 18 years for adults and less than 18 years for the paediatrics cohort (2) the World Health Organization definition of influenza-like illness (fever of 38 °C or higher, cough, sore throat, nasal congestion, and illness onset within the last ten days), and (3) confirmed infection by appropriate microbiological or virological assays, in addition to the presence of clinical evidence of infection (e.g. physical examination and imaging studies such as chest X-ray). For example, COVID-19 infection was confirmed by virological testing on respiratory samples (nasal swab/ throat swab/sputum/bronchoalveolar lavage) by PCR or antigen detection assay, together with signs of respiratory infection (respiratory distress and chest X-ray findings) as assessed by an admitting physician. All control samples in the dataset were tested negative for common respiratory infections (bacterial or viral). Study data were collected and managed using Research Electronic Data Capture (REDCap) electronic data capture tools hosted at the University of Sydney^12,13.

Blood sample collection and RNA isolation

Blood samples were collected into PAXgene Blood RNA Tubes (2.5 mL blood) (PreAnalytiX, Qiagen, Germany) from participants at the time of study enrolment according to the manufacturer’s supplied protocol. Samples were stored at room temperature for 2 h, −20 °C for 24 h and finally to −80 °C for long-term storage. Total RNA was extracted according to the manufacturer’s instructions and included DNase I treatment (PreAnalytiX, QIAGEN/BD, Switzerland). An aliquot of 4 μl of each extracted total RNA was used for RNA quality control assessments. The concentration and integrity of extracted RNA were evaluated by visualization of 28S and 18S band integrity on a Tapestation 4200 system (Agilent). RNA purity was estimated by examining the OD 260/280 and the OD 260/230 ratios. RNA samples were stored at −80 °C until use. Samples with the concentration of 100ng-1ug of total RNA with high RNA Integrity Number (RIN) (>7), OD 260/280 nm ratio of 1.8–2.0 and OD 260/230 nm ratio of 2.0–2.2 were sent for RNASeq.

Library preparation and RNASeq

Libraries were prepared from 300 ng total RNA using the Illumina Stranded Total RNA Prep with Ribo-Zero Plus with Unique Dual Indexes (Illumina, CA, USA). Briefly, human ribosomal and globin RNA were depleted, remaining RNA fragmented (targeting insert size of ~190 bp), and strand-specific double-stranded cDNA was synthesised. After adapter ligation and indexing, libraries were purified, quality-checked (PerkinElmer GXII), quantified (qPCR), and pooled (32 samples/lane). Sequencing was performed on an Illumina NovaSeq 6000 (150 bp paired-end, S4-300 flow cell), yielding an average of 90.8 million read pairs per sample. Base calling and FASTQ conversion were completed with standard Illumina pipelines: NovaSeq Control Software v1.7.5, RTA v3.4.4, and DRAGEN BCL Convert v3.10.8.

Data pre-processing: sequence reads to count data

Raw RNA sequencing data were quality-controlled and pre-processed into analysis-ready count data using the highly scalable RNASeq-DE workflow (v1.0.0) (https://github.com/Sydney-Informatics-Hub/RNASeq-DE). This workflow uses OpenMPI (v4.1.0) and nci-parallel (v1.0.0a) (https://doi.org/10.1007/978-3-540-30218-6_19) to distribute tasks across multiple compute nodes for compute efficient, parallel data pre-processing. Default or developer-recommended settings were applied unless otherwise described below. Quality reports for each FASTQ file were obtained using FastQC (v0.11.7) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and summarised with MultiQC (v1.9)¹⁴. FASTQ pairs were 3′ adapter and polyA tail trimmed using BBduk trim (v37.98) (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), leaving an average of 90.6 million trimmed read pairs. The human reference genome GRCh38 primary assembly and gene set release 106 were obtained from Ensembl and prepared with STAR’s GenomeGenerate tool (v2.7.3a)¹⁵, with –sjdbOverhang set to 149. Each pair of trimmed FASTQ reads was mapped using STAR to the prepared reference. Sequencing batch-level binary alignment (BAM) files were merged and indexed with SAMtools (v1.10)¹⁶ to obtain sample-level BAMs. HTSeq-count (v0.12.4)¹⁷ with -s reverse was used to obtain feature-level raw counts. TPMCalculator (v0.0.4)¹⁸ was used to obtain TPM normalized, feature level counts. The experimental flow chart (study design) is shown in Fig. 1.

Signal validation through differential expression

To validate data quality, we performed differential expression between controls and each disease group and confirmed that previously reported gene markers were identified in the current dataset. First, we loaded feature counts generated from HTSeq-count into a DESeq object with design = ~1 and transformed the data using variance stabilizing transformation (VST). Principal component analysis (PCA) using prcomp() on VST counts was used to observe variation between disease groups (Fig. 2a) and confirm no unwanted batch effects. The batch effect caused by the collection site was of primary interest, as methods were otherwise applied consistently across samples. The PCA analysis plot grouped by collection sites is shown in Figure S1 (see Supplementary Information document). DESeq 2’s dispersion estimates closely follow the fitted trend line, with decreasing dispersion at higher mean counts and no major outliers, indicating that the model provides a good fit for the data (Fig. 2b). Differential expression was then performed with DESeq 2 using HTSeq-count data, setting control samples as the base level. Significantly differentially expressed genes were defined as protein-coded genes with adjusted P-value ≤ 0.05 and |log₂ FC| ≥ 2 ≤ for each pairwise comparison between the disease groups against the control. Volcano plots of different expression results were generated with the package Enhanced Volcano, version 1.18.0¹⁹, with enhanced colouring and labelling shown in Figure S2 (a-j) (see Supplementary Information document).

The proportion of males to females within each disease group is shown in Figure S3 (see Supplementary Information document).

Data Records

Raw FASTQ data discussed in this publication have been deposited in NCBI’s Sequence Read Archive under BioProject accession PRJNA901461. Count data were deposited to NCBI’s Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4944384/) and are available at GEO Series accession numbers GSE217948²⁰ and GSE282464²¹. (see supplementary file “2 PREDICT-19 RNA-samples and data quality.xlsx”). All samples in PRJNA901461 were collected for the same project as described in this article. The two series represent two different parts of the entire dataset. Samples in GSE217948 were included in previous publications^22,23,24,25. In this publication, we provide additional clinical data for these samples in the supplementary file “2 PREDICT-19 clinical data.xlsx.” For some patients, additional longitudinal samples are also available in series GSE282464. All samples in GSE282464 are newly released as part of this article.

Technical Validation

Transcriptome data quality assessment

FastQC reported that all FASTQ files containing raw sequencing data had high per-sequence quality scores (Phred >30) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). This was reflected at each base across reads, except at position one, where quality was slightly lower for some reads (minimum Phred 16, equivalent to 97.5% accuracy, Fig. 2c). Adapter content and over-representation of poly A sequence were detected in the raw data. We trimmed adapters and poly A sequence and confirmed that they were successfully removed with FastQC and MultiQC.

Mapping metrics obtained by RSeQC’s tools are reported in Supplementary attached Files 2 and 3 ([2] PREDICT-19 RNA samples and data quality and [3] PREDICT19_DataDictionary - see Supplementary Information document). On average, 95.0% of reads were mapped (SD ± 2.2%), 73.4% of which were uniquely mapped (SD ± 7.0%), 17.0% were non-primary hits (SD ± 6.3%), and 36.7% (SD ± 6.5%) to coding sequence on the GRCh38 primary assembly. RSeQC’s infer_experiment.py was used to confirm that libraries were reverse strand aware with >0.7 of reads explained by “1 + −,1−+, 2++,2–” for all samples. The number of paired reads, read length and mapping characteristics exceed or meet Illumina’s recommendations and ENCODE’s best practice guidelines for profiling global gene expression and obtaining some information on alternative splicing²⁶ (https://knowledge.illumina.com/library-preparation/rna-library-prep/library-preparation-rna-library-prep-reference_material-list/000001243).

Signal validation through differential expression

The PCA plot and volcano plots presented in this paper serve as powerful tools to confirm the quality of our dataset. The PCA plot visually represents that the data’s variance can be attributed to infection status rather than unwanted batch effects, such as collection site. This is also confirmed by differential expression analysis and identification of gene markers reported to characterise specific infection types. For instance, IFI27 was significantly upregulated in viral infections, consistent with previous reports, where IFI27 was identified as a strong biomarker distinguishing viral from bacterial respiratory infections²⁷. Meanwhile, the volcano plots allow us to visualise significant changes in gene expression between different conditions, providing insights into the dataset’s robustness and reliability. Together, these analyses validate the quality of our dataset and enhance our confidence in its suitability for further study. Data presented here provides a valuable resource to replicate and validate similar findings from other studies (Table S2- see Supplementary Information document) (adjusted P-value ≤ 0.05 and |log₂ FC| ≥ 2 ≤ ).

Usage Notes

This study’s RNASeq data analysis is limited by the unequal distribution of male and female participants across specific cohorts, especially for the seasonal coronavirus group (which comprised only females) and the co-infected groups. As a result, some sex-specific genes were detected as ‘DEGs’ in contrast to infected versus healthy controls. This gender imbalance may introduce biases, affecting the generalizability of the results. Future research should aim for a more balanced gender representation to ensure broader applicability and minimise potential biases. Also, the lack of follow-up data for some infected individuals or groups could limit the analytical power. This dataset (GSE282464 and associated SRA records from PRJNA901461) provides newly released clinical and transcriptomic data that complement prior datasets. While this dataset can be analyzed independently for specific research applications, it is also designed to be interoperable with previously published datasets, particularly PRJNA901461.

Code availability

Highly scalable RNASeq-DE workflow v1.0.0 was used to perform quality assessment and pre-processing of raw RNA-sequencing data to raw counts. The code, tools and versions used are publicly available and accessible, as documented on GitHub (https://github.com/Sydney-Informatics-Hub/RNASeq-DE).

References

Nicholson, L. B. The immune system. Essays Biochem 60, 275–301, https://doi.org/10.1042/ebc20160017 (2016).
Article PubMed PubMed Central Google Scholar
Brodin, P. & Davis, M. M. Human immune system variation. Nat Rev Immunol 17, 21–29, https://doi.org/10.1038/nri.2016.125 (2017).
Article CAS PubMed Google Scholar
Netea, M. G., Schlitzer, A., Placek, K., Joosten, L. A. B. & Schultze, J. L. Innate and Adaptive Immune Memory: an Evolutionary Continuum in the Host’s Response to Pathogens. Cell Host Microbe 25, 13–26, https://doi.org/10.1016/j.chom.2018.12.006 (2019).
Article CAS PubMed Google Scholar
Ochando, J., Mulder, W. J. M., Madsen, J. C., Netea, M. G. & Duivenvoorden, R. Trained immunity - basic concepts and contributions to immunopathology. Nat Rev Nephrol 19, 23–37, https://doi.org/10.1038/s41581-022-00633-5 (2023).
Article PubMed Google Scholar
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13, https://doi.org/10.1186/s13059-016-0881-8 (2016).
Article CAS PubMed PubMed Central Google Scholar
Mohr, S. & Liew, C. C. The peripheral-blood transcriptome: new insights into disease and risk assessment. Trends Mol Med 13, 422–432, https://doi.org/10.1016/j.molmed.2007.08.003 (2007).
Article CAS PubMed Google Scholar
Sweeney, T. E., Wong, H. R. & Khatri, P. Robust classification of bacterial and viral infections via integrated host gene expression diagnostics. Sci Transl Med 8, 346ra391, https://doi.org/10.1126/scitranslmed.aaf7165 (2016).
Article CAS Google Scholar
Almansa, R. et al. A host transcriptomic signature for identification of respiratory viral infections in the community. Eur J Clin Invest 51, e13626, https://doi.org/10.1111/eci.13626 (2021).
Article CAS PubMed Google Scholar
Shojaei, M. et al. Multisite validation of a host response signature for predicting likelihood of bacterial and viral infections in patients with suspected influenza. Eur J Clin Invest 53, e13957, https://doi.org/10.1111/eci.13957 (2023).
Article CAS PubMed Google Scholar
Casamassimi, A., Federico, A., Rienzo, M., Esposito, S. & Ciccodicola, A. Transcriptome Profiling in Human Diseases: New Advances and Perspectives. Int J Mol Sci 18, https://doi.org/10.3390/ijms18081652 (2017).
Kukurba, K. R. & Montgomery, S. B. RNA Sequencing and Analysis. Cold Spring Harb Protoc 2015, 951–969, https://doi.org/10.1101/pdb.top084970 (2015).
Article PubMed PubMed Central Google Scholar
Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42, 377–381, https://doi.org/10.1016/j.jbi.2008.08.010 (2009).
Article PubMed Google Scholar
Harris, P. A. et al. The REDCap consortium: Building an international community of software platform partners. J Biomed Inform 95, 103208, https://doi.org/10.1016/j.jbi.2019.103208 (2019).
Article PubMed PubMed Central Google Scholar
Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048, https://doi.org/10.1093/bioinformatics/btw354 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21, https://doi.org/10.1093/bioinformatics/bts635 (2013).
Article CAS PubMed Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Article CAS PubMed PubMed Central Google Scholar
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169, https://doi.org/10.1093/bioinformatics/btu638 (2015).
Article CAS PubMed Google Scholar
Vera Alvarez, R., Pongor, L. S., Mariño-Ramírez, L. & Landsman, D. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics 35, 1960–1962, https://doi.org/10.1093/bioinformatics/bty896 (2019).
Article CAS PubMed Google Scholar
Blighe, K. Rana, S. & Lewis, M. EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. R package version 1.22.0 (2024).
NCBI GEO https://identifiers.org/geo/GSE217948 (2025).
NCBI GEO https://identifiers.org/geo/GSE282464 (2025).
Carney, M. et al. Host transcriptomics and machine learning for secondary bacterial infections in patients with COVID-19: a prospective, observational cohort study. Lancet Microbe 5, e272–e281, https://doi.org/10.1016/s2666-5247(23)00363-4 (2024).
Article CAS PubMed Google Scholar
Wang, Y. et al. Pathway and Network Analyses Identify Growth Factor Signaling and MMP9 as Potential Mediators of Mitochondrial Dysfunction in Severe COVID-19. Int J Mol Sci 24, https://doi.org/10.3390/ijms24032524 (2023).
Wang, Y. et al. Blood transcriptome responses in patients correlate with severity of COVID-19 disease. Front Immunol 13, 1043219, https://doi.org/10.3389/fimmu.2022.1043219 (2022).
Article CAS PubMed Google Scholar
Shojaei, M. et al. IFI27 transcription is an early predictor for COVID-19 outcomes, a multi-cohort observational study. Front Immunol 13, 1060438, https://doi.org/10.3389/fimmu.2022.1060438 (2022).
Article CAS PubMed Google Scholar
ENCODE Guidelines and Best Practices for RNA-Seq: Revised. https://doi.org/10.1101/044578 (2016).
Tang, B. M. et al. A novel immune biomarker IFI27 discriminates between influenza and bacteria in patients with suspected respiratory infection. Eur Respir J 49, https://doi.org/10.1183/13993003.02098-2016 (2017).

Download references

Acknowledgements

We thank all participants involved in this study. This study was funded by the Snow Medical Research Foundation (BEAT COVID-19) (grant no. CT28701/G207593), the National Health and Medical Research Council (Australian Partnership for Preparedness Research on Infectious Disease Emergencies, APPRISE AppID 1116530), and the Jack Ma Foundation. The authors also declare that this study received funding from A2 Milk Company. The funder had no role in the study design, data collection, analysis, interpretation, writing of this article, or the decision to submit it for publication. The authors wish to acknowledge the Australian Genomics Research Facility (AGRF) for sequencing services, supported by the Australian Government’s National Collaborative Research Infrastructure Strategy through Bioplatforms Australia. RNA sample quality control was performed by Dr Joey Lai at the Westmead Scientific Platforms, which are supported by the Westmead Research Hub, the Cancer Institute New South Wales, the National Health and Medical Research Council and the Ian Potter Foundation. The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney, and the Australian BioCommons, enabled by NCRIS via Bioplatforms Australia. We acknowledge using the National Computational Infrastructure (NCI), supported by the Australian Government, and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney. We are also especially grateful to Carl Feng, Alice Grey, Angela Ferguson, Jennifer Audsley, Rebecca Burrell, Amith Shetty, and Kevin Lai for their invaluable contributions across various aspects of the project. Members of the PREDICT-19 consortium are listed in alphabetical order by first name: Alberto Ballestrero, Allan Cripps, Amanda Cox, Amy L Phu, Andrea De Maria, Anthony McLean, Arutha Kulasinghe, Ben Marais, Benjamin Tang, Carl Feng, Damien Chaussabel, Darawan Rinchai, Davide Bedognetti, Gabriele Zoppoli, Gunawan Gunawan, Irani Thevarajan, Jennifer Audsley, John-Sebastian Eden, Jonathan Iredell, Karan Kim, Kirsty R Short, Klaus Schughart, Mandira Chakraborty, Marcela Kralovcova, Marek Nalos, Marko Radic, Martin Matejovic, Maryam Shojaei, Meagan Carney, Michele Bedognetti, Miroslav Prucha, Mohammed Toufiq, Nandan Deshpande, Narasaraju Teluguakula, Nicholas West, Paolo Cremonesi, Philip N. Britton, Ricardo Garcia Branco, Rodolphe Thiebaut, Rostyslav Bilyy, Sally Teoh, Stephen MacDonald, Tania Sorrell, Thomas Karvunidis, Tiana M. Pelaia, Tim Kwan, Tracy Chew, Velma Herwanto, Win Sen Kuan, Ya Wang, and Yoann Zerbib.

Author information

These authors contributed equally: Tracy Chew, Tiana M. Pelaia.

Authors and Affiliations

Sydney Informatics Hub, Core Research Facilities, University of Sydney, Sydney, NSW, Australia
Tracy Chew & Nandan Deshpande
Department of Intensive Care Medicine, Nepean Hospital, Penrith, NSW, Australia
Tiana M. Pelaia, Sally Teoh, Ya Wang, Velma Herwanto, Sam Orde, Anthony McLean & Maryam Shojaei
Research and Education Network, Western Sydney Local Health District, Westmead Hospital, Westmead, NSW, Australia
Amy L. Phu
Faculty of Medicine and Health, Sydney Medical School Westmead, Westmead Hospital, University of Sydney, Westmead, NSW, Australia
Amy L. Phu
Centre for Immunology and Allergy Research, The Westmead Institute for Medical Research, The University of Sydney, Westmead, NSW, Australia
Ya Wang, Karan Kim, Velma Herwanto, Benjamin Tang & Maryam Shojaei
Faculty of Medicine and Health, Sydney Medical School Nepean, Nepean Hospital, University of Sydney, Sydney, NSW, Australia
Ya Wang, Anthony McLean & Maryam Shojaei
Faculty of Medicine, Universitas Tarumanagara, Jakarta, Indonesia
Velma Herwanto
Medistra Hospital, Gatot Subroto Kav. 59, Jakarta, Indonesia
Gunawan
Medical ICU, 1st Department of Internal Medicine, Charles University and Teaching Hospital Pilsen, 323 00, Plzeň, Czech Republic
Thomas Karvunidis
Department of Intensive Care Medicine, Amiens University Hospital, Amiens, France
Yoann Zerbib
School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
Kirsty R. Short
Centre for Clinical Research in Emergency Medicine, Harry Perkins Institute of Medical Research, Royal Perth Hospital, University of Western Australia, Perth, WA, Australia
Stephen Macdonald
Victorian Infectious Disease Service, The Royal Melbourne Hospital, Melbourne, VIC, Australia
Irani Thevarajan
Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia
Irani Thevarajan
Translational Medicine Division, Research Branch, Sidra Medicine, Doha, Qatar
Darawan Rinchai
St Jude Children’s Research Hospital, Memphis, TN, USA
Darawan Rinchai
Emergency Medicine Department, National University Hospital, Singapore, Singapore
Win Sen Kuan
Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Win Sen Kuan
Department of Infectious Diseases and Microbiology, Westmead Hospital, WSLHD, Westmead, NSW, Australia
Ben Knippenberg
Centre for Infectious Diseases and Microbiology, The Westmead Institute for Medical Research, Westmead, NSW, Australia
Jonathan Iredell
Faculty of Medicine and Health, School of Medical Sciences, The University of Sydney, Sydney, NSW, Australia
Jonathan Iredell
Westmead Hospital, Western Sydney Local Health District, Westmead, NSW, Australia
Jonathan Iredell
Sydney Institute for Infectious Disease, The University of Sydney, Sydney, NSW, Australia
Jonathan Iredell
Department of Infectious Diseases and Microbiology, the Children’s Hospital at Westmead, Westmead, NSW, Australia
Philip N. Britton
Sydney ID, Sydney Medical School, University of Sydney, Sydney, NSW, Australia
Philip N. Britton
RPA Virtual Hospital, Sydney Local Health District, Camperdown, NSW, Australia
Owen Hutchings
Department of Clinical Immunology, Royal Prince Alfred Hospital, Camperdown, NSW, Australia
Warwick J. Britton
Centenary Institute, University of Sydney, Sydney, NSW, Australia
Warwick J. Britton
Centre for Virology Research, the Westmead Institute for Medical Research, Westmead, NSW, Australia
John-Sebastian Eden
Translational Medicine Division, Research Branch, Sidra Medicine, Doha, Qatar
Damien Chaussabel
Computational Sciences Department, The Jackson Laboratory, Farmington, CT, USA
Damien Chaussabel
Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Centre, Memphis, Tennessee, USA
Klaus Schughart
Institute of Virology Münster, University of Münster, Münster, Germany
Klaus Schughart

Authors

Tracy Chew
View author publications
Search author on:PubMed Google Scholar
Tiana M. Pelaia
View author publications
Search author on:PubMed Google Scholar
Amy L. Phu
View author publications
Search author on:PubMed Google Scholar
Sally Teoh
View author publications
Search author on:PubMed Google Scholar
Ya Wang
View author publications
Search author on:PubMed Google Scholar
Nandan Deshpande
View author publications
Search author on:PubMed Google Scholar
Karan Kim
View author publications
Search author on:PubMed Google Scholar
Velma Herwanto
View author publications
Search author on:PubMed Google Scholar
Gunawan
View author publications
Search author on:PubMed Google Scholar
Thomas Karvunidis
View author publications
Search author on:PubMed Google Scholar
Yoann Zerbib
View author publications
Search author on:PubMed Google Scholar
Kirsty R. Short
View author publications
Search author on:PubMed Google Scholar
Stephen Macdonald
View author publications
Search author on:PubMed Google Scholar
Irani Thevarajan
View author publications
Search author on:PubMed Google Scholar
Darawan Rinchai
View author publications
Search author on:PubMed Google Scholar
Win Sen Kuan
View author publications
Search author on:PubMed Google Scholar
Ben Knippenberg
View author publications
Search author on:PubMed Google Scholar
Jonathan Iredell
View author publications
Search author on:PubMed Google Scholar
Philip N. Britton
View author publications
Search author on:PubMed Google Scholar
Owen Hutchings
View author publications
Search author on:PubMed Google Scholar
Warwick J. Britton
View author publications
Search author on:PubMed Google Scholar
John-Sebastian Eden
View author publications
Search author on:PubMed Google Scholar
Sam Orde
View author publications
Search author on:PubMed Google Scholar
Benjamin Tang
View author publications
Search author on:PubMed Google Scholar
Anthony McLean
View author publications
Search author on:PubMed Google Scholar
Damien Chaussabel
View author publications
Search author on:PubMed Google Scholar
Klaus Schughart
View author publications
Search author on:PubMed Google Scholar
Maryam Shojaei
View author publications
Search author on:PubMed Google Scholar

Contributions

Study concept and design: B.T. and M.S., ethics/governance application: B.T., M.S., Y.W., T.P., S.T., recruitment of participants, sample collection/processing: S.T., T.P., K.K., M.S., P.N.B., T.K., Y.Z., A.D.M., I.T., W.S.K. clinical data collection and REDCap database: B.T., A.M., A.P., T.P., S.T., P.N.B., M.S., RNASeq data pre-processing Q.C. and data analyses: T.C., J.S., KSch and N.D., data interpretation and discussion: T.C., KSch, D.C., A.M., J.S., M.S., manuscript writing and generating figures: T.C. and M.S., manuscript revision: T.C., KSch, B.T. A.M., M.S., D.C., funding acquisition and project supervision: B.T., M.S., A.M. All authors contributed to the article and approved the submitted version. PREDICT-19 Consortium contributed to many aspects of this study, including study concept and design, applications of material transfer agreements, recruitment of participants, sample collections, clinical data collection, setup of REDCap database, data interpretation and discussions.

Corresponding authors

Correspondence to Tracy Chew or Maryam Shojaei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

SUPPLEMENTARY INFORMATION

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chew, T., Pelaia, T.M., Phu, A.L. et al. Molecular landscape of respiratory infection: A large-scale, multi-centre blood transcriptome dataset. Sci Data 12, 1175 (2025). https://doi.org/10.1038/s41597-025-05488-6

Download citation

Received: 02 January 2025
Accepted: 27 June 2025
Published: 10 July 2025
Version of record: 10 July 2025
DOI: https://doi.org/10.1038/s41597-025-05488-6

This article is cited by

A four-gene signature from blood to exclude bacterial etiology of lower respiratory tract infection in adults
- Ann R. Falsey
- Derick R. Peterson
- Thomas J. Mariani
Nature Communications (2025)

Subjects

Abstract

Similar content being viewed by others

Background & Summary

Methods

Ethics statement

Study design and participants of human cohorts

Blood sample collection and RNA isolation

Library preparation and RNASeq

Data pre-processing: sequence reads to count data

Signal validation through differential expression

Data Records

Technical Validation

Transcriptome data quality assessment

Signal validation through differential expression

Usage Notes

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links