Background & Summary

Infections continue to impact millions of individuals each year, despite the significant advances that have been made in research and development of antimicrobial therapies. Numerous studies have identified and unravelled the causative organisms underpinning infectious diseases, such as severe-acute-respiratory-syndrome-related coronavirus-2 (SARS-CoV-2). However, our understanding of the complex molecular changes occurring within the human body during infections remains limited. Therefore, there is a critical need for large-scale whole-blood transcriptome datasets from infected patients to gain a comprehensive understanding of the underlying mechanisms of disease progression resulting from infection. Such datasets should capture patient heterogeneity, improve our understanding of the immune response to emerging variants and guide future research and clinical interventions. Increasing evidence suggests that changes in the expression of host immune responses are significant contributors to the development and progression of disease following infections1,2,3,4. Molecular profiling technologies, especially RNA sequencing (RNASeq) stand out as powerful tools capable of unravelling these complex host responses during infections. This approach facilitates the profiling of gene expression changes across diverse biological samples and offers insights into their involvement in disease pathogenesis. With the significant drop in the cost of running RNASeq experiments, this technique has now become a primary step to characterise and compare gene expression and molecular pathways across many types of infections5. It also facilitates a rapid translation of research findings from bench to bedside, given that blood-based transcriptomic analysis is easily accessible and non-invasive, and it provides a systemic view of the body’s response to infections (https://www.lexogen.com/whole-blood-rna-seq-best-practice/)6,7,8,9. By analysing blood transcriptomes, researchers can accurately identify differentially expressed genes, discover key molecular signatures or biomarkers, and unravel key pathways underpinning pathogenesis. Due to these advantages, RNASeq has been widely accepted as an indispensable tool in studying immune responses in many physiological and pathological situations10,11.

Respiratory infections, characterised by their dynamic and complex nature, continue to pose significant challenges to global health, affecting people worldwide and presenting challenges to healthcare systems. This manuscript reports next-generation RNA sequencing within a large-scale, multi-centre framework to interpret the immunology landscape of respiratory infections. To date, we have accumulated a comprehensive whole blood transcriptome dataset obtained from 502 patients with SARS-COV-2, seasonal coronavirus, influenza A and influenza B, sepsis, septic shock, and co-infection across 11 multiple centres from 5 countries, drawn from various patient populations and covering a diverse range of clinical presentations and settings. In addition, we provide longitudinal data for patients with Coronavirus disease 2019 (COVID-19) and co-infected groups, capturing changes in the transcriptome spanning the entire course of infection with varied disease severity. The objective of this manuscript is to highlight the dataset scope, sample characteristics, and experimental approach and to describe data quality using various assessment metrics. Additionally, it highlights its translational potential by providing researchers with a roadmap to understand critical aspects of respiratory infection pathogenesis, thereby facilitating future research in the field. By making this valuable resource open to the scientific community, we expect to promote collaborative research efforts to inform the design of future studies, accelerate discoveries, and contribute to a more thorough understanding of respiratory infections, including SARS-CoV-2. Our goal is to aid the development of future diagnostic and prognostic tools, therapeutic interventions, and preventive strategies to address and combat future global health crises.

Methods

Ethics statement

This multicentre, observational cohort study recruited patients with respiratory infections across different sites in five countries. The study was approved by Human Research Ethics Committees (HRECs) at all participating institutions. Informed consent was obtained from all participants. Further details are provided in the Supplementary Information under the ethics statement.

Study design and participants of human cohorts

A total of 681 samples collected from 502 participants with a respiratory infection are included in this paper. Of these, 322 participants, comprising 301 adults and 21 children with confirmed COVID-19 were enrolled from 10 multinational centres (Australia, Czech Republic, France, Indonesia, and Singapore) between February 2020 and February 2022. Samples with other respiratory infections including seasonal coronavirus (n = 9), influenza A (n = 55) or influenza B (n = 8), sepsis (n = 17), and septic shock (n = 7) were collected between July 2014 and November 2019 in Australia or Singapore. In addition, this study includes samples collected (2014–2022) from subjects with various co-infections (n = 84) either with bacterial/viral (n = 56), bacterial/viral/fungal (n = 4), viral/fungal (n = 17) or viral/viral (n = 7) from Australia, France, and the Czech Republic. Longitudinal sample(s) obtained between two to nine days post-infection was collected for COVID-19 (65 patients) and co-infection groups (19 patients). Seventy-two volunteer samples from Australia collected before 2019 were included as healthy controls. A summary of the relevant cohort characteristics (study population and disease demographics) is provided in Table S1 (see Supplementary Information document). Detailed clinical data for each sample is provided as a supplementary file in the attached file 1 ([1] PREDICT-19 clinical data.xlsx).

Eligibility criteria included (1) age equal to or greater than 18 years for adults and less than 18 years for the paediatrics cohort (2) the World Health Organization definition of influenza-like illness (fever of 38 °C or higher, cough, sore throat, nasal congestion, and illness onset within the last ten days), and (3) confirmed infection by appropriate microbiological or virological assays, in addition to the presence of clinical evidence of infection (e.g. physical examination and imaging studies such as chest X-ray). For example, COVID-19 infection was confirmed by virological testing on respiratory samples (nasal swab/ throat swab/sputum/bronchoalveolar lavage) by PCR or antigen detection assay, together with signs of respiratory infection (respiratory distress and chest X-ray findings) as assessed by an admitting physician. All control samples in the dataset were tested negative for common respiratory infections (bacterial or viral). Study data were collected and managed using Research Electronic Data Capture (REDCap) electronic data capture tools hosted at the University of Sydney12,13.

Blood sample collection and RNA isolation

Blood samples were collected into PAXgene Blood RNA Tubes (2.5 mL blood) (PreAnalytiX, Qiagen, Germany) from participants at the time of study enrolment according to the manufacturer’s supplied protocol. Samples were stored at room temperature for 2 h, −20 °C for 24 h and finally to −80 °C for long-term storage. Total RNA was extracted according to the manufacturer’s instructions and included DNase I treatment (PreAnalytiX, QIAGEN/BD, Switzerland). An aliquot of 4 μl of each extracted total RNA was used for RNA quality control assessments. The concentration and integrity of extracted RNA were evaluated by visualization of 28S and 18S band integrity on a Tapestation 4200 system (Agilent). RNA purity was estimated by examining the OD 260/280 and the OD 260/230 ratios. RNA samples were stored at −80 °C until use. Samples with the concentration of 100ng-1ug of total RNA with high RNA Integrity Number (RIN) (>7), OD 260/280 nm ratio of 1.8–2.0 and OD 260/230 nm ratio of 2.0–2.2 were sent for RNASeq.

Library preparation and RNASeq

Libraries were prepared from 300 ng total RNA using the Illumina Stranded Total RNA Prep with Ribo-Zero Plus with Unique Dual Indexes (Illumina, CA, USA). Briefly, human ribosomal and globin RNA were depleted, remaining RNA fragmented (targeting insert size of ~190 bp), and strand-specific double-stranded cDNA was synthesised. After adapter ligation and indexing, libraries were purified, quality-checked (PerkinElmer GXII), quantified (qPCR), and pooled (32 samples/lane). Sequencing was performed on an Illumina NovaSeq 6000 (150 bp paired-end, S4-300 flow cell), yielding an average of 90.8 million read pairs per sample. Base calling and FASTQ conversion were completed with standard Illumina pipelines: NovaSeq Control Software v1.7.5, RTA v3.4.4, and DRAGEN BCL Convert v3.10.8.

Data pre-processing: sequence reads to count data

Raw RNA sequencing data were quality-controlled and pre-processed into analysis-ready count data using the highly scalable RNASeq-DE workflow (v1.0.0) (https://github.com/Sydney-Informatics-Hub/RNASeq-DE). This workflow uses OpenMPI (v4.1.0) and nci-parallel (v1.0.0a) (https://doi.org/10.1007/978-3-540-30218-6_19) to distribute tasks across multiple compute nodes for compute efficient, parallel data pre-processing. Default or developer-recommended settings were applied unless otherwise described below. Quality reports for each FASTQ file were obtained using FastQC (v0.11.7) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and summarised with MultiQC (v1.9)14. FASTQ pairs were 3′ adapter and polyA tail trimmed using BBduk trim (v37.98) (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/), leaving an average of 90.6 million trimmed read pairs. The human reference genome GRCh38 primary assembly and gene set release 106 were obtained from Ensembl and prepared with STAR’s GenomeGenerate tool (v2.7.3a)15, with –sjdbOverhang set to 149. Each pair of trimmed FASTQ reads was mapped using STAR to the prepared reference. Sequencing batch-level binary alignment (BAM) files were merged and indexed with SAMtools (v1.10)16 to obtain sample-level BAMs. HTSeq-count (v0.12.4)17 with -s reverse was used to obtain feature-level raw counts. TPMCalculator (v0.0.4)18 was used to obtain TPM normalized, feature level counts. The experimental flow chart (study design) is shown in Fig. 1.

Fig. 1
figure 1

Experimental flow chart outlining the study design, from sample collection to data sharing. This diagram illustrates the key steps involved in the study, including participant recruitment, sample processing, data analysis, and data deposition into public repositories for sharing.

Signal validation through differential expression

To validate data quality, we performed differential expression between controls and each disease group and confirmed that previously reported gene markers were identified in the current dataset. First, we loaded feature counts generated from HTSeq-count into a DESeq object with design = ~1 and transformed the data using variance stabilizing transformation (VST). Principal component analysis (PCA) using prcomp() on VST counts was used to observe variation between disease groups (Fig. 2a) and confirm no unwanted batch effects. The batch effect caused by the collection site was of primary interest, as methods were otherwise applied consistently across samples. The PCA analysis plot grouped by collection sites is shown in Figure S1 (see Supplementary Information document). DESeq 2’s dispersion estimates closely follow the fitted trend line, with decreasing dispersion at higher mean counts and no major outliers, indicating that the model provides a good fit for the data (Fig. 2b). Differential expression was then performed with DESeq 2 using HTSeq-count data, setting control samples as the base level. Significantly differentially expressed genes were defined as protein-coded genes with adjusted P-value ≤ 0.05 and |log2 FC| ≥ 2 ≤ for each pairwise comparison between the disease groups against the control. Volcano plots of different expression results were generated with the package Enhanced Volcano, version 1.18.019, with enhanced colouring and labelling shown in Figure S2 (a-j) (see Supplementary Information document).

Fig. 2
figure 2

Raw and analysis-ready data quality after mapping and controlling unwanted batch effects. (a) The principal components analysis plot is grouped by disease status. (b) DESeq 2 generated dispersion parameter estimates for each gene. (c) Mean quality scores of raw sequencing data per FASTQC file across the 150 bp read length were generated with FastQC and MultiQC. All bases across reads had a mean quality score of Phred 30 or greater, except at position one, where the quality was a minimum of Phred 16 (97.5% accuracy).

The proportion of males to females within each disease group is shown in Figure S3 (see Supplementary Information document).

Data Records

Raw FASTQ data discussed in this publication have been deposited in NCBI’s Sequence Read Archive under BioProject accession PRJNA901461. Count data were deposited to NCBI’s Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4944384/) and are available at GEO Series accession numbers GSE21794820 and GSE28246421. (see supplementary file “2 PREDICT-19 RNA-samples and data quality.xlsx”). All samples in PRJNA901461 were collected for the same project as described in this article. The two series represent two different parts of the entire dataset. Samples in GSE217948 were included in previous publications22,23,24,25. In this publication, we provide additional clinical data for these samples in the supplementary file “2 PREDICT-19 clinical data.xlsx.” For some patients, additional longitudinal samples are also available in series GSE282464. All samples in GSE282464 are newly released as part of this article.

Technical Validation

Transcriptome data quality assessment

FastQC reported that all FASTQ files containing raw sequencing data had high per-sequence quality scores (Phred >30) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). This was reflected at each base across reads, except at position one, where quality was slightly lower for some reads (minimum Phred 16, equivalent to 97.5% accuracy, Fig. 2c). Adapter content and over-representation of poly A sequence were detected in the raw data. We trimmed adapters and poly A sequence and confirmed that they were successfully removed with FastQC and MultiQC.

Mapping metrics obtained by RSeQC’s tools are reported in Supplementary attached Files 2 and 3 ([2] PREDICT-19 RNA samples and data quality and [3] PREDICT19_DataDictionary - see Supplementary Information document). On average, 95.0% of reads were mapped (SD ± 2.2%), 73.4% of which were uniquely mapped (SD ± 7.0%), 17.0% were non-primary hits (SD ± 6.3%), and 36.7% (SD ± 6.5%) to coding sequence on the GRCh38 primary assembly. RSeQC’s infer_experiment.py was used to confirm that libraries were reverse strand aware with >0.7 of reads explained by “1 + −,1−+, 2++,2–” for all samples. The number of paired reads, read length and mapping characteristics exceed or meet Illumina’s recommendations and ENCODE’s best practice guidelines for profiling global gene expression and obtaining some information on alternative splicing26 (https://knowledge.illumina.com/library-preparation/rna-library-prep/library-preparation-rna-library-prep-reference_material-list/000001243).

Signal validation through differential expression

The PCA plot and volcano plots presented in this paper serve as powerful tools to confirm the quality of our dataset. The PCA plot visually represents that the data’s variance can be attributed to infection status rather than unwanted batch effects, such as collection site. This is also confirmed by differential expression analysis and identification of gene markers reported to characterise specific infection types. For instance, IFI27 was significantly upregulated in viral infections, consistent with previous reports, where IFI27 was identified as a strong biomarker distinguishing viral from bacterial respiratory infections27. Meanwhile, the volcano plots allow us to visualise significant changes in gene expression between different conditions, providing insights into the dataset’s robustness and reliability. Together, these analyses validate the quality of our dataset and enhance our confidence in its suitability for further study. Data presented here provides a valuable resource to replicate and validate similar findings from other studies (Table S2- see Supplementary Information document) (adjusted P-value ≤ 0.05 and |log2 FC| ≥ 2 ≤ ).

Usage Notes

This study’s RNASeq data analysis is limited by the unequal distribution of male and female participants across specific cohorts, especially for the seasonal coronavirus group (which comprised only females) and the co-infected groups. As a result, some sex-specific genes were detected as ‘DEGs’ in contrast to infected versus healthy controls. This gender imbalance may introduce biases, affecting the generalizability of the results. Future research should aim for a more balanced gender representation to ensure broader applicability and minimise potential biases. Also, the lack of follow-up data for some infected individuals or groups could limit the analytical power. This dataset (GSE282464 and associated SRA records from PRJNA901461) provides newly released clinical and transcriptomic data that complement prior datasets. While this dataset can be analyzed independently for specific research applications, it is also designed to be interoperable with previously published datasets, particularly PRJNA901461.