Introduction

Whole genome sequencing (WGS) using short-read sequencing has become a standard approach for unraveling the genetic basis of complex diseases. Short-read sequencing specifically provides more information than genotype array data and is substantially cheaper than long-read sequencing. Several large-scale sequencing studies have already reported promising new findings, such as the TopMed program1 and the UK Biobank2. A major challenge in all WGS projects is the sheer amount of data. For example, at an average coverage of approximately 35×, the per-sample file size of the fastq.gz files containing the raw sequences is approximately 65 Gigabytes (GB). Additional files are generated during pre-processing, and sequence data after mapping and alignment are generally stored as BAM or CRAM files3, with BAM being the current standard, which require approximately 55 GB or 15 GB per sample at 35× coverage. If data are stored in a cloud operated by one of the commercial vendors, storage costs are approximately 0.17 USD per GB per year in case of a 50:50 storage use of archive and work storage (Table 1). The annual cost for storing both the fastq.gz and BAM files of a single sample (in total 130 GB) thus is approximately 22 USD. Storing genomic data for a decade may, therefore, be even more expensive than generating it initially4, and efficient data compression technologies can help reduce the cost of long-term data storage and file transfer.

Table 1 Prices for cloud data storage in Zurich, Switzerland.

General-purpose algorithms for compressing FASTQ files have been reviewed4,5,6, and 19 general-purpose tools have recently been benchmarked7. However, the current standard is to transfer already compressed fastq.gz files. In the past few years, specialized programs have been released to further compress fastq.gz and BAM files.

In this work, we benchmarked the four fastq.gz compression programs: DRAGEN ORA8, Genozip9, repaq10, and SPRING11. These packages allow for paired-end reads compression, which results in a further shrinkage of file sizes by 10–15%, compared to compressing the files independently (https://www.genozip.com/fastq, accessed: March 20, 2025). Furthermore, we compared compression approaches for BAM files by converting BAM files to CRAM 3.0 and CRAM 3.1 using SAMtools12, and by compressing BAM files with Genozip. For all comparisons, we used three subjects from the genome-in-a-bottle consortium (GIAB), which were sequenced as part of the quality control of the genetic sequencing study Hamburg Davos (GENESIS-HD), using short-read sequencing on an Illumina NovaSeq 600013,14.

Methods

Subject and sample preparation

The whole genome sequencing study and its data pre-processing steps have been described in detail elsewhere13,14. In brief, three GIAB subjects, forming a trio of Ashkenazim Jewish ancestry, were ordered from the Coriell Institute (NA24385, NIST ID HG002; NA24149, NIST-ID HG003 and NA24143, NIST-ID HG004)15. DNA concentrations were measured by Qubit. The library was constructed according to the Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for WGS, and protocol steps were: (1) fragmentation of 1 μg genomic DNA to 350 bp inserts by Covaris LE220-plus, (2) cleanup of fragmented DNA, (3) repair ends, (4) removal of large and small DNA fragments, (5) 3′-end adenylation, and (6) adapter ligation. The resulting library was quantified and quality-assessed with the iSeq100 (Illumina). All GIAB subjects were sequenced with the NovaSeq 6000 platform (Illumina) using S4 flow cells with 300 cycles (2 × 150 reads) and measured 2× to reach an average coverage of 35×. GIAB subjects HG002, HG003, and HG004 were sequenced and called 70, 8, and 4 times, respectively.

Variant calling

The raw sequencing files (base call files, BCL) were converted to FASTQ format and demultiplexed in a single step—without adapter trimming—on a single DRAGEN computer, using Illumina’s bcl2fastq version 2.20 program16 from DRAGEN 3.8.417. Mapping and alignment and variant calling were performed using DRAGEN 3.8.4. The integrated soft read trimmer was used with default parameters, and the human reference genome hg38 was used for mapping.

Compression software

We identified 6 software packages specifically tailored to compress paired-end reads fastq.gz data. Illumina’s DRAGEN ORA8 was run on a DRAGEN v3 on-premises server equipped with an Intel Xeon Gold 6226 CPU (2.7 GHz clock speed) with DRAGEN version 4.3.4. ORA does not use FPGA acceleration. PetaGene was unwilling to sell a license for this benchmark experiment (personal correspondence), and therefore, we had to exclude PetaGene’s PetaSuite18 from this study. We excluded GeneSqueeze from this comparison because the GeneSqueeze authors stated that their software had “drawbacks in speed and memory usage when compared to SPRING”19. Genozip version 15.0.62, repaq version 0.3.0, SAMtools version 1.20, and SPRING version 1.1.1 were run on a local high-performance computing cluster at Cardio-CARE (Davos, CH). Compute nodes were equipped with 2 AMD EPYC 7742 CPUs (2.25 GHz clock speed), 2 TB RAM, approximately 11 TB NVMe, and Rocky Linux 9.2 operating system. ORA and Genozip support automatic computation of the MD5sum during computation. repaq offers an option to check the consistency of the compressed files. These options were disabled during compression because checksum computation might affect runtime. Finally, with Genozip and SAMtools, we identified two software packages specifically tailored to compress BAM files.

fastq.gz compression

Each of the 82 GIAB samples had 16 fastq.gz files since paired-end sequencing was done on two flow cells with four lanes each. fastq.gz files were compressed using ORA, Genozip, repaq, and SPRING. All compression tools but repaq were sequentially run with 8 threads per read pair. repaq was run with a single thread per read pair, because a higher number of threads resulted in a lower compression ratio. The total runtime of both compression and decompression per sample was defined as the sum of read pair runtime. The compression ratio was calculated by dividing the total file size per sample of the fastq.gz files over the total file size per sample after compression. Decompression time was measured by outputting FASTQ files, not fastq.gz files. Memory consumption is reported as the total memory consumption per sample in GBs.

BAM compression

The BAM file of each GIAB sample was compressed with 8 threads with SAMtools and Genozip.

Statistical analysis

All statistical analysis was performed with R version 4.4.020. All figures were created with ggplot2 version 3.5.121.

Computer codes

The codes for the compression and decompression of the files are provided in the Supplementary Material.

Results

Overview of compression abilities

Table 2 provides an overview of the capabilities of the different software packages. Only Genozip compressed all data formats, i.e., fastq.gz, BAM, CRAM, and gVCF files. SAMtools could compress BAM files only. ORA, repaq, and SPRING were restricted to compressing fastq.gz files.

Table 2 Ability of software to compress different data formats.

fastq.gz compression and decompression

Table 3 reports file sizes for the compressed fastq.gz files and runtimes separately for compression and decompression. While the original fastq.gz file size was 68.3 (quartiles: 66.4–70.8) GB for the 82 GIAB samples, they were only 11.4 (11.0–12.1) GB for Genozip and 12.1 (11.7–12.9) GB for ORA (Table 3, Fig. 1). Median compression ratios were thus 1:5.99 and 1:5.64 for Genozip and ORA, respectively. Median compression ratios were only 1:1.99 and 1:3.79 for repaq and SPRING.

Table 3 File size (in Gigabytes (GB)), runtimes (in seconds), and memory usage (in GB) for compression and decompression of fastq.gz files. Displayed are medians (quartiles in parenthesis) for the n = 82 GIAB samples.
Fig. 1
figure 1

File size (in Gigabytes (GB)) for fastq.gz and compressed fastq.gz files (a) and memory usage (in GB) during compression and decompression (b) for n = 82 GIAB samples.

ORA had the lowest runtimes (Table 3) among the four packages, and ORA was 15 to 16 times faster than SPRING and repaq. However, on-site ORA compression can only be run on DRAGEN servers. In contrast, Genozip can be run on any CPU-based system and compressed the fastq.gz files more than 10 times faster than repaq and SPRING.

All software packages could be run on CPU systems for decompression. Here, ORA decompression was approximately twice as fast as Genozip decompression. Both packages substantially outperformed repaq and SPRING (Table 3). Finally, SPRING outperformed repaq in compression ratio, and both compression and decompression runtimes.

Table 3 and Fig. 1 additionally report memory usage for compression and decompression. All packages required more memory during compression than during decompression. SPRING showed the highest memory usage during compression with 57.3 (quartiles: 55.6–58.7) GB and decompression with 48.0 (46.7–48.5) GB. In contrast, repaq required the least memory during compression (5.5 GB; 5.5–5.5 GB) and decompression (3.8 GB; 3.7–3.9 GB). Genozip used less memory during both compression and decompression than ORA (Genozip compression 11.0 (10.9–11.1) GB, OTA compression 40.0 (40.0–40.0) GB; Genozip decompression 3.8 (3.7–3.9) GB; ORA decompression: 5.5 (5.5–5.6) GB).

BAM compression

Table 4 summarizes file sizes and runtimes from the compression of BAM files using SAMtools and Genozip. Genozip had the highest compression ratio (Table 4, Fig. 2) at the cost of a higher run time and approximately 13 times higher memory consumption compared to SAMtools (Table 4, Fig. 2). CRAM files were generated using SAMtools, and CRAM3.1 offered a slightly better compression ratio than CRAM3.0 at almost identical run times, albeit with slightly higher memory usage (Table 4, Fig. 2b). We note that Genozip can also compress CRAM3.0 and CRAM3.1 files. The resulting files were as big as files compressed from BAM files (details not shown).

Table 4 File size (in Gigabytes (GB)), compression time (in s), and memory usage (in GB) of BAM files. Displayed are medians (quartiles in parenthesis) for the n = 82 GIAB samples.
Fig. 2
figure 2

File size (in Gigabytes (GB)) for BAM and compressed BAM files (a) and memory usage (in GB) during compression (b) for n = 82 GIAB samples.

Discussion

This benchmark study on the compression of paired-end reads short-read sequence data revealed that Genozip and ORA had the highest compression ratios of approximately 1:6 for fastq.gz files. They also had the smallest compression and decompression times. In contrast, repaq and SPRING had lower compression ratios of approximately 1:2 and 1:4, respectively, and they also took more than 10× longer for file compression than Genozip and approximately 15× longer than ORA. SPRING showed the highest memory usage for both compression and decompression, whereas memory usage was lowest with repaq. Genozip used less memory than ORA. There were thus substantial differences between the software packages. All four packages offered lossless compression and the reconstruction of MD5sums. However, SPRING needs to be run single-threaded for MD5sum reconstruction at the cost of a substantial increase in runtime. Unexpectedly, the observed compression ratio for Genozip of 1:6 was obtained in the lossless compression mode, and it was therefore substantially higher than the compression ratios originally reported in 20219. A possible explanation is that the results of Lan et al.9 were based on FASTQ files generated by a HiSeq sequencer, which do not bin quality scores. Running Genozip in the optimized mode bins these quality scores, which results in higher compression ratios. Newer sequencers automatically bin these quality scores.

BAM files could be compressed with Genozip or to CRAM files with SAMtools. The advantage of CRAM files is that they can be directly read by many standard software packages. The new file standard CRAM 3.1 offers an almost 20% reduction in file size compared to CRAM 3.0 (Table 4)3, and CRAM 3.1 files had a compression ratio of 1:4 compared to BAM files. In our benchmark study, CRAM 3.1 files were just 14% larger than Genozip compressed BAM files (Table 4) but used approximately 13 times less memory during compression (Fig. 2B). Although the CRAM format has clear advantages, its application adaptation seems rather slow, although most software packages can easily handle CRAM files. Because Genozip-compressed files must be decompressed before they can be further used, we prefer using CRAM 3.1 files. Moreover, the random access to sequences is slow for Genozip, and specialized algorithms have been developed for efficient random access to sequences22. However, when a hundred thousand CRAM 3.1 files need to be stored, the extra 14% file size reduction of Genozip-compressed CRAM may be worth the compression effort because approximately 200 terabytes of storage could be saved. An alternative would be deleting the intermediate BAM/CRAM files because they can be regenerated. However, we stress that the need for reproducibility may hinder the deletion of the BAM files. In contrast, the long-term storage of the fastq.gz files is a must, and the efficient compression of these files thus has the highest priority.

Genozip is the most complete package for file compression because various file formats generated during secondary analysis can be compressed, and this ranges from fastq.gz to GVCF files9. ORA compression can only be performed on special DRAGEN servers and requires a separate license, which is by default available with the newest Illumina sequencer but not necessarily with older sequencing machines. However, decompression and ingestion of FASTQ.ORA files into the DRAGEN map/align does not require a license. In our opinion, ORA compression is ideally done in a single step together with BCL demultiplexing and conversion to fastq.gz files, immediately after Illumina sequencing. The novel Illumina NovaSeq X Series sequencers have DRAGEN servers on board so that ORA compression of fastq.gz files is possible on board of the sequencing machine.

One limitation of our benchmark study is that we focused on the compression of paired-end reads short-read Illumina sequences. Compression of long-read sequences, such as those from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) has not been considered in this work23. Moreover, we could not include PetaGene’s PetaGene for comparison18. General compression tools were also not considered because paired-end compression generally leads to higher compression ratios9. Furthermore, we did not investigate software packages that permit the compression of GVCF files, such as VCFShark24, Genotype Sparse Compression GSC25, or VCF Zarr26. However, the GVCF files after variant calling require approximately 3 GB per sample, thus only 5% of the original fastq.gz files. Efficient compression of individual GVCF files seems to be important if the files are to be transferred. However, in large-scale association studies using WGS data, the focus might be on the efficient storage of the multi-sample VCF file. The efficient compression of individual GVCF files and of multi-sample VCF files may thus be investigated in a separate benchmark study.

Conclusions

fastq.gz files may be compressed with a compression ratio of approximately 1:6 using Genozip or ORA compression. Genozip supports the compression of fastq.gz, BAM, CRAM, and (G)VCF formats. Although it requires an annual license, its source code is freely available, ensuring sustainability. The commercial tools Genozip and ORA offer higher compression ratios than the freely available tools SPRING and repaq. SAMtools may be preferable for compressing BAM files because many software packages can directly read the produced CRAM files.