Abstract
Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.
Similar content being viewed by others
Introduction
Whole genome sequencing (WGS) using short-read sequencing has become a standard approach for unraveling the genetic basis of complex diseases. Short-read sequencing specifically provides more information than genotype array data and is substantially cheaper than long-read sequencing. Several large-scale sequencing studies have already reported promising new findings, such as the TopMed program1 and the UK Biobank2. A major challenge in all WGS projects is the sheer amount of data. For example, at an average coverage of approximately 35×, the per-sample file size of the fastq.gz files containing the raw sequences is approximately 65 Gigabytes (GB). Additional files are generated during pre-processing, and sequence data after mapping and alignment are generally stored as BAM or CRAM files3, with BAM being the current standard, which require approximately 55 GB or 15 GB per sample at 35× coverage. If data are stored in a cloud operated by one of the commercial vendors, storage costs are approximately 0.17 USD per GB per year in case of a 50:50 storage use of archive and work storage (Table 1). The annual cost for storing both the fastq.gz and BAM files of a single sample (in total 130 GB) thus is approximately 22 USD. Storing genomic data for a decade may, therefore, be even more expensive than generating it initially4, and efficient data compression technologies can help reduce the cost of long-term data storage and file transfer.
General-purpose algorithms for compressing FASTQ files have been reviewed4,5,6, and 19 general-purpose tools have recently been benchmarked7. However, the current standard is to transfer already compressed fastq.gz files. In the past few years, specialized programs have been released to further compress fastq.gz and BAM files.
In this work, we benchmarked the four fastq.gz compression programs: DRAGEN ORA8, Genozip9, repaq10, and SPRING11. These packages allow for paired-end reads compression, which results in a further shrinkage of file sizes by 10–15%, compared to compressing the files independently (https://www.genozip.com/fastq, accessed: March 20, 2025). Furthermore, we compared compression approaches for BAM files by converting BAM files to CRAM 3.0 and CRAM 3.1 using SAMtools12, and by compressing BAM files with Genozip. For all comparisons, we used three subjects from the genome-in-a-bottle consortium (GIAB), which were sequenced as part of the quality control of the genetic sequencing study Hamburg Davos (GENESIS-HD), using short-read sequencing on an Illumina NovaSeq 600013,14.
Methods
Subject and sample preparation
The whole genome sequencing study and its data pre-processing steps have been described in detail elsewhere13,14. In brief, three GIAB subjects, forming a trio of Ashkenazim Jewish ancestry, were ordered from the Coriell Institute (NA24385, NIST ID HG002; NA24149, NIST-ID HG003 and NA24143, NIST-ID HG004)15. DNA concentrations were measured by Qubit. The library was constructed according to the Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for WGS, and protocol steps were: (1) fragmentation of 1 μg genomic DNA to 350 bp inserts by Covaris LE220-plus, (2) cleanup of fragmented DNA, (3) repair ends, (4) removal of large and small DNA fragments, (5) 3′-end adenylation, and (6) adapter ligation. The resulting library was quantified and quality-assessed with the iSeq100 (Illumina). All GIAB subjects were sequenced with the NovaSeq 6000 platform (Illumina) using S4 flow cells with 300 cycles (2 × 150 reads) and measured 2× to reach an average coverage of 35×. GIAB subjects HG002, HG003, and HG004 were sequenced and called 70, 8, and 4 times, respectively.
Variant calling
The raw sequencing files (base call files, BCL) were converted to FASTQ format and demultiplexed in a single step—without adapter trimming—on a single DRAGEN computer, using Illumina’s bcl2fastq version 2.20 program16 from DRAGEN 3.8.417. Mapping and alignment and variant calling were performed using DRAGEN 3.8.4. The integrated soft read trimmer was used with default parameters, and the human reference genome hg38 was used for mapping.
Compression software
We identified 6 software packages specifically tailored to compress paired-end reads fastq.gz data. Illumina’s DRAGEN ORA8 was run on a DRAGEN v3 on-premises server equipped with an Intel Xeon Gold 6226 CPU (2.7 GHz clock speed) with DRAGEN version 4.3.4. ORA does not use FPGA acceleration. PetaGene was unwilling to sell a license for this benchmark experiment (personal correspondence), and therefore, we had to exclude PetaGene’s PetaSuite18 from this study. We excluded GeneSqueeze from this comparison because the GeneSqueeze authors stated that their software had “drawbacks in speed and memory usage when compared to SPRING”19. Genozip version 15.0.62, repaq version 0.3.0, SAMtools version 1.20, and SPRING version 1.1.1 were run on a local high-performance computing cluster at Cardio-CARE (Davos, CH). Compute nodes were equipped with 2 AMD EPYC 7742 CPUs (2.25 GHz clock speed), 2 TB RAM, approximately 11 TB NVMe, and Rocky Linux 9.2 operating system. ORA and Genozip support automatic computation of the MD5sum during computation. repaq offers an option to check the consistency of the compressed files. These options were disabled during compression because checksum computation might affect runtime. Finally, with Genozip and SAMtools, we identified two software packages specifically tailored to compress BAM files.
fastq.gz compression
Each of the 82 GIAB samples had 16 fastq.gz files since paired-end sequencing was done on two flow cells with four lanes each. fastq.gz files were compressed using ORA, Genozip, repaq, and SPRING. All compression tools but repaq were sequentially run with 8 threads per read pair. repaq was run with a single thread per read pair, because a higher number of threads resulted in a lower compression ratio. The total runtime of both compression and decompression per sample was defined as the sum of read pair runtime. The compression ratio was calculated by dividing the total file size per sample of the fastq.gz files over the total file size per sample after compression. Decompression time was measured by outputting FASTQ files, not fastq.gz files. Memory consumption is reported as the total memory consumption per sample in GBs.
BAM compression
The BAM file of each GIAB sample was compressed with 8 threads with SAMtools and Genozip.
Statistical analysis
All statistical analysis was performed with R version 4.4.020. All figures were created with ggplot2 version 3.5.121.
Computer codes
The codes for the compression and decompression of the files are provided in the Supplementary Material.
Results
Overview of compression abilities
Table 2 provides an overview of the capabilities of the different software packages. Only Genozip compressed all data formats, i.e., fastq.gz, BAM, CRAM, and gVCF files. SAMtools could compress BAM files only. ORA, repaq, and SPRING were restricted to compressing fastq.gz files.
fastq.gz compression and decompression
Table 3 reports file sizes for the compressed fastq.gz files and runtimes separately for compression and decompression. While the original fastq.gz file size was 68.3 (quartiles: 66.4–70.8) GB for the 82 GIAB samples, they were only 11.4 (11.0–12.1) GB for Genozip and 12.1 (11.7–12.9) GB for ORA (Table 3, Fig. 1). Median compression ratios were thus 1:5.99 and 1:5.64 for Genozip and ORA, respectively. Median compression ratios were only 1:1.99 and 1:3.79 for repaq and SPRING.
ORA had the lowest runtimes (Table 3) among the four packages, and ORA was 15 to 16 times faster than SPRING and repaq. However, on-site ORA compression can only be run on DRAGEN servers. In contrast, Genozip can be run on any CPU-based system and compressed the fastq.gz files more than 10 times faster than repaq and SPRING.
All software packages could be run on CPU systems for decompression. Here, ORA decompression was approximately twice as fast as Genozip decompression. Both packages substantially outperformed repaq and SPRING (Table 3). Finally, SPRING outperformed repaq in compression ratio, and both compression and decompression runtimes.
Table 3 and Fig. 1 additionally report memory usage for compression and decompression. All packages required more memory during compression than during decompression. SPRING showed the highest memory usage during compression with 57.3 (quartiles: 55.6–58.7) GB and decompression with 48.0 (46.7–48.5) GB. In contrast, repaq required the least memory during compression (5.5 GB; 5.5–5.5 GB) and decompression (3.8 GB; 3.7–3.9 GB). Genozip used less memory during both compression and decompression than ORA (Genozip compression 11.0 (10.9–11.1) GB, OTA compression 40.0 (40.0–40.0) GB; Genozip decompression 3.8 (3.7–3.9) GB; ORA decompression: 5.5 (5.5–5.6) GB).
BAM compression
Table 4 summarizes file sizes and runtimes from the compression of BAM files using SAMtools and Genozip. Genozip had the highest compression ratio (Table 4, Fig. 2) at the cost of a higher run time and approximately 13 times higher memory consumption compared to SAMtools (Table 4, Fig. 2). CRAM files were generated using SAMtools, and CRAM3.1 offered a slightly better compression ratio than CRAM3.0 at almost identical run times, albeit with slightly higher memory usage (Table 4, Fig. 2b). We note that Genozip can also compress CRAM3.0 and CRAM3.1 files. The resulting files were as big as files compressed from BAM files (details not shown).
Discussion
This benchmark study on the compression of paired-end reads short-read sequence data revealed that Genozip and ORA had the highest compression ratios of approximately 1:6 for fastq.gz files. They also had the smallest compression and decompression times. In contrast, repaq and SPRING had lower compression ratios of approximately 1:2 and 1:4, respectively, and they also took more than 10× longer for file compression than Genozip and approximately 15× longer than ORA. SPRING showed the highest memory usage for both compression and decompression, whereas memory usage was lowest with repaq. Genozip used less memory than ORA. There were thus substantial differences between the software packages. All four packages offered lossless compression and the reconstruction of MD5sums. However, SPRING needs to be run single-threaded for MD5sum reconstruction at the cost of a substantial increase in runtime. Unexpectedly, the observed compression ratio for Genozip of 1:6 was obtained in the lossless compression mode, and it was therefore substantially higher than the compression ratios originally reported in 20219. A possible explanation is that the results of Lan et al.9 were based on FASTQ files generated by a HiSeq sequencer, which do not bin quality scores. Running Genozip in the optimized mode bins these quality scores, which results in higher compression ratios. Newer sequencers automatically bin these quality scores.
BAM files could be compressed with Genozip or to CRAM files with SAMtools. The advantage of CRAM files is that they can be directly read by many standard software packages. The new file standard CRAM 3.1 offers an almost 20% reduction in file size compared to CRAM 3.0 (Table 4)3, and CRAM 3.1 files had a compression ratio of 1:4 compared to BAM files. In our benchmark study, CRAM 3.1 files were just 14% larger than Genozip compressed BAM files (Table 4) but used approximately 13 times less memory during compression (Fig. 2B). Although the CRAM format has clear advantages, its application adaptation seems rather slow, although most software packages can easily handle CRAM files. Because Genozip-compressed files must be decompressed before they can be further used, we prefer using CRAM 3.1 files. Moreover, the random access to sequences is slow for Genozip, and specialized algorithms have been developed for efficient random access to sequences22. However, when a hundred thousand CRAM 3.1 files need to be stored, the extra 14% file size reduction of Genozip-compressed CRAM may be worth the compression effort because approximately 200 terabytes of storage could be saved. An alternative would be deleting the intermediate BAM/CRAM files because they can be regenerated. However, we stress that the need for reproducibility may hinder the deletion of the BAM files. In contrast, the long-term storage of the fastq.gz files is a must, and the efficient compression of these files thus has the highest priority.
Genozip is the most complete package for file compression because various file formats generated during secondary analysis can be compressed, and this ranges from fastq.gz to GVCF files9. ORA compression can only be performed on special DRAGEN servers and requires a separate license, which is by default available with the newest Illumina sequencer but not necessarily with older sequencing machines. However, decompression and ingestion of FASTQ.ORA files into the DRAGEN map/align does not require a license. In our opinion, ORA compression is ideally done in a single step together with BCL demultiplexing and conversion to fastq.gz files, immediately after Illumina sequencing. The novel Illumina NovaSeq X Series sequencers have DRAGEN servers on board so that ORA compression of fastq.gz files is possible on board of the sequencing machine.
One limitation of our benchmark study is that we focused on the compression of paired-end reads short-read Illumina sequences. Compression of long-read sequences, such as those from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) has not been considered in this work23. Moreover, we could not include PetaGene’s PetaGene for comparison18. General compression tools were also not considered because paired-end compression generally leads to higher compression ratios9. Furthermore, we did not investigate software packages that permit the compression of GVCF files, such as VCFShark24, Genotype Sparse Compression GSC25, or VCF Zarr26. However, the GVCF files after variant calling require approximately 3 GB per sample, thus only 5% of the original fastq.gz files. Efficient compression of individual GVCF files seems to be important if the files are to be transferred. However, in large-scale association studies using WGS data, the focus might be on the efficient storage of the multi-sample VCF file. The efficient compression of individual GVCF files and of multi-sample VCF files may thus be investigated in a separate benchmark study.
Conclusions
fastq.gz files may be compressed with a compression ratio of approximately 1:6 using Genozip or ORA compression. Genozip supports the compression of fastq.gz, BAM, CRAM, and (G)VCF formats. Although it requires an annual license, its source code is freely available, ensuring sustainability. The commercial tools Genozip and ORA offer higher compression ratios than the freely available tools SPRING and repaq. SAMtools may be preferable for compressing BAM files because many software packages can directly read the produced CRAM files.
Data availability
One trio dataset generated during the current study is available in the Sequencing Read Archive (SRA) repository, accession number: PRJNA907182. All data are available from the corresponding author on reasonable request for collaborative projects.
References
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299. https://doi.org/10.1038/s41586-021-03205-y (2021).
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature https://doi.org/10.1038/s41586-022-04965-x (2022).
Bonfield, J. K. CRAM 3.1: Advances in the CRAM file format. Bioinformatics 38, 1497–1503. https://doi.org/10.1093/bioinformatics/btac010 (2022).
Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic data compression. Annu. Rev. Biomed. Data Sci. 2, 19–37. https://doi.org/10.1146/annurev-biodatasci-072018-021229 (2019).
Hosseini, M., Pratas, D. & Pinho, A. J. A survey on data compression methods for biological sequences. Information 7, 56. https://doi.org/10.3390/info7040056 (2016).
Numanagic, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005–1008. https://doi.org/10.1038/nmeth.4037 (2016).
McGhee, E. & Milton, S. in Practice and Experience in Advanced Research Computing (PEARC ’23), July 23–27, 2023 (ACM, Portland, OR, USA, 2023).
Illumina Inc. DRAGEN ORA Compression and Decompression. https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/ORACompression.htm. Accessed: 22.04.2025. (2023).
Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: A universal extensible genomic data compressor. Bioinformatics 37, 2225–2230. https://doi.org/10.1093/bioinformatics/btab102 (2021).
Chen, S. et al. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front. Genet. 14, 1260531. https://doi.org/10.3389/fgene.2023.1260531 (2023).
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: A next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676. https://doi.org/10.1093/bioinformatics/bty1015 (2019).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience. https://doi.org/10.1093/gigascience/giab008 (2021).
Betschart, R. O. et al. Comparison of calling pipelines for whole genome sequencing: An empirical study demonstrating the importance of mapping and alignment. Sci. Rep. 12, 21502. https://doi.org/10.1038/s41598-022-26181-3 (2022).
Betschart, R. O. et al. Biostatistical aspects of whole genome sequencing studies: Pre-processing and quality control. Biom. J. 66, e202300278. https://doi.org/10.1002/bimj.202300278 (2024).
Zook, J. M. & Salit, M. Genomes in a bottle: Creating standard reference materials for genomic variation—Why, what and how?. Genome Biol. 12, P31. https://doi.org/10.1186/1465-6906-12-S1-P31 (2011).
Illumina Inc. bcl2fastq2 Conversion Software v2.20. https://support.illumina.com/downloads/bcl2fastq-conversion-software-v2-20.html. Accessed: 22.04.2025. (2017).
Illumina Inc. Illumina DRAGEN Bio-IT platform v3.8. Instructions for using the DRAGEN Bio-IT platform. https://support-docs.illumina.com/SW/DRAGEN_v38/Content/SW/FrontPages/DRAGEN.htm. Accessed: 22.04.2025. (2021).
Greenfield, D., Wittorff, V. & Hultner, M. The importance of data compression in the field of genomics. IEEE Pulse 10, 20–23. https://doi.org/10.1109/MPULS.2019.2899747 (2019).
Nazari, F. et al. Lossless and reference-free compression of FASTQ/A files using GeneSqueeze. Sci. Rep. 15, 322. https://doi.org/10.1038/s41598-024-79258-6 (2025).
R Core Team: A language and environment for statistical computing. https://www.r-project.org. Accessed: 01.05.2025. (2024).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Adhisantoso, Y. G. et al. GVC: Efficient random access compression for gene sequence variations. BMC Bioinform. 24, 121. https://doi.org/10.1186/s12859-023-05240-0 (2023).
Marx, V. Method of the year: Long-read sequencing. Nat. Methods 20, 6–11. https://doi.org/10.1038/s41592-022-01730-w (2023).
Deorowicz, S., Danek, A. & Kokot, M. VCFShark: How to squeeze a VCF file. Bioinformatics 37, 3358–3360. https://doi.org/10.1093/bioinformatics/btab211 (2021).
Luo, X. et al. GSC: Efficient lossless compression of VCF files with fast query. Gigascience 13, giae046. https://doi.org/10.1093/gigascience/giae046 (2024).
Czech, E. et al. Analysis-ready VCF at Biobank scale using Zarr. bioRxiv. https://doi.org/10.1101/2024.06.11.598241 (2025).
Acknowledgements
Cardio-CARE is a program by the Kühne Foundation, and we gratefully acknowledge funding of the whole genome sequencing study by the Kühne Foundation. Tanja Zeller is supported by the German Center for Cardiovascular Research (DZHK e.V.) (81Z0710102, partner site project).
Author information
Authors and Affiliations
Contributions
R.B., F.T., and A.Z. designed this study. R.B. performed the bioinformatic and statistical analysis. S.B., A.Z., and T.Z. developed the concept of the whole genome sequencing study. R.B., A.Z., and T.Z. wrote the paper. All authors made critical revisions to the manuscript.
Corresponding authors
Ethics declarations
Competing interests
T.Z. is supported by the German Research Foundation, the EU Horizon 2020 program, the EU ERANet and ERAPreMed Programs, the German Centre for Cardiovascular Research (DZHK, 81Z0710102), and the German Ministry of Education and Research. S.B., T.Z., and A.Z. are listed as co-inventors of an international patent on the use of a computing device to estimate the probability of myocardial infarction (International Publication Number WO2022043229A1). T.Z. is a shareholder of ART-EMIS Hamburg GmbH. R.B. and F.T. are employees of Cardio-CARE, A.Z. is the scientific director and CEO of Cardio-CARE, and S.B. is scientific advisor to Cardio-CARE. Cardio-CARE is a shareholder of the ART-EMIS Hamburg GmbH and a program by the Kühne Foundation. M.Z. does not report a conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Betschart, R.O., Thalén, F., Blankenberg, S. et al. A benchmark study of compression software for human short-read sequence data. Sci Rep 15, 15358 (2025). https://doi.org/10.1038/s41598-025-00491-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-00491-8