A benchmark study of compression software for human short-read sequence data

Betschart, Raphael O.; Thalén, Felix; Blankenberg, Stefan; Zoche, Martin; Zeller, Tanja; Ziegler, Andreas

doi:10.1038/s41598-025-00491-8

Download PDF

Article
Open access
Published: 02 May 2025

A benchmark study of compression software for human short-read sequence data

Scientific Reports volume 15, Article number: 15358 (2025) Cite this article

3061 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

Article Open access 02 January 2025

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Article Open access 06 February 2023

Taming large-scale genomic analyses via sparsified genomics

Article Open access 21 January 2025

Introduction

Whole genome sequencing (WGS) using short-read sequencing has become a standard approach for unraveling the genetic basis of complex diseases. Short-read sequencing specifically provides more information than genotype array data and is substantially cheaper than long-read sequencing. Several large-scale sequencing studies have already reported promising new findings, such as the TopMed program¹ and the UK Biobank². A major challenge in all WGS projects is the sheer amount of data. For example, at an average coverage of approximately 35×, the per-sample file size of the fastq.gz files containing the raw sequences is approximately 65 Gigabytes (GB). Additional files are generated during pre-processing, and sequence data after mapping and alignment are generally stored as BAM or CRAM files³, with BAM being the current standard, which require approximately 55 GB or 15 GB per sample at 35× coverage. If data are stored in a cloud operated by one of the commercial vendors, storage costs are approximately 0.17 USD per GB per year in case of a 50:50 storage use of archive and work storage (Table 1). The annual cost for storing both the fastq.gz and BAM files of a single sample (in total 130 GB) thus is approximately 22 USD. Storing genomic data for a decade may, therefore, be even more expensive than generating it initially⁴, and efficient data compression technologies can help reduce the cost of long-term data storage and file transfer.

Table 1 Prices for cloud data storage in Zurich, Switzerland.

Full size table

General-purpose algorithms for compressing FASTQ files have been reviewed^4,5,6, and 19 general-purpose tools have recently been benchmarked⁷. However, the current standard is to transfer already compressed fastq.gz files. In the past few years, specialized programs have been released to further compress fastq.gz and BAM files.

In this work, we benchmarked the four fastq.gz compression programs: DRAGEN ORA⁸, Genozip⁹, repaq¹⁰, and SPRING¹¹. These packages allow for paired-end reads compression, which results in a further shrinkage of file sizes by 10–15%, compared to compressing the files independently (https://www.genozip.com/fastq, accessed: March 20, 2025). Furthermore, we compared compression approaches for BAM files by converting BAM files to CRAM 3.0 and CRAM 3.1 using SAMtools¹², and by compressing BAM files with Genozip. For all comparisons, we used three subjects from the genome-in-a-bottle consortium (GIAB), which were sequenced as part of the quality control of the genetic sequencing study Hamburg Davos (GENESIS-HD), using short-read sequencing on an Illumina NovaSeq 6000^13,14.

Methods

Subject and sample preparation

The whole genome sequencing study and its data pre-processing steps have been described in detail elsewhere^13,14. In brief, three GIAB subjects, forming a trio of Ashkenazim Jewish ancestry, were ordered from the Coriell Institute (NA24385, NIST ID HG002; NA24149, NIST-ID HG003 and NA24143, NIST-ID HG004)¹⁵. DNA concentrations were measured by Qubit. The library was constructed according to the Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for WGS, and protocol steps were: (1) fragmentation of 1 μg genomic DNA to 350 bp inserts by Covaris LE220-plus, (2) cleanup of fragmented DNA, (3) repair ends, (4) removal of large and small DNA fragments, (5) 3′-end adenylation, and (6) adapter ligation. The resulting library was quantified and quality-assessed with the iSeq100 (Illumina). All GIAB subjects were sequenced with the NovaSeq 6000 platform (Illumina) using S4 flow cells with 300 cycles (2 × 150 reads) and measured 2× to reach an average coverage of 35×. GIAB subjects HG002, HG003, and HG004 were sequenced and called 70, 8, and 4 times, respectively.

Variant calling

The raw sequencing files (base call files, BCL) were converted to FASTQ format and demultiplexed in a single step—without adapter trimming—on a single DRAGEN computer, using Illumina’s bcl2fastq version 2.20 program¹⁶ from DRAGEN 3.8.4¹⁷. Mapping and alignment and variant calling were performed using DRAGEN 3.8.4. The integrated soft read trimmer was used with default parameters, and the human reference genome hg38 was used for mapping.

Compression software

We identified 6 software packages specifically tailored to compress paired-end reads fastq.gz data. Illumina’s DRAGEN ORA⁸ was run on a DRAGEN v3 on-premises server equipped with an Intel Xeon Gold 6226 CPU (2.7 GHz clock speed) with DRAGEN version 4.3.4. ORA does not use FPGA acceleration. PetaGene was unwilling to sell a license for this benchmark experiment (personal correspondence), and therefore, we had to exclude PetaGene’s PetaSuite¹⁸ from this study. We excluded GeneSqueeze from this comparison because the GeneSqueeze authors stated that their software had “drawbacks in speed and memory usage when compared to SPRING”¹⁹. Genozip version 15.0.62, repaq version 0.3.0, SAMtools version 1.20, and SPRING version 1.1.1 were run on a local high-performance computing cluster at Cardio-CARE (Davos, CH). Compute nodes were equipped with 2 AMD EPYC 7742 CPUs (2.25 GHz clock speed), 2 TB RAM, approximately 11 TB NVMe, and Rocky Linux 9.2 operating system. ORA and Genozip support automatic computation of the MD5sum during computation. repaq offers an option to check the consistency of the compressed files. These options were disabled during compression because checksum computation might affect runtime. Finally, with Genozip and SAMtools, we identified two software packages specifically tailored to compress BAM files.

fastq.gz compression

Each of the 82 GIAB samples had 16 fastq.gz files since paired-end sequencing was done on two flow cells with four lanes each. fastq.gz files were compressed using ORA, Genozip, repaq, and SPRING. All compression tools but repaq were sequentially run with 8 threads per read pair. repaq was run with a single thread per read pair, because a higher number of threads resulted in a lower compression ratio. The total runtime of both compression and decompression per sample was defined as the sum of read pair runtime. The compression ratio was calculated by dividing the total file size per sample of the fastq.gz files over the total file size per sample after compression. Decompression time was measured by outputting FASTQ files, not fastq.gz files. Memory consumption is reported as the total memory consumption per sample in GBs.

BAM compression

The BAM file of each GIAB sample was compressed with 8 threads with SAMtools and Genozip.

Statistical analysis

All statistical analysis was performed with R version 4.4.0²⁰. All figures were created with ggplot2 version 3.5.1²¹.

Computer codes

The codes for the compression and decompression of the files are provided in the Supplementary Material.

Results

Overview of compression abilities

Table 2 provides an overview of the capabilities of the different software packages. Only Genozip compressed all data formats, i.e., fastq.gz, BAM, CRAM, and gVCF files. SAMtools could compress BAM files only. ORA, repaq, and SPRING were restricted to compressing fastq.gz files.

Table 2 Ability of software to compress different data formats.

Full size table

fastq.gz compression and decompression

Table 3 reports file sizes for the compressed fastq.gz files and runtimes separately for compression and decompression. While the original fastq.gz file size was 68.3 (quartiles: 66.4–70.8) GB for the 82 GIAB samples, they were only 11.4 (11.0–12.1) GB for Genozip and 12.1 (11.7–12.9) GB for ORA (Table 3, Fig. 1). Median compression ratios were thus 1:5.99 and 1:5.64 for Genozip and ORA, respectively. Median compression ratios were only 1:1.99 and 1:3.79 for repaq and SPRING.

Table 3 File size (in Gigabytes (GB)), runtimes (in seconds), and memory usage (in GB) for compression and decompression of fastq.gz files. Displayed are medians (quartiles in parenthesis) for the n = 82 GIAB samples.

Full size table

ORA had the lowest runtimes (Table 3) among the four packages, and ORA was 15 to 16 times faster than SPRING and repaq. However, on-site ORA compression can only be run on DRAGEN servers. In contrast, Genozip can be run on any CPU-based system and compressed the fastq.gz files more than 10 times faster than repaq and SPRING.

All software packages could be run on CPU systems for decompression. Here, ORA decompression was approximately twice as fast as Genozip decompression. Both packages substantially outperformed repaq and SPRING (Table 3). Finally, SPRING outperformed repaq in compression ratio, and both compression and decompression runtimes.

Table 3 and Fig. 1 additionally report memory usage for compression and decompression. All packages required more memory during compression than during decompression. SPRING showed the highest memory usage during compression with 57.3 (quartiles: 55.6–58.7) GB and decompression with 48.0 (46.7–48.5) GB. In contrast, repaq required the least memory during compression (5.5 GB; 5.5–5.5 GB) and decompression (3.8 GB; 3.7–3.9 GB). Genozip used less memory during both compression and decompression than ORA (Genozip compression 11.0 (10.9–11.1) GB, OTA compression 40.0 (40.0–40.0) GB; Genozip decompression 3.8 (3.7–3.9) GB; ORA decompression: 5.5 (5.5–5.6) GB).

BAM compression

Table 4 summarizes file sizes and runtimes from the compression of BAM files using SAMtools and Genozip. Genozip had the highest compression ratio (Table 4, Fig. 2) at the cost of a higher run time and approximately 13 times higher memory consumption compared to SAMtools (Table 4, Fig. 2). CRAM files were generated using SAMtools, and CRAM3.1 offered a slightly better compression ratio than CRAM3.0 at almost identical run times, albeit with slightly higher memory usage (Table 4, Fig. 2b). We note that Genozip can also compress CRAM3.0 and CRAM3.1 files. The resulting files were as big as files compressed from BAM files (details not shown).

Table 4 File size (in Gigabytes (GB)), compression time (in s), and memory usage (in GB) of BAM files. Displayed are medians (quartiles in parenthesis) for the n = 82 GIAB samples.

Full size table

Discussion

This benchmark study on the compression of paired-end reads short-read sequence data revealed that Genozip and ORA had the highest compression ratios of approximately 1:6 for fastq.gz files. They also had the smallest compression and decompression times. In contrast, repaq and SPRING had lower compression ratios of approximately 1:2 and 1:4, respectively, and they also took more than 10× longer for file compression than Genozip and approximately 15× longer than ORA. SPRING showed the highest memory usage for both compression and decompression, whereas memory usage was lowest with repaq. Genozip used less memory than ORA. There were thus substantial differences between the software packages. All four packages offered lossless compression and the reconstruction of MD5sums. However, SPRING needs to be run single-threaded for MD5sum reconstruction at the cost of a substantial increase in runtime. Unexpectedly, the observed compression ratio for Genozip of 1:6 was obtained in the lossless compression mode, and it was therefore substantially higher than the compression ratios originally reported in 2021⁹. A possible explanation is that the results of Lan et al.⁹ were based on FASTQ files generated by a HiSeq sequencer, which do not bin quality scores. Running Genozip in the optimized mode bins these quality scores, which results in higher compression ratios. Newer sequencers automatically bin these quality scores.

BAM files could be compressed with Genozip or to CRAM files with SAMtools. The advantage of CRAM files is that they can be directly read by many standard software packages. The new file standard CRAM 3.1 offers an almost 20% reduction in file size compared to CRAM 3.0 (Table 4)³, and CRAM 3.1 files had a compression ratio of 1:4 compared to BAM files. In our benchmark study, CRAM 3.1 files were just 14% larger than Genozip compressed BAM files (Table 4) but used approximately 13 times less memory during compression (Fig. 2B). Although the CRAM format has clear advantages, its application adaptation seems rather slow, although most software packages can easily handle CRAM files. Because Genozip-compressed files must be decompressed before they can be further used, we prefer using CRAM 3.1 files. Moreover, the random access to sequences is slow for Genozip, and specialized algorithms have been developed for efficient random access to sequences²². However, when a hundred thousand CRAM 3.1 files need to be stored, the extra 14% file size reduction of Genozip-compressed CRAM may be worth the compression effort because approximately 200 terabytes of storage could be saved. An alternative would be deleting the intermediate BAM/CRAM files because they can be regenerated. However, we stress that the need for reproducibility may hinder the deletion of the BAM files. In contrast, the long-term storage of the fastq.gz files is a must, and the efficient compression of these files thus has the highest priority.

Genozip is the most complete package for file compression because various file formats generated during secondary analysis can be compressed, and this ranges from fastq.gz to GVCF files⁹. ORA compression can only be performed on special DRAGEN servers and requires a separate license, which is by default available with the newest Illumina sequencer but not necessarily with older sequencing machines. However, decompression and ingestion of FASTQ.ORA files into the DRAGEN map/align does not require a license. In our opinion, ORA compression is ideally done in a single step together with BCL demultiplexing and conversion to fastq.gz files, immediately after Illumina sequencing. The novel Illumina NovaSeq X Series sequencers have DRAGEN servers on board so that ORA compression of fastq.gz files is possible on board of the sequencing machine.

One limitation of our benchmark study is that we focused on the compression of paired-end reads short-read Illumina sequences. Compression of long-read sequences, such as those from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) has not been considered in this work²³. Moreover, we could not include PetaGene’s PetaGene for comparison¹⁸. General compression tools were also not considered because paired-end compression generally leads to higher compression ratios⁹. Furthermore, we did not investigate software packages that permit the compression of GVCF files, such as VCFShark²⁴, Genotype Sparse Compression GSC²⁵, or VCF Zarr²⁶. However, the GVCF files after variant calling require approximately 3 GB per sample, thus only 5% of the original fastq.gz files. Efficient compression of individual GVCF files seems to be important if the files are to be transferred. However, in large-scale association studies using WGS data, the focus might be on the efficient storage of the multi-sample VCF file. The efficient compression of individual GVCF files and of multi-sample VCF files may thus be investigated in a separate benchmark study.

Conclusions

fastq.gz files may be compressed with a compression ratio of approximately 1:6 using Genozip or ORA compression. Genozip supports the compression of fastq.gz, BAM, CRAM, and (G)VCF formats. Although it requires an annual license, its source code is freely available, ensuring sustainability. The commercial tools Genozip and ORA offer higher compression ratios than the freely available tools SPRING and repaq. SAMtools may be preferable for compressing BAM files because many software packages can directly read the produced CRAM files.

Data availability

One trio dataset generated during the current study is available in the Sequencing Read Archive (SRA) repository, accession number: PRJNA907182. All data are available from the corresponding author on reasonable request for collaborative projects.

References

Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299. https://doi.org/10.1038/s41586-021-03205-y (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature https://doi.org/10.1038/s41586-022-04965-x (2022).
Article PubMed PubMed Central Google Scholar
Bonfield, J. K. CRAM 3.1: Advances in the CRAM file format. Bioinformatics 38, 1497–1503. https://doi.org/10.1093/bioinformatics/btac010 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic data compression. Annu. Rev. Biomed. Data Sci. 2, 19–37. https://doi.org/10.1146/annurev-biodatasci-072018-021229 (2019).
Article Google Scholar
Hosseini, M., Pratas, D. & Pinho, A. J. A survey on data compression methods for biological sequences. Information 7, 56. https://doi.org/10.3390/info7040056 (2016).
Article Google Scholar
Numanagic, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005–1008. https://doi.org/10.1038/nmeth.4037 (2016).
Article CAS PubMed Google Scholar
McGhee, E. & Milton, S. in Practice and Experience in Advanced Research Computing (PEARC ’23), July 23–27, 2023 (ACM, Portland, OR, USA, 2023).
Illumina Inc. DRAGEN ORA Compression and Decompression. https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/ORACompression.htm. Accessed: 22.04.2025. (2023).
Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: A universal extensible genomic data compressor. Bioinformatics 37, 2225–2230. https://doi.org/10.1093/bioinformatics/btab102 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. et al. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front. Genet. 14, 1260531. https://doi.org/10.3389/fgene.2023.1260531 (2023).
Article PubMed PubMed Central Google Scholar
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: A next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676. https://doi.org/10.1093/bioinformatics/bty1015 (2019).
Article CAS PubMed Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience. https://doi.org/10.1093/gigascience/giab008 (2021).
Article PubMed PubMed Central Google Scholar
Betschart, R. O. et al. Comparison of calling pipelines for whole genome sequencing: An empirical study demonstrating the importance of mapping and alignment. Sci. Rep. 12, 21502. https://doi.org/10.1038/s41598-022-26181-3 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Betschart, R. O. et al. Biostatistical aspects of whole genome sequencing studies: Pre-processing and quality control. Biom. J. 66, e202300278. https://doi.org/10.1002/bimj.202300278 (2024).
Article MathSciNet PubMed Google Scholar
Zook, J. M. & Salit, M. Genomes in a bottle: Creating standard reference materials for genomic variation—Why, what and how?. Genome Biol. 12, P31. https://doi.org/10.1186/1465-6906-12-S1-P31 (2011).
Article PubMed Central Google Scholar
Illumina Inc. bcl2fastq2 Conversion Software v2.20. https://support.illumina.com/downloads/bcl2fastq-conversion-software-v2-20.html. Accessed: 22.04.2025. (2017).
Illumina Inc. Illumina DRAGEN Bio-IT platform v3.8. Instructions for using the DRAGEN Bio-IT platform. https://support-docs.illumina.com/SW/DRAGEN_v38/Content/SW/FrontPages/DRAGEN.htm. Accessed: 22.04.2025. (2021).
Greenfield, D., Wittorff, V. & Hultner, M. The importance of data compression in the field of genomics. IEEE Pulse 10, 20–23. https://doi.org/10.1109/MPULS.2019.2899747 (2019).
Article PubMed Google Scholar
Nazari, F. et al. Lossless and reference-free compression of FASTQ/A files using GeneSqueeze. Sci. Rep. 15, 322. https://doi.org/10.1038/s41598-024-79258-6 (2025).
Article CAS PubMed PubMed Central Google Scholar
R Core Team: A language and environment for statistical computing. https://www.r-project.org. Accessed: 01.05.2025. (2024).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Book Google Scholar
Adhisantoso, Y. G. et al. GVC: Efficient random access compression for gene sequence variations. BMC Bioinform. 24, 121. https://doi.org/10.1186/s12859-023-05240-0 (2023).
Article Google Scholar
Marx, V. Method of the year: Long-read sequencing. Nat. Methods 20, 6–11. https://doi.org/10.1038/s41592-022-01730-w (2023).
Article CAS PubMed Google Scholar
Deorowicz, S., Danek, A. & Kokot, M. VCFShark: How to squeeze a VCF file. Bioinformatics 37, 3358–3360. https://doi.org/10.1093/bioinformatics/btab211 (2021).
Article CAS PubMed Google Scholar
Luo, X. et al. GSC: Efficient lossless compression of VCF files with fast query. Gigascience 13, giae046. https://doi.org/10.1093/gigascience/giae046 (2024).
Article PubMed PubMed Central Google Scholar
Czech, E. et al. Analysis-ready VCF at Biobank scale using Zarr. bioRxiv. https://doi.org/10.1101/2024.06.11.598241 (2025).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Cardio-CARE is a program by the Kühne Foundation, and we gratefully acknowledge funding of the whole genome sequencing study by the Kühne Foundation. Tanja Zeller is supported by the German Center for Cardiovascular Research (DZHK e.V.) (81Z0710102, partner site project).

Author information

These authors jointly supervised this work: Tanja Zeller and Andreas Ziegler.

Authors and Affiliations

Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland
Raphael O. Betschart, Felix Thalén, Stefan Blankenberg & Andreas Ziegler
Institute of Cardiogenetics, University of Lübeck, Lübeck, Germany
Raphael O. Betschart & Tanja Zeller
Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
Stefan Blankenberg, Tanja Zeller & Andreas Ziegler
Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
Stefan Blankenberg, Tanja Zeller & Andreas Ziegler
German Center for Cardiovascular Research, Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
Stefan Blankenberg & Tanja Zeller
Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
Martin Zoche
School Mathematics, Statistics and Computer Science, Scottsville, Private Bag X01, Pietermaritzburg, 3209, South Africa
Andreas Ziegler

Authors

Raphael O. Betschart
View author publications
Search author on:PubMed Google Scholar
Felix Thalén
View author publications
Search author on:PubMed Google Scholar
Stefan Blankenberg
View author publications
Search author on:PubMed Google Scholar
Martin Zoche
View author publications
Search author on:PubMed Google Scholar
Tanja Zeller
View author publications
Search author on:PubMed Google Scholar
Andreas Ziegler
View author publications
Search author on:PubMed Google Scholar

Contributions

R.B., F.T., and A.Z. designed this study. R.B. performed the bioinformatic and statistical analysis. S.B., A.Z., and T.Z. developed the concept of the whole genome sequencing study. R.B., A.Z., and T.Z. wrote the paper. All authors made critical revisions to the manuscript.

Corresponding authors

Correspondence to Tanja Zeller or Andreas Ziegler.

Ethics declarations

Competing interests

T.Z. is supported by the German Research Foundation, the EU Horizon 2020 program, the EU ERANet and ERAPreMed Programs, the German Centre for Cardiovascular Research (DZHK, 81Z0710102), and the German Ministry of Education and Research. S.B., T.Z., and A.Z. are listed as co-inventors of an international patent on the use of a computing device to estimate the probability of myocardial infarction (International Publication Number WO2022043229A1). T.Z. is a shareholder of ART-EMIS Hamburg GmbH. R.B. and F.T. are employees of Cardio-CARE, A.Z. is the scientific director and CEO of Cardio-CARE, and S.B. is scientific advisor to Cardio-CARE. Cardio-CARE is a shareholder of the ART-EMIS Hamburg GmbH and a program by the Kühne Foundation. M.Z. does not report a conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Betschart, R.O., Thalén, F., Blankenberg, S. et al. A benchmark study of compression software for human short-read sequence data. Sci Rep 15, 15358 (2025). https://doi.org/10.1038/s41598-025-00491-8

Download citation

Received: 20 December 2024
Accepted: 28 April 2025
Published: 02 May 2025
DOI: https://doi.org/10.1038/s41598-025-00491-8

Subjects

Abstract

Similar content being viewed by others

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Taming large-scale genomic analyses via sparsified genomics

Introduction

Methods

Subject and sample preparation

Variant calling

Compression software

fastq.gz compression

BAM compression

Statistical analysis

Computer codes

Results

Overview of compression abilities

fastq.gz compression and decompression

BAM compression

Discussion

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links