Table 3 Summary information of storage allocation in each pipeline step.
From: Design and implementation of a hybrid cloud system for large-scale human genomic research
Pipeline Step | Operation | Input format | Output format | Total file size (Tb) | Mean size (Gb) | Median size (Gb) | Total file numbers | Note |
|---|---|---|---|---|---|---|---|---|
Input | Â | Â | Â | 480.2 | 21.9 | 38.5 | 22,476 | Consist of two files for each sample, i.e, paired-end protocol. |
Step 1-1 | Alignment | fastq | cram | 240.4 | 21.9 | 19.4 | 11,238 | Not include crai index file. |
Step 1-2 | Variant call | cram | gvcf | 123.5 | 0.4 | 0.4 | 288,290 | Not include tabix index file. |
Step 2 | Genomic DB import | gvcf | gdb | 186.8 | 60.4 | 67.7 | 28,269,278 | The mean and median size are the total file size per interval (in total 3169 interval dataset.) |
Step 3 | Joint-genotyping | gdb | vcf | 5.8 | 226.9 | 204 | 26 | Chr1-22/X/Y/PAR/M Not include tabix index file. |
Step 4 | Variant quality score calculation | vcf | vcf | 10.0 | 395.6 | 355.3 | 26 | Chr1-22/X/Y/PAR/M Not include tabix index file. |
Step 5 | Annotation | vcf | tsv | 0.1 | 3.3 | 3.3 | 27 | Â |
| Â | Â | Â | Total | 1046.8 | Â | Â | 28,591,361 | Â |