Table 3 Summary information of storage allocation in each pipeline step.

From: Design and implementation of a hybrid cloud system for large-scale human genomic research

Pipeline Step

Operation

Input format

Output format

Total file size (Tb)

Mean size (Gb)

Median size (Gb)

Total file numbers

Note

Input

   

480.2

21.9

38.5

22,476

Consist of two files for each sample, i.e, paired-end protocol.

Step 1-1

Alignment

fastq

cram

240.4

21.9

19.4

11,238

Not include crai index file.

Step 1-2

Variant call

cram

gvcf

123.5

0.4

0.4

288,290

Not include tabix index file.

Step 2

Genomic DB import

gvcf

gdb

186.8

60.4

67.7

28,269,278

The mean and median size are the total file size per interval (in total 3169 interval dataset.)

Step 3

Joint-genotyping

gdb

vcf

5.8

226.9

204

26

Chr1-22/X/Y/PAR/M

Not include tabix index file.

Step 4

Variant quality score calculation

vcf

vcf

10.0

395.6

355.3

26

Chr1-22/X/Y/PAR/M

Not include tabix index file.

Step 5

Annotation

vcf

tsv

0.1

3.3

3.3

27

 
   

Total

1046.8

  

28,591,361

 
  1. The file size (mean/median/total) and total file numbers are summarized in each analysis step.