Fig. 2
From: Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

GeneSqueeze duplication removal process. An example of this process is depicted, showing the initial creation of an index for the original order of the sequences. The process then illustrates the alphabetical re-ordering of the sequences based on the first nucleotide of the sequence, and the retention of the original index position of each sequence. The original index position of duplicate sequences is then associated with the original index identifier for the sequence that is the initial occurrence of the duplicate sequences (the ‘parent’ sequence). The duplicate sequences are then removed from the data frame, and an index of the original identifiers, and the duplication removal identifiers is created to store the identity and relationships between the retained sequences and any removed duplicate sequences. S denotes a parent with no duplicates. M denotes a parent with duplicate ‘child’ sequences. For the duplicated sequences, the DR_Identifier indicates the original index identifier of their parent sequence.