Table 1 Description of BPGA Pipeline.

From: BPGA- an ultra-fast pan-genome analysis pipeline

Features

Description

Tools/scripts

Notes

Equivalent tools.

Citation

Preparation step

Preprocessing of raw files (.faa, .fsa or any fasta or .gbk) leading to a single input file required for clustering.

BPGA script

BPGA modifies the files by inserting genome ID into the sequence headers.

NA

This study

Clustering

It is used to cluster genes based on sequence similarity into orthlogous clusters.

USEARCH#, CD-HIT*, OrthoMCL*.

USEARCH is fastest clustering tool so far. BPGA uses it as default clustering tool and can also process the clusters from other two.

Roary, PGAP, PGAT, ITEP, Panseq.

[25,27, 28, 29,45]

Matrix Generation (Pan-Matrix)

It generates 1,0–binary presence/ absence matrix from orthlogous clusters.

BPGA script

BPGA script checks the presence or absence of genes from the individual strains and writes in the form of matrix.

Roary, PanGP, PGAP.

[26, 27, 28]

Pan-Genome Profile Analysis

Calculates shared genes after stepwise addition of each individual genome. This trend can be plotted as Core or Pan-genome Profile Curves.

BPGA script, gnuplot.

BPGA script calculates such trends taking different permutations/combinations of genomes.

Roary, PanGP, PGAP.

[26, 27, 28]

Phylogeny Construction

Pan Phylogeny: Generates a phylogenetic tree based on pan-matrix data. Core/MLST Phylogeny: Generates a phylogenetic tree based on concatenated core/housekeeping gene alignments.

BPGA script, MUSCLE # , Librsvg.

BPGA script concatenates the core sequences from all strains and converts pan-matrix into Newick tree. MUSCLE is faster and more accurate alignment and tree generator tool.

Roary, PGAP, Panseq, ITEP.

[25,27, 28, 29]

Function and Pathway Analysis

COG and KEGG Assignments on the basis of best hits with respective reference databases.

USEARCH # , BPGA script, gnuplot.

Best hits are processed to get the % occurrences for all COG & KEGG pathway categories.

COG: PGAP, PGAT,ITEP. KEGG Analysis: None

[28,29,45]

Pan-Genome Statistics

It provides genome wise core, accessory, unique and exclusively absent gene counts.

BPGA script

Gives an idea about contribution of each strain to the pan-genome.

None

This study

Atypical GC Content Analysis

Identifies genes with substantial high or low GC content from their genomic GC content.

BPGA script

Applicable only if Genbank files are used as input.

None

This study

Subset Analysis

Divides the original dataset into user defined smaller subsets and performs default pan genomic analyses.

BPGA script

The subsets may be based on pathogenic potential, habitat, taxonomical groups or any other criteria.

None

This study

Exclusive gene absence

Identifies the clusters showing exclusive absence of a gene from the specific strain.

BPGA script

Sequences of such clusters are given in output file.

None

This study

  1. #Automated by BPGA script.
  2. *Supported outputs.
  3. These are novel features by BPGA, NA-Not Applicable.