Scientific Reports

Table 1 Description of BPGA Pipeline.

From: BPGA- an ultra-fast pan-genome analysis pipeline

Features	Description	Tools/scripts	Notes	Equivalent tools.	Citation
Preparation step	Preprocessing of raw files (.faa, .fsa or any fasta or .gbk) leading to a single input file required for clustering.	BPGA script	BPGA modifies the files by inserting genome ID into the sequence headers.	NA	This study
Clustering	It is used to cluster genes based on sequence similarity into orthlogous clusters.	USEARCH#, CD-HIT, OrthoMCL.	USEARCH is fastest clustering tool so far. BPGA uses it as default clustering tool and can also process the clusters from other two.	Roary, PGAP, PGAT, ITEP, Panseq.	[25,27, 28, 29,45]
Matrix Generation (Pan-Matrix)	It generates 1,0–binary presence/ absence matrix from orthlogous clusters.	BPGA script	BPGA script checks the presence or absence of genes from the individual strains and writes in the form of matrix.	Roary, PanGP, PGAP.	[26, 27, 28]
Pan-Genome Profile Analysis	Calculates shared genes after stepwise addition of each individual genome. This trend can be plotted as Core or Pan-genome Profile Curves.	BPGA script, gnuplot.	BPGA script calculates such trends taking different permutations/combinations of genomes.	Roary, PanGP, PGAP.	[26, 27, 28]
Phylogeny Construction	Pan Phylogeny: Generates a phylogenetic tree based on pan-matrix data. Core/MLST Phylogeny: Generates a phylogenetic tree based on concatenated core/housekeeping gene alignments.	BPGA script, MUSCLE ^# , Librsvg.	BPGA script concatenates the core sequences from all strains and converts pan-matrix into Newick tree. MUSCLE is faster and more accurate alignment and tree generator tool.	Roary, PGAP, Panseq, ITEP.	[25,27, 28, 29]
Function and Pathway ^† Analysis	COG and KEGG Assignments on the basis of best hits with respective reference databases.	USEARCH ^# , BPGA script, gnuplot.	Best hits are processed to get the % occurrences for all COG & KEGG pathway categories.	COG: PGAP, PGAT,ITEP. KEGG Analysis: None	[28,29,45]
Pan-Genome Statistics ^†	It provides genome wise core, accessory, unique and exclusively absent gene counts.	BPGA script	Gives an idea about contribution of each strain to the pan-genome.	None	This study
Atypical GC Content Analysis ^†	Identifies genes with substantial high or low GC content from their genomic GC content.	BPGA script	Applicable only if Genbank files are used as input.	None	This study
Subset Analysis ^†	Divides the original dataset into user defined smaller subsets and performs default pan genomic analyses.	BPGA script	The subsets may be based on pathogenic potential, habitat, taxonomical groups or any other criteria.	None	This study
Exclusive gene absence ^†	Identifies the clusters showing exclusive absence of a gene from the specific strain.	BPGA script	Sequences of such clusters are given in output file.	None	This study

^#Automated by BPGA script.
^*Supported outputs.
^†These are novel features by BPGA, NA-Not Applicable.

Back to article page

Search

Advanced search

Quick links