Fig. 1: Development of rhizoSMASH.

a The gene cluster prediction workflow of rhizoSMASH. RhizoSMASH takes a genome sequence file as input (GenBank or FASTA) and recognize potential catabolic enzymes by scanning the sequence profile hidden Markov models. Gene clusters encoding relevant pathways were then detected using a set of detection rules. b The tuning procedure used for curation of rCGC detection rules. An initial set of detection rules was first summarized from a comprehensive literature study. Then, genome sequences in our BARS collection were scanned using this set of detection rules. The output gene clusters were grouped into cluster families with BiG-SCAPE together with our known cluster database, rKnownCGCs. We manually curated the detection rules by visually investigating the gene cluster family network generated by BiG-SCAPE for putative false positives/negatives, aided by further literature searches when needed. This calibration, validation and finetuning was performed three times to arrive at more and more optimal detection rules. Created with BioRender.com.