Fig. 1: Overview of sequence classification benchmark workflow.
From: Benchmarking DNA foundation models for genomic and genetic tasks

DNA sequences are input into foundation models, generating token embeddings from the final layer. These embeddings undergo output pooling to produce high-dimensional representations of input sequences. A supervised classifier (random forest) is trained on these embeddings using labeled datasets. Model performance is evaluated on a independent test set using multiple metrics, with AUROC as the primary measure.