Supervised deep learning models have a dazzling track record in many computational genomics tasks, but their success relies on vast (and often costly) experimental data for training. Recently, genomic language models (gLMs), whose pretraining needs only (although large numbers of) DNA sequences, have manifested as a potentially appealing alternative and are interesting many researchers in the genomics and computational biology community, including Peter Koo of Cold Spring Harbor Laboratory. “We were initially excited by the growing class of gLMs that aim to learn unsupervised representations of DNA,” he says. However, after building these models with his team, “We found that they consistently underperformed well-established supervised models.” Intrigued to know whether these observations prevail more generally, Koo and his colleagues shifted their project to perform a rigorous evaluation of gLMs.
Challenges abound for benchmarking analysis in such a fast-paced area. Although new gLMs keep emerging, issues with code and data availability often hinder full reproducibility. “Many functional genomics modeling papers provided code and data, but these were often incomplete or difficult to adapt,” says Koo. This led the team to concentrate on a small but representative set of gLMs whose data and model baselines could be reliably obtained. Another important distinction of their benchmarking study compared to previous efforts is the tasks they designed. “The key innovation of our evaluation is its focus on biologically aligned tasks that are tied to open questions in gene regulation,” notes Koo. “In contrast, most existing benchmarks rely on classification tasks that originated in the machine learning literature and continue to be propagated in gLM studies, despite being disconnected from how models would be used to advance biological understanding and discovery.”
This is a preview of subscription content, access via your institution