Table 2 Overview of the tasks included in DNALONGBENCH

From: DNALONGBENCH: a benchmark suite for long-range DNA prediction tasks

LR Tasks

LR Type

Input Length

Output Shape

# Samples

Metric

Enhancer-target Gene

Binary Classification

450,000

1

2602

AUROC

eQTL

Binary Classification

450,000

1

31,282

AUROC

Contact Map

Binned (2048 bp) 2D Regression

1,048,576

99681

7840

SCC&PCC

Regulatory Sequence Activity

Binned (128 bp) 1D Regression

196,608

Human: (896, 5313)

Human: 38,171

PCC

   

Mouse: (896, 1643)

Mouse: 33,521

 

Transcription Initiation Signal

Nucleotide-wise 1D Regression

100,000

(100,000, 10)

100,000*

PCC

  1. "1D” and “2D” denote one-dimensional and two-dimensional, respectively. Nucleotide-wise tasks involve predicting a sequence of labels, each corresponding to individual nucleotides in the input. Sequence-wise tasks require classifying the entire input sequence. In binned tasks, multiple nucleotides are grouped into bins and share a common label. *: The data for this task consists of sequences sampled from whole genomes, with 100,000 samples used for training our baselines.
  2. AUROC area under the receiver operating characteristic curve, PCC Pearson correlation coefficient, SCC stratum-adjusted correlation coefficient.