Fig. 1: A schematic of the model structure of HK model 2.

Subreads generated from single-molecule real-time sequencing (SMRT-seq) are aligned to the corresponding circular consensus sequence (CCS), and the kinetic features are established for individual nucleotides. Such kinetic features include inter-pulse duration (IPD) and pulse width (PW) (Top left). DNA is double-stranded, thus subreads can be derived from the Watson and Crick strands. As SMRT-seq utilizes a circularized DNA template, the DNA polymerase (yellow) conducts multiple laps of continuous and processive polymerization based on fluorescently labeled nucleotides, namely A (adenine), C (cytosine), G (guanine), and T (thymine) (Top right), producing a number of aforementioned subreads from the same DNA template. The colors of fluorescent pulses during sequencing are used to determine the identity of each base. The trajectory of these fluorescent signals helps measure two key kinetic features, namely, IPD and PW. The IPD reflects the time interval between two consecutive base incorporations, while PW indicates how long a base incorporation event lasts. Due to the repeated measurement nature of SMRT sequencing, the collective use of subreads from the same molecule can improve the sequencing accuracy and quantification of the kinetics of polymerase which would be influenced by base modifications present in the template [e.g. 5mC (5-methylcytosine), 5hmC (5-hydroxymethylcytosine), or 6mA (N6-methyladenine)]. Furthermore, the holistic kinetic (HK) model 2 framework is illustrated at the bottom. The kinetic signals of sequenced nucleotides within a flanking region around a query site (e.g. a C nucleotide at the CG context) are organized into an input matrix based on their base identities and positions, forming a measurement window. The input matrix is processed through convolutional layers, which extract local kinetic patterns associated with base modification. The output of these layers, combined with positional embeddings encoding relative nucleotide positions, is passed into transformer layers, which capture kinetic relationships across the measurement. The output layer generates probabilities for different types of base modification (referred to as base modification scores). Base modifications predicted by current HK model 2 include 5mC, 5hmC, and 6mA.