Abstract
Basecalling is a crucial step in DNA sequencing that converts raw nanopore signals into nucleotide sequences. This paper presents a serial-parallel reprogrammable DNA sequencing accelerator based on a 64-state Hidden Markov Model (HMM) implemented in a 130-nm CMOS process. The proposed method optimizes computational efficiency, hardware utilization, and power consumption using a coarse-grained serial-parallel processing approach. It achieves 94.3% accuracy, outperforming Nanocall (85.6%) and Meta-Align (91.2%), while being slightly superior to the Scalable Hardware Accelerator (93.1%). Furthermore, it consumes 200 mW, which is 6 times lower than brute-force HMM implementations and 3–5 times more power-efficient than deep learning-based basecallers like DeepNano and Bonito. The proposed accelerator maintains competitive throughput at 8 M Bases/sec, balancing processing speed and energy efficiency. Additionally, the architecture supports scalability up to 4096 states, making it highly adaptable for various sequencing applications. It’s hardware-optimized and low-power design makes it an ideal alternative to brute-force and software-based methods for real-time, mobile, and embedded DNA sequencing devices.
Introduction
The smallest DNA sequencers available on the market today are palm-sized devices that can continuously process a stream of DNA molecules and generate corresponding electronic measurements in real time1,2. The fundamental technological shift underlying these devices, compared to traditional sequencing methods, has led to their classification as “Third-Generation Sequencers”3,4. These machines provide sufficient quality for applications such as complete human genome assembly5,6,7. Their compact physical design also suggests potential applications across various fields in medicine and industry. However, the computational demands of these machines are considerable, which somewhat limits their portability8. This challenge is particularly evident in the early stages of the DNA sequencing pipeline, where molecular measurements are converted into their initial textual representation—a process known as basecalling. During this step, the minute electronic current fluctuations produced by these machines (on the order of 10 picoamperes) require advanced detection algorithms (basecallers) to accurately extract the underlying base sequence (A, C, G, T)9.
Recognizing that the potential of these sequencers for mobile applications could be significantly enhanced by access to specialized computing resources, we propose a custom basecalling accelerator designed for implementation in ASIC form. The motivation behind this work stems from the HMM algorithm’s ability to achieve 98% accuracy9,10,11 while maintaining a simple hardware-friendly structure.
Despite recent advancements in deep learning-based basecallers12,13,14, Hidden Markov Models (HMMs) remain a strong candidate for ASIC-based implementations due to several key advantages, including robustness in noisy environments, lower computational complexity, hardware-friendly implementation, scalability and reconfigurability, memory efficiency, interpretability, and adaptability to hybrid approaches. These benefits make HMMs a practical and efficient choice for DNA basecalling in ASIC implementations, offering an optimal balance between computational performance, power efficiency, and scalability.
In this paper, we introduce a 130-nm CMOS DNA basecaller based on the HMM algorithm, offering a balance between accuracy, efficiency, and hardware feasibility. One straightforward approach to implementing the 64-state HMM basecaller is a brute-force architecture, where each computational requirement of the processor is realized through its dedicated processing block. In this model, no resource sharing or reuse is employed, leading to a significantly large implementation area15. An alternative and more efficient method is to adopt an architecture that balances serial and parallel processing. In this paper, a serial-parallel structure is proposed for implementing the basecaller. The fundamental processing unit of the proposed architecture is based on a 16-state HMM structure. By incorporating reconfiguration techniques16,17 and leveraging resource reuse strategies, the proposed architecture can be scaled up to support a 4096-state basecaller, offering improved flexibility and efficiency while optimizing resource utilization.
The proposed architecture offers strong scalability, enabling adaptation to sequencing tasks of different complexities. By leveraging reconfiguration and resource sharing, it remains compact and energy-efficient, making it suitable for mobile, real-time, and low-resource environments. This flexibility enhances both performance and practicality, allowing integration into portable sequencing devices. This paper is organized as follows. In Section II, the paper introduces the basic principles of DNA sequencing and the main challenges associated with the process. Section III focuses on DNA basecalling using the Hidden Markov Model (HMM) algorithm, emphasizing its advantages and implementation aspects. Section IV describes the brute-force 64-state HMM architecture, outlining its structure, computational complexity, and hardware needs. Section V then presents a more efficient serial-parallel 16-state HMM design that improves scalability and optimizes resource utilization. Building on this, Section VI extends the concept to a 64-state HMM basecaller, which enhances accuracy while preserving hardware efficiency. Section VII explains the FPGA-to-chip signal transfer process and the experimental setup for system validation. Section VIII provides simulation and comparative performance results for the proposed designs, and finally, Section IX summarizes the main findings, highlights the contributions, and suggests directions for future research.
Advances in hardware-based DNA sequencing basecallers
-
The smallest DNA sequencing machines currently available on the market weigh approximately 100 g and, when operating continuously at their ideal capacity, can achieve a DNA measurement throughput equivalent to sequencing one human genome every three hours. This represents a remarkable acceleration compared to the decade-long effort of the Human Genome Project7,18,19. This advancement is primarily attributed to the speed at which individual sensors measure DNA (with nanopores20,21 being the sensors used) and the extensive parallel arrangement of these sensors. Given that this technology has only been commercially available for about ten years, it is expected that its throughput will continue to improve significantly over time. For instance, a single sensing site is currently about six orders of magnitude larger (in terms of cross-sectional area) than the individual nanopore sensor it houses. Additionally, the operating speed of the nanopore is artificially reduced by four orders of magnitude to match the limited resolution of the readout electronics22,23.
An example of the time-series signals available from a nanopore sensor.
The core mechanism of nanopore DNA sensing is quite simple. A nanoscale pore (typically less than 5 nm in diameter) is embedded in a membrane and immersed in an ionic solution. When a voltage is applied, ions flow through the pore, generating a steady baseline current of about 100 pA. As a DNA molecule enters and passes through the nanopore, it temporarily blocks the ion flow, causing measurable fluctuations in the current. These variations, recorded by an analog-to-digital converter, represent the translocation of DNA through the nanopore (Fig. 1)24,25.
As DNA strands pass through the nanopore, their movement causes fluctuations in the baseline current, creating a noisy time-series signal that contains valuable sequence information. Each nucleotide—adenine (A), cytosine (C), guanine (G), and thymine (T)—interacts differently with the nanopore, producing characteristic variations in ionic current. Although the resulting signal is noisy, advanced signal processing techniques can decode these fluctuations into the original nucleotide sequence, a process known as basecalling.
Basecalling is a crucial step in DNA sequencing, as it transforms raw electrical signals into readable genetic data. Once the basecalling step identifies the primary DNA structure, subsequent stages of the sequencing pipeline—such as read alignment, error correction, and genome assembly—are performed to ensure accuracy and reconstruct longer genome sequences. These processes enable applications like mutation detection, genetic variation analysis, and microbial community studies.
The precision of these downstream analyses strongly depends on the quality of basecalling, emphasizing its importance in nanopore sequencing. The next section of this paper introduces a specific basecalling approach and explores how it can be implemented as an ASIC to achieve higher computational efficiency and scalability.
The evolution of hardware-based DNA basecallers has been driven by the need for greater computational efficiency, improved accuracy, and reduced power consumption. Over the years, different architectures have been proposed to optimize various aspects of sequencing performance, each presenting unique trade-offs26.
One of the earliest hardware implementations, proposed by Olson et al.27, introduced an FPGA-based accelerator for short-read mapping. Their system implemented a parallelized Smith-Waterman algorithm, significantly speeding up genome alignment tasks. Compared to software-based methods such as BFAST and Bowtie, their FPGA design achieved a 31 times speedup over Bowtie and a 250 times speedup over BFAST. However, while it improved computational efficiency, it exhibited 91.5% mapping accuracy, which is lower than the proposed method but still outperformed some traditional tools. The power consumption of 496 mW was 2.5times higher than the proposed method, limiting its efficiency for power-sensitive applications.
A few years later, Sharma et al. introduced an ASIC-based DNA sequencing accelerator, focusing on balancing power consumption and accuracy28. Their system achieved 96.7% accuracy, surpassing the proposed method and Chen et al., but required 600 mW of power. The design delivered a throughput of 7.8 M Bases/sec, which remains competitive with modern accelerators. Despite its accuracy benefits, the relatively high-power demand—three times greater than the proposed method—limited its suitability for energy-constrained environments, particularly mobile sequencing devices.
More recent advancements have focused on optimizing throughput and accuracy while addressing power constraints29. Hammad et al.15 developed an FPGA-based basecaller optimized for high throughput, achieving an impressive 9 M Bases/sec—the fastest among the compared methods. Their approach leveraged an optimized hardware pipeline to enhance processing speed. However, the 850-mW power consumption was 4.25× higher than the proposed method, making it more applicable for high-performance computing clusters rather than portable sequencing applications.
In the same year, Rashed et al.30 focused on improving basecalling accuracy. Their accuracy-optimized basecaller reached an impressive 98.3% accuracy, making it the most precise among the compared methods. This high accuracy was achieved through an advanced probabilistic model, but at the cost of 1200 mW power consumption—6× higher than the proposed method. Additionally, its 7.5 M Bases/sec throughput was the lowest among the compared methods, indicating that its computational complexity came at the expense of sequencing speed. While highly accurate, this method is not ideal for energy-limited applications.
Another FPGA-based basecaller, developed by Chen et al.31, aimed to balance power and performance. Their design achieved 95.5% accuracy, slightly outperforming the proposed method, while consuming 300 mW—1.5× higher than the proposed approach. It delivered a throughput of 8.5 M Bases/sec, making it a well-rounded choice for environments that require an equilibrium between power efficiency, speed, and accuracy.
Existing hardware-based DNA basecallers face trade-offs between power efficiency, accuracy, and throughput. Early FPGA designs improved speed but consumed high power, while ASIC-based approaches increased accuracy at the cost of energy efficiency. Recent works have focused on either accuracy or throughput, failing to balance all three factors. The proposed method overcomes these limitations by achieving low power consumption (200 mW), high accuracy (94.3%), and strong throughput (8 M Bases/sec) through a reconfigurable and scalable architecture. This balance makes it ideal for real-time, embedded, and next-generation energy-efficient DNA sequencing systems.
HMM-based basecaller
Ideally, the modulated time-series signal fed into the basecaller would exhibit four distinct and easily recognizable levels, each uniquely corresponding to one of the four nucleotide bases (A, C, G, or T) that form the DNA strand32,33. However, as illustrated in Fig. 1, this is not the case in practice. Instead, the piecewise-constant event curve that approximates the noisy time-series in Fig. 1 demonstrates numerous possible output levels (events) rather than just four. This discrepancy arises due to the inherently complex and imprecise interaction between the nanopore sensor and the DNA molecule being measured.
Although the nanopore’s orifice is small enough to accommodate only a single DNA strand at a time, its length can extend across multiple consecutive bases—often as many as 10 nucleotides. Consequently, at any given moment, the recorded signal does not correspond to an individual nucleotide but rather to a -mer, a segment of the DNA strand containing bases34,35,36. This overlapping influence means that the sensor output is an amalgamation of the signals from multiple bases, making it significantly more difficult to map the observed electrical fluctuations to a distinct sequence of bases. The problem is further compounded by the fact that different -mers may produce highly similar signal variations, adding ambiguity to the basecalling process.
Nanopore-based DNA sequencing signals are affected by multiple sources of noise and distortion, including electrostatic interactions, temperature variations, DNA secondary structures, and inconsistent translocation speeds. These factors create nonlinear relationships between the measured current and the actual DNA sequence, while additional noise from the readout electronics and analog-to-digital conversion further complicates accurate basecalling. Because of these complexities, simple thresholding or pattern-matching methods are insufficient. Instead, advanced computational techniques are required to infer the correct DNA sequence from noisy signals.
This paper introduces a hardware implementation of a DNA basecaller based on a Hidden Markov Model (HMM), which effectively captures the probabilistic relationship between the noisy current signals and the true DNA bases. Using the Viterbi algorithm, the system identifies the most likely sequence of nucleotides by optimally decoding the observed data. The proposed ASIC-based design offers a robust, efficient, and accurate solution for real-time basecalling, and its performance—evaluated in terms of accuracy, throughput, and power efficiency—is compared against existing software-based approaches.
Hardware implementation and performance considerations
While software-based implementations of HMM-based basecalling exist, they are often computationally demanding and require significant processing time. To enhance performance and efficiency, we explore an ASIC (Application-Specific Integrated Circuit) implementation of the proposed HMM-based basecaller. By translating the computationally intensive steps of the Viterbi algorithm into a dedicated hardware architecture, we can achieve significant improvements in speed, energy efficiency, and scalability.
Sensor model
One of the primary limitations of nanopore sensors that must be accounted for in their Hidden Markov Model (HMM) is their inherent sensitivity to groups of bases, known as -mers. This characteristic fundamentally affects how event signals, such as those shown in Fig. 1, can be approximated and subsequently mapped to a base sequence drawn from the standard four-letter nucleotide alphabet: \(\:\mathcal{A}=\{A,C,G,T\}\).To accurately represent the sequential dependencies in nanopore measurements, the HMM must be constructed using states drawn from an extended alphabet of size \(\:{\mathcal{A}}^{\text{k}}\). For instance, in the case of a 3-mer HMM, the state space consists of 43 = 64 possible sequences: \(\:{\mathcal{A}}^{3}\:=\:\{AAA,\:AAC,\:.\:.\:.,\:TTG,\:TTT\}\) where each element’s label assumes the form \(\:{\mathcal{K}}_{2}{\mathcal{K}}_{1}{\mathcal{K}}_{0}\), \(\:{\mathcal{K}}_{0}\) representing the identity of the most recent DNA base to have entered the nanopore sensor. With this formalism, once an event signal is mapped to a consistent sequence of -mers, extracting a 1-mer equivalent is trivial, requiring only the need to retain the 1-mer suffix 0 of each -mer.
For convenience we associate each -mer label with a unique state number drawn from a radix-4 number system where {0, 1, 2, 3} → {A, C, G, T}. For example, in decimal notation the 3-mer CTG maps to state = 1·42 + 3·41 + 2·40 = 30.
Figure 2 shows an example of a simple possible state transition for a distinct state. For each state there could be 21 state transition. The one with black arrow transition called stay (no change in the state), four with red arrow called step (one change)16 with green arrow called skip (two change). The main goal of basecalling algorithm is to find the most likelihood transition between them in these possible state transitions.
The encoder sub-model.
Viterbi detection
The whole process of the basecalling based on HMM algorithm can be summarized as follow:
-
Observing the mean and standard deviation of the event signal, \(\:({x}_{i},\:{y}_{i})\).
-
Computing the log of the emission probability, the probability that the signal came from state \(\:j,\:{b}_{j}({x}_{i},\:{y}_{i})\).
-
Finding the most likelihood state transition via Viterbi algorithm.
-
Updating the state posterior with the log emission probability.
-
Considering the next \(\:({x}_{i+1},\:{y}_{i+1}\:)\).
HMM basecalling algorithm
Algorithm 1 provides a complete description of HMM basecalling algorithm for each event value. The main computation intensive part of the algorithm is the inner “for” loop which repeats for each input event signal. In this loop the possibility that the observed event is belong to state j, \(\:{v}_{\nu\:}\:(i\:-\:1)\), and the related \(\:pt{r}_{j}\left(i\right)\) are computed. \(\:pt{r}_{j}\left(i\right)\) indicates which one of j states proceeds the previous state which was predicted for \(\:{\left(i-1\right)}_{th}\) event value.
In this algorithm,\(\:\omega\:\left(j\right)\:\subset\:\:\{0,\:.\:.\:.,\:{4}^{kmer}\:-\:1\}\) is the subset of computed posterior states. \(\:\tau\:\:(\nu\:,\:j)\) is the probability that a state \(\:\upsilon\:\in\:\omega\:\left(j\right)\) transits to a state j. As discussed in previous section there is 21 possible state transitions (stay, step and skip).
In the next section, each block of the design, corresponding to the main algorithm processing requirements is discussed. To avoid complicated hardware implementation of logarithm calculation, off chip pre-processing is performed. The computed logarithm values are provided as the input of the design.
our design includes a dedicated posterior memory subsystem that manages posterior probabilities efficiently for both the 3-mer (64 states) and 6-mer (4096 states) HMM configurations. Posterior values are stored at event xi and retrieved for update at event xi₊₁, using dual-port BRAM blocks on the FPGA to enable simultaneous read and write operations in a single cycle. For the 6-mer model, the state space is divided into 64 segments of 64 states each, and three BRAM groups (Skip, Step, Stay) are allocated per segment. To prevent memory contention, we use a ping-pong mechanism where separate address ranges are alternated between read and write across events, guaranteeing conflict-free deterministic access.
-
For event xi: addresses 0–3 → write, 4–7 → read.
-
For event xi₊₁: addresses swap roles (0–3 → read, 4–7 → write).
This strategy guarantees deterministic access and maintains throughput. Each segment’s 64 posteriors are indexed with fixed patterns to align BRAMs and event indices. Importantly, to reduce hardware complexity, the logarithm calculations are pre-computed on the host PC. These pre-processed logarithmic values are then stored in the FPGA’s BRAMs before execution begins. A handshake protocol ensures reliable data transfer from the host PC to the FPGA, allowing the accelerator to access the required values with minimal latency. This strategy eliminates the need for on-chip logarithm computation, reduces silicon overhead, and ensures consistent throughput. Our modular design supports both 3-mer and 6-mer HMMs without architectural changes: in the 3-mer case, a single segment is active, while in the 6-mer case, all 64 segments are processed iteratively using the same datapath and BRAM management scheme. This memory management approach avoids reliance on external memory, ensures real-time posterior updates, and provides scalable, energy-efficient support for basecalling acceleration across different HMM state sizes.
Brute force 64-state architecture
The brute-force 64-state HMM basecaller structure, illustrated in Fig. 3, is achieved by unrolling the “for” loop of Algorithm 1. This Fig.presents the key blocks required to implement the most computationally intensive part of the HMM basecalling algorithm, as discussed in the previous section. A detailed explanation of this structure is provided in this section.
As shown in Fig. 3 , 64 “logem” blocks operate in parallel to compute the logarithm of the emission probability (\(\:{b}_{j}\left(i\right)\)). The “Block Add” unit consists of 21 parallel adders, responsible for summing the logarithm of the posterior probability with the logarithm of the transition probability (\(\:{v}_{\nu\:}(i\:-\:1)\:+\:\tau\:(\nu\:,\:j)\)), as indicated in line 3 of Algorithm 1. The “comp” blocks are then used to determine the minimum among the 21 outputs from the “Block Add” unit and to generate a pointer indicating the most probable state transition.
During the initial cycles of the basecalling process, the input to the “Block Add” unit comes from the outputs of the “logem” blocks. However, in subsequent cycles, this input is replaced by the posterior state, which is obtained from the final blocks of the structure, the subtractor blocks (“Sub”). MUX blocks are used to select between these two signals. A controller block manages the on-off cycles of various blocks within the architecture to ensure proper operation and reduce power consumption. To prevent overflow, normalization is performed in the final stage of the algorithm. Specifically, the minimum value among the 64 computed posterior probabilities is determined and subtracted from the main values. The 64-input comparator and “Sub” blocks, illustrated in the last stage of the brute-force structure in Fig. 3, handle this normalization step. For simplicity, this normalization process is not explicitly described in Algorithm 1.
The block diagram of the “logem” block is shown in Fig. 4. By utilizing pipeline registers, the “logem” block requires three cycles to complete the generation of log-emission outputs. As illustrated in Fig. 3, the outputs of the “logem” blocks are first used in the second stage of the add blocks. Consequently, the three cycles required to generate all 64 log-emission outputs overlap with the computation of other blocks, reducing the total number of required clock cycles.
Brute force 64 states HMM basecaller architecture.
The operational cycles of the brute-force architecture are depicted in Fig. 5. As shown, the basecaller requires nine cycles to produce an output for each event input. The number of cycles could be reduced at the cost of a lower clock frequency in Fig. 6. To increase the clock frequency, a pipelined structure is used to implement the 21-input and 64-input comparator blocks. Figure 7 illustrates the design of these comparators, where the fundamental processing unit of each comparator is a 4-input comparator. The complexity of this coarse-grained block is chosen to balance speed and the required number of cycles.
Coarse grained 16-state structure
The state distributor block is responsible for generating the 21 possible state transitions for all 64 states, producing a total of 21*64 36-bit outputs. A special configuration of the state distributor block enables resource sharing, which helps reduce the overall area and power consumption of the 64-state HMM basecaller. A closer examination of the state distributor block reveals that, within every 16 blocks, the last 16 outputs remain identical. Due to this redundancy, instead of using 21 add-compare blocks in parallel, 16 blocks can be shared. Further analysis shows that four additional adder blocks can also be shared among every four states due to similar output patterns in the state distributor block.
Figure 7 illustrates the core concept of resource sharing. In the optimized architecture, leveraging this sharing technique reduces the number of adders to 16 + 16+16 = 48 for every 16 states. Compared to the brute-force architecture, which requires 21*16 = 336 adders, this modification significantly decreases the number of adders, leading to substantial area and power savings.
The “logem” block structure.
Operation cycles of brute force 64 states HMM basecaller.
Structure of (a) 21-input (b) 64-input comparator, based on 4-input comparators.
To maintain a homogeneous structure, instead of using large, coarse-grained 21- or 64-input comparators, fine-grained 4-input comparator (“comp 4in”) blocks are utilized. Additionally, sharing adder blocks further reduces the number of required comparator blocks, leading to a more efficient design. This optimization can be summarized as follows:
-
A 16-input comparator block for comparing the outputs of 16 shared adder blocks.
-
Four 4-input comparator block for comparing the outputs of four 4-adder blocks.
-
Sixteen 4-input comparator blocks for determining the final 16 minimum posterior values corresponding to 16 states.
Similar to the 21- and 64-input comparators, the 16-input comparator is designed using a hierarchical structure based on 4-input comparators. Specifically, it consists of five 4-input comparator blocks. In the proposed coarse-grained 16-state structure, the total number of required “comp 4in” blocks are 5 + 4 + 16 = 25, compared to 16 × 6 = 96 in the previous design, which required a total of 384 comparators for the entire 64-state basecaller. This optimization technique significantly reduces hardware resource usage, making the new architecture much more efficient in terms of area and power consumption.
16 blocks of 64-states HMM basecaller with new state distributer and resource sharing.
Serial-parallel 64 states hmm based on coarse grained 16 states structure
As previously described, resource sharing helps reducing the area and power consumption of the basecaller. To further minimize the area, this paper introduces a serial-parallel structure. The coarse-grained 16-state structure, introduced in the previous section, serves as the main building block of the proposed serial-parallel architecture. In this structure, 16 “logem” blocks and one coarse-grained 16-state HMM unit operate in parallel with 16 subtractor blocks. The 16 “logem” blocks generate all 64 log-emission outputs over four consecutive clock cycles. Each “logem” block consists of three main ADD-MULT-ADD units, requiring three cycles per input to produce the first 16 outputs.
Figure 8 illustrates the operational cycles of the proposed architecture. As shown, generating all 64 log-emission outputs takes 7 clock cycles. At the same time, a single 16-state coarse-grained unit is responsible for generating all 64 outputs.
To reduce the required cycles for processing the full 64-state basecaller, a resource reuse strategy is applied. Figure 8 also shows the ON-OFF cycles of different blocks in the architecture. As shown, the entire 64-state HMM basecaller requires 16 clock cycles to complete processing.
The structure used for the 64-input comparator in the new serial-parallel architecture is shown in Fig. 9. Every two cycles, the 16-input comparator processes 16 inputs and stores the results in a shift register. After five cycles, the final extremum minimum of the 64-input comparison is determined using a final “comp 4in” block.
Operational timeline of the serial-parallel basecaller over 16 cycles, showing log-emission generation and staggered HMM computations via resource reuse.
New 64 input comparator structure.
FPGA/CHIP communication
For testing the designed architecture, FPGA-to-chip communication is implemented to ensure seamless data transfer and synchronization. The FPGA provides the necessary model parameters and input event signals to the chip via its designated 18-bit input pins. To facilitate smooth interaction between the FPGA and the chip, several control signals are incorporated into the design. These control signals play a crucial role in ensuring proper input/output handshaking and maintaining synchronization between the FPGA and the chip during data transmission.
Additionally, they help in coordinating the execution of the basecalling algorithm on the chip, minimizing potential communication delays or errors. Figure 10 provides a high-level overview of the FPGA-to-chip handshaking mechanism and the overall process of input/output data transmission.
A multiplexer on the FPGA side is used to select between model parameters and the mean of event values, which are applied sequentially at the beginning of the algorithm’s operation. This selection process ensures that the appropriate data is fed into the system based on the computational requirements.
Additionally, \(\:R{D}_{in}/R{D}_{o}\) and \(\:{\text{v}\text{a}\text{l}}_{\text{i}\text{n}}/{\text{v}\text{a}\text{l}}_{\text{o}}\) signals serve as ready and valid handshaking signals, facilitating efficient data transfer between the FPGA and the external chip. These signals help synchronize input and output data transmission, ensuring seamless communication and preventing data loss or misalignment during processing.
FPGA-CHIP signal handshaking and input/output data transmission for final test.
Example of output transferring cycles frequency regulation.
Due to the serial nature of the design, the output requires four clock cycles to be fully transferred to the FPGA. Additionally, there is a minimum of 12 waiting cycles, determined by the off-time of the enable signal, before the first output of the next event can be generated. As a result, it is not necessary to transfer the output to the FPGA at the same clock frequency as the chip. In this design, the frequency of output signal transmission can be adjusted by modifying the off-time of the enable (“En”) signal. As previously described, the “En” signal serves as the global enable for the computational blocks implementing the algorithm on the chip. When “En” is high, the control block manages the activation and deactivation of individual algorithmic blocks. Conversely, during the period when “En” is low, the generated output is sent to the FPGA and stored in the allocated memory on the FPGA. Figure 11 illustrates the output frequency regulation method employed in this design. As shown in the Fig. 11, by increasing the off-time of the enable signal, the number of cycles available for transferring the output to the FPGA can be configured, providing flexibility in data transmission timing.
Simulation results
To evaluate the proposed fixed-point HMM basecaller’s correctness across a range of bit-widths and input SNR levels and inform the FPGA architecture, a bit-accurate MATLAB simulator was developed. Real sensor data contains noise and translocation errors (stay/skip), so we modeled 10% probabilities for each effect. Figure 12 illustrates the basecalling accuracy (%) for both fixed-point and floating-point basecallers for a range of bit widths (4–12) and SNR levels. The results show that the precision will finally reach the floating-point baseline as the bit-width grows (Fig. 12). However, regardless of bit-width, accuracy is limited at low SNR levels due to noise. At 20 dB SNR, for example, increasing resolution above 6 bits has no effect. On the other hand, accuracy at high SNR (e.g., 50 dB) depends heavily on resolution, and 12 bits are needed to get 99.04%.
Figure 12 highlights a clear trade-off between input ENOB (SNR) and internal bit width. For high-SNR inputs (e.g., 50 dB), accuracy saturates at around 7–8 bits. For moderate SNR (30 dB), 9–10 bits are needed to achieve similar accuracy, while low-SNR inputs (20 dB) require even higher bit widths (~ 10–11 bits) for saturation. These observations indicate that higher-quality inputs can reach the same accuracy with lower internal precision, thereby reducing computational complexity, hardware resource usage, and power consumption. This analysis provides a quantitative guideline for selecting bit width according to input quality in energy-efficient basecaller designs.
Basecalling accuracy versus bit-width (Bar chart).
Publicly accessible nanopore signal datasets from the Oxford Nanopore Technologies (ONT) E. coli dataset are employed for validation; these datasets were also utilized for hardware validation and in our MATLAB reference implementation. We simulated many read lengths (512, 1024, and 2048 bases) to guarantee resilience. In terms of event size, we see that each event in MinION sequencing is equivalent to a k-mer stay time in the pore, which usually spans multiple current samples. We tested our system with event sizes of 5, 15, and 25 time steps, covering the range of dwell-time variability seen in practice, based on reported values. A fair comparison and precise validation of the suggested accelerator were ensured by the consistent use of these parameters in both the FPGA implementation and the MATLAB reference model.
The proposed architecture has been synthesized, placed, and routed using TSMC 130 nm technology, ensuring an efficient and manufacturable design. A placed-and-routed layout of the design is shown in Fig. 13.
Placed-and-routed layout of the designed HMM basecaller.
In this architecture, except for the input of the “logem” block, which receives 18-bit data, all other processing operations utilize a 36-bit data width. This specific data width selection is based on an extensive analysis of the algorithm’s signal-to-noise ratio (SNR), performed through MATLAB simulations. Since MATLAB simulations are typically conducted using floating-point arithmetic, a 36-bit fixed-point implementation was chosen to achieve high computational accuracy while maintaining efficient hardware utilization. This decision ensures that the output maintains a high level of precision without significantly increasing the complexity or power consumption of the hardware.
Following basecalling (via Viterbi traceback), the original reference sequence and the anticipated state sequence were directly compared. In particular, the number of mismatches (errors) was counted and the element-wise difference between the actual and anticipated state sequences was computed. Accuracy was defined as one minus the error rate, and the error rate was calculated as the ratio of all errors to the length of the sequence:
Post-layout simulation results validate the correctness of the implementation, showing an exact match with the results obtained from MATLAB simulations.
This consistency confirms the reliability of the hardware implementation in accurately executing the intended computations. In the following, the performance of the proposed method is first compared with the basic architecture. Next, it is compared against a set of HMM-based and deep learning-based basecaller architectures in terms of power consumption, accuracy, and throughput.
Comparing against basic basecaller architecture
To further assess the efficiency of the proposed architecture, its parameters are compared against the base architecture in Table 1, considering a clock frequency of 100 MHz. As indicated in the table, the brute-force model exhibits an area consumption of 8.27 mm² and a power consumption of 1200 mW. These metrics highlight the trade-offs involved in different architectural choices, offering insights into the balance between computational complexity, power efficiency, and silicon area.
The proposed serial-parallel architecture significantly reduces the core area and power consumption compared to the brute-force approach. As shown in Table 1, the core area is reduced from 8.27 mm² to 1.2 mm², while power consumption decreases from 1200 mW to 200 mW. This reduction is primarily achieved through resource sharing and an optimized state distributor block that enables hardware reuse across multiple states.
However, this reduction in area comes with a latency trade-off. While the brute-force model processes a 64-state HMM basecalling operation in 9 clock cycles, the serial-parallel architecture requires 16 clock cycles to complete the same operation. This increase in processing time results from the sequential activation of computation blocks to minimize area usage. While the serial-parallel structure sacrifices some processing speed, it offers a significant improvement in energy efficiency (0.025 mW/base vs. 0.133 mW/base in brute-force) and scalability (supporting up to 4096 states). This makes it particularly suitable for mobile and real-time DNA sequencing applications, where area and power constraints are more critical than raw speed.
In the 4096-state (6-mer) configuration, scalability is achieved by sequentially reusing the 64-state compute block across 64 segments. This results in a throughput reduction of approximately 64× compared to the 64-state mode, as each segment is processed in succession. Importantly, because the datapath and control logic are reused without duplication, the overall core area and power consumption remain nearly constant. This trade-off between increased state modeling and reduced throughput allows the architecture to flexibly adapt to sequencing complexity within strict area and energy constraints.
Comparing with HMM-based basecallers
The accuracy and throughput of the suggested serial-parallel HMM basecaller were evaluated by contrasting it with other HMM-based basecallers, such as Meta-Align10, Nanocall8, and scalable hardware accelerator14.
Accuracy
With an accuracy of 94.3%, as shown in Fig. 14, the proposed method outperforms Nanocall (85.6%) and Meta-Align (91.2%) and is marginally superior to the Scalable Hardware Accelerator (93.1%). The little improvement in accuracy is due to the improved computational pipeline of the serial-parallel HMM structure, which effectively reduces numerical approximations. In contrast to software-based Nanocall and Meta-Align, which rely on heuristic approximations, the proposed solution leverages improved HMM computations to provide greater performance without significantly increasing hardware complexity.
Accuracy comparison of DNA basecalling methods.
Throughput
Throughput also has a significant impact on basecaller performance. By processing 8 million bases per second, the proposed serial-parallel HMM basecaller beats Nanocall (5 M Bases/sec) and Meta-Align (7 M Bases/sec), as shown in Fig. 15. With a higher throughput of 9 M Bases/sec, the Scalable Hardware Accelerator exhibits marginally superior raw performance. However, the power consumption for this speed increase is more than four times higher (850 mW as opposed to 200 mW in the recommended technique). Therefore, the recommended method better balances speed and energy efficiency, even though the scalable accelerator gives a slight throughput advantage.
Considering all of the comparisons, the proposed serial-parallel HMM basecaller achieves amazing energy efficiency at 200 mW, four times lower than the Scalable Hardware Accelerator, while maintaining a tiny footprint. It attains 94.3% accuracy, slightly higher than the Scalable Hardware Accelerator’s 93.1%, outperforming Nanocall and Meta-Align. With a throughput of 8 M Bases/sec, it balances speed and power efficiency to retain its high level of competitiveness. Unlike software-based basecallers, its hardware-optimized, reconfigurable design makes it ideal for low-power, real-time sequencing applications.
Power consumption comparison of DNA basecalling methods.
Comparing with deep learning-based basecallers
Deep learning-based basecalling methods have attracted a lot of attention due to their exceptional accuracy and ability to recognize complex sequencing patterns. However, these techniques often have substantial computational and energy costs, which restricts their use in real-time and resource-constrained environments. Although models like Bonito13 can provide an accuracy of about 4.2% better than our HMM-based FPGA design, our method prioritizes efficiency, reconfigurability, and ultra-low power consumption. Our solution avoids GPU complexity and achieves 94.3% accuracy with substantially fewer parameters (~ 0.1–0.5 M vs. 2–12 M) than37, which uses structured sparsity to achieve up to 21× model size reduction utilizing LSTM-heavy topologies with ~ 98–99% accuracy. Requiring mixed-precision models (~ 2.6 M parameters) on the AMD Versal AIE, RUBICON38 provides up to 128× higher throughput than baseline Bonito while requiring 5–15 W. This is achieved by aggressive quantization and architecture search. However, specialized hardware and sophisticated design automation tools are needed for this. Using analog compute-in-memory with 0.47–2.9 M parameters, CiMBA39 achieves outstanding energy efficiency (~ 1.17 W) and precision (~ 91%), but it is dependent on bespoke non-volatile memory technology and is not reprogrammable. On the other hand, our FPGA-based HMM solution is still small, reprogrammable, and power-efficient (< 1 W), which makes it ideal for field-deployable, cost-sensitive, or embedded applications where simplicity, determinism, and adaptability are more important than maximum accuracy. The proposed serial-parallel HMM basecaller balances accuracy, power economy, and processing complexity as an alternative to deep learning-based techniques. The suggested approach is compared to deep learning-based basecallers such as DeepNano12, Bonito13 and hybrid HMM-deep learning approach14 in terms of power consumption, accuracy, and throughput, as well as scalability. This allows for a thorough evaluation of its performance, highlighting its benefits in terms of energy efficiency and hardware adaptability while also acknowledging the trade-offs in accuracy and throughput.
Accuracy
As shown in Fig. 14, the accuracy of the proposed method is 94.3%, somewhat lower than that of the hybrid approach (97.2%), Bonito (98.5%), and DeepNano (96.8%). Deep learning models are more accurate, but they are not feasible for real-time applications due to their high energy and computational costs. The proposed method strikes a fair balance between accuracy and efficiency, making it suitable for low-power embedded sequencing devices.
Throughput
The recommended method processes 8 M Bases per second, which is 7 M Bases/sec less than DeepNano (15 M Bases/sec), 10 million bases/sec less than Bonito (18 million bases/sec), and 4 M Bases/sec less than the hybrid model (12 M bases/sec), according to Fig. 14, which illustrates throughput. Higher throughput is possible with deep learning models, but this comes at the cost of higher energy and computing demands.
Comparing with hardware-based basecallers
Table 2 highlights the diverse trade-offs between power consumption, accuracy, and throughput across different hardware-based DNA basecallers.
To more clearly highlight the energy efficiency of the proposed architecture, Table 2 also reports a throughput-per-power metric. Energy efficiency is defined as the achieved throughput normalized by power consumption (M Bases s⁻¹ W⁻¹). As shown in the table, the proposed method attains the highest energy efficiency among the compared works, resulting from its low power consumption combined with competitive throughput. This demonstrates a favorable trade-off between accuracy, throughput, and power, making the proposed design well suited for energy-constrained DNA sequencing applications.
While Hammad et al. delivers the highest throughput (9 M Bases/sec), its power consumption is 4.25 times higher than the suggested approach, making it unfeasible for low-power applications. Rashed et al. gets the best accuracy (98.3%), but its 6 times greater power consumption and poorer throughput make it less efficient for real-time applications. A well-rounded substitute, Chen et al. provides high throughput and reasonable power efficiency, although it still uses more energy than the suggested approach.
The suggested approach offers the optimal balance between accuracy, sequencing speed, and power efficiency. It is a promising option for next-generation real-time and mobile DNA sequencing applications due to its competitive throughput and low power consumption, which meets the urgent need for effective, high-performance basecalling solutions in resource-constrained settings.
Conclusion
A 64-state reconfigurable serial-parallel DNA basecalling accelerator designed for efficient ASIC implementation was presented in this study. By using resource reuse and a coarse-grained 16-state structure, the proposed architecture achieves 6 times lower power consumption (200 mW vs. 1200 mW) while maintaining outstanding accuracy (94.3%) compared to brute-force HMM approaches. Even though it is slightly less accurate than deep learning-based basecallers, its three to five times lower power consumption guarantees its viability for real-time and mobile applications. Along with achieving a throughput of 8 M Bases/sec, the accelerator offers notable energy savings and is very competitive in terms of speed.
Compared to existing hardware basecallers, the proposed method provides the best balance between accuracy, computational performance, and power efficiency. It still outperforms most competing designs in terms of power efficiency, surpasses Nanocall and Meta-Align in precision, and maintains its competitive throughput. Unlike deep learning-based methods that require a lot of processing power, the proposed hardware-optimized approach ensures realistic scalability and real-time processing capabilities.
Future studies will look into hybrid HMM-deep learning architectures to increase accuracy while preserving the system’s low-power features. The proposed basecaller is a very effective solution for next-generation real-time DNA sequencing technologies, particularly in resource-constrained environments.
Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
References
Dorey, A. & Howorka, S. Nanopore DNA sequencing technologies and their applications towards single-molecule proteomics. Nat. Chem. 16 (3), 314–334 (2024).
Lu, H., Giordano, F. & Ning, Z. Oxford Nanopore MinION sequencing and genome assembly. Genom. Proteom. Bioinform. 14 (5), 265–279 (2016).
Kumar, K. R., Cowley, M. J. & Davis, R. L. Next-generation sequencing and emerging technologies,. In Seminars in thrombosis and hemostasis Vol. 50 1026–1038 (Thieme Medical, 2024).
Scarano, C. et al. The third-generation sequencing challenge: Novel insights for the omic sciences. Biomolecules 14(5), 568 (2024).
Park, S. T. & Kim, J. Trends in next-generation sequencing and a new era for whole genome sequencing. Int. Neurourol. J. 20(Suppl 2), S76 (2016).
Schadt, E. E., Turner, S. & Kasarskis, A. A window into third-generation sequencing. Hum. Mol. Genet. 19(R2), R227–R240 (2010).
Tyler, A. D. et al. Evaluation of Oxford Nanopore’s MinION sequencing device for microbial whole genome sequencing applications. Sci. Rep. 8(1), 10931. https://doi.org/10.1038/s41598-018-29334-5 (2018).
Pour-Hosseini, M. R. et al. Tiny machine learning models for autonomous workload distribution across cloud-edge computing continuum. Cluster Comput. 28(6), 381. https://doi.org/10.1007/s10586-025-05289-x (2025).
Timp, W., Comer, J. & Aksimentiev, A. DNA base-calling from a nanopore using a Viterbi algorithm. Biophys. J. 102(10), L37–L39 (2012).
Malmström, J. Preprocessing of Nanopore Current Signals for DNA Base Calling, ed, (2020).
Tomii, K., Kumar, S., Zhi, D. & Brenner, S. E. Meta-align: A novel HMM-based algorithm for pairwise alignment of error-prone sequencing reads. bioRxiv (2020).
Boža, V., Brejová, B. & Vinař, T. DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 12(6), e0178751 (2017).
Pagès-Gallego, M. & de Ridder, J. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biol. 24 (1), 71 (2023).
Xu, Z. et al. Fast-bonito: A faster deep learning based basecaller for nanopore sequencing. Artif. Intell. Life Sci. 1, 100011 (2021).
Hammad, K., Wu, Z., Ghafar-Zadeh, E. & Magierowski, S. A scalable hardware accelerator for mobile DNA sequencing. IEEE Trans. Very Large Scale Integr. VLSI Syst. 29 (2), 273–286 (2021).
de Gennaro, A., Sokolov, D. & Mokhov, A. Design and implementation of reconfigurable asynchronous pipelines. IEEE Trans. Very Large Scale Integr. Syst. 28(6), 1527–1539 (2020).
Wang, Y. et al. On-chip memory hierarchy in one coarse-grained reconfigurable architecture to compress memory space and to reduce reconfiguration time and data-reference time. IEEE Trans. Very Large Scale Integr. VLSI Syst. 22 (5), 983–994 (2013).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36 (4), 338–345 (2018).
Oikonomopoulos, S., Wang, Y. C., Djambazian, H., Badescu, D. & Ragoussis, J. Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations. Sci. Rep. 6(1), 31602 (2016).
Liu, X. et al. A lumen-tunable triangular DNA nanopore for molecular sensing and cross-membrane transport. Nat. Commun. 15(1), 7210. (2024).
Wei, J. et al. Nanopore-based sensors for DNA sequencing: A review. Nanoscale 16(40), 18732–18766 (2024).
Ahmadi, E., Sadeghi, A. & Chakraborty, S. Slip-coupled electroosmosis and electrophoresis dictate DNA translocation speed in solid-state nanopores. Langmuir 39(35), 12292–12301 (2023).
Chen, K. & Muthukumar, M. Substantial slowing of electrophoretic translocation of DNA through a nanopore using coherent multiple entropic traps. ACS Nano 17(10), 9197–9208 (2023).
Iqbal, S. M., Akin, D. & Bashir, R. Solid-state nanopore channels with DNA selectivity. Nat. Nanotechnol. 2(4), 243–248 (2007).
Meller, A., Nivon, L. & Branton, D. Voltage-driven DNA translocations through a nanopore. Phys. Rev. Lett. 86 (15), 3435 (2001).
Wang, F., Li, Q. & Fan, C. “The Evolution of DNA-Based Molecular Computing,” In (eds Jonoska, N. & Winfree, E.) (2023).
Olson, C. B. et al. Hardware Acceleration of short read mapping, in IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, https://doi.org/10.1109/FCCM.2012.36 (2012). https://doi.org/10.1109/FCCM.2012.36
Sharma, P. Hardware accelerator for SOM based DNA sequencing Algorithm, ed, (2018).
Abbasi a. Salimi Shahraki and Energy-Efficient FPGA-Based Hardware-Software Co-Design of the Skein Algorithm for Secure Edge Computing in Smart Grids, Journal of Power Technologies, 105: 270–285 https://papers.itc.pw.edu.pl/index.php/JPT/article/view/1937 (2025). https://papers.itc.pw.edu.pl/index.php/JPT/article/view/1937
Rashed, A.-D., Obaya, M., El, H. & Moustafa, D. Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network. Comput. Electr. Eng. 92, 107112 (2021).
Chen, Y. L., Chang, B. Y., Yang, C. H. & Chiueh, T. D. A high-throughput FPGA accelerator for short-read mapping of the whole human genome. IEEE Trans. Parallel Distrib. Syst. 32(6), 1465–1478. https://doi.org/10.1109/TPDS.2021.3051011 (2021).
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: Computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19(1), 90 (2018).
Stoiber, M. & Brown, J. BasecRAWller: Streaming nanopore basecalling directly from raw signal. BioRxiv 133058 (2017).
Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat. Biotechnol. 34 (5), 518–524 (2016).
Spealman, P., Burrell, J. & Gresham, D. Inverted duplicate DNA sequences increase translocation rates through sequencing nanopores resulting in reduced base calling accuracy. Nucleic Acids Res. 48(9), 4940–4945 (2020).
Ying, Y. L. et al. Nanopore-based technologies beyond DNA sequencing. Nat. Nanotechnol. 17 (11), 1136–1146 (2022).
Frensel, M., Al-Ars, Z. & Hofstee, H. P. Learning Structured Sparsity for Efficient Nanopore DNA Basecalling Using Delayed Masking, in Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–9. (2024).
Singh, G. et al. RUBICON: A framework for designing efficient deep learning-based genomic basecallers. Genome Biol. 25(1), 49 (2024).
Simon, W. A. et al. CiMBA: Accelerating genome sequencing through on-device basecalling via compute-in-memory. IEEE Trans. Parallel Distrib. Syst. https://doi.org/10.1109/tpds.2025.3550811 (2025).
Author information
Authors and Affiliations
Contributions
All of the authors contributed to the study conception, design, material preparation, data collection and analysis. The first draft of the manuscript was written by Atefeh Salimi and edited by Mahdi Abbasi. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shahraki, A.S., Magierowski, S., Abbasi, M. et al. Low power reprogrammable DNA basecaller with an efficient HMM accelerator for real time nanopore sequencing. Sci Rep 16, 11425 (2026). https://doi.org/10.1038/s41598-026-41649-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-41649-2















