Abstract
Combinatorial optimization problems (COPs) are critical to applications across logistics, VLSI design, and scientific discovery, yet remain challenging due to their NP-hard nature. Ising model-based annealers have emerged as promising candidates for solving COPs through probabilistic computing. However, CMOS-based Ising implementations still face barriers in scalability, energy efficiency, and hardware cost. This work presents a compact digital compute-in-memory (DCIM) Ising annealer that overcomes these limitations through several co-optimized algorithm-hardware innovations. A hierarchical clustering approach and a compact weight mapping scheme are introduced to reduce the required hardware cost. Furthermore, we utilize intrinsic process variations in SRAM bitcells to generate random probabilistic bits via pseudo-read operations, enabling an area- and energy-efficient realization of the annealing process without additional random number generators. The fabricated chip, implemented in 28 nm CMOS and featuring a 6 Kb DCIM SRAM array, successfully solves 96-city Traveling Salesman Problem (TSP) instances with time-to-solution of 620 µs and energy consumption of 961 nJ. Compared to prior hardware-based TSP solvers, our solution achieves a 15× to 572× improvement in the hardware cost ratio, validating the effectiveness of our architecture. This work demonstrates the feasibility of large-scale, real-time, and low-cost Ising annealing for combinatorial optimization on digital CMOS platforms.
Similar content being viewed by others
Introduction
Combinatorial optimization problems (COPs) are widely used in many different fields, including logistics, VLSI routing, drug discovery, financial modeling, etc. However, COPs are often computationally expensive to solve exactly, especially as the problem size scales. Ising machine-based solvers have become popular for their ability to efficiently find approximate solutions through probabilistic computing, offering both high speed and energy efficiency. Previous Ising CMOS hardware implementations demonstrate their capability to achieve near-ground solutions with high speed and energy efficiency1,2,3,4.
COPs can generally be categorized into constrained and unconstrained types. Constrained problems, such as the Travelling Salesman Problem (TSP) and Boolean Satisfiability (SAT), require satisfying specific rules or constraints, whereas unconstrained problems, like Max-Cut, focus solely on optimizing an objective function without strict constraints. Among these, TSP is a representative constrained COP and a widely used benchmark due to its well-defined structure, practical significance in routing and scheduling, and its suitability for Ising model mapping. The goal of TSP is to find the shortest route that visits each city exactly once, given a list of cities and pairwise distances. However, as the size of the problem grows, the number of spins and coupling data increases significantly. Under the Ising formulation5, an N-city TSP requires \({N}^{2}\) spins and \({N}^{4}\) all-to-all interactions. The search space grows exponentially, making the problem NP-hard6. The substantial memory requirement, extensive data movement, and limited parallelism collectively constrain the practicality of the Ising solver, as shown in Fig. 1. Moreover, the Hamiltonian energy has a global ground state and multiple local minimum states. To prevent the system from being stuck into a local minimum, annealing processes such as simulated annealing have been widely used7, where the temperature (\(T\)) adjusts the probability of spins flipped by “thermal fluctuation”8. During the annealing process, massive spatial random noise generators are required, such as the digital linear feedback shift register (LFSR)9, resulting in significant hardware costs.
This work integrates a clustering method with a compact weight-mapping approach and leverages a DCIM architecture to efficiently solve TSP with orders-of-magnitude lower hardware cost.
Various Ising-based hardware designs have been proposed to accelerate solving TSP by emulating spin dynamics in physical or digital systems.
Physical implementations, such as D-Wave’s quantum annealer, which uses superconducting qubits and quantum tunneling to explore low-energy states10, and NTT’s optical coherent Ising machine, which is based on degenerate optical parametric oscillators (DOPOs) that converge through phase synchronization11, have demonstrated the capability to solve various COPs, including TSP12. However, these platforms are constrained by factors like bulky setups, high cost, low temperature requirement, and limited scalability.
Motivated by the natural convergence behavior of physical systems, several continuous-time (CT) Ising solvers, typically implemented using analog dynamics, have been proposed, including oscillator-based Ising machines13,14, latch-based architectures15,16, etc. These analog systems are often compact and energy-efficient and can converge quickly to low-energy states. However, they face key limitations when dealing with constraint-heavy problems like TSP. Specifically, (1) CT Ising machines are typically suited to unconstrained or weakly constrained problems, as their continuous spin dynamics do not naturally enforce the hard two-dimensional (row/column one-hot) constraint required by TSP, even with large penalty terms. (2) Their spin dynamics are difficult to regulate for permutation-based updates (e.g., PBM four-spin swaps), making it hard to remain within the valid permutation solution space. (3) CT Ising machines with analog implementations provide only limited flexibility in reconfiguring the coupling topology and scaling to different problem sizes and instances, and their sensitivity to noise and mismatch further hinders large-scale TSP deployment. These limitations further motivate the development of digitally structured architectures to effectively handle constraint preservation and regular updates in COPs.
On the discrete-time side, Hitachi’s STATICA digital annealer17, Toshiba’s simulated bifurcation machine (SBM)18, and Fujitsu’s MAQO many-core architecture19 have demonstrated the ability to solve general COPs, typically based on fully connected spin networks and pseudo-random number generators such as LFSRs. These systems provide high configurability and have demonstrated success on general COPs. However, their reliance on dense weight matrices and stochastic spin updates leads to substantial hardware overhead and a lack of structure-aware optimizations. When applied to TSP, these general-purpose digital designs suffer from limited scalability due to the inefficiency of compute-memory interaction.
Recently, several works have proposed CIM-based Ising solvers tailored for TSP. For instance, Hong et al.5 proposed an in-memory annealing unit using a memristor (RRAM) crossbar array, enabling constraint-preserving spin exchanges through in-memory computation. The design also utilizes additional bias cells as a random source during the annealing process, which introduces extra hardware overhead. While the design achieves high energy efficiency in computation, it still requires storing a large fully connected coupling matrix, which grows rapidly with problem size and limits its scalability20. To alleviate this problem, hierarchical clustering-based TSP mappings have been proposed20,21, in which the original problem is decomposed into smaller clustered subproblems. This decomposition significantly reduces the effective problem complexity and the required hardware resources. Note that this hierarchical clustering is an approximate, sparsified version of the original fully connected TSP and may discard some distance information, reducing the original solution space for certain instances.
This work integrates a clustering method with a compact weight-mapping approach and leverages a digital compute-in-memory (DCIM) architecture to efficiently solve TSP problems with orders-of-magnitude lower hardware cost. The main contributions of this work are summarized as follows: (1) A hierarchical clustered approach is introduced to solve input sparsity, reducing the number of required spins. (2) A compact weight mapping pattern is used for solving weight sparsity to further reduce the weight matrix hardware cost. (3) The SRAM-based DCIM architecture is leveraged to efficiently support Hamiltonian computation with minimal memory access. (4) The intrinsic process variations between SRAM devices are utilized to generate spatial noisy bit errors, realizing the annealing process. (5) The cluster-wise partial parallelism strategy is used to accelerate solver convergence by enabling simultaneous updates of non-interacting clusters. (6) A fabricated 28 nm 6 Kb DCIM Ising annealer is demonstrated on a TSP problem with up to 96 cities per chip, obtaining a solution in 620 μs with 961nJ energy, and achieving a 15× to 572× improvement in hardware cost ratio.
This article is an extended version of a simulation study22, providing further discussion on the computation dataflow, weight mapping pattern, details of the DCIM architecture, proposed techniques, and silicon validation, and additional measurement results with fabricated chips.
Results
Ising model and annealing
The Ising model-based annealer operates by minimizing the Hamiltonian energy (\(H\)) of the system23, which is expressed as:
Where \({\sigma }_{i}\) represents the spin state of the \(i\)-th node (\({\sigma }_{i}\in \{+1,-1\}\)), \({J}_{{ij}}\) represents the coupling coefficient between spins \(i\) and \(j\), \({h}_{i}\) represents an external magnetic field applied to spin \(i\). The local energy of spin \(i\) can be expressed as:
Where \({H}_{\sigma }\) represents the contribution of spin \(i\) to the total Hamiltonian. By updating each spin in the direction that reduces its local energy, the system seeks to minimize its total energy, thus approaching an optimal solution.
Generally, the system energy landscape has a global minimum and multiple local minimum states. To prevent the system from getting trapped in a local minimum, various annealing strategies have been developed to guide the system’s evolution towards lower-energy states, including quantum annealing (QA)10,24, simulated quantum annealing (SQA)25, quantum-inspired annealing (QIA)26,27 and simulated annealing (SA)5,7,8,9. Among these, simulated annealing is one of the most commonly used methods due to its simplicity and compatibility with digital hardware. In SA, the system updates spins based on local energy and a temperature-dependent probability. The probability of flipping a spin is given by the sigmoid function:
Where \(\Delta E\) is the change in energy if the spin is flipped, \({k}_{B}\) is the Boltzmann constant, and \(T\) is the temperature. This probabilistic update allows the system to explore different spin configurations, ultimately helping it settle into a low-energy state, which approximates an optimal solution.
In this work, we adopt the TSP as a representative constrained COP to validate the Ising-based solver. TSP can be formulated into an Ising model using a binary spin matrix \({\sigma }_{{ik}}\), where \({\sigma }_{{ik}}=1\) indicates that city \(k\) is visited at order \(i\) in the tour.
To ensure a valid tour, one-hot constraints are imposed across cities and positions, and incorporated as quadratic penalty terms in the Ising Hamiltonian. The traveling cost is modeled by spins with pairwise coupling, where coupling strength proportional to the distance between city pairs. The resulting total energy function is:
Where \({W}_{{kl}}\) denotes the distance between the \(k\)-th and \(l\)-th cities. The first term, called the objective function, represents the total distances between city pairs with adjacent visit orders. The second and third terms are one-hot constraints that penalize tours that visit the same city multiple times or place multiple cities in the same position, respectively. The hyperparameters \(a\), \(b\) and \(c\) control the relative weight between constraint enforcement and objective optimization. The energy penalty of infeasible routes (the second and third terms in (4)) can be avoided using a permutational Boltzmann machine (PBM)8, which operates under a constraint-satisfying, permutation-based update scheme. To maintain the two-way one-hot constraint, PBM performs permutation-based updates where four spins are jointly flipped (e.g., \({\sigma }_{{ik}},{\sigma }_{{il}},{\sigma }_{{jk}},{\sigma }_{{jl}}\)) to swap the tour positions of two cities while preserving the two-dimensional (row/column) one-hot structure. This effectively exchanges city \(k\) and city \(l\) at orders \(i\) and \(j\). The local energies before/after spin flips are then evaluated, and the swap is accepted or rejected based on the total energy change.
The spin energy computation is a multiply-and-accumulate (MAC) operation as expressed in (2). The Ising model can also be represented as a Hopfield Neural Network (HNN), which comprises work winning the 2024 Nobel Prize in Physics. The HNN is a fully connected recurrent network with binary neurons, where each neuron corresponds to a spin state and the weight matrix represents the interaction between spins. The MAC output of neurons and weights represents the spin energy, which is then used to determine whether to update spin state. When deployed in a compute-in-memory (CIM) array, this approach allows full connectivity between all spins and leverages the inherent noise in memory devices as a source of randomness for the annealing process. This approach also enables efficient parallel computation and significant energy savings.
Hierarchical clustered approach for input sparsity
Clustering design has been applied to reduce the required number of spins and weights21. For example, in Fig. 2, the original spins represent 144 permutations of cities (A–L) and orders (1–12) for a 12-city (N = 12) TSP, resulting in a 144 × 144 weight matrix. After grouping the cities into four clusters with three cities each, only 36 permutations remain (9 permutations per cluster × 4 clusters). The excluded spins (permutations) and their corresponding weights can thus be eliminated.
The number of spins is reduced from \({N}^{2}\) to \(p\times N\), and the weight matrix size is reduced from \({N}^{2}\times {N}^{2}\) to \((p\times N)\times (p\times N)\) by grouping cities into clusters and sub-clusters across hierarchical levels.
With further increase in the problem size, hierarchical clustering becomes necessary to maintain scalability. First, hierarchical clustering is applied in a bottom-up manner by recursively grouping every \(p\) cities (or sub-clusters represented by the centroids) into higher-level clusters using k-means algorithm based on city coordinates or distance metrics. Here, \(p\) denotes the number of cities in each cluster. However, it should be noted that interactions between neighboring clusters are preserved, and thus the clustering process does not mean that the original TSP is decomposed into multiple independent \(p\)-city subproblems. This process is repeated across multiple levels. Then, hierarchical annealing is performed top-down: at the top level, the TSP is solved among super-clusters; at the next level, the ordering of sub-clusters within each super-cluster is determined as smaller TSPs; and this continues recursively until the final order of all cities is resolved at the bottom level. In this way, the number of required neurons is reduced from \({N}^{2}\) to \(p\times N\) and the weight matrix dimension shrinks from \({N}^{2}\times {N}^{2}\) to \((p\times N)\times (p\times N)\), significantly improving memory and computational efficiency. In other words, the hardware complexity reduces from O(N4) to O(N2). The resulting clustered weight matrix is shown on the left side of Fig. 3. Each cluster includes both intra-cluster city interactions and adjacent city pairs for evaluating boundary connections.
Each dot represents a stored city-pair distance, while blank regions correspond to sparsity. The red dots indicate the distances that are activated in the current Hamiltonian calculation according to the spin states.
Normally, spins in the Ising model are updated sequentially according to Gibbs sampling to ensure ergodicity and convergence. However, based on the principle of chromatic Gibbs sampling28, spins that are mutually independent (not directly coupled) can be updated in parallel without violating convergence guarantees. In clustered TSP, cities from non-adjacent clusters do not interact in the energy function, as they cannot appear as neighbors in the tour. Therefore, after assigning indices to clusters, all odd-indexed clusters (solid windows in Fig. 3) can be updated in parallel in one cycle, followed by even-indexed clusters (dash windows in Fig. 3) in the next cycle. This alternating update schedule enables partial parallelism while preserving the correctness of the annealing process.
Compact weight mapping pattern for weight sparsity
Although the clustered approach solves the input sparsity and reduces weight matrix size from \(O({N}^{4})\) to \(O({N}^{2})\), significant weight sparsity still remains. As shown on the left side of Fig. 3, non-zero weights are concentrated only in diagonal windows due to the fact that spins interact only with others within the same cluster or adjacent clusters. Each window contains \((p\times N)\times (p\times N)\) weights, where \(p\) is the number of cities or sub-clusters in each cluster.
To further reduce memory overhead, the valid weight windows are reorganized into a compact format, as illustrated on the right side of Fig. 3. In this compact format, distances from each city to others within the same or adjacent clusters are arranged column-wise, with each column corresponding to a different starting city. This compact mapping eliminates redundant storage and significantly reduces overall weight memory size. Each window contains \((3p-1)\times p\) weights, and with \(N/p\) such windows, the total weight storage becomes \((3p-1)\times N\), which scales as \(O(N)\). Our analysis22 shows that the best tradeoff occurs when p is set to 3, achieving a moderate hardware cost and solution quality.
Hardware implementation
As illustrated in Fig. 4, the proposed DCIM-based Ising annealer is designed to support the \({p}_{\max }\) = 3 clustering scheme identified as optimal in the design space exploration. The architecture features a 48 × 128b compact DCIM array, along with peripheral modules including the main controller, weight loading unit, data transfer unit, and spin register block.
The architecture consists of a compact DCIM array and peripheral modules including the main controller, weight loading unit, data transfer unit, and spin register block. During operation, coupling data and spin states are initialized through the interface, followed by iterative annealing that performs pseudo-read–based noise injection, Hamiltonian evaluation, and spin updates with half-parallel execution across odd and even clusters.
During initialization, weight matrix data and initial spin states are sent via interface I/O. Weights are streamed row by row into the DCIM array through the weight load unit, while spin values are stored in a dedicated spin register. To enable annealing, the main controller triggers a pseudo-read operation with reduced VDDM to inject noise into the weight array, converting spatial device noise into temporal randomness for simulated annealing29.
The system then enters the iterative computation, where each annealing step performs Hamiltonian evaluation followed by spin updates. As discussed in Section “Ising model and annealing”, unlike traditional Gibbs sampling, which updates spins sequentially, chromatic Gibbs sampling is applied to exploit partial parallelism: odd- and even-indexed clusters, which are mutually decoupled, are updated in alternating cycles. As shown in the right subfigure of Fig. 4, a counter toggles between odd and even cluster phases. The DCIM array supports parallel computation for 16 clusters per update phase. For the active clusters, the Hamiltonian energy of the current spin order is calculated over p cycles, followed by that of the swapped spin order in the next p cycles. The spin states are then updated by comparing the energies before and after the swap. Although this architecture supports scalable operation with multiple DCIM arrays operating in parallel, the current prototype integrates only a single array due to pad and area constraints, in order to validate the core functionality and feasibility of the proposed design.
Figure 5 illustrates the detailed circuit architecture of the proposed DCIM array, which performs in-memory Hamiltonian computation for spin updates. Each cluster is configured with a 3-city grouping, resulting in a 3 × 8 structure that represents 8-bit weight values for each pairwise connection. A total of 32 such clusters are arranged to form the 48 × 128b DCIM array, capable of supporting TSP instances with up to 96 cities.
All MSB and LSB are grouped together, allowing easier control of noise application to specific bits via VDDM. Each cell TG signal is shared across 8 rows within the same cluster block. The spin IN signal, generated based on the spin states, is shared across 48 columns. The DCIM bitcell consists of a 6T SRAM bitcell, a 4T NOR gate and a 2T TG.
To facilitate noise injection during annealing, all the most significant bits (MSBs) and least significant bits (LSBs) across the array are physically grouped. This organization enables easier control of noise injection into specific weight LSBs via pseudo-read operations under reduced VDDM. Within each cluster, the transmission gate (TG) control signal is shared across 8 rows. The 8-bit Spin IN signals, derived from current spin states, are broadcast across 48 columns and used to control the shared 8:2 multiplexers. For every update cycle, the MUX selects two 8-bit weights from 8 row inputs, which are accumulated by a full adder to generate partial Hamiltonian sums. These partial sums are subsequently used for energy comparison between “before” and “after” spin-swap configurations.
Each DCIM bitcell consists of a compact 12-transistor structure: a standard 6T SRAM cell for weight storage, a 4T NOR gate for computation, and a 2T transmission gate for selective data readout. These DCIM array allows in-memory dot-product operations to be performed in parallel across activated clusters, enabling area- and energy-efficient Hamiltonian computation without frequent data movement.
The 6T SRAM bitcell inherently stores data using a cross-coupled latch structure designed to reliably retain its state. However, process variations introduce mismatch between the two inverters, reducing the static noise margin (SNM). When subjected to external voltage disturbances that exceed this SNM, the cell may have bit-flip errors. While the error probability remains extremely low under normal supply conditions, it becomes controllable by lowering the latch supply voltage (VDDM).
To exploit this behavior for probabilistic annealing, we apply a pseudo-read operation, as illustrated in Fig. 6. In this operation, the precharge_n signal is first activated to precharge both BL and BLB to VDD. Word line (WL) is then activated only after the precharge phase is complete. The resulting voltage difference between the bit-lines and the internal nodes (Q/QB) of the latch induces a transient voltage bump on the low-voltage storage node. When VDDM is lowered, the SNM is further degraded, increasing the probability of a bit flip due to this disturbance. By tuning VDDM, the error probability can be adjusted, with its randomness influenced by device mismatch, allowing probabilistic state changes driven by physical circuit variations.
The pseudo-read operation avoids true read-out through the sense amplifier, injecting controllable probability by lowering VDDM.
Notably, this operation does not need any actual readout of the stored value, as no sense amplifier is present to resolve bit-line differentials. Thus, the operation is named “pseudo-read”. After this step, the latch returns to the nominal VDD, and spin IN is subsequently activated to perform a NOR operation with the QB node. The result is conditionally passed to the MAC path through Cell TG. Since spin states evolve over time, different rows become activated in different cycles, and corresponding columns are dynamically selected into the computation path. In this way, spatial variation in the SRAM cells is converted into temporally noise, which supports efficient convergence in the annealing process.
Due to the high sensitivity of the SRAM bitcell’s SNM to process-voltage-temperature (PVT) variations, Fig. 7 shows 10k-samples Monte Carlo simulation results under different PVT corners. The results show that all PVT conditions exhibit similar increasing error rate trends as VDDM decreases. Additionally, slight differences are observed between stored “0” (a) and “1” (b) cells, mainly caused by the asymmetric loading at the Q and QB nodes. This asymmetry can be further mitigated by optimizing layout matching or adjusting drive balance in future designs.
SRAM error rates for stored “0” (a) and “1” (b) after pseudo-read operation across 0.3–0.9 V VDDM under various PVT conditions, based on post-layout Monte Carlo simulations.
Measurement results
This design is fabricated in a 28 nm CMOS technology, the die photograph and the FPGA-based measurement setup are shown in Fig. 8. The 6Kb DCIM array occupies an area of 155 × 170 µm2. During testing, TSP problem instances are generated on a PC and transmitted to the FPGA via UART. The FPGA manages data formatting and timing control, sending spin initialization data and weight matrices to the DCIM chip through the test board. All the Ising computations are done on the test chip. After on-chip annealing, the final spin states are read back to the FPGA and sent to the PC for result decoding and evaluation. A breakdown of the DCIM solver’s area and energy are shown in Fig. 9, with area grouped by hardware modules and energy categorized by operation types during the annealing process, highlighting how different operations contribute to overall energy consumption.
The top panel shows the FPGA-based measurement setup, the lower-right panel shows the die photo, and the lower-left panel shows the customized DCIM bitcell with a size of 0.7 × 2.43 μm2.
The left panel shows the area breakdown of the proposed DCIM Ising annealer across major functional blocks, while the right panel shows the corresponding energy breakdown during annealing operation.
The acceptance probability in hardware annealing is generated by pseudo-read operations of the SRAM bitcells. Figure 10 shows the measured probability as a function of Hamiltonian ΔE under different VDDM and LSB configurations. As VDDM decreases or more LSB bits are injected, the curves become smoother, effectively emulating higher annealing temperatures and increasing the acceptance probability of worse solutions. This behavior resembles a sigmoid-like trend, validating the controllability of probabilistic behavior via supply voltage and bit-level error injection in the proposed DCIM-based annealer.
Measured acceptance probability of hardware annealing as a function of Hamiltonian ΔE for different VDDM and LSB configurations, illustrating a sigmoid-like trend.
When applied to the gr96.tsp dataset30, the measured distance evolution curve during the annealing process confirms that the test chip is able to escape local minima and converge toward lower-energy states, with a 12% average solution-quality overhead. Figure 11 shows the measured annealing distance evolution for solving gr96.tsp dataset using hierarchical clustering, where three levels progressively converge to the final solution: 11 super-clusters, 32 sub-clusters, and finally 96 cities. At each clustering level, the on-chip annealer performs a fixed number of 800 spin-update iterations; every 200 iterations, the coupling data in the DCIM array are reloaded and a pseudo-read is applied, with different VDDM and LSB configurations to gradually reduce the annealing probability. The distributions of solution distances are obtained from 1000 algorithm (with software-based noise) runs with different random seeds and from repeated measurements on multiple test chips, as illustrated in Fig. 12. The best result obtained from the algorithm is 61,077, with an average solution of 62,704 over 1000 runs. In comparison, the measured optimal distance is 61,917, and the average solution is 63,030. This represents only a 0.52% degradation in solution quality relative to the software baseline, while reducing the required weight memory capacity by 1.1 × 105 × (from 648 Mb to 6 Kb) for 96-city problem. To further showcase usability, a live demonstration system was developed to retrieve the chip’s annealing results in real time and display the computed tour on a geographic map based on user-defined cities.
Measured annealing process of gr96 TSP using hierarchical clustering, with distance evolution at each level.
The distribution of the 96-city TSP solution of multiple trials on multiple chips, along with a comparison to algorithm results based on software noise.
Scalability via multi-macro parallelism
To support larger-scale TSP, the proposed architecture enables multi-macro scaling by deploying multiple independent DCIM macros in parallel. Each DCIM macro is capable of solving a 96-city clustered TSP instance using in-memory annealing. Under the proposed clustered TSP mapping, each DCIM macro stores all intra-cluster and boundary distances for its assigned cluster, so there is no inter-macro data overlap or runtime communication required during annealing for the considered problems. Thus, larger problems can be partitioned into independent subproblems, each mapped to a separate macro, enabling straightforward horizontal scaling. During execution, weights and spin initialization data for all macros are sequentially loaded through a shared I/O interface. Once initialization is complete, all macros perform annealing in parallel. After the annealing process, final spin states are read out and combined to reconstruct the complete TSP solution.
To further enhance energy efficiency, each macro is equipped with an independent power/clock gate. This allows dynamic activation of only the necessary number of macros for a given problem size, while unused macros remain power-gated to minimize leakage. Such flexibility supports a wide range of problem scales with minimal energy and area overhead.
As shown in Fig. 13, the system maintains relatively stable optimal ratios when scaling up to 3038-city TSP using 32 macros. The overall hardware area scales almost linearly with macro count. The total area of 32 macros is only 32.04× that of a single macro, with a negligible 0.04× overhead (~1550 µm²) primarily attributed to additional macro-selection logic. This area evaluation is based on post-layout measurements of the 1-macro test chip and RTL synthesis of multi-macro configurations, confirming the scalability and hardware efficiency of the proposed design for large-scale Ising problem solving.
Optimal ratio (left axis) and total area (right axis) across different TSP problem sizes using 1–32 DCIM solvers. The average and best optimal ratios are obtained from 1000 simulation runs with different random seeds, and error bars represent the standard deviation (±1σ).
Discussion
A comparison with prior TSP-specific Ising designs5,9,21,23,31,32,33,34 is shown in Table 1. This work employs a DCIM design that utilizes process variations of memory devices as a random noise source to realize the Ising annealing process, resulting in high-quality solution ratios. By using hierarchical clustering and compact weight mapping approach, the required weight memory capacity is significantly reduced, greatly reducing TSP solution latency, improving the hardware cost ratio (defined as weight memory capacity/problem size/precision) by 15–572× compared to state-of-the-art implementations5,9,23,32.
This work presents a compact digital compute-in-memory (DCIM) Ising annealer that leverages hierarchical clustering and compact weight mapping to address the scalability challenges of solving large-scale TSP instances. By exploiting intrinsic process variations in SRAM devices, the annealing process is efficiently implemented through probabilistic bit flips, eliminating the need for dedicated noise generators. The 28 nm fabricated chip, featuring a 6Kb DCIM array, successfully solves 96-city TSP problems with 961nJ energy and 620μs annealing time. The proposed approach reduces the required weight memory capacity by 1.1 × 10⁵× compared to conventional implementations. Compared to prior TSP solvers, it also achieves up to 572× improvement in hardware cost ratio. Scalability is further evaluated through a 3038-city TSP case using 32 macros, demonstrating the architecture’s potential for large-scale deployment with minimal area overhead.
Methods
A. Dataset preparation and clustering configuration
Multiple TSP instances were selected from TSPLIB. All problems were mapped using hierarchical clustering with p = 3. At each hierarchical level, the annealer performed 800 spin-update iterations. The weight matrix was reloaded every 200 iterations to update boundary distances. During the annealing process, VDDM and the number of activated LSBs for pseudo-read noise were gradually reduced. VDDM was changed in steps of approximately 200 mV.
B. Chip measurement setup
The fabricated 28 nm test chip was evaluated using an FPGA-based platform, as shown in Fig. 8. TSP instances were generated on a PC and transmitted to the FPGA. The FPGA formatted the spin initialization data and weight matrices, controlled the timing of pseudo-read operations, and sent all data to the DCIM chip. After executing all Hamiltonian computations and spin updates on chip, the final spin states were returned to the PC. Power was measured from the chip supply rails using an external meter. Time-to-solution was obtained by counting FPGA clock cycles between the start and end of annealing.
C. Multi-macro scalability evaluation
For scalability evaluation, the clustered TSP formulation was partitioned across multiple independent DCIM macros. Weight matrices and spin states for all macros were loaded sequentially. After initialization, all macros annealed in parallel without runtime communication. Final spin states from each macro were read back and combined to form the complete tour. Multi-macro area estimates were obtained using post-layout data of one macro and RTL simulations of parallel configurations.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to sponsor-related restrictions and hardware access limitations, but are available from the corresponding author on reasonable request.
References
Shim, C., Bae, J. & Kim, B. 30.3 VIP-Sat: a boolean satisfiability solver featuring 5×12 variable in-memory processing elements with 98% solvability for 50-variables 218-clauses 3-SAT problems. In Proc. International Solid-State Circuits Conference (ISSCC) Digest Technical Papers, 486–488 (IEEE, 2024).
Bae, J., Shim, C. & Kim, B. 15.6 e-Chimera: a scalable SRAM-based Ising macro with enhanced-chimera topology for solving combinatorial optimization problems within memory. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) Digest Technical Papers, 286–288 (IEEE, 2024).
Lo, H., Moy, W., Yu, H., Sapatnekar, S. & Kim, C. H. An Ising solver chip based on coupled ring oscillators with a 48-node all-to-all connected array architecture. Nat. Electron. 6, 771–778 (2023).
Xie, S. et al. Ising-CIM: a reconfigurable and scalable compute within memory analog Ising accelerator for solving combinatorial optimization problems. IEEE J. Solid State Circuits 57, 3453–3465 (2022).
Hong, M.-C. et al. In-memory annealing unit (IMAU): energy-efficient (2000 TOPS/W) combinatorial optimizer for solving travelling salesman problem. In Proc. 2021 International Electron Devices Meeting (IEDM), 21.3.1–21.3.4 (IEEE, 2021).
Karlin, A. R., Klein, N. & Gharan, S. O. A (slightly) improved approximation algorithm for metric TSP. In Proc. 53rd Annual ACM SIGACT Symposium Theory on Computing, 32–45 (ACM, 2021).
Yan, X. et al. Reconfigurable stochastic neurons based on tin oxide/MoS2 hetero-memristors for simulated annealing and the Boltzmann machine. Nat. Commun. 12, 5710 (2021).
Bagherbeik, M. P. et al. A permutational Boltzmann machine with parallel tempering for solving combinatorial optimization problems. In Proc. 12th International Conference on Parallel Problem Solving from Nature, 317–331 (Springer Nature, 2020).
Chu, Y.-C., Lin, Y.-C., Lo, Y.-C. & Yang, C.-H. 30.4 A fully integrated annealing processor for large-scale autonomous navigation optimization. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) on Digital Technical Papers 488–490 (IEEE, 2024).
Johnson, M. W. et al. Quantum annealing with manufactured spins. Nature 473, 194–198 (2011).
Inagaki, T. et al. A coherent Ising machine for 2000-node optimization problems. Science 354, 603–606 (2016).
Feld, S. et al. A hybrid solution method for the capacitated vehicle routing problem using a quantum annealer. Front. ICT 6, 13 (2019).
Ahmed, I., Chiu, P.-W., Moy, W. & Kim, C. H. A probabilistic compute fabric based on coupled ring oscillators for solving combinatorial optimization problems. IEEE J. Solid State Circuits 56, 2870–2880 (2021).
Moy, W. et al. A 1968-node coupled ring oscillator circuit for combinatorial optimization problem solving. Nat. Electron. 5, 310–317 (2022).
Bae, J., Oh, W., Koo, J. & Kim, B. CTLE-Ising: A 1440-spin continuous-time latch-based Ising machine with one-shot fully-parallel spin updates featuring equalization of spin states. In Proc. IEEE International Solid State Circuits Conference (ISSCC) on Digital Technical Papers 190–192 (IEEE, 2023).
Bae, J., Koo, J., Shim, C. & Kim, B. 15.5 LISA: a 576×4 all-in-one replica-spins continuous-time latch-based Ising computer using massively-parallel random-number generations and replica equalizations. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) on Digital Technical Papers 284–286 (IEEE, 2024).
Yamamoto, K. et al. STATICA: a 512-Spin 0.25M-weight annealing processor with an all-spin-updates-at-once architecture for combinatorial optimization with complete spin–spin interactions. IEEE J. Solid State Circuits 56, 165–178 (2021).
Tatsumura, K., Yamasaki, M. & Goto, H. Scaling out Ising machines using a multi-chip architecture for simulated bifurcation. Nat. Electron. 4, 208–217 (2021).
Bagherbeik, M. et al. MAQO: a scalable many-core annealer for quadratic optimization. In Proc. IEEE Symposium on VLSI Technology & Circuits 76–77 (IEEE, 2022).
Huang, Z., Zhang, Y., Wang, X., Jiang, D. & Yao, E. DCAP: a scalable decoupled-clustering annealing processor for large-scale traveling salesman problems. IEEE Trans. Circuits Syst. I Reg. Papers 71, 6349–6362 (2024).
Dan, A., Shimizu, R., Nishikawa, T., Bian, S. & Sato, T. Clustering approach for solving traveling salesman problems via Ising model based solver. In Proc. 57th ACM/IEEE Design Automation Conference (IEEE, 2020).
Lu, A. et al. Digital CIM with noisy SRAM bit: a compact clustered annealer for large-scale combinatorial optimization. In Proc. 61st ACM/IEEE Design Automation Conference (DAC) 78, 1–6 (2024).
Lu, A. et al. Scalable in-memory clustered annealer with temporal noise of FinFET for the travelling salesman problem. In Proc. 2022 International Electron Devices Meeting (IEDM) 22.5.1–22.5.4 (IEEE, 2022).
Boixo, S. et al. Evidence for quantum annealing with more than one hundred qubits. Nat. Phys. 10, 218–224 (2024).
Grimaldi, A. et al. Experimental evaluation of simulated quantum annealing with MTJ-augmented p-bits. In Proc. 2022 International Electron Devices Meeting (IEDM), 22.4.1–22.4.4 (IEEE, 2022).
Jiang, M., Shan, K., He, C. & Li, C. Efficient combinatorial optimization by quantum-inspired parallel annealing in analogue memristor crossbar. Nat. Commun. 14, 5927 (2023).
Shan, K. et al. One-step combinatorial optimization solver with fully integrated analog memristors and annealing module. In Proc. 2024 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2024).
Gonzalez, J., Low, Y. C., Gretton, A. & Guestrin, C. Parallel Gibbs sampling: From colored fields to thin junction trees. In Proc. Fourteenth International Conference on Artificial Intelligence and Statistics Vol. 15, 324–332 (PMLR, 2011.
Yamaoka, M. et al. A 20k-spin Ising chip to solve combinatorial optimization problems with CMOS annealing. IEEE J. Solid State Circuits 51, 303–309 (2016).
Reinelt, G. TSPLIB—A traveling salesman problem library. ORSA J. Comput. 3, 376–384 (1991).
Tao, Q., Zhang, T. & Han, J. An approximate parallel annealing Ising machine for solving traveling salesman problems. IEEE Embedded Syst. Lett. 15, 226–229 (2023).
Iimura, R., Kitamura, S. & Kawahara, T. Annealing processing architecture of 28-nm CMOS chip for Ising model with 512 fully connected spins. IEEE Trans. Circuits Syst. I Reg. Papers 68, 5061–5071 (2021).
Sanyal, S. & Roy, K. Neuro-Ising: accelerating large-scale traveling salesman problems via graph neural network guided localized Ising solvers. In Proc. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Vol. 41, 2408–5420 (IEEE, 2022).
Yoo, S. et al. TAXI: traveling salesman problem accelerator with X-bar-based Ising macros powered by SOT-MRAMs and hierarchical clustering. In Proc. 62nd Annual ACM/IEEE Design Automation Conference Vol. 78, 1–7 (IEEE, 2025).
Acknowledgements
This work is supported by Intel Emerging Technology Strategic Research Sector (SRS) funding and mentored by Exploratory Integrated Circuits, Technology Research, Intel Corporation. The chip fabrication is supplemented by NSF-2218604.
Author information
Authors and Affiliations
Contributions
Y.K. designed the overall research project, conducted chip implementation, including tapeout verification and testing, and wrote the manuscript. A.L. contributed the initial design concept and preliminary validation, including both algorithm and hardware. H.L. optimized the algorithm and developed the testing environment. V.G. and R.W. assisted with setting up the testing environment. H.Li and I.Y. provided sponsorship from Intel, with regular supervision and guidance. S.Y. supervised the overall project. All authors reviewed and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kong, Y., Lu, A., Liu, H. et al. A compact digital compute-in-memory Ising annealer with probabilistic SRAM bit in 28 nm for travelling salesman problem. npj Unconv. Comput. 3, 15 (2026). https://doi.org/10.1038/s44335-026-00060-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s44335-026-00060-w















