Fault tolerant and quality of service aware routing algorithm based on priority technique for scalable network on chip architectures

Yu, Xiaomo; Tang, Ling; Mi, Jie; Liu, Jiajia; Long, Long

doi:10.1038/s41598-025-20381-3

Download PDF

Article
Open access
Published: 21 October 2025

Fault tolerant and quality of service aware routing algorithm based on priority technique for scalable network on chip architectures

Xiaomo Yu ORCID: orcid.org/0000-0002-7056-2362^1,3,
Ling Tang²,
Jie Mi¹,
Jiajia Liu¹ &
…
Long Long⁴

Scientific Reports volume 15, Article number: 36578 (2025) Cite this article

1614 Accesses
Metrics details

Subjects

Abstract

Network on Chip (NoC) architectures are essential subsystems for on-chip communication. They use routers and simplified protocols modeled after public data networks to transport packets using complex routing algorithms from their source to their destination. Reliable communication can be severely hampered by component failures, such as malfunctioning routers or cables, which can interrupt packet transfer. Performance may be harmed by the narrow criteria used by traditional fault-tolerant routing algorithms to find reliable routes. In order to improve routing reliability and Quality of Service (QoS) in scalable NoC architectures, this paper suggests a novel, adaptive fault-tolerant routing algorithm that incorporates the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), a multi-criteria decision-making technique. The suggested approach dynamically assesses and ranks alternate routes to choose the best ones, even when there are failures, by utilizing path length and density information from nearby nodes. On 8 × 8 meshes with 10% link failures, the approach reduces average delay by ~ 8–12% compared to EDAR and increases throughput by ~ 2–5% compared to EDAR; on application-driven traces, it reduces delay by ~ 5–15% at nearly equal throughput. It reduces energy per flit by around 15–20% compared to EDAR, improves throughput by about 3–4%, and lowers delay by about 8–10% under transient, thermal, and voltage disturbances. The two-stage decision core maintains the improvements on 16 × 16 meshes and reroutes locally in about 3–5 cycles without adding a critical-path cost. Additionally, the approach ensures scalability for large-scale NoC implementations by introducing low hardware overhead. The suggested algorithm is a viable answer for next-generation NoC designs, meeting the requirements of high-performance, dependable, and scalable on-chip communication systems thanks to its combination of fault tolerance, QoS awareness, and resource efficiency.

A fault tolerant CSA in QCA technology for IoT devices

Article Open access 27 January 2025

Fault correcting adder design for low power applications

Article Open access 29 November 2024

Adaptive fault tolerance mechanisms for ensuring high availability of digital twins in distributed edge computing systems

Article Open access 24 November 2025

Introduction

Chip networks facilitate communication among several intellectual cores or processing components¹. The primary objectives of the NoC design approach are to enhance efficiency², minimize latency³, and decrease power consumption⁴. The most prevalent communication architecture on a semiconductor is the shared bus system, which offers advantages such as simplicity and minimal implementation overhead⁵. Nonetheless, when the length of the shared bus line expands and the number of components integrated on the chip rises⁶, the parasitic signals along this shared pathway will escalate significantly⁷. Consequently, the augmentation of propagation delay resulting from this phenomenon will restrict the number of processing units that can be integrated into this communication system, thereby diminishing scalability⁸. The limited scalability, significant area overhead for point-to-point communications, and substantial communication delay associated with a shared bus are critical shortcomings of this communication method, prompting designers to adopt the NoC communication approach to mitigate overhead and enhance efficiency^9,10. The implementation of a novel communication architecture known as NoC facilitates inter-unit communication via packet routing through integrated routers and switches¹¹. Scalability, reduced cable length, and an increased number of nodes at the chip level will decrease power consumption and enhance network bandwidth^12,13,14.

One of the critical challenges in NoC design is ensuring fault tolerance, as failures in routers and links can significantly impact data transmission¹⁵. Traditional fault-tolerant algorithms often rely on predefined criteria to identify alternative paths, limiting their adaptability in dynamic network conditions¹⁶. In contrast, adaptive routing strategies offer greater flexibility by dynamically selecting paths based on real-time network conditions¹⁷. However, balancing reliability, performance, and hardware overhead remains a significant concern in designing effective routing algorithms.

This paper proposes a QoS-aware fault-tolerant routing algorithm that utilizes the TOPSIS for multi-criteria decision-making. The proposed method ranks potential routing paths by considering factors such as path length and link density, ensuring the selection of an optimal route in the presence of faults. By incorporating a stress value parameter that represents link density within the router, this approach enables efficient congestion management and improves network performance. The proposed algorithm enhances network reliability while maintaining minimal hardware overhead, making it suitable for large-scale NoC implementations.

The rapid advancement of semiconductor technologies has led to an exponential increase in the number of cores integrated on a single chip. This evolution has necessitated the development of high-performance and scalable communication architectures¹⁸, as traditional shared bus structures suffer from limitations such as excessive power consumption, high propagation delays, and scalability issues¹⁹. NoCs address these challenges by enabling efficient packet-based communication; however, their reliability remains a critical concern, particularly in large-scale implementations where component failures can significantly disrupt network performance.

Fault-tolerant routing mechanisms play a crucial role in ensuring reliable data transmission within NoCs. A robust fault-tolerant routing algorithm must be able to detect faults dynamically and reroute packets through alternative paths without causing congestion or significant performance degradation^20,21. This study introduces an innovative routing approach that enhances NoC resilience by utilizing the TOPSIS decision-making method to identify the most suitable routing paths based on multiple criteria. The design of fault-tolerant NoC routing algorithms presents several challenges, including dynamic fault detection, load balancing, scalability, latency, throughput, and deadlock prevention. Identifying faulty components in real-time without excessive hardware overhead is a complex task, and ensuring an even distribution of traffic across the network is essential for preventing congestion and performance degradation. As the number of cores increases, routing algorithms must maintain efficiency without a significant increase in computational complexity. Moreover, ensuring low latency and high throughput while dynamically adapting to faults and traffic conditions is critical for maintaining network performance. Additionally, avoiding network deadlocks and preventing starvation, where some packets are indefinitely delayed, are key concerns in NoC design.

Our proposed routing strategy addresses these challenges by introducing a novel adaptive routing methodology that leverages the TOPSIS decision-making framework. The approach ranks routing paths based on multiple criteria such as path length, congestion levels, and link density to select the most reliable and efficient route. A key innovation of this method is the introduction of a stress value parameter, which quantifies link congestion and aids in optimizing routing decisions. Unlike many existing fault-tolerant methods, our approach maintains a low hardware footprint, making it suitable for large-scale NoC implementations. The stress value is updated dynamically based on real-time network conditions using an event-driven method, ensuring rapid adaptation to traffic changes and faults. This capability enhances network reliability while improving throughput and reducing latency. Furthermore, the proposed method ensures efficient congestion management and improves load balancing, leading to a more stable and high-performing NoC architecture. The main contributions of the research are as follows.

Introduction of a QoS-aware adaptive routing mechanism that leverages TOPSIS for dynamic path selection, ensuring improved fault tolerance and efficient load balancing in NoCs.
Implementation of a novel stress value parameter that quantifies link density, aiding in congestion management and enhancing routing decisions for improved network performance.

This paper is organized as outlined below. The second section examines routing methods in NoC systems, emphasizing fault tolerance and reviewing the relevant literature. The third section delineates the proposed strategy and elucidates the algorithmic specifics of the new method. The objective is to introduce a fault-tolerant routing solution in NoC systems to enhance the quality-of-service components. The proposed method also aims to enhance fault tolerance and refine evaluation criteria. The fourth section presents the simulation of the suggested method and evaluates the outcomes produced from it. Ultimately, in the fifth section, following a comprehensive conclusion, recommendations for future endeavors and citations are provided.

Previous works

Network-on-a-chip (NoC) has emerged as an economical communication interface for multi-core tiling chip processor architectures. Inter-core communication occurs via packet exchange. As the computing demands of applications escalate, the frequency of packet transmission between cores correspondingly rises. Inadequate routing of these packets results in significant congestion, hence diminishing system performance²². This signifies the necessity for congestion-aware routing in NoC. In practical scenarios, apps operating on NoC produce varied traffic, hence posing routing issues. These issues have prompted an increasing number of researchers to depend on machine learning methods for resolution. Nonetheless, the challenges of storage overhead and packet latency prevail in these techniques. An adaptive routing algorithm, DeepNR, utilizing a deep reinforcement learning approach is introduced in²³. The suggested methodology incorporates network data to depict the status, routing direction of activities, and queue delay for the reward function. Experiments including synthetic and contemporaneous traffic were performed to illustrate the efficacy and efficiency of DeepNR utilizing the Gem5 simulator.

In the field of network resilience, there have been works that explore “event-driven control” and “safety/security” approaches for distributed systems²⁴: stable event-driven lossy interception with distributed delays; Ref²⁵. present a fault-tolerant optimal scheme based on zero-sum game and dynamic adaptive programming; Ref²⁶ investigate disturbance-resistant consensus with event-driven excitation and constraints; Ref²⁷ present event-driven adaptive secure control for MAS with operator error and FDI attacks; Ref²⁸ analyze observer-based secure control under DoS attacks in networked switching systems; and Ref²⁹ advance fuzzy T–S based event-driven secure control against DoS.

A dynamic detection technique for wireless interface faults in WiNoC is proposed in³⁰, which categorizes wireless interface error situations in WRs, executes wireless interface error detection during WiNoC operation, and revises the error scenarios. Additionally, an optimal path technique utilizing an error-free WR table is introduced to enhance network performance.

Various routes exist inside these networks to traverse from one node to another. Consequently, a function capable of identifying the optimal route to the destination should be accessible. The research³¹ employs an innovative hybrid approach termed Scored Regional Congestion Aware and DICA (ScRD) to optimize output channel selection and enhance the performance of NoC. Subsequent to the application of the ScRD algorithm, an analyzer scrutinizes the traffic packets to ascertain if the NoC communication is local or non-local, contingent upon the number of hops. Consequently, if the traffic is localized, the scoring process identifies the superior output channel. Alternatively, based on the system status and the specified parameter, the optimal output channel will be determined via the DICA or RCA selection functions. The Nirgam simulation was ultimately employed to evaluate the suggested approach across various traffic circumstances and selection criteria.

Reference³² introduces a fault-aware routing methodology tailored for mesh-based NoC architectures. The suggested method seeks to enhance the fault tolerance of NoC by strategically rerouting traffic to circumvent defective components while reducing performance deterioration. This method utilizes fault diagnosis mechanisms (BIST) and VC-based routing algorithms to effectively adjust to fluctuating network circumstances, ensuring resilient communications despite the presence of faults.

In³³, a reinforcement learning-based fault-tolerant routing (RL-FTR) method is introduced to address routing issues arising from link and router failures in mesh-based NoC architecture. The efficacy of the proposed RL-FTR method is examined utilizing a System-C-based cycle-accurate NoC simulator. Simulations are conducted with an increasing number of links and router problems across various mesh sizes. Subsequent to the simulations, the real-time efficacy of the suggested RL-FTR method is evaluated by an FPGA implementation.

Proposed system

The key aspect in developing a fault-tolerant routing algorithm is the quality-of-service parameters taken into account to attain the shortest pathways. Typically, fault-tolerant algorithms employ restricted criteria to identify a dependable path. This work proposes an adaptive routing system that identifies the most reliable way by assessing the status of surrounding nodes and integrating it with the path length. The suggested method employs the TOPSIS multi-criteria decision-making approach to rank pathways according to quality-of-service metrics. This algorithm originates from a Serbian term signifying a solution for compromise and multi-criteria optimization, initially developed in³¹. Upon the occurrence of a failure on a path, the algorithm identifies a substitute with comparable QoS attributes to transmit the packet, hence preserving efficiency throughout the failure and averting deadlock within the network.

Defining the issue

Multi-criteria decision making (MCDM) or multi-criteria decision analysis (MCDA) is a branch of operations research that systematically assesses various conflicting criteria in the decision-making process. Effectively addressing intricate issues by explicitly evaluating several factors results in superior and more informed decision-making. The multi-criteria decision problem is articulated as follows: Identifying the optimal solution from a collection of viable alternatives assessed against a set of criterion functions. This paper introduces the comprehension of quality-of-service characteristics, the TOPSIS multi-criteria decision-making methodology, and its application for routing in NoC systems. Figure 1 illustrates the flow chart of the suggested method utilizing the TOPSIS algorithm.

Step 1 Decision-matrix construction and raw feature extraction. Let $\:A=\{{a}_{1},\dots\:,{a}_{m}\}$ be the set of admissible next-hop routes from the current router $\:{v}_{c}=\left({x}_{c},{y}_{c}\right)$ toward the destination $\:{v}_{d}=\left({x}_{d},{y}_{d}\right)$. In a 2-D mesh, $\:A$ contains up to four output ports $\:\{N,E,S,W\}$ that keep the packet progress admissible (deadlock-free). For each alternative $\:{a}_{i}$ we compute three QoS criteria (two costs, one benefit) and assemble the $\:m\times\:n$ decision matrix $\:X=\left[{x}_{ij}\right]$ with $\:n=3$:

$\:{C}_{1}$: remaining hop distance (cost). After selecting $\:{a}_{i}$, let the next node be $\:{n}_{i}=\left({x}_{i}^{{\prime\:}},{y}_{i}^{{\prime\:}}\right)$. The residual Manhattan distance is $\:{H}_{i}=\:\left|{x}_{d}-{x}_{i}^{{\prime\:}}\right|+\left|{y}_{d}-{y}_{i}^{{\prime\:}}\right|$. Set $\:{x}_{i1}={H}_{i}$.
$\:{C}_{2}$: congestion/stress (cost). Each output port exposes a normalized stress $\:{S}_{i}\in\:\left[\text{0,1}\right]$ derived from neighbor input-buffer occupancy (EWMA) or its 3-level quantization (Low/Moderate/Severe). Set $\:{x}_{i2}={S}_{i}$.
$\:{C}_{3}$: fault/health (benefit). Let $\:{h}_{i}\in\:\left[\text{0,1}\right]$ be the instantaneous health score of the link/next router on $\:{a}_{i}$ (1 = healthy, 0 = faulty), obtained from online detectors (parity/CRC, link-error counters) and periodic BIST. Set $\:{x}_{i3}={h}_{i}$.

Thus,

$$\:X=\:\left[\begin{array}{ccc}{X}_{11}&\:\cdots\:&\:{X}_{1i}\\\:⋮&\:\ddots\:&\:⋮\\\:{X}_{m1}&\:\cdots\:&\:{X}_{mJ}\end{array}\right]=\:\left[\begin{array}{ccc}{H}_{1}&\:{S}_{1}&\:{h}_{1}\\\:⋮&\:\ddots\:&\:⋮\\\:{H}_{m}&\:{S}_{m}&\:{h}_{m}\end{array}\right]$$

(1)

Costs ($\:{H}_{i},{S}_{i}$) are minimized and the benefit $\:{h}_{i}$ is maximized. The weight vector $\:w=\left[{w}_{path},{w}_{cong},{w}_{fault}\right]\left(normalized\:to\:\sum\:{w}_{j}=1\right)$ is applied after normalization in Step 2. This Step defines the alternatives, the exact raw measurements, and the sign (cost/benefit) of each criterion; normalization and ideal points follow in Steps 2–3.

The decision matrix $\:\text{D}\in\:{\mathbb{R}}^{m\times\:n}$ contains the raw scores of the options; each row (i) represents a candidate path/port (west, south, east, north, or equivalent minimum-step paths), and each column (j) represents a QoS criterion (path length in hops, neighbor stress/traffic, health/fault). The entry $\:{D}_{ij}$ is extracted from local/neighbor counters at the decision moment. Column normalization of D per Eq. (2) yields the matrix A, which serves as the basis for calculating $\:{A}^{+},\:{A}^{-}\:$followed by $\:{S}_{i},\:{R}_{i}$, and $\:{Q}_{i}$.

Step 2 Normalization or scale-freeing constitutes the second phase in addressing all multi-criteria decision-making methodologies reliant on the decision matrix³³. Normalization is conducted in a linear fashion as Eq. (2).

$$\:{A}_{ij\:\:}=\frac{{X}_{ij}}{\sqrt{\sum\:_{i=1}^{m}{X}_{i,j}^{2}}}$$

(2)

In this formula, $\:{X}_{ij}$ represents the j criterion for the $\:i-th$ path, and $\:m$ denotes the total number of pathways between the source and destination nodes. “Link-based transfer” with an efficient computational formulation for large-scale networks, which is related to the proposed discussion on scalability and time budget of link-based decision making³⁴. The total of all elements post-normalization will equal one. Following normalization, if a negative criterion exists, its value must be derived Eq. (3).

$$\:{A}_{ij}=1-{A}_{ij}$$

(3)

Consequently, the standard decision matrix $\:A$ is derived Eq. (4). In this matrix³⁵, $\:{A}_{ij}$ represents the normalized value of criteria j for the $\:i-th$ path between the source and destination nodes.

$$\:\left[\begin{array}{ccc}{A}_{11}&\:\cdots\:&\:{A}_{1i}\\\:⋮&\:\ddots\:&\:⋮\\\:{A}_{j1}&\:\cdots\:&\:{A}_{ij}\end{array}\right]$$

(4)

Step 3 The third step involves identifying the optimal and suboptimal values for each criterion in the matrix. For positive criteria associated with profit, the highest value represents the best outcome, while the lowest value signifies the worst result. In negative criteria related to cost, the minimal value represents the optimal outcome, whereas the maximal value signifies the least favorable response³⁶. The optimal and suboptimal values of each criterion are denoted as $\:{A}^{+}$ and $\:{A}^{-}$, respectively. The calculation of these two values employs (5) and (6).

$${A}^{+}=\text{max}{X}_{ij\:\:\:\:\:\:\:\:}$$

(5)

$${A}^{-}=\text{min}{X}_{ij}$$

(6)

In this context, $\:{A}^{+}$denotes the positive ideal, while $\:{A}^{-}$ signifies the negative ideal. Consequently, in this phase, the maximum and minimum values of each column in the decision matrix are identified³⁷.

Step 4 The fourth step involves calculating the utility value $\:{S}_{i}$ and the dissatisfaction value $\:{R}_{i}\:$ for the ith path between the origin and destination nodes. The utility value $\:S$ denotes the relative distance of the ith path from the ideal point, while the dissatisfaction value $\:R$ represents the greatest discontent of this road attributable to its distance from the ideal point³⁸. To compute these values, Eqs. (7) and (8) are employed, respectively³⁹.

$${S}_{i}={{L}_{1}}_{,j}=\sum\:_{j=1}^{n}{w}_{j}\times\:\frac{{A}_{j}^{+}-{A}_{ij}}{{A}^{+}-{A}_{j}^{-}}$$

(7)

$$\:{R}_{i}={{L}_{\infty\:}}_{,j}=\text{max}[{w}_{i}\times\:\frac{{A}_{j}^{+}-{A}_{ij}}{{A}_{j}^{+}-{A}_{j}^{-}}]$$

(8)

In this context, $\:{A}_{j}^{+}$denotes the optimal value for the jth criterion, $\:{A}_{j}^{-}$ signifies the suboptimal value for the jth criterion, $\:{w}_{i}\:$indicates the weight of the criterion⁴⁰, reflecting its significance, and $\:{A}_{ij}$ represents the normalized value of criterion j for each path between the source and destination nodes within the conventional decision matrix.

Step 5 In this phase, the computation of the TOPSIS index $\:{Q}_{i}$ for the ith path is executed. In this phase, the TOPSIS index value is computed using (9) for each path connecting the source and destination nodes.

$$\:{Q}_{i}=v\left[\frac{{S}_{i}-{S}^{-}}{{S}^{*}-{S}^{-}}\right]+\left(1-v\right)\left[\frac{{R}_{i}-{R}^{-}}{{R}^{*}-{R}^{-}}\right]$$

(9)

Scalable multi-agent routing with RL is presented as an alternative learning approach, while TOPSIS with guaranteed cyclic delay⁴¹. In this regard, the values of $\:{S}^{*}$, $\:{R}^{*}$, $\:{S}^{-}$and $\:{R}^{-}$ are calculated using (10) to (13) respectively, and the value of $\:v$ represents the weight of the strategy $\:{S}_{j}$, $\:{R}_{j}$ and $\:0<v<1$.

$$\:{S}^{*}=\text{max}{S}_{i}$$

(10)

$$\:{S}^{-}=\text{min}{S}_{i}$$

(11)

$$\:{R}^{*}=\text{max}{R}_{i}$$

(12)

$$\:{R}^{-}=\text{min}{R}_{i}$$

(13)

The parameter $\:v$ is also influenced by the consensus of the decision-making group. If the agreement is substantial, then $\:v>0.5$, if the agreement is determined by majority vote, then $\:v=0.5$ and if the agreement is minimal, then $\:v<0.5$ As $\:v$ increases, group viewpoints are prioritized, and as $\:v$ decreases, individual opinions are emphasized.

Step 6 Sorting the routes based on the values of $\:{S}_{i}$, $\:{R}_{i}$and $\:{Q}_{i}$ is the next step. In this step, all the routes are sorted based on the values obtained from the above relations in order to select the ideal route based on the predefined conditions. For this purpose, the routes are sorted in three groups from small to large based on the values of $\:S$, $\:R$ and $\:Q$. The best route is the route that has the highest rank in all three values of $\:S$, $\:R$ and $\:Q$.Otherwise, the best route is the route that has the smallest $\:Q$.

Step 7 Taking into account the two criteria of acceptable benefit and acceptable stability, ⅰ) the optimal path is the one that exhibits the lowest values across all three indicators, and ⅱ) there must be a difference between the first path (a) and the second path (b) in Eq. (14).

$$\:{Q}_{b}-{Q}_{a}\ge\:\frac{1}{m-1}$$

(14)

Acceptable stability in decision-making signifies that the selected compromise option must optimize collective value while minimizing individual repercussions. Should the initial condition remain unfulfilled, the first and second alternatives are regarded as preferable options⁴². If the second requirement is unmet, the first choice, as per the $\:Q$ ranking, to the final option that fails to satisfy the second condition are preferable alternatives.

TOPSIS routing algorithm

To implement the proposed method, a weighted path strategy has been used. The number of steps between the source and destination nodes, prioritized by minimal path, and the channel status weights (busy/congested/failed) are computed in real time to establish the path priority weight criterion $\:{W}_{path}$ using the Vickers decision matrix. deals with the low-cost orchestration of network function chains for low-latency applications where multi-criteria decision policies can optimize service paths to minimize latency⁴³; this idea is consistent with TOPSIS path selection. In the area of hardware resilience⁴⁴, applies adaptive reconfiguration for redundancy and service continuation after failure, which is similar to the adaptive rerouting logic in NoC. Also⁴⁵, presents a lightweight approach to fault tolerance in circuits by encoding hybrid random numbers to contain bit-width growth, which is relevant to our discussion of low overhead and energy efficiency in routing; otherwise, the path priority weight is determined as per Table 1. Additional weight values to be computed encompass the path priority weight $\:{W}_{path}$, the channel congestion weight $\:{W}_{Cong}$, and the channel fault weight $\:{W}_{Fault}$. Given that the algorithm initially selects movement along the x-axis followed by the y-axis under identical traffic conditions, the four cardinal ports north, east, south, and west (W/S/E/N) in the source node are designated as $\:p{p}_{2},p{p}_{1},p{p}_{3}$ based on the number of steps necessary to transmit packets to the destination. Both directions with an equal number of steps to the destination node are regarded at the same level.

Table 1 Calculating the weight of route priority.

Subjects

Abstract

Similar content being viewed by others

A fault tolerant CSA in QCA technology for IoT devices

Fault correcting adder design for low power applications

Adaptive fault tolerance mechanisms for ensuring high availability of digital twins in distributed edge computing systems

Introduction

Previous works

Proposed system

Defining the issue

TOPSIS routing algorithm

Energy modeling

Fault model and robustness to transient/environmental effects

Error detection, delay, and coverage resources

Cycle-level latency and complexity

Simulation and evaluation of results

Results of efficiency

Energy results

Hardware-level performance and overhead analysis

Application-driven traffic

Transient/environmental evaluation

Sensitivity to detection latency

Scalability to larger networks

Weighting strategy and sensitivity analysis

Stress value calibration and granularity

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links