Deep reinforcement learning-driven multi-objective optimization and its applications on lighting infrastructure operation and maintenance strategy

Wang, Zhiliu; Tang, Jiabao; Wei, Pengda; Lun, Wenhao; Wang, Yulong; Yang, Weihong; Xiao, Hui

doi:10.1038/s41598-026-37811-5

Download PDF

Article
Open access
Published: 13 February 2026

Deep reinforcement learning-driven multi-objective optimization and its applications on lighting infrastructure operation and maintenance strategy

Zhiliu Wang^1,2,
Jiabao Tang¹,
Pengda Wei¹,
Wenhao Lun¹,
Yulong Wang¹,
Weihong Yang³ &
…
Hui Xiao³

Scientific Reports volume 16, Article number: 8989 (2026) Cite this article

987 Accesses
Metrics details

Subjects

Abstract

This study addresses the challenges facing tunnel lighting system maintenance, where conventional single-objective optimization strategies and traditional maintenance approaches struggle to balance multidimensional requirements. Focusing on lifecycle maintenance management of tunnel lighting infrastructure, the research transforms multi-objective optimization into a set of Pareto-optimal subproblems through decomposition strategies. The proposed framework establishes a dynamic topological network within the solution space by integrating the Double Deep Q-Network(DDQN) algorithm from deep reinforcement learning with neighborhood gradient transfer strategies. This study proposes an innovative integration of Wiener degradation processes and the DDQN algorithm to establish a dynamic reliability-cost coupling model for equipment performance analysis. A multi-objective deep reinforcement learning (MODRL)-driven intelligent maintenance framework is developed, systematically coordinating degradation dynamics and economic constraints through computational learning mechanisms. The results show that incorporating maintenance costs and reliability as reward components in the multi-objective optimization problem (MOP) simultaneously enhances operational reliability and reduces comprehensive maintenance expenditures by 29.7%. The neighborhood-based parameter transfer strategy reduced single-episode training time by 41.9% and parameter synchronization time by 68.3%, while improving GPU utilization by 34.9%. It achieved faster convergence with 22.8 fewer threshold steps and reduced multi-objective conflict rates by 17.0%. The developed multi-objective optimization framework for tunnel lighting systems overcomes fixed maintenance threshold limitations. The framework demonstrated a 68.5% reliability decline near lighting failure conditions while effectively addressing overconfidence issues. The weight-combination-based adaptive mechanism enables scenario-specific customization of optimization objectives, offering scalable solutions for cost-prioritized, reliability-focused, or balanced operational strategies.

Introduction

Tunnel lighting systems serve as critical infrastructure that enhances traffic safety and operational stability, with their reliability and operational status directly influencing drivers’ visual perception and decision-making capabilities. System failures caused by lamp aging or circuit malfunctions trigger sudden luminance reduction, creating compromised visual conditions that increase misinterpretation risks and compromise safety parameters. Such conditions significantly elevate rear-end collision probabilities and other accident risks, thus posing substantial threats to both human lives and transportation infrastructure integrity.

Conventional lighting maintenance strategies predominantly rely on manual inspections, static threshold regulation, or single-objective optimization models, all of which exhibit insufficient adaptability to multidimensional requirements in complex dynamic environments. An example of such a dynamic environment is energy efficiency optimization which requires concurrent maintenance of illumination stability, equipment longevity, and rapid failure response-objectives exhibiting inherent nonlinear conflicts. Conventional methods face significant limitations in real-time responsiveness and coordinated optimization capabilities, and high maintenance costs hinder adaptation to granular maintenance requirements of large-scale tunnel networks. Achieving dynamic equilibrium between operational costs and system reliability through intelligent methodologies constitutes the critical challenge for advancing tunnel lighting maintenance efficacy.

Recent advancements in DRL offer novel solutions for dynamic multi-objective optimization through DRL’s autonomous learning capabilities in complex decision-making environments. DRL, particularly suited to high-dimensional industrial systems with multiple constraints, enables adaptive strategy adjustments under incomplete information via continuous agent-environment feedback mechanisms. Successful implementations in smart grids, traffic management, and geotechnical engineering demonstrate its potential for coordinated multi-objective optimization across critical infrastructure domains. Ma et al.¹ investigated deep reinforcement learning-driven maintenance decision optimization for stay cables in long-span bridges. Dong et al.² developed joint optimization of condition-based maintenance and spare parts inventory for offshore wind turbines through DRL. Fan et al.³ established an intelligent optimization framework for natural gas pipeline preventive maintenance based on gas supply reliability. Gao⁴ conducted sequential decision optimization research for multi-state system observation and maintenance strategies under partial observability. Kong et al.^1,5 implemented intelligent prediction of shield tunnel cutterhead torque using smart optimization algorithms and Bi-LSTM neural networks. Yang et al.⁶ created a tunnel fire prediction system integrating external smoke analysis with deep learning algorithms. Ma et al.⁷ pioneered computer vision-based tracking algorithms for tunnel fire monitoring. Jiang et al.⁸ explored millimeter-wave radar target recognition and classification in tunnel environments. Hu et al.⁹ developed an automatic tunnel crack detection post-processing algorithm utilizing segmentation mask technology.

Tunnel lighting maintenance constitutes a risk-sensitive endeavor due to unique environmental factors, operational constraints, and elevated failure risks with significant financial implications. The systematic implementation of deep reinforcement learning algorithms in lighting system operations remains in exploratory stages despite technological advancements. Research gaps persist in addressing domain-specific challenges including real-time degradation monitoring and risk-aware decision coordination under operational uncertainties.

Current research predominantly focuses on single-objective optimization, and thus tend to 0verlook the dynamic trade-off requirements among multiple objectives in maintenance strategies. Algorithmic designs often inadequately address real-world constraints such as equipment degradation and environmental uncertainties. To overcome the limitations of single-objective paradigms, scholars have introduced MOP with enhanced solution-space adaptability. Rong-Hua D et al.¹⁰, Mohammad J et al.¹¹, and Yiyang Q et al.¹² have respectively conducted multi-objective optimization studies in the domains of spacecraft autonomous rendezvous, energy storage systems, and integrated energy systems. These studies utilized distinct methodologies including angle-measurement navigation with closed-loop guidance control, thermoeconomic analysis, and installation configuration optimization-to enhance key performance indicators in their respective systems. Furthermore, Armin et al.¹³ and Yu et al.¹⁴ respectively applied multi-objective optimization methods in sustainable energy systems and material and energy flow analysis in steel plants , further demonstrating the effectiveness of multi-objective optimization in complex systems. Model-based MOP algorithms fundamentally formulate problems through explicit objective functions. When constraints are rigorously defined and objective functions are meticulously structured, these frameworks achieve robust optimization performance. However, the design of objective functions frequently necessitates weight assignment to reconcile conflicting objectives—a process complicated by the non-convex optimization landscape inherent to real-world systems.

In light of the aforementioned circumstances, it is imperative that MOP be incorporated into the framework of DRL. Liao et al.¹⁵, Ge et al.¹⁶, and Li et al.¹⁷ respectively developed hierarchical and multi-agent frameworks for optimizing HVAC energy-IAQ trade-offs, renewable hybrid systems, and hybrid electric vehicle energy management. Hou et al.¹⁸, Yuan et al.¹⁹, and Liu et al.²⁰ innovated in industrial scheduling through co-evolutionary NSGA-III integration, flexible job shop dynamic allocation, and quantum computing resource optimization. Furthermore, Wei et al.²¹ and Wang et al.²² addressed satellite imaging quality and batch milling parameter optimization using neural policy gradient methods. Theoretical extensions include fairness-to-compromise solution mapping (Qian et al.²³) and infrastructure maintenance utility optimization (Remmerden et al.²⁴), which enhance DRL’s Pareto front interpretability. Faizanbasha et al.²⁵ proposed a hybrid ensemble model integrating CNN, Transformer, and LSTM, incorporating a smoothing semi-martingale layer to achieve collaborative modeling of deterministic and stochastic degradation processes. Simultaneously, reinforcement learning dynamically optimizes parameters to enhance prediction performance. Zhu et al.²⁶ further expanded uncertainty applications in RUL prediction, introducing the 1D-CNN-Informer architecture combined with Monte Carlo Dropout to generate interval-based RUL predictions. Morato et al.²⁷ constructed a joint framework of constrained partially observable Markov decision processes (POMDP) and multi-agent DRL. Bayesian inference was employed to handle state uncertainty, while decentralized control addressed the curse of dimensionality. Faizanbasha et al.²⁸ introduced Geometric Point Processes (GPP) and Smooth Semi-Martingales (SSM) to construct a Total Expected Discounted Cost (TEDC) optimization framework. This addressed the shortcomings of traditional MDP models in neglecting the geometric properties of failures and costs, achieving an 18.34% reduction in maintenance costs within naval systems. Fan et al.²⁹ proposed a multi-objective optimization framework integrating ensemble learning prediction models with the TD3 algorithm to achieve coordinated optimization of boiler parameters. Faizanbasha et al.³⁰ established a semi-Markov decision process (SMDP) model for dual-cell series manufacturing systems, enabling coordinated optimization of aging screening and PdM strategies, with validated reliability improvements in EV battery systems. Li et al.³¹ employed an XGBoost prediction model integrated with SMS-EMOA and NSGA-III algorithms to achieve multi-objective optimization of engine power, fuel consumption, and emissions. An et al.³² proposed a framework combining Kriging surrogate modeling with NSGA-II to address the high computational cost of optimizing complex systems.A systematic review of the existing literature revealed that the application of multi-objective DRL to maintenance strategy optimization is low. In the context of tunnel lighting system maintenance, relevant research is especially lacking.

This study integrates deep reinforcement learning with multi-objective optimization theory in order to develop a dynamic decision-making framework for intelligent maintenance of lighting facilities. A MODRL architecture is designed to systematically balance operational costs and equipment reliability. To address inefficiencies in traditional multi-objective optimization, a neighborhood parameter transfer strategy is introduced, thus constructing gradient propagation networks in solution space. This approach enhances training speed while ensuring the completeness and convergence of Pareto frontier solutions. Additionally, a personalized optimization scheme generation mechanism based on weight space exploration is developed, which, as a consequence and for the first time, achieves dynamic adaptation of multi-objective weights in tunnel maintenance. The research findings provide theoretical support for the intelligent upgrade of lighting facilities and open new technical pathways for multi-objective optimization in complex industrial systems. Subsequent chapters detail algorithm design, experimental validation, and practical case applications to verify the effectiveness and engineering applicability of the proposed methods.

Multi-objective operation and maintenance(O&M) strategy optimization framework

This section introduces a comprehensive framework designed to optimize O&M strategies through a multi-objective lens. The framework integrates advanced optimization techniques with real-world operational constraints, aiming to balance multiple performance objectives.

Foundational principles of RL

The RL serves as a cornerstone of this optimization framework. This section delves into the fundamental principles of RL, exploring its unique approach to decision-making and learning processes.

Integration of Markov decision processes (MDPs) and reinforcement learning

Reinforcement learning tasks are typically formulated using MDPs-mathematical frameworks modeling sequential decision-making by intelligent agents in stochastic environments. An MDP comprises four fundamental components: a state space S representing all possible environmental configurations;an action space A containing all executable agent behaviors;a state transition probability function P(s’|s,a) quantifying the likelihood of transitioning to states’ given current state “s” and action “a”; and a reward function R(s,a,s’) (often simplified as “r”) specifying immediate feedback for state-action transitions. The Markov property constitutes its theoretical foundation, mathematically expressed as P(s’|${s}_{1:t}$,a_1:t) = P(s’)${s}_{t}$,a_t), indicating future states depend solely on current state-action pairs rather than historical trajectories.

DRL

The progressive deterioration inherent in tunnel lighting system O&M necessitates advanced methodologies for enhanced efficiency and precision. As shown in Fig. 1, the DQN algorithm offers a structured approach to address these challenges through sequential decision-making optimization.

However, neural networks may generate output noise during training processes, which frequently induces overestimation of value functions and consequently compromises policy optimization effectiveness. To mitigate this inherent limitation, researchers developed the DDQN algorithm, whose architectural configuration is shown in Fig. 2.

Consequently, the DDQN algorithm has been selected in this study to estimate value functions during tunnel lighting infrastructure O&M. The algorithm’s mathematical framework is formally structured through:

$$y_{t} = r_{t} + \gamma Q(s_{t + 1} ,\mathop {argmax}\limits_{{a_{t + 1} }} Q(s_{t + 1} ,a_{t + 1} ;\theta );\theta )$$

(1)

$$y_{j} = r_{j} + \gamma Q(s_{j + 1} ,a_{\max } ;\theta ^{\prime})$$

(2)

$$y_{j} = r_{j} + \gamma Q(s_{j + 1} ,\arg \max_{a} Q(s_{t + 1} ,a;\theta );\theta ^{\prime})$$

(3)

$$\delta = \left| {Q(s_{t} ,a_{t} ) - y_{t} } \right| = \left| {Q(s_{t} ,a_{t} ;\theta ) - (r_{t} + \gamma Q(s_{t + 1} ,argmax_{a} Q(s_{t + 1} ,a;\theta );\theta ^{\prime}))} \right|$$

(4)

$${\text{loss}} = \left\{ {\begin{array}{*{20}c} {\frac{1}{2}\delta^{2} } & {{\text{for}}\left| \delta \right| \le 1} \\ {\left| \delta \right| - \frac{1}{2}} & {{\text{ortherwise}}} \\ \end{array} } \right\}$$

(5)

DRL-based multi-objective optimization framework

To enhance optimization efficiency, a decomposition strategy was implemented to disaggregate the MOP into subproblems, while introducing neighborhood-based parameter transfer mechanisms.

Multi-objective issue

The MOP in DDQN is abbreviated as DDQN-MOP. The MOP is defined by Eq. (6).

$$\mathop {\min }\limits_{s,t,x \in X} \{ f_{1} (x),f_{2} (x),f_{3} (x), \cdots f_{M} (x)\}$$

(6)

In the formulation, the objective function coordinates the competitive relationships among sub-objectives within the decision space. The Pareto front (PF) delivers multi-objective equilibrium solutions by combining locally optimal configurations of each sub-objective, enabling holistic performance to approach optimality.

Decomposition strategy

The decomposition strategy breaks down MOP into scalar optimization subproblems, thereby generating Pareto optimal solutions. Once all subproblems are solved, the PF can be derived.

$$\lambda_{1}^{j} + \lambda_{2}^{j} + \cdots + \lambda_{i}^{j} = 1(\forall j = 0,1,2, \cdots ,N)$$

(7)

Each weight vector corresponds to a single-objective optimization task, and the j-th subproblem can be defined as:

$$\mathop {\min }\limits_{{x \in {\text{X}}}} = \sum\limits_{i = 1}^{M} {\lambda_{i}^{j} f_{i} (x)}$$

(8)

The scalarized subproblems, defined by Eq. (8), are formally independent single-objective optimization tasks. Each is solved using the DDQN algorithm to find its optimal policy. However, they are not optimized in isolation. The framework employs a coordinated yet decentralized optimization paradigm. While each subproblem maintains its own value network and training process, the neighborhood-based parameter transfer strategy facilitates knowledge sharing among similar subproblems. This allows the learning progress of one subproblem to bootstrap and stabilize the training of its neighbors, effectively creating a form of implicit joint optimization that enhances overall efficiency without sacrificing the clarity of the decomposition approach.

The linear weighted sum method (Eq. 8) may fail to reach Pareto optimal solutions residing in non-convex regions of the objective space. To guarantee the robustness and completeness of the decomposition approach, the framework incorporates the Tchebycheff scalarization as its core decomposition formulation. For a given weight vector $\lambda^{j}$ and a utopian reference point $z^{*}$, the subproblem is defined as:

$$\mathop {\min }\limits_{x \in \Omega } g^{te} (x|\lambda^{j} ,z^{*} ) = \mathop {\min }\limits_{x \in \Omega } \mathop {\max }\limits_{{1 \le m \le {\text{M}}}} \left\{ {\lambda_{m}^{j} \cdot |f_{m} (x) - z_{m}^{*} } \right\}$$

(9)

where M is the number of objectives. This method possesses the property that for any Pareto optimal solution $x^{*}$, there exists a weight vector λ such that $x^{*}$ is an optimal solution to Eq. (9). Thus, it ensures stability and the theoretical capability to obtain any Pareto optimal solution even in non-convex scenarios. The reference point $z^{*}$ is dynamically updated based on the current best-found objectives to guide the search effectively.

Using the linear weighted aggregation strategy, the original multi-objective optimization problem is reformulated into multiple single-objective subtasks, as illustrated in Fig. 3.

Figure 3 shows that in the bi-objective scenario, solutions to subproblems correspond to specific trade-off points on the Pareto front (blue curve). The geometric relationship between black solid reference directions and orange weight vectors visually reveals how weight allocation guides solution space distribution. By parametrically adjusting objective weights, this method transforms complex multi-objective decision-making into controllable scalar optimization sequences, significantly reducing solution complexity.To achieve a well-distributed approximation of the Pareto front, a systematic approach to generating weight vectors is adopted. A set of uniformly distributed weight vectors $\left\{ {\lambda^{1} ,\lambda^{2} , \cdots ,\lambda^{N} } \right\}$ is constructed via the quasi-random sequences. Each vector defines a unique search direction in the objective space, as depicted by the reference directions in Fig. 3. By ensuring these directions span the entire objective space uniformly, the framework guarantees that the solutions to the scalarized subproblems collectively provide a diverse and representative sampling of the Pareto optimal set. Adaptive refinement techniques can further densify the solution set in regions of high decision-maker interest.

Figure 4 illustrates a decomposition strategy example for tunnel lighting facilities.The figure demonstrates the variations in objective values of operational costs and reliability subproblems during the iterative process of decomposition-based multi-objective optimization.

Domain-based parameter transfer strategy

Figure 5 illustrates the parameter transfer strategy for tunnel lighting facility optimization using DRL for subproblem decomposition. The blue Pareto frontier circles represent the optimal solution set for multi-objective optimization, while black arrows indicate subproblem reference directions, showing search directions in the objective function space. Orange points denote weight positions, reflecting weight allocations for subproblems. The figure demonstrates similarities among subproblems, as nearby reference directions and weight positions suggest potentially similar optimal solutions.To quantify the similarity between neighboring subproblems, we introduce a cosine similarity metric based on their weight vectors. Given two subproblems with weight vectors $\lambda^{i}$ and $\lambda^{j}$, their neighborhood similarity is defined as:

$${\text{Sim}}(\lambda^{i} ,\lambda^{j} ) = \frac{{\lambda^{i} \cdot \lambda^{j} }}{{\left\| {\lambda^{i} } \right\|\left\| {\lambda^{j} } \right\|}}$$

(10)

Subproblems with similarity above a threshold τ are considered neighbors, enabling structured parameter transfer between them.This integration of subproblem decomposition and parameter transfer strategies forms the foundation of the MOP-DDQN framework.

To explicitly define the parameter transfer process within the MOP-DDQN framework, a structured transfer protocol is established based on quantified neighborhood similarity. For a given subproblem i with its associated weight vector $\lambda^{i}$, its most similar trained neighbor $j^{*}$ is identified using cosine similarity (as defined in Eq. 11). Parameter transfer is then executed by initializing the model parameters of subproblem i with the optimal parameters obtained from subproblem $j^{*}$, formally expressed as $\omega_{{\lambda^{i} }} \leftarrow \omega_{{\lambda^{{j^{*} }} }}^{*}$. This mechanism facilitates knowledge reuse, accelerating convergence across related subproblems.Direct parameter assignment, however, can potentially induce instability due to disparities in objective landscapes or source task overfitting. To ensure robustness, the following stabilizing measures are integrated:

Parameter transfer is only activated when $Sim(\lambda^{i} ,\lambda^{{j^{*} }} ) \ge \tau$, where τ is a predefined similarity threshold. This condition prevents negative transfer from insufficiently similar subproblems.Soft blending strategy can be employed. Parameters are initialized via a convex combination:

$$\omega_{{\lambda^{i} }} \leftarrow \alpha \cdot \omega_{{\lambda^{{j^{*} }} }}^{*} + (1 - \alpha ) \cdot \omega_{{\lambda^{i} }}^{init}$$

(11)

Here,$\alpha \in [0,1][0,1]$ is a blending coefficient proportional to ${\text{Sim}}(\lambda^{i} ,\lambda^{{j^{*} }} )$, and $\omega_{{\lambda^{i} }}^{init}$ denotes parameters from a random initialization. This approach ensures a smoother transition, mitigating abrupt changes in the parameter space.Following parameter transfer, the subsequent training of subproblem i incorporates regularization techniques—into the DDQN loss function. This constrains parameter updates, thereby preserving valuable knowledge from the source task while adaptively learning the target task.The following algorithm extends the MOP-DDQN framework by incorporating neighborhood similarity quantification and adaptive parameter transfer:

DRL for optimization of O&M strategies for tunnel lighting facilities

Building upon the constructed the multi-objective optimization framework, this chapter focuses on single-objective DRL modeling for tunnel lighting facility maintenance strategies. Phase-wise validation is implemented to provide theoretical support for subsequent multi-objective collaborative optimization.

Wiener process and its reliability

Wiener process degradation model

For optimizing maintenance strategies of complex tunnel lighting systems, equipment degradation modeling serves as the core component in constructing dynamic decision-making environments. The Wiener process, recognized for its mathematical properties and engineering applicability, has become a key tool for degradation process modeling. However, traditional Wiener-based maintenance optimization studies predominantly focus on static single-objective planning (e.g., cost minimization or reliability maximization), and do not adequately address multi-objective collaborative optimization in dynamic environments. Lighting facility maintenance requires simultaneous balancing of equipment reliability and maintenance costs—objectives exhibiting nonlinear conflicts and dynamic temporal variations. This study integrates the analytical strengths of Wiener degradation processes with DRL’s multi-objective decision-making capabilities in order to deliver theoretical foundations and engineering solutions for sustainable operation of high-dynamic systems like lighting facilities.

For a Wiener process with drift parameter $\mu$ and diffusion parameter $\sigma$, the univariate Wiener process $\left\{ {X(t),t \ge 0} \right\}$ satisfies the following properties: Stationary independent increments—when $X(0) = 0$,the process exhibits stationary independent increments; Normal distribution—at any time t, $X(t)$ follows a normal distribution with mean $\mu t$ and variance $\sigma^{2} t$.The Wiener process with drift can be mathematically expressed as:

$$X(t) = \mu t + \sigma B(t)$$

(12)

Let $\{ B(t)\} ,t \ge 0$ denote a standard Wiener process (or standard Brownian motion). Based on this, the Wiener process with drift is defined as follows:

(1)
For any time interval $t\sim t + \vartriangle t$,$\Delta X = X(t + \Delta t) - X(t)\sim N(\mu \Delta t,\sigma^{2} \Delta t)$;
(2)
For any two non-overlapping time intervals $[t_{1} ,t_{2} ],[t_{3} ,t_{4} ],t_{1} < t_{2} \le t_{3} < t_{4}$, the increments $X(t_{4} ) - X(t_{3} )$ and $X(t_{2} ) - X(t_{1} )$ are statistically independent.

In the Wiener process degradation model for tunnel lighting facilities, Fig. 6 and Fig. 7 depict the degradation progression over time and the evaluated degradation trends.The divergence in degradation curves across facilities highlights the inherent time uncertainty and individual variability in the Wiener process.

Reliability modeling of the Wiener process

This study models the performance degradation using a univariate Wiener process. With a failure threshold l, the reliability is defined as:

$$R(t) = 1 - F(t;\mu ,\sigma ) = \Phi (\frac{l - \mu t}{{\sigma \sqrt t }}) - exp(\frac{2\mu l}{{\sigma^{2} }})\Phi (\frac{ - l - \mu t}{{\sigma \sqrt t }})$$

(13)

When the tunnel lighting facility is still in operation at time $\tau$ and has not failed, and the current degradation amount is $x_{\tau } (x_{\tau } < l)$, then the remaining life $T_{1}$ of the lighting facility can be expressed as:

$$T_{1} = inf\{ t{|}X{(}t{ + }\tau {)} \ge {\text{l,}}X{(}\tau {) = }x_{\tau } ,t \ge 0\}$$

(14)

Using the independent increment property and homogeneous Markov property of the univariate Wiener process, we obtain:

$$T_{1} = inf\{ t{|}X{(}t{ + }\tau {) - }X{(}\tau {)} \ge {\text{l}} - x_{\tau } ,t \ge 0\} = inf\{ t{|}X{(}t{)} \ge {\text{l}} - x_{\tau } ,t \ge 0\}$$

(15)

The remaining life $T_{1}$ also follows an inverse Gaussian distribution. Its probability density function can be derived by replacing the failure threshold $l$ in the density function of the original lifetime $T$ with $l - x_{\tau }$, i.e.:

$$f_{{T_{1} }} (t) = \frac{{l - x_{\tau } }}{{\sqrt {2\pi \sigma^{2} t^{3} } }}\exp [ - \frac{{(l - x_{\tau } - \mu t)^{2} }}{{2\sigma^{2} t}}]$$

(16)

The trend of tunnel lighting facility reliability over time is shown in Fig. 8. As depicted in Fig. 9, the changes in reliability are influenced by different parameter values (mu, the drift parameter; sigma, the diffusion parameter; and threshold, the failure threshold). A larger mu value leads to a faster decline in reliability. A larger sigma value indicates increased uncertainty in the degradation process, resulting in a less stable reliability curve. Different threshold values directly alter the calculation of reliability and the shape of the curve.

Based on the Wiener process degradation model, the relationships between degradation time, facility groups, and degradation levels in tunnel lighting facilities are illustrated in Fig. 10. The correlation matrix heatmap of tunnel lighting degradation data is shown in Fig. 11. This heatmap displays the strength and direction of correlations between degradation levels across different facilities. Such correlations reflect interconnected degradation mechanisms among the facilities, indicating how their degradation processes are mutually influenced.

RL configure

In this section, the O&M management of tunnel lighting facilities is modeled as a sequential decision-making problem through the exploration–exploitation mechanism of DRL. A Q-network is employed to approximate the state-action value function, thereby driving the generation of adaptive control strategies under dual-objective constraints of cost and reliability. The focus is particularly on optimizing the O&M strategies for lighting facilities in long and extra-long tunnels. Due to their extended length, these tunnels feature a large number of widely distributed lighting facilities, with diverse illumination requirements and complex operational environments. Given the significant impact of lighting facility maintenance on tunnel safety and operational costs, optimizing maintenance strategies is critical for enhancing efficiency and reducing costs. To achieve the optimization of tunnel lighting maintenance strategies, the state space of tunnel lighting facilities is encoded as observational inputs for the DRL agent.The distribution of maintenance actions is optimized via gradient ascent to ensure real-time response constraints under multi-objective O&M are met during stochastic degradation processes.

Statuses

To establish a mathematical representation framework for intelligent O&M, this study implements a periodic monitoring mechanism (with a sampling interval Δt, defaulting to 60 min). The real-time operational state of the tunnel lighting system is abstracted into a high-dimensional feature vector states:

$$s = [l_{1} ,l_{2} ,l_{3} , \cdots ,l_{n} ,f_{1} ,f_{2,} f_{3} , \cdots ,f_{n} ,e_{1} ,e_{2} ,e_{3} , \cdots ,e_{n} ]$$

(17)

where $l_{i} (i = 1,2,3, \cdots ,n)$ represents the luminance value of the $i$-th lighting fixture, and $f_{i} (i = 1,2,3, \cdots ,n)$ denotes the operational status indicator of the lighting fixture. The specific rules are defined as follows:

$$f_{i} = \{ \begin{array}{*{20}c} {1,ifl_{i} > L_{\min } } \\ {0,{\text{otherwise}}} \\ \end{array}$$

(18)

The specific rules are as follows:

(1)
1.$f_{i} = 1$ if the brightness $l_{i}$ ≥ $L_{\min }$ (normal operation);
(2)
2.$f_{i} = 0$ if $l_{i}$ < $L_{\min }$ (failure or degradation).

Here, $e_{n} (n = 1,2,3, \cdots ,n)$ denotes the ambient light intensity in the n-th area. It is related to the brightness of the lighting fixtures through the formula:

$$e_{n} = k \cdot \sum\limits_{{i \in \Omega_{n} }} {l_{i} } + b$$

(19)

where:$\Omega_{n}$: Set of fixtures in the n-th area, k and b: Coefficients determined by the actual environment. This relationship helps evaluate lighting requirements and fixture operational status.

Action

The management decisions for tunnel lighting systems encompass multiple operation types, including luminance adjustment of fixtures, fault repair or component replacement, and maintaining the current state. Based on this, the system’s action space can be formally defined as a multidimensional discrete space composed of decision vectors for all lighting units. Its mathematical representation is given in Eq. (20):

$$a_{i} = \left[ {a_{1} ,a_{2} ,a_{3} , \cdots ,a_{n} } \right]$$

(20)

Among them,$a_{i} (i = 1,2,3, \cdots ,n)$ represents the management action for the i-th lighting fixture. The actions are defined as follows:

1.
1.$a_{i} = 0$: No action is taken;
2.
2.$a_{i} = 1$: Brightness of the i-th fixture is increased by Δl;
3.
3.$a_{i} = 2$: Brightness of the i-th fixture is decreased by Δl;
4.
4.$a_{i} = 3$: Perform maintenance on the i-th fixture, with cost $C_{repair,i} = m \cdot t_{repair,i}$, where m is the unit-time maintenance cost and $t_{repair,i}$ is the maintenance duration;
5.
5.$a_{i} = 4$: Replace the i-th fixture, with cost $C_{repair,i} = p_{i}$, where $p_{i}$ is the replacement price.

The action a is represented as a vector. In the deep learning algorithm, actions correspond one-to-one with the output layer indices. The neural network’s output layer dimension is determined by the number of fixtures n and action types (4), resulting in a dimension of n × 4.

Incentive mechanism

With the aim of minimizing the operational costs of tunnel lighting facilities to the greatest extent, this paper calculates the equipment failure risk based on the first passage time distribution of the Wiener process.The multi-dimensional reward function incorporates operational risks alongside energy consumption and maintenance action costs.The core task of the agent is to continuously optimize its policy during interactions with the environment, striving to learn how to maximize cumulative rewards.Therefore,the reward function is defined as a function of comprehensive benefits, with its specific expression as follows:

$$r = - \cos t = - (C_{{{\text{power}}}} + C_{{{\text{maintenance}}}} + C_{{{\text{replacement}}}} + C{}_{{{\text{penalty}}}})$$

(21)

Among them, “cost” represents the operating cost.

a) $C_{{{\text{power}},i}}$: Power consumption cost, calculated as $C_{{{\text{power}}}} = \sum\limits_{i = 1}^{n} {C_{{{\text{power}},i}} \cdot t_{i} }$.

where $C_{{{\text{power}},i}}$ is the unit-time power cost of the i-th lighting fixture, $t_{i}$ is the operating duration of the i-th fixture, and n is Total number of lighting fixtures in the tunnel;

b) $C_{{{\text{maintenance}}}}$: Daily maintenance cost, expressed as $C_{{{\text{maintenance}}}} = \sum\limits_{i = 1}^{n} {C_{{{\text{maintenance}},i}} } \cdot t_{i}$.

where $C_{{{\text{maintenance}},i}}$ is the unit-time maintenance cost of the i-th fixture;

c) $C_{{{\text{replacement}}}}$: Fixture replacement cost, defined as $C_{{{\text{replacement}}}} = \sum\limits_{i = 1}^{n} {R_{i} } \cdot C_{n}$.

where R_i is replacement frequency of the i-th lighting fixture (dimensionless, taking non-negative integer values: 1 for one replacement, etc.), and C_n is total replacement cost per fixture, including the purchase price of the luminaire and on-site installation labor fee (unit: yuan/unit).

d)$C{}_{{{\text{penalty}}}}$: Penalty cost incurred due to non-compliant lighting quality (e.g., compromised driving safety). Penalty rules can be customized based on practical requirements. For example, when the average luminance $\overline{L} < L_{{{\text{threshold}}}}$,$C{\text{penalty}} = c{}_{{{\text{penalty}}}} \cdot (\overline{L} - L_{{{\text{threshold}}}} )^{2}$ with $C{}_{{{\text{penalty}}}}$ as the constraint violation penalty coefficient and $L_{{{\text{threshold}}}}$ as the regulatory minimum average luminance threshold.

This study employs operational data from a 7.005-km extra-long highway tunnel in Yunnan Province for validation. The tunnel has been in continuous operation for four years and is equipped with 2,080 LED tunnel luminaires. Key cost parameters were calibrated based on actual operational records, market quotations, and local industry standards. The validation results are as follows Table 1:

Table 1 Validation results.

Full size table

A comparative test was conducted between the simplified cost model and the actual operational cost records for Tunnel 2024 from January to June (6 months): The model calculated a total cost of 89,632 yuan, with a relative error of 5.7% compared to the actual total cost (95,018 yuan). The average error for single-cycle (60 min) costs was 4.3%, falling within an acceptable range.

Agent-environment interaction simulation

Through an IoT sensor network, multidimensional state data—including fixture luminance (L_t), energy consumption (E_t), operating temperature (T_t), and more—are collected at high frequency (hourly intervals). Combined with a Wiener degradation process, a quantified equipment health assessment model is constructed to estimate real-time Remaining Useful Life (RUL) and failure probability (P_fail) of the fixtures. This forms a multidimensional state vector:s_t = [L_t,E_t,RUL,P_fail,Tenv], where Tenv represents ambient temperature. At fixed intervals, the agent inspects the lighting facility state characterized by this multidimensional vector. Based on the current state s_t, the agent determines management actions according to a predefined reinforcement learning policy network. The tunnel lighting facility maintenance management model is illustrated in Fig. 12.

Figure 12 shows the maintenance workflow initiated upon detecting luminaire failure through P_fail threshold evaluation, where failed units are first removed (as shown in the “Lighting facilities under maintenance” operation scene) and replaced with components retrieved from the spare lighting facility repository. The dismantled faulty luminaires undergo workshop maintenance, with digitally recorded repair parameters (e.g., replaced part specifications and service time) systematically documented in maintenance logs. This closed-loop data feedback mechanism updates the Wiener degradation process parameters within the equipment health assessment model, enabling dynamic optimization of maintenance strategies through continuous model refinement.

To investigate the state evolution patterns of tunnel lighting facilities and the mechanistic impacts of management actions on their operational states, this study establishes a virtual interactive environment (as shown in Fig. 13). The environment enables efficient agent-environment interactions to simulate dynamic system behaviors under diverse maintenance strategies.

Figure 13 shows that within this simulation framework, the system evaluates tunnel lighting facility states at fixed time intervals. The agent selects optimal actions from pre-defined options based on real-time state information and receives immediate rewards incorporating operational costs, lighting quality, and other multidimensional factors.

O&M cost sheet targeting

This study utilized four years of operational data (performance metrics, electrical parameters, environmental monitoring, maintenance records, etc.) from a tunnel lighting facility in Yunnan Province, China, employing principal component analysis (PCA) for dimensionality reduction and noise suppression. As shown in Fig. 14, the first principal component (PC1) accounts for 87.2% of the variance, effectively characterizing equipment degradation trends. Further embedding the PC1 time-series data into a Wiener process (as illustrated in Fig. 15), a degradation model is established through parameter estimation (μ = 0.32/day, σ = 0.15/day). The 95% confidence interval of the model successfully captures abnormal acceleration in degradation caused by environmental temperature fluctuations (t = 60–90 days), demonstrating its adaptability to real-world operating conditions.

Considering the dataset characteristics of tunnel lighting facilities, the Wiener process parameters are defined as constants, and the degradation threshold (D) is determined by the average of the maximum values of the first principal component (PC1). Additionally, maximum likelihood estimation (MLE) is applied to estimate the values of μ and σ, with the results summarized in Table.2.

Table 2 Parameter estimation.

Full size table

The first principal component (PC1) extracted by PCA was selected as the degradation indicator. The Kolmogorov–Smirnov (K-S) normality test was performed on the degradation increments of PC1 at different time points. Test results indicate that the K-S statistic D values for all time points ranged between 0.042 and 0.068, all falling below the critical value D₀.₀₅ = 0.094. Furthermore, the corresponding P-values were all greater than 0.05 (Table 3), confirming that the degradation increments satisfy the assumption of normal distribution.

Table 3 Verification results.

Full size table

As shown in Table 3, the mean and variance of degradation increments across different time intervals reveal that the mean fluctuates between 0.31 and 0.33 (close to the estimated value μ = 0.32/day), while the variance ranges from 0.148 to 0.152 (close to σ² = 0.0225/day²). The coefficient of variation remains below 5% for all increments, demonstrating their stationarity. An autocorrelation function (ACF) test was conducted on the degradation increment sequence. The autocorrelation coefficients at lags 1 to 10 all fell within the range [-0.08, 0.06] and did not exceed the 95% confidence interval (± 0.052). This indicates no significant correlation between degradation increments in adjacent time intervals, satisfying the assumption of independent increments. The root mean square error (RMSE) and coefficient of determination (R²) were used to quantify the fit between the Wiener process model and the actual degradation data. Results show that for the PC1 degradation trajectory, the model yields RMSE = 0.87 and R² = 0.92. The 95% confidence interval successfully covers 93.6% of the actual data points, including the accelerated degradation segment from t = 60 to 90 days caused by environmental temperature fluctuations (Fig. 15).

As shown in Table 4, comparing the Wiener process with Gamma and inverse Gaussian degradation models, all three models employed maximum likelihood estimation (MLE) for parameter fitting. Implementation was achieved using Python’s scipy.stats library. The following metrics were selected: coefficient of determination (R²), root mean square error (RMSE), the Akaike Information Criterion (AIC), and the residual K-S test to establish a multidimensional quantitative evaluation system. The fitting results are shown in Table 4. Table 4 indicates that the Wiener process achieves R² = 0.92, significantly higher than both the Gamma process (0.89) and the Inverse Gaussian process (0.90). Furthermore, its RMSE of 0.87 is the smallest among the three models, demonstrating optimal fitting accuracy for the PC1 degradation trajectory. Considering the comprehensive trade-off between “fitting accuracy and model complexity,” the Wiener process has an AIC of 128.6, lower than both the gamma process (135.2) and the inverse Gaussian process (131.5), better aligning with the principle of model simplicity. The residual K-S test results show that the residuals of all three models satisfy normal distribution (P > 0.05) with no systematic fitting bias. However, the residual statistic D for the Wiener process is 0.051, smaller than that of the gamma process (0.073) and the inverse Gaussian process (0.068). This indicates a more random residual distribution and stronger reliability of the model fit.

Table 4 Comparison with gamma and Inverse Gaussian degradation models.

Full size table

Results and analysis

Calculation result

A single training epoch is defined as 1000 operational cycles of the tunnel lighting facilities, where each epoch comprises a series of 1000 actions performed by the agent on the facilities (T = 1000). A total of 1000 training epochs are conducted (C = 1000). To evaluate and analyze the learning dynamics of the agent, we investigated how it balances exploration of new actions and exploitation of existing experience. The Epsilon balancing parameter trajectory is illustrated in Fig. 16.

Figure 16 shows that during the initial stages of training, the Epsilon value decreases rapidly, indicating that the agent is in the exploration phase. As training progresses, the decrease in Epsilon slows down and eventually approaches a lower value, suggesting that the agent relies more on its learned strategies while retaining some exploration to adapt to environmental changes.

To validate the effectiveness of the DDQN-optimized strategy, four baseline strategies were compared to DDQN-optimized: Strategy A (Routine Lighting Maintenance) performs fixed-cycle maintenance every 3 months to ensure system stability; Strategy B (Illuminance-Based Maintenance) triggers operations when the average illuminance across multiple monitoring points drops below a threshold of 20 lx; Strategy C (Luminaire Usage-Time-Based Maintenance) schedules maintenance once cumulative usage reaches 5,000 h per luminaire; and Strategy D (Failure Rate Threshold-Based Maintenance) initiates full-system inspections when the calculated failure rate exceeds 10%, defined as Failure Rate = Faulty Luminaires/Total Luminaires × 100%. Cost variations under identical time intervals for all strategies are shown in Fig. 17.

As shown in Fig. 17, Strategy A (purple curve) displays a smooth, gradually rising cost trend, reflecting the sustained impact of equipment degradation on maintenance expenses; Strategy B (yellow curve) exhibits distinct step-like fluctuations, highlighting the cost uncertainty of illuminance-based maintenance dependent on illuminance levels dropping below the 20 lx threshold; Strategy C (brown curve) follows a pattern of slow initial rise, accelerated mid-phase growth, and late-phase slowdown, aligning with the relationship between luminaire usage cycles and cost dynamics; the failure rate threshold-based strategy, Strategy D, (gray curve) shows low, stable costs when the failure rate remains below 10%, but triggers sharp step-like surges followed by rapid declines upon exceeding the threshold, underscoring its vulnerability to failure rate volatility. Meanwhile, the DDQN-optimized strategy (orange curve), despite minor fluctuations, maintains a steadier overall trajectory, demonstrating improved stability in cost control by avoiding the marked step-like patterns or steep cost escalations observed in baseline strategies.

After 10 testing cycles, Strategy A demonstrates an average cost of CNY 10,575.25. This elevated expenditure stems from conventional fixed-interval maintenance, which guarantees baseline system functionality but accumulates escalating maintenance workload and costs due to progressive equipment degradation. Strategy B achieves significantly lower mean costs at CNY 1,653.30, benefiting from initial illuminance compliance requiring minimal maintenance. However, illuminance fluctuations risk sporadic cost spikes in specific cycles. Strategy C exhibits intermediate cost performance (CNY 1,433.45) with expenditure patterns correlating to luminaire service duration—lower initial costs followed by progressive increases from intensified luminaire replacements. The failure-rate-threshold-based Strategy D achieves optimal cost efficiency (CNY 1,401) through low-cycle-cost operations when below critical failure rates, despite incurring substantial comprehensive maintenance costs during threshold-exceeding events.The DDQN-driven strategy achieves an optimal equilibrium between system stability and expenditure control, balancing operational demands at an average cost of CNY 1,492.95. While this cost is higher than that of Strategies C and D,the enhanced system stability provided by DDQN makes it a superior operational system for long-term maintenance decisions.The strategy’s ability to maintain tunnel lighting facilities with greater efficiency underscores its value as a sustainable and superior choice despite the initial higher costs.

Analysis

To investigate why certain traditional lighting maintenance strategies result in lower costs compared to advanced intelligent predictive maintenance approaches, a sample was selected from the testing process for analysis. This aims to uncover underlying factors driving cost efficiency in conventional methods despite their lack of adaptive optimization capabilities.

For tunnel lighting fixtures currently under monitoring, the agent initially opted to defer maintenance based on predefined rules, a decision that ultimately led to fixture failure. As shown in Table 5, the degradation margin γ (defined as γ = D—10.5 = 13.2—10.5 = 2.7) follows a normal distribution γ ~ N (0.6, 0.2). Leveraging the properties of the normal distribution, the probability of fixture failure in the next cycle was precisely calculated as P {γ ≥ 2.7} = 0.03. Within the reinforcement learning (RL) framework, the agent’s decisions are driven by expected future rewards: encountering a fixture failure yields an expected reward of—0.03 × (40 + 3) = -1.29, while normal operation results in -(1—0.03) × 3 = -2.91. This discrepancy led to an overly conservative behavior—the agent incorrectly overestimated the likelihood of imminent failure (despite the low 3% probability) and prioritized unnecessary maintenance actions, significantly inflating costs.

Table 5 Parametric analysis.

Full size table

After filtering out anomalous data such as sensor failures, the degradation margin for each luminaire was calculated at 60 min intervals. The degradation margin data underwent a Kolmogorov–Smirnov (K-S) test, yielding a test statistic D = 0.059 and a critical value D_0.05 = 0.094. Since D < D_0.05, the null hypothesis that “data follow a normal distribution” was satisfied³³. The mean (0.59) and variance (0.22) of the empirical degradation margin data highly aligned with the assumed distribution γ ~ N (0.6, 0.2), validating the rationality of parameter settings.

Over a series of 10 test cycles covering diverse operating conditions, tunnel lighting fixture failure recurrence was systematically tracked, with results illustrated in Fig. 18. This “over-conservatism” frequently occurs in critical infrastructure like highway tunnels where stable illumination is paramount, as simplistic deep reinforcement learning approaches often converge to suboptimal solutions. To refine the agent’s decision-making for precise risk assessment near failure thresholds, stability and maintenance costs were integrated into a MOP model, thus balancing operational reliability against economic efficiency.

Applying the framework to O&M cost and reliability optimization

Optimizing solely for minimal maintenance costs in tunnel lighting systems has inherent limitations, as intelligent systems may develop overconfidence biases. Consequently, co-optimizing reliability objectives with cost targets is critical to ensure balanced decision-making that avoids both excessive conservatism and undue risk exposure.

Modeling of O&M cost and reliability subproblems

In tunnel lighting facility operations research, the maintenance cost function requires inverse calibration under a system performance degradation evaluation framework to account for time-dependent attenuation of equipment states, as demonstrated in Eq. (22). This calibration accounts for the degradation-dependent nature of maintenance costs, while in real-world operating conditions, equipment degradation parameters exhibit time-dependent characteristics. The resulting multi-objective optimization problem is formulated in Eq. (23):

$$\arg \mathop {\max }\limits_{\tau } \Phi (\tau ) = [C(\tau ),A(\tau )]^{T}$$

(22)

The maintenance cost is defined as the first sub-objective, as shown in Eq. (22).The second sub-objective, reliability, quantifies the ability of tunnel lighting facilities to maintain normal operation over time, which is critical for ensuring driving safety and traffic flow continuity. To holistically balance these dual objectives (cost and reliability), distinct weights are assigned to each, thus enabling the derivation of a Pareto frontier. Consequently, Eq. (23) can be reformulated as:

$$\arg \mathop {\max }\limits_{\tau } \psi (\tau |\lambda ) = \lambda \cdot C(\tau ) + (1 - \lambda ) \cdot A(\tau )$$

(23)

To reveal how the DDQN algorithm transforms the multi-objective optimization model into an actionable neural network structure, the paper investigates the dynamic interaction mechanisms between functional modules, as illustrated in Fig. 19.

The DDQN algorithm integrates the state input layer (including parameters such as illuminance, energy consumption rate, and lifespan) with model parameters. Shared hidden layers enable objective coupling, while dual Q-value branches independently calculate cost and reliability. The experience replay mechanism simulates equipment degradation processes. Specifically:

Economic Objective: Minimize costs:$\min C(\tau ) = f_{\cos t} (s_{t} ,a_{t} |\theta )$;

Reliability Objective: Maximize operational reliability:$\max A(\tau ) = f_{{{\text{reliability}}}} (s_{t} ,a_{t} |\phi )$.

Reliability reconstruction

With respect to tunnel lighting facility maintenance and management, using the original reliability function R(t) for multi-objective optimization poses limitations. The original reliability values are confined to the range [0,1], while maintenance costs may span [− ∞,0]. The mismatch in sign conventions and magnitude scales between these two objectives impedes their synergistic optimization. This incompatibility not only distorts the algorithm’s ability to accurately assess facility states but may also lead to local optima due to conflicting gradient directions, ultimately causing misjudgments and flawed decisions. To address this, inspired by the Sigmoid function and its inverse, we propose the following reformulation:

$$R_{re} (t) = - \eta \ln \left( {\frac{1}{R(t)} - 1} \right)$$

(24)

To reveal the differences between the reconstructed model and the original model, we conducted an analysis of reliability function reconstruction and gradient response characteristics. A graphical representation of the results is shown in Fig. 20.

As shown in the figure above, the reconstructed reliability function $R_{re} (t)$ extends the original value range to the real number domain, resolving the sign and magnitude mismatch issues. This reformulation injects critical prior knowledge into the optimization model: when the original reliability R(t) → 0, the reconstructed gradient undergoes a sharp increase. This threshold-responsive design activates a negative-reward reinforcement mechanism when facility reliability degrades by 68.5% and post-phase stability indices fall below 3.5%, thereby compelling prioritized maintenance resource allocation. Such reliability-driven enforcement effectively prevents lighting-induced tunnel operational discontinuities and safety–critical incidents through gradient-sensitive failure mitigation.

The reconstructed function optimizes decision logic via a multi-dimensional coordination framework, with corresponding multi-objective gradient fields and 3D function space mapping illustrated in Figs. 21 and22, respectively.

Figure 21 shows that the reconstructed reliability gradient exhibits heightened sensitivity to facility state variations, with the algorithm demonstrating sharper responsiveness to states near critical reliability thresholds.

Figure 22 visualizes how the reconstructed mechanism’s 3D function space mapping captures dynamic reliability-cost interactions in maintenance trajectories, with the functional mapping itself strategically guiding maintenance decisions through multi-objective optimization.Building on this framework, the multi-objective optimization formula is updated as follows:

$$\mathop {\min }\limits_{\pi } \left[ {E_{\pi } \left( C \right), - E_{\pi } \left( {R_{re} (t)} \right)} \right]$$

(25)

Here, the economic objective (minimizing cost C) and the reliability objective (maximizing $R_{re} (t)$) are synergistically coordinated through the DDQN algorithm, thereby enhancing both the stability of tunnel lighting systems and the robustness of maintenance decisions.

Post-reconstruction results

Upon completing the reliability function reconstruction, a dynamic weight-based subproblem optimization sequence is established for the tunnel lighting facility’s multi-objective optimization problem. The specific workflow operates as follows: Priority optimization triggering occurs when the reconstructed reliability function gradient $\left\| {\nabla R_{re} (t)} \right\| > \gamma$(threshold γ), automatically queues the corresponding subproblem into high-priority optimization. The model then inherits parameters from the dual Q-network architecture and implements targeted fine-tuning for subproblem optimization.

As revealed in the figure, the red strong-coupling zone demonstrates tight parameter interdependence during optimization, and visually manifests the DDQN-MOP strategy’s multi-objective collaborative optimization characteristics. This visualization offers critical perspectives for understanding dynamic reward function interactions through gradient correlation patterns. The algorithm’s gradient conflict mitigation mechanism during training is elucidated in Fig. 23, which investigates the dynamic evolution of cost gradients versus reliability gradients.

The color-coded training phases, as shown, demonstrate the DDQN-MOP strategy’s dual-objective gradient coordination capability during optimization, providing visual validation of algorithmic training stability.

Furthermore, to quantitatively validate DDQN-MOP’s exceptional performance in maintenance strategy optimization, Fig. 24 contrasts the constructed PF for the MOP against scatter plots representing mono-objective optimization baselines in operational cost and reliability metrics.

From the figure, it is evident that under specific weight vector conditions, the proposed method outperforms single-objective optimization in both operational costs and reliability. Specifically, the incorporation of a reliability target has contributed to a 29.7% reduction in the comprehensive O&M costs of lighting facilities (as shown in Table 6). Among all the evaluated maintenance strategies, demonstrating its superior cost-effectiveness over the standard DRL baseline (DDQN, with a cost of 1492.95) and the classic multi-objective reinforcement learning (MORL) baseline (NSGA-II, with a cost of 1536.65) by a substantial margin.

Table 6 Maintenance cost comparison.

Full size table

To quantitatively evaluate the ability of DDQN-MOP to capture non-convex regions, two standard multi-objective optimization performance metrics are introduced: Hypervolume (HV) and Inverted Generational Distance (IGD). HV measures the volume of the objective space dominated by the generated PF, directly reflecting the coverage of non-convex regions; IGD quantifies the average distance between the generated PF and the reference PF, evaluating both convergence and uniformity.Table 7 presents the quantitative results of these metrics for all compared maintenance strategies. It can be observed that DDQN-MOP achieves the highest HV (0.763) and the lowest IGD (0.089), significantly outperforming the MORL baseline NSGA-II (HV = 0.617, IGD = 0.154) and all single-objective/rule-based strategies. Specifically, the HV of DDQN-MOP is 23.7% higher than that of NSGA-II, indicating that it covers a more comprehensive objective space.

Table 7 Multi-objective performance metrics for pareto front evaluation.

Full size table

To validate the necessity and contribution of reliability reconstruction and parameter transfer in the DDQN-MOP strategy, three sets of ablation experiments were designed:①Full model;②Without reliability reconstruction, directly optimizing using the original reliability function (DDQN-MOP w/o RR);③Without parameter transfer, training with randomly initialized parameters (DDQN-MOP w/o PT). All experiments were conducted on the same hardware platform and training settings.

As shown in Table 8, removing either reliability reconstruction or parameter transfer significantly degrades model performance: the model without reliability reconstruction sees average maintenance costs rise to 1587.30 CNY and reliability drop to 0.72, while the model without parameter transfer also underperforms the complete model. The complete model achieves an HV value of 0.83 and a conflict rate of 0.39, demonstrating optimal multi-objective optimization results. Regarding training efficiency, parameter transfer substantially reduced single-epoch duration and convergence steps, while reliability reconstruction accelerated convergence by optimizing the objective function.

Table 8 Melting experiment.

Full size table

The proposed method was trained on an i7-13,400 CPU and NVIDIA GeForce RTX 3060 platform, utilizing a neighborhood-based parameter transfer strategy to accelerate training. Through PyTorch Lightning’s Trainer with built-in training_step_timing callback, single-epoch training time decreased from 124 to 72 s (41.9% reduction). NVIDIA Nsight Systems timeline analysis revealed parameter synchronization time reduced from 68.2 s to 21.6 s (68.3% improvement). Real-time Linux terminal monitoring showed GPU utilization increased from 63 to 85% (34.9% enhancement). The decrease in convergence threshold steps (from 18,400 to 14,200; 22.8% reduction),as evidenced by reward curves logged via SummaryWriter, validates the algorithm’s accelerated optimization capability.Using PlatypUS library’s HypervolumeIndicator, the multi-objective conflict rate improved from 0.47 to 0.39 (17.0% decrease), confirming solution set quality enhancement.

Conclusions

This study integrates deep reinforcement learning with multi-objective optimization theory to establish a dynamic decision-making framework for intelligent maintenance of lighting systems. An innovative MODRL algorithm architecture was designed to systematically balance operational costs and equipment reliability. The proposed neighborhood parameter transfer strategy employs a gradient transfer network in solution space to enhance training efficiency, resolving parameter update inefficiencies inherent in conventional multi-objective optimization. This approach simultaneously ensures the integrity of Pareto front solutions and guarantees their convergence properties. A weight-space exploration mechanism was developed to generate personalized optimization schemes, achieving the first dynamic adaptation of multi-objective weights in tunnel maintenance applications. The research outcomes not only provide theoretical foundations for intelligent upgrades of tunnel lighting infrastructure but also pioneer new technical approaches for multi-objective collaborative optimization in complex industrial systems. The main conclusions are summarized as follows.

(1)
This study develops a MOP framework specifically tailored for tunnel lighting system maintenance strategies. The framework breaks away from conventional reliance on fixed maintenance thresholds by innovatively incorporating equipment reliability as a central decision-making factor. Inspired by the Sigmoid function, the inverse Sigmoid transformation is applied to reformulate the original reliability metric, thus demonstrating superior performance compared to prior strategies. Experimental results show 68.5% reduced reliability decline near critical failure thresholds with post-optimization stability indices remaining below 3.5%. Overconfidence issues in maintenance scheduling were thereby effectively mitigated.
(2)
The MOP framework integrates operational costs and reliability as reward components, achieving dual optimization through their coordinated evaluation. This approach enhances lighting system reliability and achieves a significant 29.7% reduction in operational costs by employing a decomposition strategy that converts the complex optimization problem into manageable scalar optimization subproblems.
(3)
The neighborhood-based parameter transfer strategy achieves significant acceleration in the training process, enabling more efficient exploration of optimal solutions. Comparative experiments demonstrate a 41.9% reduction in per-epoch training time, a 68.3% decrease in parameter synchronization overhead, and a 34.9% improvement in GPU utilization rate compared to baseline methods. Furthermore, the approach reduces convergence threshold steps by 22.8% and lowers multi-objective conflict rates by 17.0%. These improvements establish new benchmarks for distributed optimization in complex systems.
(4)
The proposed framework enables flexible weight combination selection to generate customized optimization objectives tailored to diverse operational requirements and application scenarios. Optimization priorities—whether emphasizing cost reduction, reliability enhancement, or balanced trade-offs—are achieved through dynamic weight parameter adjustments within the systematic paradigm. This methodology provides diverse solutions for tunnel lighting maintenance optimization, and demonstrate significant practical value across various engineering contexts through its adaptable multi-criteria decision-making architecture.

Data availability

Data will be made available on request.

References

Fei, M. Y., Peng, Y., Yu, H. & Le, W. Maintenance decision optimization for long-span bridge suspenders using deep reinforcement learning. China. J. Highw. Transp. 37, 64–75. https://doi.org/10.19721/j.cnki.1001-7372.2024.11.005 (2024).
Article Google Scholar
Kang, D. W., Xin, W. Y., Qi, Y., Jun, G. & Yang, Z. T. Joint optimization of condition-based maintenance and spare parts inventory for offshore wind turbines using deep reinforcement learning. J. Solar Energy 44, 190–199. https://doi.org/10.19912/j.0254-0096.tynxb.2022-1219 (2023).
Article Google Scholar
Lin, F., Huai, S., Liang, P. S., Li, Z. & Jun, Z. J. Supply reliability based method of intellectual optimization on preventive maintenance strategy for natural gas pipeline system. J. China Univ. Petroleum. 47, 134–140. https://doi.org/10.3969/j.issn.1673-5005.2023.01.014 (2023).
Article Google Scholar
J. Gao. 2022 Sequential inspection and maintenance optimizationfor partially observable multi-state systems. A Master Thesis Submitted toUniversity of Electronic Science and Technology. https://doi.org/10.2700/d.cnki.gdzku.2022.001418.
Chao, K. F., Han, L. Y., Ji, L. S., Chun, L. D. & Li, D. X. Intelligent prediction method for cutterhead torque of shield tunnel based on intelligent optimization algorithm and Bi-LSTM neural network. Scientia Sinica Technol. 55, 171–186. https://doi.org/10.1360/SST-2024-0145 (2025).
Article Google Scholar
Nie, Y., Yi, X. C. & Qi, C. J. Smart prediction of tunnel fire scenario based on external smoke imageanddeep-learningalgorithm. JTsinghuaUniv Sci&Technol. 01(1), 7. https://doi.org/10.16511/j.cnki.qhdxxb.2024.27.055 (2025).
Article Google Scholar
Lu, M. Q., Wei, W., Xiao, S., Zheng, Z. & Hao, L. Visual tracking algorithm for tunnel fire. J. Southeast Univ. (Natural ScienceEdition) 01, 1–13. https://doi.org/10.3969/j.issn.1001-0505.2025.01.029 (2025).
Article Google Scholar
Jie, J. Z., Heng, S., Nan, H., Xi, D. L. & Ping, C. Target recognition and classification algorithm of MMW radar in tunnel. Syst. Eng. Electronics 01(1), 10 (2025).
Google Scholar
Bo, H. et al. A post-processing algorithm for automatic recognition of tunnel crack diseases based on segmentation masks. Acta Geodaetica et Cartographica Sinica 53, 1715–1724. https://doi.org/10.11947/j.AGCS.2024.20240088 (2024).
Article Google Scholar
Du, R. H., Liao, W. H. & Zhang, X. Multi-objective optimization of angles-only navigation and closed-loop guidance control for spacecraft autonomous noncooperative rendezvous. Adv. Space Res. 70, 3325–3339. https://doi.org/10.1016/J.ASR.2022.08.024 (2022).
Article ADS Google Scholar
Mohammad, J., Shahriyar, G. H., Ata, C., Jian, S. & Christos, M. Electrolyzer cell-methanation/Sabatier reactors integration for power-to-gas energy storage: Thermo-economic analysis and multi-objective optimization. Appl. Energy https://doi.org/10.1016/J.APENERGY.2022.120268 (2023).
Article Google Scholar
Yang, Q. Y. et al. Multi-objective optimization of integrated energy system considering installation configuration. Energy https://doi.org/10.1016/J.ENERGY.2022.125785 (2023).
Article Google Scholar
Armin, A., Javad, J., Mortaza, Y. & Bahman, N. Multi-aspect assessment and multi-objective optimization of sustainable power, heating, and cooling tri-generation system driven by experimentally-produced biodiesels. Energy https://doi.org/10.1016/J.ENERGY.2022.125887 (2023).
Article Google Scholar
Xing, Y. Y. et al. Multi-objective optimization and analysis of material and energy flows in a typical steel plant. Energy https://doi.org/10.1016/J.ENERGY.2022.125874 (2023).
Article Google Scholar
Liao, C. X., Miyata, S. H., Qu, M. & Akashi, Y. Year-round operational optimization of HVAC systems using hierarchical deep reinforcement learning for enhancing indoor air quality and reducing energy consumption. Appl. Energy https://doi.org/10.1016/J.APENERGY.2025.125816 (2025).
Article Google Scholar
Ge, Y. Y., Xie, J., Chang, J. Q. & Feng, S. A multi-objective deep reinforcement learning method for intelligent scheduling of wind-solar-hydro-battery complementary generation systems. Int. J. Electrical Power Energy Syst. https://doi.org/10.1016/J.IJEPES.2025.110635 (2025).
Article Google Scholar
Hou, Y. J., Liao, X. J., Chen, G. Z. & Chen, Y. Co-Evolutionary NSGA-III with deep reinforcement learning for multi-objective distributed flexible job shop scheduling. Comput. Indust. Eng. https://doi.org/10.1016/J.CIE.2025.110990 (2025).
Article Google Scholar
Li, X. Y., Zhou, Z. H., Wei, C. Y., Gao, X. & Zhang, Y. B. Multi-objective optimization of hybrid electric vehicles energy management using multi-agent deep reinforcement learning framework. Energy AI https://doi.org/10.1016/J.EGYAI.2025.100491 (2025).
Article PubMed Central Google Scholar
Yuan, E., Wang, L. J., Song, S. J., Cheng, S. L. & Fan, W. Dynamic scheduling for multi-objective flexible job shop via deep reinforcement learning. Appl. Soft Comput. https://doi.org/10.1016/J.ASOC.2025.112787 (2025).
Article Google Scholar
Liu, Y. T. Superconducting quantum computing optimization based on multi-objective deep reinforcement learning. Sci. Rep. 15, 3828–3828. https://doi.org/10.1038/S41598-024-73456-Y (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Wei, L. N., Cui, Y. Q., Chen, M., Wan, Q. & Xing, L. I. Multi-objective neural policy approach for agile earth satellite scheduling problem considering image quality. Swarm Evolut. Comput. https://doi.org/10.1016/J.SWEVO.2025.101857 (2025).
Article Google Scholar
Wang, P., Cui, Y. X., Tao, H. Z., Xu, X. & Yang, S. Machining parameter optimization for a batch milling system using multi-task deep reinforcement learning. J. Manufact. Syst. https://doi.org/10.1016/J.JMSY.2024.11.013 (2025).
Article Google Scholar
Qian, J. Q., Siddique, U., Yu, G. B. & Weng, P. From fair solutions to compromise solutions in multi-objective deep reinforcement learning. Neural Comp. Appl. https://doi.org/10.1007/S00521-024-10602-7 (2025).
Article Google Scholar
Remmerden, J. V. et al. Deep multi-objective reinforcement learning for utility-based infrastructural maintenance optimization. Neural Comput. Appl. https://doi.org/10.1007/S00521-024-10954-0 (2025).
Article PubMed PubMed Central Google Scholar
Faizanbasha, A. & Rizwan, U. Deep learning-stochastic ensemble for RUL prediction and predictive maintenance with dynamic mission abort policies. Reliabil. Eng. Syst. Safe. https://doi.org/10.1016/j.ress.2025.110919 (2025).
Article Google Scholar
Zhu, J. J., Zhu, Z. B., Li, B. T. & Zhao, X. F. Predictive maintenance strategy based on critical probabilistic cost. Reliabil. Eng. Syst. Safe. https://doi.org/10.1016/j.ress.2025.111823 (2026).
Article Google Scholar
Morato, P. G., Papakonstantinou, K. G., Andriotis, C. P., Hlaing, N., & Kolios, A. Interpretation and analysis of deep reinforcement learning driven inspection and maintenance policies for engineering systems. In: Paper presented at 14th International Conference on Applications of Statistics and Probability in Civil Engineering 2023, Dublin, Ireland.
Faizanbasha, A. & Rizwan, U. Optimizing replacement times and total expected discounted costs in coherent systems using geometric point process. Comput. Industrial Eng. https://doi.org/10.1016/j.cie.2025.110879 (2025).
Article Google Scholar
Fan, Y. C. et al. Multi-objective optimization of boiler NOx emissions and platen superheater overheating based on ensemble learning and deep reinforcement learning. Fuel https://doi.org/10.1016/j.fuel.2025.137980 (2026).
Article Google Scholar
Faizanbasha, A. & Rizwan, U. Optimizing burn-in and predictive maintenance for enhanced reliability in manufacturing systems: A two-unit series system approach. J. Manufact. Syst. https://doi.org/10.1016/j.jmsy.2024.12.002 (2025).
Article Google Scholar
Li, Y. B., Zhu, Z. J., Shi, L., Deng, T. N. & Dai, H. J. A multi-objective optimization for engine performance and emissions through ethanol-water fuel blends and compression ratio variation. Energy https://doi.org/10.1016/j.energy.2025.139664 (2026).
Article Google Scholar
An, Y., Xia, Z. S., Luo, M., Jin, X. & Zhu, R. H. Surrogate model-assisted multi-objective optimization of the mooring system of a floating wind turbine considering short- and long-term performances. Energy https://doi.org/10.1016/j.energy.2025.139610 (2025).
Article Google Scholar
Yun, X. Y., Sheng, W. S. & Fang, F. LED reliability assessment based on the Kolmogorov-Smirnov test. Chin. J. Photonics. https://doi.org/10.3788/gzxb20164509.0923004 (2016).
Article Google Scholar
Cheng, Q., Jie, F. J., Jun, F. X., An, Y. C. & Qi, Z. G. Accelerated luminous flux decay testing and reliability evaluation of LED lighting products. J. Light. Eng. 43(48), 63. https://doi.org/10.3969/j.issn.1004-440X.2016.02.010 (2016).
Article Google Scholar
Jie, S., Xiang, Y. C., Xiang, K. W. & Yu, Z. D. Research on LED life evaluation test methods based on electro-thermal accelerated aging. Electron. Components Mater. https://doi.org/10.14106/j.cnki.1001-2028.2020.0279 (2020).
Article Google Scholar
Zhang, J. P. et al. Life prediction of LED lights by ADTs combined with luminance degradation and probability statistics. J. Luminescence https://doi.org/10.1016/j.jlumin.2025.121252 (2025).
Article Google Scholar
Long, D. Y., Zhou, Z., Dai, H. & Liu, K. X. Reliability evaluation of LED Lamp beads considering multi-stage wiener degradation process under generalized coupled accelerated stress. Electronics 13, 4724. https://doi.org/10.3390/electronics13234724 (2024).
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the financial support and the study participants for their time and contribution.

Funding

This work is supported by the Henan Province Science and Technology Research Project (252102321008), the Natural Science Foundation of Zhongyuan University of Technology (K2023QN013), the Funding Program for Young Backbone Teachers in Zhongyuan University of Technology (2023XQG14), the National Key Research and Development Program of China (2023YFC3805900), the Key Research and Development projects of Metallurgical Corporation of China (YCC2023Kt01).

Author information

Authors and Affiliations

School of Civil Engineering and Architecture, Zhongyuan University of Technology, Zhengzhou, 450007, China
Zhiliu Wang, Jiabao Tang, Pengda Wei, Wenhao Lun & Yulong Wang
State Key Laboratory of Water Resource Protection and Utilization in Coal Mining, Beijing, 102211, China
Zhiliu Wang
Central Research Institute of Building and Construction Co., Ltd., MCC Group, Beijing, 100088, China
Weihong Yang & Hui Xiao

Authors

Zhiliu Wang
View author publications
Search author on:PubMed Google Scholar
Jiabao Tang
View author publications
Search author on:PubMed Google Scholar
Pengda Wei
View author publications
Search author on:PubMed Google Scholar
Wenhao Lun
View author publications
Search author on:PubMed Google Scholar
Yulong Wang
View author publications
Search author on:PubMed Google Scholar
Weihong Yang
View author publications
Search author on:PubMed Google Scholar
Hui Xiao
View author publications
Search author on:PubMed Google Scholar

Contributions

**Zhiliu Wang:** Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition. **Jiabao Tang:** Writing – original draft, Visualization, Validation, Software. **Pengda Wei** : administration, Formal analysis, Conceptualization. **Wenhao Lun:** Validation, Methodology, Conceptualization. **Yulong Wang:** Methodology, Investigation, Formal analysis. **Weihong Yang:** Validation, Methodology, Investigation, **Hui Xiao:** Investigation, Formal analysis, Conceptualization.

Corresponding author

Correspondence to Zhiliu Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Z., Tang, J., Wei, P. et al. Deep reinforcement learning-driven multi-objective optimization and its applications on lighting infrastructure operation and maintenance strategy. Sci Rep 16, 8989 (2026). https://doi.org/10.1038/s41598-026-37811-5

Download citation

Received: 20 November 2025
Accepted: 27 January 2026
Published: 13 February 2026
Version of record: 16 March 2026
DOI: https://doi.org/10.1038/s41598-026-37811-5

Subjects

Abstract

Introduction

Multi-objective operation and maintenance(O&M) strategy optimization framework

Foundational principles of RL

Integration of Markov decision processes (MDPs) and reinforcement learning

DRL

DRL-based multi-objective optimization framework

Multi-objective issue

Decomposition strategy

Domain-based parameter transfer strategy

DRL for optimization of O&M strategies for tunnel lighting facilities

Wiener process and its reliability

Wiener process degradation model

Reliability modeling of the Wiener process

RL configure

Statuses

Action

Incentive mechanism

Agent-environment interaction simulation

O&M cost sheet targeting

Results and analysis

Calculation result

Analysis

Applying the framework to O&M cost and reliability optimization

Modeling of O&M cost and reliability subproblems

Reliability reconstruction

Post-reconstruction results

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links