Introduction

The pressing concern of global warming is driving a global shift towards electrified mobility. With the transportation sector contributing to approximately 12% of all global emissions1, adjustments are required in order to transition to a zero-emissions energy sector. Studies by the Intergovernmental Panel on Climate Change1 and the International Energy Agency2 emphasize the critical need for clean transportation solutions to address the urgent issue of climate change. This has driven governments and policymakers to innovate and collaborate in advancing electric vehicle (EV) technologies.

Lithium-ion batteries (LIBs) are the preferred energy storage technology for EVs due to their superior power and energy density, which enables longer driving ranges compared to other battery technologies3. For a compelling and sustainable EV mass market, accurate state of health (SOH) estimation4 and remaining useful life (RUL)5 prediction of LIB systems are essential. Existing methods for SOH estimation and RUL prediction can be broadly divided into model-based and data-driven approaches. Model-based estimation approaches rely on empirical or equivalent circuit models (ECMs), or electrochemical models, and formulate estimation algorithms around them. Various ECM-based filters for SOH estimation have been proposed in the literature, including Extended Kalman Filter6, dual and joint Extended Kalman Filter7, Unscented Kalman Filter8, Adaptive Extended Kalman Filter9, Particle Filter10 and genetic algorithms11.

Other methods for SOH estimation and RUL prediction, utilizing empirical degradation models, include Unscented Kalman Filters12 and Particle Filters13. In a Bayesian Monte Carlo approach14, the parameters of an empirical capacity model are updated to compute the posterior probability density function for capacity fade prediction. Despite their simplicity, these methods lack explicit physical understanding and require significant calibration effort. Also, electrochemical battery models15,16,17 which demand increased computational power, have been employed, and adaptive observers based on the enhanced single particle model18,19 have been tested in a battery-in-the-loop setup.

With the advancement in cloud computing technologies and Internet of things, data-driven methods for battery SOH estimation, such as linear regression, gaussian process regression, support vector machine, or artificial neural network, have gained traction in recent years20. For instance, multiple linear regression models have been trained using descriptive features of the voltage distribution21 or incremental capacity curves22 to predict capacity fade and resistance increase. Among the more sophisticated prediction models, Gaussian process regression models have been used for capacity estimation23, taking as inputs different statistical features extracted from the charging curves24. Furthermore, neural networks25 are used to establish relationships between input features, such as equivalent circuit model parameters and state of charge (SOC), and battery capacity fade. Regarding RUL prediction, support vector machines4 and random forests26 are utilized. These methods are effective in forecasting the remaining operational lifespan of batteries based on historical data and operational conditions. Battery SOH estimation works can be classified into three primary categories based on the dataset used for development. The first category includes datasets acquired from field operations, which accurately reflect the aging phenomena affecting batteries in real-world EVs driving27. However, a challenge with these datasets is the absence of a baseline for evaluating SOH. To address this limitation, several studies28,29,30 utilize internal resistance as a direct metric for assessing battery SOH. Alternatively, some studies31 suggest using the peak values derived from incremental capacity curves to overcome this challenge. However, these metrics can be challenging to evaluate due to their strong dependence on operating conditions, such as temperature. Capacity fade is often measured using Coulomb counting32,33, which involves integrating battery management system (BMS) current over a limited SOC window. This method, however, may produce inaccurate results due to high sensor noise and quantization in-vehicle sensors. Conversely, tests are conducted in a temperature-controlled environment34 to ensure consistent capacity measurements that serve as ground truth. Here, SOH is estimated using supervised learning models that directly utilize BMS signals—such as voltage, current, SOC, and pack temperature—as inputs. The second category of SOH estimation works relies on datasets collected in laboratory settings24,35,36,37,38. In these datasets, cells are cycled with current profiles that do not accurately represent actual EV battery operation. As a result, features computed using these datasets are not, in general, transferable nor generalizable to real-world applications. The third category of datasets utilizes data collected in laboratory settings aimed to mimic EV real-use case scenarios. Examples include ARTEMIS39 and the Urban Dynamometer Driving Schedule (UDDS)40. These datasets provide more realistic conditions for testing and developing battery SOH estimation algorithms, still providing ground truth capacity through periodic reference performance tests (RPTs).

Using these datasets, different machine learning algorithms, such as support vector machine4, Gaussian process regression, or neural network41, have been used to estimate SOH for batteries undergoing EV driving cycles using statistical features from current, voltage, and temperature signals as input features. Other studies have proposed various physics-based health indicators to estimate battery capacity fade, e.g., features derived from the ECM42 or the time taken for voltage to rise from a low to a high level during the charging process43. Another crucial physical quantity linked to battery aging is internal resistance, with several nuanced indicators proposed in the literature44,45. Most of these indicators are computed using ECMs and algorithms such as recursive least squares. However, these methods typically come with increased computational requirements. Another approach27,46 involves evaluating the indicators during the vehicle acceleration (discharge) and braking (charge), requiring less computational power and facilitating its real-time implementation and integration within the vehicle BMS. Another physics-based SOH indicator is charging impedance27, which combines variations in electrolyte resistance, charge transfer resistance, and polarization due to aging. This feature can be extracted from the initial portion of the charging phase47. Additionally, the energy during charging and discharging can offer valuable insights into battery degradation. Energy metrics are typically calculated over extended portions of full charging profiles to effectively estimate capacity fade48,49,50.

The health features used in previous studies are typically based on idealized constant charging and discharging profiles. However, these profiles do not accurately reflect how electric vehicles are charged and discharged in real-world conditions. Most research has focused on extracting health indicators during complete, repetitive charging cycles, where the battery is charged from a low SOC to a high SOC. In reality, charging patterns are much more variable, and it’s uncommon for batteries to go through full cycles or always follow the same charging profile. This discrepancy makes it difficult to apply findings from controlled experiments directly to real-world EV use.

The contributions of this work are the following. First, our work systematically formulates various SOH indicators based on domain knowledge and proposes a framework for their integration into BMS. The proposed SOH indicators include: power autocorrelation, resistance, charging impedance, energy during charging, and energy during discharging. Second, unlike previous research51 that focused on voltage signal autocorrelation, here, the power autocorrelation is used to quantify the battery’s power-delivery capability over time. Additionally, the proposed SOH indicators are derived from an experimental dataset40 that replicates real-world EV battery operation. Unlike most prior studies48,49,50 that rely on constant current discharging profiles, in this work, energy consumption is evaluated during discharging under realistic EV driving cycles. Moreover, a windowed approach is proposed to assess energy consumption during charging, thereby improving the effectiveness of the energy as an indicator of battery health, especially in scenarios involving partial charging. Furthermore, we modify the formulation of the charging impedance indicator27 by calculating it over an optimized voltage window and averaging the values within this range to improve accuracy and reliability in assessing battery health during partial charging. Through correlation analysis, power autocorrelation, energy during charging, and energy during discharging emerge as the most effective indicators for capacity estimation. It is worth noting that the proposed SOH indicators are agnostic to specific battery chemistries. Moreover, they operate independently of cumulative data such as total aging cycles or ampere-hour throughput. This design choice helps mitigate inaccuracies that could arise from sensor errors or insufficient data. These indicators can be easily evaluated during EV operation. This makes them suitable for real-time deployment and integration into existing BMS strategies.

The battery capacity is estimated through the machine learning pipeline shown in Supplementary Note 1 where 1) SOH indicators are first extracted from the experimental dataset40, 2) the correlation between the indicators and SOH is analyzed through a regression analysis, and 3) a linear regression model (LRM) is trained to estimate capacity fade. The results show that models trained using power autocorrelation and energy-based features obtain capacity estimation with absolute percentage errors (APE) ranging between 1.5% and 2.5%.

Previous works that used linear regression models to estimate battery SOH21,22, are based on features selected over simplistic charge/discharge profile not representative of EV driving.

Results

The SOH indicators are extracted using data from five Nickel Manganese Cobalt (NMC)/Graphite cells40 (reported in Table 1 and detailed in Sec. Cell cycling and experimental dataset). In this section, a thorough analysis of each indicator and regression analysis is carried out, and the estimation results obtained by the linear regression model are shown.

Table 1 Battery cells

SOH indicators analysis

Power autocorrelation function

The autocorrelation function of the battery’s power signal, evaluated during discharge, has shown to offer valuable insights into the battery SOH51. Assuming that discharging occurs periodically with an identical current profile, the power autocorrelation PAutocorr indicator varies as the battery ages, providing a method to monitor battery health.

Of particular interest is the change in the central peak of the power autocorrelation function (see Fig. 1a). The reduction in this peak, defined as PAutocorr,loss = (PAutocorr,fresh − PAutocorr,i)/PAutocorr,fresh 100) where PAutocorr,fresh is the central peak value for the fresh cell and PAutocorr,i is the central peak value during cycle i, correlates with a decrease in capacity. This relationship is illustrated in Fig. 1b, where the power autocorrelation loss shows a strong linear relation with capacity loss. Capacity loss is calculated as Qcell,loss = (Qcell,fresh − Qcell,i)/Qcell,fresh, with Qcell,fresh and Qcell,i representing the fresh cell capacity and the cell capacity at cycle i, respectively.

Fig. 1: Power autocorrelation.
figure 1

a Power autocorrelation (PAutocorr) profiles calculated over discharge profiles for cell W8 throughout its lifetime. b Percentage decrease of the peak amplitude at null delay (PAutocorr,loss) is plotted against the percentage capacity loss (Qcell,loss).

Despite the promising potential of this indicator for estimating capacity loss, it is important to highlight that the periodicity of the current profile may not hold true in real-driving conditions. Nevertheless, this study suggests that this indicator can be engineered in an offline setting, for example, as part of onboard diagnostics routines. In this work, the power autocorrelation function proves to be an effective SOH indicator given consistent usage of the UDDS discharge profile.

Resistance

Abrupt charge and discharge events, related to braking and acceleration maneuvers, respectively, offer the opportunity to evaluate the battery’s internal resistance27. As the battery ages, various factors such as electrode degradation, electrolyte breakdown, and formation of passivation layers contribute to an increase in its internal resistance, R. This increase limits the flow of ions within the battery, reducing conductivity and affecting the battery’s power output capability. As resistance increases, less power can be delivered to the motors due to higher Joule losses.

Demanding acceleration and braking events lead to changes in the battery current, referred to as current peaks. The resistance is calculated at each discharge current peak corresponding to an acceleration event over the discharging phase of the aging cycle, as described in Sec. Definition of SOH indicators. It is important to note that the battery’s internal resistance is influenced by factors such as SOC (see Supplementary Note 3), C-rate, and temperature. For accurate aging assessments in real-world scenarios, resistance should be measured under consistent conditions throughout the battery’s lifespan. In this work, temperature effects on this indicator are not studied since the cells are maintained in a controlled temperature environment.

A single resistance value is computed by averaging the resistances calculated during each discharge phase, which consists of multiple concatenated UDDS cycles between two charging phases, as further detailed in Sec. Cell cycling and experimental dataset, to minimize noise in the resistance, as shown in Fig. 2. This method effectively minimizes noise and variations in resistance measurements, offering a more consistent and representative value to assess battery health. Figure 2(a) highlights the importance of determining the average resistance. Despite the large standard deviation observed in the distribution of internal resistances for each discharge event, the average values, represented by green points, clearly exhibits an increasing trend as the battery ages.

Fig. 2: Resistance.
figure 2

a Internal resistance (R) is plotted as a function of the current peak number during the discharge phases throughout the cell’s life span for cell W8. Shades of gray represent different batches of aging cycles. Batch j is defined as the period between the jth and the (j + 1)th reference performance test (RPT). Green points indicate the average internal resistance. b Percentage increase in internal resistance (Rincrease) relative to capacity loss (Qcell,loss) for all five cells.

Additionally, Fig. 2b shows the percentage increase in average internal resistance for all five cells, correlated with their corresponding capacity losses. This increase is calculated as Rincrease = (Ri − Rfresh)/Rfresh 100, where Rfresh represents the average internal resistance measured during the first discharging phase of the cell, and Ri denotes the average internal resistance determined during the discharging at cycle i.

Charging impedance

The charging impedance27ZCHG represents the battery’s resistance to the flow of electrons during charging. Variations in ZCHG reflect how this resistance evolves as the battery ages. The ZCHG profiles for three cells (V4, W8, and W9), charged at different C-rates, are illustrated in Fig. 3 as a function of cell degradation and SOC.

Fig. 3: Charging impedance (ZCHG) as function of state of charge (SOC) and cycle number for cells V4 (a), W8 (b), and W9 (c).
figure 3

The yellow curve represents fresh cell conditions, while the dark blue curve denotes aged cell conditions.

The rising trend of ZCHG over cells’ lifetime aligns with the understanding that, as the battery ages, its overpotential increases due to factors such as the growth of the Solid-Electrolyte Interface, increased of contact resistance, and changes in reaction kinetics and transport dynamics6,52. Additionally, it is important to note that the ZCHG profiles reach different SOC values at the end of charge (at 4 V). This phenomenon can be attributed to the varying polarization losses resulting from the different C-rates used during charging for cells V4, W8, and W953. The charging impedance indicator is computed by averaging the impedance within the specific voltage range [Vin = 3.8 V, Vfin = 3.9 V], which is selected through the analysis reported in Supplementary Note 4. As shown in Fig. 4b, the increase in charging impedance (ZCHG,increase = (ZCHG,i − ZCHG,fresh)/ZCHG,fresh 100) is highly correlated with capacity loss across all the battery cells. Therefore, the charging impedance ZCHG can be used directly as a feature to correlate with capacity loss.

Fig. 4: Charging impedance.
figure 4

a Charging Impedance (ZCHG) as a function of voltage for cell W8. Charging impedance is averaged over the voltage range [Vin = 3.8 V, Vfin = 3.9 V]. b Percentage variation of average ZCHG (ZCHG,increase) as a function of capacity fade (Qcell,loss).

Energy during charging

The energy during charging indicator, Ech, quantifies the energy stored in the battery during charging. This is computed by integrating the battery power within a specific voltage range [Vin,ch, Vfin,ch] (as detailed in Sec. Definition of SOH indicators). Figure 5a illustrates Ech in relation to the charging duration required to reach Vfin,ch from Vin,ch. Figure 5b shows the energy during charging over the voltage range [Vin,ch = 3.6 V, Vfin,ch = 3.9 V] as a function of capacity loss. The y-axis of Fig. 5b quantifies the percentage energy loss during charging for each cell. Energy loss during charging for each cell is computed as Ech,loss = (Ech,fresh − Ech,i)/Ech,fresh 100, where Ech,fresh is the energy for the fresh cell and Ech,i is the amount of energy the battery is charged at during aging cycle i of the same cell. These results show that capacity loss is linearly correlated with energy loss during charging over the selected voltage range.

Fig. 5: Energy during charging and discharging.
figure 5

a Energy during charging (Ech) as a function of charging time within the voltage range [Vin,ch = 3.6 V,Vfin,ch = 3.9 V] for cell W8 throughout its cycle life. b Energy loss during charging (Ech,loss) across all 5 cells shows a linear correlation with capacity loss (Qcell,loss). c Energy during discharging (Edis) as a function of discharging time within the voltage range [Vin,dis = 3.85 V, Vfin,dis = 3.4 V] for cell W8 throughout its cycle life. d Energy loss during discharging (Edis,loss) for all five cells demonstrates a linear correlation with capacity loss (Qcell,loss).

Energy during discharging

The energy during discharging indicator, Edis,quantifies the energy delivered by the battery during its discharge phase. This is computed by integrating the battery power over a specific voltage range [Vin,dis, Vfin,dis], as detailed in Sec. Definition of SOH indicators. Figure 5c illustrates Edis in relation to the discharging duration needed to reach Vfin,dis from Vin,dis. Figure 5d displays the energy during discharging over the voltage range [Vin,dis = 3.85 V, Vfin,dis = 3.4 V] as a function of capacity loss. The y-axis of Fig. 5d quantifies the percentage energy loss during discharging for each cell. Energy loss during discharging for each cell is computed using Edis,loss = (Edis,fresh − Edis,i)/Edis,fresh 100, where Edis,fresh represents the energy of a fresh cell, and Edis,i is the energy charged during aging cycle i of the same cell. The results indicate a linear relationship between capacity loss and energy loss during discharging within the selected voltage range. In real EV scenarios, the variability in discharging rates complicates the consistent computation and monitoring of Edis. A practical approach is to compare Edis across driving scenarios with similar driving styles to account for this variability.

SOH indicators regression analysis

The health indicators are pre-processed according to the pipeline outlined in Supplementary Note 1. This process involves calculating incremental values for each feature: ΔPAutocorr (power autocorrelation), ΔRch (resistance), \(\Delta {Z}_{{{\rm{CHG}}}}^{{{\rm{NORM}}}}\) (normalized charging impedance), ΔEch (energy during charging), and ΔEdis (energy during discharging). These incremental values are derived by subtracting the initial feature value, measured during the first aging cycle, from the value at each subsequent aging cycle i throughout the cell’s life cycle. Additional details are provided in Sec. Methods. In this work, we use features’ incremental values to simplify the detection of aging trends. For each cell, we assess the correlation between its capacity loss and feature variations using Pearson’s correlation coefficient r, defined as:

$$r=\frac{{\sum }_{i = 1}^{N}({X}_{i}-\overline{X})({Y}_{i}-\overline{Y})}{\sqrt{\mathop{\sum }_{i = 1}^{N}{({X}_{i}-\overline{X})}^{2}\mathop{\sum }_{i = 1}^{N}{({Y}_{i}-\overline{Y})}^{2}}},$$
(1)

where Xi represents the value of a specific incremental feature for a given cell at the i-th aging cycle, Yi is the corresponding capacity loss value for the same cell at that cycle, \(\overline{X}\) is the mean of the incremental feature values, \(\overline{Y}\) is the mean of the capacity loss values across all cycles, and N is the total number of data points (aging cycles analyzed). The results are shown in the heatmap of Fig. 6a. Each cell shows a high Pearson’s correlation coefficient between capacity loss and each feature, underlying that the variations in these features are consistent indicators of aging across all the cells. However, since feature trends can vary across different cells, an additional analysis was conducted to identify features with more generalizable trends. We performed a correlation analysis between the extracted incremental features and capacity fade across all cells. This approach helps identify features that consistently reflect cell aging, regardless of individual cell differences. Figure 6b shows that some indicators generalize better across different cells.

Fig. 6: Correlation analysis.
figure 6

a Heatmap showing the Pearson’s correlation coefficients between incremental power autocorrelation (ΔPAutocorr), incremental energy during charging (ΔEch), incremental energy during discharging (ΔEdis), normalized incremental charging impedance (\(\Delta {{Z}}_{{{\rm{CHG}}}}^{{{\rm{NORM}}}}\)), and incremental resistance (ΔR) with capacity loss for each individual cell. b Histogram illustrating the Pearson’s correlation coefficients between incremental features and capacity loss across all cell data. Note that the correlation between the capacity of cell W7 and ΔR is not reported due to some computed resistances being deemed unreliable because of data acquisition issues (Sec. Cell cycling and experimental dataset and Supplementary Note 5), and thus interpreted as outliers during the pre-processing phase.

We select features to train an estimation model according to two different cases. In the first case, Power autocorrelation (PAutocorr) is selected, as the sole feature, due to its superior overall performance. In the second case, we choose the best-performing feature for charging (energy during charging, Ech) and the best-performing feature for discharging (energy during discharging, Edis), excluding power autocorrelation.

The strong correlation between the extracted features and capacity fade can be attributed to the physical phenomena driving battery degradation. The linear relationship observed between charging impedance, resistance, and energy features with respect to charge throughput aligns with the linear trend of the capacity fade curve54. Given that the cells are cycled within a linear SOC window of 80% to 20% at ambient temperature, Solid-Electrolyte Interface layer growth is considered the dominant aging mechanism, leading to a linear capacity decrease trajectory. However, to thoroughly assess the aging modes present in the cells, a post-mortem analysis would be necessary.

SOH estimation

In this paper, we use capacity calculated at C/20 during RPTs as SOH metric. Additionally, for the purpose of training the machine learning models, the experimental C/20 capacity points are augmented using a linear data augmentation method as discussed in Sec. Data augmentation approach.

The features selected through the regression analysis are utilized to estimate capacity loss using a data-driven model. The performance of various models, namely, LRM, feed-forward neural networks, autoregressive moving average with extra input, and recurrent neural networks, is compared using the same training and testing datasets, as detailed in Supplementary Note 6. Despite its simplicity, the LRM achieves estimation performance comparable to that of more complex models, owing to the strong linear correlation between the SOH indicators and capacity degradation. Therefore, the LRM is chosen for capacity loss estimation due to its lower computational time. Additionally, the LRM has the advantage of requiring fewer parameters to tune and fewer training samples compared to neural network-based models55. The LRM is trained using distinct sets of incremental SOH features: first with power autocorrelation, and then with energy during charging and energy during discharging (see Sec. SOH indicators regression analysis). Additionally, the estimation capabilities of the selected features are evaluated in two Scenarios. In Scenario 1 the LRM is trained exclusively on the data from cell W8 and tested on the other cells. In Scenario 2 the LRM is trained using data from all cells except the test cell. In the second Scenario, for cross-validation, the data is split into two subsets: one for the target cell and another for the remaining cells. The model is trained on the data from the remaining cells and tested on the data from the target cell.

Since the autocorrelation function of the power signal ΔPAutocorr exhibits the highest correlation with capacity fade, the data-driven model is initially trained using using ΔPAutocorr as input. Figure 7 displays the capacity estimation results for both ΔPAutocorr and the energy-based features. In Scenario 1, the training dataset consists solely of data from cell W8, while in Scenario 2, it includes data from all cells except the test cell. The absolute percentage error (as defined in Sec. Methods) remains consistently below 1.5%, underscoring the relevant information provided by this individual feature. Moreover, using a more extensive set of training data from multiple cells (Scenario 2) does not improve estimation accuracy, leading to conclude that ΔPAutocorr is effective even with limited data. However, this feature has limitations in real-world scenarios and is better suited for offline diagnostics rather than online applications. It is also important to note that gaps in the observed capacity curves are due to voltage measurements anomalies, which resulted in unreliable feature values. This irregularity is attributed to unidentified equipment issues, as discussed in Sec. Cell cycling and experimental dataset and detailed further in Supplementary Note 5.

Fig. 7: SOH estimation results from the linear regression model (LRM) using power autocorrelation (\({{{\rm{LRM}}}}_{{P}_{{{\rm{Autocorr}}}}}\)), and energy during charging and discharging (\({{{\rm{LRM}}}}_{{E}_{{{\rm{ch}}}},{E}_{{{\rm{dis}}}}}\)) as input features versus the aging cycle number (Cycle).
figure 7

Profiles of capacity loss and estimation error for cells V4 (a), W5 (b), W7 (c), and W9 (d). Augmented capacity points (obtained as discussed in Sec. Data augmentation approach) are shown in red. SOH estimation using power autocorrelation as input is shown in brown (with training data from cell W8) and yellow (with training data from all cells except the test cell). The dark blue and light blue lines show SOH estimation using energy features as input, with training data from cell W8 (dark blue) and from all available cells except the test cell (light blue). Gaps in the capacity curves for cells W5 (b) and W7 (c) are due to voltage measurements anomalies affecting the reliability of feature values (see Sec. Cell cycling and experimental dataset and Supplementary Note 5). The capacity drop for cell W8 (d) results from issues with the aging protocol implementation.

The LRM is subsequently trained using features that can be calculated during vehicle operation, specifically during driving and charging. The features selected for their high linear correlation with capacity during charging and discharging are energy during charging (ΔEch) and energy during discharging (ΔEdis), respectively. As illustrated in Fig. 7, accurate capacity fade estimation is achieved with these features. Notably, when the LRM is trained using data from only cell W8 (Scenario 1), it achieves an absolute percentage error below 2.5% when tested on data from the other four cells. This result highlights the strong estimation capability of these features even with a limited dataset. For a more comprehensive analysis, the same estimation model is trained using data from multiple cells, leading to improved performance with the larger dataset. When using data from four cells for training (Scenario 2) and testing on the remaining cell, the absolute percentage error is below 1.6%. Notably, the estimation models perform well even for cells like W7 and W5, where some data is missing. This adaptability of the features and estimation models to partially available data is particularly advantageous in real-world scenarios, where acquiring complete EV battery data may not always be feasible. Moreover, to evaluate if adding extra features alongside the energy-based indicators could enhance model estimation capabilities, the LRM was also trained with incremental resistance and charging impedance included as additional inputs. However, the performance of the model with these additional features was worse than when using only energy during charging and discharging, as shown in Fig. 8. This indicates that the inclusion of resistance and charging impedance may introduce more noise than valuable information. It should be noted that for cell W7, only charging impedance is used as an additional feature, as the resistance data was compromised due to acquisition issues discussed in Sec. Methods. The superior performance of energy during charging and discharging as SOH indicators, compared to the increase in resistance or charging impedance, can be attributed to several factors. Energy loss reflects not only resistance increases but also other factors such as heat generation, electrode degradation, and Solid-Electrolyte Interface formation, which impact overall energy efficiency. Additionally, the integration of the power signal offers a comprehensive measure of battery energy dynamics throughout an entire cycle, whereas resistance and charging impedance are computed over shorter time periods, making them more sensitive to short-term fluctuations.

Fig. 8: SOH estimation results from the linear regression model (LRM) using charge and discharge energies, charging impedance, and resistance as features versus the aging cycle number (Cycle).
figure 8

The capacity loss and estimation error profiles for cells V4 (a), W5 (b), W7 (c), and W9 (d) are shown. Augmented Capacity points (obtained as discussed in Sec. Data augmentation approach) are shown in red. Three scenarios are displayed: blue represents the LRM output trained solely with energy during charging and energy during discharging (\({{{\rm{LRM}}}}_{{E}_{{{\rm{ch}}}},{E}_{{{\rm{dis}}}}}\)); the light purple represents the LRM output trained with energy during charging, energy during discharging along with charging impedance (\({{{\rm{LRM}}}}_{{E}_{{{\rm{ch}}}},{E}_{{{\rm{dis}}}},{Z}_{{{\rm{CHG}}}}}\)); the dark purple represents the LRM output trained with energy during charging, energy during discharging, charging impedance, and resistance (\({{{\rm{LRM}}}}_{{E}_{{{\rm{ch}}}},{E}_{{{\rm{dis}}}},{Z}_{{{\rm{CHG}}}},R}\)). All models are trained exclusively using data from cell W8.

Conclusions

This work extracts and evaluates five knowledge based SOH indicators, demonstrating their effectiveness as inputs to ML models for estimating capacity fade. The formulation of these indicators is guided by battery domain knowledge, allowing for the quantification of internal state variability due to battery degradation. Since none of the indicators rely on cumulative information (such as cycle number or Ah-throughput), they are suitable for real-world applications even with partial battery history. The high correlation between the indicators and capacity indicates that battery aging mechanisms leading to capacity fade are directly related to energy decrease and impedance rise. Two subsets of the engineered indicators, i.e.,power autocorrelation, energy during charging, and energy during discharging, were utilized to train the estimation model for accurate cell capacity estimation. Due to their high correlation with capacity fade, combining energy during charging and energy during discharging as inputs results in accurate SOH estimation, with an absolute percentage error consistently below 2.5%. Conversely, power autocorrelation is the most informative feature, enabling precise capacity fade estimation with an absolute percentage error below 1.5%, even with limited training data. However, its effectiveness is influenced by the periodicity of discharging events. Consequently, power autocorrelation cannot be directly used as an SOH indicator in real-world driving scenarios but could be incorporated into a diagnostic tool by applying a periodic current signal to the battery when it is not in use. These findings suggest that domain knowledge-based features have the potential to be used as online tools for real-time capacity estimation. However, the model’s effectiveness may be limited in practical applications. The dataset used in this study does not account for temperature variations or practical discharge events typical in real-world battery usage. Additionally, the current and voltage signals used to extract features have a high signal-to-noise ratio, which may not always be present in EV batteries. Having demonstrated the potential of these features on the studied dataset40, further investigations will be conducted using field data as future work. While this study primarily focuses on capacity estimation, utilizing a larger dataset could allow for the application of these indicators in RUL prediction. Extending the method proposed in this paper, these indicators could be integrated into forecasting models, enabling the BMS to anticipate and effectively manage battery capacity degradation.

Methods

Cell cycling and experimental dataset

The experimental dataset40 used in this work involves INR21700-M50T battery cells with graphite/silicon anode and nickel manganese cobalt oxides (NMC) cathode tested over a period of 30 months. For each cell, periodic RPTs, including C/20 capacity tests, Hybrid Pulse Power Characterization, and Electrochemical Impedance Spectroscopy, were conducted to assess the battery aging from fresh conditions. The cells underwent aging cycles as described in Supplementary Note 2. Each cycle includes a Constant Current-Constant Voltage (CC-CV) charge phase followed by a discharge phase. Specifically, there are two charge phases. Once the batteries reach 20% SOC (from the discharge phase), they are charged through the CC-a phase (at different C-rates) until reaching 4 V. They then continue charging at C/4 until 4.2 V, followed by the CV phase until the current drops below 50mA. The discharge phase, using concatenated UDDS driving profiles, simulates EV battery discharging, reducing the cell’s SOC from 80% to 20%. Aging cycles conducted between the jth and (j+1)th RPTs for each cell are grouped into the jth batch of aging cycles. Supplementary Note 2 details the number of aging cycles in each batch for all cells used in this study. Among the ten cells (G1, V4, V5, W3, W4, W5, W7, W8, W9, W10) in the dataset, five (V4, W5, W7, W8, W9) are used in this study, as detailed in Table 1. The remaining cells were excluded for the following reasons. Cells W3, W10, and G1 were charged using a fast-charging 3C current profile during the CC-a phase, resulting in a very short charging duration interval that hindered feature extraction. Cell V5 was excluded due to insufficient aging, having undergone only 59 cycles with less than a 3% capacity decrease from the beginning of life. Cells W4, W5, and W7 were reported to have voltage measurements anomalies due to experimental issues, as noted in the “README” file of the dataset40 and detailed in Supplementary Note 5. Specifically, cell W4 was affected for 310 cycles out of the total 760.

Data augmentation approach

This work uses capacity to describe battery SOH. Given the limited number of RPTs, we have adopted an approach that uses data augmentation with linear interpolation for training purposes. For each cell, to assign a capacity value at every aging cycle i contained in batch j, we use the capacity values measured at the j-th and (j + 1)-th RPTs and estimate the capacity for cycle i, Qi as follows:

$${Q}_{i}=\frac{i-{{{\rm{cycle}}}}_{j}^{{{\rm{RPT}}}}}{{{{\rm{cycle}}}}_{j+1}^{{{\rm{RPT}}}}-{{{\rm{cycle}}}}_{j}^{{{\rm{RPT}}}}}\times \left({Q}_{j+1}^{{{\rm{RPT}}}}-{Q}_{j}^{{{\rm{RPT}}}}\right)+{Q}_{j}^{{{\rm{RPT}}}}$$
(2)

where \({{{\rm{cycle}}}}_{j}^{{{\rm{RPT}}}}\) and \({{{\rm{cycle}}}}_{j+1}^{{{\rm{RPT}}}}\) denote the numbers of the aging cycle preceeding the j-th and (j + 1)-th RPTs, respectively, while \({Q}_{j}^{{{\rm{RPT}}}}\) and \({Q}_{j+1}^{{{\rm{RPT}}}}\) represent the capacity values measured during these tests for the considered cell. Index i ranges from 1 to the number of aging cycles a cell has undergone (Table 1, fourth column), while index j ranges from 1 to the number of times the cell has been tested (Table 1, third column).

For example, capacity for cell V4 at aging cycle #30, namely \({Q}_{30}^{{{\rm{V4}}}}\), is defined as:

$${Q}_{30}^{{{\rm{V4}}}}=\frac{30-{{{\rm{cycle}}}}_{2}^{{{\rm{RPT}}},{{\rm{V}}}4}}{{{{\rm{cycle}}}}_{3}^{{{\rm{RPT}}},{{\rm{V}}}4}-{{{\rm{cycle}}}}_{2}^{{{\rm{RPT}}},{{\rm{V}}}4}}\times \left({Q}_{3}^{{{\rm{RPT}}},{{\rm{V}}}4}-{Q}_{2}^{{{\rm{RPT}}},{{\rm{V}}}4}\right)+{Q}_{2}^{{{\rm{RPT}}},{{\rm{V}}}4}$$
(3)

where \({{{\rm{cycle}}}}_{2}^{{{\rm{RPT}}}}=20\) and \({{{\rm{cycle}}}}_{3}^{{{\rm{RPT}}}}=45\), since cell V4 has undergone 20 aging cycles before RPT #2 and 45 aging cycles before RPT #3.

Definition of SOH indicators

Vch and Vdis represent the voltage profiles during charging and discharging, respectively. Ich and Idis, are the current profiles during charging and discharging, respectively. Voltage variations due to acceleration peaks during discharging are indicated with ΔVacc, and the corresponding current variations with ΔIacc. The autocorrelation function measures the linear relationship between a signal x(t) and its time-delayed version x(t + τ), where τ is the time delay. In this work, power autocorrelation during the discharge phase is quantified by correlating the power signal with its delayed copies. First, cell power is calculated from the voltage and current signals as follows:

$$P(t)={V}_{{{\rm{dis}}}}(t)\cdot {I}_{{{\rm{dis}}}}(t)$$
(4)

The autocorrelation function of the power signal \({\hat{\rho }}_{\tau }\) is computed with delays τ limited to a range [ − τmaxτmax]. In our study, τmax is set to 3000 s. For each value within this range, \({\hat{\rho }}_{\tau }\) is computed as follows:

$${\hat{\rho }}_{\tau }=\mathop{\sum }_{t=\tau +1}^{T}(P(t)-\bar{P})(P(t-\tau )-\bar{P})$$
(5)

where T is the duration of the discharging phase, P(t) is the power at time t, \(\bar{P}\) is the average of the power over the time window T, and P(t − τ) is the power at instant t − τ. The power autocorrelation indicator PAutocorr is defined as the autocorrelation with null delay: \({P}_{{{\rm{Autocorr}}}}={\hat{\rho }}_{\tau = 0}\).

The resistance R indicator is extracted for each aging cycle during the discharging phase using the following procedure. First, acceleration peaks are identified during the discharge27 as explained in Supplementary Note 7. Then, the resistance Rpeak corresponding to the lth current peak within the ith aging cycle is computed as follows:

$${R}_{{{\rm{peak}}},l}^{i}=\frac{\Delta {V}_{{{\rm{acc}}},l}^{i}}{\Delta {I}_{{{\rm{acc}}},l}^{i}}$$
(6)

where \(\Delta {V}_{j}^{i}\) and \(\Delta {I}_{j}^{i}\) are the voltage and current variations at the peak occurrence, respectively, as shown in Supplementary Note 7. Thus, P resistances \({R}_{{{\rm{peak}}},1}^{i},{R}_{{{\rm{peak}}},2}^{i},\ldots ,{R}_{{{\rm{peak}}},P}^{i}\) are computed for each ith aging cycle, with i = 1, …, N, where N represents the number of aging cycles during the cell’s life and P is the total number of acceleration peaks within each cycle. Note that the number of total accelatrion peaks, P, varies with the aging cycle. Subsequently, a single resistance value for each aging cycle is obtained by averaging the P resistances extracted from all acceleration peaks within that cycle:

$${R}^{i}=\frac{{\sum }_{l = 1}^{P}{R}_{{{\rm{peak}}},l}^{i}}{P}\quad i=1,2,\ldots ,N$$
(7)

The instantaneous battery charging impedance \({Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}\) is computed over the CC-a phase27 as follows:

$${Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}({t}_{k})=-\frac{{V}_{{{\rm{ch}}}}({t}_{k})-{V}_{{{\rm{ch}}}}({t}_{k-1})}{{I}_{{{\rm{ch}}}}}$$
(8)

where Vch(tk) − Vch(tk−1) is the voltage difference over the interval Δt = tk − tk−1, and Ich is the constant charging current during the CC-a phase.

The choice of the time window Δt is crucial. Increasing Δt helps filter out noise from the voltage difference in the numerator of Equation (9) and reduces current quantization effects. However, too large a window can excessively filter and result in information loss. Therefore, Δt is tuned to balance noise reduction while preserving the information content of \({Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}\). The time intervals Δt are selected based on the C-rate: Δt = 60 s for C/4, Δt = 30 s for C/2, and Δt = 1 s for 1C charging events.

After extracting the instantaneous battery impedance for all the time intervals of the charging phase, the ZCHG indicator is computed for each charging phase by averaging the \({Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}\) within a specific voltage range [VinVfin]:

$${Z}_{{{\rm{CHG}}}}=\frac{1}{M}\mathop{\sum }_{{t}_{k}={t}_{{{\rm{in}}}}}^{{t}_{{{\rm{fin}}}}}{Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}({t}_{k})$$
(9)

where M is the number of \({Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}\) measurements within the considered voltage range, and tin and tfin are the initial and final time instants such that V(tin) = Vin and V(tfin) = Vfin, respectively. The voltages Vin and Vfin were set to 3.8 V and 3.9 V, respectively, based on the sensitivity analysis presented in Supplementary Note 4.

An alternative formulation would be to compute the average of \({Z}_{{{{\rm{CHG}}}}_{{{\rm{ist}}}}}\) within a SOC range instead of a voltage range. However, we opted for the voltage-based formulation to avoid estimation errors affecting the SOC, which is a non-measurable quantity generally estimated by the BMS. Additionally, a different definition of charging impedance, discussed in Supplementary Note 8, has been excluded in the present work due to its lower correlation with capacity fade.

Finally, the energy during charging and discharging is computed on the CC-a charging segment (see Supplementary Note 2) and driving UDDS profile, respectively, by integrating the electrical power within a fixed voltage window, specifically [Vin,chVfin,ch] and [Vin,disVfin,dis]:

$${E}_{{{\rm{ch}}}} = {\int_{{t}_{{{\rm{in}}}}}^{{t}_{{{\rm{fin}}}}}}{V}_{{{\rm{ch}}}}(t)\cdot {I}_{{{\rm{ch}}}}(t)\,{{\rm{dt}}}$$
(10)
$${E}_{{{\rm{dis}}}} = {\int_{{t}_{{{\rm{in}}}}}^{{t}_{{{\rm{fin}}}}}}{V}_{{{\rm{dis}}}}(t)\cdot {I}_{{{\rm{dis}}}}(t)\,{{\rm{dt}}}$$
(11)

where Vch is the cell voltage during charging, Ich is the cell current during charging and tin and tfin are the initial and final time instants such that Vch(tin) = Vin,ch and Vch(tfin) = Vfin,ch. Similarly, Vdis is the cell voltage during discharging, Idis is the cell current during discharging and tin and tfin are the initial and final time instants such that Vdis(tin) = Vin,dis and Vdis(tfin) = Vfin,dis. Thus, energy is not only a function of the C-rate but also depends on the voltage window over which it is calculated.

We selected the fixed voltage windows [Vin,ch = 3.6 V, Vfin,ch = 3.9 V] and [Vin,dis = 3.85 V, Vfin,dis = 3.4 V] for computing Ech and Edis, respectively, to bypass the initial and final stages of charging and discharging, which are potentially prone to noise.

Sensitivity of charging energy to voltage window

To assess the feasibility of using energy during charging for partial charging profiles, the correlation between Ech and capacity loss was quantified across different voltage ranges. First, the interval [Vin,ch, Vf,ch] was divided into sub-intervals of 0.25 V amplitude, and the energy was computed for each sub-interval.

As shown in Fig. 9, there is a strong correlation between energy during charging and capacity loss across all voltage sub-intervals. These results show that energy can be effectively used to estimate the SOH for partial and narrow charging periods. The analysis indicates that the voltage interval with the highest correlation also depends on the charging rate. This insight facilitates straightforward integration into the BMS.

Fig. 9: Impact of voltage range, C-rate and charging cycle number on energy during charging.
figure 9

The amount of energy during charging (Ech) depends on the voltage range and the charging rate. As the charging rate increases (from C/4 for cell V4 (a), to C/2 for cell W8 (b), to 1C for W9 (c)), the peak in charging energy shifts towards higher voltage ranges, i.e., [3.675 V - 3.7 V] at C/4, [3.7 V - 3.725 V] at C/2 and [3.825 V - 3.85 V] at 1C.

Pre-processing and incremental indicators

Data pre-processing is essential for effectively using SOH indicators in data-driven algorithms. A critical step is removing outliers—data points that deviate from the majority. Outliers can affect feature extraction and machine learning model performance. Therefore, a careful approach is used to remove outlier-containing data, ensuring more robust and reliable feature representation. A second step of the pre-processing phase is the computation of the incremental features, denoted by Δ. This subsection explains how to obtain these features, using incremental resistances as an example.

For each cell in the dataset, the vector of incremental resistances ΔR is calculated as follows:

  1. 1.

    For each aging cycle ithi = 1, …, N, the resistance during the discharge phase over acceleration peaks is calculated as a function of SOC. R1 represents the average resistance over the SOC range of 80% and 20% during discharge. The resulting resistance vector is:

    $${{\bf{R}}}=[{R}^{1},{R}^{2},\ldots ,{R}^{N}]$$
    (12)

    where R1 is the average fresh cell resistance and RN is the average resistance at the last cycle.

  2. 2.

    Obtain the incremental resistance vector by subtracting R1 from each value in R.

    $$\Delta {{\bf{R}}}={{\bf{R}}}-{R}^{1}$$
    (13)

This approach ensures that the first element of the incremental vector for each feature is zero, facilitating the comparison of aging trends across cells. Additionally, the charging impedance vector ZCHG requires further pre-processing due to its dependency on the C-rate (see, Fig. 3. this feature strongly depends on the C-rate at which it is computed. To standardize across different C-rates, the incremental vector ΔZCHG is normalized using the fresh cell impedance value:

$$\Delta {{{\bf{Z}}}}_{{{\rm{CHG}}}}^{{{\rm{NORM}}}}=\frac{\Delta {{{\bf{Z}}}}_{{{\rm{CHG}}}}}{{Z}_{{{\rm{CHG}}}}^{1}}$$
(14)

where \({Z}_{{{\rm{CHG}}}}^{1}\) is the charging impedance calculated over the first aging cycle in Batch #1 in the voltage range [3.8 V - 3.9 V] as described in Sec. Definition of SOH indicators. Normalization reduces variations from different charging rates, providing a consistent feature representation. This pre-processing step is crucial for evaluating the ML model across cells cycled at various rates, effectively excluding C-rate as a training feature. It ensures a more refined data representation for machine learning algorithms.

Estimation model

In this work, the LRM estimates capacity fade due to its strong linear correlation with SOH indicators. The LRM relates the response variable y to the input vector u as follows56:

$$y(t)={\beta }_{0}+\beta {{\boldsymbol{u}}}(t)+\epsilon (t)$$
(15)

where ϵ represents model error, capturing deviations between the model and observed data. Coefficients β are determined using the least-squares method, which minimizes the model error on the training dataset. To evaluate the accuracy of the estimation models, the root mean square error (RMSE) is calculated as:

$${{\rm{RMSE}}}=\sqrt{\frac{\mathop{\sum }_{i = 1}^{N}{e}_{i}^{2}}{N}}$$
(16)
$${e}_{i}=\frac{{Q}_{{{\rm{cell}}},i}-{Q}_{{{\rm{est}}},i}}{{Q}_{{{\rm{cell}}},i}}$$
(17)

where ei is the relative error, with Qcell,i and Qest,i representing the actual and estimated capacities at the cycle i, respectively. Additionally, the absolute percentage error is given by APE(%) = ei 100.