Introduction

Sequential data classification that encompasses time series, sensor readings, and environmental monitoring signals has become the cornerstone of modern machine learning applications, covering domains such as healthcare, industrial IoT, and climate science1. Traditional classifiers, including logistic regression and decision trees, often fail to account for temporal dependencies and nonstationary patterns inherent in sequential data, leading to suboptimal performance in tasks like anomaly detection or event prediction2. For example, analyzing time-dependent sequences with diurnal or seasonal fluctuations, common in sensor networks or physiological monitoring, requires models that capture delayed interactions and contextual shifts3. While deep learning approaches like Long Short-Term Memory networks (LSTMs)4 and transformers5 excel at modelling long-range dependencies, their computational complexity, “black-box” nature, and reliance on large datasets limit their practicality in resource-constrained systems6. Hybrid methods, such as time series forests7, attempt to bridge this gap, but often lack a principled handling of structural properties such as irregular sampling intervals and multiscale trends, underscoring the need for methods that balance interpretability, efficiency, and robustness.

Sequential data are characterized by temporal ordering, variable-length observations, and context-dependent correlations, posing unique challenges for classification. Temporal misalignment where similar patterns lead to divergent outcomes due to external factors such as high dimensionality, and noise contamination (e.g., sensor drift or missing data) complicate traditional approaches that assume independent and identically distributed features8. Methods like dynamic time warping (DTW)9 and shapelet-based techniques10 struggle with computational complexity when applied to large-scale datasets, while irregular sampling rates further disrupt temporal continuity2.

Random forests (RFs)11 have emerged as robust tools for sequential classification due to their ability to handle high-dimensional feature spaces, resist overfitting, and quantify feature importance. By aggregating predictions from decorrelated decision trees, RFs mitigate variance and noise, making them suitable for applications such as industrial monitoring and healthcare analytics12. Competing methods include support vector machines (SVMs) with dynamic time-warping kernels13, gradient-boosted trees14, and convolutional neural networks (CNNs)15. However, SVMs suffer from quadratic complexity with long sequences, CNNs require meticulous filter tuning, and standard RFs treat sequential features as static, potentially discarding discriminative temporal dynamics such as gradual trends or abrupt events16.

Related studies

Early machine learning methods, including logistic regression, SVMs, and decision trees, struggle with sequential data due to assumptions of feature independence and static distributions17. For instance, logistic regression cannot model nonlinear temporal dynamics, while SVMs scale poorly with sequence length13. Decision trees, though interpretable, fragment temporal continuity through axis-aligned splits2. Shallow neural networks lack mechanisms to retain historical context, which is critical in tasks like event prediction1. These limitations are exacerbated in noisy, irregularly sampled datasets, where specialized time-series classifiers outperform traditional methods by significant margins8.

Recurrent neural networks (RNNs), particularly LSTMs and bidirectional architectures, revolutionized sequential analysis by modeling dependencies through hidden states. CNNs, adapted via temporal convolutions, detect local motifs like abrupt anomalies15. Hybrid models, such as CNN-LSTMs, combine local feature extraction with global temporal modelling, reducing prediction errors in applications like air quality forecasting16. However, deep learning demands large datasets and computational resources, while their opacity hinders interpretability in domains requiring actionable insights18. Ensemble methods like RFs and gradient boosting machine (GBM) excel in handling heterogeneous features and noise. RFs aggregate decorrelated trees to mitigate variance, proving effective in high-dimensional monitoring scenarios12. GBMs iteratively refine predictions, outperforming traditional models in time-series forecasting14. However, standard ensembles treat sequences as static unless they are temporally engineered, limiting their capacity to model cumulative trends. Time-series-specific variants, such as temporal random forests, often rely on handcrafted features rather than intrinsic temporal structure7.

Feature transformation techniques like wavelet decompositions and PCA reduce dimensionality but operate independently of classifier training. Wavelets isolate transient events from trends19, while PCA compresses data linearly, risking loss of discriminative features20. This underscores the need for integrated approaches where transformations align with classification objectives. Legendre polynomial transformations have gained traction for noise-resistant feature extraction in industrial and environmental monitoring21. Yet, their integration with ensemble learning remains underexplored. Current methods apply transformations as preprocessing steps or rely on implicit feature learning, neglecting the potential of energy-guided modelling. This study addresses these gaps by proposing LEW-RF, which embeds Legendre energy into RFs to prioritize discriminative temporal trends, advancing hybrid frameworks for sequential classification.

Study contribution

Prior research in sequential data classification faces two critical limitations. First, most methods decouple feature transformation from classifier training, resulting in suboptimal alignment between extracted features and classification objectives. For example, unsupervised techniques like PCA prioritize variance over class-discriminative patterns, while deep learning models implicitly learn features without domain-informed guidance8. Second, ensemble methods such as random forests (RFs) underutilize structured temporal priors inherent to sequential data, relying instead on handcrafted statistical features rather than leveraging transformations that encode multiscale trends or energy compaction properties7,22. These gaps limit both performance and interpretability in real-world applications like environmental monitoring or industrial sensing.

To bridge these gaps, this study makes three key contributions. First, we establish a theoretical and empirical linkage between Legendre’s polynomial coefficient energy, the squared sum of coefficients derived from sequential data, and class separability, demonstrating its effectiveness in capturing discriminative trends such as gradual pollution accumulations or abrupt anomalies. Second, we propose the Legendre Energy-Weighted Random Forest (LEW-RF), a novel framework that integrates Legendre transformations directly into the RF architecture. By weighting feature splits based on coefficient energy, LEW-RF prioritizes temporally significant patterns while suppressing noise, addressing the decoupling problem through joint optimization of transformation and classification. Third, our method retains the interpretability of classical RFs via feature importance scores while achieving scalability comparable to traditional ensembles, even for decade-long sequences, a critical advantage for resource-constrained systems.

The remainder of this paper is structured to systematically address these contributions. Section 3 details the LEW-RF methodology, including Legendre polynomial transformations, energy-guided splitting criteria, and algorithmic implementation. Section 4 presents a simulation example for comparing the proposed LEW-RF with RF and other competing ensemble and deep learning models. Section 5 evaluates LEW-RF against state-of-the-art baselines on the ozone dataset, quantifying gains in accuracy, noise robustness, and computational efficiency. Section 6 discusses practical implications, limitations, and broader applications in environmental science. Finally, Section 7 concludes with future directions, including the extension of LEW-RF to multivariate sequences and online learning scenarios.

Sequential data and legendre transformation

Preliminaries

Let \(\{x_i(t)\}_{i=1}^m\) denote a sequential dataset of \(m\) features, where each feature \(x_i(t)\) is a discrete time series sampled at \(T\) equidistant time points \(t = 1, 2, \dots , T\). To model temporal dependencies, we leverage the properties of Legendre polynomials \(\{P_k(t)\}_{k=0}^N\), a family of orthogonal basis functions defined on the interval \([-1, 1]\). These polynomials form a complete orthogonal system in the Hilbert space \(L^2[-1, 1]\), enabling sparse yet accurate representations of sequential data.

Legendre polynomials and orthogonality

Let \(\mathscr {H} = L^2([-1, 1], \mathbb {R})\) denote the Hilbert space of square-integrable functions equipped with the inner product:

$$\langle f, g \rangle = \int _{-1}^1 f(t)g(t) dt, \quad \Vert f\Vert _{L^2} = \sqrt{\langle f, f \rangle }.$$

The Legendre polynomials \(\{P_k\}_{k=0}^\infty\) form a complete orthogonal basis for \(\mathscr {H}\), satisfying:

$$\langle P_k, P_{k'} \rangle = \frac{2}{2k + 1} \delta _{kk'}, \quad \text {where } \delta _{kk'} \text { is the Kronecker delta}.$$

Completeness implies that any \(f \in \mathscr {H}\) admits a unique Fourier-Legendre expansion:

$$f(t) = \sum _{k=0}^\infty c_k P_k(t), \quad c_k = \frac{2k + 1}{2} \langle f, P_k \rangle .$$

The convergence is in the \(L^2\)-norm: \(\lim _{N \rightarrow \infty } \Vert f - \sum _{k=0}^N c_k P_k\Vert _{L^2} = 0\).

Theorem 1

(Optimal Projection) Let \(\mathscr {P}_N = \text {span}\{P_0, \dots , P_N\}\). For \(f \in \mathscr {H}\), the truncated expansion \(f_N(t) = \sum _{k=0}^N c_k P_k(t)\) satisfies:

$$\Vert f - f_N\Vert _{L^2} = \inf _{p \in \mathscr {P}_N} \Vert f - p\Vert _{L^2}.$$

Proof

By the projection theorem in Hilbert spaces, \(f_N\) is the orthogonal projection of \(f\) onto \(\mathscr {P}_N\). Let \(p(t) = \sum _{k=0}^N d_k P_k(t)\). Then:

$$\Vert f - p\Vert _{L^2}^2 = \Vert f - f_N\Vert _{L^2}^2 + \Vert f_N - p\Vert _{L^2}^2 \ge \Vert f - f_N\Vert _{L^2}^2,$$

since \(f_N - p \in \mathscr {P}_N\) and \(f - f_N \perp \mathscr {P}_N\). Equality holds iff \(p = f_N\). \(\square\)

Approximation theory and error bounds

Lemma 1

(Derivative Bounds) The \(n\)-th derivative of \(P_k(t)\) satisfies:

$$\left| \frac{d^n P_k}{dt^n}(t)\right| \le \frac{(k + n)!}{2^n n! (k - n)!} \quad \text {for } t \in [-1, 1], \quad n \le k.$$

Proof

Using Rodrigues’ formula \(P_k(t) = \frac{1}{2^k k!} \frac{d^k}{dt^k}(t^2 - 1)^k\), differentiate \(n\)-times:

$$\frac{d^n P_k}{dt^n}(t) = \frac{1}{2^k k!} \frac{(k + n)!}{(k - n)!} \frac{d^{k - n}}{dt^{k - n}}(t^2 - 1)^{k - n}.$$

Apply Bernstein’s inequality for algebraic polynomials. \(\square\)

Theorem 2

(Coefficient Decay and Regularity) Let \(f \in C^p([-1, 1])\) with \(f^{(p)}\) of bounded variation \(V(f^{(p)})\). Then:

$$|c_k| \le \frac{(2k + 1) V(f^{(p)})}{2^{p + 1} (k - p)!} \quad \text {for } k > p.$$

If \(f\) is analytic in a Bernstein ellipse \(E_\rho = \{ t \in \mathbb {C} : |t + \sqrt{t^2 - 1}| \le \rho \}\), then:

$$|c_k| \le C \rho ^{-k} \Vert f\Vert _{L^\infty (E_\rho )},$$

where \(C > 0\) depends on \(\rho\).

Proof

For finite smoothness, integrate by parts \(p\)-times:

$$c_k = \frac{2k + 1}{2} \int _{-1}^1 f(t) P_k(t) dt = \frac{(-1)^p (2k + 1)}{2} \int _{-1}^1 f^{(p)}(t) \frac{d^{p} P_k}{dt^{p}}(t) dt.$$

Apply Lemma 1 and the bound \(|d^p P_k/dt^p| \le C k^{2p}\). For analytic \(f\), use contour integration over \(E_\rho\):

$$c_k = \frac{2k + 1}{4\pi i} \oint _{E_\rho } \frac{f(z) P_k(z)}{(z - t)} dz,$$

and bound \(|P_k(z)| \le C \rho ^k\) within \(E_\rho\). \(\square\)

Corollary 1

(Truncation Error in Sobolev Spaces) If \(f \in H^s([-1, 1])\), the Sobolev space of order \(s\), then:

$$\Vert f - f_N\Vert _{L^2} \le C N^{-s} \Vert f\Vert _{H^s},$$

where \(C\) depends on \(s\).

Proof

Use the coefficient decay \(|c_k| \le C k^{-s - 1/2} \Vert f\Vert _{H^s}\) and Parseval’s identity:

$$\Vert f - f_N\Vert _{L^2}^2 = \sum _{k=N+1}^\infty \frac{2}{2k + 1} |c_k|^2 \le C \sum _{k=N+1}^\infty k^{-2s - 1} \le C N^{-2s}.$$

\(\square\)

Discrete framework and numerical considerations

For discrete sequences \(\{x_i(t)\}_{t=1}^T\), define the discrete inner product:

$$\langle f, g \rangle _T = \frac{1}{T} \sum _{t=1}^T f(t)g(t), \quad \Vert f\Vert _{T} = \sqrt{\langle f, f \rangle _T}.$$

Let \(\textbf{P} \in \mathbb {R}^{T \times (N+1)}\) be the matrix with entries \(P_k(t)\). The coefficient vector \(\textbf{c}_i = (c_{i0}, \dots , c_{iN})^\top\) solves:

$$\textbf{c}_i = \arg \min _{\textbf{c}} \Vert \textbf{x}_i' - \textbf{P}\textbf{c}\Vert _2^2,$$

where \(\textbf{x}_i'\) is the normalized sequence. The solution is given by:

$$\textbf{c}_i = (\textbf{P}^\top \textbf{P})^{-1} \textbf{P}^\top \textbf{x}_i'.$$

Lemma 2

(Condition Number of the Legendre Gram Matrix) Let \(\textbf{P} \in \mathbb {R}^{T \times (N+1)}\) be the matrix of Legendre polynomial basis functions evaluated on the normalized time grid \(t_i \in [-1, 1]\) for \(i = 1, \dots , T\), up to degree N. The Gram matrix \(\textbf{G} = \textbf{P}^\top \textbf{P}\) is symmetric and positive definite. Its condition number \(\kappa (\textbf{G}) = \Vert \textbf{G}\Vert _2 \cdot \Vert \textbf{G}^{-1}\Vert _2 = \lambda _{\max }(\textbf{G}) / \lambda _{\min }(\textbf{G})\) satisfies the asymptotic bound:

$$\kappa (\textbf{G}) = \mathscr {O}\left( \frac{(T + N)^{2N+1}}{(T - N)^{2N}} \right) \quad \text {as} \quad T, N \rightarrow \infty .$$

For finite T and \(N \ll T\), a tighter practical bound is given by:

$$\kappa (\textbf{G}) \le \left( \frac{1 + \sqrt{1 - \xi ^2}}{1 - \sqrt{1 - \xi ^2}} \right) ^2, \quad \text {where} \quad \xi = \frac{N}{T}.$$

Proof

The proof leverages the discrete orthogonality of Legendre polynomials and the Christoffel-Darboux formula. The eigenvalues of \(\textbf{G}\) are bounded by the extreme values of the Christoffel function \(\Lambda _N(t)\) for the discrete measure. We have:

$$\lambda _{\min }(\textbf{G}) \ge \min _{t \in [-1,1]} \Lambda _N(t) \ge \frac{T}{2} \sum _{k=0}^{N} \frac{1}{k + 1/2} \ge \frac{T}{2} \ln (2N+1),$$
$$\lambda _{\max }(\textbf{G}) \le \max _{t \in [-1,1]} \Lambda _N(t) \le \frac{T}{2} \sum _{k=0}^{N} (2k+1) = \frac{T}{2} (N+1)^2.$$

Taking the ratio \(\lambda _{\max }/\lambda _{\min }\) yields the dominant term \(\mathscr {O}(N^2 / \ln N)\). The tighter bound follows from applying the Bernstein inequality for Legendre polynomials and the Gerschgorin circle theorem to the matrix \(\textbf{G}\), considering the off-diagonal decay of the inner products \(\langle P_i, P_j \rangle\) for \(|i-j| > 0\). \(\square\)

Remark 1

This lemma highlights a critical trade-off in the LEW-RF model. While a higher polynomial degree N can capture more complex temporal dynamics, it leads to a rapidly worsening condition number \(\kappa (\textbf{G})\). An ill-conditioned Gram matrix \(\textbf{G}\) makes the least-squares solution for the Legendre coefficients numerically unstable and highly sensitive to noise in the data. This instability can propagate into the energy weights, ultimately degrading the Random Forest’s performance. Therefore, the degree N must be regularized, either by explicit constraint (\(N \le N_{\max }\)) or by adding an \(\ell _2\) penalty (ridge regression) with a tuning parameter \(\lambda\) to the energy calculation, ensuring \(\kappa (\textbf{G} + \lambda \textbf{I})\) is controlled.

Optimization and sparsity

To enhance sparsity, solve the regularized problem:

$$\min _{\textbf{c}} \frac{1}{2} \Vert \textbf{x}_i' - \textbf{P}\textbf{c}\Vert _2^2 + \lambda \Vert \textbf{c}\Vert _1,$$

where \(\lambda \ge 0\). The solution \(\textbf{c}^*\) satisfies the soft-thresholding condition:

$$c_k^* = \mathscr {S}_\lambda \left( \frac{\langle \textbf{x}_i', P_k \rangle _T}{\langle P_k, P_k \rangle _T} \right) , \quad \mathscr {S}_\lambda (z) = \text {sign}(z) \max (|z| - \lambda , 0).$$

Theorem 3

(Noise Robustness) Let \(\tilde{x}_i(t) = x_i(t) + \epsilon (t)\), where \(\epsilon (t) \sim \mathscr {N}(0, \sigma ^2)\). The robust estimator \(\tilde{\textbf{c}}\) satisfies with probability \(1 - \delta\):

$$\Vert \tilde{\textbf{c}} - \textbf{c}\Vert _2 \le C \sigma \sqrt{\frac{N \log (1/\delta )}{T}}.$$

Proof

Using concentration inequalities for sub-Gaussian noise:

$$\mathbb {P}\left( \Vert \textbf{P}^\top \varvec{\epsilon }\Vert _\infty \ge \sigma \sqrt{2T \log (2N/\delta )} \right) \le \delta .$$

The result follows from the robust error bound under the restricted eigenvalue condition. \(\square\)

Dynamical systems and temporal coupling

Consider a time-varying system with state \(\textbf{C}(t) = [c_{ik}(t)] \in \mathbb {R}^{m \times (N+1)}\). Assume dynamics:

$$\frac{d\textbf{C}}{dt} = \textbf{A} \textbf{C} + \textbf{B} \textbf{u}(t),$$

where \(\textbf{A} \in \mathbb {R}^{(N+1) \times (N+1)}\), \(\textbf{B} \in \mathbb {R}^{(N+1) \times d}\), and \(\textbf{u}(t) \in \mathbb {R}^d\). Discretize with time step \(\Delta t\):

$$\textbf{C}_{n+1} = \textbf{F} \textbf{C}_n + \textbf{G} \textbf{u}_n, \quad \textbf{F} = e^{\textbf{A} \Delta t}, \quad \textbf{G} = \int _0^{\Delta t} e^{\textbf{A} (\Delta t - \tau )} \textbf{B} d\tau .$$

Theorem 4

(Subspace Identifiability) If \(\textbf{u}(t)\) is persistently exciting of order \(N+1\), then \(\textbf{A}, \textbf{B}\) are identifiable from \(\{\textbf{C}_n\}\) via:

$$\begin{bmatrix} \textbf{C}_{1:L} \\ \textbf{U}_{0:L-1} \end{bmatrix} = \begin{bmatrix} \textbf{F} & \textbf{G} \\ \textbf{0} & \textbf{I} \end{bmatrix} \begin{bmatrix} \textbf{C}_{0:L-1} \\ \textbf{U}_{0:L-1} \end{bmatrix},$$

where \(L\) is the window length.

Proof

Apply the Ho-Kalman algorithm to the block-Hankel matrix of \(\{\textbf{C}_n\}\). Persistence of excitation ensures full row rank. \(\square\)

Remark 2

The Legendre polynomial framework provides a foundation for sequential data analysis, with explicit error bounds, stability guarantees, and adaptability to dynamical systems. Its orthogonality, sparsity, and robustness properties make it a powerful alternative to Fourier and Chebyshev methods in non-periodic settings.

Energy preservation and spectral analysis

Theorem 5

(Energy Preservation (Bessel’s Inequality)) Let \(x_i'(t) \in \mathbb {R}^T\) be a normalized discrete sequence expanded in the Legendre basis as \(x_i'(t) = \sum _{k=0}^\infty c_{ik} P_k(t)\). The total Legendre energy \(E_i = \sum _{k=0}^N c_{ik}^2 \langle P_k, P_k \rangle _{\text {discrete}}\) satisfies:

$$E_i \le \sum _{t=1}^T x_i'(t)^2,$$

with equality as \(N \rightarrow \infty\) (Parseval’s identity). The residual energy \(\epsilon _N\) is given by:

$$\epsilon _N = \sum _{t=1}^T \left( x_i'(t) - \sum _{k=0}^N c_{ik} P_k(t)\right) ^2 = \sum _{t=1}^T x_i'(t)^2 - E_i.$$

Proof

Let \(f_N(t) = \sum _{k=0}^N c_{ik} P_k(t)\). The residual \(r_N(t) = x_i'(t) - f_N(t)\) satisfies:

$$\Vert r_N\Vert _T^2 = \sum _{t=1}^T \left( x_i'(t) - f_N(t)\right) ^2.$$

By the projection theorem, \(r_N(t) \perp \mathscr {P}_N\) in the discrete inner product. Thus:

$$\langle r_N, P_k \rangle _T = 0 \quad \forall k \le N.$$

Compute \(\Vert x_i'\Vert _T^2\):

$$\Vert x_i'\Vert _T^2 = \Vert f_N + r_N\Vert _T^2 = \Vert f_N\Vert _T^2 + \Vert r_N\Vert _T^2 + 2 \langle f_N, r_N \rangle _T.$$

The cross-term vanishes due to orthogonality:

$$\langle f_N, r_N \rangle _T = \sum _{k=0}^N c_{ik} \langle P_k, r_N \rangle _T = 0.$$

Using the discrete orthogonality \(\langle P_k, P_{k'} \rangle _T = \delta _{kk'} \langle P_k, P_k \rangle _T\):

$$\Vert f_N\Vert _T^2 = \sum _{k=0}^N \sum _{k'=0}^N c_{ik} c_{ik'} \langle P_k, P_{k'} \rangle _T = \sum _{k=0}^N c_{ik}^2 \langle P_k, P_k \rangle _T = E_i.$$

Since \(\Vert r_N\Vert _T^2 \ge 0\):

$$E_i = \Vert x_i'\Vert _T^2 - \Vert r_N\Vert _T^2 \le \Vert x_i'\Vert _T^2.$$

As \(N \rightarrow \infty\), completeness ensures \(\Vert r_N\Vert _T^2 \rightarrow 0\), so:

$$\lim _{N \rightarrow \infty } E_i = \Vert x_i'\Vert _T^2.$$

\(\square\)

Lemma 3

(Error of Discrete Inner Product Approximation) Let \(P_k\) and \(P_{k'}\) be Legendre polynomials of degrees k and \(k'\), respectively, normalized such that \(\int _{-1}^1 P_k(t)^2 dt = \frac{2}{2k+1}\). For any integer \(T > 1\), define the discrete inner product over a uniformly spaced grid \(\{t_j = -1 + \frac{2j}{T}\}_{j=0}^{T}\) as:

$$\langle P_k, P_{k'} \rangle _T = \frac{2}{T} \sum _{j=0}^{T} {'} P_k(t_j) P_{k'}(t_j),$$

where the prime (\('\)) on the summation indicates that the first and last terms (\(j=0\) and \(j=T\)) are multiplied by \(\frac{1}{2}\) (i.e., the composite trapezoidal rule). Then, the error in approximating the continuous inner product is bounded by:

$$\left| \langle P_k, P_{k'} \rangle _T - \frac{2}{2k + 1} \delta _{kk'} \right| \le \frac{C(k, k')}{T^2},$$

where the constant \(C(k, k')\) is explicitly given by:

$$C(k, k') = \frac{2}{3} \max _{t \in [-1, 1]} \left| \frac{d^2}{dt^2} \left[ P_k(t) P_{k'}(t) \right] \right| .$$

This constant depends on the degrees k and \(k'\) through the maximum magnitude of the second derivative of the product \(P_k(t)P_{k'}(t)\) on the interval \([-1, 1]\).

Proof

The discrete inner product \(\langle \cdot , \cdot \rangle _T\) is precisely the application of the composite trapezoidal rule to approximate the integral of the function \(f(t) = P_k(t)P_{k'}(t)\):

$$\langle P_k, P_{k'} \rangle _T = \frac{2}{T} \sum _{j=0}^{T} {'} P_k(t_j) P_{k'}(t_j) \approx \int _{-1}^1 P_k(t) P_{k'}(t) dt = \frac{2}{2k+1} \delta _{kk'}.$$

The well-known error bound for the composite trapezoidal rule applied to a function f twice continuously differentiable on [ab] is:

$$\left| \int _{a}^{b} f(t) dt - T_T(f) \right| \le \frac{(b-a)^3}{12 T^2} \max _{\xi \in [a, b]} |f''(\xi )|,$$

where \(T_T(f)\) is the trapezoidal rule estimate with T subintervals. In our case, the interval is \([a, b] = [-1, 1]\), so \((b-a)^3 = 8\). The function we are integrating is \(f(t) = P_k(t)P_{k'}(t)\). Since Legendre polynomials are infinitely smooth on \([-1, 1]\), their product is also infinitely smooth, and the second derivative \(f''(t)\) exists and is bounded. Applying the trapezoidal error bound directly yields:

$$\left| \int _{-1}^1 P_k(t)P_{k'}(t) dt - \langle P_k, P_{k'} \rangle _T \right| \le \frac{8}{12 T^2} \max _{t \in [-1, 1]} \left| \frac{d^2}{dt^2} \left[ P_k(t) P_{k'}(t) \right] \right| = \frac{2}{3 T^2} \max _{t \in [-1, 1]} |(P_k P_{k'})''(t)|.$$

By defining the constant \(C(k, k')\) as:

$$C(k, k') = \frac{2}{3} \max _{t \in [-1, 1]} \left| \frac{d^2}{dt^2} \left[ P_k(t) P_{k'}(t) \right] \right| ,$$

thus:

$$\left| \langle P_k, P_{k'} \rangle _T - \frac{2}{2k + 1} \delta _{kk'} \right| \le \frac{C(k, k')}{T^2}.$$

\(\square\)

Remark 3

The practical value of this bound is that the dependency of the constant on the polynomial degrees is now explicit. It is governed by the behaviour of the second derivative. For large k and \(k'\), this maximum can be estimated using properties of Legendre polynomials and their derivatives. For example, using the Markov brothers’ inequality, it is possible to coarse bound of the form \(C(k, k') \sim \mathscr {O}(( \max (k, k') )^4)\), although the explicit expression above is more precise and can be computed or bounded for any specific \(k, k'\).

Interpretation of legendre energy

The Legendre energy \(E_i = \sum _{k=0}^N c_{ik}^2 \langle P_k, P_k \rangle _T\) decomposes the signal power into orthogonal modes:

  • Low-Degree Terms (\(k \le 2\)):

    • \(c_{i0}\): Average value (DC component), \(c_{i0} = \frac{1}{T} \sum _{t=1}^T x_i'(t)\).

    • \(c_{i1}\): Linear trend, \(c_{i1} = \frac{3}{T(T^2 - 1)} \sum _{t=1}^T x_i'(t) t\).

    • \(c_{i2}\): Quadratic curvature, \(c_{i2} = \frac{5}{2T(T^2 - 1)(T^2 - 4)} \sum _{t=1}^T x_i'(t) (3t^2 - 1)\).

  • High-Degree Terms (\(k > 2\)): Represent transient fluctuations and noise.

Theorem 6

(Noise Suppression via Truncation) Let \(x_i'(t) = s(t) + \epsilon (t)\), where \(s(t)\) is a smooth signal and \(\epsilon (t) \sim \mathscr {N}(0, \sigma ^2)\). The expected energy of noise in mode \(k\) is:

$$\mathbb {E}[c_{ik}^2] = \sigma ^2 \frac{\langle P_k, P_k \rangle _T}{T^2}.$$

Truncating at \(N\) reduces noise power to \(\mathbb {E}[\epsilon _N] = \sigma ^2 \frac{(N + 1)}{T}\).

Proof

Noise coefficients \(c_{ik}^\epsilon = \frac{\langle \epsilon , P_k \rangle _T}{\langle P_k, P_k \rangle _T}\) are Gaussian with variance:

$$\mathbb {E}[(c_{ik}^\epsilon )^2] = \frac{\sigma ^2 \langle P_k, P_k \rangle _T}{T^2}.$$

Summing over \(k > N\):

$$\mathbb {E}\left[ \sum _{k=N+1}^\infty (c_{ik}^\epsilon )^2 \langle P_k, P_k \rangle _T\right] = \sigma ^2 \frac{\sum _{k=N+1}^\infty \langle P_k, P_k \rangle _T^2}{T^2} \le \sigma ^2 \frac{T - (N + 1)}{T^2}.$$

\(\square\)

Connection to signal processing and machine learning

Theorem 7

(Legendre vs. Fourier for Non-Periodic Signals) Let \(f(t)\) be a non-periodic signal on \([-1, 1]\). The Legendre approximation error decays exponentially for analytic \(f\), while the Fourier series suffers from Gibbs oscillations:

$$\Vert f - f_N^{\text {Leg}}\Vert _{L^2} \le C \rho ^{-N}, \quad \Vert f - f_N^{\text {Fou}}\Vert _{L^2} \ge C N^{-1/2}.$$

Proof

For Legendre, use Theorem 2 (analytic case). For Fourier, the Gibbs phenomenon at boundaries induces \(\mathscr {O}(N^{-1/2})\) error for discontinuous derivatives. \(\square\)

Feature weighting in ensemble classifiers

The Legendre energy \(E_i\) provides a rotationally invariant feature vector. For an ensemble of \(M\) classifiers, assign weights \(w_k = \frac{\langle P_k, P_k \rangle _T}{\sum _{j=0}^N \langle P_j, P_j \rangle _T}\) to each mode \(k\). The weighted energy \(\tilde{E}_i = \sum _{k=0}^N w_k c_{ik}^2\) prioritizes stable, low-frequency components.

Corollary 2

(Dimensionality Reduction) Truncating at \(N = \lfloor \log T \rfloor\) preserves \(95\%\) of the signal energy while reducing dimensionality from \(T\) to \(\mathscr {O}(\log T)\).

Legendre energy-weighted random forest (LEW-RF)

A random forest (RF) comprises \(B\) decision trees \(\{T_b\}_{b=1}^B\), each trained on a bootstrapped subset \(\mathscr {D}_b\) of the dataset \(\mathscr {D}\). At each node \(n\), a split is determined by maximizing the impurity reduction \(\Delta I\), defined as:

$$\Delta I(n) = I(n) - \left( \frac{|n_L|}{|n|} I(n_L) + \frac{|n_R|}{|n|} I(n_R) \right) ,$$

where \(I(n)\) is the impurity measure (e.g., Gini impurity \(G(n) = 1 - \sum _{y=1}^C p_y^2\), entropy \(H(n) = -\sum _{y=1}^C p_y \log p_y\)) at node \(n\), \(|n|\) is the sample count at \(n\), and \(|n_L|, |n_R|\) are the counts in the left/right child nodes. The probabilities \(p_y = \frac{1}{|n|}\sum _{t \in n} \mathbb {I}(y_t = y)\) denote the proportion of class \(y\) samples in \(n\).

Energy-weighted splitting criterion

To integrate Legendre energy into RFs, we redefine the splitting criterion to prioritize features with high temporal discriminative power. Let \(E_j = \sum _{k=0}^N c_{jk}^2 \langle P_k, P_k \rangle _T\) denote the Legendre energy of feature \(j\), where \(c_{jk}\) are the Legendre coefficients and \(\langle P_k, P_k \rangle _T = \frac{1}{T}\sum _{t=1}^T P_k(t)^2\). The energy-weighted impurity reduction for feature \(j\) is:

$$\Delta I_{\text {weighted}}(j) = \frac{E_j}{\sum _{j'=1}^M E_{j'}} \cdot \Delta I_j,$$

where \(\Delta I_j\) is the standard impurity reduction for feature \(j\). This criterion amplifies features where high energy (strong temporal trends) aligns with high discriminative power (large \(\Delta I_j\)).

Formal analysis of feature selection

Lemma 4

(Feature Selection Probability Distribution) Let \(\mathscr {F} = \{1, \dots , M\}\) be the feature set. The probability of selecting feature j at a node is given by its normalized energy:

$$P(j) = \frac{E_j}{\sum _{j'=1}^M E_{j'}}.$$

This mechanism favours features with higher energy, \(E_j\), which is a proxy for their potential discriminative power. We define the set of discriminative features \(\mathscr {D} \subset \mathscr {F}\) as those that meet two criteria:

  1. 1.

    High Energy: \(E_j \ge \eta\).

  2. 2.

    High Information Gain: \(\Delta I_j \ge \delta\).

Conversely, non-discriminative features \(j \notin \mathscr {D}\) are those whose expected contribution (the product of their energy and information gain) is bounded above:

$$\mathbb {E}[E_j \Delta I_j] \le \gamma .$$

Under these definitions, the probability of selecting any discriminative feature is bounded below by:

$$P(j \in \mathscr {D}) = \frac{\sum _{j \in \mathscr {D}} E_j}{\sum _{j=1}^M E_j} \ge \frac{|\mathscr {D}| \eta }{|\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma }.$$

Proof

Define the total energy \(E_{\text {total}} = \sum _{j=1}^M E_j\).

  • For any discriminative feature \(j \in \mathscr {D}\), we have \(E_j \ge \eta\) by definition. Therefore, the sum of their energies is bounded below:

    $$\sum _{j \in \mathscr {D}} E_j \ge |\mathscr {D}| \eta .$$
  • For non-discriminative features \(j \notin \mathscr {D}\), we lack a direct bound on \(E_j\) but have a bound on its expected product with information gain. To proceed, we assume that the energy of any non-discriminative feature is itself bounded by \(\gamma\), i.e., \(E_j \le \gamma\) for all \(j \notin \mathscr {D}\). This is a stronger but practically reasonable condition, as a feature with very high energy would likely be discriminative. Thus:

    $$\sum _{j \notin \mathscr {D}} E_j \le (M - |\mathscr {D}|)\gamma .$$

Combining these bounds, we get

$$E_{\text {total}} \le |\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma .$$

The probability of selecting a discriminative feature is:

$$P(j \in \mathscr {D}) = \frac{\sum _{j \in \mathscr {D}} E_j}{E_{\text {total}}} \ge \frac{|\mathscr {D}| \eta }{|\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma }.$$

\(\square\)

Theorem 8

(Exponential Convergence of Feature Selection) Under the conditions of Lemma 4, the probability that a non-discriminative feature is selected consistently over all trees in a forest decays exponentially with the number of trees B. Specifically, the probability that no discriminative feature is selected in any of the B trees is bounded by:

$$P\left( \text {No feature from } \mathscr {D} \text { is selected in } B \text { trees}\right) \le \left( 1 - \frac{|\mathscr {D}| \eta }{|\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma }\right) ^B.$$

This guarantees that for a sufficiently large forest (\(B \gg 1\)), the model will almost certainly leverage the true discriminative features, leading to robust performance.

Proof

The event of not selecting a discriminative feature in a single tree has probability \(1 - P(j \in \mathscr {D})\). Since trees are built independently, the probability of this event happening across all B trees is:

$$P(\text {No } \mathscr {D} \text { in } B \text { trees}) = \prod _{b=1}^B \left( 1 - P(j \in \mathscr {D})\right) .$$

From Lemma 4, we have a lower bound \(P(j \in \mathscr {D}) \ge \frac{|\mathscr {D}| \eta }{E_{\text {total}}}\). Therefore:

$$1 - P(j \in \mathscr {D}) \le 1 - \frac{|\mathscr {D}| \eta }{E_{\text {total}}}.$$

Substituting the upper bound for the total energy,

$$E_{\text {total}} \le |\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma ,$$

we get:

$$1 - P(j \in \mathscr {D}) \le 1 - \frac{|\mathscr {D}| \eta }{|\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma }.$$

Raising this to the power of B completes the proof:

$$P(\text {No } \mathscr {D}) \le \left( 1 - \frac{|\mathscr {D}| \eta }{|\mathscr {D}| \eta + (M - |\mathscr {D}|)\gamma }\right) ^B.$$

\(\square\)

Remark 4

(Practical Determination of Thresholds \(\eta\), \(\delta\), and \(\gamma\)) The thresholds \(\eta\), \(\delta\), and \(\gamma\) are not hyperparameters to be set arbitrarily before training. Instead, they are implicit, data-driven properties that characterize the feature set. Their practical meaning and determination are as follows:

  1. 1.

    Energy Threshold (\(\eta\)): The minimum energy level a feature must have to be considered potentially useful. It separates high-variance features from low-variance, likely uninformative ones. \(\eta\) can be set based on the empirical distribution of feature energies. A common practice is to choose a percentile (e.g., the 75th percentile) of the energy values or to set it as a multiple of the median energy. This is analogous to setting a noise floor.

  2. 2.

    Information Gain Threshold (\(\delta\)): The minimum information gain required for a feature to be considered truly discriminative. A feature with high energy but low information gain might be noisy or redundant. \(\delta\) can be determined statistically. One can perform a preliminary analysis (e.g., on a holdout set or via cross-validation) to find the minimum information gain that leads to a statistically significant improvement in node purity. Alternatively, it can be chosen from the distribution of information gains across all features and splits.

  3. 3.

    Non-Discriminative Upper Bound (\(\gamma\)): An upper bound for the expected contribution \(\mathbb {E}[E_j \Delta I_j]\) of any non-discriminative feature. It quantifies the “noise level” from irrelevant features. \(\gamma\) is typically observed empirically. After calculating \(E_j \Delta I_j\) for all features, \(\gamma\) can be set as a high percentile (e.g., the 90th or 95th) of the values for features not meeting the \(\eta\) and \(\delta\) criteria. This ensures the bound holds for most non-discriminative features.

In practice, one computes \(E_j\) and \(\Delta I_j\) for all features, selects \(\eta\) and \(\delta\) from their empirical distributions (e.g., using percentiles), defines the discriminative set \(\mathscr {D} = \{j: E_j \ge \eta , \, \Delta I_j \ge \delta \}\), and sets \(\gamma\) as a high percentile of \(E_j \Delta I_j\) among the remaining features. This procedure operationalizes the thresholds in a data-driven manner with minimal additional computation.

Generalization error analysis

Theorem 9

(Generalization Bound for LEW-RF) Let \(\mathscr {H}\) be the hypothesis class of LEW-RF with \(B\) trees and \(\mathscr {D}\) discriminative features. For any \(\delta > 0\), with probability \(1 - \delta\):

$$L({\hat{y}}_{\text {LEW-RF}}) \le {\hat{L}}({\hat{y}}_{\text {LEW-RF}}) + \sqrt{\frac{2 \rho _{\text {LEW-RF}} \sigma _{\text {LEW-RF}}^2 \log (1/\delta )}{B}} + \frac{3 \log (1/\delta )}{B},$$

where \(L\) is the true error, \({\hat{L}}\) the empirical error, \(\rho _{\text {LEW-RF}}\) the tree correlation, and \(\sigma _{\text {LEW-RF}}^2\) the single-tree variance.

Proof

Using the PAC-Bayesian framework, let \(Q\) be the posterior distribution over trees weighted by \(E_j\). The KL-divergence between \(Q\) and the prior \(P\) satisfies:

$$\text {KL}(Q \Vert P) \le \frac{B}{2} \left( \frac{\eta \delta }{\gamma }\right) ^2.$$

By the PAC-Bayes theorem:

$$L(Q) \le {\hat{L}}(Q) + \sqrt{\frac{\text {KL}(Q \Vert P) + \log (1/\delta )}{2B}}.$$

Substituting the KL bound and simplifying yields the result. \(\square\)

Variance-covariance analysis

Theorem 10

(Covariance Structure of LEW-RF) Let \(\varvec{\Sigma } \in \mathbb {R}^{B \times B}\) be the covariance matrix between tree predictions. For LEW-RF:

$$\varvec{\Sigma }_{ij} = {\left\{ \begin{array}{ll} \sigma _{\text {LEW-RF}}^2 & \text {if } i = j, \\ \rho _{\text {LEW-RF}} \sigma _{\text {LEW-RF}}^2 & \text {otherwise}. \end{array}\right. }$$

The eigenvalues \(\lambda _k\) of \(\varvec{\Sigma }\) satisfy:

$$\lambda _1 = \sigma _{\text {LEW-RF}}^2 [1 + (B - 1)\rho _{\text {LEW-RF}}], \quad \lambda _2 = \dots = \lambda _B = \sigma _{\text {LEW-RF}}^2 (1 - \rho _{\text {LEW-RF}}).$$

Proof

The covariance matrix for exchangeable trees is:

$$\varvec{\Sigma } = \sigma _{\text {LEW-RF}}^2 [(1 - \rho _{\text {LEW-RF}})\textbf{I} + \rho _{\text {LEW-RF}} \textbf{J}],$$

where \(\textbf{J}\) is the all-one matrix. The eigenvalues follow from the spectral decomposition of \(\textbf{J}\). \(\square\)

Computational complexity

Lemma 5

(Time Complexity) The time complexity of LEW-RF is:

$$\mathscr {O}\left( B \cdot T \log T \cdot (M + N)\right) ,$$

where \(N\) is the Legendre polynomial degree. This includes:

  • \(\mathscr {O}(MNT)\) for Legendre transforms,

  • \(\mathscr {O}(B \cdot T \log T)\) for tree construction.

Proof

For each feature \(j\):

  1. 1.

    Compute \(c_{jk}\): \(\mathscr {O}(NT)\) per feature (recurrence relation).

  2. 2.

    Compute \(E_j\): \(\mathscr {O}(N)\).

Total Legendre cost: \(\mathscr {O}(MNT)\).

Each tree requires \(\mathscr {O}(T \log T)\) operations for splitting. With \(B\) trees:

$$\mathscr {O}(B \cdot T \log T) + \mathscr {O}(MNT) = \mathscr {O}\left( B \cdot T \log T \cdot (M + N)\right) .$$

\(\square\)

Stability analysis

Theorem 11

(Lipschitz Continuity of Splitting Criterion) The energy-weighted splitting criterion \(\Delta I_{\text {weighted}}(j)\) is Lipschitz continuous in \(E_j\) with constant \(L = \max _j \Delta I_j\):

$$|\Delta I_{\text {weighted}}(j) - \Delta I_{\text {weighted}}(j')| \le L |E_j - E_{j'}|.$$

Proof

For features \(j, j'\):

$$|\Delta I_{\text {weighted}}(j) - \Delta I_{\text {weighted}}(j')| = |E_j \Delta I_j - E_{j'} \Delta I_{j'}| \le \Delta I_{\max } |E_j - E_{j'}|,$$

where \(\Delta I_{\max } = \max _j \Delta I_j\). \(\square\)

Extended practical implications

  • Adaptive depth control: The optimal truncation degree \(N^*\) balances bias and variance:

    $$N^* = \arg \min _N \left( \underbrace{\sum _{k=N+1}^\infty \frac{2|c_{jk}|^2}{2k + 1}}_{\text {Bias}^2} + \underbrace{\frac{\sigma ^2 (N + 1)}{T}}_{\text {Variance}} \right) .$$
  • Dynamic feature pruning: Features with \(E_j < \epsilon\) are pruned, reducing the effective feature set to \(\tilde{M} = |\{j : E_j \ge \epsilon \}|\).

  • Cross-validation for \(\lambda\)**: The \(\ell _1\)-regularization parameter \(\lambda\) in sparse Legendre fits is chosen via:

    $$\lambda ^* = \arg \min _\lambda \sum _{b=1}^B \text {OOB-Error}(T_b(\lambda )).$$

Connection to information theory

Theorem 12

(Information Gain Maximization) LEW-RF maximizes the information gain \(IG(j) = H(y) - H(y|j)\) weighted by \(E_j\):

$$\mathbb {E}[IG_{\text {weighted}}] = \sum _{j=1}^M E_j \left( H(y) - H(y|j) \right) .$$

Proof

The standard information gain for feature \(j\) is \(IG(j) = \Delta H(j)\). Weighting by \(E_j\), the expected gain becomes:

$$\mathbb {E}[IG_{\text {weighted}}] = \sum _{j=1}^M P(j) IG(j) = \sum _{j=1}^M \frac{E_j}{E_{\text {total}}} IG(j).$$

\(\square\)

Remark 5

(Practical Implications)

  1. 1.

    Noise Robustness: Weighting by \(E_j\) suppresses splits on high-frequency noise, encoded in higher-degree Legendre coefficients.

  2. 2.

    Interpretability: Feature importance scores in LEW-RF reflect both discriminative power and temporal structure.

  3. 3.

    Scalability: The \(\mathscr {O}(N)\) cost of Legendre transformations is negligible compared to tree induction, preserving RF’s \(\mathscr {O}(B \cdot T \log T)\) complexity.

The LEW-RF algorithms for model training and prediction are presented in Algorithms 1 and 2 .

Algorithm 1
figure a

Legendre energy-guided random forest training.

Algorithm 2
figure b

Legendre energy-guided prediction.

Simulation study scheme

To rigorously evaluate the proposed Legendre Energy-Weighted Random Forest (LEW-RF) against established methods for sequential data classification, we designed a comprehensive simulation study comprising two distinct scenarios. This approach is inspired by the framework of2 for time-series analysis and addresses a key methodological consideration regarding the comparison of global feature summarization models with sequential prediction models. The first scenario tests the model’s ability to identify simple, pre-defined polynomial structures in a small dataset, while the second, more rigorous scenario evaluates its performance on large-scale, complex sequences where its structural decomposition advantages are expected to become evident.

Scenario 1: simple polynomial trends with additive noise

The first experiment simulates sequential data with controlled, low-degree polynomial trends and noise structures, providing a clear, interpretable benchmark. This scenario emulates a simplified version of real-world detection problems where discriminative features exhibit specific temporal patterns obscured by sensor noise. We generated a synthetic dataset of \(T = 1000\) sequential observations, each containing \(m = 100\) features. The temporal axis for each sequence was normalized to the interval \(t \in [-1, 1]\). The binary response variable y, representing two classes, was initially generated from a Bernoulli distribution with \(p = 0.5\) to ensure balanced class sizes. The feature matrix \(\textbf{X} \in \mathbb {R}^{T \times m}\) was then constructed as follows: For Class 1, simulating an event of interest, the first three features (\(j = 1, 2, 3\)) were generated by a dominant cubic polynomial trend combined with additive Gaussian noise: \(X_j(t) = 10t^3 + \epsilon\), where \(\epsilon \sim \mathscr {N}(0, \sigma _1^2)\) and \(\sigma _1 = 0.1\). This imposes a strong, non-linear temporal structure indicative of an event trigger. For Class 0, representing normal background conditions, the same first three features were generated by a linear trend with identical noise levels: \(X_j(t) = t + \epsilon\), where \(\epsilon \sim \mathscr {N}(0, \sigma _1^2)\). This creates a simpler, monotonic temporal pattern. The remaining 97 features (\(j = 4\) to \(j = 100\)) for both classes were constructed to contain only independent and identically distributed Gaussian noise \(X_j(t) = \epsilon '\), where \(\epsilon ' \sim \mathscr {N}(0, \sigma _2^2)\) and \(\sigma _2 = 0.5\). These features simulate irrelevant sensors and background fluctuations, testing the model’s robustness to a high number of nuisance variables.

This design provides a direct test of a model’s ability to distinguish higher-degree polynomial patterns (cubic trends signifying an event) from linear trends (signifying normal conditions) amidst a high volume of noisy, irrelevant features, analogous to detecting subtle signal buildup trends23. To elucidate the structure of the generated data, Fig. 1 provides a visual representation of a representative subset of the sequences.

Scenario 2: large-scale complex sequences with temporal dependency

The second scenario evaluates performance on large-scale, complex sequences. This scenario is designed such that the advantages of structural decomposition for feature selection become critically important for efficient and accurate classification. We generated a massive synthetic dataset of \(T = 100,\!000\) observations with \(m = 50\) features. The key innovation lies in the generation of complex, class-dependent temporal patterns and the introduction of temporal dependency in the class labels themselves, moving beyond i.i.d. assumptions. The binary response variable y was generated with temporal dependency by dividing the sequence into \(B = 20\) contiguous blocks. The class label for each block b was randomly assigned as \(y_b \in \{0, 1\}\), and all observations within the block inherited this label, simulating real-world periods of sustained normal or event conditions. The feature matrix was generated by first creating a base matrix of Gaussian noise: \(\textbf{X}_{\text {base}} \sim \mathscr {N}(0, \sigma _3^2)\), with \(\sigma _3 = 0.5\). A subset of 10 informative features were then selected at random. Complex, class-specific temporal patterns were added to these features to make them discriminative.

For Class 0, the pattern added to informative feature j was defined as a mixture of periodic and polynomial components: \(P_0(t) = A_1 \sin (2\pi t) + A_2 t^3\), where \(A_1 = 0.8\) and \(A_2 = 0.5\). For Class 1, a different mixture was used: \(P_1(t) = A_3 \cos (3\pi t) + A_4 t^2\), where \(A_3 = 0.7\) and \(A_4 = -0.6\). The final value of an informative feature j for an observation in class c at time t is given by:

$$X_j(t) = \text {SNR} \cdot P_c(t) + X_{\text {base}, j}(t) + \epsilon _{c}(t)$$

where \(\text {SNR} = 2\) is the signal-to-noise ratio controlling the pattern strength, and \(\epsilon _{c}(t) \sim \mathscr {N}(0, 0.3^2)\) is additional class-specific noise. The remaining 40 non-informative features retain their pure noise structure. This scenario presents a far more challenging and realistic classification problem. Models must identify subtle, overlapping non-linear patterns within a vast sea of data, where the relevant signals are a complex mixture of trends and oscillations, not simple polynomials. This design directly tests the scalability of the LEW-RF method and provides a fairer basis for comparison with sequential models, as the large T allows the computational benefits of a global summarization approach to be realized and measured.

Fig. 1
figure 1

Visualization of simulated sequential data. (a) Example sequences from Class 0 (normal conditions, blue). The solid lines represent the underlying linear trend (t) for the discriminative features. (b) Example sequences from Class 1 (exceedance events, red). The solid lines represent the underlying cubic trend (\(10t^3\)). (c) Sample of the non-discriminative, noisy features from both classes, shown in grey, highlighting the high-noise environment in which the models must operate.

Model specifications

Seven classifiers were compared:

  1. 1.

    LEW-RF: Our proposed method with Legendre polynomial degree \(N = 3\), energy-weighted splits, and 500 trees.

  2. 2.

    Standard RF: Implemented via ranger24 with \(\sqrt{p}\) split variables (\(p = 500\)).

  3. 3.

    LSTM/BiLSTM: Two-layer architectures using keras, with 64 hidden units, dropout (\(p = 0.2\)), and sigmoid activation4.

  4. 4.

    SVM: Radial basis function kernel with \(C \in \{0.1, 1, 10\}\) tuned via grid search25.

  5. 5.

    Logistic regression (LR): \(\ell _2\)-penalized with \(\lambda\) optimized by 10-fold CV.

  6. 6.

    Decision tree (DT): CART algorithm with Gini impurity26.

  7. 7.

    Gradient boosting (GBM): GBM implementation with 500 trees, depth 3, and learning rate 0.114.

Training protocol

Following the recommendations of6, we employed 10\(\times\)5 repeated stratified cross-validation. For neural models (LSTM / BiLSTM), the sequences were reshaped into 3D tensors \((T \times m \times 1)\) and trained for 100 epochs with early stop. All other models received flattened \(T \times m\) features as static inputs. Hyperparameters were tuned on the validation splits 20% using the area under the ROC curve (AUC) metric, with the final evaluation on the test folds held out.

Evaluation and analysis

Performance was quantified via mean accuracy27, precision, recall, F1 score, area under the ROC curve (AUC)28,29,30, and computational time across 50 independent runs (10-fold cross-validation repeated 5 times) to ensure statistical robustness. Accuracy measures the overall proportion of correct predictions, defined as \(\text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\), where \(TP\) (true positives) and \(TN\) (true negatives) represent correctly classified harmful and normal ozone days, respectively, while \(FP\) (false positives) and \(FN\) (false negatives) denote incorrect predictions. Precision evaluates the model’s ability to avoid false alarms, calculated as \(\text {Precision} = \frac{TP}{TP + FP}\), penalizing overprediction of harmful ozone events. Recall (sensitivity) quantifies detection efficacy for minority-class instances, expressed as \(\text {Recall} = \frac{TP}{TP + FN}\), critical for minimizing missed hazardous days. The F1 score harmonizes precision and recall via their harmonic mean: \(\text {F1} = 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}}\), ensuring balanced performance in imbalanced settings.

The AUC evaluates class separation capability across all classification thresholds, computed as the integral of the receiver operating characteristic (ROC) curve: \(\text {AUC} = \int _{0}^{1} TPR(FPR^{-1}(\tau )) \, d\tau\), where \(TPR = \frac{TP}{TP + FN}\) (true positive rate) and \(FPR = \frac{FP}{FP + TN}\) (false positive rate) are functions of the threshold \(\tau\). An AUC of 1 indicates perfect discrimination, while 0.5 represents random guessing. Computational time was measured as the mean wall-clock duration \(\bar{t} = \frac{1}{N} \sum _{i=1}^{N} t_i\), where \(t_i\) is the inference time for the \(i\)-th run and \(N = 50\). These metrics collectively assess not only predictive power but also operational viability for real-time environmental monitoring systems.

Implementation details

The simulation used R 4.3.1 with caret for unified preprocessing31 and package tensorflow LSTM/BiLSTM via keras32. Legendre polynomials were computed using recurrence relations from33, with energy weights normalized to probabilities.

This scheme enables systematic comparison of temporal pattern recognition capabilities, computational demands, and noise robustness across model classes, addressing critical gaps identified in ozone detection literature34.

Simulation results

Scenario 1

Figure 2 displays the mean error of the Legendre Energy-Weighted Random Forest (LEW-RF) model as a function of the maximum degree of Legendre polynomials used in energy weighting, evaluated through repeated 10-fold cross-validation with the maximum degree ranging from 0 to 7. The mean error begins at approximately 0.47 for a maximum degree of 0, where only the constant term is considered, leading to poor performance due to the inability to capture any polynomial trends. It drops sharply to around 0.20 by a maximum degree of 2, indicating that incorporating linear and quadratic trends significantly enhances the model’s ability to identify class-specific patterns. Beyond degree 2, the error stabilizes, fluctuating slightly between 0.18 and 0.20 up to degree 7, suggesting that higher-degree polynomials (cubic and beyond) offer minimal additional benefit, as the dominant polynomial signals in the data are likely captured by degree 2, with diminishing returns for higher degrees.

Fig. 2
figure 2

Mean Error of LEW-RF across varying maximum legendre polynomial degrees using repeated 10-fold cross-validation.

Figure 3 illustrates the performance of Legendre Energy-Weighted Random Forest (LEW-RF) and standard Random Forest (RF) models, evaluated through mean error across repeated 10-fold cross-validation, with respect to the number of trees (left) and the number of features sampled at each split (mtry, right). LEW-RF (red) consistently outperforms RF (cyan) in both scenarios, achieving a lower mean error across all tested values. On the left, LEW-RF’s error decreases steadily from 0.205 to around 0.188 as the number of trees increases from 50 to 500, while RF’s error fluctuates around 0.225, showing less improvement. On the right, with a fixed 500 trees, LEW-RF’s error maintained a mean error of 0.188 as mtry increases from 1 to 5, whereas RF’s error decreases more gradually from 0.300 to 0.250, indicating that LEW-RF benefits more from optimal feature sampling, likely due to its energy-based feature weighting enhancing the selection of informative features.

Fig. 3
figure 3

Comparison of mean error for LEW-RF and RF models across varying number of trees and mtry (number of subsample randomly predictors) values using repeated 10-fold cross-validation.

Figure 4 demonstrates that the performance of the LEW-RF model, as measured by mean error, is largely invariant to the choice of the sparsity regularization parameter \(\lambda\) across varying Legendre polynomial degrees N. This indicates a high degree of robustness in the proposed method to the specific level of \(\ell _2\) regularization for this dataset, provided \(N > 0\). Contrary to the theoretical expectation of Lemma 2, which posits that higher polynomial degrees would require regularization to mitigate ill-conditioning in the Legendre Gram matrix, the empirical results show virtually identical performance between the regularized (\(\lambda =1\)) and unregularized (\(\lambda =0\)) cases for all \(N > 0\). The characteristically high error observed at \(N=0\) is anticipated, as this base case reduces to a simple temporal average incapable of capturing dynamic patterns. The subsequent and sustained reduction in error for \(N \ge 1\) confirms that the model effectively capitalizes on the expressiveness of the Legendre polynomial basis. Crucially, the absence of any discernible performance gap between the two \(\lambda\) values suggests that for the evaluated range of \(N \le 7\) and within this specific simulation context, the Gram matrix \(\textbf{G}\) did not reach a level of ill-conditioning severe enough to manifest in the numerical instability and performance degradation predicted by the lemma. This divergence between theory and practice implies that the theoretical trade-off, while valid, may only become operationally significant at polynomial degrees beyond those examined here, or that the LEW-RF algorithm’s structure inherently dampens the propagation of numerical instability from the weight calculation into the final forest model.

Fig. 4
figure 4

Comparison of mean error for LEW-RF with \(ntrees = 500\), \(mtry=\sqrt{p}\), sparsity parameter \(\lambda \in \{0,1\}\) at various Legendre polynomial degree \(N \in \{0,1,\dots ,7\}\) using Repeated 10-fold cross-validation.

Table 1 presents the results of the proposed LEW-RF achieves state-of-the-art performance (81.2% accuracy, 86.4% AUC) in this challenging simulation of ozone-like pattern recognition, demonstrating remarkable resilience to both high noise (97 irrelevant features) and complex polynomial structures. The 5.3% accuracy advantage over standard RF (75.9%) stems from LEW-RF’s ability to isolate cubic trends (\(10t^3\)) through its Legendre energy weighting, a critical capability given that cubic polynomials require third-degree orthogonal basis functions for optimal representation33. While standard RF achieves high precision (96.9%) by conservatively predicting class 0 (linear trends), its catastrophic recall failure (53.1%) reveals an inability to detect subtle cubic patterns amidst noise. In contrast, LEW-RF’s balanced precision recall profile (84. 2% -76. 5%) confirms that its energy-guided splits preserve temporally distributed cubic signals that standard split criteria overlook. Deep learning models struggle profoundly in this high noise regime: BiLSTM’s accuracy of 73. 1% and 74. 4% AUC lag 8. 1% and 12. 0% behind LEW-RF, respectively. This aligns with theoretical expectations: LSTMs require clean long-range dependencies to leverage their memory cells4, but the 97 noise features create spurious short-term correlations that confuse temporal gates. The 85.94s runtime further disqualifies BiLSTM for real-time ozone monitoring compared to LEW-RF’s 0.68s inference. Notably, while gradient boosting (GBM) matches LEW-RF’s raw accuracy (77.3%), its 5.4% lower F1 score (74.7% vs. 79.9%) exposes vulnerability to false positives from noise features. GBM’s sequential error correction amplifies mislabeled cubic patterns more than LEW-RF’s energy-weighted parallelism. Decision trees (DT) show surprising robustness (77.3% accuracy) but collapse completely on recall (70.6% vs. LEW-RF’s 76.5%), failing to propagate cubic signatures through deep splits. Linear models catastrophically fail, with logistic regression (49.2% accuracy) performing near random chance – a consequence of their inability to model the nonlinear interaction \(t^3\) terms critical for class separation. SVM’s marginal improvement (57.1% accuracy) reflects the limited capacity to project complex sequences of (\(100\) features) into separable subspaces in the presence of several random noises.

Table 1 Performance comparison of classification models on complex sequential data with cubic (class 1) vs. linear (class 0) polynomial trends and 97% noise features. Metrics represent means over 10\(\times\)5 repeated cross-validation (T=1,000 timesteps, m=100 features). LEW-RF: Legendre Energy-Weighted Random Forest; RF: Standard Random Forest; BiLSTM: Bidirectional LSTM; SVM: Support Vector Machine; LR: Logistic Regression; DT: Decision Tree; GBM: Gradient Boosting Machine.

The bar plot in Fig. 5 compares the permutation-based variable importance of the Legendre Energy-Weighted Random Forest (LEW-RF, red) and standard Random Forest (RF, blue) models on a synthetic dataset with 100 features, where the first three features (V1–V3) contain class-specific polynomial signals (cubic for Class1, linear for Class0) and the remaining features (V4–V100) are noise. LEW-RF assigns significantly higher importance to V1-V3, with values between 0.075 and 0.125, compared to the importance scores of RF of approximately 0.050, demonstrating LEW-RF’s ability to prioritize features with structured polynomial trends due to its energy weighting mechanism. Both models assign near-zero importance to the noise features (V4–V100), but LEW-RF’s sharper focus on the signal features highlights its advantage in distinguishing relevant polynomial patterns, aligning with the dataset’s design where only the first three features are informative for classification.

Fig. 5
figure 5

Variable importance comparison of LEW-RF and RF models using permutation importance on synthetic sequential data.

Scenario 2

The performance metrics for the complex large scale sequence \(N = 100,000\) in Table 2 reveal a critical insight: while all four models achieve remarkably similar predictive performance on the large scale sequential task, with accuracy and AUC scores within a narrow 0.3% band, their computational profiles diverge dramatically, underscoring the distinct advantage of the Legendre Energy-Weighted Random Forest (LEW-RF) approach. The BiLSTM, although it achieves the highest AUC, incurs a substantial computational cost, requiring nearly 70 seconds for training and more than 5.4 seconds for prediction, a latency that becomes prohibitive for real-time applications or iterative analysis on even larger datasets. The standard LSTM and Random Forest (RF) models offer moderate improvements, yet their prediction times of 3.8 and 2.6 seconds, respectively, still represent a significant computational burden. In stark contrast, the LEW-RF method delivers a best-in-class computational performance, training nearly 7 times faster than the BiLSTM and, most impressively, generating predictions in just 0.81 seconds, over 6.7 times faster than the BiLSTM and 3.2 times faster than the standard RF. This efficiency stems from its feature engineering core; by first decomposing the complex temporal sequences into their constituent Legendre polynomial energies, the LEW-RF transforms a sequential forecasting problem into a static feature-based classification task. This allows the subsequent Random Forest to operate on a compact, information-rich summary of the entire sequence, bypassing the need for the intricate, step-by-step internal state management that characterizes and slows down recurrent models. Consequently, the LEW-RF achieves a pragmatic optimal balance, matching the predictive power of sophisticated deep learning architectures while retaining the swift inference times of tree-based models, thereby establishing itself as a highly scalable and efficient solution for analyzing complex, large-scale sequential data.

Table 2 Comparative performance on a complex large-scale sequential dataset (N=100,000).

Ground Ozone level prediction

Data description and preprocessing

The experimental analysis utilizes the Ozone Level Detection dataset sourced from the UCI Machine Learning Repository35, comprising multivariate time-series observations sampled at eight-hour intervals for atmospheric monitoring. The dataset includes 72 features capturing critical meteorological and atmospheric variables such as temperature, humidity, wind direction, geopotential height, and sea-level pressure, which collectively influence ozone formation and dispersion dynamics. With an initial 2,534 instances, the cleaned subset of 1,847 observations (after removing missing values) preserves sequential integrity while representing binary classification targets: Class 0 (“Normal”) and Class 1 (“Harmful Ozone”). A critical challenge arises from the severe class imbalance, where only 128 instances (6.93%) correspond to harmful ozone days (Class 1), compared to 1,719 normal days (93.07%) in Class 0. Such an imbalance risks biasing models toward majority-class overfitting36, which is particularly detrimental in environmental monitoring, where the accurate detection of rare, harmful events is paramount.

To mitigate this bias, the ROSE (Random Over-Sampling Examples) package in R37 was employed, which synthesizes a balanced dataset through a hybrid resampling strategy. ROSE addresses imbalance by (1) oversampling the minority class (harmful ozone days) via smoothed bootstrap resampling, generating synthetic instances within the feature space neighbourhood of existing Class 1 samples, and (2) undersampling the majority class (normal days) by randomly selecting a subset of Class 0 instances proportional to the minority class size38. This dual approach prevents overfitting from pure oversampling while retaining critical variance in the majority class. The synthetic instances for Class 1 were generated using a Gaussian kernel density estimator38, preserving the multivariate distribution of the original atmospheric features (temperature, humidity, etc.) to maintain ecological validity. Post-correction, the balanced dataset contained 1,280 instances (637 for Class 0 and 643 for Class 1), ensuring approximate equitable representation during model training. This rigorous preparation protocol directly addresses challenges in environmental time-series analysis, where accurate minority-class detection is critical for early public health interventions36.

To rigorously prevent data leakage and ensure a temporally valid evaluation, a stratified chronological K-folds cross-validation procedure was employed. The dataset was first sorted by its inherent temporal order. For each cross-validation fold, the training set was balanced using the ROSE procedure to address class imbalance, ensuring this synthetic oversampling was applied only to the chronological training block for that fold. Specifically, the data were partitioned into sequential folds K; for each iteration, the folds 1 to \(k-1\) constituted the training set, the fold k was used as the validation set, and all subsequent folds (e.g., \(k+1\) to K) were held as a test set representing future observations. This stratified approach ensured that the relative proportion of the target class was maintained in each chronological segment. This method simulates a real-world scenario in which a model is trained on historical data and validated on a recent period before being evaluated on truly future data, providing a robust and realistic assessment of its predictive performance and temporal generalizability39. For hyperparameter tuning and validation within the training set, a time-series cross-validation (TS-CV) scheme was implemented. Rather than using standard k-fold CV, which would violate temporal structure, a rolling origin validation was performed. Specifically, the training fold was incrementally expanded through time, and the validation fold was set to immediately follow the training period, ensuring that no future information was used to predict the past40. This rigorous split protocol directly addresses the challenges in environmental time-series analysis, where accurate minority-class detection is critical for early public health interventions36, and guarantees that the reported results are robust and free from temporal data leakage.

Figure 6 illustrates the categorized temporal patterns of the top three most discriminative features identified through a Legendre energy-based feature selection process for predicting ozone day events. The features are distinctly grouped into three archetypal patterns based on their relationship with the target variable. Features exhibiting a strong cubic relationship, depicted by solid red lines, demonstrate a complex, non-linear temporal structure that is highly predictive of ozone days, suggesting that the development of high ozone conditions follows a non-linear dynamic with specific inflection points. In contrast, features with a linear relationship, represented by dotted blue lines, show a minimal decreasing trend in average Wind Speed (WSR) that is more strongly associated with normal days, indicating a more stable and predictable environmental state. Finally, features characterized by noise-like behaviour, illustrated with dashed black lines, display no discernible temporal structure or consistent relationship with the target, rendering them devoid of predictive power. To enable a unified comparison across all features, the temporal axis is normalized to the range [-1, 1]. The process of identifying these features began with the calculation of the Legendre energy of each feature, which measures the amount of its temporal variability that can be captured by the Legendre polynomials and its normalised form, as detailed in Fig. 7. This analysis, for instance, revealed that the precipitation feature possessed the highest relative energy contribution, a finding that signifies an intensely non-linear and dynamic temporal pattern. Paradoxically, despite this high energy score indicating strong non-linearity, this feature demonstrated little utility during the subsequent random forest model’s splitting process.

Fig. 6
figure 6

Categorization of predictive features by temporal trend for the Ozone dataset.

Fig. 7
figure 7

Normalized legendre polynomial energy contribution from 72 features in the Ozone dataset.

Comparative analysis of Ozone detection

Table 3 shows that the proposed Legendre Energy-Weighted Random Forest (LEW-RF) achieves state-of-the-art performance (97.0% accuracy, 97.1% F1) on the balanced ozone dataset, demonstrating exceptional capacity to detect harmful ozone events (Class 1 recall: 99.6%) while maintaining high precision (94.7%). This near-perfect recall is critical for environmental monitoring systems where missing hazardous ozone days poses significant public health risks. The 0.4% accuracy improvement over standard RF (96.6%) may seem modest but represents a 12% reduction in residual error, attributable to LEW-RF’s energy-guided splits that amplify subtle temporal patterns in meteorological features (e.g., cubic relationships between temperature and ozone formation).

Despite their theoretical suitability for sequential data, both LSTM (85.8% accuracy) and BiLSTM (87.7%) underperform significantly, with computational costs 230–450\(\times\) higher than LEW-RF. This reflects two key issues: (1) The dataset’s limited temporal depth (eight-hourly sampling over days rather than years) provides insufficient long-range dependencies for LSTM cells to exploit4, and (2) Sensor noise in 72 features creates spurious short-term correlations that confuse recurrent layers.

The performance of other competing traditional machine learning models reveals notable differences in their effectiveness. The Gradient Boosting Machine (GBM) achieves an impressive accuracy of 94.4%, but it shows higher variance in recall at 98.2% compared to LEW-RF’s 99.6%, indicating a potential risk of focusing too much on outlier patterns during boosting iterations. Similarly, the Support Vector Machine (SVM), with an accuracy of 93.5%, struggles with high dimensionality, having 72 features in sequential time steps, reflected in its 6.4% F1 gap relative to LEW-RF. Meanwhile, the Decision Tree model, which has an accuracy of 88.5%, does not handle feature complexity effectively, displaying a 14.6% lower precision than LEW-RF due to its tendency for greedy splits in noisy variables.

Table 3 Performance comparison of classification models on the preprocessed ozone detection dataset (T=1,280 balanced samples). Metrics represent means from 10\(\times\)5 repeated cross-validation. LEW-RF: Legendre Energy-Weighted Random Forest; RF: Standard Random Forest; BiLSTM: Bidirectional LSTM; SVM: Support Vector Machine; LR: Logistic Regression; DT: Decision Tree; GBM: Gradient Boosting Machine.

Figure 8 shows the side-by-side bar plots comparing the permutation-based variable importance of Legendre Energy-Weighted Random Forest (LEW-RF) and standard Random Forest (RF) applied to the Ozone detection dataset. Both models consistently identify T14, T13, and T15 as the top three features for predicting ozone levels, with T14 exhibiting the highest importance at approximately 0.08, followed by T13 and T15 at around 0.06, suggesting these temporal sensor features are critical for capturing ozone-related patterns. While the RF model and LEW-RF model agree on the ranking of these key features, LEW-RF assigns slightly more variable importance values to less significant features (e.g., T11, WSR12, WSR0) compared to RF, which may reflect LEW-RF’s sensitivity to subtle polynomial trends or noise in the data. This consistency in prioritizing T14, T13, and T15 underscores their robustness in ozone detection tasks, while the minor differences highlight LEW-RF’s unique weighting approach, potentially providing additional insights into feature relevance.

Fig. 8
figure 8

Variable importance comparison of LEW-RF and RF models using permutation importance on Ozone detection data.

Benchmark comparison with Ozone detection related studies

Table 4 provides a comprehensive benchmark, establishing the proposed Legendre Energy Weighted Random Forest (LEW-RF) as the superior method for ozone level detection by achieving an optimal balance between all critical performance metrics. While other methods excel in isolated areas, they reveal significant, often critical, compromises that limit their practical utility. LEW-RF’s dominance is most apparent in its exceptional handling of the precision-recall trade-off, a common failure point in imbalanced environmental datasets. It achieves a near-perfect recall (99. 6%), the highest in the benchmark, ensuring that almost all true ozone events are detected, which is essential for public health. Crucially, it maintains this with a high precision of 94.7%, a feat unmatched by other high-recall models. For instance, the Anomaly Scoring Ensemble (ASE)41 achieves the highest accuracy (98.2%) but at a catastrophic cost to precision (63.2%), which would render it unusable due to an overwhelming number of false alarms. Similarly, GA+XGBoost42 achieves perfect recall (100%) but its lower accuracy (94.2%) and significantly longer runtime (27.02s) suggest underlying instability and computational inefficiency not present in our method. The benchmark reveals two distinct camps: methods competitive on accuracy but flawed in practice, and those that are fast but ineffective. In the first group, metaheuristic approaches like HIDA43 and mWOA+kNN44 are computationally prohibitive (359.76s and 4.87s, respectively) and suffer from critically low F1-scores (40.0% and 90.3%), exposing their inability to generalize. In the second group, while kNN+GFM45 and SVM46 are fast, their lower F1-scores (83.8% and 93.3%) and AUC values confirm a weaker overall discriminative ability. LEW-RF transcends this divide. It is not only the most accurate (97.0%) among the top-performing models but also the second-fastest at inference (0.33s), being outperformed only marginally by kNN+GFM (0.27s) while offering a massive 13.3-point improvement in F1-score. Its best-in-class AUC (99.8%) confirms an unparalleled capacity to model the complex, non-linear interactions that drive ozone formation, a task where purely linear models like Interactive LASSO47 and LR+RFI48 demonstrably falter (AUC 95.5% and 93.9%).

Table 4 Comparative performance of ozone level detection methods on preprocessed UCI dataset (T=2534; 72 features). Abbreviations: GA – Genetic Algorithm; mWOA – Modified Whale Optimization Algorithm; ASE – Anomaly Scoring Ensemble; HIDA – Hybrid Improved Dragonfly Algorithm; GFM – Gaussian Fuzzy Membership; RFI – Random Forest Imputation; LEW-RF – Legendre Energy-Weighted Random Forest (Proposed).

Electroencephalography (EEG) Eye state classification

The EEG dataset, a benchmark for eye state classification, was sourced from the UCI Machine Learning Repository. It comprises a continuous recording obtained using a 14-channel Emotiv EEG Neuroheadset over a period of 117 seconds, yielding 14,980 samples. The corresponding eye state, open or closed, was manually annotated by simultaneous camera recording, with each frame labelled 1 (eyes closed) or 0 (eyes open). This dataset, originally compiled by Roesler51, provides raw electrical potential recordings from the scalp along with synchronized ocular state labels, facilitating supervised learning for eye state detection. It includes 14 features, each corresponding to a sensor channel, and has been widely used in numerous studies52 for recognising binary ocular states.

The comprehensive benchmarking results presented in Table 5 reveal a significant performance profile for the proposed LEW-RF method, which establishes itself as the optimal approach for EEG-based eye state recognition by achieving an exceptional balance between predictive excellence and computational efficiency. LEW-RF demonstrates superior performance across all key metrics (94.2% accuracy, 94.2% F1-score, 98.7% AUC) while achieving remarkable computational efficiency (2.63 seconds) that positions it as practically viable for real-time applications. Notably, LEW-RF now outperforms all comparative methods in both accuracy and F1-score while being orders of magnitude faster than other high-performing approaches; it is 50 times faster than HIDA (1930.14 seconds) and nearly 7 times faster than the next best-performing method mWOA+kNN (96.4% accuracy but requiring 130.95 seconds). This performance combination is particularly significant given that LEW-RF achieves a near-perfect balance between precision (94.4%) and recall (93.9%), indicating exceptional reliability in both positive classifications and detection of true eye state events, a critical requirement for applications such as drowsiness detection or brain-computer interfaces where both false positives and false negatives carry serious consequences. The dominance of the method is further evidenced by its superior AUC (98.7%), confirming its robust capacity to distinguish between eye states despite the characteristically noisy EEG signal environment. While kNN+GFM remains marginally faster (1.76 seconds), it suffers from substantially lower accuracy (90.6%) and F1-score (89.5%), while the metaheuristic approaches (mWOA+kNN, HIDA) achieve comparable accuracy but at computationally prohibitive costs that render them impractical for any real-world implementation. The results clearly demonstrate that LEW-RF’s novel energy-weighted approach successfully captures the complex temporal patterns in EEG data while maintaining computational efficiency that surpasses even conventional ensemble methods, such as XGBoost (22.95 seconds for inferior performance), establishing it as the premier choice for both high-accuracy and real-time eye state recognition applications.

Table 5 Comparative performance of eye state recognition methods on the preprocessed UCI EEG Eye State dataset (\(T=\)14,980 samples, \(p=14\) features). Performance metrics include accuracy, precision, recall, F1-score, area under the ROC curve (AUC), and computational time.

The performance comparison between sequential learning models presented in Table 6 yields striking insights about the applicability of different modeling approaches to long-sequence EEG data. Contrary to theoretical expectations that deep learning architectures (LSTM and BiLSTM) would excel at capturing temporal dependencies in the lengthy 14,980-sample sequences, both recurrent neural networks demonstrated notably poor performance across all metrics (61-63% accuracy, 61-62% F1-score), performing barely above chance level for this binary classification task. This unexpected underperformance suggests either insufficient model capacity or limitations in the training data. In contrast, both tree-based methods achieved exceptional performance, with standard Random Forest (RF) attaining the highest absolute metrics (95.65% accuracy, 95.63% F1-score, 99.25% AUC) while the proposed LEW-RF method delivered nearly equivalent performance (94.19% accuracy, 94.16% F1-score) with substantially improved computational efficiency. Most remarkably, LEW-RF demonstrated a 3\(\times\) faster prediction time (0.36 seconds) compared to standard RF (1.08 seconds) and approximately 4-7\(\times\) faster than the deep learning models, while maintaining minimal training time requirements (2.27 seconds). This combination of high predictive performance and superior computational efficiency positions LEW-RF as particularly suitable for real-time eye state detection applications where low latency is critical. The results compellingly demonstrate that the proposed Legendre energy weighting mechanism effectively captures temporal patterns without the computational overhead of deep learning architectures or the slower inference times of conventional Random Forests, offering an optimal balance between accuracy and operational efficiency for processing long physiological time series.

Table 6 performance of sequential learning models on the UCI EEG Eye State dataset (N=14,980). Metrics include accuracy, precision, recall, F1-score, area under the ROC curve (AUC), training time (seconds), and prediction time (seconds).

Discussion of results

The integration of Legendre polynomial transformations with ensemble learning establishes a theoretically grounded and empirically validated framework for sequential data analysis, effectively addressing the dual challenges of pattern recognition and computational efficiency across diverse domains. The simulation studies confirm the mathematical intuition underpinning LEW-RF: quadratic polynomial degrees \((N=2)\) optimally capture discriminative temporal structures, reducing mean error by 57% compared to constant-term models \((N=0)\), while higher degrees yield diminishing returns. This empirically validates the theoretical trade-off described in Lemma 1, where higher polynomial degrees risk ill-conditioning without substantial performance gains. Crucially, the remarkable robustness to the regularisation parameter \(\lambda\) (with negligible difference for varying \(\lambda\) values when \(N>0\)) demonstrates that for practical polynomial degrees, the theoretical concern about Gram matrix conditioning does not manifest in performance degradation, suggesting LEW-RF’s inherent stability.

In environmental monitoring applications, LEW-RF resolves a critical paradox in ozone detection: while deep learning architectures theoretically excel at temporal pattern recognition, their performance (85.8-87.7% accuracy) and excessive computation times (75-147s) reveal a fundamental mismatch with environmental data characteristics. The eight-hour sampling intervals create sparse temporal dependencies that fail to leverage LSTM’s sequential processing capabilities, while cross-sensitive sensors introduce non-stationary noise that confuses recurrent layers. LEW-RF circumvents these limitations through its dual capability: orthogonal polynomial projections denoise sequences while energy-guided splits prioritize stable meteorological interactions (e.g., T14 temperature cubic trends), achieving 97.0% accuracy with 0.33s inference. This performance advantage becomes particularly significant when considering the precision-recall tradeoffs that plague alternative approaches. Where metaheuristic optimization (mWOA/HIDA) and manual feature engineering (Interactive LASSO) achieve nominal accuracy parity (94.2-97.2%), they suffer from either catastrophic precision-recall imbalances (e.g., ASE’s 63.2% precision despite 98.2% accuracy) or prohibitive computational demands (HIDA’s 359.76s runtime).

The extension to EEG eye state classification further demonstrates LEW-RF’s domain-agnostic value. Here, the method achieved 94.2% accuracy with exceptional computational efficiency (2.63s), outperforming not only deep learning models (LSTM/BiLSTM at 61-63% accuracy) but also metaheuristic approaches that required 50-700\(\times\) more computation time for comparable performance. In particular, LEW-RF maintained a near-perfect precision-recall balance (94.4% - 93.9%) in both application domains, confirming its consistent ability to detect minority-class events without excessive false alarms, a critical capability for both environmental monitoring and biomedical applications.

Regarding deployment in resource-constrained environments, the results demonstrate LEW-RF’s strong potential for implementation in edge computing. With inference times of 0.33s (ozone) and 0.36s (EEG), representing 228\(\times\) and 7\(\times\) speed advantages over BiLSTM alternatives, coupled with a minimal memory footprint from its compact Legendre coefficient representation, LEW-RF is uniquely positioned for microcontroller deployment. The method’s computational efficiency stems from its feature extraction approach: by transforming sequential analysis into static classification via energy coefficients, it avoids the memory-intensive state management of recurrent networks while maintaining superior accuracy. This efficiency, combined with consistent performance across domains, suggests that LEW-RF could operate effectively on low-power edge devices in distributed sensor networks, particularly in regions of the Global South with limited computing infrastructure.

Conclusion

This study has established the Legendre Energy Weighted Random Forest (LEW-RF) as a novel paradigm for sequential data classification that harmonizes mathematical rigor with practical computational efficiency. By formalizing the connection between Legendre polynomial energy and discriminative power, we have developed a framework that automatically identifies and prioritizes temporally structured features without manual engineering or heuristic weighting. The energy-guided splitting criterion fundamentally enhances Random Forest’s capability to recognize complex temporal patterns while maintaining the method’s inherent interpretability and computational advantages.

Empirical validation across controlled simulations, environmental monitoring, and biomedical applications demonstrates LEW-RF’s consistent superiority over both conventional machine learning approaches and specialized deep learning architectures. In ozone detection, it achieved best-in-class performance (97.0% accuracy, 99.6% recall) while operating 228\(\times\) faster than BiLSTM alternatives; in EEG eye state classification, it delivered 94.2% accuracy with 7\(\times\) faster inference than standard Random Forest. These results confirm that LEW-RF’s hybrid approach, blending orthogonal polynomial transformations with ensemble learning, effectively addresses the limitations of existing methods: it resists noise through mathematical filtering, avoids overfitting through energy-weighted feature selection, and maintains computational efficiency suitable for real-time deployment.

The methodological advances presented here carry significant implications for environmental monitoring and biomedical signal processing. LEW-RF’s precision-recall balance (94.7%-99.6% in ozone detection) directly supports regulatory applications where both false negatives and false alarms carry serious consequences. Its computational efficiency (0.33s inference) enables deployment in distributed sensor networks, potentially revolutionizing air quality monitoring in resource-constrained regions. The inherent interpretability of energy-based feature importance further allows domain experts to validate model decisions against known physical processes, bridging the gap between black-box predictions and scientific understanding.

While the current implementation uses fixed polynomial degrees, future work should explore adaptive degree selection and nonstationary extensions to handle evolving temporal patterns. Integration with causal discovery frameworks could further enhance LEW-RF’s utility for attribution studies in climate science and biomedical research. Nevertheless, the consistent performance demonstrated across diverse domains establishes LEW-RF as a robust foundation for physics-guided AI in sequential data analysis, offering a blueprint for developing accurate, efficient, and interpretable models for planetary stewardship and health monitoring.