Introduction

The energy sector in the world is demanding sustainable solutions to reduce the present energy crisis. Population growth, urbanization, industrialization, and technological development are the primary attributes of the rise in energy demand globally1. This finite nature of fossil fuels, the attendant environmental concerns, as well as the surge in global energy demands, have caused a shift towards renewable energy resources. Among these renewable energy resources, biomass is unique because it can be converted into solid, liquid, or gaseous fuels, offering a versatile and sustainable alternative for electricity generation, heating, and transportation. Biogas production from organic materials provides a promising opportunity to manage this challenge by harnessing renewable energy sources through anaerobic digestion. Lignocellulose materials are identified as the most available source of feedstock for biogas production globally, and they are available in the form of grasses, softwood, agricultural residues, hardwood, and energy crops. This feedstock has comparative advantages over other biogas feedstocks because of its availability, low cost, not competing with food supply, and relatively high yield2.

Lignocellulose materials have the potential to produce biogas during anaerobic digestion. However, they exhibit a limitation in terms of their efficiency owing to the slow rate of breaking down complex organic materials. Lignocellulose feedstocks comprise lignin, hemicellulose, and cellulose that are strongly bonded together and make the organic matter accessible to microbes during hydrolysis3. Therefore, pretreatment is required to break down the bonds within the lignocellulose, increase the surface area, and improve crystallinity and depolymerization before anaerobic digestion. Alkali pretreatment is a chemical pretreatment that addresses this challenge by altering the morphological arrangement of lignocellulose materials, improving biomethane yield, and reducing retention time4. The literature is replete with several studies on anaerobic digestion (AD) of lignocellulosic biomass subjected to alkaline pretreatment. Alkali pretreatment of rice straw was reported to accelerate methane production and increase methane yield by 87.1%5. Corn stover was pretreated with 0.5% KOH at 60 °C for 12 h, and the methane released was increased by 56.40% compared to the untreated substrate6. When NaOH was applied to rice straw and date palm, the optimum improvement was recorded at 6% w/w and 18% w/w, respectively7,8. Integrating alkali pretreatment into the anaerobic digestion of lignocellulose feedstock enhances the biogas release, reduces the retention time, and maximizes the potential of the process.

Feedstock composition, process parameters, and microbial dynamics greatly impact biogas production performance, rendering optimization a challenge. This challenge threatens the economic viability of the anaerobic digestion process. Hence, the intelligent model has demonstrated efficacy in intelligent feedstock management and real-time decision-making. Various mathematical models have been investigated for the prediction and optimization of the anaerobic digestion process, all showing the intricate nature of the process. These models include Schnute, transfer, Gompertz, modified Gompertz, cone, logistic, first-order, superimposed, etc9,10,11,12. , and are characterized by the combination of statistical, theoretical, and analytical methods. These models serve varying purposes in determining hydrolysis kinetics, bacterial growth, inhibition rate, lag phase duration, and biogas prediction9,13. The major challenge with these traditional models is that they are deterministic and require prior knowledge to ensure accurate prediction, hence failing to capture the non-linearity in biogas data14. This limitation has motivated the need to explore more intelligent approaches based on artificial intelligence (AI) and machine learning (ML) to enhance the accuracy of anaerobic digestion process modeling15. ML is an AI model that can learn and adjust from data, properly capturing the hidden trends in digestion and providing a more effective predictive model for anaerobic digestion16. Adaptive Neuro-fuzzy Inference System (ANFIS) is a data-driven machine learning model that can potentially address the non-linearity and complicated relationships between the input variables17. The ANFIS model processed and analyzed data trends to forecast biogas release using statistical and advanced algorithm models. This model is trained with historical data using different process parameters like pH, C/N ratio, pH, pretreatment conditions, mixing ratio, and concentration. Then, the model identifies patterns and relationships in the data and uses them to predict biogas yield based on new input data18. Fajobi et al. developed an ANFIS model to optimize and predict biogas yield from anaerobic co-digestion of mango pulp, cow dung, and Chromolaena odorat using pressure and temperature as the input parameters. The lowest Root Mean Square Error (RMSE) of 14.37 and coefficient of correlation (R2) of 0.9978 were reported, indicating that the biogas predicted is 99.78% accurate19. Anaerobic digestion and biogas yield from municipal solid waste were modeled using three models: kinetic, artificial neural network (ANN), and ANFIS, using digestion time, pH, moisture content, and volatile solids as the input parameters. The three models were compared using their performance metrics, and ANFIS has the best metric with an RMSE of 0.670 and an R2 value of 0.99920. Similarly, Chong et al. compared three different response surface methodology (RSM), ANN, and ANFIS models to predict and optimize biogas and methane yield from palm oil mill effluent. The study considered reticulation ratio, temperature, and pH as the input parameters, and it was reported that the ANFIS model has the best prediction with R2 of 0.9791, mean absolute error of 0.0730, and RMSE value of 0.143821. However, studies on optimizing and predicting biomethane yield are limited when the pretreatment conditions are considered input parameters.

Interest in advanced computational techniques for extensive data-driven insights into biogas research technology has grown in recent years, owing to the complex microbial interactions in the bio-digestion process and the pretreatment dynamics. These computational approaches can comprehend and interpret system behavior, identify hidden patterns, and facilitate biomethane process optimization, thus aiding data-driven decision-making and intelligence as a validation method for experimental investigations. While previous studies have extensively researched the AD of lignocellulosic biomass with generic ML models in their black-box nature, the robust framework of experimental investigations with advanced data analytics including explainable AI (XAI) based on SHapley Additive exPlanations (SHAP) and advanced ML techniques quantify the individual and cumulative influence of these variables on biomethane yield remains less explored in biogas research. These advanced techniques enhance bio methane predictions as well as explainable prediction outcomes, which deepens process understanding, providing actionable data-driven insights for the design and control of bioenergy systems. This study develops a novel integration of experimental and multimodal ML-based computational analysis to provide in-depth data-driven insights into the AD of Xyris capensis subjected to alkaline (NaOH) pretreatment. The experimental dataset was utilized for further statistical analysis, correlation-based parameters profiling, SHAP-based features ranking, cluster analysis and dimensionality reduction, and ML-based predictive models. This study aims at investigating the impact of alkaline pretreatment on the biomethane yield of Xyris capensis through the following objectives (i) experimental investigations of biomethane yield under different NaOH concentration and exposure time, digestion retention time in mesophilic anaerobic conditions for 35 days (ii) assessment of the linear correlation between digestion parameters, pretreatment conditions and biomethane yield (iii) statistical assessment and visualization of the impact of alkaline pretreatment on biomethane yield using a two-sample independent t-test (iv) SHAP-based feature ranking of digestion and pretreatment parameters (v) cluster analysis for bio-digestion operational dataset using k-means clustering integrated with Principal Component Analysis (PCA) (vi) develop ANN, Support Vector Machine (SVM), random forest and decision tree models for biomethane yield prediction. This research demonstrates the potential of data-driven approaches as powerful standalone tools and as vital complements to experimental investigations. By offering actionable intelligence, the study contributes to improved energy recovery and enhanced process control in the anaerobic digestion of lignocellulosic biomass.

Materials and methods

The methodological framework and approach used in this study are presented in Fig. 1. It encompasses experimental, statistical, and multimodal machine learning based analysis for investigating the impact of alkaline pretreatment on the optimal methane yield from anaerobic digestion of Xyris Capensis.

Fig. 1
Fig. 1
Full size image

Methodological framework of this research.

Materials sourcing

Xyris capensis, which Nilsson first discovered in 189222, used for the research was sourced locally in Limpopo Province, South Africa (24°40′S 30°20′E), chopped into smaller sizes (2–4 cm), and sun-dried to 25% moisture content. The dried sample was kept in a plastic bag in a well-ventilated and controlled environment in the laboratory (about 4 °C). The sample was then subjected to alkaline pretreatment before the pretreated and untreated samples were characterized for ultimate and proximate composition according to the Association of Official Analytical Chemists (AOAC) procedure23. Liquid digestate from the previous anaerobic digester, where lignocellulose feedstock and wastewater were co-digested, was collected and used as the inoculum for the experiment. The inoculum was also stored in a controlled environment in the laboratory before characterization and anaerobic digestion.

Alkali pretreatment

Xyris capensis was pretreated with dilute NaOH to alter the recalcitrant characteristics of the substrate and improve the biomethane production. The alkali used was purchased locally from a supplier in Johannesburg, South Africa. The choice of NaOH concentration, exposure time, and temperature was selected based on previous studies with small adjustments considering the morphological structure of Xyris capensis4. As presented in Table 1, NaOH pellet was dissolved in water at different concentrations, and the chopped Xyris capensis were soaked in the solution at the set temperature for the predetermined exposure times. The substrate was dipped at 1: 10 (w/v) and stirred continuously using a magnetic stirrer set at 200 rpm. When the treatment times are completed, the solution is filtered to remove the solid from the liquid, and then washed with water until a neutral pH is achieved using a digital pH meter. The pretreated feedstock was oven-dried for 6 h at 50 °C to remove the moisture to an acceptable level before it was then stored in plastic bags and kept in a fridge set at 4 °C before characterization and anaerobic digestion.

Table 1 NaOH pretreatment conditions.

Experimental setup

The experiment to investigate the biomethane potential of NaOH-pretreated and untreated Xyris capensis was carried out according to the VDI 4630 standard using the Automatic Methane Potential Testing System II (AMPTS II)24. Twelve 500-ml digester bottles were charged with 400 g of stable inoculum, and the pretreated and untreated substrate was added. The feedstock added to each digester was calculated using Eq. 1, which was determined based on volatile solids (VS) at 2: 1 of substrate to inoculum. The digesters were loaded and labeled, as shown in Table 1. The experiment was conducted at mesophilic conditions; therefore, the water bath temperature where the digesters were arranged was set at 37 ± 2 °C. The experiment was duplicated twice, and the average value for each treatment was recorded. Two reactors filled with only inoculum were also run simultaneously to ascertain the gas remaining in the inoculum and used for overhead correction. The gas generated from the digesters with only inoculum was deducted from another yield to determine the actual volume of biomethane produced by the substrate alone. The AMPTS II software was set at 60 s on and 60 s off for the mixer, 10% CO2 flush gas, and 80% of the stirrer speed for the experiment. The headspace of the reactor was set at 100 ml, and the biomethane generation was projected at 60%25. To remove the trapped oxygen and set anaerobic conditions in the digester, each digester was purged with nitrogen gas for about 60 s. To purify the gas released, 75 ml NaOH (3 M) solution in a 100 ml screw bottle was used. Silicon tubes were used to transfer the gas produced from the digesters directly to the purification unit before being linked to the measuring unit, where the volume of biomethane released was recorded. The experiment was terminated on day 35 of the retention period when it was established that the daily biomethane release was below 1%.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{M}_{s}=\:\:\frac{{M}_{i}{C}_{i}}{2{C}_{S}}\text{}\text{}.$$
(1)

Where \(\:{M}_{i}\) is the Inoculum mass (g), \(\:{C}_{i}\) is the inoculum concentration (%), \(\:{M}_{s}\) is the substrate mass (g), while \(\:{C}_{S}\) =is the substrate concentration (%)24.

Statistical and machine learning computational framework

Correlation analysis for the biodigestion and pretreatment parameters

The linear interrelationship between the variables of anaerobic digestion and pretreatment conditions was analyzed using a Pearson correlation matrix and visualized using the correlation heat map. This expresses the potential co-linearity amongst the variables and the positive and negative correlation of the key biodigestion and pretreatment variables to the methane yield.

T-test analysis for assessment of pretreatment impact on biomethane yield

The impact of the alkaline (NaOH) pretreatment on the methane yield was statistically validated using a 2-sample independent t-test. The t-test assessed the average biomethane yields between the no-treatment and NaOH-pretreatment conditions, with the null hypothesis (H₀) assuming that no significant difference existed between the means. A significance level of p-value = 0.05 was employed. The test statistics were calculated according to Eq. 2.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:t=\frac{{\stackrel{-}{X}}_{1}-{\stackrel{-}{X}}_{2}}{\sqrt{\frac{{{s}_{1}}^{2}}{{n}_{1}}+\frac{{{s}_{2}}^{2}}{{n}_{2}}}}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(2)

Where \(\:{n}_{1},\:{n}_{2}\) are sample sizes, \(\:{\stackrel{-}{X}}_{1}\) and \(\:{\stackrel{-}{X}}_{2}\) are samples meanwhile \(\:{{s}_{1}}^{2}\:and\:{{s}_{2}}^{2}\) are sample variance. A boxplot depicting the mean, variance, and standard deviation of biomethane yield across the two treatment categories was further used to visualize the result of the t-test

SHapley additive explanations (SHAP)

SHAP is an additive feature identification technique based on co-operative game theory. It measures the impact of every feature on the prediction outcome of a machine learning model by assigning an importance value to all model features through SHAPley values26,27. In this study, SHAP provided a robust model-agnostic approach to evaluate the contribution of pretreatment condition and bio-digestion parameters to the prediction of the biomethane yield of Xyris capensis. The SHAPley value , based on \(\:n\) number of model input features \(\:i\), is calculated using Eq. 3.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{\varphi\:}_{i}=\sum\:_{S\subseteq\:N}\frac{(n-\left|S\right|-1)!\left|S\right|!)}{n!}\left[v\left(S\cup\:\left\{i\right\}\right)-v\left(s\right)\right].\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(3)

Where \(\:{\varphi\:}_{i}\) is the SHAPley values representing the importance of each feature, \(\:n\) denoted the number of features while \(\:N\) represents the group input in the dataset. \(\:S\) is a subset of \(\:N\). The SHAP algorithm’s basic principle is that the sum of all feature contributions is obtained by subtracting the baseline \(\:{\varphi\:}_{o}\) and the model’s predicted value \(\:f\left(x\right)\) as in Eq. 427

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:f\left(x\right)=\sum\:_{i=1}^{N}{\varphi\:}_{I}\left(x\right)+{\varphi\:}_{o\:}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(4)

The value of ϕi(x) indicates how much the feature affects the model prediction from the baseline \(\:{\varphi\:}_{o}\) for the data instance \(\:x\). The baseline value, \(\:{\varphi\:}_{o}\), represents the expected output. Accurate estimation of SHAP values is time-consuming, especially for high-dimensional datasets. However, numerous ways have been established to make SHAP more practical in real-world applications. Examples of algorithms are gradient SHAP, “Kernel SHAP”, “Tree SHAP”, and “Deep SH”28.

Principal component analysis

Principal Component Analysis (PCA) is a mathematical method employed to effectively reduce the dimensions required to represent the features of data matrices. This approach represents the original matrix through an array of new uncorrelated variables known as principal components (PC), which retain the most variance in the biodigester dataset. The co-variant matrix C is estimated from the averaged dataset using Eq. 5.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\text{C}=\frac{1}{N-1}\bullet\:{\text{x}}^{\text{T}}.\text{x}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(5)

where \(\:\text{x}\) is the mean data matrix, C is the covariant matrix, and N is the number of observations. The eigen decomposition can be solved using Eq. 6.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{\text{C}\text{v}}_{i}={{\uplambda\:}}_{i}{\text{v}}_{i\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(6)

where \(\:{{\uplambda\:}}_{i}\) is the eigenvalue of the \(\:\text{i}\text{t}\text{h}\) PC, while \(\:{\text{v}}_{i}\) is the corresponding eigen-vector. Each PC accounts for a segment of the total variance, and the explained variance ratio (EVR) is computed as in Eq. 7.

$$\:{\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:EVR}_{i}=\frac{{{\uplambda\:}}_{i}}{\sum\:_{j=1}^{p}{{\uplambda\:}}_{j}}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(7)

Where p is the number of features. To determine the PC scores, project the mean data onto the chosen eigenvectors using Eq. 9.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:PC=x{\text{v}}_{k}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(8)

where \(\:{\text{v}}_{k}\) is the matrix of top k eigenvectors.

Biodigester operational clusters analysis (k-means clustering)

k-means clustering was used to identify distinct operational clusters within the biodigester operational dataset, considering the key input variables involved. The k-means clustering partitions the dataset into \(\:k\) non-overlapping clusters by minimizing the within-cluster sum of squares (WCSS), effectively grouping operational states with similar characteristics. In the iterative process, each data point \(\:{\text{x}}_{\text{j}}\)​ is assigned to the cluster with the nearest centroid \(\:{{\upmu\:}}_{\text{i}}\)​, using the Euclidean distance in Eq. 9, while the cluster assignment is formalized using Eq. 11.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:d\left({x}_{j},{\mu\:}_{i}\right)=\sqrt{\sum\:_{m=1}^{p}{\left({x}_{jm}-{\mu\:}_{im}\right)}^{2}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(9)
$$\:{\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:C}_{j}\:=arg\underset{i}{\text{min}}d\left({x}_{j},{\mu\:}_{i}\right).\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(10)

Subsequent to the assignment, cluster centroids are adjusted by calculating the mean of all data points within each cluster with Eq. 11.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{\mu\:}_{i}=\frac{1}{\left|{c}_{i}\right|}\sum\:_{{x}_{j}\in\:{c}_{i}}{x}_{j}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(11)

Where \(\:{c}_{i}\:\)represents the collection of data points allocated to cluster , while \(\:\left|{c}_{i}\right|\) is the data points within the cluster. The total within-cluster sum of squares (WCSS) is minimized as follows in Eq. 12.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:WCSS=\sum\:_{I=1}^{K}\sum\:_{{x}_{j}\in\:{c}_{i}}{||{x}_{j}-{\mu\:}_{i}||}^{2}.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(12)

Iterations continue until the change in WCSS between iterations falls below a threshold. Clustering results were further validated by projecting the clustered data onto the first two principal components derived from PCA, enabling an intuitive interpretation of the operational regime.

Artificial neural network

An artificial neural network (ANN) is a machine learning model that mimics the human brain with interconnected nodes that transform data into an output29. The neurons are typically organized in layers or vectors, and the output of one-layer functions as the input for the subsequent layer and potentially further layers. The feed-forward neural network (FFNN) is a specific type of ANN where input layer data is transmitted directly to the output layer without any feedback. Neurons in ANN have three layers: the input layer, the hidden layer(s), and the output layer. The input layer receives inputs \(\:{x}_{j}:(j=\text{1,2}\dots\:n)\), the hidden layer(s) consist of neurons \(\:{n}_{j}:(j=\text{1,2}\dots\:n)\), and the output layer produces outputs\(\:{o}_{j}:(j=\text{1,2}\dots\:n).\) They represent neuron output in the first hidden layer of a two-hidden neural network. The first hidden layer has \(\:{m}_{1}\) neurons, the second has \(\:{m}_{2}\:\)neurons. Weights linking the first hidden layer to the input layer are labelled \(\:{{w}_{il}}^{1}\:\)and those connecting the second hidden layer to the first are labelled \(\:{{w}_{ij}}^{2}\) and expressed in Eqs. 13 and 14. Activation function for neurons in the first hidden layer is \(\:{\varphi\:}_{i}\), and for neurons in the second layer, \(\:{\psi\:}_{j}\)30,31.

$$\:{{\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\xi\:}_{i}=\psi\:}_{j}\left(\sum\:_{i=0}^{p}{{w}_{il}}^{1}{U}_{l}\right),\:\:\:\:\:{u}_{0\:\:}and\:{\psi\:}_{j}\left(\bullet\:\right)=1\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(13)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:y=\sum\:_{i=0}^{{m}_{2}}{w}_{1}{\varphi\:}_{i}\left(\sum\:_{j=0}^{{m}_{1}}{{w}_{ij}}^{2}{\xi\:}_{j}\right),\:\:\:\:\:\:\:{\varphi\:}_{0}\:\left(\bullet\:\right)=1.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(14)

Support vector machine

SVM is a supervised learning technique employed for classification and regression tasks. The primary advantages of SVM are its simplicity, computational efficiency, and capability to be trained with a small number of samples. Nonetheless, identifying the ideal kernel and its parameters presents the greatest obstacle. The basic idea of SVM is to maximize the geometric margin between two datasets while simultaneously minimizing the empirical classification error32. For regression purposes, the function \(\:f\left(x\right)\) is estimated to illustrate the relationship between the input and output variables. The objective function is defined as in Eq. 15

$$\:{\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:}\underset{{\omega\:}_{i}{b}_{i}{\xi\:}_{i}{\xi\:}_{i}^{*}}{\text{min}}\left(\frac{1}{2}{||\omega\:||}^{2}+C\sum\:_{i=1}^{n}({\xi\:}_{i}+{\xi\:}_{i}^{*})\right).\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(15)

Subjected to

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left\{\begin{array}{c}{y}_{i}-\langle \omega\:,\varphi\:\left({x}_{i}\right)\rangle -b\le\:\epsilon+{\xi\:}_{i}\\\:\langle \omega\:,\varphi\:\left({x}_{i}\right)\rangle +b-{y}_{i}\le\:\epsilon+{\xi\:}_{i}^{*}\\\:{\xi\:}_{i},{\xi\:}_{i}^{*}\ge\:0\:\end{array}\right. .\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(16)

Where \(\:{\text{y}}_{\text{i}}\) denotes the biomethane yield and \(\:{\text{x}}_{\text{i}}\) represents the input variables (digestion and pretreatment). The slack variables for points outside the insensitive tube are represented as \(\:{\xi\:}_{i}\) and \(\:{\xi\:}_{i}^{*}\), \(\:\epsilon\) is the tube width, while \(\:C\) represents the regularization parameters. This study used a radial basis function (RBF) kernel function. Thus, the non-linearity of the data is expressed as in Eq. 17.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:K\left({x}_{i},{x}_{j}\right)=\text{exp}\left(-\gamma\:{||{x}_{i}-{x}_{j}||}^{2}\right).\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(17)

Where \(\:\gamma\:\:\)is a kernel parameter, while the Euclidean distance between \(\:{x}_{i}\:and\:{x}_{j}\:\)samples is \(\:||{x}_{i}-{x}_{j}||\). The final regression function to compute the biomethane yield is given by \(\:f\left(x\right)\) in Eq. 18.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:f\left(x\right)=\sum\:_{i=1}^{n}\left({\alpha\:}_{i}-{\alpha\:}_{i}^{*}\right).\text{exp}\left(-\gamma\:{||{x}_{i}-x||}^{2}\right)+b.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(18)

Where \(\:b\:\)is the bias term while \(\:{\alpha\:}_{i}\:\text{a}\text{n}\text{d}\:{\alpha\:}_{i}^{*}\) are training Lagrange multiplier.

Decision tree

A decision tree (DT) is a non-parametric supervised machine learning model characterized by a hierarchical structure with a root node, branches, internal nodes, and leaf nodes for both regression and classification tasks. It is a decision support recursive partitioning structure that models’ decisions and their outcomes, including chance event outcomes, resource costs, and efficiency. Each internal node represents a test on an attribute (e.g., whether a coin flip comes up heads or tails), each branch represents the test outcome, and each leaf node represents a class label. At each stage, a DT uses information entropy to select the next appropriate variable for separating the set of objects. DT ignores the dependence assumption and classification sequence, unlike Bayesian models. DT generates simple classification rules, which is a major benefit. These rules help analyze sensor performance and extract features.

Random forest

Random forest (RF) is an ensemble learning technique that generates numerous DTs during the training process and produces the mean prediction of the individual trees for regression tasks. The outcome of the RF is the class chosen by the majority of trees during classification, while in regression tasks, the result is the mean of the predictions from the trees. The rationale behind the RF model is that numerous uncorrelated models collectively exhibit superior performance compared to their isolated functioning. For a classification problem, each tree provides a classification or a “vote.” The forest selects the classification based on the predominant “votes.” However, for the regression task, the ensemble computes the mean of the outputs from all individual trees. The regression model is expressed in Eq. 19.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:y=\frac{1}{T}\sum\:_{i=1}^{T}{f}_{i}\left(x\right).\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(19)

Where \(\:{f}_{i}\left(x\right)\) denotes the prediction from \(\:t-th\) tree, and \(\:y\) represents the predicted biomethane yield. The aggregate number of DT in the forest is represented as \(\:T\). \(\:{x}_{i\:}\in\:{\mathbb{R}}^{4}\) since there are 4 input variables in the biodigester dataset.

Model development, evaluations, and hyper-parameter settings

A careful selection of key control parameters of machine learning models is an important step in achieving optimal prediction performance. Some of the key hyper-parameters of the four models developed are presented in Table 2. A 2-hidden-layer architecture with 10 neurons in each layer was selected for the ANN. Owing to the complex non-linear relationship involved in the corrosion problem, a single hidden layer architecture may be insufficient in capturing the complex chemical reactions with different varying parameters involved31. The RBF kernel was selected based on its effectiveness in capturing nonlinear relationships in SVM-based corrosion rate prediction. The Gamma and epsilon values were set at 0.1, respectively, to control the influence of individual data points and ensure smooth decision boundaries.

Table 2 Hyper-parameters defined for the models.

The prediction performance of the developed machine learning was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Variance Accounted For (VAF), and computed using Eqs. 2023.

$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:MAPE=\:\frac{1}{N}\sum\:_{k=1}^{N}\left|\frac{{y}_{k}-\widehat{{y}_{k}}}{{y}_{k}}\right|\times\:100\%\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(20)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:RMSE=\sqrt{\frac{\sum\:_{k=1}^{N}[{y}_{k}-\widehat{{y}_{k}}]}{N}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(21)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:MAE=\:\frac{\sum\:_{k=1}^{N}\left|\widehat{{y}_{k}}-{y}_{k}\right|}{N}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(22)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:VAF=1-\left[\frac{var\left(\widehat{{y}_{k}}-{y}_{k}\right)}{var\left({y}_{k}\right)}\right]\times\:100.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(23)

Results and discussion

Effect of pretreatment on feedstock composition and cumulative biomethane yield

The result of the compositional analysis of the NaOH-pretreated and untreated feedstock is presented in Table 3. It was observed that NaOH pretreatment influences the composition of the feedstock, and the impact varies based on the treatment conditions. The total solids (TS) of the feedstock were affected differently, but none of the treatment conditions reduced the TS beyond the acceptable level. It was noticed that treatments P, R, and T have the same value of TS, while treatments Q and S produce the same value of TS. However, treatment U, which is the untreated sample, has a different value from the pretreated samples, which shows that pretreatment has an impact on the feedstock. Compared with previous studies, all the treatment conditions produced higher TS6, which is a good result for biomethane production. Volatile solids (VS) are the available organic matter for the microbes’ degradation during biomethane production, and it was noticed that pretreatment applied influences the VS of the feedstock. However, all the values were more than 90%, which is an indication of good buffering capacity of the samples. These higher percentages are expected to produce higher biomethane, and they are more than what was reported for some lignocellulose biomasses3,33. The elemental composition of the substrate was also noticed to have been affected by pretreatment when pretreated samples are compared with the untreated feedstock, as shown in Table 3. C/N ratio is a major component of biomass that significantly influences the biomethane release. Carbon is the food source for microbes, while a certain quantity of nitrogen is also needed for their growth. An imbalance in the C/N ratio can lead to digester failure because of the overaccumulation or underproduction of volatile fatty acids34. A C/N ratio of 20–30 is recommended for optimum production35. It can be observed from Table 3 that all the treatment conditions have a good C/N ratio. Pretreatment application was observed to improve the C/N ratio of the feedstock compared to the untreated sample. The result of characterization of this study aligned with the previous study when different chemical pretreatment was experimented on another lignocellulose biomass36.

Table 3 Physicochemical characteristics of NaOH-pretreated and untreated Xyris capensis.

The cumulative biomethane yield released from NaOH-pretreated and untreated Xyris capensis after 35 days of retention time is presented in Fig. 2. A total methane yield of 258.68, 287.80, 304.02, 328.20, 310.20, and 135.06 ml CH4/ gVSadded was recorded for treatments P, Q, R, S, T, and U, respectively. This result indicates that pretreatment enhanced the biomethane yield by 91.53, 113.09, 125.10, 143.00, and 129.68% for treatments P, Q, R, S, and T, respectively, compared to the untreated Xyris capensis (treatment U). This finding agrees with the previous studies that reported that the pretreatment increases the biomethane yield of lignocellulose feedstock37,38,39. Biogas yield from anaerobic co-digestion of cow dung and food waste was improved by 28.1, 20.23, and 13.27% when pretreated with ultrasonic, autoclave, and microwave pretreatment methods, respectively40. Enzymatic pretreatment of corn stover increases the biomethane yield by 36.9% compared to the untreated substrate41. The optimum biomethane yield of 328.20 ml CH4/ gVSadded was observed when 4% w/w of NaOH was utilized for 20 min at 90 °C, which indicates that higher concentration with shorter exposure time favours the pretreatment of Xyris capensis. It was observed that beyond that period, further increases in the alkali concentration do not translate into further biomethane yield. This can be traced to the release of inhibitory compounds like acetic acid, furfural, and 5-hydroxymethylfurfural, which can hinder the activities of the microorganisms and subsequent biomethane release37. The reduction in biomethane yield beyond this condition could be the over-degradation of hemicellulose in the substrate, thereby reducing the available organic content for biomethane release. The pretreatment decrystallized and expanded lignin-carbohydrate bonds of the substrate, thereby increasing the accessible surface area for enzymatic hydrolysis. The properties of NaOH alter the lignin structure and significantly redistribute/eliminate the lignin content from acetyl groups, hemicellulose, and uronic acid. The performance of the NaOH pretreatment depends on the percentage of lignin in the feedstock. It was noticed from the study that treatment alters the hydrolyzable links and improves saccharification, which increases solubilization and reduces crystallinity and polymerization, as reported in a previous study42. This study aligned with what was reported when rice straw, Napier grass, and hazelnut were pretreated with NaOH. It was observed that the optimum yield was when 4% NaOH was used, but at different treatment times and temperatures43,44,45. The variation in the treatment time and temperature can be linked to the variation in the morphological arrangement of the feedstock and the purity of the chemical used.

Fig. 2
Fig. 2
Full size image

Cumulative biomethane yield of NaOH-pretreated and untreated feedstock.

Statistical evaluation of digestion process parameters and pretreatment conditions

An in-depth statistical insight into the relationship between the vital biodigester parameters and pretreatment conditions is vital to understand the complex biochemical processes within the biodigester and the biomethane production trends. A correlation analysis was conducted between the critical operating and pretreatment parameters. The discovered correlations highlight the multifactorial nature of the anaerobic digestion process, where no single parameter solely determines methane output but rather a collective interaction of physical, chemical, and biological elements that play a crucial role. Table 4 presents the statistical summary of the relevant parameters.

Table 4 Statistical summary of digestion and pretreatment parameters.

Figure 3 shows the magnitude and direction of the linear correlation between the key operating conditions and methane yield. The retention time of the digestion exhibits a weak positive correlation (0.13) with the methane yield. This indicates that prolonged digestion facilitates more thorough hydrolysis and methanogenesis. The very weak (-0.04) correlation value of temperature depicts an essentially negligible linear correlation with methane yield. The NaOH concentration exhibits a marginally mild negative correlation with methane yield. This may result from non-linear effects since excessive NaOH might suppress microbial activity, whilst appropriate concentrations improve digestibility. Consequently, the general trend is that the overall trend becomes weak or slightly negative. Likewise, exposure time exhibits a weak negative correlation with methane yield. Prolonged exposure to NaOH results in over-degradation or the production of inhibitory chemicals, hence diminishing yield. The optimal duration of exposure is crucial. These findings underscore the significance of controlling both pretreatment duration and intensity to enhance biogas production from Xyris capensis.

Fig. 3
Fig. 3
Full size image

Correlation heat map of the anaerobic digestion of Xyris capensis.

SHAP feature importance analysis

The non-linear interactions among digestion and pretreatment parameters such as exposure time, retention time, NaOH concentration, and temperature are further analyzed using SHAP-based feature-ranking. This approach reveals the contribution of each variable to the model output while giving insight into the operational factors that significantly affect biomethane yield. Figure 4 presents the SHAP values of each feature, indicating their contributions to the prediction (either positive or negative) and the degree of variability over various scenarios. Exposure times have the most significant effect, with a broad SHAP range of around − 200 to + 50. Extended exposure time generally lowers model output, while shorter exposure time marginally enhances it. This inverse contribution corresponds with its mean SHAP value of -14.23, affirming it as a negatively impactful variable. The distribution indicates that long periods of exposure lead to decreasing biomethane production, potentially due to excessive breakdown or structural inhibition during pretreatment. Furthermore, retention is the second most significant feature, exhibiting a positive SHAP mean of + 13.81. Enhanced model output is obtained at a high retention time, indicating a direct association with biomethane generation. Extended digestion durations generally facilitate more thorough breakdown and methanogenesis, hence supporting essential anaerobic digestion dynamics.

The concentration of NaOH has a moderate positive SHAP mean of + 2.07. The trend in the figure depicts elevated concentrations and their positive influence in enhancing the output, although with less consistency across the data points. This indicates a threshold or ideal range beyond which increased concentration may not benefit biomethane generation. The temperature exhibits a moderate positive SHAP value of + 0.94, featuring a little variability in Fig. 4. This indicates that temperature exerted a slight positive effect on biomethane prediction. This is consistent with the assumption that mesophilic temperatures promote anaerobic microbial activity. However, the model does not assign significant variability to temperature.

Fig. 4
Fig. 4
Full size image

SHAP values plot of the digestion and pretreatment parameters.

Beyond the individual average importance presented in the SHAP values in Fig. 4, a good understanding of the cumulative importance of each parameter gives useful insight into how feature effects vary across cases. Presented in Fig. 5 is the SHAP decision plot illustrating the cumulative importance of each feature. A trend that aligns with the observation in Fig. 4 was noted for each feature. The high exposure time began at a high level and then swiftly dropped, resulting in a significant decline in predictive value, hence reinforcing its earlier established strong negative impact. Furthermore, retention time and NaOH concentration augment model predictions, affirming their combined influence on the enhancement of biomethane from Xyris capensis. In numerous instances, temperature results in a negligible deviation from the baseline value, corresponding with its low average SHAP value.

Fig. 5
Fig. 5
Full size image

SHAP decision plot of the digestion and pretreatment parameters.

Dimensionality reduction of anaerobic digestion parameters using PCA

Both individual contributions and relationships and interdependence among digestion and pretreatment variables are reflected in the PCA results. This further reinforces the insights gained from the DTR-based feature importance by reducing the dataset’s dimensionality while retaining the bio-digestion dataset’s core structure. The PCA efficiently illustrated the variance inherent in the digestion dataset by vividly identifying the dominant principal component (PC) accountable for significant variability. The result of the PCA is presented in Table 4 and visually illustrated in Fig. 5. From Table 5, the first principal component (PC1) accounts for about 36.64% of the variance in the data, while 28.71% of the variance is attributed to the second principal component (PC2). The combined cumulative variance between PC1 and PC2 (65.35%) indicated that they were insufficient to represent the dataset’s structure. While PC3 and PC4 contribute less variance, the combined cumulative variance between PC1 and PC3 (86.64%) shows that the 3 PCs are enough to capture the variance and the structure in the dataset.

Table 5 PCA matrix of the anaerobic digestion parameter and pretreatment conditions.

Figure 6 (a) presents the explained variance and the cumulative variance for each component expressed in percentage. The cumulative variance plot reveals that the first three PCs (PC1 to PC3) together capture nearly 87% of the variability. While PC4 contributes less, it is still considered significant. However, the dimensionality of the bio digestion dataset has been reduced to 3 (PC1 to PC3), implying that these 3 PCs can substantially capture the variance in the dataset. Figure 6 (b) is the scree plot depicting each principal component’s eigenvalues with a reference line at eigenvalue = 1 to indicate significant components. The eigenvalue for only PC1 and PC2 substantially exceeds 1, whilst the other components are approximately below 1. According to Kaiser Criterion, only PCs with eigenvalues above 1 are significant46. This establishes PC1 and PC2 as substantial contributors to the dimensional structure. The PCA outcome has a significant implication for the monitoring efforts in the biodigester operation by focusing on parameters that strongly load on PC1 and PC2. From the PCA result, PC1 positively correlates with NaOH concentration and negatively with the NaOH contact time. PC2 is positively influenced by retention time and negatively related to NaOH concentration. PC1 is negatively influenced by retention time and temperature, while PC4 is negatively influenced by time and temperature.

Fig. 6
Fig. 6
Full size image

Scree plot showing (a) explained and cumulative variance (b) eigenvalues relating to each principal component.

Operational cluster analysis of the digestion process

A k-means cluster analysis visualized via the PCA further revealed the dynamics of anaerobic digestion of Xyris capensis under NaOH pretreatment, as illustrated in Fig. 7. The spatial disparity across the clusters along PC1 and PC2 indicates significant underlying feature diversity, hence supporting the fact that the operational parameters of Cluster 0 and Cluster 2 are the most different. Cluster 1 bridges the traits of the two extremes in a more transitional area. Every centroid is the average location of all observations in that cluster. Hence, it is practically the “typical digestion regime” for that group. Every cluster shows a unique mix of NaOH content, digestion temperature, and exposure time, all affecting biomethane yield. Cluster 0 is a high-performing NaOH-pretreated condition. The position of Cluster 0 in PCA space shows a different biomethane-enhancing system, likely related to ideal NaOH concentration, balanced exposure, and adequate retention duration. Conversely, Cluster 2 probably contains untreated or low-intensity pretreatment conditions defined by lesser NaOH exposure and low biomethane yields. Applied under optimal operational windows, alkaline (NaOH) pretreatment clusters into separate, high-performing clusters (e.g., Cluster 0). PCA provides a clear dimensional reduction that facilitates the visualization of the underlying variations in digesting tactics. This knowledge offers a data-driven framework for targeted operational optimization, particularly for scaling up pretreatment operations or creating an adaptive digestion protocol. Cluster 1 might be intermediate procedures where moderate pretreatment was used but did not produce the synergistic effects seen in Cluster 0.

Fig. 7
Fig. 7
Full size image

Operational clusters of the anaerobic digestion of Xyris capensis.

Figure 8 illustrates the distribution and variability of key operating parameters across the 3 clusters. Across the 3 clusters, temperatures vary from 19 °C to 31 °C. Operating in the higher mesophilic range (29–31 °C), Cluster 0 is known to host active microbial populations. While Cluster 2 is more diverse, Cluster 1 shifts somewhat cooler. When paired with NaOH pretreatment, higher operating temperatures synergistically enhance hydrolysis rates and microbial metabolism, thereby explaining why Cluster 0 conditions can correspond to better yield. With a median retention length of 30 days, Cluster 0 indicates long-duration digesting processes. With the shortest retention times (10 days), Cluster 1 suggests rapid-cycle digestion, presumably tuned for quicker throughput. A hybrid approach is implied by Cluster 2, which spans a wider spectrum yet focuses on 16 days. Longer retention durations (Cluster 0) often allow more complete digestion of substrates, which is especially advantageous when lignin content is lowered by NaOH pretreatment, hence facilitating deeper microbial action. In all three clusters, NaOH levels range from 0 to 5% w/w. With medians at 3–4% NaOH concentration, clusters 0 and 1 reveal wider, overlapping NaOH ranges (0–5%). Skewed more toward low or no pretreatment, Cluster 2 could indicate control or little pretreatment situations. This verifies that Cluster 2 may include untreated samples with predicted decreased biomethane yield, consistent with previous findings. With far longer NaOH exposure times (50–60 min), Cluster 2 may compensate for its lower NaOH concentration. With shorter exposure times (10–30 min), clusters 0 and 1 imply more aggressive pretreatment in less time. Though combining both ideal concentration and duration (as probably observed in Cluster 0) enhances performance, longer exposure time at lower NaOH concentration (Cluster 2) may still produce moderate efficacy.

Fig. 8
Fig. 8
Full size image

Variation of bio-digestion parameters and NaOH pretreatment conditions across the operational clusters.

Statistical insight into the effect of alkaline pretreatment on the biomethane yield

The impact of the alkaline pretreatment of the Xyris capensis biomass sample on its biomethane yield was validated statistically beyond the experimental investigation. A two-sample independent t-test was carried out to compare the average yield of biomethane between the pretreated and untreated categories, with a null hypothesis that there is no significant difference in the biomethane yield between the untreated and pretreated categories. The statistical analysis demonstrates the effectiveness of NaOH pretreatment in improving biomethane output from Xyris capensis. Based on t-test’s p-value \(\:<\) 0.05, we can ignore the null hypothesis, assuming no difference between the two categories. Hence, the improvement in biomethane yield is statistically significant. Figure 9 shows the variation in biomethane yield and the mean and standard deviation of biomethane yield values across the untreated and pretreated categories. These charts further establish an enhanced methane generation after NaOH pretreatment. The figures prove that NaOH pretreatment greatly increases biomethane output compared to untreated conditions. The ability of NaOH to decompose complicated cell wall structures, lessen lignin shielding, and enhance methanogenesis and enzymatic hydrolysis could be attributed to this observation47. During anaerobic digestion, optimal circumstances with 6% NaOH and 0.10 g·g−1 cellulase produced a notable rise in gas generation48. Compared to untreated samples, solid-state NaOH pretreatment at 4% concentration produced 144% more methane49.

Fig. 9
Fig. 9
Full size image

Biomethane yield changes under NaOH pretreatment and no-treatment conditions.

Performance evaluation of the developed model

The model developed for predicting the biomethane yield of NaOH-treated Xyris capensis was evaluated for accuracy and reliability using important performance metrics. Table 6 presents the statistical metrics value of the ANN, SVM, RF, and DT models at the training phase. The table revealed a significant variation in their prediction performance across all metrics. Based on RMSE, the RF model outperformed other models with the lowest RMSE of 3.1480, showing it had the least prediction error and better accuracy at training. This implies that the RF exhibits little variation in predicting the biomethane yield at the training phase. The accuracy of DT follows the RF, which had an RMSE of 5.8544, while ANN and SVM recorded higher RMSE values of 7.2026 and 8.3786, respectively, indicating less reliable prediction during training. A similar trend was noted based on MAE values, with the RF exhibiting the lowest MAE value of 2.0737, affirming its reliability. The MAE values of SVM \(\:{(\text{R}\text{M}\text{S}\text{E}}_{\text{S}\text{V}\text{M}}=\:6.0038\)), ANN \(\:{(\text{R}\text{M}\text{S}\text{E}}_{\text{A}\text{N}\text{N}}=\:6.2926,)\), DT (\(\:{\text{R}\text{M}\text{S}\text{E}}_{\text{D}\text{T}}=4.5435\:)\:\)indicates that DT predictions are on average, significantly closer to the actual values. Furthermore, the lower MAD further confirms that RF predictions are more stable and consistent. RF had the smallest MAD value of 1.7569, while the MAD values of DT, ANN, and SVM are 4.7654, 6.1675, and 6.2029, respectively. Considering the MAPE values of the models, which indicate the percentage prediction accuracy of the model, RF was also noted to have the best prediction accuracy based on the MAPE value of 5.7488. This indicates that the RF model is about 94% accurate at predicting the biomethane yield during training. This was better than DT \(\:{(\text{M}\text{A}\text{P}\text{E}}_{\text{D}\text{T}}=\:\)6.9543), SVM \(\:{(\text{M}\text{A}\text{P}\text{E}}_{\text{S}\text{V}\text{M}}=\:\)9.0657), and ANN \(\:{(\text{M}\text{A}\text{P}\text{E}}_{\text{A}\text{N}\text{N}}=\)10.6767). Based on VAF, all models were noted to capture all the variance in the actual bio-methane output. However, RF gave the highest VAF of 99.0680%, followed closely by DT \(\:{(\text{V}\text{A}\text{F}}_{\text{D}\text{T}}=\:\)98.7887%), SVM \(\:{(\text{V}\text{A}\text{F}}_{\text{S}\text{V}\text{M}}=\:\)96.9575%), and ANN \(\:{(\text{V}\text{A}\text{F}}_{\text{A}\text{N}\text{N}}=\:\)93.9540%). Generally, the RF model predicted biomethane yield optimally across all metrics. While ANN and SVM approximated well, their larger errors suggest suboptimal parameter tuning and learning structure. Though slightly inferior to RF, DT performed well and may be used in practice.

Table 6 Statistical metrics of the developed model at training.

The generalization capacity of the machine learning models was also assessed at the testing stage to determine their predictive consistency. As shown in Table 7, the performance metrics give a clearer picture of how each model responded to unseen data. The RF model achieves the most optimal performance, achieving the lowest RMSE of 5.6862, confirming its excellent generalization performance with unseen data outside the training phase. However, the ANN demonstrated the highest RMSE of 9.9094. This depicts a substantial divergence from the experimental values of biomethane yield, which diminished during the testing phase. The DT and SVM exhibited RMSE values of 5.9346 and 7.5069, respectively. Based on the MAE value, RF still predicts best with 4.2938, slightly higher than DT’s 3.5767, the lowest MAE across models. ANN again emerged with the least performance with an MAE of 7.8629, while SVM had a moderate performance with 5.9737. DT had the lowest MAD of 2.4676, followed by RF (3.8981), SVM (5.3245), and ANN (9.3768). With a MAPE of 7.0717%, RF made the most accurate percentage predictions, making it acceptable for operational bioenergy systems. SVM had the highest VAF (94.60%), ahead of RF (93.93%). DT had 89.80%, followed by ANN with 92.96%.

Table 7 Statistical metrics of the developed model at testing.

The RF model has the smallest area on the radar plot in Fig. 10, indicating lower RMSE, MAE, and MAPE relative to ANN, SVM, and DT models. Wider shapes created by the ANN, SVM, and DT models suggest larger error metrics. This verifies that RF provides improved learning capacity and error minimization during the building of models. The pattern stays constant as the RF models exceed the others with lower values on all axes. The radar charts confirm visually that RF is the most accurate model for predicting biomethane yield from Xyris capensis. Its minimal errors and higher accuracy throughout training and testing make it the perfect choice for practical bioenergy modeling and deployment. Though all models showed a decline in performance in testing (as expected because of unseen data), the RF model exhibited superior generalization and slight performance loss.

Fig. 10
Fig. 10
Full size image

Statistical metrics for the best model (RF) at training and testing.

The experimental and predicted biomethane values at the training phase are compared graphically, as shown in Fig. 11. It shows the experimental biomethane yield values compared to the RF (most accurate) model-predicted values. Experimental and predicted values across all sample indices significantly overlap, suggesting great prediction accuracy during training. The model’s capacity to identify nonlinear patterns in the data is confirmed even more by the close clustering of points and trends between the experimental and predicted values. The error histogram shows the distribution of prediction errors during training. The distribution is centered around zero, indicating that the model does not show consistent bias. The bell-shaped error distribution closely resembles the overlay normal distribution curve, suggesting that the prediction errors are random50. The error distribution verifies that most prediction errors are within a small range and that large departures are uncommon.

Fig. 11
Fig. 11
Full size image

Comparison plot of the actual and predicted methane yield using the RF (most accurate) model-predicted values at the training.

Similarly, at the testing phase, the experimental and predicted biomethane value was compared in Fig. 12. This chart shows the experimental biomethane yield values compared to the RF (most accurate) model-predicted values at the testing phase. Although certain areas show little variation, the general trend tracking stays consistent. The model accurately predicts biomethane yield during testing outside training data by effectively understanding the inherent nonlinear relationships. The histogram shows the spread of prediction errors throughout testing. Centered around zero, the error distribution indicates low bias in model predictions. The error distribution is somewhat more spread out than during training, and the fit with the normal distribution is less perfect, suggesting a broader error spread. Though more diverse than in training, the testing errors are still well-behaved and typically fall within a reasonable range.

Fig. 12
Fig. 12
Full size image

Comparison plot of the actual and predicted methane yield using the RF (most accurate) model-predicted values at the training.

Figure 13 presents the scatter plot of experimental and RF (most accurate) model-predicted methane yield values at both the testing and training phases. These plots show a near-linear alignment of points, indicating good model accuracy and little residual error during training. This shows that the model has properly understood the connection between the input characteristics and biomethane yield, hence supporting the model’s previously reported low error values. Similarly, the model maintains a significant connection between experimental and predicted values at the testing stage despite introducing a novel dataset. Some minor line deviations are observable, which is expected in real-world situations because of natural noise and complexity in anaerobic digestion processes.

Fig. 13
Fig. 13
Full size image

Scatter plot of actual and predicted methane yield using the RF model at the training and testing phase.

Overall, the RF model exhibited the best prediction performance than other models. The ensemble and random feature sampling nature of RF, which are noise and variance resistant and requires less parameter optimization could be accountable for its superior performance51. While the performance of non-ensemble-based DT is better than ANN and SVM, it is less accurate compared to RF. This is due to its training overfitting possibilities, particularly in the absence of pruning or complexity regulation, resulting in memorizing instead of learning, hence poor generalization on novel data52. The less accurate prediction of ANN could be attributed to its susceptibility to overfitting when dealing with a small training data set. The complexity and number of its parameters, as well as high model flexibility, could result in learning noise instead of the underlying patterns, especially without a strong mitigation strategy like regularization, data augmentation, among others53. In a similar study by Ahmad et al.54 RF consistently gave a better prediction accuracy than ANN in small-data contexts owing to their potential for overfitting. SVM, however, relies heavily on well-structured, low-noise datasets for good performance55. It is susceptible to hyperparameter settings, whose defective tuning could significantly result in underfitting or overfitting56. Comprehensive parameter optimization could be challenging for SVM when dealing with meagre or noisy datasets, owing to their quadratic or cubic training complexity57.

In a similar study, ANN was used to simulate the biogas yield of chemically treated grass, clover, and wheat straw co-digested during anaerobic digestion. Retention time, temperature, alkali concentration, and substrate composition were used as input parameters while the cumulative methane yield was the output. The performance metric of the model shows a varying level of performance based on the variation of input parameters, with a minimum RMSE value of 0.41058, which is lower than 9.9094 for ANN, and the minimum from RF (5.6862) observed in this study. An ML application for the optimization and prediction of fresh mass methane yield of particle size pretreated Arachis hypogea shells using ANFIS was reported to have RMSE, MAPE, and MAD values of 2.7875, 9.0643, and 1.7665, respectively. These values are lower compared to this study, which shows that the model performed better than this18. Biomethane release from the organic fraction of municipal solid waste was optimized and predicted with ANN, linear regression, XGBoost, RF, and SVM. Moisture content, C/N ratio, lignocellulose contents, and age of the waste were the input parameters with biomethane as the output. XGBoost and RF were reported as the best models with RMSE values of 305 and 496, respectively59, which are higher compared to the values recorded in this study. SVM was reported as the best model for the prediction of biogas and methane yield of solid-state anaerobic digestion, using existing biogas and methane data. The RMSE and MAE values of 3.21 and 1.93 were reported60, respectively; these values are lower compared to this study. It was observed that this study deviates from the existing studies, which can be traced to the difference in feedstock, input process parameters, model version, and type. It was also difficult to have a comprehensive comparison because the performance metrics considered were not the same, for instance, most of the studies do not consider VAF in their performance metrics.

Conclusion

The impact of alkaline (NaOH)-pretreatment on the process output of anaerobic digestion of Xyris capensis was investigated in this study through an integrated approach involving experimental and multi-modal statistical and ML-based analysis, including correlation analysis, SHapley Additive exPlanations (SHAP) for feature-ranking, cluster analysis for bio-digestion operational and pretreatment datasets, using k-means integrated with Principal Component Analysis (PCA). This provides in-depth insights into the operational dynamics of anaerobic digestion. From the experimental front, the NaOH pretreatment demonstrated a substantial enhancement in biomethane yield by breaking down the recalcitrant properties of Xyris capensis. Furthermore, all the pretreatment conditions investigated improve the yield to about 143% under optimal pretreatment conditions. SHAP analysis revealed exposure time as the most influential feature with a strong negative impact on biomethane yield, while retention time and NaOH concentration were identified as key positive contributors. PCA further verified the importance of these features, capturing over 86% of the variance in three principal components. ANN, SVM, DT, and RF models were developed to predict biomethane yield. The RF model outperformed others (ANN, DT, and SVM), achieving the lowest error margins during training and testing. The integrated experimental and data-driven computational techniques presented in this study contribute significantly to sustainable energy production by offering actionable intelligence toward improved energy recovery and enhanced process control in the anaerobic digestion of lignocellulosic biomass.

Specimen deposited in a public herbarium

Xyris capensis was deposited at Flora of Botswana on 15th June 1996, with record number 84221, recorded by MG Bingham and collector number 11045.

Permission to collect Xyris capensis sample

Xyris capensis is a common grass found in almost all the Provinces of South Africa; therefore, official permission is not required.

Feedstock identification for the study

Author Daniel M. Madyira identified the feedstock species used for this study.