Introduction

Wastewater treatment is a crucial area of focus for environmental protection and sustainable development. It addresses the challenges of purifying and discharging industrial and domestic wastewater, which can pose significant risks to the environment1,2. Wastewater treatment plants (WWTPs) are essential facilities for treating urban wastewater, and they must discharge in accordance with the standards, e.g., the Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant of China, which regulates the release of certain conventional pollutants, heavy metals, and some other characteristic pollutants3. However, traditional physicochemical indicators or pollutant concentrations alone do not fully reflect the comprehensive risk of wastewater4. Even after undergoing tertiary treatment and high dilution, urban wastewater that meets discharge standards still contains low-concentration of various pollutants such as pharmaceuticals and personal care products5, pesticides6, and halogenated flame retardants7. Besides, the range of regulated pollutants within current effluent standards is limited. This implies that not only low-concentration regulated pollutants but also unregulated or emerging contaminants may still pose risks to the ecological environment2,5. As urban wastewater is continuously discharged into receiving waters, these pollutants and their transformation products in the aquatic environment are distributed among various media and bioaccumulate8, exerting complex toxic effects. This can cause endocrine disturbance, developmental toxicity, neurotoxicity, genotoxicity, and other effects on aquatic organisms9,10,11, affecting their normal development and ultimately impacting the species abundance, community composition, structure, and ecological function of aquatic ecosystems12,13. Numerous studies have shown that WWTP effluents can cause toxicity to aquatic organisms14, making them a significant source of water environmental pollution15,16,17,18. Therefore, controlling the ecological risk of WWTP effluents to the aquatic environment is crucial for maintaining the health of aquatic ecosystems. Nevertheless, the current focus of wastewater treatment technology research predominantly revolves around controlling specific pollutants, with relatively fewer studies addressing biotoxicity comprehensively. Accurately assessing effluent toxicity risks from WWTPs facilitates guiding updates to existing standards. Furthermore, it is imperative to strengthen wastewater treatment technology towards toxicity removal, reducing the effluent toxicity levels19,20.

WWTP effluents represent a complex mixture exposure system. Relying solely on pollutant monitoring for exposure assessment may underestimate the risks of the effluent. Therefore, it is necessary to combine biological effect assessments to accurately identify the complex toxicity of the effluent. Toxicity Identification Evaluation (TIE) is a research approach based on bioassays that integrates various physicochemical experimental methods for toxicity assessment, widely adopted in industrial wastewater, urban wastewater, and sediments21. In the TIE procedure, toxicity can be determined by a series of short-term toxicity or acute lethality freshwater aquatic bioassays, including algae growth, bacterial tests, survival of macroinvertebrates, and fish bioassays13. Standardized in vivo bioassays for evaluating the efficacy of state-of-the-art activated sludge wastewater treatment are limited. To improve the predictive power of in vivo methods, the critical aspect of TIE is utilizing more effective and sensitive test organisms to evaluate the biological toxicity of each separated component21. Compared to traditional test organisms in the TIE procedure, Caenorhabditis elegans, commonly known as the nematode, is easy to culture, has a short lifecycle, and has an extensive background in genetic and developmental research22,23. It can be used for multi-endpoint assessments such as survival rate, reproduction, lifespan, locomotion behavior, metabolism, etc.24,25,26,27, especially showing high sensitivity to environmental stress responses28. Nematode toxicity testing, which can effectively determine toxicity based on toxic components of effluent rather than water quality parameters, has proven to be a relatively simple and economical method for the evaluation of beneficial organisms for environmental hazards in wastewater22. Currently, it has become a recognized alternative model organism for the assessment of animal toxicity. Therefore, using Caenorhabditis elegans as a test organism combined with TIE technology for aquatic biotoxicity testing could effectively assess biological toxicity risks. Combining this approach with existing effluent pollutant concentration standards aligns more closely with the increasing emphasis on multi-target control of WWTP effluent quality, in response to heightened environmental awareness, and aims to minimize adverse impacts on the surrounding environment to the greatest extent possible29,30.

The anaerobic-anoxic-oxic (AAO) biological wastewater treatment process has become the predominant process in WWTPs due to its convenience in operation and high efficiency in pollutant removal1,31. However, the current processes struggle to effectively control effluent biological toxicity and lack the necessary technologies to enhance toxicity reduction level32, presenting challenges to the traditional biological wastewater treatment process. Studies have shown that treatment technologies significantly impact effluent pollutants removal efficiency33, highlighting the need for optimization in the traditional AAO process to fully address potential toxicity risks12,32. Consequently, enhancing and optimizing the AAO biological wastewater treatment process has emerged as a pivotal direction in current research, striving to meet the dynamic pollution control demands. Therefore, innovative solutions within traditional wastewater treatment processes are required to ensure stable efficiency and adaptability when addressing diverse wastewater conditions and meeting multiple objective pollutant discharge standards, especially concerning effluent toxicity.

Optimizing biological wastewater treatment processes is a complex and dynamic challenge, as influent features vary among different treatment facilities and change over time within the same facility34,35. Furthermore, traditional methods relying on heuristic rules and fixed parameters struggle to adapt to the dynamics and complexities of wastewater pollutants36,37, making it difficult to design universal processes for various scenarios38,39. Researchers, such as Fan et al. improved the sequencing batch reactor (SBR) wastewater treatment process to obtain minimal energy consumption with the help of simulation tools40. Wang et al. pointed out that two upgrading processes, namely AAO coupling MBBR processes and five-stage Bardenpho, exhibit superior pollutant removal efficiency than AAO at full-scale41. Jiang et al. researched under-loaded performance42 and investigated the impact of aeration control43 on pollutant removal of a full-scale WWTP system. Yue et al. optimized the process structure and operation of the current widely used biological treatment processes of coking WWTP based on a general algebraic modeling system44. Liu et al. compared the denitrification and phosphorus removal efficiency under low temperatures by implementing the AAO and AAO-SBR processes in a full-scale wastewater treatment plant45. Combining years of operational experience in WWTPs, Xiao et al. summarized optimization recommendations for the design phase of semi-underground wastewater treatment plants46. However, the inherent problem lies in the fact that current research on performance comparisons and optimization proposals for biological wastewater treatment processes often depends on trial-and-error methods by mathematical models, practical experiments for comparisons, and empirical summaries40,44,45,46. The excessive dependence on time and resource consumption hampers large-scale comparisons, limiting the comprehensive modification and improvement of various treatment processes. Faced with these complex and evolving challenges, there is a growing need for innovative, data-driven solutions to enhance the effectiveness and sustainability of biological wastewater treatment processes47,48,49. In recent years, the application of machine learning (ML) in the field of biological wastewater treatment process optimization has sparked widespread interest50,51,52. It is employed in wastewater quality parameter monitoring for predicting indicators such as chemical oxygen demand (COD), biochemical oxygen demand (BOD)53, and total nitrogen (TN)54. Additionally, it is utilized in early warning fault diagnosis, providing advanced notice of operational faults in WWTPs55,56. This represents the potential applications of ML in wastewater treatment and offers new avenues for addressing multiple objective optimization challenges for biological wastewater treatment processes57,58,59.

In this work, we collected incorporated information on wastewater quality parameters, treatment process parameters, and wastewater biological toxicity (based on the nematode bioassay) from 122 WWTPs across China. Based on different ML algorithms, four prediction models for wastewater quality parameters and one prediction model for toxicity reduction efficiency were constructed to accurately predict the effluent’s quality parameters and toxicity reduction ratios. In addition, we devised a ML-based framework for optimizing the biological wastewater treatment processes to improve their performance and sustainability. This method simulates the bioreactor unit in wastewater biological processes and predicts the corresponding effluent quality and detoxification efficiency. Through screening, we identified the improved biological wastewater treatment unit combinations that met the standards for wastewater quality parameters and meanwhile achieved the highest toxicity reduction efficiency under certain water quality conditions, thereby optimizing the AAO biological wastewater treatment process. Through this research, we aspire to provide new insights into the field of wastewater treatment, fostering the development of more sustainable and efficient biological wastewater treatment processes to better protect the environment and water resources.

Results

Effectiveness of the wastewater quality parameters representation

We first conducted a SHapley Additive exPlanations (SHAP) analysis for the importance of the wastewater quality parameter based on NNR algorithm. The results are shown in Fig. 1.

Fig. 1: The SHAP analysis of the wastewater quality parameters.
figure 1

The scatters in the plot represent the SHAP values for each sample for each feature, and the color represents the size of the feature value.

Figure 1 reveals that in terms of the impact of the prediction model based on NNR algorithm, the COD feature holds the highest significance. In other words, samples with high COD values have a positive impact on the model prediction result, whereas samples with low COD values have a negative impact. Moreover, the importance of the DO feature serves as the foundation for feature extraction of the multi-unit combination process. The results reveal that the effective representation supports a comprehensive and accurate depiction of features across various biological wastewater treatment processes, with the contribution ranked as follows: COD, dissolved oxygen (DO), pH, temperature (Temp), total phosphorus (TP), TN, and ammonia nitrogen (NH3-N).

Effectiveness of the biological wastewater treatment process representation

To verify the validity of the features for biological wastewater treatment process, we used a toxicity reduction prediction model to conduct two sets of predictions. One set was predicted with complete information of the treatment process parameters, while the other was performed sequentially using information of the split individual units. For example, a three-unit biological treatment process combination is predicted as set 1 in accordance with the complete biological process, while it is predicted three times as three one-unit biological treatment process combinations in succession as set 2. It is worth noting that 22 WWTPs were randomly selected as independent test data, while the rest were used as training data. The prediction results are shown in Fig. 2 and Supplementary Table 1.

Fig. 2: The regression plot of the two sets of predictions.
figure 2

The broken lines represent the actual values and predicted values of set 1 and set 2 for independent test samples, respectively.

As shown in Fig. 2 and Supplementary Table 1, in the independent test, the prediction of the toxicity reduction ratio of the two sets is in agreement with the actual value. The prediction of set 1 achieved a coefficient of determination (R2) of 0.85, slightly better than that of set 2, which had an R2 of 0.83. Although the predicted reliability of set 2 decreases slightly over the course of multiple predictions, the overall margin is small. The insignificant difference between the two sets suggests that it was feasible to divide the complete biological treatment process into several units for feature extraction, which laid a foundation for the subsequent unit combination steps.

Prediction performance of wastewater quality parameters models

By combining different ML algorithms and multi-unit representations, we constructed prediction models for wastewater effluent quality parameters tailored to various influent water quality scenarios. The prediction results of the wastewater quality parameters model on five-fold cross validation are displayed in Fig. 3 and Supplementary Table 2 in SI. High R2 ( ~ 0.8) and low mean absolute error (MAE) and root mean squared error (RMSE) (< 0.1) values suggest that these models can accurately predict the water quality of wastewater effluent, including COD, TN, NH3-N, and TP, demonstrating the applicability of the models for effluent quality prediction in real-world wastewater treatment plants.

Fig. 3: The prediction results of the wastewater quality parameters model based on different algorithms by five-fold cross validation.
figure 3

AD Represents the model prediction performances for COD, TN, NH3-N, and TP based on four ML algorithms, respectively.

Figure 3 also shows that the model built by NNR algorithm performs better than the other three algorithms. Specifically, the NNR-based prediction model achieved the best MAE of 0.04, RMSE of 0.07, and R2 of 0.81 for the prediction of TN. It also demonstrated favorable performance with MAE, RMSE, and R2 values of 0.05, 0.07, and 0.82, respectively, for the prediction of NH3-N. Similarly, the NNR-based model showed good performance for the prediction of TP, with MAE, RMSE, and R2 values of 0.06, 0.08, and 0.80, respectively. As for COD, the NNR-based model showcased superior performance than others, achieving an MAE of 0.06, RMSE of 0.07, and a maximal R2 of 0.82. The results underscore the exceptional capabilities of the predictive models for wastewater quality parameters, particularly with the NNR algorithm exhibiting the best prediction accuracy. Therefore, four wastewater quality parameter prediction models (COD, TN, NH3-N, and TP) are all built based on NNR algorithm, signifying the robustness of the models in accurately forecasting wastewater quality parameters.

Prediction performance of toxicity reduction models

The prediction model for toxicity reduction of the wastewater effluent was constructed to evaluate how biological wastewater treatment processes are effective in reducing the ecological and toxicological impacts of the wastewater. We built prediction models for effluent toxicity reduction assessment for different process combinations based on four different ML algorithms, including Support Vector Regression (SVR), Random Forest Regression (RFR), eXtreme Gradient Boosting Regression (XGBR), and Network Regression (NNR) algorithms. The prediction results of the toxicity reduction model on five-fold cross-validation are shown in Fig. 4 and Supplementary Table 3 in SI. High R2 ( ≥ 0.8) values suggest that these models can accurately predict the toxicity of wastewater effluent, indicating that the developed models could serve as powerful tools for process optimization towards toxicity risk control.

Fig. 4: The regression prediction graph of the toxicity reduction model based on different algorithms by five-fold cross-validation.
figure 4

AD Represents the results of SVR, RFR, XGBR, and NNR, respectively.

Among the four models tested, the NNR-based model achieved best performance with an MAE of 0.05, RMSE of 0.06, and R2 of 0.86. The highest R2 and lowest MAE and RMSE values highlight the superior predictive power of the NNR algorithm than others, this is likely because the NNR algorithm with high model complexity and automatic learning mode training mode holds stronger generalization ability60. Likewise, the optimal toxicity reduction model was constructed by NNR algorithm and used in the following sections.

Process optimization based on toxicity control

After the establishment of individual models for predicting the water quality parameters and effluent toxicity, we integrated them together by devising a comprehensive four-step framework, aiming at optimizing the biological wastewater treatment processes towards toxicity reduction of the wastewater effluent.

For ease of understanding, we devised 12 sets of water quality data that simulated the WWTP influents (refer to Supplementary Note 1 and Table 4 in SI for specific experimental data). We set pH to 7.1 and Temp to 21.3 °C based on the average values of the original dataset. The values of different parameters in these sets vary within the following ranges: COD 25–300 mg/L, TN 20–40 mg/L, NH3-N 16–30 mg/L, and TP 1.5–6 mg/L. By referring to the urban WWTP effluent discharge standards, the values were set to COD of 50 mg/L, TN of 15 mg/L, NH3-N of 5 mg/L, and TP of 0.5 mg/L for the baseline of discharge standard. The 12 sets of data are arranged in decreasing order of C/N. The final toxicity reduction ratio and corresponding optimal short-range unit combinations are shown in Fig. 5. Note that A represents aerobic unit, A’ represents anoxic unit, and O represents anaerobic unit.

Fig. 5: The final toxicity reduction ratio and corresponding short-range unit combinations of the biological wastewater treatment combination with wastewater quality parameters up to standards.
figure 5

The columns represent the corresponding influent conditions, and the broken lines of “TRR-2”, “TRR-3” and “TRR-4” represent the toxicity reduction ratios of the optimal processes for combinations of different numbers of units, respectively. The corresponding unit combinations that meet water quality standards with optimal toxicity removal are shown at the bottom of each group. Note that missing results represent that the effluent of the specified number of unit combinations do not meet the standard. And the bottom right side represents the number of occurrences of the optimal unit combination.

Figure 5 shows that the combinations of multiple units have higher toxicity reduction ratios than the combinations of fewer units. Among two-unit combination processes, the combined process may not be sufficient to meet effluent water quality standards, let alone achieve toxicity removal. The toxicity reduction ratio tends to stabilize at around 70%, with AO being the two-unit combined process. Among three-unit combination processes, the AA’O method is currently the most effective for toxicity reduction, though its efficacy decreases with decreasing the C/N, with the toxicity reduction ratio being around 74%. The four-unit combination process is most effective for toxicity reduction, with a reduction ratio of over 78% when the C/N ratio is maintained between 3 and 10. For example, when the C/N ratio reaches 6.25, the reduction ratio of anaerobic-aerobic-anaerobic-aerobic reaches the highest value of 79.8%. The results also suggest that the optimal four-unit combination depends on the water quality conditions. For example, there are 5 different optimal toxicity reduction processes under 12 sets of water quality conditions. Among them, the combination AOA’O appears four times. Therefore, it is necessary to take consideration of the water quality characteristics when determine the optimal combination process for a given WWTP. With the comparable toxicity reduction ratios, the short-range processes are more recommended than the four-unit combined processes, with due consideration of the energy saving and footprint.

To sum up, the proposed framework allows the recommendation of the wastewater treatment process unit combinations that can meet the basic effluent discharge standards and meanwhile ensure optimal toxicity reduction under varied influent water quality conditions. The findings provide solid support for the optimization and improvement of the existing wastewater treatment processes.

Application of the optimal process in real WWTPs for toxicity reduction

To demonstrate that the applicability of the above optimization process in real WWTPs for improving the toxicity reduction, we implemented it across 122 WWTPs. The comparison of toxicity reduction ratio with or without optimization for each plant is illustrated in Fig. 6.

Fig. 6: The toxicity reduction ratios for 122 WWTPs before and after optimization.
figure 6

The red broken line represents the actual biological toxicity reduction ratio, and the green broken line represents the optimized biological toxicity reduction ratio. The gray imaginary broken line represents the increase of the biological toxicity reduction ratio after optimization.

The results demonstrate a significant leap in toxicity reduction achieved via process optimization. Before optimization, the existing biological wastewater treatment processes showed relatively low toxicity reduction efficiency, ranging from 0% to 71.2%, with an average value of 48.6%. With optimization, the toxicity reduction ratios for optimal treatment processes combinations improved substantially, ranging from 36.4% to 87.5%, with an average value of 70.7%. This result highlights the crucial role of process optimization in enhancing toxicity reduction efficiency. Figure 6 also shows the enhancement in toxicity reduction varied with wastewater treatment plants, with improvement ratios ranging from 8.9% to 54.9%, and averaging at 22.1%. Nonetheless, the toxicity reduction efficiency of all WWTPs has seen significant improvement after optimization.

Discussion

This research fully leverages knowledge and process data from the wastewater treatment process and establishes a two-step prediction scheme that enables accurate predictions of effluent quality parameters (average R2 = 0.81) and toxicity reduction ratio (R2 = 0.86) under different influent water quality scenarios. This study also proposes a ML-based framework for the optimization of biological wastewater treatment processes, not only facilitating the identification of optimal unit combinations that adhere to fundamental effluent quality standards but also ensuring effective toxicity risk management under diverse influent water quality conditions. As a result, the four-unit combination process is most effective for toxicity removal, with a removal ratio of over 78% (up to 79.8% of anaerobic-aerobic-anaerobic-aerobic) when the C/N ratio is maintained between 3 and 10. After optimization, the average toxicity reduction ratio of 122 WWTPs increased from 48.6% to 70.7%, with a maximum of 87.5%.

The proposed framework allows the recommendation of the wastewater treatment process unit combinations that can meet the basic effluent discharge standards and meanwhile ensure optimal toxicity reduction under varied influent water quality conditions. This advancement is crucial for the design and upgrading of WWTPs, aiding at meeting higher effluent standards of wastewater treatment that align with evolving environmental regulations and societal expectations.

Process optimization stands as our fundamental step, acknowledging the dynamic nature of wastewater treatment. Currently, optimal unit combinations enable us to adjust dissolved oxygen levels effectively, guiding alternating microbial degradation processes necessary for various pollutants. Moving forward, we aim to integrate internal and external factors that influence water treatment operations, thereby comprehensively optimizing the wastewater treatment process. Future efforts will explore internal factors and leverage technological advancements for holistic optimization. Furthermore, our current research is based on nationwide sampling for experimental analysis, providing a comprehensive understanding while not yet exploring regional variations. Future studies are planned to incorporate geographical factors to delve deeper into potential geographic patterns. In addition to our primary focus on achieving basic water quality standards and controlling biological toxicity, our optimization framework has the potential to include additional pollutant control models. Advanced models can be adapted to address the control of other pollutants, such as emerging contaminants, aligning with our objective of achieving multiple objective optimization control. It is expected to help mitigate the environmental impacts of effluent discharges, promoting sustainability in water resource management and safeguarding aquatic ecosystems.

Methods

Sample description

A comprehensive dataset was compiled by gathering information from 122 municipal WWTPs located in 26 provinces or municipalities across China, and specific information is given in Supplementary Table 5. The data collection period ranged from July 2020 to June 2021, providing a detailed overview of the biological treatment processes employed in wastewater treatment. The sampling sites are shown in Supplementary Figure 1 and detailed sampling information is provided in Supplementary Note 2.

Sampling nationwide and across the year contributes to obtaining comprehensive and accurate data from wastewater treatment plants34. This provides robust support for developing more effective biological treatment technology optimization strategies.

Framework of the proposed optimization method of biological wastewater treatment processes

To create a reliable ML-based optimization approach, we devised a comprehensive four-step framework for the proposed method aimed at optimizing the biological wastewater treatment processes. The overall framework of the proposed biological wastewater treatment processes optimization method is illustrated in Fig. 7, which consists of four modules: collection module (wastewater quality parameters, treatment process parameters, and wastewater biological toxicity), model construction module, unitization module, and process optimization module. The specific steps are described in detail in the following sections of the article.

Fig. 7: The framework of the proposed optimization method of biological wastewater treatment processes.
figure 7

The framework is composed of four modules, namely collection module, model construction module, unitization module, and process optimization module.

Data collection

In the collection module, we obtained incorporated information from sampling WWTPs, including wastewater quality parameters, treatment process parameters, and wastewater biological toxicity. The specific data acquisition methods are as follows.

Quantitative analysis of wastewater quality parameters

The wastewater quality parameter gathered from the influent and effluent of WWTPs include COD, TN, NH3-N, TP, DO, Temp, and pH value. DO, Temp, and pH value were measured and recorded from the original water samples. The remaining wastewater quality parameters were chemically analyzed after sampling and pretreatment of water samples, described in detail as Supplementary Note 3.

Rigorous and standardized sampling and pretreatment methodology not only aligns with established protocols but also ensures the extraction and concentration of target substances from wastewater effluent61. The resulting dataset offers a robust foundation for an in-depth analysis of the biological treatment processes employed in municipal WWTPs. The distribution of wastewater quality parameters measured is as follows: COD 3.5 ~ 241.6 mg/L (mean = 36.1, SD = 31.65), TN 0.7 ~ 66.0 mg/L (mean = 11.1, SD = 9.57), NH3-N 0 ~ 56.0 mg/L (mean = 7.6, SD = 8.61), TP 0 ~ 24.3 mg/L (mean = 3.2, SD = 1.28), DO 0.0 ~ 7.0 mg/L (mean = 1.0, SD = 4.03), Temp 4 ~ 35 °C (mean = 21.3, SD = 5.58), and pH 6 ~ 8 (mean = 7.1, SD = 0.56). Note that SD stands for standard deviation.

Treatment process parameter

In this work, we utilized treatment process parameters, which is often overlooked in most research58,59, as ML inputs. We processed the collected treatment process parameters as follows:

We performed feature extraction for the biological treatment process, particularly focusing on the bioreactor unit. Since the essence of wastewater biological treatment lies in microbial degradation of pollutants, where microorganisms operate under varying DO conditions: aerobic, anoxic, and anaerobic. While DO is an artificially controllable factor that influences bioreactor type and reaction processes, pivotal for optimizing the treatment process62. Adjusting DO levels enables different microbial types to function alternately, ensuring efficient pollutant removal in biological treatment stages. Thus, to digitize the process information for model input, the bioreactor unit, based on dissolved oxygen levels, was categorized into three distinct units: aerobic (DO > 0.5 mg/L), anoxic (0.2 mg/L < DO < 0.5 mg/L), and anaerobic (DO < 0.2 mg/L). We applied one-hot encoding to represent these biological treatment processes62. Therefore, the aerobic unit was represented as O: [1,0,0], the anoxic unit as A’: [0,1,0], the anaerobic unit as A: [0,0,1], and a blank unit was represented as [0,0,0]. With this solution, we can deal with all biological treatment processes into continuous units according to DO levels with unified digital representation.

Wastewater biological toxicity evaluation

We conducted biological toxicity tests to quantify the adverse biological effects of the mixed pollutants in WWTPs and to provide effective toxicity evaluation of WWTP effluents19,20. Caenorhabditis elegans was used as the bioassay for toxicity evaluation. The detailed methodology of the biotoxicity assay experiments is described as Supplementary Note 4. The body length of the nematode was used as the indicator to determine the ability of toxicity reduction for each section or the whole treatment process. Since the effluent toxicity is greatly affected by the influent toxicity, using toxicity reduction ratio instead of effluent toxicity ensures comparability and reliability in experiments, while providing a comprehensive assessment of wastewater treatment performance63. The formula for calculating the specific toxicity reduction ratio is described by Eq. (1):

$$\begin{array}{ll}{Toxicity\; reduction\; ratio}\\=\frac{{inhibition\; ratio\; of\; influent}-{inhibition\; ratio\; of\; effluent}}{{inhibition\; ratio\; of\; influent}}\times 100 \%\end{array}$$
(1)

where inhibition ratio refers to the inhibition effect of water samples on the body length indicator of nematode, compared with the blank control group.

The exception is when the biological toxicity becomes stronger, that is, inhibition ratio becomes higher, and the reduction ratio calculated according to the formula is less than 0, then we consider the toxicity reduction ratio to be 0%. In the original sampling data, the toxicity reduction ratio of biological treatment processes ranged from 0 to 71.2%, with an average of 48.6%.

Data preprocessing

To ensure the quality and availability of the data, we carried out strict data preprocessing, including missing data processing, outlier data processing, data normalization, and standardization64. Through these methods, we have obtained high-quality biological wastewater treatment data, which provides a solid data foundation for subsequent characterization, modeling, and prediction support64,65. This aids in gaining a deeper understanding of the effluent performance and toxicity reduction associated with wastewater treatment processes, thereby providing robust support for enhancing the efficiency and sustainability of wastewater treatment.

Model construction

We have devised a two-step prediction scheme aimed at optimizing wastewater treatment processes: basic water quality prediction and toxicity reduction prediction. Predicting wastewater quality parameters such as COD, TN, NH3-N, and TP allows us to assess the effectiveness of the treatment process in meeting regulatory standards66, referring to Grade A of the Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant of China. Meanwhile, predicting toxicity reduction ratio enables us to gauge the environmental and biological safety of the treated effluent66. It is worth noting that the basic water quality prediction is a necessary step as it ensures compliance with essential water quality standards, laying the foundation for effective process optimization. This approach ensures that while efficiently removing biological toxicity, other basic water quality compliance requirements are not compromised. By implementing this comprehensive optimization strategy, we can mitigate the risk of prioritizing toxic substance removal at the expense of sacrificing certain water quality compliance standards.

Effluent wastewater quality parameters prediction model

In the first step of the scheme, we leveraged preprocessed wastewater quality parameters and treatment process parameters derived from the collection module. Among them, treatment process parameters and influent quality parameters, such as COD, TN, NH3-N, TP, DO, Temp, and pH value model indicators, were used as input features. While the effluent quality parameters can be used as prediction targets, such as COD, TN, NH3-N, and TP indicators. Notably that each model takes only one wastewater quality parameter as the prediction target.

The approach was tailored to the multiple objective effluent water quality targets, allowing us to construct one or more prediction models for different water quality indicators. Utilizing the influent wastewater quality parameters as input features, we established individual prediction models for COD, TN, NH3-N, and TP based on four different machine-learning algorithms. We conducted five-fold cross-validation experiments to measure the performance of these prediction models and determine the optimal prediction model for basic effluent quality67. This comprehensive evaluation strategy provided insights into the predictive accuracy and reliability of our models, ultimately guiding the selection of the most effective approach for assessing basic effluent water quality.

Effluent toxicity reduction prediction model

In the subsequent step, we utilized influent quality parameters and treatment process parameters as input features. Moreover, the toxicity reduction ratio served as the prediction target.

We utilized various ML algorithms during the model development phase to construct distinct wastewater toxicity reduction prediction models. We performed five-fold cross-validation experiments to verify the robustness and reliability of these models68,69, and to obtain the optimal prediction model. These experiments comprehensively evaluated the prediction models, enabling us to identify the most optimal wastewater toxicity reduction prediction model. By assessing their performance and generalization ability, our objective was to select the most effective model for evaluating the toxicity reduction ratio in the effluent water following wastewater treatment.

Machine learning algorithms

The choice of a regression algorithm plays a pivotal role in the prediction process, significantly influencing the accuracy and trustworthiness of the predictive outcomes. Several well-established ML algorithms are commonly employed for regression tasks, including SVR7070, RFR71, and XGBR72 algorithms. In parallel, we incorporate deep learning techniques, specifically the Neural Network Regression (NNR)60, for comparative analysis with conventional ML algorithms to enhance model performance. The detailed description of these four machine-learning algorithms is presented as Supplementary Note 5.

Model performance measures

In this study, we employed five-fold cross-validation experiments to conduct a rigorous statistical analysis of prediction performance and the predictive models’ generalization capabilities68. Five-fold cross validation is explained in the Supplementary Note 6. The evaluation measures utilized include MAE, RMSE, and R2, described in detail as Supplementary Note 7.

Process unitization

In this phase, we simulated and reconstructed all potential short-range unit combinations within the biological wastewater treatment process. These short-range unit combinations may include two-unit, three-unit, and four-unit combinations. To ensure uniformity and consistency, we represent these short-range unit combinations as “four-unit” input and employ one-hot encoding for these potential short-range unit combinations. The one-hot encoding method for representing multiple units in biological treatment processes was implemented as follows:

Initially, the aerobic unit was represented as [1,0,0], the anoxic unit as [0,1,0], the anaerobic unit as [0,0,1], and a blank unit was represented as [0,0,0]. Subsequently, we unified the biological wastewater treatment processes by four units encoding and concatenated features based on the bioreactor units utilized in the process. In cases where the number of representation units was insufficient, we supplemented them with blank units. The feature dimension represented by the “four-unit” is 12. For example, the “four-unit” representation for an anaerobic-anoxic-aerobic process in biological wastewater treatment was AA’O: [0,0,1,0,1,0,1,0,0,0,0,0], while the representation for an anaerobic-aerobic process was AO: [0,0,1,1,0,0,0,0,0,0,0,0].

Process optimization

This study presents an optimization framework that integrates predictive modeling for basic water quality and toxicity reduction, aiming to achieve multiple objective control. Initially, the framework screens unit combinations based on predicted water quality parameters to ensure compliance with effluent standards. It then compares these compliant combinations to assess their efficacy in toxicity removal. The specific steps are as follows:

To begin with, the optimal basic effluent wastewater quality parameters prediction models for four wastewater quality parameters in the modeling module were used to predict the effluent wastewater quality parameters for all reconstructed short-range unit combinations. We set effluent wastewater quality parameters thresholds based on Grade A of the Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant of China. Whether the predicted results all reach the established effluent water quality criteria determines whether the biological treatment short-range unit combinations are qualified.

Subsequently, we evaluated the toxicity reduction of reconstructed short-range unit combinations that adhered to the wastewater quality parameters requisites. This assessment was carried out by the well-trained wastewater toxicity reduction prediction model to gauge the toxicity reduction ratio associated with the short-range unit combinations that met the standards. We then filtered the prediction results that adhered to the toxicity risk criteria and pinpointed the optimal biological wastewater treatment process, based on its effectiveness in controlling toxicity risks.

Consequently, we identified reconstructed unit combinations that met the wastewater quality parameters standards and exhibited the most effective risk control, in line with the multiple objective requirements, namely achieving basic water quality standards and maximum toxicity reduction efficiency. These unit combinations held the potential to serve as a novel biological wastewater treatment process, particularly well-suited for the prevailing influent water quality conditions. This innovative approach addresses the trade-off situation of avoiding sacrificing overall water quality to remove specific pollutants, offering fresh insights and solutions in the realm of wastewater treatment, with the potential to enhance treatment efficiency and diminish environmental impacts.