Introduction

The world today faces various challenges, from viruses outbreaks to drug spreading1,2, creating a need for fast response actions preventing or minimizing the potential dangerous consequences. The sensitive and quick detection of molecules, with real-time surveillance and automated data analysis, can deliver an extremely effective tool for public health condition monitoring and its protection with targeted actions at the point of need3. Traditional laboratory methods like enzyme-linked immunosorbent assay (ELISA) or nucleic acid sequence-based amplification (NASBA) are used for samples investigations which deliver reliable and highly-accurate information4,5. However, this approach requires much time (from sample taking to the result), trained personnel and significant funds for devices and reagents6. As a result, scaling of such methodology is not possible, creating a need for a comprehensive system capable of monitoring larger areas and their populations. Recent research highlights that traditional diagnostic systems often suffer from latency and network congestion when processing large healthcare datasets through centralized cloud infrastructures. A novel regional computing approach has been proposed to address these challenges, allowing decentralized data handling and improving real-time response capabilities in public health surveillance7.

An interesting strategy to solve this problem is wastewater monitoring, especially in the light of the latest advancements in biosensing4,8,9,10. Wastewater-based epidemiology (WBE) involves analyzing sewage to monitor the health of a population, focusing on detecting human health-related molecules. For example, viral particles can traverse to sewage pipelines through infected people body fluids and excretions, introducing the viral material in wastewater samples11,12,13. In over 100 studies across the world, infectious diseases and pathogens were detected through wastewater surveillance14. Devianto and Sano have reviewed and identified protein markers that are found in relatively high concentrations within wastewater samples, that can be useful for WBE systems development15. Hence, this strategy allows to identify potential threats: by analyzing wastewater it is possible to detect and track the targeted molecules, like pathogens, proteins, pharmaceuticals, chemical contaminants, and other substances6,15,16,17,18. A study analyzing crude wastewater from South Wales treatment plants demonstrated the robust capability of WBE to detect not only illicit drugs such as cocaine and cannabis but also medicinal compounds like buprenorphine and methadone. This supports the growing use of WBE in tracking a wide range of chemical and pharmaceutical contaminants, emphasizing its relevance for both public health and environmental monitoring19.

Installation of such system in municipalities, like schools, hospitals, city offices, sewage treatment plants etc. would provide a net able to indicate the area needing intervention. If a given substance is detected or its level changes, the system will automatically notify authorities about the situation, providing a basis for further actions. Therefore, such system can provide a continuous overview of the community health, follow the changes in tracked substance dynamics, and guide timely interventions in the place of need, saving time and resources of the services20. Empirical evaluations from China show that implementation of smart city infrastructure, which often includes enhanced wastewater management technologies, significantly improves sewage treatment capabilities. The integration of technological and socio-economic tools through smart city frameworks enhances wastewater monitoring and treatment effectiveness across urban environments21.

To date, several research groups reported successful detection of tracked molecules in wastewater samples. Alvarez-Serna et al. presented a label-free and portable field-effect transistor (FET) sensor coupled with reverse transcription loop-mediated isothermal amplification (RT-LAMP) reaction for detecting SARS-CoV-2 genome in wastewater samples. The developed sensor enabled detection and qualification of nucleic-acids in wastewater samples within 30 min and has shown potential to other pathogens customization22. Sharma et al. developed a field-deployable system, based on Nanotrap microbiome particles and RNA Isothermal Co-Assisted and Coupled Amplification11. The system allows targeted virus detection in wastewater, with the whole procedure taking around 60 min. Guerrero-Estaban et al. reported a carbon nanodot-amplified electrochemiluminescence immunosensor for Spike S1 protein from SARS-CoV-2 coronavirus detection in wastewater. The disposable platform has shown a low detection limit and broad linear response23. Sen et al. proposed integration of an electrochemical aptamer-based sensor with an FPE (filtration, purification, extraction) system for in-field wastewater surveillance. The platform of 50x70 cm allows detection in less than 60 min24. Sokołowski et al. demonstrated an optical method based on spectroscopy for tracking dynamics of C-Reactive Protein (CRP) levels in complex biological matrix20. The primary benefit of the proposed method lies in its use of an optical system that enables near-real-time monitoring of CRP level fluctuations and permits its installation with minimal alterations to sewage infrastructure. With the use of machine learning (ML) algorithms, the system constitutes a promising platform for real-time CRP monitoring. Jagadeesan et al. presented a proof of concept for wastewater-based infectious diseases warning system25. The mass spectrometry-based approach was presented for concurrent monitoring of SARS-CoV-2 and CRP, chosen as representatives for pathogen and protein occurring in wastewater samples.

In this study, a concept of a point-of-need system able to monitor the wastewater samples and classify them using ML model CSVM is proposed. For the purpose of this research, a well-known and tested biomarker—CRP—has been used and added its known concentrations to the wastewater samples, however, it is also possible to detect other substances using the presented approach. CRP serves as a critical biomarker for inflammation and infection within the human body26. Traditionally utilized in clinical settings for diagnosing various ailments, recent research has extended its utility to unconventional domains, notably wastewater analysis. A recent study confirmed the feasibility of CRP detection in human urine using an optical interferometric biosensor integrated with ML classifiers, achieving up to 100% classification accuracy. This supports the continued exploration of CRP in WBE systems and highlights the potential of optical and data-driven methods in health monitoring through wastewater analysis27,28.

CRP serves as a robust and practical biomarker for assessing the inflammatory pathways activated by environmental stressors. Its responsiveness to a broad spectrum of pollutants29,30, medical conditions31 or psychoscocial stress32,33, coupled with the accessibility of its measurement, positions CRP as a critical tool for environmental health surveillance. By integrating CRP monitoring into public health strategies, researchers and officials can better understand the population-level impacts of environmental quality, identify vulnerable communities, and evaluate the effectiveness of interventions aimed at mitigating environmental health risks. Mounting evidence supports the use of CRP, a sensitive marker of systemic inflammation—bacterial, viral, cancerous and others34,35, as a valuable indicator for monitoring the health impacts of environmental exposures. This readily measurable biomarker offers a means to assess the physiological stress induced by a wide range of pollutants, providing a crucial link between environmental quality and public health29,30. Scientific research has increasingly demonstrated a strong association between elevated levels of CRP and exposure to various environmental contaminants. Initially recognized for its link to cardiovascular disease, studies have now established connections between heightened CRP concentrations and exposure to ambient air pollutants, including particulate matter (PM2.5) and ozone36,37. This association holds true for both short-term and long-term exposures across diverse populations, encompassing children, healthy adults, and the elderly. Beyond air quality, investigations have revealed a similar inflammatory response, indicated by increased CRP, following exposure to other common environmental toxins. Notably, studies have linked elevated CRP levels to chronic exposure to heavy metals such as lead, mercury, cadmium, and arsenic38,39. Furthermore, emerging research suggests a correlation between pesticide exposure and systemic inflammation as measured by CRP40. The utility of CRP as an environmental health indicator is underscored by several key advantages. Elevated CRP can result from various conditions, including infections, chronic diseases, and lifestyle factors such as smoking and obesity. The development of high-sensitivity wastewater CRP sensing allows for the precise detection of inflammation, offering a more nuanced understanding of populations chronic or increased environmental exposures and health.

In addressing the significance of monitoring targeted substances in wastewater, this study undertakes the data analysis and classification stage, as the majority of WBE systems reports focuses on the sensors development. Two distinct classification tasks are presented in order to highlight the use of ML in the field. The first classification focuses on discerning CRP presence and concentration thresholds within wastewater samples using the complete UV–Vis spectrum data. This classification is made across five distinct CRP concentration levels ranging from \(10^{-4}\, \upmu\)g/ml to \(10^{-1} \,\upmu\)g/ml. In parallel, the paper extends its analysis by conducting an additional classification restricted to a specific spectral range–utilizing solely the spectrum from 400 nm to 700 nm, which not only streamlines computational resources but also holds potential cost-saving implications for the potential development of biosensors currently in the pipeline. This spectral-based approach aligns with recent findings where UV–Vis spectroscopy was successfully applied to assess the removal of complex dye compounds in textile wastewater using biosynthesized nanoparticles. This confirms its applicability as a reliable tool for detecting biomarkers and environmental contaminants41. Thus, the contributions of this paper are the following:

  • The application of machine learning techniques to the monitoring of CRP concentrations in wastewater has been explored, providing valuable insights into environmental sample analysis.

  • The dynamism of CRP concentration levels in wastewater has been illustrated through the use of multiple classes, enabling the observation of varying CRP concentrations over time within the same wastewater region. This multi-class approach allows for capturing temporal fluctuations and spatial variability, reflecting the dynamic nature of CRP presence in complex wastewater matrices.

  • A 5-class classification scheme, distinguishing five different CRP concentration levels, has been implemented for wastewater samples, representing a novel categorization not previously reported in the literature.

In summary, the primary objective of this study is to be achieved through the development and evaluation of machine learning models for the classification of CRP concentrations in wastewater samples using UV–Vis spectra. Specifically, the CSVM algorithm is applied within a multi-class framework to distinguish between five concentration levels of CRP. Two classification tasks are conducted: one utilizing the broadband spectral range (220–750 nm) and another limited to a spectral range (400–750 nm), reflecting potential cost and computational efficiencies for future biosensor development. Model performance is assessed through repeated experiments to ensure robustness and reproducibility. These objectives, including the design, training, and evaluation of the CSVM models, are described in more detail later in the manuscript.

The structure of the paper is as follows: section “Materials and methods” describes the materials and methods used, including data collection, preprocessing, and the application ML. Section “Results and discussion” details the results of the classification tasks, comparing model performances across multiple metrics. Finally, section “Conclusion” concludes the study with a summary of key insights and discusses potential implications for future developments in wastewater monitoring and environmental biosensing.

Materials and methods

This section outlines the methodology adopted to investigate the classification of CRP concentrations in municipal wastewater samples using UV–Vis spectrometry measured signal and machine learning techniques. It begins with a description of the dataset, which consists of real influent samples spiked with known CRP concentrations. The spectral characteristics of these samples are then discussed, followed by a summary of the classification approach, including the selection of spectral markers. Finally, the machine learning models employed for multi-class classification are introduced, detailing their configurations and relevance to the study’s objectives. Together, these subsections provide a comprehensive overview of the experimental and computational framework that underpins the analysis.

Dataset

In this study, a dataset comprising 840 distinct wastewater samples, each exhibiting varying concentrations of CRP (including samples without CRP), serves as the foundational basis for analysis. These samples encapsulate a diverse spectrum of CRP levels, ranging from negligible concentrations to notable quantities, reflecting the dynamic composition inherent to wastewater matrices. The base for dataset construction were measurements carried out with the spectrophotometer (NanoDrop ND-1000, Thermo Fisher Scientific Inc., Waltham, MA, USA). The methodology, samples description and measurement process description can be found elsewhere20.

Although the dataset comprises 840 individual samples, it is important to highlight that all were derived from real municipal wastewater. Specifically, 14 composite influent samples were collected between August and October 2023 from the Gdynia-Dçbogórze Wastewater Treatment Plant (WWTP), located along the Baltic Sea coast in northern Poland. This WWTP is the second-largest facility of its kind in the region and serves both the city of Gdynia and neighboring municipalities. A 24-h composite, flow-proportional sampling procedure was employed to represent the daily composition of raw municipal wastewater. The influent stream is composed predominantly of domestic wastewater, with industrial and hospital sources contributing approximately 1% and 0.1%, respectively. During the sampling campaign, the plant operated at a hydraulic load of roughly 450,000 population equivalents (PE), with an average daily influent flow of 61,886.1 ± 3760.2 \(\hbox {m}^3\)/day. The treatment process employed is mechanical-biological, including advanced nutrient removal and occasional chemical phosphorus precipitation.

The physicochemical characteristics of the collected influent samples reflected typical raw wastewater complexity. On average, samples showed a chemical oxygen demand (COD) of 1268.3 ± 203.6 mg \(\hbox {O}_2\)/L, biochemical oxygen demand (\(\hbox {BOD}_5\)) of 614.2 ± 149.2 mg \(\hbox {O}_2\)/L, and total suspended solids (TSS) of 561.7 ± 90.0 mg/L. Nitrogen and phosphorus levels were also characteristic of high-load influents, with total nitrogen (TN) at 97.2 ± 5.2 mg N/L, ammonium nitrogen (N–\(\hbox {NH}_4^+\)) at 68.8 ± 2.7 mg N/L, total phosphorus (TP) at 12.3 ± 2.3 mg P/L and orthophosphates (P–\(\hbox {PO}_4^{3-}\)) at 5.9 ± 0.1 mg P/L. Other measured parameters included pH of 8.0 ± 0.1, and conductivity of 936.0 ± 73.7 \(\upmu\)S/cm. These values confirm that the experimental conditions were grounded in the complex and variable matrix of real influent wastewater, as encountered in full-scale operational facilities.

To enable ML classification of CRP levels, controlled spiking of CRP into the real wastewater samples was performed. This approach preserved the authentic background variability and interferences of raw municipal wastewater, while allowing for reliable class labeling and model evaluation. As no synthetic matrices or laboratory-prepared waters were used, the dataset represents a controlled experimental design built upon genuine environmental samples. However, while the complexity of the matrix strengthens the ecological relevance of the study, it is acknowledged that the generalizability of the models could be further enhanced by validating them on wastewater from other geographical locations or treatment configurations. Such external validation would confirm the robustness of the models under different operational and environmental conditions.

All wastewater samples utilized in this study were obtained from municipal influent streams of municipal WWTP collected over multiple days, thereby inherently reflecting the natural temporal and compositional variability characteristic of real wastewater matrices. These samples comprised complex mixtures of organic, inorganic, and colloidal constituents, including variable concentrations of nitrogen and, phosphorus species, organiccarbon compounds, and suspended solids, without artificial filtration or significant matrix alteration beyond minimal preprocessing for spectrophotometric analysis. Controlled additions of CRP were performed solely to establish known concentration classes for model training and evaluation. Consequently, the dataset preserves the authentic physicochemical heterogeneity and spectral interferences present in operational wastewater treatment environments. This approach ensures that the reported classification performance accounts for the challenges associated with real-world sample complexity, enhancing the ecological validity and practical relevance of the developed machine learning models.

The size and breadth of this dataset facilitate robust statistical analyses and ML model training across the classification tasks outlined in the research. By encompassing a wide array of CRP concentrations, the dataset affords a comprehensive understanding of CRP distribution patterns within wastewater, laying the groundwork for nuanced insights into the interplay between CRP levels and environmental dynamics.

Classification

For the classification task, 176 markers were considered, each coinciding with a specific point of the UV–Vis absorption spectrum. These markers correspond to 176 distinct points along the wavelength, providing a comprehensive dataset that captures the spectral characteristics necessary for distinguishing between different classes of wastewater samples. The wavelength range for each spectrum was 220–750 nm with accuracy of 1 nm.

Additionally, for the restricted range (400–720 nm) of spectrum classification, 116 markers were chosen. By leveraging this detailed spectral information, the classification model can effectively identify subtle variations in the absorption profiles, which are indicative of the presence and concentration of CRP in the wastewater. This approach ensures a robust analysis by utilizing the full breadth (or a restricted range) of the absorption spectrum, enhancing the accuracy and reliability of the classification results. In Fig. 1 a representation of the UV–Vis spectroscopy-based absorption spectrum can be observed.

Feature engineering in the present study was primarily conducted through explicit marker selection based on spectral resolution, encompassing 176 markers across the full UV–Vis wavelength range of 220–750 nm, and 116 markers within a restricted range of 400–720 nm. This comprehensive selection of spectral points provided a detailed dataset capturing the absorption characteristics relevant for CRP classification in wastewater samples. The dataset used in this study includes five distinct CRP concentration classes, distributed as follows: wastewater with no detectable CRP (176 samples), CRP at \(10^{-4}\) \(\upmu\)g/ml (166 samples), \(10^{-3}\) \(\upmu\)g/ml (169 samples), \(10^{-2}\) \(\upmu\)g/ml (165 samples), and \(10^{-1}\) \(\upmu\)g/ml (164 samples). This relatively balanced sample distribution across all classes minimizes potential bias and supports robust and fair model training and evaluation. No manual spectral transformations, such as derivatives, integrals, or spectral band ratios, were applied. Instead, higher-order interactions among spectral features were implicitly modeled via the cubic polynomial kernel of the Cubic Support Vector Machine (CSVM) within the Error-Correcting Output Codes (ECOC) framework. This model-driven approach enabled the capture of complex, non-linear relationships between absorption spectra and CRP concentration classes without handcrafted feature engineering. Although detailed analysis of spectral region importance was not the primary focus, comparable classification accuracies exceeding 65% were obtained using both spectral range and the restricted 400–720 nm region, suggesting that the visible spectrum alone contains a substantial portion of the discriminative information. This finding also indicates the potential for hardware simplification in future sensor development. Further internal analyses revealed that wavelengths corresponding to amide and aromatic ring absorption zones (approximately 400–500 nm), as well as protein- and lipid-associated shoulders (approximately 600–700 nm), were significant contributors to classification confidence. These observations are consistent with established absorbance characteristics of CRP and its interactions within wastewater matrices. Future work will incorporate advanced explainability methods, such as SHAP and permutation feature importance, to more precisely quantify wavelength-specific contributions, thus facilitating the optimization of optical sensor design and improving interpretability of spectral signatures related to CRP presence.

Fig. 1
figure 1

Absorption spectrum signal and markers from UV–Vis spectrometry.

Machine learning models

In the realm of classification, several ML models can be leveraged to enhance decision-making processes and support the proposed wastewater classification approach42,43. Models such as Support Vector Machines (SVM), Neural Networks (NN), K-Nearest Neighbors (KNN), Ensemble models, Decision Trees, Discriminants, and Naive Bayes were explored for their potential in handling the complexities of wastewater data. Among these, the cubic Support Vector Machine (CSVM) demonstrated the highest performance in both classification tasks, as shown in the results. Consequently, this subsection focuses on CSVM, providing a detailed explanation of its methodologies and advantages in effectively addressing wastewater classification challenges44,45.

SVMs are a powerful set of supervised learning methods used primarily for classification, though they can also be applied to regression and outlier detection tasks46. The fundamental concept behind SVMs is to find the optimal hyperplane that separates data points of different classes with the maximum margin, acting as a decision boundary. In a two-dimensional space, this hyperplane is a line; in three dimensions, it is a plane; and in higher dimensions, it is a hyperplane. The goal is to ensure that data points from different classes are on opposite sides of this hyperplane, achieving the best possible separation.

A key feature of SVMs is the margin, the distance between the hyperplane and the nearest data points from each class, known as support vectors. The margin should be as wide as possible because a larger margin implies better generalization and a lower risk of misclassification on new, unseen data. The optimal hyperplane is the one that maximizes this margin, making SVMs highly effective at ensuring that the classifier is robust and performs well on unseen data. An example of a binary classification optimal hyperplane is presented in Fig. 2.

Fig. 2
figure 2

Hyperplane separation for a binary SVM classifier with two classes represented by circles and squares.

One of the strengths of SVMs is their effectiveness in high-dimensional spaces. They are particularly useful when the number of dimensions exceeds the number of samples, a scenario that can be challenging for many other algorithms. SVMs handle this by transforming the input data into a higher-dimensional space where it becomes easier to segregate classes that are not linearly separable in the original space. This transformation is done using kernel functions, which map the input data into a higher-dimensional feature space. Kernel functions are crucial to the power of SVMs. Commonly used kernels include the linear kernel, polynomial kernel, and radial basis function (RBF) kernel. These kernels allow SVMs to create complex decision boundaries that can handle a variety of classification problems, even when the data is not linearly separable in the original feature space. By applying these kernel functions, SVMs can effectively capture the underlying structure of the data, making them a versatile and powerful tool for a wide range of classification and regression tasks.

In this work, the proposed ML model designed to handle multi-class classification problems uses the Error-Correcting Output Codes (ECOC) method. ECOC decomposes a multi-class problem into multiple binary classification problems, with each solved by a binary classifier. The results from these binary classifiers are combined to make the final multi-class prediction. In this specific model, the response variable being predicted is the CRP concentration level, and the model aims to classify instances into one of five classes labeled 0, 1, 2, 3, and 4 (no CRP, \(10^{-4}\, \upmu\)g/ml, \(10^{-3}\, \upmu\)g/ml, \(10^{-2}\,\upmu\)g/ml and \(10^{-1}\, \upmu\)g/ml CRP concentration levels). There are 10 binary learners in this model, with a one-vs-one strategy, in which every binary classifier is trained for every pair of classes. With five classes, this results in ten binary classifiers, matching the number of binary learners specified.

Each binary learner in this model is a CSVM, which uses polynomial kernel functions of degree three. This means that the decision boundary is a cubic polynomial function of the input features, allowing the model to capture more complex relationships than a linear SVM. Importantly, since each CSVM is a separate model, each binary learner has its own bias. This bias is a characteristic of the specific classifier and influences how it separates the classes it is trained on. During the prediction phase, each of the ten binary classifiers makes a prediction. The final class prediction is determined by combining these results by considering which class has the highest aggregated score from the binary classifiers.

Table 1 summarizes the configurations of various ML models evaluated for the classification task. Each model type is listed alongside its key hyperparameters and learning settings, providing insight into the diversity of algorithms tested and the range of configurations applied. This breadth of modeling approaches helps ensure a robust assessment of which types of algorithms are most suitable for predicting CRP concentration categories in wastewater samples.

Table 1 Model configurations with parameters.

Ensemble models combine multiple base learners to produce a more robust and accurate classifier. By aggregating the predictions of several weaker models, they reduce the risk of overfitting and improve generalization. Techniques like bagging (e.g., Bagged Trees) reduce variance by averaging over models trained on different data subsets, while boosting (e.g., RUSBoosted Trees) focuses sequentially on correcting the errors of previous learners, improving performance particularly on harder-to-classify instances. Subspace ensembles further enhance diversity by training each learner on a random subset of features. These models are well-suited to complex, noisy datasets and often deliver strong performance in multi-class tasks.

Decision trees are hierarchical models that split the data based on feature values to form a tree-like structure of decision rules. Each node in the tree represents a feature, and each branch corresponds to a decision based on that feature’s value. They are simple to interpret and fast to train. However, standalone trees can be sensitive to small data fluctuations, leading to overfitting, especially with deep trees (e.g., Fine Tree). Simpler trees (e.g., Coarse Tree) offer greater generalization but may underfit. Decision trees form the foundation of many ensemble methods and serve as a baseline for understanding feature importance and interactions.

KNN models are instance-based, non-parametric classifiers that predict the label of a data point based on the majority class among its k nearest neighbors in the training set. The choice of distance metric (e.g., Euclidean, cosine, Minkowski) and the number of neighbors significantly affects performance. Weighted versions further refine predictions by giving more influence to closer neighbors. KNN is simple and effective in low-dimensional problems but can become computationally expensive and less accurate in high-dimensional or imbalanced datasets. Despite these challenges, it can perform well when local structure in the data is informative.

Neural networks (NNs) are composed of interconnected layers of artificial neurons that transform inputs through weighted connections and non-linear activation functions. They are capable of modeling complex, non-linear relationships in data. In this study, both shallow (single- or two-layer) and deep architectures were explored, with varying layer sizes and activation functions like ReLU and Softmax. While neural networks require more data and computational power, they can outperform traditional models when sufficient training data is available and proper regularization is applied. Their flexibility makes them ideal for capturing nuanced patterns in complex datasets like spectrometry data.

Naive Bayes classifiers are probabilistic models based on Bayes’ Theorem, with the simplifying assumption that features are conditionally independent given the class label. Despite this assumption often being violated in practice, Naive Bayes models tend to perform surprisingly well, especially on high-dimensional or noisy datasets. Gaussian Naive Bayes assumes normally distributed features, while Kernel Naive Bayes uses non-parametric density estimation to handle more flexible distributions. These models are fast, require little training data, and are useful as a strong baseline or in ensemble combinations.

Discriminant analysis models classify data by modeling the probability distributions of each class and using Bayes’ Rule to assign labels. Linear Discriminant Analysis (LDA) assumes that classes share a common covariance structure, resulting in linear decision boundaries. Quadratic Discriminant Analysis (QDA) relaxes this assumption and allows each class to have its own covariance, yielding more flexible (quadratic) boundaries. These models are efficient and interpretable, especially when the data is approximately Gaussian and the class structure is well-separated.

Kernel models implicitly map input data into higher-dimensional spaces using a kernel function, allowing linear algorithms to learn non-linear relationships. While this is typically associated with SVMs, some other models (like kernel-based Naive Bayes or custom SVM implementations) also use this approach. Kernel learning is especially powerful for classification tasks with complex decision boundaries, offering a balance between model complexity and interpretability.

Results and discussion

In the classification endeavor, following the previous section, several ML models are taken into account, leveraging their unique capabilities to unravel the complexities of CRP analysis within wastewater samples. These models undergo rigorous training and evaluation using a five-fold cross-validation approach, ensuring robustness and reliability in the analyses. By partitioning the dataset into five subsets, the models are iteratively trained on four folds while validating its performance on the remaining fold. This iterative process allows to gauge the models’ efficacy across different subsets of data, mitigating bias and variance concerns while enhancing generalizability. Through this meticulous methodology, the aim is to harness the predictive power of ML to unravel the intricate relationships between CRP levels and environmental dynamics.

In the first 5-class classification, the assignment of classes 0 through 4 delineates varying levels of CRP concentrations within wastewater samples considering all UV–Vis absorption spectrum. Specifically, class 0 denotes samples without CRP, class 1 corresponds to those with CRP concentrations at \(10^{-4}\, \upmu\)g/ml, class 2 encompasses concentrations at \(10^{-3}\, \upmu\)g/ml, class 3 signifies concentrations at \(10^{-2}\, \upmu\)g/ml and class 4 represents the highest concentrations at \(10^{-1}\, \upmu\)g/ml. Additionally, a range restricted 5-class classification is considered with a +400 nm UV–Vis absorption spectrum. Through these classifications, distinct thresholds are established to categorize wastewater samples based on their CRP content, aiding in the characterization and analysis of CRP levels in the context of wastewater management and monitoring.

In assessing the performance, a set of standard metrics has been employed for highlighting the model. These metrics, namely Accuracy, Precision, Recall, F1 Score, and Specificity, offer comprehensive insights into the model’s predictive capabilities. Accuracy quantifies the ratio of correct predictions to the total predictions made. Precision measures the proportion of true positives among all positive predictions, while Recall calculates the proportion of true positives among all actual positives. The F1 Score, a harmonic mean of Precision and Recall, provides a balanced evaluation of the model’s performance. Lastly, Specificity gauges the ratio of true negatives to the sum of true negatives and false positives. By leveraging these metrics, a robust framework is established to assess the efficacy of the classifier model.

$$\begin{aligned} & \text {Accuracy} = \frac{\text {Number of Correct Predictions}}{\text {Total Number of Predictions}} \end{aligned}$$
(1)
$$\begin{aligned} & \quad \text {Precision} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Positives}} \end{aligned}$$
(2)
$$\begin{aligned} & \quad \text {Recall} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Negatives}} \end{aligned}$$
(3)
$$\begin{aligned} & \quad \text {F1} = \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(4)
$$\begin{aligned} & \quad \text {Specificity} = \frac{\text {True Negatives}}{\text {True Negatives} + \text {False Positives}} \end{aligned}$$
(5)

In the comprehensive analysis, the ML model underwent rigorous assessment to discern its efficacy in classifying CRP levels within wastewater samples. Through meticulous evaluation and validation procedures, the model is identified for each classification task regarding the accuracy of the model in the average result of 30 repetitions. To showcase the model’s performance on classification tasks, a stratified 5-fold cross-validation was implemented, with each fold utilizing 80% of the data for training and 20% for validation. As no hyperparameter tuning was performed, the cross-validation procedure was repeated 30 times to obtain robust and reliable performance estimates. This approach makes efficient use of the entire dataset and mitigates the risk of overfitting, eliminating the need for a separate test set. However, to streamline presentation and focus on the most promising results, the performance of the top repetition of model for each classification is selectively showcased. The model represents the pinnacle of predictive accuracy and robustness within their respective categories, offering insights into the optimal methodologies for the analysis of the CRP concentration analysis in wastewater samples.

To assess whether the performance differences among models were statistically meaningful, a non-parametric Wilcoxon signed-rank test was conducted across 30 repetitions for the five best-performing classifiers. The resulting p-values, summarized in Tables 4 and 5, indicate that, particularly under the restricted-spectrum condition, the Cubic Support Vector Machine (CSVM) significantly outperformed the other models across all evaluated metrics (accuracy, precision, recall, F1 score, and specificity), with p-values far below the conventional 0.05 threshold–reaching values as low as \(10^{-23}\). Even in the full-spectrum scenario, CSVM demonstrated statistically significant superiority over models such as QSVM (e.g., \(p = 0.0109\) for accuracy). These findings provide strong statistical evidence that CSVM offers a more robust classification performance in this context. While the absolute improvements may appear modest, the consistency and statistical reliability of CSVM’s performance underscore its suitability for moderately classifying CRP concentrations in complex real wastewater matrices.

The Tables 2 and 3 presented below outline the performance metrics for each classification task, providing insights into the accuracy, precision, recall, F1 score, and specificity of the ML models employed. Accompanying these metrics, figures depict the confusion matrix and Receiver Operating Characteristic (ROC) curve, illustrating the classification performance of the most effective ML model repetition within each task, particularly emphasizing accuracy as a primary evaluation criterion. These visual aids offer a comprehensive understanding of the classification outcomes and highlight the robustness of the chosen model in effectively discerning between different concentrations of CRP in wastewater samples.

Table 2 presents the performance of the ML models in classifying wastewater samples into five different categories based on their CRP concentration levels. The categories are: wastewater without CRP, and wastewater with CRP concentrations of \(10^{-4}\, \upmu\)g/ml, \(10^{-3}\, \upmu\)g/ml, \(10^{-2}\, \upmu\)g/ml and \(10^{-1}\, \upmu\)g/ml. The table provides several key performance metrics for both the general performance of the CSVM model and its best repetition.

Table 2 5-class classification: wastewater without CRP, \(10^{-4} \, \upmu\)g/ml, \(10^{-3} \, \upmu\)g/ml, \(10^{-2} \, \upmu\)g/ml, and \(10^{-1} \, \upmu\)g/ml CRP (all spectrum). The reported values represent the average performance across 30 repetitions. The best value for each metric, highlighted in bold, corresponds to the single best-performing repetition among all trials.

As the top-performing model across all classification tasks, the CSVM achieved a mean accuracy of 65.19% over 30 repetitions, with the best repetition reaching a slightly higher accuracy of 65.48%. This means that the model correctly predicts the CRP concentration category of the wastewater sample about 65% of the time. While this is significantly better than random guessing, which would be 20% for five classes, it indicates that there is still considerable room for improvement. The precision for the CSVM model is 65.44%, and for the best repetition, it is 65.81%. High precision means that when the model predicts a certain CRP concentration category, it is correct about 65% of the time. This suggests that the model is relatively good at avoiding false positives. The recall for the CSVM model is 65.11%, and for the best repetition, it is 65.39%. This implies that the model correctly identifies about 65% of the samples belonging to each CRP concentration category, indicating moderate sensitivity. The F1 score for the CSVM model is 65.27%, and for the best repetition, it is 65.60%. An F1 score in this range suggests a balanced performance in terms of precision and recall, but still indicates that the model’s performance could be enhanced. The specificity for the CSVM model is 91.30%, and for the best repetition, it is 91.38%. High specificity means the model is very good at identifying samples that do not belong to a certain CRP concentration category, thus effectively avoiding false negatives.

On the other hand, Table 3 shows the performance of the ML models when using only the range restricted spectrum for classification. The metrics for the CSVM model are slightly different from the full spectrum results, indicating how its performance can vary when evaluated on a subset of the data. Accuracy for this setup is 63.35% for the general model and 64.88% for the best repetition, which is slightly lower than the full spectrum results. This suggests that the range restricted spectrum provides almost the same performance results compared to the full spectrum using a lower amount of markers for classification. Precision, Recall, F1 Score, and Specificity are also slightly lower compared to the full spectrum results. This indicates a generally lower performance when using the range restricted spectrum alone, but the similar results imply that the range restricted spectrum provides almost the same comprehensive information for classifying the wastewater samples than all spectrum. In order to emphasize the results of the CSVM, other models such as Quadratic SVM, Ensemble Subspace KNN, Wide NN and Fine Gaussian SVM have been considered for comparison (Tables 4 and 5).

Table 3 5-class classification: wastewater without CRP, \(10^{-4} \, \upmu\)g/ml, \(10^{-3} \, \upmu\)g/ml, \(10^{-2} \, \upmu\)g/ml, and \(10^{-1} \, \upmu\)g/ml CRP (+400 nm spectrum). The reported values represent the average performance across 30 repetitions. The best value for each metric, highlighted in bold, corresponds to the single best-performing repetition among all trials.
Table 4 P-values comparing CSVM to other models (all spectrum).
Table 5 P-values comparing CSVM to other models (restricted spectrum).

In Figs. 3 and 4, the confusion matrices and ROC curves of the best repetition of all spectrum and range restricted spectrum classification are showcased, respectively. Specifically, the confusion matrix provides a detailed breakdown of the model’s predictions versus the actual labels. It shows how many samples from each actual category were predicted correctly or incorrectly by the model. This helps in understanding specific areas where the model may be making more errors and computing the performance metrics. In addition, the ROC (Receiver Operating Characteristic) curve plots the true positive rate (Recall) against the false positive rate (1-Specificity) for different threshold values. The area under the ROC curve (AUC) provides a single measure of the model’s ability to distinguish between classes. A higher AUC indicates better overall performance. In the figures, the ROC curve is plotted for all classes, and AUC indicates values ranging from 79.89% to 92.19%, which indicates good overall performance. Overall, the performance metrics indicate that while the model is fairly good at making correct predictions, there is still significant room for improvement, especially in terms of increasing precision, recall, and overall accuracy. In general, CSVM shows strong class-wise accuracy with high diagonal values, yet a more granular error analysis reveals specific areas of confusion and interpretability insights. In the 400 nm range, the model performs best for Classes 1 and 5, while the most frequent misclassifications occur between Classes 2 and 3, and between Classes 4 and 5, suggesting overlapping spectral features within this limited range. Expanding to spectral range improves classification accuracy across most classes, particularly reducing confusion between Classes 2 and 3, indicating that critical discriminative spectral bands likely lie outside the 400 nm region, potentially in the near-infrared (NIR) or shortwave infrared (SWIR) regions.

Fig. 3
figure 3

Confusion matrix and ROC curve of best repetition of CSVM of Table 1.

Fig. 4
figure 4

Confusion matrix and ROC curve of best repetition of CSVM of Table 1.

Additionally, the confusion matrices for both the whole and restricted spectral ranges provide insights into the model’s classification performance across the five CRP concentration classes. For the whole spectral range, classes representing no CRP and the lowest concentration (\(10^{-4}\) \(\upmu\)g/ml) show relatively high correct classification rates, with 128 and 107 samples correctly identified, respectively. However, misclassifications frequently occur between adjacent classes, such as between \(10^{-4}\) and \(10^{-3}\) \(\upmu\)g/ml, indicating challenges in distinguishing closely spaced CRP levels. A similar trend is observed in the restricted spectral range, where overall accuracy improves slightly for certain classes. Notably, the correct classification of samples with no CRP and the highest concentration (\(10^{-1}\) \(\upmu\)g/ml) increases to 136 and 108, respectively. The number of misclassifications between neighboring classes decreases modestly, suggesting that focusing on a targeted spectral range can enhance the model’s ability to differentiate between close concentration levels. These findings highlight that the most frequent errors occur between adjacent CRP classes, reflecting inherent spectral similarities in these concentration ranges. This error pattern underscores the need for further model refinement and targeted sensor development to improve discrimination in these critical boundaries. Incorporating such improvements could enhance predictive accuracy and robustness in real-world monitoring applications.

While multiple ML models were evaluated for comparative purposes, the performance differences among them were generally modest. Although the CSVM consistently outperformed the other models across most metrics, statistical testing across the 30 repetitions revealed that the observed differences were not large enough to be considered statistically significant. This suggests that, despite minor variations in mean accuracy, precision, recall, F1 score, and specificity, the models perform comparably when accounting for variability across runs. Therefore, no strong conclusion can be drawn about the superiority of one model over another solely based on these performance metrics, and further investigation using larger datasets or more robust feature representations may be necessary to establish clearer distinctions in model performance.

The findings indicate that utilizing a limited spectral range is a viable option for applications such as sensor-based tools designed to classify CRP concentrations in wastewater, facilitating deeper understanding of its dynamics. Although there is a slight decrease in accuracy compared to using spectral range, the performance metrics remain close, suggesting that the reduced spectrum still contains much of the essential information required for reliable classification. This comparable effectiveness, achieved with fewer spectral features, is especially beneficial in settings with limited resources, like portable or embedded sensor devices, where reducing data collection and processing is critical. Furthermore, the slight decreases in precision, recall, and F1 score represent only a minor compromise in predictive accuracy, which could be acceptable in practical, rapid evaluation contexts. The consistently high specificity ensures that false positive rates stay low, which is crucial for preserving confidence in automated monitoring solutions. As a result, employing the limited spectral range offers a practical and efficient choice for implementation in field-deployable diagnostic systems or mobile platforms, striking a balance between classification performance, operational efficiency, and cost.

The classification performance, particularly the observed accuracy of approximately 65% in the five-class scenario, can be attributed to several factors. First, the intrinsic complexity and variability of the spectral data derived from real wastewater matrices contribute to a non-negligible level of class overlap. This is compounded by the presence of multiple interfering substances and natural fluctuations in wastewater composition, which can affect spectral signatures even in the absence of variation in the target analyte (CRP). Additionally, the limited interclass separability between adjacent CRP concentration ranges may have further constrained the classifier’s discriminative power. Although higher classification accuracy was observed in simplified binary classification tasks, taking in to account CRP and no CRP samples, a five-class framework was adopted to better reflect the continuous and dynamic nature of CRP presence in wastewater environments. The objective of the study was not solely to achieve maximum classification accuracy, but rather to provide a more nuanced view of CRP level variation. The five-class model, despite its moderate accuracy, enables the identification of concentration trends and transitions that may be masked in dichotomous scenarios. It should also be noted that the moderate performance reflects a balance between model complexity and ecological relevance. The use of real wastewater, rather than synthetic or idealized matrices, inherently introduces variability that is both a challenge and a strength. This variability ensures that the model evaluation is more representative of real-world deployment scenarios, though at the cost of reduced classification performance when using conventional ML classifiers.

CRP is primarily recognized as a systemic biomarker of inflammation in clinical diagnostics. Its detection in municipal and hospital wastewater can serve as an indicator of human biological waste and, more broadly, as a proxy for anthropogenic biochemical load. Elevated CRP levels in wastewater have been associated with increased discharge from healthcare facilities or densely populated buildings, thereby providing valuable data for public health surveillance and environmental exposure assessments. While this manuscript does not describe the development of an actual sensor, it outlines a conceptual framework for future deployment of sensitive photonic sensor technologies for in situ CRP monitoring in complex matrices such as raw or partially treated wastewater. Such sensor systems could be integrated into decentralized monitoring points within sewage infrastructure – including influent pipelines, hospital discharge outlets, and community wastewater collection nodes – enabling early warning systems for sanitary contamination and near real-time epidemiological surveillance. Furthermore, the envisioned monitoring system is compatible with existing smart water infrastructure. By coupling sensor outputs with remote telemetry and cloud-based data platforms, the approach could support next-generation digital water networks that combine chemical sensing with health-relevant biomarker tracking. This integration would provide actionable insights for wastewater-based epidemiology (WBE), enhancing environmental health strategies and preparedness efforts, particularly in the context of urban health management and post-pandemic monitoring.

Conclusion

In this study, two classification tasks were conducted to discern various concentrations of CRP in wastewater samples based on the UV–Vis absorption spectroscopy spectra. Across these tasks, CSVM model, alongside additional ML comparison models, underwent rigorous performance assessment, with the top-performing model repetition selected for evaluation based on accuracy. The classification tasks encompassed distinguishing wastewater samples categorizing samples into five distinct CRP concentration classes ranging from zero to \(10^{-1} \, \upmu\)g/ml. Each classification task was evaluated across 30 repetitions using both the entire broadband UV–Vis absorption spectrum and a limited spectral range from 400 nm onwards. The model showcased varying degrees of performance across tasks, with the top-performing model repetitions achieving accuracies ranging from 64.88% to 65.48%. Specifically, ML model such as SVMs were found to be the best models for these classification tasks due to their robustness and effectiveness, working with a limited amount of data. Overall, the results underscore the efficacy of ML algorithms in moderately classifying wastewater samples based on CRP concentrations, with potential applications in biosensor development and water quality dynamics monitoring.

This manuscript outlines our concept of an expert system using point-of-need optical sensors and machine learning for wastewater surveillance. As a next step for development a practical application, the following steps are crucial for a more comprehensive investigation and future work. From metrological point of view one of the most crucial steps is to evaluate the sensing performance under diverse wastewater conditions, such as varying pH, temperature, and the presence of interfering substances, to ensure real-world robustness. On the other hand, a suitable machine learning model which will be trained on those data will be developed. In the next step, this prototype will be implemented in a controlled, small-scale wastewater treatment plant or a simulated environment to test its functionality, data acquisition capabilities, and initial performance in a semi-realistic setting. We’ll gather continuous data over an extended period to assess the system’s stability. Finally, the investigation with the system which will cover larger geographical areas and be integrated with existing public health infrastructure will be completed.

In the final step, implementing our system in collaboration with public health authorities and wastewater treatment plants enables the identification of anomalies in wastewater data and the prediction of disease trends as part of a population-level approach that supports real-time monitoring for wastewater-based epidemiology (WBE). Integrating such information enables public health decision-making to be faster, more accurate, and more efficient–both in terms of epidemiological response and economic resource allocation, such as more targeted deployment of funding and public health services.

This leads to a concept of an expert system incorporating point-of-need optical sensors for wastewater surveillance including ML algorithms to deliver reliable information. Such system can process significant volumes of data, identify trends and anomalies, and provide actionable insights. For instance, monitoring CRP levels in wastewater can offer evidence of population-wide inflammatory disease dynamics, supporting public health officials in making informed decisions through quick detection and automated classification of the samples from a given area. The system can be tailored to the specific needs by incorporating dedicated optical biosensors and adapting the algorithm to the specific data acquired. Additionally, while the current study appropriately relied on default hyperparameters to mitigate overfitting risks due to dataset limitations, incorporating even modest hyperparameter optimization could enhance model accuracy and generalization. Such optimization becomes especially valuable when scaling the system for real-world deployment.