Introduction

Three-dimensional excitation-emission matrix (3D-EEM) fluorescence spectroscopy has been widely applied in characterizing fluorescence substances (e.g., dissolved organic matter (DOM), soluble microbial product (SMP), and extracellular polymeric substances (EPS)) originated from environmental fields. The fluorescent components in water samples (e.g., aromatic amino acids, humic acids, and flavins) can be sensitively detected by the 3D-EEM fluorescence spectroscopy1,2,3. The fluorescence signals measured by the 3D-EEM fluorescence spectroscopy will change depending on both the environmental conditions and the characteristics of these fluorescent components4,5,6. Therefore, qualitative and quantitative tracking of fluorescent components in water samples can be realized by the 3D-EEM fluorescence spectroscopy7. However, the output of 3D-EEM fluorescence spectroscopy (i.e., 3D-EEM spectra) may be difficult to understand directly due to many disturbing noises and overlapped fluorescent signals8,9.

To this end, chemometrics researchers have developed fluorescence decompose methods, such as parallel factor analysis (PARAFAC)10,11, to disengage overlapped fluorescent signals and output corresponding maps of each fluorescent component12. However, due to the time-consuming procedure and high sample requirement, these methods cannot be embedded into online monitoring systems directly4,11. Our previous work proposed a fast fluorescent identification network (FFI-Net) based on the convolutional neural network (CNN)4. The trained FFI-Net could classify the numbers of fluorescent components in a single 3D-EEM spectrum and predict all maps of these fluorescent components in a few seconds4, which is essential to the online analysis.

Unfortunately, compared with natural images in computer vision (e.g., car or bird images), the overlapped fluorescence signals in 3D-EEM spectra cannot be directly analyzed by visual inspection8. Therefore, although the CNN classifiers for classifying the numbers of fluorescent components in 3D-EEM spectra presented robust performance4, it is still unknown whether the CNN classifiers made the correct classification according to the correct regions of 3D-EEM spectra (e.g., regions of fluorescent peaks). Moreover, we do not understand why the CNN classifiers could distinguish the differences in different 3D-EEM spectra because of the “black box” nature of deep learning models13. The CNN classifier cannot say “I don’t know” in ambiguous situations and instead returns the class with the highest probability14. Therefore, the reliability of CNN classifiers for analyzing 3D-EEM spectra should be further investigated.

The interpretability and explainability of CNN models have received great attention in recent years15,16,17. Many CNNs explanation methods for visualizing convolutional layers or disturbing input images have been developed. For example, the gradient-weighted class activation mapping (Grad-CAM) method has been utilized for producing visual explanations of decisions from CNNs18. Ribeiro et al.19 proposed a local interpretable model-agnostic explanations (LIME) method to form an interpretable surrogate model that is locally faithful to the CNN classifier. Moreover, Shitole et al.20 found that images may have multiple relatively localized explanations, and established structured attention graphs (SAGs) to visualize how different combinations of image regions impact the confidence of a classifier. Meanwhile, some strategies for improving the accuracy of CNNs based on the attention mechanism have been developed21,22,23. For instance, Woo et al.21 proposed a Convolutional Block Attention Module (CBAM), which could be integrated into any CNN architecture to emphasize meaningful features along two principal dimensions (i.e., channel and spatial axes) of CNNs. By integrating the CBAM, the performance of CNNs on the multiple benchmarks of object detection was greatly improved. However, due to the great differences between 3D-EEM spectra and natural images (e.g., car or bird images) used in the above methods, the performance of these methods for interpreting and improving the CNN classifiers for 3D-EEM spectra needs to be reevaluated and reconsidered.

Therefore, this work investigates the reliability of CNN classifiers for analyzing 3D-EEM spectra. The raw CNN classifiers are first interpreted by the Grad-CAM and SAGs methods. Then, strategies including modifying the data quality and CNN structure for refocusing and improving the attention of CNN classifiers are proposed. The classification results between raw and modified CNN classifiers are compared to highlight the importance of improving the CNN classifiers for analyzing 3D-EEM spectra. To our best knowledge, it is the first time to interpret CNN classifiers designed for classifying the numbers of fluorescent components in 3D-EEM spectra. This work reveals and improves the misleading attention of CNN classifiers on the 3D-EEM spectra and will help formulate strategies for developing deep learning models for analyzing water samples and make these models more acceptable to non-deep learning users.

Results

High accuracy and misleading attention of raw CNN classifier

The raw CNN classifier was first trained by the raw data of 3D-EEM spectra containing scatter peaks. The training loss of the raw CNN classifier significantly decreased to 0.092 ± 0.013 after 100 training epochs (Supplementary Fig. 1). Meanwhile, the test accuracy of the raw CNN classifier trained by raw data reached 89.8 ± 2.5% (Supplementary Fig. 2). The raw CNN classifier trained by the raw data seems to have an acceptable classification accuracy (Table 1).

Table 1 The training loss and test accuracy of different CNN classifiers trained by different 3D-EEM spectra

However, the visual explanations of Grad-CAM and guided Grad-CAM for the trained CNN classifier revealed that the misleading attention of the CNN classifier governed the classification for the 3D-EEM spectra (Fig. 1). The heatmaps for the first convolutional layer (Conv1) presented a high consistency with the color maps, implying that the Conv1 focused on all values in the 3D-EEM spectra. In the following convolutional layers (Conv2-5), the heatmaps gradually concentrated in a smaller area, which supports the classification decision of the CNN classifier18. These phenomena were consistent with the CNN visualization results for natural images (e.g., car or bird images), where more concrete features are extracted by the last convolutional layer in the CNN classifier24,25.

Fig. 1: The gradient-weighted class activation mapping (Grad-CAM) and guided Grad-CAM interpretation of CNN classifier trained by the raw data.
figure 1

a Class 0: a 3D-EEM spectrum contains three fluorescent components; b Class 1: a 3D-EEM spectrum contains four fluorescent components; c Class 2: a 3D-EEM spectrum contains five fluorescent components. Three 3D-EEM spectra containing 3–5 components were randomly selected from the training dataset.

As shown in Fig. 1a, b, the CNN classifier trained by raw data highlighted the regions outside the fluorescent peaks in 3D-EEM spectra. Moreover, the Rayleigh scatterings in 3D-EEM spectra drew the most attention from the CNN classifier. Similarly, a heatmap shift from fluorescent peaks to the Rayleigh scattering was observed in Fig. 1c. The guided Grad-CAM further provided fine-grained importance in classification images (Fig. 1)18. The misleading attention of Conv2-5 on the 3D-EEM spectra was revealed by the guided Grad-CAM. These results demonstrated that the accurate classification of the CNN classifier may originate from misleading attention on 3D-EEM spectra.

To quantify this phenomenon, all 3D-EEM spectra in the Conv5 were analyzed by the Grad-CAM method, and mathematical indices including Correct Accuracyi, Wrong Accuracyi, and Correct Attentioni were calculated by Eqs. (13). Very low Correct Accuracyi (3.8%-32.3%) and Correct Attentioni (3.8–32.3%) proved that the raw CNN classifier trained by the raw data mainly classified the 3D-EEM spectra according to the regions outside the fluorescent peaks, which significantly impaired the reliability of the CNN classifier (Table 2). Meanwhile, the 3D-EEM spectra in Class 2 received the highest Wrong Accuracyi (96.2%), which may be due to the highest complexity of five fluorescent components in these 3D-EEM spectra4.

Table 2 The model accuracy and attention of different CNN classifiers trained by different 3D-EEM spectra

The classification results from the raw CNN classifier trained by the raw dataset were further analyzed by the SAGs method20, which can decompose 3D-EEM spectra into sub-regions and evaluate the effects of removing a particular patch on the classification confidence. As shown in Fig. 2a, the raw CNN classifier trained by the raw dataset made correct classification according to the patches on two Rayleigh scatterings. Meanwhile, the true confidence for this 3D-EEM spectra decreased from 100% to 0% when the patches near the Rayleigh scatterings were removed. To understand the wrong classification made by the CNN classifier, SAGs results of two wrong classification results are presented in Fig. 2b, c. The region near the Rayleigh scattering and outside the fluorescent peaks supported the wrong classification for 5 fluorescent components in Fig. 2b. Similarly, removing a particular patch outside the fluorescent peaks led to a decrease in false confidence from 98% to 0% in Fig. 2c. The results of SAGs further proved that the CNN classifier may make classifications based on misleading attention to the 3D-EEM spectra.

Fig. 2: The structured attention graphs (SAGs) of classification results from the raw CNN classifier trained by the raw data.
figure 2

a a 3D-EEM spectrum containing four fluorescent components was classified as four fluorescent components; b a 3D-EEM spectrum containing three fluorescent components was classified as five fluorescent components. c a 3D-EEM spectrum containing three fluorescent components was classified as five fluorescent components. Three 3D-EEM spectra were randomly selected from the training dataset.

Improving CNN attention by modified 3D-EEM spectra

The misleading attention of the CNN classifier on the Rayleigh scatterings was observed in the former section (Fig. 1). To solve this issue, the Rayleigh scatterings in 3D-EEM spectra were removed (called cut data). Then, the raw CNN classifier was trained by the cut data. The training loss of the CNN classifier decreased to 0.116 ± 0.019 (Supplementary Fig. 3), and the test accuracy increased from 89.8 ± 2.5% to 91.3 ± 1.2% (Supplementary Fig. 4), indicating that the raw CNN classifier trained by the cut data also achieved acceptable performance (Table 1). The heatmaps of Grad-CAM and guided Grad-CAM both highlighted the fluorescent peaks in the same 3D-EEM spectra (Fig. 3) tested for the raw data (Fig. 1). Meanwhile, the total Correct Accuracyi and Correct Attentioni of the raw CNN classifier trained by the cut data increased from 16.8% and 17.6% to 33.7% and 36.4%, respectively (Table 2). A significant decrease of Wrong Accuracyi for the Class 0 and Class 2 was observed, implying that the raw CNN classifier learned more features of the cut 3D-EEM spectra in Class 0 and Class 2. Moreover, the 3D-EEM spectra without Rayleigh scatterings showed more clear fluorescent peaks (Fig. 3) than the raw 3D-EEM spectra (Fig. 1). Because the Rayleigh scattering normally has very large fluorescent signals and will affect the normalization and transformation of 3D-EEM spectra. The fluorescent data in region of Rayleigh scattering were set to zero during the PARAFAC analysis procedure10. As a result, the number of fluorescent components classified by the trained CNN classifier should depend on the fluorescent peaks rather than the information provided by the Rayleigh scattering. By removing the strong misleading signatures of Rayleigh scattering, the raw CNN classifier showed both higher accuracy and more focused attention.

Fig. 3: The gradient-weighted class activation mapping (Grad-CAM) and guided Grad-CAM interpretation of the raw CNN classifier trained by the cut data.
figure 3

a Class 0: a 3D-EEM spectrum contains three fluorescent components; b Class 1: a 3D-EEM spectrum contains four fluorescent components; c Class 2: a 3D-EEM spectrum contains five fluorescent components. Three 3D-EEM spectra containing 3–5 components were the same as the 3D-EEM spectra utilized in Fig. 2.

The SAGs of two correct classification results showed that the combination of several patches on the fluorescent peaks supported the correct classification (Fig. 4). The removal of the patch outside the fluorescent peaks did not reduce the confidence of the CNN classifier (Fig. 4a), whereas the removal of patches near the fluorescent peaks significantly reduced the classification confidence (Fig. 4b). Overall, removing the Rayleigh scatterings in 3D-EEM spectra successfully refocused the attention of the CNN classifier from scatterings to fluorescent peaks in 3D-EEM spectra to some extent (Table 2).

Fig. 4: The structured attention graphs (SAGs) of true classification results from the raw CNN classifier trained by the cut data.
figure 4

a a 3D-EEM spectrum containing three fluorescent components; b a 3D-EEM spectrum containing five fluorescent components.

Improving CNN attention by integrating CBAM

Although removing Rayleigh scatterings in 3D-EEM spectra optimized CNN attention to some extent, the total Correct Accuracyi and Correct Attentioni of the CNN classifier were still unsatisfactory. To further improve CNN attention on key regions (i.e., fluorescent peaks) in 3D-EEM spectra, the CBAM was embedded into the CNN classifier. The CNN classifier with CBAM trained by the cut data also received acceptable training loss (0.103 ± 0.011) (Supplementary Fig. 5) and test accuracy (91.2 ± 1.2%) (Supplementary Fig. 6). According to the results of the Grad-CAM, the CNN classifier with CBAM possessed much higher total Correct Accuracyi (55.5%) and Correct Attentioni (57.2%) (Table 2). These results demonstrated that the spatial and channel-wise attention in CBAM was useful in improving CNN attention on the 3D-EEM spectra. As a result, the CBAM enhanced CNN classifier focused on target fluorescent area more properly than the raw CNN classifiers.

The attention mechanism provided by the CBAM not only distinguishes important regions but also improves the representation of interests26,27,28. As a result, the applications of CBAM for natural images have also been proven to cover target regions better than the original CNN21. Compared with natural images, 3D-EEM spectra showed great differences, where overlapped signals cover the fluorescent peaks, and no clear semantic meanings can be observed by users. For classifying numbers of fluorescent components, the CBAM refocused the CNN attention from meaningless regions to whole fluorescent regions, increasing the reliability of the CNN classifier on this task.

Discussion

Although deep learning methods for water samples have been increasingly investigated and applied29,30,31, most of them only considered the model accuracy (e.g., classification accuracy) and ignored the risk behind the “black box”. For example, high classification accuracy for microbeads in wastewater (89%)29 and morphology of activated sludge (95%)30 was obtained, whereas no cue can ensure that the CNN classifiers in their study truly extracted the features of microbeads and sludge morphology from training images. The problem of misleading attention found in this study may occur in other CNN models for image-like data with ambiguous semantics, especially in environmental fields29,30,32,33,34. Therefore, this study focuses on analyzing 3D-EEM spectra of water samples through interpretable CNN classifiers. The misleading attention of CNN classifiers for analyzing 3D-EEM spectra was identified through Grad-CAM and SAGs methods. The misleading attention of 3D-EEM spectra may originate from the features of 3D-EEM data. Unlike natural images with clear semantics, the overlapped fluorescent data in 3D-EEM spectra could not be examined by naked eye. Similar to our previous study4, Yu et al.31 designed a deep convolutional autoencoder to extract feature maps from 3D-EEM spectra. However, the feature maps generated by the deep convolutional autoencoder were not validated by interpretation techniques. Xie et al.35 collected 3D-EEM spectra (351 × 21) of oil samples to establish CNN classification models for oil species. The size of 3D-EEM spectra may affect the attention of CNN, whereas they did not examine the attention of trained CNN models. Yan et al.36 also applied interpretable methods to show the 3D-EEM feature maps of CNN classifier for classifying storage year of Ningxia wolfberry samples. They suggested that the attention of CNN classifier observed on periphery of the fluorescent peaks was caused by more important classification contributions in the weak fluorescent components for the storage year of Ningxia wolfberry samples. This may be a possible reason due to different classification targets (storage year in their study) from our study (numbers of fluorescent components). However, the high classification accuracy in their study may also originate from the misleading attention of CNN classifiers. The data improvement and attention mechanisms strategy also could be utilized to improve the attention of CNN classifiers for the storage year of Ningxia wolfberry samples.

The combination of 3D-EEM spectra and decomposition methods (e.g., PARAFAC method) has been widely applied in many environmental fields. For example, the decomposed 3D-EEM spectra can be utilized as surrogate parameters to monitor the fate of environmental substances, including different organic compounds in wastewater-affected water37, polycyclic aromatic hydrocarbons fractions in combustion particular matter38, DOM in surface water and groundwater5,39. Moreover, the decomposed 3D-EEM spectra can represent key indices in water industries, such as monitoring DOM and disinfection byproduct precursors in drinking water treatment processes40,41, and monitoring microbial activities in wastewater treatment processes42,43. Although 3D-EEM fluorescence spectroscopy is more sensitive, more time-efficient, and less expensive than traditional chromatographic methods38, the decomposition methods for 3D-EEM suffer from time-consuming procedures and strict data requirements10, limiting the online monitoring and analysis of 3D-EEM4. The FFI-Net developed in our previous study4 is promising to decompose the overlapped signals directly and replace time-consuming decomposition methods10, whereas its interpretability and acceptability should be improved. To this end, this study proposed a strategy combining data improvement and attention mechanisms to alleviate the misleading attention of CNN classifiers on 3D-EEM spectra. This study strongly improved the accuracy and reliability of deep learning methods applied to fast analyze the 3D-EEM spectra4 of water samples in different fields.

Meanwhile, there are still some limitations in this study. Only 55.5% total True Accuracyi was achieved by the CNN classifier with CBAM, which means that great enhancement of model attention may be realized by further improving data and model structures. On one hand, transforming 3D-EEM data from array (0-9999) to grey image (0-255) causes the loss of information. A more appropriate image form for 3D-EEM spectra or advanced methods for improving grey images may improve the model’s attention. For example, Shi et al.44 utilized the morphological grayscale reconstruction method to pre-enhance the locations of fluorescent peaks in the grey images of 3D-EEM spectra. On the other hand, the attention mechanism has been embedded in many novel model structures, such as recurrent neural network27 and transformer26. Therefore, a more elegant model structure coupling with a strong attention module may further improve the total True Accuracyi of the classification task for 3D-EEM spectra.

Moreover, the classification labels (i.e., numbers of fluorescent components for each 3D-EEM spectrum) of wastewater samples determined by the PARAFAC method may not represent the true numbers of fluorescent components in the 3D-EEM spectra due to the limitations of this method. The PARAFAC method as a superposition model assumes that all chromophores within the mixture absorb and emit light independently45. However, charge-transfer interactions between chromophores (e.g., humic-like components) in wastewater samples will alter emission properties and impact the calculations of PARAFAC45,46. Nowadays, evaluating and verifying this assumption is difficult for datasets containing wastewater samples46. As a result, the wrong numbers of fluorescent components may pass the model validation of PARAFAC and generate wrong classification labels, which will reduce the quality of training dataset. As a data-driven algorithm, the accuracy and attention of CNN models highly depend on the quality of training dataset. The training dataset prepared by PARAFAC in this study may also be improved by using some more advanced 3D-EEM analysis method (e.g., parallel factor framework-clustering analysis (PFFCA)47, and three-direction resection alternating trilinear decomposition (TDR-ATLD) algorithm48) in future applications. Overall, the results of this study may have important implications for online monitoring and analysis of environmental substances through 3D-EEM spectra. Importantly, this work provided strategies for further improvement of CNN classifiers for 3D-EEM spectra collected from different water fields, making them more robust and acceptable.

Methods

Dataset of 3D-EEM spectra

The 3D-EEM spectra collected from SMP and EPS in biological wastewater treatment systems were used for model development. The water samples of SMP and EPS mainly consist of microbial products-rich substances, such as proteins and humic acids, which can be detected by fluorescence spectroscopy. Due to the high complexity of biological wastewater treatment systems, the fluorescence peaks in 3D-EEM spectra collected from SMP and EPS are commonly overlapped3,49. Therefore, we chose these 3D-EEM spectra as classification targets of CNN classifiers. The SMP samples of anaerobic digestion sludge, anammox sludge, and aerobic sludge were collected from the supernatant in the reactors and were filtered by 0.45 μm membrane before measurement. The EPS samples of anaerobic digestion sludge, anammox sludge, and aerobic sludge were extracted with the cation exchange resin (CER, Amberlite 732, sodium form) method described by Frølund et al.50.

The 3D-EEM spectra of the samples were obtained by a fluorescence spectrometer F-7000 (Hitachi Co., Japan). The excitation (Ex) wavelengths ranged from 200 to 600 nm at a 5 nm-interval and the emission (Em) wavelengths ranged from 200 to 600 nm at a 5 nm scanning step. Excitation and emission slits were both maintained at 5 nm and the scanning speed was 30,000 nm/min. All 3D-EEM spectra were preprocessed to a unified format (Ex = 200–450 nm, Ex interval = 5 nm, Em = 250–500 nm, Em interval = 5 nm) before they were transformed to 3D-EEM images.

The collected 3D-EEM spectra were first analyzed by the PARAFAC method to provide the classification labels (i.e., 3, 4 and 5 fluorescent components denoted as class 0, 1 and 2, respectively). Then, the raw 3D-EEM spectra (excitation/emission wavelength of 200–450 nm/250–500 nm and 5 nm interval) were normalized to 0–255 and transformed as an image format (51 × 51 pixels, PNG file) to form the input images.

In our previous study, the FFI-Net achieved acceptable classification accuracy with a 3D-EEM dataset containing Rayleigh scatterings4. Therefore, the raw 3D-EEM spectra (called raw data) containing Rayleigh scatterings were used to form a training dataset first. Then, to eliminate the impacts of Rayleigh scatterings on the attention of CNN classifiers, the Rayleigh scatterings in 3D-EEM spectra were removed before transforming 3D-EEM spectra into the input images (called cut data).

The number of 3D-EEM spectra for 3, 4, and 5 fluorescent components reached 422, 198, and 266, respectively. The imbalanced number of 3D-EEM spectra for classification labels may cause implicit bias. To reduce this implicit bias, the 3D-EEM spectra for 4 and 5 fluorescent components were duplicated once. As a result, the final classification dataset contained 422, 396, and 532 3D-EEM spectra for three labels, respectively (a total of 1350 samples).

Model development of CNN classifiers

The raw CNN classifier had a similar structure to the famous Alexnet (Fig. 5a)51. Briefly, the CNN classifier first contained five convolutional layers (Conv1-Conv5) and three max-pooling layers, extracting the features from the 3D-EEM spectra. Then, two fully connected layers received the flattened feature map and transformed the information to the output layer with softmax function for three classification labels (i.e., 3, 4 and 5 fluorescent components denoted as class 0, 1 and 2, respectively). To prevent overfitting, the dropout technique was implemented between the last max-pooling layers and the first fully connected layer15. The rectified linear unit (ReLU) was used as the activation function of all convolutional layers and two fully connected layers. The optimizer of the CNN classifier was Adam52 (a common optimizer method in deep learning) with a learning rate of 0.0001, β1 of 0.9, β2 of 0.999, and epsilon of 1 × 10−8.

Fig. 5: The diagram of model structures and interpretation methods.
figure 5

a The model structure of raw CNN classifier contained five convolutional layers (Conv1-Conv5), three max-pooling layers, two fully connected (FC) layers and one output layer. b Three convolutional block attention modules (CBAM) were embedded into the raw CNN classifier. Two types of input samples (i.e., 3D-EEM spectra with and without Rayleigh scattering) were utilized. c Gradient-weighted class activation mapping (Grad-CAM) method, guided Grad-CAM method, and structured attention graphs (SAGs) method were utilized to interpret the CNN classifier.

To improve the attention of the CNN classifier, the CBAM with both spatial and channel-wise attention was embedded into the CNN classifier (Fig. 5b)21. The CBAM was chosen because it is a lightweight and general module, which can be integrated into any CNN architecture seamlessly with negligible overheads and is end-to-end trainable along with base CNNs21. The channel attention module and spatial attention module were utilized to exploit the inter-channel relationship and inter-spatial relationship of features in 3D-EEM spectra, respectively. The channel attention module containing both average-pooling and max-pooling layers focuses on the meaningful information in the 3D-EEM spectra. Compared with the channel attention module, the spatial attention module focuses on informative regions in the 3D-EEM spectra, which is complementary to channel attention. The average-pooling and max-pooling layers in the spatial attention module were also operated along the channel axis and were applied to generate an efficient feature descriptor21.

The 3D-EEM spectra were resized to 224 × 224 pixels before entering the CNN classifier. The training dataset and test dataset were randomly divided into 80% and 20% for each classification. The performance of CNN classifiers was evaluated by assessing the mean cross-entropy loss and mean classification accuracy. The training loss presented the training performance of the CNN classifiers, whereas the test accuracy showed the classification accuracy of the trained CNN classifiers on unseen data.

Furthermore, to evaluate whether the CNN classifier focused on the fluorescence regions in the 3D-EEM spectra, mathematical indices including Correct Accuracyi, misleading Accuracyi and Correct Attentioni were proposed:

$${\rm{Correct}}\,{{\rm{Accuracy}}}_{i}=\frac{{{\rm{CC}}}_{i}}{{{\rm{CC}}}_{i}+{{\rm{CM}}}_{i}+{{\rm{WC}}}_{i}+{{\rm{WM}}}_{i}}$$
(1)
$${\rm{Misleading}}\,{{\rm{Accuracy}}}_{i}=\frac{{{\rm{CM}}}_{i}}{{{\rm{CC}}}_{i}+{{\rm{CM}}}_{i}+{{\rm{WC}}}_{i}+{{\rm{WM}}}_{i}}$$
(2)
$${\rm{Correct}}\,{{\rm{Attention}}}_{i}=\frac{{{\rm{CC}}}_{i}+{{\rm{WC}}}_{i}}{{{\rm{CC}}}_{i}+{{\rm{CM}}}_{i}+{{\rm{WC}}}_{i}+{{\rm{WM}}}_{i}}$$
(3)

Where i{0, 1, 2} represents Class 0 (3D-EEM spectra contain three fluorescent components), Class 1 (3D-EEM spectra contain four fluorescent components), and Class 2 (3D-EEM spectra contain five fluorescent components). The classification performance of CNN classifiers was divided into four types: (I) correct classification with correct attention (CC); (II) correct classification with misleading attention (CM); (III) Wrong classification with correct attention (WC); (IV) Wrong classification with misleading attention (WM). In this way, \({\text{Correct Accuracy}}_{i}\) and \({\text{Wrong Accuracy}}_{i}\) represent correct and misleading extraction features of 3D-EEM spectra by the CNN classifier with correct classification, respectively. \({\text{Correct Attention}}_{i}\) represents the correct attention of the CNN classifier on all 3D-EEM spectra. All raw data and cut data in Conv5 were analyzed by the Grad-CAM method to measure the mathematical indices mentioned above. The final performance results for the CNN classifier are the average of the 6 independent runs.

Interpretation methods for CNN classifiers

The Grad-CAM and guided Grad-CAM were utilized as interpretation methods for the CNN classifier to visualize the attention of convolutional layers on the 3D-EEM spectra (Fig. 5c)18. The Grad-CAM can produce a coarse localization map to highlight the important regions in the 3D-EEM spectra. The guided Grad-CAM method further combined the Grad-CAM visualizations with the guided backpropagation via point-wise multiplication. Compared with Grad-CAM method, high-resolution, highly class-discriminative and more detailed features in the 3D-EEM spectra could be displayed by the guided Grad-CAM method18.

To further interpret the influence of 3D-EEM spectra on the confidence of the CNN classifier, the SAGs method was utilized to visualize how different combinations of image regions (called conjunctions) impact the classifier confidence20. The SAGs method combines multiple saliency maps into a single 3D-EEM spectrum to illustrate multiple different minimal perturbations to change the model output (i.e., the classification result of the CNN classifiers). In this way, the SAGs method could help understand how different combinations of image regions impact the confidence of the CNN classifier. Each 3D-EEM spectrum was divided into 49 = 7 × 7 patches (Fig. 5c) to limit the search space of the Beam Search algorithm. The minimal sufficient explanation (i.e., a minimal region in the image that achieves high classifier confidence) of a 3D-EEM spectrum was presented on the root nodes of SAGs. The SAGs of raw data and cut data were compared to highlight the influence of different 3D-EEM spectra.

It is important to mention that the 3D-EEM spectra are different from the bird or car images tested in the SAGs methods. The latter images have clear identification features, whereas the features of 3D-EEM spectra cannot be directly identified by the naked eyes. Therefore, sufficient conjunctions for classifying 3D-EEM spectra remain to be discovered. Meanwhile, not all 3D-EEM spectra could be interpreted by the SAGs method due to the general high confidence in some 3D-EEM spectra. The interpretation results of the raw CNN classifiers and CNN classifiers with CBAM were compared quantitatively based on the mathematical indices including Correct Accuracyi, misleading Accuracyi and Correct Attentioni (Eqs. 13).