Exploring the role of preprocessing combinations in hyperspectral imaging for deep learning colorectal cancer detection

Tkachenko, Mariia; Huber, Benjamin; Hamotskyi, Serhii; Jansen-Winkeln, Boris; Gockel, Ines; Neumuth, Thomas; Köhler, Hannes; Maktabi, Marianne

doi:10.1038/s41598-025-20735-x

Download PDF

Article
Open access
Published: 23 September 2025

Exploring the role of preprocessing combinations in hyperspectral imaging for deep learning colorectal cancer detection

Mariia Tkachenko^1,2,
Benjamin Huber³,
Serhii Hamotskyi³,
Boris Jansen-Winkeln⁴,
Ines Gockel^4,5,
Thomas Neumuth²,
Hannes Köhler² &
…
Marianne Maktabi^2,3

Scientific Reports volume 15, Article number: 32685 (2025) Cite this article

907 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

This study compares various preprocessing techniques for hyperspectral deep learning–based cancer diagnostics. The study considers different spectrum scaling and noise reduction options across spatial and spectral axes of hyperspectral datacubes, as well varying levels of blood and light reflections removal. We also examine how the size of the patches extracted from the hyperspectral data affects the models’ performance. We additionally explore various strategies to mitigate our dataset’s imbalance (where cancerous tissues are underrepresented). Our results indicate that. Scaling: Standardization significantly improves both sensitivity and specificity compared to Normalization. Larger input patch sizes enhance performance by capturing more spatial context. Noise reduction unexpectedly degrades performance. Blood filtering is more effective than filtering reflected light pixels, although neither approach produces significant results. By carefully maintaining consistent testing conditions, we ensure a fair comparison across preprocessing methods and reproducibility. Our findings highlight the necessity of careful preprocessing selection to maximize deep learning performance in medical imaging applications.

Fluorescence excitation-scanning hyperspectral imaging with scalable 2D–3D deep learning framework for colorectal cancer detection

Article Open access 26 June 2024

Multi-wavelength interference phase imaging for automatic breast cancer detection and delineation using diffuse reflection imaging

Article Open access 03 January 2024

Hierarchical deep learning models using transfer learning for disease detection and classification based on small number of medical images

Article Open access 01 March 2021

Introduction

Cancer is one of the leading causes of death worldwide¹. Colorectal cancer — third by incidence and second by mortality² — accounts for a significant portion of cancer-related morbidity. Techniques usable during preoperative³, intraoperative⁴, or postoperative stages (e.g., to detect cancer regions and ensure tumor-free resection margins⁵ can aid in cancer diagnosis and improve outcomes.

Hyperspectral imaging (HSI) combines imaging and spectroscopy, creating image data containing both spatial information and the spectrum for each pixel, usually extending beyond the wavelengths seen by humans. HSI has shown potential in discriminating between tissue structures by analyzing their spectrum, thereby distinguishing healthy and pathological (e.g., cancerous) tissues in a contactless, radiation-free and non-invasive manner, and has seen increasing adoption in the medical field⁴.

HSI has been used for cancer detection in humans, for skin^6,7, breast⁸, head and neck⁹, brain^10,11, oral¹², gastric¹³ cancer, as well as colorectal cancer^14,15, and various others.

Different approaches exist for extracting this information from hypercubes: both conventional Machine Learning (ML) models, which have a stronger reliance on domain-specific knowledge to extract data, and Deep Learning (DL) ones, which can discover structures in high-dimensional data with very little engineering by hand¹⁶ and have obtained better classification performance than traditional approaches¹⁷. DL models used for HSI cancer detection have been primarily (but not exclusively) convolutional neural networks (CNNs)^18,19,20,21, including 3D-CNNs^14,22. Other approaches have been used with success as well, for instance semi-supervised Graph Convolutional Networks that leverage the use of unlabeled pixels²³.

Cancer detection with HSI and DL is challenging already at the dataset acquisition stage: HSI and ground truth labeling require devices and medical expertise, resulting in smaller datasets than usually found in non-medical domains (e.g., in the order of tens, not thousands, of images).

Datasets can also include class imbalance (e.g., more cancerous pixels than healthy ones), light or blood reflections, and variations in tissue structure, all presenting challenges in developing accurate and robust models. Blood poses a problem due to its strong absorption and scattering properties, which can distort spectral signatures and reduce the contrast between cancerous and non-cancerous tissues — a previously neglected dimension that our work studies in detail.

Previous studies have explored the impact of pre- and post-processing steps on the supervised classification of tissues in hyperspectral images, highlighting the importance of optimized data processing techniques^8,22,24.

This study builds upon these findings by addressing the challenge of optimizing the preprocessing of HSI data with the use of 3D-CNNs, contributing in two separate but interrelated ways: (1) The comprehensive exploration of the effects of various preprocessing approaches (including noise reduction by smoothing and blood filtering, the latter especially being overlooked in literature), and (2) A pipeline enabling systematic testing of preprocessing combinations under the conditions where testing all combinations thoroughly is prohibitively difficult, with an emphasis on validity, statistically sound methods, and exclusion of confounding variables.

Various combinations of preprocessing are explored, including scaling, smoothing, filtering blood and light reflection pixels, and different approaches to address dataset imbalance: class weighting and sample weighting (to increase model sensitivity to smaller annotated areas).

Evaluating each combination of parameters on the entire dataset would be unfeasibly time and resource intensive. Combinations were tested on a smaller, statistically representative subset of the dataset. Reproducibility and reliability are ensured by a careful exclusion of confounding variables and fixing possible points of randomness, by controlling weight initialization/dropout/random seeds, and ensuring consistent ordering. This ensures robust results and at the same time can serve as a blueprint for other experiments under similar constraints.

Data

The dataset includes HS datacubes from 56 patients, annotated by expert pathologists and surgeons based on histopathological slide comparisons. Each patient’s record contains an HS cube and an associated image (ground truth mask) that displays ground truth labels, shown in Fig. 1. In this study, we focus on binary classification of tissues, specifically identifying non-malignant (purple) and cancerous (yellow) tissues. It is important to note the significant imbalance in the dataset, which contains approximately 10 non-malignant to 1 cancerous pixel, in total 477,670 cancerous pixels and 4,537,769 non-malignant pixels.

The HS image data of resected colorectal tissue was captured within the first five minutes post-resection, using the push-broom TIVITA^® Tissue device from Diaspective Vision GmbH, Germany, which was positioned 50 cm away from the tissue and all side lighting was excluded. Each image required approximately ten seconds to capture. The generated datacubes have dimensions of 480 by 640 pixels with 100 spectral bands, spanning a wavelength range from 500 to 1000 nm at 5 nm intervals. We excluded the wavelengths between 500 and 540 nm due to excessive noise, resulting in the use of 92 spectral bands corresponding to the 540–1000 nm range.

Methods

The study was approved by the local ethics committee of the medical Faculty of the University of Leipzig (026/18-ek), the study was also registered at Clinicaltrials.gov (NCT04230603). Written informed consent, including consent for publication, was obtained from all the patients. All methods were performed in accordance with the Declaration of Helsinki.

Ensuring compatibility between combinations

A machine learning pipeline can be viewed as comprising the following independent components: Data, Preprocessing, Model, and Cross-validation.

These components are independent in the sense that, to compare different variations of one component, the remaining components must be held constant. In this study, to isolate the effect of preprocessing, all other factors including data order, model state and cross-validation were held constant. In Table 1 we outline the checklist used to ensure consistency across each component. These prerequisites ensured that the only variable altered was the preprocessing combination. Consequently, if preprocessing A outperformed preprocessing B, we could be confident that the observed differences were solely due to the preprocessing methods, with no confounding variables influencing the results.

Table 1 Checklist for fixing randomness.

Full size table

Sampling a small but representative dataset

First, we needed to create a small representative subset as follows:

1.
Randomly select 1% of data from each patient and each class.
2.
Apply Kolmogorov–Smirnov test (K-S test, α = 0.05) for every wavelength to confirm that the empirical distribution in the reduced subset does not differ significantly from the corresponding distribution in the full dataset, thereby establishing that the subset is distributionally representative.
3.
Repeat until distributions of all wavelengths are representative to the whole dataset. In most cases, less than 5 repetitions were needed.

Afterwards, for each preprocessing combination, corresponding preprocessing was performed on the selected small representative prototype. This way we ensured that the model received the same data in the same order, only the preprocessing was different.

Preprocessing

Pipeline

The preprocessing pipeline comprised the following sequential steps:

1.
Reading Hyperspectral Datacube and Ground Truth Mask. The initial step involved importing the hyperspectral data along with the corresponding ground truth mask, which provided reference labels.
2.
Scaling. This step scaled the data values, using Normalization or Standardization methods to ensure uniformity and better convergence of neural networks.
3.
Smoothing. Noise reduction was achieved through smoothing techniques.
4.
Filtering.Filtering processes like blood and light reflection filtering were applied to remove possibly harmful artifacts.
5.
Finally, for each annotated pixel we extracted 3D patches, which became the input samples for neural networks. Each patch was generated from a labeled pixel along with its neighboring pixels. Patch sizes in this work were 3 and 5, so the input shapes of the samples were (3, 3, 92) and (5, 5, 92).

Scaling

Scaling is important not only for bringing all features into a consistent range, which enhances convergence, but also for eliminating biases, such as these stemming from unique spectral characteristics of each patient. To address this, we employed feature scaling techniques. Based on previous research²², we selected two algorithms: Normalization (scaling to unit length) and Standardization (Z-score). Figure 2a presents the formulas for these scaling methods and illustrates how each scaling affects the spectra.

Smoothing

To reduce noise, we tried 1D (spectral), 2D (spatial), and 3D (both) smoothing on our datacubes as part of preprocessing pipelines. Figure 2b illustrates these different approaches within a hyperspectral cube, where the purple areas represent the region being subjected to the smoothing process at the same time for the respective mode (1D, 2D or 3D), while the red areas highlight the window applied pixel by pixel across the purple area. 1D smoothing is applied along the spectral dimension. 2D smoothing is applied spatially to each spectral band. 3D smoothing involves the simultaneous application of filters across both spatial and spectral dimensions.

Three filters were chosen for smoothing: Median Filter (MF), Gaussian Filter (GF) and Savitsky-Golay Filter (SGF)²⁵, the use of which led to a significant improvement in¹⁴. MF replaces each pixel value with the median of neighboring values within a specified window size. GF, in contrast, applies a Gaussian function with a specific sigma value to weigh the neighboring spectral values. SGF was applied only with 1D smoothing.

The sigmas used for GF and window sizes used for MF for each dimension are described in Sect. 3.2.6. These values were selected based on a comparative analysis of spectra visualizations before and after the application of smoothing, ensuring the chosen parameters effectively modified the spectra while preserving essential structural details. SGF was applied with a window size of 9 and a polynomial order of 2, as these parameters yielded the best results in the study by Collins et al.¹⁴.

Filtering

During our research, it was shown that our models tended to misclassify pixels with reflected light, and to a higher extent — pixels with blood(Fig. 3a). The likely cause is that the blood spectra are similar to cancerous spectra, as shown in Fig. 3b.

Considering the above observations, we decided to exclude blood and light reflection pixels using the method for background extraction described in²⁶ (Sect. 2.4). The background detection algorithm works by calculating three metrics — A, B, and C — from the hyperspectral image data. Metric A is the mean reflectance value over the wavelength range of 500 to 1000 nm, and metric B is the mean reflectance value within the 510 to 570 nm range. Metric C is a logarithmic function that compares the reflectance within the 650 to 710 nm range to a constant value. The pixel is classified as background, if A is less than 0.1, B is larger than 0.7, and C is negative. For clarity, in this work thresholds for these same A and B will be referred to as “light threshold” and “blood threshold”, respectively. In²⁶, the light threshold was 0.1, and the blood threshold was set to 0.7. In this work, we tested different values for both thresholds to investigate effects on training outcomes. Figure 3c and d shows how many pixels were identified as light and blood for different threshold values.

Sample and class weights

With 477,670 cancerous and 4,537,769 non-malignant pixels, our dataset is imbalanced. To address this issue, we used class weights during training. Class weights are defined with the following formula:

$$class\_weigh{t_{class~C}}=\frac{{total~number~of~pixels}}{{number~of~pixels~of~class~C}}$$

Class weights are applied to the loss function. And in this case, for example, if cancerous tissue is misclassified, the loss function will be multiplied by class weight for cancerous tissue (~ 10 in our case). Hence, the ML model learns to pay more attention to the less represented classes.

In Fig. 1, it can be seen that the cancer area of the far-right patient is significantly bigger than the cancer area of the far-left patient. During previous research, it has been established that ML models perform better on patients who have bigger cancer areas, which seems reasonable, because patients with bigger cancer are more represented. But it is important to pay attention to the patients with smaller cancer areas as well, because each patient could contribute to a better generalization of the model. Therefore, we included sample weights in our tests. Sample weights are defined as:

$$sample\_weigh{t_{class~C,~patient~P}}\frac{{total~number~of~pixels}}{{number~of~pixels~of~class~C~of~patient~P}}$$

If sample weights were used, class weights were not used, because since sample weights are balanced through the total number of pixels, it means that all samples from all patients are already balanced.

The complete presentation of the tested combinations

In total, 1,584 combinations were tested for both patch sizes 3 and 5 (792 for each patch size). Figure 4 presents the preprocessing options that were permutated, such as Scaling (Normalization and Standardization), Smoothing (1D, 2D, and 3D with Median, Gaussian, and Savitzky-Golay filters), and filtering specific to blood and light characteristics. Smoothing options are parametrized with various window sizes and sigma values. Additionally, Sample Weights can be applied or omitted. This organized approach allows for systematic testing of different combinations to optimize data processing for subsequent analysis.

Training

Inception-based 3D-CNN architecture and cross-validation

Due to the limited dataset size (56 patients only), the decision was made to adopt pixel-wise classification, thereby increasing the number of samples, which is a common practice in medical imaging. This reframed the segmentation problem as an image classification task, for which the Inception model²⁷ is particularly effective. That is the reason a 3D-CNN²² based on Inception was chosen. The initial architecture is depicted in Fig. 5a and incorporates a single block from the Inception architecture (designated by the gray rectangle). The Inception model offers numerous benefits, notably the concurrent utilization of multiple kernel sizes. This approach not only facilitates the integration of diverse feature maps but also reduces the likelihood of vanishing gradients. The use of ‘same’ padding in all convolutions ensures the preservation of dimensional consistency, facilitating seamless concatenation after each block.

For each combination Leave-One-Out-Cross-Validation was performed. For our dataset of 56 patients, 56 folds were tested for each combination, where during each fold one patient was excluded as test set, three were used as validation set, the rest were used as train set. The described CV can be seen in Fig. 5b. Different numbers of validation patients were tested preliminary (1, 3, 10, 20), and three validation patients gave the best results.

Training parameters and infrastructure

Combinations were trained and evaluated with Python (3.7.4), Tensorflow (2.4.0) on the University Leipzig cluster with eight cores of AMD EPYC 32 core processor CPU and 2–8 RXT2080TI GPUs.

Training parameters:

Maximal epochs – 50.
Batch size – 500.
Loss – Binary cross-entropy.
Optimizer – Adam with β1 = 0.9 and β2 = 0.99.
Learning rate – 0.0001.
Activation – ReLU; last layer – Sigmoid.
Number of wavelengths – 92.

Early stopping was used to avoid overfitting: If the F1-score on the validation dataset did not improve for more than five epochs, training stopped.

Recommendations for implementation:

At least 16 RAM, possibly less with smaller batch sizes.
A processor with several cores would be a benefit.
At least one GPU would be a big benefit.
It is important to plan some time to implement parallel processing.

In our setup one combination needs 2–3 h to be trained on the small representative dataset. Training on the whole dataset needs 5–6 days, but with parallel processing it was reduced to ~ 1 day.

Results

Metrics

In this section, we use the following abbreviations:

TP (True Positives): Cancerous tissue correctly identified as cancerous.
TN (True Negatives): Non-malignant tissue correctly identified as non-malignant.
FP (False Positives): Non-malignant tissue incorrectly identified as cancerous.
FN (False Negatives): Cancerous tissue incorrectly identified as non-malignant.

Due to the imbalanced nature of the dataset, sensitivity and specificity were chosen as metrics, allowing a more nuanced understanding of model performance. To preserve clinical validity in the performance estimates, each metric is first calculated independently for every patient p:

Sensitivity (also known as Recall), calculated as the ratio of correctly identified positive cases (TP) to the total actual positive cases (TP + FN).

$$sensitivit{y_p}=\frac{{T{P_p}}}{{T{P_p}+F{N_p}}}$$

Specificity, measured as the proportion of actual negatives (TN) that are correctly identified.

$$specificit{y_p}=\frac{{T{N_p}}}{{T{N_p}+F{P_p}}}$$

Binary classification requires a decision threshold, which partitions predicted probabilities into non-malignant tissue values (below the threshold) and malignant tissue values (above the threshold). As described in²² (Sect. 2.5.2), by systematically varying the decision threshold t across the interval [0, 1] produces two cohort-level response curves:

Mean sensitivity:

$$\:sensitivit{y}_{mean}\left(t\right)=\:\frac{1}{N}\:\sum\:_{p=1..N}sensitivit{y}_{p}\:\left(t\right)$$

Mean specificity:

$$\:{specificity}_{mean}\left(t\right)=\:\frac{1}{N}\:\sum\:_{p=1..N}{specificity}_{p}\:\left(t\right)$$

where N is the number of patients. The optimal threshold T is defined at the intersection point, where sensitivity_mean(t) = specificity_mean(t), due to its ability to balance sensitivity and specificity, thus allowing easy model comparison while managing the trade-off between false negatives and false positives.

The intersection point at T is what we report throughout the manuscript as the “mean over sensitivity and specificity” (MSS). This procedure provides a patient‑level aggregate metric while preserving equal importance for sensitivity and specificity.

Statistical analysis

During analysis, we explored what contributors and values affect MSS, which was the dependent variable that we wanted to improve.

Firstly, analysis of variance (ANOVA) was used to determine significance of each preprocessing contributor separately (Sect. 4.3). ANOVA calculates differences in variance among different groups inside the contributor and how significant these differences are.

For smoothing (Sect. 4.4) and filtering (Sect. 4.5), we wanted to examine how each value of each contributor affected the dependent variable compared to the basic “no smoothing” or “no filtering” respectively.

MSS is a single, continuous response variable confined to the 0–1 range. This means that regressions based on labels, such as logistic regression, are not applicable.

Variance-inflation-factor (VIF) diagnostics revealed no multicollinearity among the predictors, and Q-Q plots showed the residuals closely followed a normal distribution, confirming the normality assumption. Consequently, we applied a two-tailed Ordinary Least Squares regression, whose readily interpretable coefficients quantify how each pipeline factor raises or lowers average performance.

In this work we present the next OLS outputs: coefficient, p-value, and confidence interval.

Coefficient shows the estimated change in the dependent variable for a one-unit change in the contributor, assuming all other variables in the model are held constant. Negative coefficients imply that the contributor influences MSS (dependent variable) negatively, and vice versa for positive values.

Confidence interval shows the interval in which 95% of the differences of contributor’s values fall. Represented visually, the confidence interval shows differences in minimums and maximums of compared boxplots.

P-value shows how significant the coefficient and confidence intervals are.

For the statistical models, n was 1584 (which corresponds to the total number of combinations) and α = 0,05.

The Supplementary Materials include an analysis of filtering and smoothing interactions, highlighting both the most and least effective combinations. Supplementary Table S1 presents the best significant (p < 0.1) interactions between smoothing and filtering, highlighting combinations with the strongest positive coefficients and confidence intervals, whereas Supplementary Table S2 shows the worst interactions where the absence of blood filtering leads to declines. Supplementary Table S3 lists the six highest-scoring preprocessing combinations, all using patch size 5 with Standardization and yielding MSS up to 0.778 and AUC up to 0.9, while Supplementary Table S4 reports the six lowest-scoring combinations with patch size 3 and Normalization, where MSS falls to about 0.465 and AUC to as low as 0.58.

Scaling, weights and patch sizes

As Fig. 6a shows, Standardization strongly outperformed Normalization: the median MSS rose from 0.54 with Normalization to 0.71 with Standardization, a mean gain of 17% points. Therefore, only results for Standardization will be analyzed further in the text. The likely explanation is that Normalization compresses spectra into a narrow positive range, which limits variance and obscures subtle tissue differences, whereas Standardization preserves negative values with a wider spread, making feature learning more effective. This question is examined in more detail in the Discussion.

In Fig. 6b the comparison between applying sample weights “SW” and class weight “CW” is shown (blue), as well as differences in input patch sizes (orange). Applying sample weights outperforms applying class weights. By forcing the model to pay more attention to small cancerous areas, sample weights improve sensitivity to rare but clinically critical patterns, helping reduce the risk of missed cancer detections. Although an increased patch size improves results even more (right), since the network could capture more spatial information.

Figure 6c shows the results of ANOVA analysis. The most significant factor (after scaling) is size, which corresponds to visual results from (a) and (b). The second most significant factor is smoothing dimension, which – as will be seen in the next subsection – relies on the fact that 3D smoothing significantly worsens the results. The next are filtering of blood and SW. Notably, filtering of blood is more significant than light filtering.

The ANOVA’s R² value is 0.390, indicating that approximately 39% of the variance in the dependent variable can be explained by the independent variables. Additionally, the p-value for the F-statistics is 7.88 * 10⁻⁸⁰, demonstrating that the overall model is statistically significant, meaning that the predictors as a group are significantly related to the dependent variable.

Smoothing

Figure 7a shows the visual results without smoothing applied (blue), and all other tested smoothing combinations. For each smoothing value on the x-axis, the filter is specified in parentheses: GF – Gaussian Filter, MF – Median Filter and SGF – Savitsky-Golay Filter. For each value results for 1D, 2D and 3D (if applicable) are shown. Dashed lines indicate the maximum (red) and median (blue) value achieved with no smoothing applied (blue boxplot). Results of OLS for smoothing are presented in Fig. 7b. R² for the OLS is 0.24 and p-value for F-statistic < 0.0001.

Outcomes:

Almost all smoothing combinations worsen the results: the medians of all boxplots except for 1D GF (0.5) and 1D MF (7) are below the no-smoothing median (blue line). Although some smoothing combinations could surpass maximal values: 2D GF (2), 1D and 2D MF (3), and with outlier 1D MF (5). These results correspond with the visual results on Fig. 7a and explain why in ANOVA (Fig. 6c) “Smoothing algorithm” is not significant and the “smoothing value” is much less significant than the other contributors.
Significant results in the Fig. 7b (p-value < 0.05, highlighted in green) correspond to declining outcomes (shades of red), where the highest decline is observed for all 3D smoothing, SGF, and 1D GF 1.5 proving they are not effective.
Overall Coefficient is negative in all cases in OLS, except for 1D GF 0.5, but these changes are not significant with p-value 0.709. This means that smoothing is worsening mean outcomes. On the other hand, there are combinations, whose 95% interval (the far-right column) is positive, which means that their maximum is higher than maximum without smoothing, but all these changes are not statistically significant (p-value > 0.05). Conclusion would be that if smoothing is used, 1D smoothing with MF 3 or 5 would be the most promising.

Filtering

Figure 8a shows the visual results for the different filtering modalities, where green represents no filtering, red – blood thresholds, and gray – light thresholds. The dashed lines represent values without filtering applied: red is the maximum value, blue – the median.

Figure 8b presents the OLS results for blood and light filtering. OLS was calculated separately for blood and light. R2 for the light OLS is 0.015 and p-value for F-statistic equals 0.0197, which means that the changes in light explain the outcomes poorly. R2 for the blood OLS is 0.029 and p-value for F-statistic 0.0001.

The results show:

Filtering is rather counter-productive, especially light filtering. Filtering worsens median values in all cases (median values of all boxplots are below the blue line), but it could also surpass the maximum value, especially blood threshold 0.1 and light threshold 0.25.
For blood filtering threshold 0.1 works much better than 0.15 (which significantly worsens the outcomes). But improvements with the threshold of 0.1 are questionable too, because the p-value of improvements is 0.106 (> 0.05).
According to Fig. 8b, the most promising light thresholds are 0.25 and 0.7, but the improvement is not high.

Discussion

Based on the ANOVA test and box plot analyses, the factors influencing outcomes can be ranked by importance as follows: (1) Feature scaling; (2) Patch size; (3) Application of sample weights; (4) Blood filtering; (5) Smoothing; (6) Filtering of pixels with reflected light. We will now discuss each of these factors in detail and separately.

The scaling algorithm significantly impacts the outcomes, with Standardization proving to be substantially more effective than Normalization. This can be attributed to the range of values post-scaling. Normalization constrains values to the positive range between 0 and 0.14, resulting in a tighter distribution. Conversely, Standardization yields both positive and negative values, typically ranging from − 3 to 2, facilitating network learning due to the broader value distribution. Although previous research²² has indicated that Normalization performed better with Inception-based models, this study suggests otherwise. The discrepancy may be due to differences in cross-validation and the introduction of the Early Stopping technique. Previously, after excluding test patients, the remaining patients were shuffled and split into training and validation sets. In the current study (Sect. 3.3.1), specific patients were excluded as the validation set. Additionally, Early Stopping was employed to prevent overfitting, whereas in the prior study, all networks were trained for the full 40 epochs.

Patch size appears to be an important factor as well, with a tendency that the larger the patch size (i.e., the larger the spatial context given to the model), the better the results. While larger sizes are theoretically advantageous, their practical applications are significantly constrained. For instance, a patch size 7 is already prohibitively demanding in terms of computational resources, including disk space, memory and processing time. Consequently, an alternative approach to consider is the use of super-pixels. Super-pixels can potentially offer a more efficient representation by aggregating pixels into perceptually meaningful regions, thereby reducing computational load while preserving essential spatial information.

Sample weights have a moderate effect. This can be attributed to the fact that annotated cancerous regions are typically smaller than healthy tissue regions. Consequently, the use of sample weights compels the models to give larger attention to samples from patients with fewer annotated regions, thereby forcing the models to learn more from patients with small cancerous areas. This results in improved overall average sensitivity.

The results regarding filtering are inconclusive. On the one hand, the ANOVA analysis on Fig. 6c indicates that light filtering is not a significant contributor. On the other hand, when examining the interaction between filtering and smoothing, light filtering with a threshold of 0.25 demonstrated the biggest potential for performance improvement. Additionally, the results for specific light thresholds are ambiguous, highlighting the necessity to test each threshold individually in each specific case. Regarding blood filtering, a blood threshold of 0.1 is more effective than 0.15. It is important to mention that it is worth exploring additional anomaly-detection techniques—most notably the Reed–Xiaoli (RX)²⁸ detector and the Spectral Angle Mapper (SAM)—for the filtering stage. The RX detector flags outliers by exploiting the data’s covariance structure, while SAM measures the angular distance between a candidate spectrum and a reference mean vector. Although both algorithms have proven effective for generic spectral anomaly detection, it remains to be determined whether they can reliably eliminate blood contamination and specular glare. Equally important is deciding how to define the reference mean: should it be derived from the entire dataset, limited to annotated pixels, or estimated separately for each patient. These questions make the line of inquiry particularly promising.

The discussion outcomes that smoothing generally degrades model performance, particularly 3D smoothing, which significantly reduces sensitivity and specificity. This is likely due to excessive smoothing of both spatial and spectral details, which can obscure key features needed for 3D-CNNs to learn useful features. It could also explain why in¹⁴ SGF improved performance, although it performed poorly in this work. Because in¹⁴ feed-forward network was used and, in this case, smoothing helps network by filtering of “noise”. But in case of CNNs, this “noise” is useful. This again emphasizes the need of preprocessing tuning separately for each model. Ordinary Least Squares (OLS) regression results confirm that most smoothing techniques, especially those involving 3D operations, correlate with performance declines, although 1D smoothing (along the spectral dimension) shows the least negative impact.

Our findings exhibit varying degrees of translatability to other hyperspectral datasets. Spectral scaling was identified as the most influential preprocessing contributor. Accordingly, when adapting the pipeline to other hyperspectral datasets, scaling—especially Standardization—should be the first parameter evaluated, yet its marked impact necessitates rigorous empirical validation rather than uncritical adoption. Secondly, the benefit of enlarging the input patch size is almost certainly generalizable: in our earlier experiments on several external datasets, increasing the receptive field consistently improved model accuracy, in line with the theoretical expectation that larger image fragments carry richer contextual information and other studies²⁹. Second, favoring sample over class weighting is also likely portable, because forcing the network to pay greater attention to rare cancer regions is a broadly applicable principle. In microscopy-style datasets, however—where each hyperspectral cube carries a single label and the annotated area is effectively constant—sample weights lose their meaning, whereas class weights remain pertinent. Third, the utility of spectral smoothing proved algorithm-dependent: for 3D-CNNs it was counter-productive. Yet, it could depend on a hyperspectral camera used. If the camera introduces substantial noise, the potential denoising benefit could outweigh performance loss, so the procedure should be assessed empirically for each camera configuration. Finally, light and blood filtering are the most setup-specific steps; our ex-vivo dataset contains minimal bleeding, whereas in-vivo acquisitions with more blood may show different behavior altogether. In sum, while the key methodological insights on patch size and class weighting appear widely generalizable, the impact of scaling, smoothing and optical-artifact filtering needs to be validated for every new organ and imaging environment.

Future research directions include tuning of the model, including stacking 2 Inception blocks. From the data perspective, the exploration of super-pixels for more efficient image representation is a promising alternative to the currently impractical large patch sizes. Improvements in computational infrastructure will be crucial to support the increased complexity and resource demands of these emerging methodologies. Another way could be to extend the amount of data, using techniques like data augmentation, autoencoders, generative adversarial networks (GANs), zero-shot learning^30,31,32,33.

Conclusions

This study systematically evaluated 1,584 unique preprocessing configurations, demonstrating that preprocessing has a profound impact on classification performance. The difference between the best and worst-performing combinations exceeded 30% points in MSS (mean over sensitivity and specificity), from 0.46 to 0.77, respectively. However, comprehensively exploring such a large configuration space is infeasible. Therefore, the results emphasize the necessity of developing automated frameworks that can efficiently select representative subsets, tune preprocessing, and ensure strict control over potential confounding factors such as random initialization, data ordering, and leave-one-out cross-validation (LOOCV) consistency. This study provides a comprehensive checklist to guide such efforts.

Several key preprocessing components were identified as particularly influential. Scaling—especially Standardization—substantially improved both sensitivity and specificity and should be considered a default step. Sample weighting, as opposed to class weighting, was more effective in addressing data imbalance, particularly by improving model focus on underrepresented cancerous regions. Additionally, blood filtering showed some slightly positive effects, while filtering of reflected light had limited or negative influence.

Contrary to expectations, smoothing techniques generally degraded performance. Most spatial and spectral smoothing approaches led to performance deterioration, with minor exceptions observed in simple 1D smoothing along the spectral dimension. These results suggest that aggressive noise reduction may remove diagnostically relevant signal features.

Additionally, the classification performance improvement with increasing spatial patch size was observed, consistent with previous literature. Although further enlarging the patch size would be a logical next step, such an approach is computationally impractical. Consequently, future work will focus on segmentation-based strategies. Given the typically small size of medical hyperspectral datasets hindering efficient use of state-of-the-art segmentation models, a promising compromise lies in the use of super-pixels or larger spatial regions for segmentation, rather than relying on pixel-wise patch classification.

In summary, the findings highlight the substantial role of preprocessing in hyperspectral image analysis for cancer detection and outline both methodological guidelines and future research directions aimed at improving model performance while maintaining reproducibility and robustness.

Data availability

Data, code and models can be obtained upon request at the following e-mail: marianne.maktabi@medizin.uni-leipzig.de.

References

Torre, L. A., Siegel, R. L., Ward, E. M. & Jemal, A. Global cancer incidence and mortality rates and trends—An update. Cancer Epidemiol. Biomarkers Prev. 25 (1), 16–27. https://doi.org/10.1158/1055-9965.EPI-15-0578 (2016).
Article PubMed Google Scholar
Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71 (3), 209–249. https://doi.org/10.3322/caac.21660 (2021).
Article CAS PubMed Google Scholar
Han, Z. et al. In vivo use of hyperspectral imaging to develop a noncontact endoscopic diagnosis support system for malignant colorectal tumors. J. Biomed. Opt. 21 (1), 016001. https://doi.org/10.1117/1.JBO.21.1.016001 (2016).
Article ADS Google Scholar
Barberio, M. et al. Intraoperative guidance using hyperspectral imaging: a review for surgeons. Diagnostics 11 (11), 2066. https://doi.org/10.3390/diagnostics11112066 (2021).
Article PubMed PubMed Central Google Scholar
Kho, E. et al. Imaging depth variations in hyperspectral imaging: development of a method to detect tumor up to the required tumor-free margin width. J. Biophotonics. 12 (11), e201900086. https://doi.org/10.1002/jbio.201900086 (2019).
Article PubMed Google Scholar
Johansen, T. H. et al. Recent advances in hyperspectral imaging for melanoma detection. WIREs Comput. Stat. 12 (1), e1465. https://doi.org/10.1002/wics.1465 (2020).
Article ADS MathSciNet Google Scholar
Liu, L. et al. Staging of skin cancer based on hyperspectral microscopic imaging and machine learning. Biosensors 12 (10), 790. https://doi.org/10.3390/bios12100790 (2022).
Article CAS PubMed PubMed Central Google Scholar
Aboughaleb, I. H., Aref, M. H. & El-Sharkawy, Y. H. Hyperspectral imaging for diagnosis and detection of ex-vivo breast cancer. Photodiagn Photodyn Ther. 31, 101922. https://doi.org/10.1016/j.pdpdt.2020.101922 (2020).
Article CAS Google Scholar
Eggert, D. et al. In vivo detection of head and neck tumors by hyperspectral imaging combined with deep learning methods. J. Biophotonics. 15 (3), e202100167. https://doi.org/10.1002/jbio.202100167 (2022).
Article CAS PubMed Google Scholar
Fabelo, H. et al. Deep learning-based framework for in vivo identification of glioblastoma tumor using hyperspectral images of human brain. Sensors 19 (4), 920. https://doi.org/10.3390/s19040920 (2019).
Article ADS PubMed PubMed Central Google Scholar
Martinez, B. et al. Most relevant spectral bands identification for brain cancer detection using hyperspectral imaging. Sensors 19 (24), 5481. https://doi.org/10.3390/s19245481 (2019).
Article ADS PubMed PubMed Central Google Scholar
Jayanthi, J. et al. Diffuse reflectance spectroscopy: diagnostic accuracy of a non-invasive screening technique for early detection of malignant changes in the oral cavity. BMJ Open. 1 (1), e000071. https://doi.org/10.1136/bmjopen-2011-000071 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. et al. Diagnosis of early gastric cancer based on fluorescence hyperspectral imaging technology combined with partial-least-square discriminant analysis and support vector machine. J. Biophotonics. 12 (5), e201800324. https://doi.org/10.1002/jbio.201800324 (2019).
Article PubMed Google Scholar
Collins, T. et al. Automatic recognition of colon and esophagogastric cancer with machine learning and hyperspectral imaging. Diagnostics 11 (10), 1810. https://doi.org/10.3390/diagnostics11101810 (2021).
Article PubMed PubMed Central Google Scholar
Jansen-Winkeln, B. et al. Feedforward artificial neural network-based colorectal cancer detection using hyperspectral imaging: a step towards automatic optical biopsy. Cancers 13 (5), 967. https://doi.org/10.3390/cancers13050967 (2021).
Article PubMed PubMed Central Google Scholar
Ma, L. et al. Adaptive deep learning for head and neck cancer detection using hyperspectral imaging. Vis. Comput. Ind. Biomed. Art. 2 (1), 18. https://doi.org/10.1186/s42492-019-0023-8 (2019).
Article PubMed PubMed Central Google Scholar
Oswald, W. et al. Fluorescence excitation-scanning hyperspectral imaging with scalable 2D–3D deep learning framework for colorectal cancer detection. Sci. Rep. 14 (1), 14790. https://doi.org/10.1038/s41598-024-64917-5 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Ma, L. et al. Deep learning based classification for head and neck cancer detection with hyperspectral imaging in an animal model. SPIE Med. Imaging. https://doi.org/10.1117/12.2255562 (2017). 101372G.
Article Google Scholar
Ortega, S. et al. Hyperspectral imaging and deep learning for the detection of breast cancer cells in digitized histological images. Medical Imaging 2020: Digital Pathology. (2020).
Ranjan, P., Kumar, R., Girdhar, A., Recent, C. N. N. & advancements for stratification of hyperspectral images. 6th International Conference on Information Systems and Computer Networks (ISCON)., 1–5 (Mathura, India, 2023). 10.1109/ISCON57294.2023.10112174.
Ranjan, P. & Girdhar, A. Xcep-Dense: a novel lightweight extreme inception model for hyperspectral image classification. Int. J. Remote Sens. 43 (14), 5204–5230. https://doi.org/10.1080/01431161.2022.2130727 (2022).
Article Google Scholar
Tkachenko, M. et al. Impact of pre- and post-processing steps for supervised classification of colorectal cancer in hyperspectral images. Cancers 15 (7), 2157. https://doi.org/10.3390/cancers15072157 (2023).
Article PubMed PubMed Central Google Scholar
Ranjan, P., Kumar, R. & Girdhar, A. Unlocking the Potential of unlabeled data: semi-supervised learning for stratification of hyperspectral images, OITS International Conference on Information Technology (OCIT). 938–943 (Raipur, India, 2023). 10.1109/OCIT59427.2023.10430513.
Aloupogianni, E., Ishikawa, M., Kobayashi, N. & Obi, T. Hyperspectral and multispectral image processing for gross-level tumor detection in skin lesions: a systematic review. J. Biomed. Opt. 27 (06), 060901. https://doi.org/10.1117/1.JBO.27.6.060901 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Savitzky, A. & Golay, M. J. E. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627–1639 (1964).
Article ADS CAS Google Scholar
Maktabi, M. et al. Classification of hyperspectral endocrine tissue images using support vector machines. Int. J. Med. Robot Comput. Assist. Surg. 16, e2121. https://doi.org/10.1002/rcs.2121 (2020).
Article Google Scholar
Szegedy, C. et al. June. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7(12) 1–9. (Boston, USA, 2015).
Reed, I. S. & Yu, X. Adaptive multiple-band CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. Acoust. Speech Signal Process. 38 (10), 1760–1770. https://doi.org/10.1109/29.60107 (1990).
Article ADS Google Scholar
Seidlitz, S. et al. Robust deep learning-based semantic organ segmentation in hyperspectral images. Med. Image Anal. 80, 102488. https://doi.org/10.1016/j.media.2022.102488 (2022).
Article PubMed Google Scholar
Ranjan, P. & Girdhar, A. A Comparison of deep learning algorithms dealing with limited samples in hyperspectral image classification, OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON)., 1–6, ( Raigarh, Chhattisgarh, India, 2022). 10.1109/OTCON56053.2023.10114005.
Ranjan, P. et al. Revolutionizing hyperspectral image classification for limited labeled data: unifying autoencoder-enhanced GANs with convolutional neural networks and zero-shot learning. Earth Sci. Inf. 18, 216. https://doi.org/10.1007/s12145-025-01739-7 (2025).
Article ADS Google Scholar
Ranjan, P., Gupta, G. A. & Cross-Domain Semi-Supervised Zero-Shot learning model for the classification of hyperspectral images. J. Indian Soc. Remote Sens. 51, 1991–2005. https://doi.org/10.1007/s12524-023-01734-9 (2023).
Article Google Scholar
Ranjan, P. et al. A novel spectral-spatial 3D auxiliary conditional GAN integrated convolutional LSTM for hyperspectral image classification. Earth Sci. Inf. 17, 5251–5271. https://doi.org/10.1007/s12145-024-01451-y (2024).
Article ADS Google Scholar

Download references

Acknowledgements

Computations for this work were done using the resources of the Leipzig University Computing Centre.We gratefully acknowledge Dr. Katrin Schierle from Institute of Pathology of the SLK-Kliniken Heilbronn, whose meticulous data annotation was invaluable to our research.We acknowledge Iryna Okhrin from ScaDS.AI for her expertise and support in statistical analysis.We appreciate the constructive feedback provided by Dr. Christoph Engel (Institute for Medical Informatics, Statistics and Epidemiology, Leipzig University)All images except Figure 2b) were created by Mariia Tkachenko using either matplotlib 3.8.0, seaborn 0.13.2, Microsoft PowerPoint or Miro boards (https://miro.com/). Figure 2b) was created by Serhii Hamotskyi using Inkscape.

Funding

Open Access funding enabled and organized by Projekt DEAL. The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany and by the Sächsische Staatsministerium für Wissenschaft, Kultur und Tourismus in the program Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig”, project identification number: SCADS24B.

Author information

Authors and Affiliations

Center for scalable data analytics and artificial intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig, Germany
Mariia Tkachenko
Innovation Center Computer-Assisted Surgery (ICCAS), University of Leipzig, Leipzig, 04103, Germany
Mariia Tkachenko, Thomas Neumuth, Hannes Köhler & Marianne Maktabi
Anhalt University of Applied Sciences, Köthen, 06366, Germany
Benjamin Huber, Serhii Hamotskyi & Marianne Maktabi
Department of Visceral, Transplant, Thoracic and Vascular Surgery, University Hospital of Leipzig, Leipzig, 04103, Germany
Boris Jansen-Winkeln & Ines Gockel
Department of Gastrointestinal Surgery, IRCCS San Raffaele Scientific Institute and San Raffaele Vita-Salute University, Milan, 20132, Italy
Ines Gockel

Authors

Mariia Tkachenko
View author publications
Search author on:PubMed Google Scholar
Benjamin Huber
View author publications
Search author on:PubMed Google Scholar
Serhii Hamotskyi
View author publications
Search author on:PubMed Google Scholar
Boris Jansen-Winkeln
View author publications
Search author on:PubMed Google Scholar
Ines Gockel
View author publications
Search author on:PubMed Google Scholar
Thomas Neumuth
View author publications
Search author on:PubMed Google Scholar
Hannes Köhler
View author publications
Search author on:PubMed Google Scholar
Marianne Maktabi
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, M.T. and M.M.; methodology, M.T. and M.M.; software, M.T. and B.H.; validation, M.T.; formal analysis, M.T. and M.M.; investigation, M.T.; resources, M.M.; data curation, B.J.—W. and I.G.; writing—original draft preparation, M.T.; writing—review and editing, all authors; visualization, M.T. and S.H.; supervision, M.M.; project administration, M.M.; funding acquisition, T.N. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Mariia Tkachenko.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

The acquisition of the colon data of patients was approved by the local ethics committee of the medical faculty of the University of Leipzig (approval code: 026/18-ek, approval date: 22 February 2018), and the study was registered at Clinicaltrials.gov (NCT04226781) (accessed on 18 July 2021).

Informed consent

Written informed consent, including consent for publication, was obtained from all patients.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tkachenko, M., Huber, B., Hamotskyi, S. et al. Exploring the role of preprocessing combinations in hyperspectral imaging for deep learning colorectal cancer detection. Sci Rep 15, 32685 (2025). https://doi.org/10.1038/s41598-025-20735-x

Download citation

Received: 11 November 2024
Accepted: 16 September 2025
Published: 23 September 2025
DOI: https://doi.org/10.1038/s41598-025-20735-x

Subjects

Abstract

Similar content being viewed by others

Fluorescence excitation-scanning hyperspectral imaging with scalable 2D–3D deep learning framework for colorectal cancer detection

Multi-wavelength interference phase imaging for automatic breast cancer detection and delineation using diffuse reflection imaging

Hierarchical deep learning models using transfer learning for disease detection and classification based on small number of medical images

Introduction

Data

Methods

Ensuring compatibility between combinations

Sampling a small but representative dataset

Preprocessing

Pipeline

Scaling

Smoothing

Filtering

Sample and class weights

The complete presentation of the tested combinations

Training

Inception-based 3D-CNN architecture and cross-validation

Training parameters and infrastructure

Results

Metrics

Statistical analysis

Scaling, weights and patch sizes

Smoothing

Filtering

Discussion

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Informed consent

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links