Abstract
The development of optical sensors for label-free quantification of cell parameters has numerous uses in the biomedical arena. However, using current optical probes requires the laborious collection of sufficiently large datasets that can be used to calibrate optical probe signals to true metabolite concentrations. Further, most practitioners find it difficult to confidently adapt black box chemometric models that are difficult to troubleshoot in high-stakes applications such as biopharmaceutical manufacturing. Replacing optical probes with contactless short-wave infrared (SWIR) hyperspectral cameras allows efficient collection of thousands of absorption signals in a handful of images. This high repetition allows for effective denoising of each spectrum, so interpretable linear models can quantify metabolites. To illustrate, an interpretable linear model called L-SLR is trained using small datasets obtained with a SWIR HSI camera to quantify fructose, viable cell density (VCD), glucose, and lactate. The performance of this model is also compared to other existing linear models, namely Partial Least Squares (PLS) and Non-negative Matrix Factorization (NMF). Using only 50% of the dataset for training, reasonable test performance of mean absolute error (MAE) and correlations (r2) are achieved for glucose (r2 = 0.88, MAE = 37 mg/dL), lactate (r2 = 0.93, MAE = 15.08 mg/dL), and VCD (r2 = 0.81, MAE = 8.6 × 105 cells/mL). Further, these models are also able to handle quantification of a metabolite like fructose in the presence of high background concentration of similar metabolite with almost identical chemical interactions in water like glucose. The model achieves reasonable quantification performance for large fructose level (100–1000 mg/dL) quantification (r2 = 0.92, MAE = 25.1 mg/dL) and small fructose level (< 60 mg/dL) concentrations (r2 = 0.85, MAE = 4.97 mg/dL) in complex media like Fetal Bovine Serum (FBS). Finally, the model provides sparse interpretable weight matrices that hint at the underlying solution changes that correlate to each cell parameter prediction.
Similar content being viewed by others
Introduction
There exist several biomedical applications for which a contactless sensor for real-time monitoring of metabolites would be invaluable1,2,3,4,5,6,7,8. These applications range from cell cultivation inside bioreactors for therapeutics manufacture1,4,6,7,8 to minimally invasive disease diagnosis2,3. For example, The association of fructose with Non-Alcoholic Fatty Liver Disease (NAFLD) is well established epidemiologically9,10,11,12, yet technologies available to measure fructolysis remain limited. The most common methods to study cellular metabolism are mass spectrometry (MS) and nuclear magnetic resonance (NMR)13,14. These methods can be destructive, bulky, and expensive and do not allow repeated measurements on the same sample. The development of inline optical sensors coupled with machine learning analysis of signals to yield metabolite concentrations within bioreactors is prevalent in the literature. Examples include NIR probes6,8,15,16,17,18,19,20, Raman probes4,16,21,22,23,24,25 and even NMR spectrometers26,27. However, these probes are prone to noise, fouling since they must be submerged within the cell sample, and sensor drift28.
As an alternative, the use of a short wave infrared hyperspectral camera (SWIR - HSI) allows for fully contactless collection of thousands of absorption spectra using a single HSI image. This wealth of signals can then be used to efficiently denoise sample spectra28. It can also train interpretable deep learning models using a small dataset of ground truth metabolite samples28. These properties enable instantaneous troubleshooting of model predictions or automatic filtering of drifting readouts. This HSI sensor setup can also produce sample measurements using less than 500 uL of liquid per sample measurement28. A truly multiplex sensor must be capable of distinguishing and quantifying similar metabolites since several biological metabolites, such as fructose and glucose, share similar chemical structures. Most studies in the literature attempt to quantify distinct metabolite relevant to bioprocessing using probes and traditional chemometric models like partial least square (PLS)4,6,7. These models tightly couple dimensionality reduction with regression analysis, so it is difficult to determine which features are both noise free and relevant to metabolite prediction29. Thus, an alternative example of interpretable linear models is explored called L-SLR. The performance of this model is compared with both PLS and well known unmixing model Non-negative Matrix Factorization (NMF).
This paper addresses the following objectives (Fig. 1): (1) enable contactless, label free rapid quantification of fructose and other cell growth parameters using a short-wave infrared (SWIR) hyperspectral camera; (2) develop and demonstrate models that easily adapts to varying cell media environments, and (3) develop model that allows practitioner to troubleshoot model performance based on sparse weight matrix maps. This leads to the development of small but interpretable models that can be extended to quantify any number of metabolites important to human health and cell growth alike.
An overview of the contactless, label free rapid quantification system using a short-wave infrared (SWIR) hyperspectral camera. (A) Collection of dataset of hyperspectral images and corresponding labels. (B) Collection of hyperspectral images in SWIR range for (C) training an encoder and sparse linear regression model for metabolite prediction.
Materials and methods
Materials
Spent Cell media samples were collected and imaged as described by Hevaganinge et al.28. In order to collect the fructose datasets, various weights of Fructose powder (Sigma F0127) were dissolved in solutions of Glucose (Sigma 150023-021) and deionized water to produce calibration datasets for fructose. Fructose was also dissolved in Fetal Bovine Serum (FBS) (Gibco 2309286) spiked with Glucose to generate further calibration curves. Sugar samples were weighed using a precise scale (Mettler-Toledo, ME204). Calibration curves were collected for fructose at varying glucose background levels, as shown in Table 1. The ID indicates the dataset number, while G denotes background glucose concentration in mg/dL,
\(\:{F}_{min\:},\:{F}_{max\:}\) denote fructose concentration range in mg/dL,\(\:\:{\varDelta\:}_{F}\) denotes fructose concentration increment in mg/dL, and N is the number of samples collected in each dataset. T represents test data, and V represents validation data in the Fold column of Table 1. Lines demarcate combinations of data used to evaluate each model. Because fructose dataset was collected with varying background levels of glucose in both water and FBS (Table 1), it was possible to isolate a single glucose level from water and FBS environments respectively for k-fold cross validation and optimization of model components, while the remaining data was divided into k folds and treated as test data.
Fructose model training and analysis
Because fructose dataset was collected with varying background levels of glucose in both water and FBS (Table 1), it was possible to isolate a single glucose level from water and FBS environments respectively for k-fold cross validation and optimization of model components, while the remaining data were treated as test data. Both validation and test data from each respective glucose background were split into k-folds consisting of 2 samples to obtain final metrics. 4 separate model sets were trained by partitioning the data as shown in Table 1, with model validation datasets bolded and all remaining datasets used as test data. Datasets used as validation data are bolded.
Fructose dataset collection
To create the water solutions, glucose and fructose were weighed using a balance (Mettler-Toledo, ME204) and placed in 15mL conical tubes. Next, the appropriate volume of water was added to reach the desired concentration. The tubes were then thoroughly mixed. Serum solutions were created with heat-inactivated Fetal Bovine Serum (FBS) (Gibco, Lot 2309286) using the same procedure as for water solutions. Since glucose is present in horse serum at unknown concentrations, the resulting solutions contained concentrations of glucose augmented by both fructose and additional glucose. Sample order was randomized before image acquisition. 500 uL of each sample was loaded into a quartz cuvette (Hellma 110-1-40) and imaged using Hyperspectral sensor (Headwall Photonics, Bolton, Massachusetts). The spatial resolution of this system was approximately 0.5 mm, which allowed collection of hundreds of transmittance readings from each cuvette photo. Each image pixel contains a transmittance spectrum which is collected in the range from 900 to 2500 nm in 8 nm increments. The quartz cuvette was washed with a single wash of water and two washes of solution of interest prior to image collection.
Cell data model training and analysis
As detailed previously28, cell permeate dataset was collected and quantified from eight different cell culture flasks, such that approximately 30 samples were collected from each flask prior to data cleaning and filtering. Further, ground truth labels for cell media samples were kept unfiltered. Only data with cell viability greater than 80% and lactate levels less than 300 mg/dL were used for model training from spent cell media dataset, resulting in approximately 20 samples per flask. The first flask was chosen as the validation flask and used to determine minimum number of flasks that could be used for model training while maintaining reasonable metabolite quantification results, as well as optimal number of components for all models. Of the remaining flasks, 4 were randomly chosen as test flasks, and remaining flasks were used as training data. 20 folds of flask combinations were generated for evaluation. To minimize model extrapolation, only data with cell viability greater than 80% was used as test data for models trained using spent cell media dataset.
Data pre-processing
Hyperspectral images were processed as described by Hevaganinge et al.28 for all datasets. First, min-max normalization of each transmission signal was conducted against respective dark and white background reference hyperspectral images that are taken at the start of each experiment. Next, transmittance spectra are converted to polynomial basis30 based on Eq. (1), where \(\:\phi \:\: \in \:\mathbb{R}^{{NxP}}\), with N samples and P terms, represents set of linear and 2nd order cross terms and \(\:{x}_{i},\:{x}_{j}\) represent transmittance values from ith and jth wavelength respectively. Also, B is the total number of wavelengths. All terms undergo min-max normalization based on extreme features of respective training dataset.
Data analysis and model training
Three models were explored for contactless quantification of various cell parameters. The respective validation dataset was used for model hyperparameter tuning, and validation data was omitted from test data evaluation as needed. Scikit-Learn implementation of Partial Least Squares with 10 components29,30 and polynomial features (\(\:\phi\:)\) was used as a baseline model to compare against the performance of two source separation models. The first alternative model used a blind source separation model, specifically Scikit-Learn implementation of Non-negative Matrix Factorization (NMF)30,31 as the encoder which takes \(\:\phi\:\) as input and returns compressed signal with 10 components for subsequent linear regression. Nonnegative Double Singular Value Decomposition (NNDSVD) was used as the initialization method, and the solver was Multiplicative Update Solver. The architecture of the proposed L-SLR model is shown in Fig. 2. First, the transmittance signal, \(\:T\: \in\:{\mathbb{R}}^{NxB}\:\), with N samples and B wavelengths, was converted to a polynomial basis (2 A). Concurrently, the matrix of metabolite concentrations, \(\:Y\: \in\:{\mathbb{R}}^{NxM}\:\) from the training dataset with M metabolites was converted into a single integer index via Hilbert transformation32 (2B). Specifically, each metabolite label was divided by maximum concentration in training dataset, multiplied by 100 and rounded to the nearest integer. Each resulting number was treated as an index in M degree space which can be spanned using a Hilbert curve using the Hilbert curve package in python33. These integer labels were used to pretrain an Scikit-Learn implementation of Orthogonal Matching Pursuit sparse regression model (OMP)30,34 (2 C) which output a sparse feature mask that was used to mask out noisy terms in original feature space that were not relevant to metabolite prediction, as shown in Eq. (2), where \(\:{H}_{i}^{}\) is the sparse cross terms feature vector of sample i from \(\:\phi\:\) after masking, and s is the sparse feature mask.
Next, these integer labels and corresponding masked features were used to pre-train a Linear Discriminant Analysis (LDA) encoder35 (2E). The LDA model forms a compressed latent space which helps to separate very redundant metabolite spectra from a variety of metabolite environments. Test signals were masked using the OMP method (2 C) then dimensionality reduction was done by pretrained LDA according to Eq. (3), where Z is the compressed signal, \(\:{H}_{i}\: \in\:{\mathbb{R}}^{Nx{P}^{+}},\:\)with N samples and \(\:{P}^{+}\:\)terms, is the masked cross terms signal, and W is the basis matrix of n = 10 highest eigenvectors obtained via optimization of LDA cost function (2E).
After, simple linear regression30 was used to quantify metabolite levels using compressed signal from LDA model, according to Eq. (4) (2 F) where \(\:\widehat{y}\) is the predicted metabolite reading, \(\:\varTheta\:\) is the sparse weights, and \(\:z\) is the LDA latent space signal. All ground truth labels are normalized by corresponding maximum training label prior to fitting linear regression model.
Finally, the sparse weight visualizations were formed by taking the weighted average of LDA weights (W) based on linear regression weights (\(\:\varTheta\:\)) before replacing nonzero elements of sparse feature mask \(\:{s}_{i}\) with the corresponding weight value. All terms used to construct weight map are highlighted in red (Fig. 2). These steps are summarized in Eq. (5), where \(\:{s}_{p}\) is the pth term in the sparse feature mask, \(\:{w}_{p}\) is the pth column of LDA eigenvector matrix, and \(\:\varTheta\:\) is the weight matrix used for metabolite quantification.
Architecture of L-SLR. First, absorbance signal is converted to a polynomial basis (A), then masked based on weights from pretrained sparse linear regression model (C). Next, masked signal is passed into pretrained LDA model (E). LDA forms compressed latent space which is used to generate metabolite predictions via linear regression (F). Coefficients from LDA and SLR are then used to form weight matrix visuals (G). Models used in (C) and (E) are pretrained using a labeled dataset consisting of integer labels computed via Hilbert curve (B) that represent unique metabolite combinations.
L-SLR weight matrix map generation
Because L-SLR is a fully linear model, the weight matrix may be visualized and analyzed similar to an attention map. The weight matrix not only reveals which features are excluded from metabolite prediction36, but also reveals which features are combined to generate final metabolite predictions. Therefore, the cumulative sum of all weight matrices generated across folds for each dataset was used as an approximation of the stable weight matrix.
Model evaluation and statistics
Predicted cell parameter concentrations were compared to ground truth readings via r2, root mean squared error (RMSE), mean absolute error (MAE), and relative mean absolute percentage error (MAPE) metrics. All metrics were calculated using Scikit-Learn implementations30. MAE, and RMSE are reported in units of mg/dL, while the relative MAPE decimal is reported in all tables. Number of latent space components used for each model is listed as n, with 10 components used for all models.
Results and discussion
Three linear models were evaluated using a series of datasets collected in house using labeled HSI images of not only fructose solutions but also spent cell media from eight CHO cell culture flasks. Tables 2, 3 and 4 summarize validation and test performance when fructose samples in various glucose backgrounds are partitioned into k-folds each containing 2 unique fructose concentrations as test samples. In dataset number 8, 5 unique fructose concentrations are sampled for each test fold since it contains a higher number of unique fructose concentrations compared to the other fructose datasets collected. The dataset number corresponds to data which was collected under conditions listed in Table 1.
Figures 3, 4 visualize PLS-R and L-SLR correlation to fructose ground truth for test datasets in all datasets of interest for a variety of fructose solution in glucose solutions (Fig. 3) and Horse Serum solutions (Fig. 4). r2 and MAE are listed in the top left corner of each graph. All measurements are reported in units of mg/dL.
Across all folds, PLS-R exhibits better test performance when compared to L-SLR, with better scoring performance across all metrics assessed. Nevertheless, the accuracies of both L-SLR and PLS fall within the range of usability for biological studies of fructose2,12 and bioprocess monitoring16,23. The performance gap between PLS-R and L-SLR widens as the fructose concentration decreases, suggesting that LDA has trouble defining discrete clusters for small fructose samples, especially in the presence of relatively high glucose backgrounds. It is also clear that folds sampled at the extreme concentrations tend to exhibit higher levels of drift compared to concentrations in the center of the fructose concentration data distribution. Perhaps more concentrations beyond the extremes of interest need to be sampled and placed as fixed training data points across all folds. In Fig. 4C and D, there are a few outlier points that have high test error for both PLS-R and L-SLR, while most of the other datapoints are quantified with relatively smaller error. The misclassification of these points may stem from the fact that background glucose concentration is more than twice the fructose concentrations imaged, so signal differences caused by fructose concentration changes are small, subtle, and difficult to differentiate properly. Further, the performance of unmixing regression model based on nonlinear matrix factorization (NMF) is shown in Tables 2, 3 and 4, since this is a well established model in the literature for analysis of spectroscopic signals37,38. However, the performance of the NMF regression is worse when compared to PLS and L-SLR, most likely since NMF is an iterative model whose performance is dependent on initialization31. Next, to evaluate model performance on cell culture flask dataset quantification, all models were trained and evaluated on 20 folds of 4 randomly sampled flasks for training data, while using remaining cell culture flasks as test data. The validation flask was omitted from test data evaluation as needed. Tables 5, 6 and 7 summarize average viable cell density (VCD), glucose and lactate quantification performance across 20 cross validation folds when four cell culture flasks (50% of dataset ) are randomly reserved for model training. This number was determined as the minimum number of flasks required to achieve comparable validation performance as using 75% (6 randomly sampled flasks) as training data and flask 1 as validation data. The use of 50% of the dataset rather than 75% of the datasets highlights the powerful generalizability and training efficiency of linear models.
Figure 5, 6, 7 visualize PLS-R and L-SLR correlation to VCD, glucose and lactate ground truth labels respectively for randomly selected test dataset folds. r2 and MAE are listed in the top left corner of each graph. All measurements are reported in units of mg/dL.
Across all folds, L-SLR exhibits similar performance to PLS-R, although PLS-R tends to display higher performance across all metrics evaluated on the test datasets. Both models achieve best test performance with lactate, test r2 values above 0.9, and MAE less than 20 mg/dL. This suggests that signals correlated to lactate are robust and bi-linear. Performance metrics are lowered slightly when evaluating glucose quantification, but r2 values are still close to 0.9. After analyzing Fig. 6, it is clear that a few cell permeate images exhibit high glucose quantification error, while most of the points have relatively low error. It is important to note that SWIR spectra mainly reflect the absorbance of water, and the subtle shifts of the water peak are used as a “molecular mirror” of substrates dissolved in the hydrogen bonding matrix of water39. Thus, signal shifts caused by glucose are not always linear, or even bi-linear40. This may explain the spurious misclassification of certain glucose images by both models. Further, PLS-R exhibits significant overfitting when compared to L-SLR performance for glucose and lactate. This is highlighted by the relatively large reduction in all metrics between training and testing performance for PLS-R when compared to L-SLR. Finally, both models achieve r2 values less than 0.9 for VCD. Since cell free permeate is imaged, perhaps hyperspectral images lack concrete signals that correlate to VCD. Most likely, the model is forced to correlate signals belonging to other cell parameters with VCD instead. It was hoped that linear unmixing methods like NMF with the help of 2nd degree cross term features could help uncover interpretable signals that directly correlate to VCD. In practice, NMF regression models display the worst performance of the models evaluated, most likely since NMF is an iterative model whose performance is dependent on initialization. Of course, L-SLR might be thought of as a unmixing model as well, since it does factor the input matrix X into eigenvectors and eigen values that best explain the variance correlated to ground truth label separation35. In the future, perhaps the Fischer’s Discriminant ratio can be modified so that chosen components group similar environment samples close together.
L-SLR interpretability
Unlike Mid-Infrared (MIR) absorption, SWIR absorption reflects weak signals caused by overtones and combination vibrations39,41,42,43. These signals are difficult to simulate and interpret, but linear weight matrices help the practitioner correlate high weights with known functional group harmonics in the literature41. Linear models afford weight matrices which directly show both which features are omitted and which features are used as well as how these features contribute to the final metabolite predictions. In contrast, attention maps formed to summarize complex non linear models only afford a glimpse of which features are omitted for model prediction36. Summary visualizations of all weight matrices produced across all folds evaluated are shown in Figs. 8 and 9.
Cumulative sum weight matrices produced by L-SLR (A–D) and PLS-R (E–H) for bands and cross terms across all models evaluated for fructose quantification via datasets 9–11 (A, E), dataset 8 (B, F), datasets 2–4 (C, G) and datasets 5–7 (D, H). All weights are normalized by the respective maximum of each cumulative matrix to facilitate comparison of weight matrices across models and datasets alike.
It is well known that hydrogen bonding within the solution matrix impacts SWIR absorption spectra41. This phenomenon is reflected in the distinct differences between weight matrices for models trained to identify fructose within glucose solutions vs. horse serum solutions spiked with glucose (Fig. 8A–D). For example, the model trained using glucose solutions pays more “attention” to absorption signals below 1400 nm, whereas the model trained using horse serum pays more “attention” to absorption signals between 1600 and 1800 nm. The former region is dominated by overtones related to O-H stretch, whereas the latter region is dominated by C-H stretch vibrations41. This implies that the correlation between fructose concentration and SWIR absorption can be related to both changes in the hydrogen bonding matrix of water and interactions of fructose with organic compounds dissolved in the horse serum. Further only cross terms in region between bands less than 1400 and bands from 1600 to 1900 or region between 1600 and 1800 are heavily used for all models other than those used to quantify small quantities of fructose in FBS (Fig. 8B). According to Fig. 4, the corresponding model displays the largest extremes in model prediction error compared to the other models. Perhaps this distribution of weights suggests that the model was not able to clearly distinguish linear or bi-linear cross term signals that correlate with fructose concentrations, which leads to overfitting to noise in the training dataset instead. The misclassification of these points may stem from the fact that background glucose concentration is very high compared to the incremental fructose concentration changes in this dataset, so feature changes are small, subtle, and difficult to differentiate properly. A similar analysis is less clear in the PLS-R weight matrices, since non-zero weights are much less sparse compared to equivalent L-SLR model. However, points of distinction are highlighted in Fig. 8E–H.
Cumulative sum weight matrices produced by L-SLR (A–C) and PLS-R (D–F) for bands and cross terms across all models evaluated for quantification of VCD (A, D), glucose (B, E), and lactate (C, F) in spent cell media. All weights are normalized by the respective maximum of each cumulative matrix to facilitate comparison of weight matrices across models and datasets alike.
Although three distinct cell culture parameters were predicted using the cell culture flask dataset, the weight profile of all three models is quite similar (Fig. 9A–C). In fact, the weight matrices formed by PLS-R (Fig. 9D–F) are virtually indistinguishable, so weight matrices formed by L-SLR will be analyzed for the remainder of the discussion. Notably, both the VCD model and the lactate model share similar regions of high weight, implying that changes in lactate concentration correlate with VCD changes. While both VCD and lactate models share a high weight band around 2200 nm, the glucose model does not weigh this band region quite as heavily. This region corresponds to overtone stretches caused by carboxylic acid41, which is the main functional group on lactic acid. Still, all three models pay “attention” to absorption signals below 1400, implying that changes to the hydrogen bonding matrix of water correlate with changes in all three cell parameters investigated. Still, SWIR spectra exhibit broad, overlapping peaks with non-linear relationships to dissolved molecules. Similar absorption spectra may correspond to very different solution matrix profiles, so the model confuses distinct cell parameter predictions. Because LDA is already designed to maximize separation of data points based on class35, it was assumed that LDA would be able to treat different sets of metabolites as distinct classes and find an appropriate lower dimensional latent space that facilitates metabolite quantification. In the future, the cost function of LDA should be modified to distance points from different environments based on environment similarity. This smooth latent space would also allow for identification and removal of ambiguous absorption spectra which overlap multiple metabolic profiles. In contrast, it is difficult to imagine how PLS-R latent space may be modified to ensure continuous latent space, since dimensionality reduction and regression are highly coupled29. Across all datasets tested, the NMF model performed the worst, most likely due to dependence on weight initialization. Unlike PLS-R and L-SLR, NMF is a blind source separation model, so it cannot take advantage of the supervised labeled dataset available. It is reasonable and arguably essential in the bioprocessing and therapeutics realm for labeled datasets to be collected prior to model training for metabolite quantification. Based on remarks in the literature, an interpretable chemometrics model in the bioprocessing and therapeutics realm should (1) produce reduced dimensionality latent space that separates spectra based on corresponding metabolite environment, (2) easily adapts to varying cell media environments, and (3) allows practitioner to troubleshoot model performance based on sparse weight matrix maps and the prediction of confidence intervals28,36. So far, L-SLR has produced a reduced dimensionality latent space coupled with sparse weight matrices that highlight which bands and cross terms are used for final prediction. Both L-SLR and PLS demonstrate reasonable metabolite quantification accuracy, and the ability to generalize to test datasets using 50% of the dataset available. This implies that these models can generalize training data environments to test environments with the same cell type and media using a 1:1 ratio of train to test data. In the future, this sensor framework may be used to quantify other valuable cell parameters such as pH, amino acids, and even antibody glycosylation5,6,20,44.
Conclusion
Herein, we presented the development of a contactless label free sensor coupled with an interpretable model for the quantification of fructose and other key cell parameters. This SWIR sensor system has succeeded in (1) contactless, label free rapid quantification of fructose and other cell growth parameters, and (2) easily adapts to varying cell media environments using (3) interpretable models that shed light on how the bands of the nonlinear absorption spectra contribute to metabolite quantification. Further, the accuracies of both L-SLR and PLS fall within the range of usability for biological studies of fructose2,12 and bioprocess monitoring16,22.Because the HSI camera is capable of recording thousands of absorbance spectra within a single image, this sensor system can efficiently train and adapt to new metabolites using relatively small calibration datasets. Currently, existing invasive time consuming quantification methods like high performance liquid chromatography (HPLC) and mass spectrometry (MS) remain the gold standard measurement devices in biomedical arena. There are several biomedical applications for which real-time monitoring of cell parameter levels coupled with intermittent ground truth readings will enable both rapid diagnosis of diseases like NAFLD, and precise control of bioprocesses.
Data availability
Raw data were generated at University of Maryland, College Park. Derived data supporting the findings of this study are available from the corresponding author Dr. Yang Tao on request.
References
Costariol, E. et al. Demonstrating the manufacture of human CAR-T cells in an automated stirred-tank bioreactor. Biotechnol. J. 15, 2000177 (2020).
Miller, M., Drucker, W. R., Owens, J. E., Craig, J. W. & Woodward, H. Metabolism of intravenous fructose and glucose in normal and diabetic subjects. J. Clin. Invest. 31, 115–125 (1952).
Kruszynska, Y. T., Meyer-Alber, A., Wollen, N. & McIntyre, N. Energy expenditure and substrate metabolism after oral fructose in patients with cirrhosis. J. Hepatol. 19, 241–251 (1993).
Schwarz, H., Mäkinen, M. E., Castan, A. & Chotteau, V. Monitoring of amino acids and antibody N-glycosylation in high cell density perfusion culture based on Raman spectroscopy. Biochem. Eng. J. 182, 108426 (2022).
Ishigaki, M. et al. Development of an amino acid sequence-dependent analytical method for peptides using near-infrared spectroscopy. Analyst 147, 3634–3642 (2022).
Hubli, G. B., Banerjee, S. & Rathore, A. S. Near-infrared spectroscopy based monitoring of all 20 amino acids in mammalian cell culture broth. Talanta 254, 124187 (2023).
Zavala-Ortiz, D. A. et al. Support vector and locally weighted regressions to monitor monoclonal antibody glycosylation during CHO cell culture processes, an enhanced alternative to partial least squares regression. Biochem. Eng. J. 154, 107457 (2020).
Thakur, G., Hebbi, V. & Rathore, A. S. An NIR-based PAT approach for real-time control of loading in Protein A chromatography in continuous manufacturing of monoclonal antibodies. Biotechnol. Bioeng. 117, 673–686 (2020).
Chiu, S. et al. Effect of fructose on markers of non-alcoholic fatty liver disease (NAFLD): A systematic review and meta-analysis of controlled feeding trials. Eur. J. Clin. Nutr. https://doi.org/10.1038/ejcn.2014.8 (2014).
Futatsugi, K. et al. Discovery of PF-06835919: A potent inhibitor of ketohexokinase (KHK) for the treatment of metabolic disorders driven by the overconsumption of fructose. J. Med. Chem. 63, 13546–13560 (2020).
Ishimoto, T. et al. High-fat and high-sucrose (western) diet induces steatohepatitis that is dependent on fructokinase. Hepatology 58, 1632–1643 (2013).
Tee, S. S. et al. Ketohexokinase-mediated fructose metabolism is lost in hepatocellular carcinoma and can be leveraged for metabolic imaging. Sci. Adv. 8, eabm7985 (2022).
Moco, S. Studying metabolism by NMR-based metabolomics. Front. Mol. Biosci. https://doi.org/10.3389/fmolb.2022.882487 (2022).
Nagana Gowda, G. A. & Djukovic, D. Overview of mass spectrometry-based metabolomics: Opportunities and challenges. Methods Mol. Biol. 1198, (2014).
Rodriguez-Saona, L. E., Fry, F. S., McLaughlin, M. A. & Calvey, E. M. Rapid analysis of sugars in fruit juices by FT-NIR spectroscopy. Carbohydr. Res. 336, 63–74 (2001).
Kozma, B. et al. On-line prediction of the glucose concentration of CHO cell cultivations by NIR and Raman spectroscopy: Comparative scalability test with a shake flask model system. J. Pharm. Biomed. Anal. 145, 346–355 (2017).
Cheng, J. H. & Sun, D. W. Partial least squares regression (PLSR) applied to NIR and HSI spectral data modeling to predict chemical properties of fish muscle. Food Eng. Rev. 2016(9), 36–49 (2016).
Sorvaniemi, J., Kinnunen, A., Tsados, A. & Mälkki, Y. Using partial least squares regression and multiplicative scatter correction for FT-NIR data evaluation of wheat flours. LWT - Food Sci. Technol. 26, 251–258 (1993).
Roychoudhury, P., O’Kennedy, R., McNeil, B. & Harvey, L. M. Multiplexing fibre optic near infrared (NIR) spectroscopy as an emerging technology to monitor industrial bioprocesses. Anal. Chim. Acta 590, 110–117 (2007).
Henriques, J. G., Buziol, S., Stocker, E., Voogd, A. & Menezes, J. C. Monitoring mammalian cell cultivations for monoclonal antibody production using near-infrared spectroscopy. Adv. Biochem. Eng. Biotechnol. 116, 29–72 (2009).
Whelan, J., Craven, S. & Glennon, B. In situ Raman spectroscopy for simultaneous monitoring of multiple process parameters in mammalian cell culture bioreactors. Biotechnol. Prog. 28, 1355–1362 (2012).
Rafferty, C. et al. Raman spectroscopic based chemometric models to support a dynamic capacitance based cell culture feeding strategy. Bioprocess Biosyst. Eng. 43, 1415–1429 (2020).
Rafferty, C. et al. Analysis of chemometric models applied to Raman spectroscopy for monitoring key metabolites of cell culture. Biotechnol. Prog. 36, e2977 (2020).
Mehdizadeh, H. et al. Generic Raman-based calibration models enabling real-time monitoring of cell culture bioreactors. Biotechnol. Prog. 31, 1004–1013 (2015).
Hu, J., Chen, G., Yongjun, Q. I. N. & Zhou, W. A raman spectroscopy integrated perfusion cell culture system for monitoring and auto-controlling perfusion cell culture. Preprint at (2020).
Mehendale, N. et al. A nuclear magnetic resonance (NMR) platform for real-time metabolic monitoring of bioprocesses. Molecules 25, 4675 (2020).
Wang, R. C. C., Campbell, D. A., Green, J. R. & Čuperlović-Culf, M. Automatic 1D 1H NMR metabolite quantification for bioreactor monitoring. Metabolites https://doi.org/10.3390/metabo11030157 (2021).
Hevaganinge, A. et al. Fast-training deep learning algorithm for multiplex quantification of mammalian bioproduction metabolites via contactless short-wave infrared hyperspectral sensing. ACS Omega 8, 14774–14783 (2023).
Wegelin, A., J. A Survey of Partial Least Squares (PLS) Methods.
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Preprint at https://doi.org/10.48550/arXiv.1201.0490 (2018).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Anders, S. Visualization of genomic data with the Hilbert curve. Bioinformatics 25, 1231–1235 (2009).
Altay, G. Programming the Hilbert Curve. (2024).
Cai, T. T. & Wang, L. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Trans. Inf. Theory 57, 4680–4688 (2011).
Zhao, S., Zhang, B., Yang, J., Zhou, J. & Xu, Y. Linear discriminant analysis. Nat. Rev. Methods Primer 4, 1–16 (2024).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
Wei, J. & Wang, X. An Overview on Linear Unmixing of Hyperspectral Data. Math. Probl. Eng. 2020, 3735403 (2020).
Palsson, B., Sveinsson, J. R. & Ulfarsson, M. O. Blind hyperspectral unmixing using autoencoders: A critical comparison.. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 15, 1340–1372 (2022).
Tsenkova, R., Munćan, J., Pollner, B. & Kovacs, Z. Essentials of aquaphotomics and its chemometrics approaches. Front. Chem. 6, 363 (2018).
Czarnecki, M. A., Morisawa, Y., Futami, Y. & Ozaki, Y. Advances in molecular structure and interaction studies using near-infrared spectroscopy. Chem. Rev. 115, 9707–9744 (2015).
Workman, J. & Weyer, L. Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy. (2012). https://doi.org/10.1201/b11894
Kojić, D., Tsenkova, R. & Yasui, M. Improving accuracy and reproducibility of vibrational spectra for diluted solutions. Anal. Chim. Acta 955, 86–97 (2017).
Kojić, D., Tsenkova, R., Tomobe, K., Yasuoka, K. & Yasui, M. Water confined in the local field of ions. ChemPhysChem 15, 4077–4086 (2014).
Fontaine, J., Hörr, J. & Schirmer, B. Near-infrared reflectance spectroscopy enables the fast and accurate prediction of the essential amino acid contents in soy, rapeseed meal, sunflower meal, peas, fishmeal, meat meal products, and poultry meal. J. Agric. Food Chem. 49, 57–66 (2001).
Acknowledgements
This study was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1840340 to CMW. National Institutes of Health grant R21CA245492 (S.S.T.), We also acknowledge the support of the University of Maryland, Baltimore, Institute for Clinical & Translational Re-search (ICTR) and the National Center for Advancing Translational Sciences (NCATS) Clinical Translational Science Award (CTSA) grant number 1UL1TR003098. We also acknowledge the support of the National Cancer Institute-Cancer Center Support Grant (CCSG) - P30CA134274, as well as the Maryland Department of Health’s Cigarette Restitution Fund Program CH-649-CRFThe funders did not participate in research or review any details of this study.
Author information
Authors and Affiliations
Contributions
A.H., E.L., A.F., M.M., N.T.M., B.R., J.Y., J.E., R.H. participated in collection of fructose data as well as experimental setup for related experiments. A.H., J.Y., J.E., R.H., A.M., A.K., J.C., Y.Y. participated in collection of data related to spent cell media (glucose, lactate, viable cell density). A.H. prepared manuscript and figures. All authors reviewed manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hevaganinge, A., Lowenstein, E., Filatova, A. et al. Exploration of linear and interpretable models for quantification of cell parameters via contactless short-wave infrared hyperspectral sensing. Sci Rep 15, 2307 (2025). https://doi.org/10.1038/s41598-025-85930-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-85930-2