Abstract
Remote photoplethysmography (rPPG) is an emerging contactless physiological parameter detection method utilizing cameras, showing great promise as a forefront technology for remote health assessment. While traditional rPPG methods have substantially contributed to the development of affordable camera-based health monitoring systems, their dependence on specific assumptions may lead to performance degradation when confronted with real-time dynamic interferences. The article presents a comprehensive overview of CAP-rPPG, an innovative method for remote physiological measurement through video analysis. This method employs a multi-scale deep learning architecture with a Gaussian pyramid to capture facial features at different scales that are often overlooked in prior work. A channel attention module further emphasizes rPPG-rich channels, mitigating the issue of feature dilution caused by excessive channel depth and enhancing the accuracy of physiological signal extraction from facial videos. The uniqueness of CAP-rPPG lies in its hybrid loss function, adeptly balancing the short-term characteristics, long-term characteristics of the signal and the correlation between the predicted HR and the ground-truth HR. CAP-rPPG demonstrates outstanding robustness under various challenging conditions, such as varying lighting environments and physical motion. It consistently outperforms most state-of-the-art methods on both the UBFC-rPPG and PURE datasets. Its capability to non-invasively capture subtle physiological changes from video data represents a significant leap forward in the realm of remote health monitoring technologies.
Similar content being viewed by others
Introduction
Physiological parameters, including heart rate (HR), blood oxygen saturation (SpO2), and heart rate variability (HRV), play a pivotal role in monitoring physical health and diagnosing diseases1,2,3,4. These parameters not only provide insights into an individual’s physiological status but also prove valuable for early disease detection and monitoring the recovery process. It is worth noting that continuous HR monitoring is particularly instrumental in the timely identification and prevention of cardiovascular problems such as arrhythmia and atherosclerosis5.
In the conventional realm of physiological parameter monitoring, electrocardiogram (ECG) and photoplethysmography (PPG) are commonly employed for measurements. This conventional approach falls under the category of contact-based methods, which have constraints such as sensor or electrode placement, which can lead to discomfort or allergic reactions. Consequently, this method is unsuitable for certain scenarios, such as situations involving sensitive or burnt skin, as well as newborn baby monitoring6.
Over the past decade, remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for physiological measurements, utilizing video captured by camera to measure physiological parameters7,8. The fundamental principle of rPPG relies on the cyclic variations in vascular blood volume induced by the cardiovascular activity, resulting in periodic fluctuations in the skin’s light absorption. Consequently, subtle color variations occur in the video. Capturing these periodic color changes enables the extraction of the rPPG signal, from which HR information can be derived9.
Unlike traditional methods, rPPG-based physiological measurements eliminate the need for specialized biomedical sensors, relying solely on ordinary cameras. This feature allows leveraging existing camera equipment without the requirement for additional sensor installations10. This method is not restricted by physical distance, avoids irritation or discomfort to the skin, and reduces detection costs and operational complexity. Therefore, rPPG is particularly suitable for applications such as telemedicine11skin sensitization patients, neonatal monitoring12and fatigue driving judgment13 and other applications.
However, the color changes in the human face induced by cardiovascular activity are extremely subtle and prone to noise interference, such as variations in lighting and motion artifacts, potentially impacting the accuracy of measurements10. The challenge lies in effectively removing interference noise from the complex signal and successfully extracting the rPPG signal, making it a highly sought-after research area in recent years.
In the early stages of rPPG research, many studies employed hand-crafted methods14,15,16,17,18,19. These methods extracted rPPG signals from color changes in region of interest (ROI) areas, such as the forehead or cheeks, by detecting and tracking faces. Subsequently, noise interference was mitigated through a series of filtering processes, ultimately estimating the average HR through frequency analysis. However, this approach based on hand-crafted methods exhibits significant drawbacks. It often relies on empirical knowledge for selecting ROI areas and may not fully concentrate on the most effective regions. Additionally, these models exhibit limited generalization capabilities and prove ineffective for rPPG extraction in complex environments20.
As with numerous applications in computer vision, deep learning methods show great potential in remote HR measurement. Deep learning techniques relying on rPPG7,21,22 demonstrate their ability in effectively addressing the challenges posed by variations in lighting, head movements, and facial expressions. For different scenarios, the HR detection has high robustness23. However, the accuracy of many existing rPPG approaches remains limited, partly due to their neglect of multi-scale facial feature analysis. This oversight restricts the network’s ability to capture comprehensive information ranging from color variations to facial structural details, thereby compromising both accuracy and stability. Moreover, as deep learning networks progress through layers, the resulting feature maps often contain a large number of channels, where informative signals may be diluted or suppressed by less relevant ones. This further hinders the reliable extraction of rPPG signals.
Inspired by the above considerations, we propose a deep learning-based method CAP-rPPG. This study makes the following contributions:
-
(1)
Multi-scale deep learning architecture: We propose a multi-scale deep learning architecture based on a Gaussian pyramid. This architecture integrates feature maps extracted at different resolutions and incorporates the temporal shift module (TSM) to effectively capture spatiotemporal information, enabling efficient and accurate prediction of rPPG signals.
-
(2)
Channel attention module: To enhance the attention on channels containing crucial rPPG signals, we employ a channel attention module. This module assigns different weights to the deep channels of the deep learning network, directing more attention to channels with a higher concentration of rPPG.
-
(3)
Hybrid loss function: We introduce a hybrid loss function comprising time, frequency, and negative Pearson correlation loss. These components guide both short-term and long-term characteristics of the target rPPG signal and the correlation between the predicted rPPG and the ground-truth PPG, providing a comprehensive approach to loss optimization.
-
(4)
Experimental validation: Our experimental results demonstrate that, when compared with the current state-of-the-art rPPG algorithms, our proposed CAP-rPPG exhibits outstanding performance on both UBFC-rPPG and PURE datasets.
Related works
Hand-crafted methods
Verkruysse et al. initiated a groundbreaking exploration, leading to the seminal revelation that facial video data captured by a camera could be analyzed to extract photoplethysmography (PPG) signals closely associated with HR. This pivotal work marked the inception of remote HR measurement, now commonly known as remote photoplethysmography (rPPG). This research achievement prompts subsequent research endeavors aimed at refining the accuracy and robustness of rPPG extraction, resulting in the introduction of numerous novel methods and frameworks18.
Since then, numerous rPPG techniques have been proposed. Notable among these are methods based on blind source separation (BSS) or optical reflection modeling of the skin. BSS is a widely adopted technique in signal processing, particularly effective in the analysis and decomposition of physiological signals24,25. Poh et al. achieved HR detection on multiple subjects using RGB color channels from a webcam, employing the independent component analysis (ICA) method based on color frequency bands16. Lewandowska et al. utilized the R and G channels, focused on the forehead as the ROI, and successfully extracted rPPG signals through principal component analysis (PCA), maintaining accuracy comparable to ICA with reduced computational complexity26. Gerard de Haan et al. proposed a chroma-based method (CHROM), which uses a linear combination of chroma signals to offset specular reflection components that do not contain rPPG signal information19. Wenjin Wang et al. proposed the plain orthogonal to skin (POS) algorithm, projecting features onto a plane orthogonal to the specular direction. This innovative approach eliminated the specular reflection component, maximizing changes induced by diffuse reflection27.
While these methods have proven effective in certain scenarios for extracting rPPG signals, the selection of ROI areas often relies on empirical knowledge, potentially neglecting the most effective regions. Additionally, many of these models hinge on assumptions about the light reflection model, resulting in poor generalization capabilities.
Deep learning methods
In recent years, rPPG signal extraction methods based on deep learning have emerged in large numbers and have become a hot field of current research. In many cases, due to the flexibility and expressiveness of deep learning, its performance is better than hand-crafted rPPG methods. It can automatically extract more spatiotemporal features of the input video, greatly optimizing the accuracy and robustness of the algorithm.
Špetlík et al. introduced a two-step convolutional neural network HR-CNN for remote HR estimation. The model shows resilience to variations in illumination and object motion8.Yu et al. proposed an end-to-end rPPG network PhysNet that merges the RGB projection into the subspace with the re-projection to the color subspace to achieve rPPG signal recovery20. Song et al. proposed a generative adversarial network PulseGAN to generate realistic rPPG pulse signals by denoising the chromaticity signal28. Seeking a balance between efficiency and accuracy, Liu et al. developed one-step neural network architectures, EfficientPhys, which eliminates the need for preprocessing steps in physiological measurements29. Gupta et al. developed RADIANT, a Transformer-based model utilizing signal embeddings to improve rPPG estimation by capturing global context and suppressing local noise30. Zhang et al. proposed a self-supervised learning network capable of estimating rPPG signals from facial videos without labeled data, leveraging the periodicity and finite bandwidth characteristics of physiological signals31. Sun et al. introduced a domain harmonization strategy to resolve domain conflicts, enhancing the generalizability of remote physiological measurements across diverse datasets32. Speth et al. presented a non-contrastive unsupervised learning framework that discovers blood volume pulse directly from unlabeled videos by encouraging sparse power spectra within normal physiological bandlimits33. Li et al. proposed STFPNet, a simple temporal feature pyramid network that leverages low-frame-rate video features to enhance remote heart rate measurement34.
These methods reveal the broad application potential of deep learning in physiological signal extraction, which can cope with different groups of people, different motion states, and different camera settings. Their performance is generally more stable in different situations, helping to improve the accuracy and robustness of measurements.
However, previous researchers have ignored the importance of multi-scale image information, resulting in insufficient accuracy of the model. Our proposed CAP-rPPG is the first deep learning architecture to use multi-scale image input, which is an unprecedented new rPPG extraction network. We incorporate the channel attention module to address the challenge of effectively attending to useful channels within the deep layers of the network, where numerous channels exist. And we also creatively proposed a loss function based on time domain, frequency domain, and correlation, so that the network can pay attention to the short-term and long-term characteristics of video images during the learning process. We thoroughly assess the performance of our proposed CAP-rPPG on various datasets.
Methods
We initially introduce our proposed CAP-rPPG network architecture. Additionally, we briefly describe the incorporated modules: Gaussian pyramid, TSM, face mask module, and channel attention module. Finally, we give a description of the proposed hybrid loss function. We confirm that all methods were performed in accordance with the relevant guidelines and regulations. And we confirm that informed consent was obtained from all participants and/or their legal guardians.
Framework of CAP-rPPG
Before performing computations with the model, the video is first preprocessed using a facial landmark detection technique. Numerous advanced facial landmark detection methods have been proposed, including MTCNN, Dlib, and MediaPipe Face Mesh, among others. These methods are all capable of accurately identifying facial contours based on key landmarks. After comparing their detection accuracy and runtime performance, we ultimately selected MediaPipe Face Mesh. This method not only detects a large number of facial landmarks with high precision but also offers excellent real-time performance. This choice enables efficient separation of facial regions from the background, allowing for fast and accurate face localization in each video frame35,36. Subsequently, we cropped the video image so that only the video image containing the face area is retained and resized to \(\:72\times\:72\) pixels.
The framework of the CAP-rPPG is shown in Fig. 1. To utilize the information contained in video images at different resolutions, we have designed a multi-scale deep learning network based on the Gaussian pyramid. Prior to entering the network for feature extraction, video image inputs at all scales undergo normalization through a layer comprising a difference layer and a batch normalization layer. The video image input from the first layer of the Gaussian pyramid, with a size of \(\:72\times\:72\), serves as the backbone of the network. Following normalization, a sequence of operations, including TSM, 2D convolution, and face mask module, are executed within the network. Subsequently, a maximum pooling layer is employed to reduce its size to \(\:36\times\:36\), facilitating integration with other feature maps. The video image inputs for the second and third layers of the Gaussian pyramid are sized \(\:36\times\:36\) and \(\:18\times\:18\), respectively. After individual operations of TSM, 2D convolution, and face mask module, these inputs are concatenated with the feature map extracted from the network backbone, ensuring comprehensive multi-scale feature fusion.
Given the depth of the network, Dropout layers are strategically incorporated to mitigate overfitting. Finally, the channel attention module is employed to assign distinct weights to various channels, followed by the utilization of a fully connected layer for feature extraction. The network outputs the rPPG signal corresponding to the video.
Gaussian pyramid
The Gaussian pyramid serves as a multi-scale representation of signals, involving the repeated application of Gaussian blurring and down-sampling to the same signal or image. This process generates multiple sets of signals or images at various scales, facilitating subsequent processing37.
When extracting rPPG signals, we focus on the periodic variations in skin color caused by blood pulsations18. Therefore, macroscopic changes in skin color are more important than detailed information about a face, such as contours or facial features.
To better capture these macroscopic color changes, we employ a Gaussian pyramid downsampling method. By downsampling step by step, it helps focus on the macroscopic changes in the image rather than the tiny details. For rPPG signal extraction, this is exactly what we need.
The construction process of Gaussian pyramid is shown in Fig. 2. The mathematical definition of Gaussian pyramid is shown below.
In eqution (1), \(\:{G}_{k}\) represents the downsampled image of the pyramid at layer \(\:k\), \(\:i\) and \(\:j\) represents the number of rows and columns of the current layer image, respectively, \(\:\omega\:(m,n)\) is the Gaussian convolution kernel which is defined in Eq. (2). In constructing the Gaussian pyramid, we adopt a 5 × 5 Gaussian kernel to balance filtering effectiveness and computational efficiency. Compared with smaller kernels, it provides stronger anti-aliasing smoothing, while remaining lightweight enough for efficient multi-scale processing. The filter weights follow a discrete approximation of the 2D Gaussian distribution, ensuring smooth downsampling transitions between pyramid levels.
As can be seen from the Fig. 2, the Gaussian pyramid retains the spatial low-frequency information in the image. As the image resolution decreases, the spatial high-frequency information such as edge information of the image is gradually lost.
Temporal shift module
Traditional 2D-CNNs exhibit low computational cost but fall short in capturing temporal relationships, while methods relying on 3D-CNNs boast strong performance at the expense of increased computational demands and higher deployment costs. To strike a balance and better capture spatiotemporal information in the video, we incorporate the temporal shift module (TSM) into the network38.
TSM begins by partitioning the input tensor into three blocks, where the first block progresses time and the second block postpones time. Following this, the first block protrudes downward by one frame, and the second block protrudes upward by one frame. The protruding frame sections are directly truncated, while the vacant frame sections are filled with 0. All shift operations are conducted along the time axis29. This block movement operation endows the current frame with information from both the preceding and succeeding frames. Consequently, the two-dimensional convolution operation can directly extract spatiotemporal information from the video, akin to three-dimensional convolution, thereby enhancing the model’s temporal dimension modeling capabilities. The operating principle of the TSM can be seen in Fig. 3. (a).
The TSM can be seamlessly integrated into a two-dimensional CNN. This innovative approach achieves the performance levels of a 3D-CNN while preserving the computational simplicity of a 2D-CNN.
Face mask module
To mitigate the adverse impact of motion and lighting noise, we introduce a face mask module into the network. The face mask module functions as a softmax attention layer, realized through \(\:1\times\:1\) convolution operations and sigmoid activation functions. First, the feature map is convolved using a convolution kernel of size \(\:1\times\:1\) so that the number of channels in the feature map becomes 1, and then activated using the sigmoid function. This mechanism is analogous to the attention score maps in conventional attention frameworks, enabling the network to learn spatial importance weights for each location. Each element is normalized to ensure that the overall sum of the attention map remains constant, preventing gradient instability caused by excessively large values in certain regions. The result is a normalized mask that preserves the same spatial dimensions as the input. The resulting normalized face mask is then subjected to multiplication with the output of the tensor shift convolution29. The face mask adaptively focuses on more stable and pulse-rich regions of the face, such as the cheeks and forehead, guiding the subsequent network to attend to informative spatial areas and thereby improving the accuracy of rPPG signal prediction.
The face mask module is designed as a flexible and pluggable component, seamlessly integrating into any part of the network without altering its overall structure. This incorporation enhances the network’s data extraction capabilities, ensuring a more robust performance by effectively minimizing the influence of motion and lighting noise. The working principle of the face mask module can be seen in Fig. 3(b).
Channel attention module
Addressing the challenge posed by the multitude of channels in deep neural networks, where effectively attending to channels containing crucial information can be intricate, we introduce the channel attention module. This module empowers the network to intelligently prioritize channels rich in valuable information by assigning distinct weights to each channel. This innovative approach allows the neural network to autonomously learn and reinforce channels pivotal to a specific task, thereby augmenting its capability to extract pertinent information, minimizing focus on irrelevant details, and enhancing the model’s overall performance and efficiency.
The channel attention module operates on the principle of dynamically learning weights to adjust the importance of each channel (feature map), amplifying useful features while diminishing extraneous ones39,40. This dynamic adjustment significantly improves the model’s performance and generalization ability. The module unfolds in two steps:
-
(1)
Squeeze: In this initial step, the channel attention module condenses the feature of each channel through global average pooling and global maximum pooling operations. The channel features produced by these pooling operations are then fused, generating a vector containing the information of each channel.
-
(2)
Excitation: The subsequent step involves the channel attention module learning a weight vector to reweight the feature map of each channel, representing the importance of each channel. To achieve this, a sigmoid function is employed for activation. Given the typical prevalence of numerous channels in the deep layers of deep learning networks, two fully connected layers, flanking the nonlinear ReLU function, are incorporated to control parameter complexity. This design decision ensures efficient parameter management while preserving the module’s effectiveness in emphasizing critical features and suppressing less relevant ones.
The channel attention module, illustrated in the Fig. 4., introduces noteworthy performance enhancements to the deep learning model structure with minimal additional computational cost.
Loss function
When supervising the deep neural model of rPPG, different loss functions will impose different constraints on the model. When selecting the loss function, many researchers only use one loss function as a constraint on the model, which results in the model only being able to pay attention to a certain feature. Based on this, we propose a hybrid loss function called time frequency correlation (TFC) loss, considering the short-term characteristics, long-term characteristics of the signal and the correlation between the predicted values and the real values, as follows:
Mean square error (MSE) loss is a frequently employed loss function in rPPG waveform extraction methods, presents clear and well-defined optimization objectives29. The model aims to minimize MSE, striving to make the predicted rPPG waveform closely match the actual PPG waveform. Notably, MSE is a differentiable loss function, facilitating the direct utilization of the backpropagation algorithm for gradient calculation, thus enhancing the efficiency of the optimization process. Emphasizing the square of the error, MSE exhibits heightened sensitivity to larger errors, enabling the model to concentrate on rectifying significant errors while maintaining resilience to minor local errors.
In Eq. (4), \(\:x\) represents the predicted rPPG signal, and \(\:y\) represents the ground-truth PPG. MSE loss mainly imposes instantaneous constraints on the time domain. Using only the time domain loss would result in a lack of control over the global characteristics. Frequency domain loss tends to guide the entire periodic characteristics of the signal. The combination of the two utilizes both long-term and short-term characteristics. Through the combined utilization of losses in both the time and frequency, more effective guidance can be obtained41. Imposing a frequency bandwidth limit stands out as a potent constraint for the model. Previous unsupervised approaches have employed the irrelevant power ratio (IPR) as a metric for validation23,42,43. We observed its efficacy in model training as well. The IPR penalty model generates signals that surpass the specified bandwidth limit44. The lower and upper bandwidths are l and u, then the power spectral density (PSD) loss is as below.
In Eq. (5), F represents the frequency domain form of the predicted rPPG signal. negative Pearson loss is usually used to measure the linear correlation between predicted values and true values29. Unlike MSE, the Pearson correlation coefficient is relatively less affected by outliers. This means that if there are some outliers in the data, negative Pearson loss may be more resistant to their impact.
Results
Initially, two benchmark datasets are introduced, followed by a description of the experimental implementation details and performance metrics. Subsequently, we compare the proposed method with previous methods. And we provide a visualization of the experimental results, while the efficacy of each component within the proposed method is evaluated through ablation experiments.
Dataset
UBFC-rPPG45 was created utilizing a C + + application for video capturing, employing an affordable webcam (Logitech C920 HD Pro) operating at 30fps and a resolution of \(\:640\times\:480\). Ground-truth PPG data, consisting of the PPG waveform, was obtained using a CMS50E transmissive pulse oximeter. During the data collection, the subject sat approximately 1 m away from the camera, ensuring their face was visible. A total of 42 segments of data are available.
PURE46 consists of 10 persons performing different, controlled head motions in front of a camera. Ten persons were recorded in six different setups, resulting in a total of 60 sequences. The videos were captured using an eco274CVGE camera by SVS-Vistek GmbH at a frame rate of 30 Hz with a cropped resolution of \(\:640\times\:480\) pixels. Concurrently, ground-truth data were collected using a pulse oximeter (pulox CMS50E). The test subjects were positioned in front of the camera at an average distance of 1.1 m. The six different setups were as follows: steady, talking, slow translation, fast translation, small rotation, medium rotation.
Implementation details and metrics
For training iteration, each resized training video was divided into segments of non-overlapping 6 s (180 frames) clips. During intra-dataset testing, dividing the dataset resulted in a low number of testing videos, so we followed43 and divided each testing video into non-overlapping 30 s (900 frames) clips and calculated HR for each clip. During cross-dataset testing, we followed29 and conducted video-level evaluation where we calculated an averaged HR for each single testing video.
UBFC-rPPG: Based on the criteria of previous research28,43we divided the 42 video sets into two subsets, containing 30 and 12 videos for training and testing respectively, without using any data augmentation methods.
PURE: Based on the standards of previous research8,43we divided the 10 subjects into two subsets, containing 6 and 4 subjects for training and testing respectively, without using any data enhancement methods.
Our algorithm is implemented in PyTorch and trained on NVIDIA RTX4090. We use the AdamW optimizer instead of Adam optimizer to train the models. All models are trained for 10 epochs, with learning rate of 0.001. The length of each video clip is set to 180 frames, all the video clips used are non-overlapping.
To validate our proposed method, we use widely recognized performance evaluation metrics to assess the performance of the model43. The evaluation criteria included mean absolute error (MAE), root mean square error (RMSE), and the Pearson correlation coefficient (ρ) between ground-truth and predicted HR.
MAE is a performance metric for evaluating prediction models, measuring the average absolute difference between predicted HR and ground-truth HR.
RMSE is a metric used to assess the performance of prediction models, representing the square root of the average squared difference between predicted HR and ground-truth HR.
\(\:\rho\:\) quantifies the linear relationship between predicted HR and ground-truth HR, ranging between − 1 and 1, with 0 indicating no linear correlation.
In the above Eqs. (7), (8), (9), the predicted HR is denoted as \(\:{HR}_{i}\), the ground-truth HR is denoted as \(\:{{HR}_{i}}^{*}\). ( \(\:\stackrel{-}{.}\) ) represents the mean operator, \(\:N\) represents the total count of HR. The predicted HR was derived by identifying the dominant frequency through the predicted PSD of the rPPG signal. We applied a bandpass filter with a cutoff frequency ranging from 0.75 to 2.5 Hz to the predicted rPPG signal before computing HR29. Subsequently, FFT is used to estimate HR values from each video. As for the ground-truth HR, it corresponds to the data collected by the contact oximeter sensor. We use rppg-toolbox47 to help us with model evaluation.
Intra-dataset HR evaluation
We conducted intra-dataset testing of HR estimation on UBFC-rPPG and PURE, comparing our method with hand-crafted methods such as Green18POS272SR48CHROM19. Additionally, we compared it with deep learning methods like PhysNet20HR-CNN8SynRhythm49RADIANT30and so on. We followed the evaluation protocols of PulseGAN28 and Contrast-Phys43adopting the same dataset partitioning strategy as used in their experiments. The performance results, including hand-crafted and deep learning methods, are presented in Table 1. Figure 5. uses bar charts to compare the performance of different methods more visually on UBFC-rPPG and PURE.
In Table 1, the lowest MAE, lowest RMSE, and highest ρ are underscored, while the performance of our proposed method is highlighted in bold. For UBFC-rPPG, apart from SynRhythm, deep learning-based methods outperform all hand-crafted methods, showcasing the ability of deep learning approaches to learn more informative characteristics for remote HR estimation.
The proposed method demonstrates outstanding performance on both the UBFC-rPPG and PURE datasets, achieving MAEs of 0.43 bpm and 0.30 bpm, respectively, without the use of any data augmentation. Although STFPNet achieves a slightly lower MAE than our proposed method on the UBFC-rPPG dataset (by 0.02 bpm), its RMSE is notably higher by 0.17 bpm. This indicates that STFPNet produces more outliers and exhibits lower stability. In contrast, our method demonstrates greater robustness, as further evidenced by the results on the PURE dataset, where it outperforms STFPNet by a significant margin in both MAE and RMSE. This observation suggests that our method’s potential for greater stability and generality may result in consistently reliable performance across various real-world scenarios.
Cross-dataset HR evaluation
Since the environment and lighting conditions of samples in the same dataset are mostly similar, in order to make a fair comparison with the state-of-the-art methods, we followed the protocol of PulseGAN28 for Cross-dataset testing, trained the model on PURE only, and Tested on UBFC-rPPG, the performance of various methods on the Cross-dataset HR evaluation is shown in Table 2. Similar to Intra-dataset, deep learning based methods perform significantly better than hand-crafted methods. The performance of our proposed method in Cross-dataset is stronger than all methods. This means that our method can generalize well to new data sets, avoiding the overfitting problem that may result from training and testing on the same data set.
Multiple illumination scenarios evaluation
Since most of the current datasets only consider the volunteer’s movement variations but not the environment’s illumination variations, they cannot effectively test the robustness of the algorithms under different illumination conditions. For this reason, we propose the multiple illumination scenarios (MIS) dataset, which collects data under three different illumination conditions: normal illumination, strong illumination, and weak illumination. Specific illumination conditions can be seen in Fig. 6. Natural illumination was used as normal illumination, a halogen lamp 30 cm from the face was used as a strong illumination source, and a light shield was fitted to the camera lens to simulate a weak illumination situation. For each illumination condition, the volunteers in the stationary and moving states are also collected separately. In the stationary condition, the volunteer is required to remain in a sitting position and cannot move, while in the motion condition, the volunteer can turn his/her head, talk, laugh, etc. at will. In total, 10 volunteers participated in data recording process. A GoPro Hero11 was used to capture the video and a Contec CMS50E to capture the PPG signal. Videos were recorded with \(\:1920\times\:1080\) resolution at 60 FPS.
We use models trained on the PURE for cross-dataset evaluation on the MIS dataset. For this assessment, we used a whole-video assessment, with one overall heart rate assessment for each 1-minute video. In order to be consistent with the previous data validation approach, we downsampled both the video and PPG signals to 30FPS during the preprocessing stage. The results show that on a total of 60 data, our proposed model has a MAE of 2.83, a RMSE of 4.95 and a Mean Absolute Percentage Error (MAPE) of 3.62. Additionally, a MAPE of ± 5 is considered as tolerable according to the standards of the American National Standards Institute (ANSI)50. Our model meets the ANSI requirement of MAPE, which proves that our model still has high accuracy under multiple complex lighting conditions, and once again demonstrates the robustness of our model.
Ablation study
We conduct an ablation study on our method by performing HR estimation on models trained on PURE, tested on UBFC-rPPG. We introduce ablation studies on the following modules: (1) Gaussian pyramid; (2) Channel attention; (3) TFC loss; (4) Face mask. The results are shown in the Table 3.
When using Gaussian pyramid alone, the MAE value dropped by 0.12 bpm and the detection accuracy increased by 7%. When using TFC loss alone, the MAE value dropped by 0.08 bpm and the detection accuracy increased by 5%. When TFC loss is used with the channel attention module, the MAE value drops from 1.74 bpm to 1.43 bpm, and the detection accuracy increases by 18%; when TFC loss is used with Gaussian pyramid and channel attention module, the MAE value is reduced from 1.74 bpm to 1.15 bpm, and the detection accuracy jumps by 34%. When the face mask is removed from the network, the RMSE increases from 2.97 bpm to 2.99 bpm, and the Pearson correlation coefficient (ρ) drops from 0.99 to 0.98, indicating a rise in both the number and magnitude of outliers in the predictions. The progressive improvement in performance with the addition of more modules indicates that each of the proposed components is effective for rPPG signal extraction.
Computational cost evaluation
For rPPG signal extraction tasks, low computational overhead is essential to ensure fast model responsiveness, which is critical for practical deployment. To further assess the computational efficiency of our proposed method, we conducted a comparison between CAP-rPPG and several representative benchmark models. The results are summarized in Table 4.
As illustrated in Table 4, our method requires fewer parameters than both DeepPhys and EfficientPhys, making it more suitable for deployment on real-world, resource-constrained devices. While PhysNet exhibits a relatively small model size, its heart rate estimation MAE on the PURE dataset reaches 2.10 bpm—substantially higher than that of our method. This suggests that PhysNet may have limited feature representation capability, resulting in reduced prediction accuracy.
Furthermore, in the cross-dataset evaluation on UBFC-rPPG, our method achieves an average inference time of 0.76 s per preprocessed video, which is a highly encouraging result. These findings demonstrate that our approach can achieve both high prediction accuracy and efficient inference, without incurring significant increases in model complexity. This enhances the practicality and scalability of CAP-rPPG for deployment on devices with limited computational resources.
Visualization
In Fig. 7., we provide visualizations for both the predicted rPPG signals and the ground-truth rPPG signals along with their PSDs of CAP-rPPG, extracted from two video clips in UBFC-rPPG and PURE. The remarkable resemblance between the predicted and ground-truth rPPG signals, as well as their corresponding PSDs, underscores the model’s ability to accurately capture and reproduce the physiological signals, further validating the robustness of our approach.
Figure 8. illustrates scatter plots depicting the predicted HR against the ground-truth HR for the test data on UBFC-rPPG and PURE of CAP-rPPG, respectively. In these plots, the x-axis represents the ground-truth HR, while the y-axis represents the predicted HR. A noticeable alignment of the scatter plots with the y = x line is apparently observed. This alignment persists across both low and high HR. Achieving this alignment is not an easy task in deep learning methods because the distribution of data is imbalance, the distribution of HR in the training data may predominantly within a specific range, leading to overfitting on that particular range. Nonetheless, our proposed method successfully predicts diverse HR distributions without additional augmentation of HR data, effectively preventing overfitting.
Figure 9. is a plot of the Bland-Altman consistency analysis of CAP-rPPG. The red line in the figure represents the relative mean error between the predicted HR and the ground-truth HR. The two virtual blue lines represent 95% confidence intervals \(\:[\mu\:-1.96\sigma\:,\:\:\mu\:+1.96\sigma\:]\), and only the points within these boundaries are deemed as highly reliable. The result reveals that a majority of the HR measurements obtained by the proposed method fall within the confidence intervals, suggesting highly consistency with the ground-truth values.
Discussion
Since rPPG signals are very weak and noisy, rPPG-based remote physiological measurements are challenging. This paper studies the method of estimating rPPG using deep learning models, proposes a new network structure CAP-rPPG based on Gaussian pyramid, uses the channel attention module to focus the network on channels containing useful information, and proposes a new network structure based on time domain, frequency The hybrid loss function of domain and correlation allows the model to learn short-term features, long-term features and correlations to ultimately improve physiological measurement accuracy. The CAP-rPPG proposed in this article achieves accurate measurement of HR on the UBFC-rPPG dataset and PURE dataset, which is significantly better than previous hand-crafted methods and is comparable to the current state-of-the-art model. Moreover, the performance of the model was also validated on the MIS dataset under a variety of lighting conditions, meeting the accuracy requirements specified by ANSI, which once again validated the robustness of the model. The implications of this research extend beyond the accurate measurement of HR. The envisioned trajectory involves pushing the boundaries of rPPG applications to encompass broader physiological parameters, including but not limited to blood pressure and respiratory rate. This work lays the foundation for a transformative approach to healthcare monitoring.
Data availability
The UBFC-rPPG dataset45 used during the study can be obtained from the official website of UBFC-rPPG at https://sites.google.com/view/ybenezeth/ubfcrppg. The PURE dataset46 used during the study can be obtained from the official website of PURE at https://www.tu-ilmenau.de/universitaet/fakultaeten/fakultaet-informatik-und-automatisierung/profil/institute-und-fachgebiete/institut-fuer-technische-informatik-und-ingenieurinformatik/fachgebiet-neuroinformatik-und-kognitive-robotik/data-sets-code/pulse-rate-detection-dataset-pure. The MIS dataset generated and analyzed during the study are not currently publicly available. Requests for access to MIS dataset should be directed to the corresponding author.
References
Cheng, C. H., Wong, K. L., Chin, J. W., Chan, T. T. & So, R. H. Y. Deep learning methods for remote heart rate measurement: A review and future research agenda. Sensors 21, 6296 (2021).
Jeong, I. C. & Finkelstein, J. Introducing contactless blood pressure assessment using a high speed video camera. J. Med. Syst. 40, 77 (2016).
Kim, N. H., Yu, S. G., Kim, S. E. & Lee, E. C. Non-contact oxygen saturation measurement using YCgCr color space with an RGB camera. Sensors 21, 6120 (2021).
Iozzia, L., Cerina, L. & Mainardi, L. Relationships between heart-rate variability and pulse-rate variability obtained from video-PPG signal using ZCA. Physiol. Meas. 37, 1934–1944 (2016).
Scalise, L. InTech,. Non contact heart monitoring. In Advances in Electrocardiograms - Methods and Analysis (ed. Millis, R.) https://doi.org/10.5772/22937 (2012).
Gupta, A., Ravelo-García, A. G. & Dias, F. M. Availability and performance of face based non-contact methods for heart rate and oxygen saturation estimations: A systematic review. Comput. Methods Prog. Biomed. 219, 106771 (2022).
Lee, E., Chen, E., & Lee, C. Y. Meta-rPPG: Remote heart rate Estimation using a transductive meta-learner. In Computer Vision – ECCV 2020 (eds Vedaldi, A., Bischof, H., Brox, T. & Frahm, J. M.). Springer International Publishing, 392–409 (2020).
Špetlík, R., Franc, V. & Matas, J. Visual heart rate estimation with convolutional neural network. In Proceedings of the British Machine Vision Conference, 3–6 (2018).
Lu, H., Han, H., & Zhou, S. K. Dual -GAN: Joint bvp and noise modeling for remote physiological measurement. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 12399–12408 https://doi.org/10.1109/CVPR46437.2021.01222 (2021).
Wang, W., Wei, Z., Yuan, J., Fang, Y. & Zheng, Y. Non-contact heart rate Estimation based on singular spectrum component reconstruction using low-rank matrix and autocorrelation. PLoS ONE. 17, e0275544 (2022).
Chen, X. et al. Video-based heart rate measurement: recent advances and future prospects. IEEE Trans. Instrum. Meas. 68, 3600–3615 (2019).
Huang, B. et al. A neonatal dataset and benchmark for non-contact neonatal heart rate monitoring based on spatio-temporal neural networks. Eng. Appl. Artif. Intell. 106, 104447 (2021).
Huang, P. W., Wu, B. J. & Wu, B. F. A heart rate monitoring framework for real-world drivers using remote photoplethysmography. IEEE J. Biomed. Health Inf. 25, 1397–1408 (2021).
Poh, M. Z., McDuff, D. J. & Picard, R. W. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 58, 7–11 (2011).
Balakrishnan, G., Durand, F. & Guttag, J. Detecting pulse from head motions in video. In 2013 IEEE Conference on Computer Vision and Pattern Recognition IEEE, 3430–3437 https://doi.org/10.1109/CVPR.2013.440 (2013).
Poh, M. Z., McDuff, D. J. & Picard, R. W. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express. 18, 10762 (2010).
Li, X., Chen, J., Zhao, G. & Pietikainen, M. Remote heart rate measurement from face videos under realistic situations. In IEEE Conference on Computer Vision and Pattern Recognition IEEE, 4264–4271 https://doi.org/10.1109/CVPR.2014.543 (2014).
Verkruysse, W., Svaasand, L. O. & Nelson, J. S. Remote plethysmographic imaging using ambient light. Opt. Express. 16, 21434 (2008).
De Haan, G. & Jeanne, V. Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 60, 2878–2886 (2013).
Yu, Z., Li, X. & Zhao, G. Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks. In 30th British Machine Vision Conference (BMVC)BMVC, (2019).
Chen, W., & McDuff, D. DeepPhys: Video-based physiological measurement using convolutional attention networks. In Computer Vision – ECCV 2018 (eds Ferrari, V., Hebert, M., Sminchisescu, C. & Weiss, Y.). Springer International Publishing, 356–373 (2018).
Niu, X. et al. Video-based remote physiological measurement via cross-verified feature disentangling. In Computer Vision – ECCV 2020 (eds Vedaldi, A., Bischof, H., Brox, T. & Frahm, J. M.). Springer International Publishing, 295–310 (2020).
Gideon, J. & Stent, S. The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) IEEE, 3975–3984. https://doi.org/10.1109/ICCV48922.2021.00396 (2021).
Gorur, K., Olmez, E., Ozer, Z. & Cetin, O. EEG-Driven biometric authentication for investigation of fourier synchrosqueezed Transform-ICA robust framework. Arab. J. Sci. Eng. 48, 10901–10923 (2023).
Gorur, K. Fourier synchrosqueezing Transform-ICA-EMD framework based EOG-Biometric sustainable and continuous authentication via voluntary eye blinking activities. Biomimetics 8, 378 (2023).
Lewandowska, M., Rumiński, J., Kocejko, T. & Nowak, J. Measuring pulse rate with a webcam – a non-contact method for evaluating cardiac activity. (2011).
Wang, W., Brinker, D., Stuijk, A. C., De Haan, G. & S. & Algorithmic principles of remote PPG. IEEE Trans. Biomed. Eng. 64, 1479–1491 (2017).
Song, R. et al. Pulsegan: learning to generate realistic pulse waveforms in remote photoplethysmography. IEEE J. Biomed. Health Inf. 25, 1373–1384 (2021).
Liu, X. et al. EfficientPhys: Enabling simple, fast and accurate camera-based cardiac measurement. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) IEEE, 4997–5006 https://doi.org/10.1109/WACV56688.2023.00498 (2023).
Gupta, A. K., Kumar, R., Birla, L. & Gupta, P. RADIANT: Better rPPG estimation using signal embeddings and Transformer. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 4965–4975 https://doi.org/10.1109/WACV56688.2023.00495 (2023).
Zhang, N., Sun, H. M., Ma, J. R. & Jia R.-S. A self-supervised learning network for remote heart rate measurement. Measurement 228, 114379 (2024).
Sun, W. et al. Resolve Domain Conflicts for Generalizable Remote Physiological Measurement. In Proceedings of the 31st ACM International Conference on Multimedia ACM, 8214–8224 https://doi.org/10.1145/3581783.3612265 (2023).
Speth, J., Vance, N., Flynn, P., & Czajka, A. Non-Contrastive Unsupervised Learning of Physiological Signals from Video. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, 14464–14474 https://doi.org/10.1109/CVPR52729.2023.01390 (2023).
Li, Z. et al. STFPNet: A simple Temporal feature pyramid network for remote heart rate measurement. Measurement 252, 117287 (2025).
Selvaraju, V. et al. Continuous monitoring of vital signs using cameras: A systematic review. Sensors 22, 4097 (2022).
Warnecke, J. M., Lasenby, J. & Deserno, T. M. Robust in-vehicle heartbeat detection using multimodal signal fusion. Sci. Rep. 13, 20864 (2023).
Babaud, J., Witkin, A. P., Baudin, M. & Duda, R. O. Uniqueness of the Gaussian kernel for scale-space filtering. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8, 26–33 (1986).
Lin, J., Gan, C. & Han, S. TSM: Temporal shift module for efficient video understanding. In IEEE/CVF International Conference on Computer Vision (ICCV) IEEE, 7082–7092 https://doi.org/10.1109/ICCV.2019.00718 (2019).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition IEEE, 7132–7141 https://doi.org/10.1109/CVPR.2018.00745 (2018).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), 3–19 (2018).
Lee, J. S., Hwang, G., Ryu, M. & Lee, S. J. LSTC-rPPG: Long short-term convolutional network for remote photoplethysmography. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) IEEE, Vancouver, 6015–6023 https://doi.org/10.1109/CVPRW59228.2023.00640 (2023).
Gideon, J. & Stent, S. Estimating heart rate from unlabelled video. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) IEEE, 2743–2749. https://doi.org/10.1109/ICCVW54120.2021.00307 (2021).
Sun, Z., & Li, X. Contrast-Phys: Unsupervised video-based remote physiological measurement via Spatiotemporal contrast. In Computer Vision – ECCV 2022 (eds Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.). Springer, 492–510 (2022).
Speth, J., Vance, N., Flynn, P. & Czajka, A. Non-contrastive unsupervised learning of physiological signals from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14464–14474 (2023).
Bobbia, S., Macwan, R., Benezeth, Y., Mansouri, A. & Dubois, J. Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recognit. Lett. 124, 82–90 (2019).
Stricker, R., Muller, S. & Gross, H. M. Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication IEEE, 1056–1062 https://doi.org/10.1109/ROMAN.2014.6926392 (2014).
Liu, X. et al. rPPG-toolbox: Deep remote PPG toolbox. Adv. Neural Inform. Process. Syst. 36 (2024).
De Haan, G. & Van Leest, A. Improved motion robustness of remote-PPG by using the blood volume pulse signature. Physiol. Meas. 35, 1913–1926 (2014).
Niu, X., Han, H., Shan, S., & Chen, X. SynRhythm: Learning a deep heart rate estimator from general to specific. In 24th International Conference on Pattern Recognition (ICPR) IEEE, 3580–3585. https://doi.org/10.1109/ICPR.2018.8546321 (2018).
Consumer Technology Association. Physical activity monitoring for heart rate, ansi/cta-2065 (2018).
Acknowledgements
This research was supported partly by the National Natural Science Foundation of China (No. 61405028 and 82073833) and the Fundamental Research Funds for the Central Universities (University of Electronic Science and Technology of China) (No. ZYGX2015J046 and ZYGX2016J069,ZYGX2019J053). The authors would like to thank Professor Yutang Ye and all the staff at MOEMIL Laboratory for their help in this study.
Author information
Authors and Affiliations
Contributions
J.Z., H.S., and Y.H. designed and implemented the experiments and drafted the initial manuscript, while G.Z. and F.L. conducted the data analysis. B.Y. and J.P. contributed to the development of the codebase and its optimization. X.D. and J.L. conceptualized the research and oversaw its execution. R.H. and X.W. provided critical insights during experiment design and offered necessary resources for experimentation. L.L. and Y.L. conducted additional data analysis and contributed to manuscript refinement. All authors reviewed and endorsed the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics declarations
All experiments and methodologies were conducted in compliance with applicable guidelines and regulations. Informed consent was obtained from all subjects and/or their legal guardian(s). The human heart rate collection protocol was approved by the University of Electronic Science and Technology of China (ethical review No. 106142022111824912).
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, J., Sun, H., Hu, Y. et al. Channel attention pyramid network for remote physiological measurement. Sci Rep 15, 22495 (2025). https://doi.org/10.1038/s41598-025-06107-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-06107-5