Introduction

Physiological parameters, including heart rate (HR), blood oxygen saturation (SpO2), and heart rate variability (HRV), play a pivotal role in monitoring physical health and diagnosing diseases1,2,3,4. These parameters not only provide insights into an individual’s physiological status but also prove valuable for early disease detection and monitoring the recovery process. It is worth noting that continuous HR monitoring is particularly instrumental in the timely identification and prevention of cardiovascular problems such as arrhythmia and atherosclerosis5.

In the conventional realm of physiological parameter monitoring, electrocardiogram (ECG) and photoplethysmography (PPG) are commonly employed for measurements. This conventional approach falls under the category of contact-based methods, which have constraints such as sensor or electrode placement, which can lead to discomfort or allergic reactions. Consequently, this method is unsuitable for certain scenarios, such as situations involving sensitive or burnt skin, as well as newborn baby monitoring6.

Over the past decade, remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for physiological measurements, utilizing video captured by camera to measure physiological parameters7,8. The fundamental principle of rPPG relies on the cyclic variations in vascular blood volume induced by the cardiovascular activity, resulting in periodic fluctuations in the skin’s light absorption. Consequently, subtle color variations occur in the video. Capturing these periodic color changes enables the extraction of the rPPG signal, from which HR information can be derived9.

Unlike traditional methods, rPPG-based physiological measurements eliminate the need for specialized biomedical sensors, relying solely on ordinary cameras. This feature allows leveraging existing camera equipment without the requirement for additional sensor installations10. This method is not restricted by physical distance, avoids irritation or discomfort to the skin, and reduces detection costs and operational complexity. Therefore, rPPG is particularly suitable for applications such as telemedicine11skin sensitization patients, neonatal monitoring12and fatigue driving judgment13 and other applications.

However, the color changes in the human face induced by cardiovascular activity are extremely subtle and prone to noise interference, such as variations in lighting and motion artifacts, potentially impacting the accuracy of measurements10. The challenge lies in effectively removing interference noise from the complex signal and successfully extracting the rPPG signal, making it a highly sought-after research area in recent years.

In the early stages of rPPG research, many studies employed hand-crafted methods14,15,16,17,18,19. These methods extracted rPPG signals from color changes in region of interest (ROI) areas, such as the forehead or cheeks, by detecting and tracking faces. Subsequently, noise interference was mitigated through a series of filtering processes, ultimately estimating the average HR through frequency analysis. However, this approach based on hand-crafted methods exhibits significant drawbacks. It often relies on empirical knowledge for selecting ROI areas and may not fully concentrate on the most effective regions. Additionally, these models exhibit limited generalization capabilities and prove ineffective for rPPG extraction in complex environments20.

As with numerous applications in computer vision, deep learning methods show great potential in remote HR measurement. Deep learning techniques relying on rPPG7,21,22 demonstrate their ability in effectively addressing the challenges posed by variations in lighting, head movements, and facial expressions. For different scenarios, the HR detection has high robustness23. However, the accuracy of many existing rPPG approaches remains limited, partly due to their neglect of multi-scale facial feature analysis. This oversight restricts the network’s ability to capture comprehensive information ranging from color variations to facial structural details, thereby compromising both accuracy and stability. Moreover, as deep learning networks progress through layers, the resulting feature maps often contain a large number of channels, where informative signals may be diluted or suppressed by less relevant ones. This further hinders the reliable extraction of rPPG signals.

Inspired by the above considerations, we propose a deep learning-based method CAP-rPPG. This study makes the following contributions:

  1. (1)

    Multi-scale deep learning architecture: We propose a multi-scale deep learning architecture based on a Gaussian pyramid. This architecture integrates feature maps extracted at different resolutions and incorporates the temporal shift module (TSM) to effectively capture spatiotemporal information, enabling efficient and accurate prediction of rPPG signals.

  2. (2)

    Channel attention module: To enhance the attention on channels containing crucial rPPG signals, we employ a channel attention module. This module assigns different weights to the deep channels of the deep learning network, directing more attention to channels with a higher concentration of rPPG.

  3. (3)

    Hybrid loss function: We introduce a hybrid loss function comprising time, frequency, and negative Pearson correlation loss. These components guide both short-term and long-term characteristics of the target rPPG signal and the correlation between the predicted rPPG and the ground-truth PPG, providing a comprehensive approach to loss optimization.

  4. (4)

    Experimental validation: Our experimental results demonstrate that, when compared with the current state-of-the-art rPPG algorithms, our proposed CAP-rPPG exhibits outstanding performance on both UBFC-rPPG and PURE datasets.

Related works

Hand-crafted methods

Verkruysse et al. initiated a groundbreaking exploration, leading to the seminal revelation that facial video data captured by a camera could be analyzed to extract photoplethysmography (PPG) signals closely associated with HR. This pivotal work marked the inception of remote HR measurement, now commonly known as remote photoplethysmography (rPPG). This research achievement prompts subsequent research endeavors aimed at refining the accuracy and robustness of rPPG extraction, resulting in the introduction of numerous novel methods and frameworks18.

Since then, numerous rPPG techniques have been proposed. Notable among these are methods based on blind source separation (BSS) or optical reflection modeling of the skin. BSS is a widely adopted technique in signal processing, particularly effective in the analysis and decomposition of physiological signals24,25. Poh et al. achieved HR detection on multiple subjects using RGB color channels from a webcam, employing the independent component analysis (ICA) method based on color frequency bands16. Lewandowska et al. utilized the R and G channels, focused on the forehead as the ROI, and successfully extracted rPPG signals through principal component analysis (PCA), maintaining accuracy comparable to ICA with reduced computational complexity26. Gerard de Haan et al. proposed a chroma-based method (CHROM), which uses a linear combination of chroma signals to offset specular reflection components that do not contain rPPG signal information19. Wenjin Wang et al. proposed the plain orthogonal to skin (POS) algorithm, projecting features onto a plane orthogonal to the specular direction. This innovative approach eliminated the specular reflection component, maximizing changes induced by diffuse reflection27.

While these methods have proven effective in certain scenarios for extracting rPPG signals, the selection of ROI areas often relies on empirical knowledge, potentially neglecting the most effective regions. Additionally, many of these models hinge on assumptions about the light reflection model, resulting in poor generalization capabilities.

Deep learning methods

In recent years, rPPG signal extraction methods based on deep learning have emerged in large numbers and have become a hot field of current research. In many cases, due to the flexibility and expressiveness of deep learning, its performance is better than hand-crafted rPPG methods. It can automatically extract more spatiotemporal features of the input video, greatly optimizing the accuracy and robustness of the algorithm.

Špetlík et al. introduced a two-step convolutional neural network HR-CNN for remote HR estimation. The model shows resilience to variations in illumination and object motion8.Yu et al. proposed an end-to-end rPPG network PhysNet that merges the RGB projection into the subspace with the re-projection to the color subspace to achieve rPPG signal recovery20. Song et al. proposed a generative adversarial network PulseGAN to generate realistic rPPG pulse signals by denoising the chromaticity signal28. Seeking a balance between efficiency and accuracy, Liu et al. developed one-step neural network architectures, EfficientPhys, which eliminates the need for preprocessing steps in physiological measurements29. Gupta et al. developed RADIANT, a Transformer-based model utilizing signal embeddings to improve rPPG estimation by capturing global context and suppressing local noise30. Zhang et al. proposed a self-supervised learning network capable of estimating rPPG signals from facial videos without labeled data, leveraging the periodicity and finite bandwidth characteristics of physiological signals31. Sun et al. introduced a domain harmonization strategy to resolve domain conflicts, enhancing the generalizability of remote physiological measurements across diverse datasets32. Speth et al. presented a non-contrastive unsupervised learning framework that discovers blood volume pulse directly from unlabeled videos by encouraging sparse power spectra within normal physiological bandlimits33. Li et al. proposed STFPNet, a simple temporal feature pyramid network that leverages low-frame-rate video features to enhance remote heart rate measurement34.

These methods reveal the broad application potential of deep learning in physiological signal extraction, which can cope with different groups of people, different motion states, and different camera settings. Their performance is generally more stable in different situations, helping to improve the accuracy and robustness of measurements.

However, previous researchers have ignored the importance of multi-scale image information, resulting in insufficient accuracy of the model. Our proposed CAP-rPPG is the first deep learning architecture to use multi-scale image input, which is an unprecedented new rPPG extraction network. We incorporate the channel attention module to address the challenge of effectively attending to useful channels within the deep layers of the network, where numerous channels exist. And we also creatively proposed a loss function based on time domain, frequency domain, and correlation, so that the network can pay attention to the short-term and long-term characteristics of video images during the learning process. We thoroughly assess the performance of our proposed CAP-rPPG on various datasets.

Methods

We initially introduce our proposed CAP-rPPG network architecture. Additionally, we briefly describe the incorporated modules: Gaussian pyramid, TSM, face mask module, and channel attention module. Finally, we give a description of the proposed hybrid loss function. We confirm that all methods were performed in accordance with the relevant guidelines and regulations. And we confirm that informed consent was obtained from all participants and/or their legal guardians.

Framework of CAP-rPPG

Before performing computations with the model, the video is first preprocessed using a facial landmark detection technique. Numerous advanced facial landmark detection methods have been proposed, including MTCNN, Dlib, and MediaPipe Face Mesh, among others. These methods are all capable of accurately identifying facial contours based on key landmarks. After comparing their detection accuracy and runtime performance, we ultimately selected MediaPipe Face Mesh. This method not only detects a large number of facial landmarks with high precision but also offers excellent real-time performance. This choice enables efficient separation of facial regions from the background, allowing for fast and accurate face localization in each video frame35,36. Subsequently, we cropped the video image so that only the video image containing the face area is retained and resized to \(\:72\times\:72\) pixels.

The framework of the CAP-rPPG is shown in Fig. 1. To utilize the information contained in video images at different resolutions, we have designed a multi-scale deep learning network based on the Gaussian pyramid. Prior to entering the network for feature extraction, video image inputs at all scales undergo normalization through a layer comprising a difference layer and a batch normalization layer. The video image input from the first layer of the Gaussian pyramid, with a size of \(\:72\times\:72\), serves as the backbone of the network. Following normalization, a sequence of operations, including TSM, 2D convolution, and face mask module, are executed within the network. Subsequently, a maximum pooling layer is employed to reduce its size to \(\:36\times\:36\), facilitating integration with other feature maps. The video image inputs for the second and third layers of the Gaussian pyramid are sized \(\:36\times\:36\) and \(\:18\times\:18\), respectively. After individual operations of TSM, 2D convolution, and face mask module, these inputs are concatenated with the feature map extracted from the network backbone, ensuring comprehensive multi-scale feature fusion.

Given the depth of the network, Dropout layers are strategically incorporated to mitigate overfitting. Finally, the channel attention module is employed to assign distinct weights to various channels, followed by the utilization of a fully connected layer for feature extraction. The network outputs the rPPG signal corresponding to the video.

Fig. 1
figure 1

Framework of CAP-rPPG. (The facial images are taken from the publicaly available dataset UBFC-rPPG).

Gaussian pyramid

The Gaussian pyramid serves as a multi-scale representation of signals, involving the repeated application of Gaussian blurring and down-sampling to the same signal or image. This process generates multiple sets of signals or images at various scales, facilitating subsequent processing37.

When extracting rPPG signals, we focus on the periodic variations in skin color caused by blood pulsations18. Therefore, macroscopic changes in skin color are more important than detailed information about a face, such as contours or facial features.

To better capture these macroscopic color changes, we employ a Gaussian pyramid downsampling method. By downsampling step by step, it helps focus on the macroscopic changes in the image rather than the tiny details. For rPPG signal extraction, this is exactly what we need.

The construction process of Gaussian pyramid is shown in Fig. 2. The mathematical definition of Gaussian pyramid is shown below.

$$G_{k} \left( {i,j} \right) = \mathop \sum \limits_{{m = - 2}}^{2} \mathop \sum \limits_{{n = - 2}}^{2} \omega \left( {m,n} \right)G_{{k - 1}} \left( {2i + m,2j + n} \right)$$
(1)

In eqution (1), \(\:{G}_{k}\) represents the downsampled image of the pyramid at layer \(\:k\), \(\:i\) and \(\:j\) represents the number of rows and columns of the current layer image, respectively, \(\:\omega\:(m,n)\) is the Gaussian convolution kernel which is defined in Eq. (2). In constructing the Gaussian pyramid, we adopt a 5 × 5 Gaussian kernel to balance filtering effectiveness and computational efficiency. Compared with smaller kernels, it provides stronger anti-aliasing smoothing, while remaining lightweight enough for efficient multi-scale processing. The filter weights follow a discrete approximation of the 2D Gaussian distribution, ensuring smooth downsampling transitions between pyramid levels.

$$\omega \left( {m,n} \right) = \frac{1}{{256}}\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 4 & 6 \\ 4 & {16} & {24} \\ \end{array} } & {\begin{array}{*{20}c} 4 & 1 \\ {16} & 4 \\ \end{array} } \\ {\begin{array}{*{20}c} 6 & {24} & {36} \\ 4 & {16} & {24} \\ 1 & 4 & 6 \\ \end{array} } & {\begin{array}{*{20}c} {24} & 6 \\ {16} & 4 \\ 4 & 1 \\ \end{array} } \\ \end{array} } \right]$$
(2)
Fig. 2
figure 2

The construction process of Gaussian pyramid. (The facial images are taken from the publicaly available dataset UBFC-rPPG).

As can be seen from the Fig. 2, the Gaussian pyramid retains the spatial low-frequency information in the image. As the image resolution decreases, the spatial high-frequency information such as edge information of the image is gradually lost.

Temporal shift module

Traditional 2D-CNNs exhibit low computational cost but fall short in capturing temporal relationships, while methods relying on 3D-CNNs boast strong performance at the expense of increased computational demands and higher deployment costs. To strike a balance and better capture spatiotemporal information in the video, we incorporate the temporal shift module (TSM) into the network38.

TSM begins by partitioning the input tensor into three blocks, where the first block progresses time and the second block postpones time. Following this, the first block protrudes downward by one frame, and the second block protrudes upward by one frame. The protruding frame sections are directly truncated, while the vacant frame sections are filled with 0. All shift operations are conducted along the time axis29. This block movement operation endows the current frame with information from both the preceding and succeeding frames. Consequently, the two-dimensional convolution operation can directly extract spatiotemporal information from the video, akin to three-dimensional convolution, thereby enhancing the model’s temporal dimension modeling capabilities. The operating principle of the TSM can be seen in Fig. 3. (a).

The TSM can be seamlessly integrated into a two-dimensional CNN. This innovative approach achieves the performance levels of a 3D-CNN while preserving the computational simplicity of a 2D-CNN.

Face mask module

To mitigate the adverse impact of motion and lighting noise, we introduce a face mask module into the network. The face mask module functions as a softmax attention layer, realized through \(\:1\times\:1\) convolution operations and sigmoid activation functions. First, the feature map is convolved using a convolution kernel of size \(\:1\times\:1\) so that the number of channels in the feature map becomes 1, and then activated using the sigmoid function. This mechanism is analogous to the attention score maps in conventional attention frameworks, enabling the network to learn spatial importance weights for each location. Each element is normalized to ensure that the overall sum of the attention map remains constant, preventing gradient instability caused by excessively large values in certain regions. The result is a normalized mask that preserves the same spatial dimensions as the input. The resulting normalized face mask is then subjected to multiplication with the output of the tensor shift convolution29. The face mask adaptively focuses on more stable and pulse-rich regions of the face, such as the cheeks and forehead, guiding the subsequent network to attend to informative spatial areas and thereby improving the accuracy of rPPG signal prediction.

Fig. 3
figure 3

Two key modules of the network. (a) temporal shift module (b) face mask module.

The face mask module is designed as a flexible and pluggable component, seamlessly integrating into any part of the network without altering its overall structure. This incorporation enhances the network’s data extraction capabilities, ensuring a more robust performance by effectively minimizing the influence of motion and lighting noise. The working principle of the face mask module can be seen in Fig. 3(b).

Channel attention module

Addressing the challenge posed by the multitude of channels in deep neural networks, where effectively attending to channels containing crucial information can be intricate, we introduce the channel attention module. This module empowers the network to intelligently prioritize channels rich in valuable information by assigning distinct weights to each channel. This innovative approach allows the neural network to autonomously learn and reinforce channels pivotal to a specific task, thereby augmenting its capability to extract pertinent information, minimizing focus on irrelevant details, and enhancing the model’s overall performance and efficiency.

The channel attention module operates on the principle of dynamically learning weights to adjust the importance of each channel (feature map), amplifying useful features while diminishing extraneous ones39,40. This dynamic adjustment significantly improves the model’s performance and generalization ability. The module unfolds in two steps:

  1. (1)

    Squeeze: In this initial step, the channel attention module condenses the feature of each channel through global average pooling and global maximum pooling operations. The channel features produced by these pooling operations are then fused, generating a vector containing the information of each channel.

  2. (2)

    Excitation: The subsequent step involves the channel attention module learning a weight vector to reweight the feature map of each channel, representing the importance of each channel. To achieve this, a sigmoid function is employed for activation. Given the typical prevalence of numerous channels in the deep layers of deep learning networks, two fully connected layers, flanking the nonlinear ReLU function, are incorporated to control parameter complexity. This design decision ensures efficient parameter management while preserving the module’s effectiveness in emphasizing critical features and suppressing less relevant ones.

The channel attention module, illustrated in the Fig. 4., introduces noteworthy performance enhancements to the deep learning model structure with minimal additional computational cost.

Fig. 4
figure 4

Channel attention module.

Loss function

When supervising the deep neural model of rPPG, different loss functions will impose different constraints on the model. When selecting the loss function, many researchers only use one loss function as a constraint on the model, which results in the model only being able to pay attention to a certain feature. Based on this, we propose a hybrid loss function called time frequency correlation (TFC) loss, considering the short-term characteristics, long-term characteristics of the signal and the correlation between the predicted values and the real values, as follows:

$$L_{{TFC}} = L_{{MSE}} + L_{{PSD}} + L_{{Pearson}}$$
(3)

Mean square error (MSE) loss is a frequently employed loss function in rPPG waveform extraction methods, presents clear and well-defined optimization objectives29. The model aims to minimize MSE, striving to make the predicted rPPG waveform closely match the actual PPG waveform. Notably, MSE is a differentiable loss function, facilitating the direct utilization of the backpropagation algorithm for gradient calculation, thus enhancing the efficiency of the optimization process. Emphasizing the square of the error, MSE exhibits heightened sensitivity to larger errors, enabling the model to concentrate on rectifying significant errors while maintaining resilience to minor local errors.

$$L_{{MSE}} = \frac{1}{N}\mathop \sum \limits_{{i = 1}}^{N} \left( {x_{i} - y_{i} } \right)^{2}$$
(4)

In Eq. (4), \(\:x\) represents the predicted rPPG signal, and \(\:y\) represents the ground-truth PPG. MSE loss mainly imposes instantaneous constraints on the time domain. Using only the time domain loss would result in a lack of control over the global characteristics. Frequency domain loss tends to guide the entire periodic characteristics of the signal. The combination of the two utilizes both long-term and short-term characteristics. Through the combined utilization of losses in both the time and frequency, more effective guidance can be obtained41. Imposing a frequency bandwidth limit stands out as a potent constraint for the model. Previous unsupervised approaches have employed the irrelevant power ratio (IPR) as a metric for validation23,42,43. We observed its efficacy in model training as well. The IPR penalty model generates signals that surpass the specified bandwidth limit44. The lower and upper bandwidths are l and u, then the power spectral density (PSD) loss is as below.

$$L_{{PSD}} = \frac{{\mathop \sum \nolimits_{{i = - \infty }}^{l} F_{i} + \mathop \sum \nolimits_{{i = u}}^{\infty } F_{i} }}{{\mathop \sum \nolimits_{{i = - \infty }}^{\infty } F_{i} }}$$
(5)

In Eq. (5), F represents the frequency domain form of the predicted rPPG signal. negative Pearson loss is usually used to measure the linear correlation between predicted values and true values29. Unlike MSE, the Pearson correlation coefficient is relatively less affected by outliers. This means that if there are some outliers in the data, negative Pearson loss may be more resistant to their impact.

$$L_{{Pearson}} = 1 - \frac{{\mathop \sum \nolimits_{{i = 1}}^{n} \left( {x_{i} - \overline{{x_{i} }} } \right)\left( {y_{i} - \overline{{y_{i} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{{i = 1}}^{n} \left( {x_{i} - \overline{{x_{i} }} } \right)^{2} \mathop \sum \nolimits_{{i = 1}}^{n} \left( {y_{i} - \overline{{y_{i} }} } \right)^{2} } }}$$
(6)

Results

Initially, two benchmark datasets are introduced, followed by a description of the experimental implementation details and performance metrics. Subsequently, we compare the proposed method with previous methods. And we provide a visualization of the experimental results, while the efficacy of each component within the proposed method is evaluated through ablation experiments.

Dataset

UBFC-rPPG45 was created utilizing a C + + application for video capturing, employing an affordable webcam (Logitech C920 HD Pro) operating at 30fps and a resolution of \(\:640\times\:480\). Ground-truth PPG data, consisting of the PPG waveform, was obtained using a CMS50E transmissive pulse oximeter. During the data collection, the subject sat approximately 1 m away from the camera, ensuring their face was visible. A total of 42 segments of data are available.

PURE46 consists of 10 persons performing different, controlled head motions in front of a camera. Ten persons were recorded in six different setups, resulting in a total of 60 sequences. The videos were captured using an eco274CVGE camera by SVS-Vistek GmbH at a frame rate of 30 Hz with a cropped resolution of \(\:640\times\:480\) pixels. Concurrently, ground-truth data were collected using a pulse oximeter (pulox CMS50E). The test subjects were positioned in front of the camera at an average distance of 1.1 m. The six different setups were as follows: steady, talking, slow translation, fast translation, small rotation, medium rotation.

Implementation details and metrics

For training iteration, each resized training video was divided into segments of non-overlapping 6 s (180 frames) clips. During intra-dataset testing, dividing the dataset resulted in a low number of testing videos, so we followed43 and divided each testing video into non-overlapping 30 s (900 frames) clips and calculated HR for each clip. During cross-dataset testing, we followed29 and conducted video-level evaluation where we calculated an averaged HR for each single testing video.

UBFC-rPPG: Based on the criteria of previous research28,43we divided the 42 video sets into two subsets, containing 30 and 12 videos for training and testing respectively, without using any data augmentation methods.

PURE: Based on the standards of previous research8,43we divided the 10 subjects into two subsets, containing 6 and 4 subjects for training and testing respectively, without using any data enhancement methods.

Our algorithm is implemented in PyTorch and trained on NVIDIA RTX4090. We use the AdamW optimizer instead of Adam optimizer to train the models. All models are trained for 10 epochs, with learning rate of 0.001. The length of each video clip is set to 180 frames, all the video clips used are non-overlapping.

To validate our proposed method, we use widely recognized performance evaluation metrics to assess the performance of the model43. The evaluation criteria included mean absolute error (MAE), root mean square error (RMSE), and the Pearson correlation coefficient (ρ) between ground-truth and predicted HR.

MAE is a performance metric for evaluating prediction models, measuring the average absolute difference between predicted HR and ground-truth HR.

$$MAE = \frac{1}{N}\mathop \sum \limits_{{i = 1}}^{N} \left| {HR_{i} - HR_{i} ^{*} } \right|$$
(7)

RMSE is a metric used to assess the performance of prediction models, representing the square root of the average squared difference between predicted HR and ground-truth HR.

$$RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{{i = 1}}^{N} \left( {HR_{i} - HR_{i} ^{*} } \right)^{2} }$$
(8)

\(\:\rho\:\) quantifies the linear relationship between predicted HR and ground-truth HR, ranging between − 1 and 1, with 0 indicating no linear correlation.

$$\rho = \frac{{\mathop \sum \nolimits_{{i = 1}}^{N} \left( {HR_{i} - \overline{{HR}} } \right)\left( {HR_{i} ^{*} - \overline{{HR_{i} ^{*} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{{i = 1}}^{N} \left( {HR_{i} - \overline{{HR}} } \right)^{2} \left( {HR_{i} ^{*} - \overline{{HR_{i} ^{*} }} } \right)^{2} } }}$$
(9)

In the above Eqs. (7), (8), (9), the predicted HR is denoted as \(\:{HR}_{i}\), the ground-truth HR is denoted as \(\:{{HR}_{i}}^{*}\). ( \(\:\stackrel{-}{.}\) ) represents the mean operator, \(\:N\) represents the total count of HR. The predicted HR was derived by identifying the dominant frequency through the predicted PSD of the rPPG signal. We applied a bandpass filter with a cutoff frequency ranging from 0.75 to 2.5 Hz to the predicted rPPG signal before computing HR29. Subsequently, FFT is used to estimate HR values from each video. As for the ground-truth HR, it corresponds to the data collected by the contact oximeter sensor. We use rppg-toolbox47 to help us with model evaluation.

Intra-dataset HR evaluation

We conducted intra-dataset testing of HR estimation on UBFC-rPPG and PURE, comparing our method with hand-crafted methods such as Green18POS272SR48CHROM19. Additionally, we compared it with deep learning methods like PhysNet20HR-CNN8SynRhythm49RADIANT30and so on. We followed the evaluation protocols of PulseGAN28 and Contrast-Phys43adopting the same dataset partitioning strategy as used in their experiments. The performance results, including hand-crafted and deep learning methods, are presented in Table 1. Figure 5. uses bar charts to compare the performance of different methods more visually on UBFC-rPPG and PURE.

Table 1 Intra-dataset HR evaluation on UBFC-rPPG and PURE.
Fig. 5
figure 5

Intra-dataset HR evaluation on UBFC-rPPG and PURE.

In Table 1, the lowest MAE, lowest RMSE, and highest ρ are underscored, while the performance of our proposed method is highlighted in bold. For UBFC-rPPG, apart from SynRhythm, deep learning-based methods outperform all hand-crafted methods, showcasing the ability of deep learning approaches to learn more informative characteristics for remote HR estimation.

The proposed method demonstrates outstanding performance on both the UBFC-rPPG and PURE datasets, achieving MAEs of 0.43 bpm and 0.30 bpm, respectively, without the use of any data augmentation. Although STFPNet achieves a slightly lower MAE than our proposed method on the UBFC-rPPG dataset (by 0.02 bpm), its RMSE is notably higher by 0.17 bpm. This indicates that STFPNet produces more outliers and exhibits lower stability. In contrast, our method demonstrates greater robustness, as further evidenced by the results on the PURE dataset, where it outperforms STFPNet by a significant margin in both MAE and RMSE. This observation suggests that our method’s potential for greater stability and generality may result in consistently reliable performance across various real-world scenarios.

Cross-dataset HR evaluation

Since the environment and lighting conditions of samples in the same dataset are mostly similar, in order to make a fair comparison with the state-of-the-art methods, we followed the protocol of PulseGAN28 for Cross-dataset testing, trained the model on PURE only, and Tested on UBFC-rPPG, the performance of various methods on the Cross-dataset HR evaluation is shown in Table 2. Similar to Intra-dataset, deep learning based methods perform significantly better than hand-crafted methods. The performance of our proposed method in Cross-dataset is stronger than all methods. This means that our method can generalize well to new data sets, avoiding the overfitting problem that may result from training and testing on the same data set.

Table 2 Cross-dataset HR evaluation with models trained on PURE and tested on UBFC-rPPG.

Multiple illumination scenarios evaluation

Since most of the current datasets only consider the volunteer’s movement variations but not the environment’s illumination variations, they cannot effectively test the robustness of the algorithms under different illumination conditions. For this reason, we propose the multiple illumination scenarios (MIS) dataset, which collects data under three different illumination conditions: normal illumination, strong illumination, and weak illumination. Specific illumination conditions can be seen in Fig. 6. Natural illumination was used as normal illumination, a halogen lamp 30 cm from the face was used as a strong illumination source, and a light shield was fitted to the camera lens to simulate a weak illumination situation. For each illumination condition, the volunteers in the stationary and moving states are also collected separately. In the stationary condition, the volunteer is required to remain in a sitting position and cannot move, while in the motion condition, the volunteer can turn his/her head, talk, laugh, etc. at will. In total, 10 volunteers participated in data recording process. A GoPro Hero11 was used to capture the video and a Contec CMS50E to capture the PPG signal. Videos were recorded with \(\:1920\times\:1080\) resolution at 60 FPS.

We use models trained on the PURE for cross-dataset evaluation on the MIS dataset. For this assessment, we used a whole-video assessment, with one overall heart rate assessment for each 1-minute video. In order to be consistent with the previous data validation approach, we downsampled both the video and PPG signals to 30FPS during the preprocessing stage. The results show that on a total of 60 data, our proposed model has a MAE of 2.83, a RMSE of 4.95 and a Mean Absolute Percentage Error (MAPE) of 3.62. Additionally, a MAPE of ± 5 is considered as tolerable according to the standards of the American National Standards Institute (ANSI)50. Our model meets the ANSI requirement of MAPE, which proves that our model still has high accuracy under multiple complex lighting conditions, and once again demonstrates the robustness of our model.

Fig. 6
figure 6

Different illumination conditions in the MIS dataset. (a) Normal illumination. (b) Strong illumination. (c) Weak illumination. (The participant provided written informed consent for the publication of identifying images in an online open access publication).

Ablation study

We conduct an ablation study on our method by performing HR estimation on models trained on PURE, tested on UBFC-rPPG. We introduce ablation studies on the following modules: (1) Gaussian pyramid; (2) Channel attention; (3) TFC loss; (4) Face mask. The results are shown in the Table 3.

Table 3 Experimental results for ablation study.

When using Gaussian pyramid alone, the MAE value dropped by 0.12 bpm and the detection accuracy increased by 7%. When using TFC loss alone, the MAE value dropped by 0.08 bpm and the detection accuracy increased by 5%. When TFC loss is used with the channel attention module, the MAE value drops from 1.74 bpm to 1.43 bpm, and the detection accuracy increases by 18%; when TFC loss is used with Gaussian pyramid and channel attention module, the MAE value is reduced from 1.74 bpm to 1.15 bpm, and the detection accuracy jumps by 34%. When the face mask is removed from the network, the RMSE increases from 2.97 bpm to 2.99 bpm, and the Pearson correlation coefficient (ρ) drops from 0.99 to 0.98, indicating a rise in both the number and magnitude of outliers in the predictions. The progressive improvement in performance with the addition of more modules indicates that each of the proposed components is effective for rPPG signal extraction.

Computational cost evaluation

For rPPG signal extraction tasks, low computational overhead is essential to ensure fast model responsiveness, which is critical for practical deployment. To further assess the computational efficiency of our proposed method, we conducted a comparison between CAP-rPPG and several representative benchmark models. The results are summarized in Table 4.

As illustrated in Table 4, our method requires fewer parameters than both DeepPhys and EfficientPhys, making it more suitable for deployment on real-world, resource-constrained devices. While PhysNet exhibits a relatively small model size, its heart rate estimation MAE on the PURE dataset reaches 2.10 bpm—substantially higher than that of our method. This suggests that PhysNet may have limited feature representation capability, resulting in reduced prediction accuracy.

Furthermore, in the cross-dataset evaluation on UBFC-rPPG, our method achieves an average inference time of 0.76 s per preprocessed video, which is a highly encouraging result. These findings demonstrate that our approach can achieve both high prediction accuracy and efficient inference, without incurring significant increases in model complexity. This enhances the practicality and scalability of CAP-rPPG for deployment on devices with limited computational resources.

Table 4 Experimental results for computational cost.

Visualization

In Fig. 7., we provide visualizations for both the predicted rPPG signals and the ground-truth rPPG signals along with their PSDs of CAP-rPPG, extracted from two video clips in UBFC-rPPG and PURE. The remarkable resemblance between the predicted and ground-truth rPPG signals, as well as their corresponding PSDs, underscores the model’s ability to accurately capture and reproduce the physiological signals, further validating the robustness of our approach.

Fig. 7
figure 7

Visualization of predicted rPPG signals and ground-truth rPPG signals on UBFC-rPPG and PURE.

Figure 8. illustrates scatter plots depicting the predicted HR against the ground-truth HR for the test data on UBFC-rPPG and PURE of CAP-rPPG, respectively. In these plots, the x-axis represents the ground-truth HR, while the y-axis represents the predicted HR. A noticeable alignment of the scatter plots with the y = x line is apparently observed. This alignment persists across both low and high HR. Achieving this alignment is not an easy task in deep learning methods because the distribution of data is imbalance, the distribution of HR in the training data may predominantly within a specific range, leading to overfitting on that particular range. Nonetheless, our proposed method successfully predicts diverse HR distributions without additional augmentation of HR data, effectively preventing overfitting.

Fig. 8
figure 8

Comparison of predicted HR with ground-truth HR on UBFC-rPPG and PURE.

Figure 9. is a plot of the Bland-Altman consistency analysis of CAP-rPPG. The red line in the figure represents the relative mean error between the predicted HR and the ground-truth HR. The two virtual blue lines represent 95% confidence intervals \(\:[\mu\:-1.96\sigma\:,\:\:\mu\:+1.96\sigma\:]\), and only the points within these boundaries are deemed as highly reliable. The result reveals that a majority of the HR measurements obtained by the proposed method fall within the confidence intervals, suggesting highly consistency with the ground-truth values.

Fig. 9
figure 9

Bland-Altman plot on UBFC-rPPG and PURE.

Discussion

Since rPPG signals are very weak and noisy, rPPG-based remote physiological measurements are challenging. This paper studies the method of estimating rPPG using deep learning models, proposes a new network structure CAP-rPPG based on Gaussian pyramid, uses the channel attention module to focus the network on channels containing useful information, and proposes a new network structure based on time domain, frequency The hybrid loss function of domain and correlation allows the model to learn short-term features, long-term features and correlations to ultimately improve physiological measurement accuracy. The CAP-rPPG proposed in this article achieves accurate measurement of HR on the UBFC-rPPG dataset and PURE dataset, which is significantly better than previous hand-crafted methods and is comparable to the current state-of-the-art model. Moreover, the performance of the model was also validated on the MIS dataset under a variety of lighting conditions, meeting the accuracy requirements specified by ANSI, which once again validated the robustness of the model. The implications of this research extend beyond the accurate measurement of HR. The envisioned trajectory involves pushing the boundaries of rPPG applications to encompass broader physiological parameters, including but not limited to blood pressure and respiratory rate. This work lays the foundation for a transformative approach to healthcare monitoring.