Introduction

Head pose estimation aims to predict the specific orientation of a person’s head in three-dimensional (3D) space by analyzing facial features in images or videos. It has a broad range of applications, including safe driving1,2, human-computer interaction3,4,5, medical systems6,7, virtual/augmented reality8, and other domains9,10,11.

Fig. 1
Fig. 1
Full size image

Examples of facial images with occlusion, low resolution, or varying lighting conditions.

The factors such as occlusion and low resolution in the images (as shown in Fig. 1) make it difficult to capture key facial pose features during the learning process. Additionally, the head pose angle exhibits a continuous variation, and using bin classification methods to represent these continuous values may lead to information loss12, further restricting the model’s performance improvement. To tackle these challenges, soft label strategies are introduced as an alternative to head labels for representing the pose of facial images13,14,15. Compared to hard labels, soft labels are capable of better capturing the continuous characteristics of head poses as well as the similarity between adjacent angles. However, the current soft label generation approaches used in head pose estimation tasks typically construct Gaussian distributions with identical and fixed variances for the three Euler angles, namely yaw, pitch, and roll12,13,14. This type of method fails to adequately account for the differences in estimation difficulty across the three Euler angles. Therefore, we propose a novel soft-label construction strategy based on facial landmark information. In this strategy, we measure the similarity between head pose angles by calculating the displacements of 3D key points at these angles, which serves as an important step in soft label construction. At the three Euler angles, the same magnitude of angle change will lead to different degrees of changes in keypoint positions. In particular, changes in yaw and pitch angles cause large keypoint position shifts, whereas roll angle has a relatively small effect. These differences under the three Euler angles are encoded into soft labels.

Additionally, to enhance feature representation and improve the model’s discriminative capability, we propose the Stacked Dual Attention Module (SDAM) that includes the Multi-Receptive Attention Module (MRAM) and the Channel-wise Self-Attention Module (CSAM). Head pose estimation requires accurate modeling of the head’s global 3D orientation, which fundamentally depends on understanding the spatial arrangement of facial structures. This makes it essential to capture not only local features (e.g., contours and edges), but also the spatial relationships between local regions (e.g., the relative configuration of the eyes and nose), as well as the global structure of the head. To address these multi-level requirements, SDAM introduces a multi-scale spatial attention mechanism that leverages convolutional kernels with different receptive fields to generate attention maps sensitive to both fine-grained local cues and broader structural context. These attention maps guide the network to emphasize spatially informative and structure-related regions, facilitating the perception of key features through multiple contextual semantics. This design is highly compatible with head pose estimation task, as it typically requires a global perspective to accurately determine head pose. CSAM utilizes self-attention mechanism to construct attention weights to achieve channel attention. The self-attention mechanism explores the importance of different channels by calculating the global dependencies between them.

Furthermore, we apply our head pose estimation model to the estimation of students’ gaze points in classroom scenes. By accurately estimating students’ head poses, we can infer the position of their gaze points, providing support for analyzing their level of focus and points of interest. This application promotes the research and application of intelligent education by not only verifying the feasibility and effectiveness of the proposed method in practical scenarios but also by providing innovative technical support for human-computer interaction and intelligent analysis in the field of education. The main contributions of our work are as follows:

  • We propose a novel soft-label construction strategy (SLCS) that utilizes key point displacements to encode the differences in angle changes at the three Euler angles.

  • We propose the SDAM that utilizes two different attention modules to enhance feature representation. These two attention modules model features from the spatial dimension and channel dimension respectively to exploit more comprehensive information.

  • The practical application of the proposed method is successfully validated in a classroom setting, providing technical support for improving teaching strategies through the estimation of students’ gaze points.

  • We evaluate our head pose estimation model on two publicly available datasets: AFLW2000 and BIWI. The experiments show that our method achieves competitive performance.

Related work

In this section, we primarily present the existing methods for head pose estimation and the estimation of students’ gaze points in classroom scenarios. The existing methods for head pose estimation can primarily be divided into landmark-based methods and landmark-free methods.

Landmark-based methods

Landmark-based methods rely on the accurate detection of facial landmarks and their correspondence with the 3D reference model. The methods based on 2D landmarks16,17,18 are highly dependent on the accuracy and robustness of landmark detection, and the accurate localization of landmarks directly determines the performance of head pose estimation. In addition, to achieve accurate pose alignment, the detected head pose model needs to be as similar as possible to the corresponding 3D model19,20. However, when facing changes in lighting, facial expressions, and partial occlusion, the performance of landmark detection can degrade, which in turn affects the accuracy of head pose estimation. To address the limitations of 2D landmark-based methods, researchers have proposed head pose estimation methods based on 3D landmarks21,22,23. Unlike 2D landmark-based methods, which rely on the position of landmarks in 2D images, 3D landmark-based methods use feature points directly in 3D space to determine head pose, thereby more accurately capturing the rotation of the head in space.

Landmark-free methods

Landmark-free methods extracts global features directly from images using deep learning models and estimates head pose parameters in an end-to-end fashion. Since these methods do not rely on landmark detection, they are more robust against occlusion, large-angle rotations, and complex lighting conditions. HopeNet24 is a classic method for predicting Euler angles that combines classification and regression. It segments the range of head pose angles and optimizes them by combining cross-entropy loss and mean square error loss functions. Subsequently, QuatNet25 uses a quaternion representation to predict the head pose, thus avoiding the singularity problem of Euler angles. FSA-Net26 reveals aggregation features with fine-grained spatial structure and progressive stage fusion and proposes a network with stage regression and feature aggregation scheme to predict Euler angles. 6DRepNet27 proposes a six-dimensional rotation representation method that significantly improves the accuracy and stability of head pose estimation in complex environments, overcoming the limitations of traditional rotation representations.

Fig. 2
Fig. 2
Full size image

Overview of the proposed method.

Estimation of students’ gaze points

By estimating students’ gaze points, teachers can gain a clearer understanding of students’ focus and comprehension of the lesson content, which helps adjust teaching strategies in a timely manner and improves teaching quality. In a video instructional environment, Kok et al.28 use eye tracking to record students’ gaze data while watching videos. They employ visualization and statistical modeling to predict students’ understanding of the content. Soares Jr et al.29 use eye-tracking technology in elementary school classrooms to collect real-time gaze data from students and present this information to teachers, enabling them to better understand student concerns during the instructional process. Xu et al.30 use eye-tracking and object-detection techniques to automatically analyze students’ gaze distributions toward blackboards or slides in real classrooms, quantifying students’ attention characteristics and providing data references for instructional evaluation and design. In this paper, we propose a method for estimating students’ gaze points based on head pose, which is capable of estimating gaze points in low-resolution classroom scenes. This method is valuable for identifying students’ attention and evaluating their classroom participation.

Method

In this section, we provide a detailed introduction to our proposed method. The overall network architecture is shown in Fig. 2. We start with a head pose estimation strategy that combines classification and regression. Building upon this method, we propose a strategy for constructing soft labels based on facial landmarks and a Stacked Dual Attention Module. Finally, we outline methods for estimating students’ gaze points in the classroom.

Classification-regression head pose estimation

Among head pose estimation methods, HopeNet24 is a classic method to use a combination of classification and regression to estimate the Euler angles of head pose. This method sets the range of each Euler angle to \(\pm 99^\circ\) and divides them into categories with a \(3^\circ\) angle difference between adjacent ones. The angular category bins after division are: [0, 1, 2, ..., 65], resulting in a total of 66 categories. For one Euler angle, such as yaw, HopeNet uses the following loss for optimization:

$$\begin{aligned} L = H(y,\hat{y}) + \alpha \cdot MSE(\psi ,\hat{\psi }) \end{aligned}$$
(1)

where \(H(\cdot ,\cdot )\) and \(MSE(\cdot ,\cdot )\) represent cross entropy and mean square error loss functions, respectively. y is the one-hot encoded true class label, and \(\hat{y}\) is the predicted probability distribution over classes. \(\psi\) and \(\hat{\psi }\) are the true angle and the predicted angle. \(\alpha\) is a balancing factor that controls the relative importance of the two loss terms.

Although one-hot encoding (hard labels) has the advantage of being simple and intuitive, suitable for discrete classification tasks, and can clearly distinguish different categories. However, its limitation lies in the inability to express the similarities within categories and the gradual differences between categories, particularly in tasks such as head pose estimation. In contrast, soft labels can better model the above characteristics. Next, we will introduce the construction process of soft labels.

Soft label construction strategy

Before constructing soft labels, unlike HopeNet, we set the bin size to 1, resulting in a total of K = 199 classes for the yaw angle. In this case, we can simply assume that each category corresponds to a discrete angle within the range [\(-99^\circ\),\(99^\circ\)]. The same approach is applied to other Euler angles. Then, we utilize the Basel Face Model (BFM)31 to construct the 3D face models due to the its many advantages, such as precise controllability and continuous diversity. The number of 3D face models is equal to the class number. 68 commonly used 3D key points can be easily extracted from these models, which are distributed in key areas such as facial contours, eyes, and mouth.

Fig. 3
Fig. 3
Full size image

This figure illustrates the calculation process of the similarity using the displacements of key points relative to the 3D reference face model \(M_0\) (yaw=\(0^\circ\)). The comparison is conducted across head pose models within an angle range of \(-99^\circ\) to \(99^\circ\).

Figure 3 illustrates the process of calculating the similarity using the displacements of key points. In this process, only the yaw angle is considered, so the pitch and roll angles are set to zero. We observed significant differences in facial shapes between the 3D face models \(M_0\) (yaw=\(0^\circ\)) and \(M_1\) (yaw=\(-99^\circ\)), as well as between \(M_0\) and \(M_2\) (yaw=\(99^\circ\)). To quantitatively describe this difference, we introduce the normalized Euclidean distance and exponential decay function to calculate the similarity. Given two 3D face models \(X_1\) and \(X_2\), the formula is defined as follows:

$$\begin{aligned} S_L({X}_1, {X}_2) = \exp \left( - \sum _{i=1}^{68} \frac{\left\| {p}_i^{(1)} - {p}_i^{(2)} \right\| _2}{dist} \right) \end{aligned}$$
(2)
Fig. 4
Fig. 4
Full size image

This figure presents the complete computation process of soft labels for yaw = 0. The obtained similarity vector is used to fit a Gaussian curve with a learnable variance.

where \({p}_i^{(1)}\) represents the 3D coordinates of the i-th landmark point in face model \({X}_1\), and \({p}_i^{(2)}\) represents the 3D coordinates of the i-th landmark point in face model \({X}_2\). dist is the normalization factor, represented here as the distance between the outer corners of the two eyes. Using the Eq. (2) to calculate the similarity between the 3D reference face model for angle \(\psi _j\) and the model sequence for the angle range \(\psi = [\psi _1,\psi _2,...\psi _K]=[-99^\circ ,...,99^\circ ]\), we can obtain a similarity vector, as follows:

$$\begin{aligned} s_j= [s_{j,1},s_{j,2},...,s_{j,i},...,s_{j,K}] \end{aligned}$$
(3)

where \(s_{j,i}\) is the similarity for angular categories j and i, and j is predetermined and determined by the specific angle being considered for generating the corresponding true soft label. For example, in Fig. 3, we generate the corresponding soft label for \(\psi _j=0\), so in this case, j is 100. Next, we normalize this vector to form a distribution in which the sum of all elements equals 1, as follows:

$$\begin{aligned} s_{j,i} \leftarrow \frac{s_{j,i}}{\sum _{k=1}^Ks_{j,k}} \end{aligned}$$
(4)

Then, as shown in Fig. 4, we model this distribution as a Gaussian distribution with mean \(\mu _j\) and standard deviation \(\sigma _j\). In other words, we fit a Gaussian curve that approximate the shape of the distribution. The standard deviation \(\sigma _j\) is a learnable parameter in curve fitting. The expression for the curve is as follows:

$$\begin{aligned} f(x) = \frac{1}{\sqrt{2\pi }\sigma _j} \exp \left( -\frac{(x - \mu _j)^2}{2\sigma _j^2}\right) \end{aligned}$$
(5)

where \(\mu _j=\psi _j\), and \(\sigma _j\) can be obtained through the least squares method. The Gaussian distribution obtained above does not satisfy the condition that the sum of all elements is 1, therefore normalization is also required. After normalization, we obtain soft label, which is represented as follows:

$$\begin{aligned} d_{j,i} = \frac{f(\psi _i)}{\sum _{k=1}^{K} f(\psi _i)},i=1,2,...,K \end{aligned}$$
(6)

For the Euler pitch and roll angles, we also employ the aforementioned strategy to obtain the soft label corresponding to each angle.

Fig. 5
Fig. 5
Full size image

This figure illustrates our proposed stacked dual attention module (SDAM), which consists of two sub-modules: the multi-receptive attention module (MRAM) and the channel-wise self-attention module (CSAM). mm represents matrix multiplication.

Stacked dual attention module

In head pose estimation tasks, feature extraction is a critical step, and common methods for feature extraction include convolutional neural networks32, recurrent neural networks33, and autoencoders34. These methods can learn highly abstract feature representations by processing images in a layered manner. ResNet-5035 adopts a multi-layer residual structure, enabling it to effectively capture detailed features in images while maintaining high computational efficiency and robustness. Therefore, we chose ResNet-50 as our backbone network. ResNet-50 mainly consists of an initial convolutional layer (Conv1) and four residual blocks (Layer1, Layer2, Layer3, and Layer4). We design a Stacked Dual Attention Module (SDAM) that can be easily deployed at various levels of ResNet-50.

The SDAM is divided into two parts: the Multi-Receptive Attention Module (MRAM) and the Channel-wise Self-Attention Module (CSAM), as shown in Fig. 5. MRAM employs three different scales of convolutional kernels (3\(\times\)3, 5\(\times\)5, 7\(\times\)7) to capture contextual information with varying receptive fields through a hierarchical convolutional process. Before each convolution operation, appropriate padding is applied to maintain feature map dimensions. The extracted multi-scale features are then normalized using Group Normalization (GN) and aggregated to generate an attention map through a sigmoid function. This process can be represented by a formula as:

$$\begin{aligned} \begin{aligned} \small M_F&= \sigma \big (\text {GN}(\text {Conv}_{3}(F)) + \text {GN}(\text {Conv}_{5}\big (\text {Conv}_{3}(F))\big ) +\text {GN}(\text {Conv}_{7}\big (\text {Conv}_{5}\big (\text {Conv}_{3}(F))\big )\big )\big ) \end{aligned} \end{aligned}$$
(7)

where \(M_F \in \mathbb {R}^{C \times H \times W}\) is the attention map, and \(\sigma\) is the sigmoid function. The output of MRAM is represented as:

$$\begin{aligned} F^{\prime } =M_F \, \otimes \, F \end{aligned}$$
(8)

where \(\otimes\) denotes element-wise multiplication. CSAM utilizes channel information based self-attention mechanism to explore inter-channel correlation of input features. Specifically, the input of CSAM passed through three \(1\times 1\) depthwise convolutions (DWConv) to generate the Query, Key, and Value. Before applying the attention mechanism, we should reshape Query, Key and Value from \(C\times H \times W\) to \(C\times (H\times W)\). After, the attention map is obtained through average pooling followed by a sigmoid function. The process is formalized as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} Q = \text {Reshape}(\text {DWConv}(F^{\prime }))\\ K = \text {Reshape}(\text {DWConv}(F^{\prime })) \\ V = \text {Reshape}(\text {DWConv}(F^{\prime })) \end{array}\right. } \end{aligned}$$
(9)
$$M_{C} = \sigma \left( {{\text{AvgPooling}}\left( {{\text{Softmax}}\left( {\frac{{QK^{T} }}{{\sqrt {dim} }}} \right)V} \right)} \right){\text{ }}$$
(10)

where \(M_C\) is the attention map and dim is \(H\times W\). Therefore, the output of CSAM (i.e., the output of SDAM) is:

$$\begin{aligned} F^{''} =M_C \, \otimes \, F^{\prime } \end{aligned}$$
(11)

SDAM expands the receptive fields by integrating convolution kernels with different receptive fields (such as \(3\times 3\), \(5\times 5\), and \(7\times 7\)), thereby producing multi-scale spatial attention maps that emphasize spatially informative and structure-related regions. Additionally, it further explores channel attention. In addition, similar to the typical CBAM36 module, our SDAM maintains the same output dimensions as its input features. This design ensures high compatibility and ease of deployment into existing mainstream network architectures. The effectiveness of SDAM is validated through subsequent ablation experiments.

Training loss

As shown in Fig. 2, we construct separate branches and loss functions for each Euler angle. Below, we take the yaw angle as an example and present its loss function. First, to measure the difference between soft labels and predicted distributions, we use the KL divergence loss function. For a sample x, assuming its true yaw angle is \(\psi\) and its discretized class label is j, the corresponding soft label is denoted as \(D=[d_1,d_2,...,d_{199}]\). Therefore, the KL divergence loss is given by:

$$\begin{aligned} KL\left( D, \hat{D}\right) = \sum _{i=1}^{K} d_{j,i} \text {ln} \frac{d_{j,i}}{\hat{d}_{j,i}} \end{aligned}$$
(12)

where \(\hat{d}_{j,i}\) represents the probability of class j being predicted as class i. For a batch of samples, the above formula can be rewritten as:

$$\begin{aligned} L_{\text {kl}} = \frac{1}{M} \sum _{m=1}^M KL\left( D^{(m)}, \hat{D}^{(m)}\right) \end{aligned}$$
(13)

where m is used to indicate the sample index, and M is the batch size. We follow the paradigm of combining classification and regression, defined as follows:

$$\begin{aligned} L_{\text {mse}} = \frac{1}{M} \sum _{m=1}^M\left( \psi ^{(m)} - \hat{\psi }^{(m)} \right) ^2 \end{aligned}$$
(14)

Finally, the total loss for yaw angle is obtained as:

$$\begin{aligned} L_{total} = L_{\text {kl}} + \alpha \cdot L_{\text {mse}} \end{aligned}$$
(15)

where \(\alpha\) represent the regression coefficient.

Fig. 6
Fig. 6
Full size image

Spatial position relationship of the world coordinate system, the camera coordinate system, the student’s head coordinate system, and the auxiliary coordinate system in the classroom.

students’ gaze points estimation in classroom

Since there is no publicly available dataset of students’ gaze points in the classroom, we conduct a data collection experiment in a real classroom setting. Specifically, two cameras are placed in the classroom to capture students’ head pose and facial features from multiple angles. The cameras are precisely configured to ensure high-quality image data. To collect diverse gaze points data, volunteers are positioned at different locations in the classroom and asked to sequentially focus on several pre-marked points on the blackboard. Whenever the volunteers shift to a new gaze point, the two cameras simultaneously capture images, recording their head poses. Ultimately, the collected data is used to validate gaze points estimation methods.

Fig. 7
Fig. 7
Full size image

The calculation of the rotation matrix of the student’s head pose in the world coordinate system.

In the student’s gaze point estimation task, the central task is to compute two key parameters: The first one represents the rotation matrix of the student’s head pose in the world coordinate system. The second one represents the 3D coordinates of the position of the student in the classroom scenario in the world coordinate system. Combining the rotation matrix of the student’s head pose and 3D coordinates of the student’s position, the 3D coordinates of the student’s gaze point in the world coordinate system are computed by geometric reasoning. To do this, we construct four coordinate systems: the world coordinate system (WCS), the camera coordinate system (CCS), the coordinate system of the student’s head (SCS), and the auxiliary coordinate system (ACS), as shown in Fig. 6. From the observer’s point of view, the origin of WCS is located in the upper right corner of the blackboard (the \(Y_W\) axis points vertically downward toward the floor, and the \(Z_W\) axis points vertically toward the student). Two cameras are placed in the classroom, on either side of the blackboard, facing the students. When the student’s head is not rotated, the origin of SCS is the point of the tip of the student’s nose (the \(Y_S\) axis points vertically upward to the ceiling, and the \(Z_S\) axis points vertically toward the blackboard). Due to the complex rotational relationship between CCS and WCS, and the fact that the blackboard will not be in the camera’s field of view. Therefore, it is not possible to directly determine the transformation relationship between these two coordinate systems. To solve this problem, we introduce ACS to establish the relationship between CCS and WCS. The following is an example of estimating a single student’s gaze point in more detail.

Our proposed head pose model estimates a student’s head pose and converts the head pose Euler angles into a rotation matrix. We then convert the rotation matrix of the student’s head pose to WCS, as shown in Fig. 7. Subsequently, we need to compute the head pose rotation matrix \(R_{WS}\) for SCS with respect to WCS. This requires ACS to link the CCS and WCS relationships. First, place the chessboard directly in front of the blackboard. From the observer’s point of view, the origin of ACS is located in the upper right corner of the chessboard (the \(Y_A\) axis points vertically to the blackboard, and the \(Z_A\) axis points vertically upward to the ceiling), as shown in Fig. 6. By manually marking the chessboard and pose estimation method, the rotation matrix \(R_ {AC}\) and the translation vector \(t_ {AC}\) of CCS relative to ACS are obtained. By manually measuring the environmental data, the rotation matrix \(R_ {WA}\) and the translation vector \(t_ {WA}\) of the ACS relative to the WCS are obtained. Through the rotation relationship of each coordinate system, we can obtain the rotation matrix of the student’s head pose in WCS, as follows:

$$\begin{aligned} R_{WS} = R_{WA} \cdot R_{AC} \cdot R_{CS} \end{aligned}$$
(16)

where \(R_{CS}\) is obtained by using our proposed head pose estimation model to obtain the Euler angles of the student in the photo and convert them into a rotation matrix relative to CCS.

As shown in Fig. 8, we need to determine the WCS 3D coordinates of the student’s position in the classroom. First, we use facial recognition and identity matching techniques to identify the same student in images captured by two cameras from different angles at the same time. After confirmation of identity consistency, we extract the image coordinates of the student’s nose tip from these two images. Subsequently, based on the student’s image coordinates, we use the triangulation method to calculate the 3D coordinates \(P_C\) of the student’s nose tip in CCS. Then, using the coordinate system conversion relationship between CCS and ACS, as well as the conversion relationship between ACS and WCS, we transform the 3D coordinates of the student’s position in the classroom from CCS to WCS, as follows:

$$\begin{aligned} P_W = R_{WA} \cdot R_{AC} \cdot P_C + R_{WA} \cdot t_{AC} + t_{WA} \end{aligned}$$
(17)
Fig. 8
Fig. 8
Full size image

The calculation of the 3D coordinates for the student’s nose tip in the world coordinate system.

where \(P_W\) is the 3D coordinate of the student’s position in the classroom in WCS. Finally, we derive the 3D coordinates of the student’s gaze point in WCS through spatial geometric relationships. Specifically, assuming there is a point \(P_ {SZ}\) in SCS with coordinates of (0, 0, 1). We can obtain the 3D coordinates of point \(P_ {SZ}\) in WCS, as follows:

$$\begin{aligned} P_{WZ} = R_{WS} \cdot P_{SZ} + P_{W} \end{aligned}$$
(18)

where \(P_ {WZ}\) is the 3D coordinate of \(P_ {SZ}\) in WCS. Assume that the rotation matrix \(R_{WS}\) and \(P_W\) are represented as follows:

$$\begin{aligned} P_{W} = \begin{bmatrix} t_1 \\ t_2 \\ t_3 \end{bmatrix} \,,\, R_{WS} = \begin{bmatrix} R_{11} & R_{12} & R_{13} \\ R_{21} & R_{22} & R_{23} \\ R_{31} & R_{32} & R_{33} \end{bmatrix} \end{aligned}$$
(19)

For this, we obtain the position of SCS origin \(O_S(t_1, t_2, t_3)\) in WCS, as well as the position of the point \(P_{WZ}(R_{13} + t_1, R_{23} + t_2, R_{33} + t_3)\) in WCS. Based on these definitions, we can calculate the 3D coordinates of the student’s gaze point \(P(x_W, y_W, z_W)\) in WCS, as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_W = t_1 - \frac{R_{13} \times t_3}{R_{33}} \\ y_W = t_2 - \frac{R_{23} \times t_3}{R_{33}} \\ z_W = 0 \end{array}\right. } \end{aligned}$$
(20)

Ethics approval and consent to participate

All methods were carried out in accordance with relevant guidelines and regulations. The study was approved by the Human Ethics Committee of Guangxi Normal University, and the ethical review number is 20250810008. Informed consent was obtained from all participants involved in the experiments.

Experiments

In this section, we describe the implementation of our method, followed by validation on a public dataset, comparison with existing methods, and presentation of ablation experiments. Finally, we introduce the practical application of our method for evaluating students’ gaze points in the classroom.

Implementation details

Our method is implemented using PyTorch. In Eq. (5), each value for yaw angle has an associated standard deviation. In our experiment, we observe that the variance differences between different angle values are relatively small. Therefore, for different yaw angle values, we set a fixed standard deviation. Similarly, the pitch and roll angles are processed in the same way. Specifically, the standard deviations for yaw, pitch, and roll angles are set to \(\sigma _y = 0.997\), \(\sigma _p = 0.939\), and \(\sigma _r = 1.25\). All input images are first detected and cropped using the MTCNN37 face detector. The cropped image is scaled to 224 \(\times\) 224. To enhance the training data diversity for better handling changes in various practical scenarios, we use various data augmentation techniques, including random horizontal flipping, random scaling, random cropping, random rotation (ranging from \(-45^\circ\) to \(45^\circ\)), and further image color processing, such as random blurring, random brightness, and contrast changes. During the optimization process, we initialize the learning rate of the Adam optimizer to 1e-4 and reduce it by half every 10 epochs, over a total of 80 epochs. Using this dynamic learning rate adjustment strategy, we aim to improve the model’s convergence in the later stages of training, thereby achieving better performance.

Fig. 9
Fig. 9
Full size image

The images in the first, second, and third rows are from the 300W-LP, AFLW2000, and BIWI datasets, respectively.

Dataset

We use three popular datasets, namely 300W-LP38, AFLW200039 and BIWI40, to train and evaluate our model, ensuring a comprehensive and valuable evaluation of our network. Some samples from these datasets are shown in Fig. 9. 300W-LP and AFLW2000 are collected in an unrestricted field environment, while BIWI is collected in a restricted laboratory environment.

300W-LP dataset is a large pose dataset obtained by applying deformation flipping to the 300W dataset. It contains over 120,000 images and 3837 volunteers, with 68 key points and three head pose angles per image.

AFLW2000 dataset contains the first 2000 images from the AFLW dataset, annotated with ground truth 3D faces and corresponding 68 landmarks. It includes samples with large variations in lighting and occlusion conditions.

BIWI dataset consists of 15,678 images, recorded using Kinect equipment in a laboratory environment, covering 20 different subjects with varying head pose. In this dataset, the head occupies only a small area in the image.

Ablation study

We conduct ablation experiments on the regression coefficient \(\alpha\) in the loss function. Specifically, we conduct experiments with different \(\alpha\) values of 4, 2, 1, and 0.1, respectively. The models are retrained, showing the Mean Absolute Error (MAE) on the AFLW2000 and BIWI datasets. The experimental results are shown in Table 1. The model exhibits relatively stable performance for different \(\alpha\) values and performs best when \(\alpha\) is 2. Therefore, we use this value as the default configuration in all experiments.

Table 1 Regression loss weights on the AFLW2000 and BIWI datasets.

To verify the effectiveness of SDAM and SLCS, we design several ablation experiments, and the results are shown in Table 2. In the experiment, we choose a method that combines classification and regression as the baseline. For ablation experiments, we add or replace our modules based on this. When SDAM and SLCS are not used as baselines, the MAE on the AFLW2000 and BIWI datasets are 4.07 and 4.03, respectively. When only the SDAM is added to the baseline, the experimental results show a decrease in MAE from 4.07 to 3.82 on the AFLW2000 dataset. The MAE decreases from 4.03 to 3.90 on the BIWI dataset. This indicates that the SDAM effectively enhances feature representation, resulting in improved accuracy of head pose estimation on AFLW2000 and BIWI datasets.

Table 2 Analysis of different module combinations on the AFLW2000 and BIWI datasets.

When only SLCS is used, the experimental results show a decrease in MAE from 4.07 to 3.41 on the AFLW2000 dataset. The MAE decreases from 4.03 to 3.46 on the BIWI dataset. This suggests that using soft labels is preferable to using hard labels for indicating the pose of a face image. As a result, on the AFLW2000 and BIWI datasets, the accuracy of the model’s head pose estimation is significantly improved. The combination of SDAM and SLCS achieves optimal performance on both AFLW2000 and BIWI datasets, significantly reducing MAE compared to baseline. Specifically, on the AFLW2000 dataset, the MAE decreases from 4.07 at baseline to 3.33, a reduction of approximately 18.2%, significantly reducing errors at all angles. Similarly, on the BIWI dataset, the MAE decreases from the baseline of 4.03 to 3.38, a reduction of approximately 16.1%. It is evident that the integration of all modules leads to higher accuracy and optimal results.

Table 3 Comparisons with other methods on the AFLW2000 and BIWI datasets. All methods are trained on the 300W-LP dataset.

Comparison with state-of-the-art methods

We follow the convention of using the synthetic 300W-LP dataset for training and the ALFW2000 and BIWI datasets for testing, with the evaluation metric being the MAE of the Euler angles. We compare our method with other state-of-the-art methods listed in Table3. 3DDFA38 uses deep neural networks to fit a 3D face model and estimate head pose angles. FAN41 obtains multiscale information for pose estimation by repeatedly fusing feature blocks between network layers. QuatNet25, using a quaternion representation to predict head pose, avoids the singularity problem of Euler angles. FSA-Net26 introduces a method called soft-stage regression, which combines feature aggregation and achieves excellent performance. FDN42, using a feature decoupling method, obtains distinct features from different angles to estimate head pose. Img2Pose43 estimates the six degrees of freedom pose of the face directly from the image using an end-to-end deep learning model, achieving efficient and accurate face alignment and recognition. 6DRepNet27 proposes a new six-dimensional rotation representation method that accurately estimates head pose in unconstrained environments, overcoming the limitations of traditional rotation representations. ASG Learning44 encodes the column vectors of the rotation matrix into an anisotropic spherical Gaussian distribution, using an adaptive training paradigm. TokenHPE45 implements head pose estimation based on Transformer by introducing novel token learning concepts and directional tokens. HeadDiff46 considers head pose estimation as a denoising diffusion process on the SO(3) manifold, improving pose accuracy and reducing ambiguity by exploiting facial semantic information and cyclic consistency learning. WQuatNet47 frames head pose estimation as landmark-free quaternion regression on a RepVGG-D2se backbone with quaternion-specific losses, enabling full 0–360° orientation prediction without Euler gimbal-lock. HRHPE48 focuses on facial regions of interest and heterogeneous relations between adjacent poses, combining region-attention feature generation, a rugby-style hierarchical structure, and Transformer-based relation mining to achieve robust head pose estimation.

Fig. 10
Fig. 10
Full size image

Example images with converted Euler angle visualization.

On the AFLW2000 dataset, compared to ASG Learning, our model reduced the MAE from 3.64 to 3.33, achieving an 8.5% improvement. Meanwhile, compared to HeadDiff’s MAE, our method reduced it by 6.7%. On the BIWI dataset, our model reduced the MAE from 3.61 in ASG Learning and 3.46 in HeadDiff to 3.38, improving by 6.4% and 2.3%, respectively. In addition, our method exhibits small errors in the yaw, pitch, and roll angles, further demonstrating the accuracy of the model. Fig. 10 provides qualitative comparisons between HopeNet and our method. The visualizations show that our predictions align more closely with the ground truth.

Fig. 11
Fig. 11
Full size image

The visualization of students’ gaze points estimation results.

Gaze point estimation for Students in classroom scenarios

The proposed head pose estimation model is applied to the students’ gaze points estimation task, and the results are shown in Fig. 11. The first and second rows exhibit smaller errors, whereas the third row exhibits larger errors. The errors in the first and second rows are relatively small mainly because the shorter distance renders the depth estimation of the binocular vision system more stable, and the disparity information is stronger, which improves the accuracy of depth estimation. In contrast, the error in the third row is relatively large, which is attributed to the increased binocular distance measurement error caused by a greater distance. As the disparity information becomes weaker, the accuracy of depth estimation decreases. During the coordinate system conversion process, the initial detection error is amplified, resulting in an increase in the deviation of gaze point prediction, which significantly affects the accuracy of the prediction results.

Conclusion

In this paper, we propose a new soft-label construction strategy for head pose estimation task. The strategy utilizes the displacements of 3D key points from different angles. The constructed soft labels adopt different variances for three Euler angles. A larger variance results in a smoother distribution, making it more suitable for the easily estimable roll angle, while a smaller variance produces a sharper distribution, which is better suited for the more challenging estimation of yaw and pitch angles. Additionally, we propose the SDAM that contains two sub-modules: MRAM and CSAM, which effectively enhances feature representation, as demonstrated through our ablation study. Moreover, we conduct experiments on the public datasets AFLW2000 and BIWI, and the experimental results demonstrate that our method achieves competitive performance compared to other approaches in the head pose estimation task. Furthermore, we extend our method to compute students’ gaze points in classroom scenarios, successfully inferring their gaze positions and providing technical support for attention analysis in intelligent education.