Introduction

Drowsiness is one of the factors that significantly affects the quality of the task at hand, hindering it to varying degrees. It is particularly perilous in areas of daily life where continuous attention is essential for safety. The best example is in the broad field of transportation, including roads, seas, and air. The National Center for Statistics and Analysis (NCSA) reported 693 deaths due to accidents caused by driver drowsiness in 20221. In aviation, drowsiness is particularly dangerous, as it can lead to large-scale crashes, as was the case with Air India Express Flight 812, in which 158 people died as a result of the pilot falling asleep (from the Gokhale report2).

To minimize the impact of drowsiness and fatigue on transportation safety, drowsiness detection systems are being developed to notify the driver or pilot before the phenomenon occurs. Such solutions help reduce the risks associated with prolonged task performance while increasing worker productivity and efficiency by minimizing errors due to drowsiness.

Methods of detecting drowsiness

Drowsiness detection methods can be classified into three main categories3:

  1. 1.

    Methods based on vehicle control parameters.

  2. 2.

    Methods based on physiological parameters.

  3. 3.

    Methods based on behavioral parameters.

Figure 1 presents a breakdown of the methods used for drowsiness detection and the psychophysical parameters measured for each method. Table 1 summarizes the main advantages and disadvantages of each approach.

Figure 1
Figure 1
Full size image

Classification of drowsiness detection methods4.

Table 1 Advantages and disadvantages of drowsiness detection methods.

Methods based on vehicle control parameters

Methods for detecting drowsiness based on vehicle control parameters rely on the fact that drowsiness impairs driving performance. The strong correlation between fatigue and vehicle handling has led to several approaches for detection.

Liu, Hosking, and Lenné5 compared different methods of detecting drowsiness by analyzing vehicle control parameters. They identified two of the most effective: the standard deviation of lane position (SDLP) and steering wheel movement (SWM).

In aviation, Morris and Miller6 analyzed variations in aircraft speed, heading, altitude, and vertical speed to calculate error rates. They found that the error rate increased as pilot sleepiness progressed during simulated flights.

Research confirms that vehicle dynamics, such as lane-keeping and steering control, can serve as reliable indicators of drowsiness. For example, lane departures linked to sleep deprivation correlate strongly with fatigue levels7,8. This supports the use of vehicle performance data in drowsiness detection systems.

More recently, machine learning methods have been introduced to assess and predict drowsiness using control parameters. These systems improve upon traditional methods by analyzing real-time data9. Advanced algorithms, such as machine learning-based optimization, further enhance detection by making systems more responsive to abnormal driving behavior10. This marks a shift toward vehicle-centric detection approaches, which emphasize operational parameters rather than only physiological signals.

Despite these advances, an important limitation remains. Detection based on control parameters often occurs only after risky behaviors emerge. This can endanger both the driver and others. For this reason, predictive methods that identify drowsiness before performance declines are especially valuable.

Methods based on physiological parameters

Drowsiness detection through physiological parameters increasingly relies on indicators related to brain activity, particularly through electroencephalography (EEG). Research indicates that EEG signals can effectively highlight different states of alertness and fatigue, particularly through fluctuations in specific wave patterns. Notably, delta and theta rhythms are associated with reduced vigilance and increased sleepiness, making them vital for drowsiness detection11,12. Methods for detecting drowsiness based on physiological parameters are built on the premise that fatigue or drowsiness causes changes in physiological parameters regulated by the sympathetic and parasympathetic parts of the nervous system. By identifying these changes, safety systems can detect the state of drowsiness4. According to Hu and Lodewijks13, some of the most promising methods are those based on EEG. These involve measuring the spectral power of brain waves in the alpha, beta, and theta bands.

Recent studies have increasingly applied deep learning models to EEG analysis in order to improve accuracy and robustness in fatigue detection. For example, a modified Inception-Dilated ResNet architecture presented by Alghanim et al. work14 has been proposed to address the nonstationary nature of EEG signals and to enhance inter-channel feature extraction. Using spectrogram representations of EEG recordings, this hybrid model demonstrated improved performance on benchmark datasets such as Figshare and SEED-VIG. These findings highlight the potential of neural architectures to capture both temporal and spatial information from EEG data, surpassing traditional feature-engineering approaches.

Other than EEG, fatigue and drowsiness are associated with changes in the autonomic nervous system (ANS), which regulates key physiological functions such as heart rate, respiration, and ocular and skin responses. In the context of operating remote systems, including unmanned aerial vehicles (UAVs), ANS responses and physiological indicators are particularly relevant due to prolonged cognitive workload and the necessity to maintain vigilance.

Studies on UAV operators have shown that heart rate variability (HRV) parameters measured via electrocardiography (ECG) are significantly affected by the level of automation and cognitive workload during UAV missions15,16. Moreover, research on the psychophysiological state of UAV operators confirms that HRV can serve as an objective biomarker for monitoring operator condition during training and operational tasks17.

Furthermore, measurement of skin conductance (EDA) can serve as an indirect marker of sympathetic activity, exhibiting changes in response to decreased arousal and sensory-cognitive load during prolonged monitoring tasks18. In addition, multi-modal approaches combining ECG, EDA, and respiratory sensors improve the detection of drowsiness and fatigue by identifying alterations such as heart rate stabilization, reduced respiratory amplitude, and decreased tonic EDA during states of lowered alertness18. The integration of these multimodal physiological signals allows for a comprehensive assessment of UAV operators and other remote system operators, providing a promising complement or alternative to EEG-based methods in operational settings.

However, these methods require participants to wear appropriate sensors, which poses practical challenges in operational environments. Similar to methods based on vehicle control parameters, approaches based on physiological parameters also have limitations. Wearing additional devices can reduce comfort and increase stress, activating the sympathetic nervous system and potentially affecting the reliability of measurements. A summary of the discussed parameters is provided in the Table 2.

Table 2 Physiological and behavioral markers for sleepiness/fatigue assessment.

Considering these limitations, this work proposes an alternative approach based on behavioral parameters for detecting drowsiness.

Methods based on behavioral parameters

Behavioral parameters are non-invasive methods of detecting drowsiness that monitor pilot fatigue by analyzing indicators such as eye closure time ratio, blink frequency, pupil diameter, saccadic movements (involuntary eye movements), and yawning.

The relationship between drowsiness and eye movement behavior has been extensively documented, with studies indicating that decreased gaze stability and prolonged blink duration correlate with increased fatigue levels23,24. Additionally, reaction time assessments revealed that lapses in attentiveness are closely tied to drowsiness states. The Psychomotor Vigilance Test (PVT) has been utilized to pinpoint such lapses, reinforcing the link between behavioral performance and cognitive alertness during driving23,25.

Data from those parameters is analyzed in real time. Relevant features are detected on the pilot’s face and recorded to measure parameters related to drowsiness detection. The pilot’s drowsiness assessment classifier evaluates the test result in the next step. If the pilot is drowsy, the safety system informs the vehicle operator. Otherwise, the algorithm continues monitoring until drowsiness is detected26. A diagram of this algorithm is shown in Fig. 2.

Figure 2
Figure 2
Full size image

Drowsiness detection algorithm based on behavioral parameters26. During application operation, the camera feed is read in real time, and the operator’s face is detected in the image. Based on this, selected features are analyzed to measure parameters related to drowsiness. The detected parameters include EAR, PERCLOS, MAR, yawning occurrence, and Euler head tilt angles (pitch and roll). Using these parameters, the classifier determines whether the pilot shows signs of drowsiness. If drowsiness is detected, the system generates an appropriate alert for the operator and then returns to its initial state.

A study by Morad27 found that the mean pupil diameter is correlated with subjective feelings of fatigue. Therefore, this potential parameter could be used to detect drowsiness in drivers. However, one problem with recognizing drowsiness by measuring the mean pupil diameter is its high sensitivity to light intensity. For this reason, if pupil diameter is selected, its measurement should be integrated into a safety system for detecting drowsiness.

Some of the most characteristic symptoms of drowsiness are slow closing of the eyelids and frequent blinking of the eyes. For this reason, Wierwille and Ellsworth created a parameter named the percentage of eyelids closed (PERCLOS) to determine this relationship28. Studies have shown that the PERCLOS parameter increases as drowsiness grows, leading to reduced driver efficiency and slower responses to stimuli29. Therefore, it is one of the most reliable indicators of drowsiness30.

PERCLOS measures the percentage of time the eyelids are over the pupil, indicating a slow dropping of the eyelids instead of the typical blinking. This parameter is calculated by dividing the time the eyelid covers about 80% of the eye by the total measurement time. It can be expressed by equation (1).

$$\begin{aligned} PERCLOS = \displaystyle \frac{n_{close}}{N_{total}}\cdot 100\% \end{aligned}$$
(1)

Where \(n_{close}\) represents the number of frames when the eyes are closed over a predefined interval and \(N_{total}\) is the total frame length over the same interval. A higher PERCLOS percentage indicates a higher degree of sleepiness. The typical recommended alarm threshold for the PERCLOS parameter is 15%31. However, Sommer and Golz32 note in their study that measuring only this parameter to indicate drowsiness is insufficient to avoid drowsiness-related accidents.

Another parameter that can detect drowsiness in pilots is related to saccadic movements. These movements are rapid, leaping movements of the eyeballs to move the gaze from one point to another. They are used to scan the visual field and are essential for activities such as reading or looking around in the environment. A study by Henn, Baloh, and Hepp33 supports the possibility of their use, which shows a relationship between sleepiness and parameters related to saccadic movements. Schleicher and colleagues showed that saccadic movement duration correlates with driver and pilot sleepiness34.

Diaz-Piedra and colleagues35 conducted a study of the speed of saccadic movements in pilots before and after a helicopter flight lasting more than 2 hours. They noted that the speed of saccadic movements decreased by about 3%. They concluded that the speed of saccadic movements is a promising biomarker for detecting sleepiness.

Yawning is included among the parameters correlated with sleepiness36, a common phenomenon in humans and animals. A wide mouth opening, deep inhalation, and short exhalation characterize it. The reasons for this phenomenon are varied and include lowering brain temperature37, increasing arousal38, and performing social functions39.

As fatigue increases, neck muscles relax, causing involuntary drooping or tilting of the head. These movements reflect reduced alertness and may indicate a state of lethargy40. Continuous monitoring of head tilt can therefore enable earlier detection of worsening fatigue.

In summary, behavioral parameters such as pupil diameter, eyelid closure (PERCLOS), blink rate, saccadic movements, yawning frequency, and head tilt provide reliable non-invasive indicators of drowsiness. While each parameter has its limitations when used in isolation, combining multiple indicators within an integrated system significantly improves detection accuracy. Such multimodal behavioral monitoring offers a practical and effective approach for real-time assessment of driver or pilot alertness.

Classification methods for detecting drowsiness

There are numerous ways of classifying drowsiness based on behavioral parameters. This article presents a selection of methods, most of which have been evaluated on the NTHUDDD dataset41. A broader review of machine learning systems for drowsiness detection is provided by El-Nabi42. Table 3 summarizes the key approaches.

Table 3 Drowsiness detection systems based on behavioral parameters.

Liu et al.43 propose a driver fatigue detection algorithm based on a two-stream network with multi-facial feature fusion. The approach consists of four main steps: (1) locating the eyes and mouth using multi-task cascaded convolutional networks (MTCNNs), (2) extracting static features from partial facial images, (3) extracting dynamic features from partial facial optical flow, and (4) combining static and dynamic features through a two-stream neural network for classification. By focusing on partial facial regions and fusing static and dynamic information, the method emphasizes fatigue-related cues, improving detection performance. Gamma correction is applied to enhance image contrast, particularly improving results in low-light conditions. Evaluated on the NTHUDDD dataset41, the system achieved an accuracy of 97.06%, demonstrating the effectiveness of multi-facial feature fusion combined with two-stream networks for driver fatigue detection.

Rezaee et al.44 present a real-time intelligent alarm system for detecting driver fatigue based on video sequences. The system captures the driver’s face at 15 fps and converts the images from RGB to YCbCr and HSV color spaces. The face region is segmented with high precision, and eye closure is determined using thresholding combined with facial symmetry equations. Yawning frequency is then identified via K-means clustering. Evaluated on four different video sequences totaling 35,000 frames, the system achieved an average accuracy of 93.18% and a detection rate of 92.71%. The high segmentation accuracy, low error rate, and fast processing distinguish this approach, demonstrating its potential for reducing accidents caused by driver fatigue.

Dua et al.45 present an ensemble framework combining FlowImageNet, AlexNet, VGG- FaceNet, and ResNet. By integrating diverse features related to eye blinking, yawning, and nodding, the ensemble improves robustness under varying conditions, such as changes in lighting and background. This approach achieves an accuracy of 85% which highlights the strength of ensemble learning for generalized detection.

Guo and Markoni46 introduce a hybrid CNN–LSTM model. CNNs are used for spatial feature extraction of eyes and mouth, while a novel Time-Skip Combination LSTM (TSC-LSTM) processes temporal dependencies across multiple time intervals. This reduces prediction noise and improves stability, achieving an accuracy of 84.85% on the NTHUDDD dataset41.

Moujahid et al.47 propose a framework based on handcrafted features, including HOG, covariance, and LBP descriptors. These are extracted from pyramidal multi-level face representations, reduced via PCA, and classified with SVMs. Fusion strategies further improve robustness under difficult conditions, achieving an accuracy of 79.84%.

Finally, Wijnands et al.48 focus on mobile deployment using depthwise separable 3D CNNs optimized for smartphones. Their system integrates early fusion of spatial and temporal information. Although accuracy is lower (73.9%), the lightweight design highlights the feasibility of large-scale, cost-effective applications.

In summary, drowsiness detection systems have evolved from handcrafted feature-based methods to deep learning approaches. Two-stream and hybrid CNN–LSTM networks offer high accuracy by capturing both spatial and temporal facial features, while ensemble frameworks improve robustness under varying conditions. Handcrafted descriptors with SVMs remain competitive in challenging scenarios, and lightweight 3D CNNs enable practical real-time deployment on mobile devices. Together, these methods demonstrate the trade-offs between accuracy, robustness, and computational efficiency in behavioral-based drowsiness detection.

Methods

Program description

The purpose of the system is to detect drowsiness in real time. The input data of the application is a video feed from a live transmission. The output data includes the classification of the pilot’s drowsiness state and drowsiness-related parameters, which are displayed on the Graphical User Interface (GUI) and saved in a comma-separated values (CSV) file for archiving the application’s processed data. The detected parameters include the Eye Aspect Ratio (EAR), the Percentage of Eye Closure (PERCLOS), the Mouth Aspect Ratio (MAR), as well as the Euler head tilt angles: pitch and roll. Figure 3 presents the graphical interface of the application. The system generates visual alerts when the pilot’s drowsiness is detected.

Figure 3
Figure 3
Full size image

The GUI contains three frames that contain information about the pilot’s drowsiness status. The left frame displays a live feed image via the MediaPipe library with a face mesh overlaid on the pilot’s face. The middle frame presents drowsiness parameters and the pilot’s drowsiness state. The right frame shows 3D facial point visualization, revealing the mouth and eye regions.

During the application execution, image frames are continuously captured from the camera’s live feed. The pilot’s face is then detected within the frame. A face mesh is applied to the detected face. Then, using that face mesh, selected drowsiness parameters are calculated. Based on the value of the parameters, the classifier infers if the pilot is showing signs of drowsiness. If drowsiness is detected, the system generates a visual alert and returns to its baseline state.

Data analysis

This subchapter presents methods for detecting EAR, PERCLOS, MAR, and Euler head tilt angles. These parameters were selected as the most reliable behavioral indicators.

Other indicators, including yawning, pupil diameter, and saccadic movement speed, were considered but ultimately excluded. A model containing yawn indicator was trained but analysis of the importance plot showed that this indicator did not contribute to predicting drowsiness. Pupil diameter was excluded because the facial landmark detection technology (MediaPipe) did not provide pupil landmarks. Although image processing was considered for pupil diameter detection, the results were inconsistent. Saccadic movements were also evaluated, but the extracted signals were too noisy, and even with filtering, the subtle effect of drowsiness on saccade speed rendered this feature unreliable.

Face mesh detection

In the proposed application, an approach using the MediaPipe library was used to detect 478 points on the pilot’s face. This approach is similar to that presented in Lee’s work49. Figure 4 shows which points are included in consecutive parameters for each feature detection.

Figure 4
Figure 4
Full size image

Landmark points for calculating drowsiness parameters: (a) EAR, (b) MAR, and (c) head tilt angles (pitch and roll angles).

EAR detection

After determining the position of the face mesh, the coordinates of the key points related to the eye region were extracted. These points were then connected in pairs to form segments. Two pairs represent the degree of eye openness, whereas one represents the eye’s width. This method allows for the normalization of eye dimensions relative to the camera’s distance.

Equation (2) was then used to calculate the EAR value for both eyes50, which are the coordinates of selected points from the face mesh.

$$\begin{aligned} EAR = \displaystyle \frac{\left| \left| P_3-P_4 \right| \right| + \left| \left| P_5-P_6 \right| \right| }{2\cdot \left| \left| P_1-P_2 \right| \right| } \end{aligned}$$
(2)

The final value of the EAR is calculated as an average of EAR values for both eyes. Following the literature51, the EAR threshold was set to 0.2. If the value of the EAR drops below 0.2, then the application registers that the eyes are closed in the selected frame. If the value of the EAR was greater than or equal to 0.2, then the eyes are registered as open for the selected frame.

PERCLOS detection

The PERCLOS value is calculated as an average value over one minute, as described in Cheng et al.’s work50. The total time the eyes were closed is determined using a moving time window method. PERCLOS is calculated via equation (1). The interval for which PERCLOS was calculated was set to 60 seconds.

MAR detection

The proposed system used the MAR method to find the degree of mouth opening. MAR is defined as a ratio between the height of the mouth opening and the width of the mouth. The process of determining this value corresponds to the EAR method. Equation (3) selects key points from the mouth region to calculate the value of MAR52.

$$\begin{aligned} MAR = \displaystyle \frac{\left| \left| P_3-P_4 \right| \right| + \left| \left| P_5-P_6 \right| \right| + \left| \left| P_7-P_8 \right| \right| }{2\cdot \left| \left| P_1-P_2 \right| \right| } \end{aligned}$$
(3)

Euler head tilt angle detection

In the proposed system, face mesh points were used to calculate the Euler angles of head tilt. Euler angles include pitch, roll, and yaw. Only pitch and roll were used because they are relevant for assessing the pilot’s drowsiness. The yaw angle was disregarded, as it does not provide useful information about drowsiness and could lead to overfitting in the trained model.

Key points from the face contour were selected to calculate these head tilt angles. Three line segments were created by pairing those key points. Then, by using trigonometric transformations, the pitch and roll angles were calculated. The final values of these angles were obtained by averaging the results from all the line segments. Additionally, the moving time window filters angle values to get a more stable result. The pitch angle was calculated using equation (4):

$$\begin{aligned} \psi = \arctan \left( \frac{y_2 -y_1}{z_2-z_1}\right) \end{aligned}$$
(4)

where:

  • \(y_1\) - y coordinate of the first point

  • \(y_2\) - y coordinate of the second point

  • \(z_1\) - z coordinate of the first point

  • \(z_2\) - z coordinate of the second point

The roll angle was computed using equation (5):

$$\begin{aligned} \theta = \arctan \left( \frac{y_2 -y_1}{x_2-x_1}\right) \end{aligned}$$
(5)

where:

  • \(y_1\) - y coordinate of the first point

  • \(y_2\) - y coordinate of the second point

  • \(x_1\) - x coordinate of the first point

  • \(x_2\) - x coordinate of the second point

Drowsiness classification

This work presents a method based on extracting selected parameters and aggregating them. The suggested method reduces the dimensionality of the input data, enables the separation of the feature detection stage from the classification stage, and minimizes the risk of model overfitting. An additional advantage of this method is the reduction of inference time and the ease of extending the model by incorporating additional parameters.

In the proposed work, the model distinguishes between two pilot states: drowsy and not drowsy. This work focuses on a random forest approach. As defined by Liu53, “Random forests are a combination machine learning algorithm that consists of a series of decision trees. Each tree casts a unit vote for the most popular class, and by combining these votes, the final classification is obtained”. Each tree must be trained on different subsets of training data and attributes.

Model training

The NTHUDDD dataset41 was used to train the model since it includes a relatively large and diverse sample of 36 individuals from various ethnic backgrounds. It minimizes the risk of overfitting a specific population subset. Some participants wore regular glasses, others wore sunglasses, and the rest did not wear eye accessories. Recordings were made both during the daytime and nighttime, increasing the overall generality of the dataset.

The NTHUDDD dataset41 includes information about whether an individual is drowsy in specific recordings. Still, it does not provide the values of parameters such as EAR, PERCLOS, MAR, and head tilt angles (pitch and roll). Because of that, the application preprocessed the dataset so that the aforementioned parameters were computed for each frame.

A random forest model was trained on the following input parameters: EAR, MAR, and head tilt angles (pitch and roll). The scikit-learn library, which provides methods for constructing decision trees, was used to create and train the random forest model. The proposed random forest model ensembled 100 decision trees, and each decision tree was trained on a bootstrap sample consisting of 60% of the training dataset (sampled with replacement).

Drowsiness classification algorithm

In the proposed work, the random forest model is employed selectively. PERCLOS is the primary determinant for classifying the drowsiness state since it is the most scientifically validated parameter for passive drowsiness detection51. If the PERCLOS value is lower than 12.5%, the system classifies the pilot as not drowsy. The pilot is classified as drowsy if the PERCLOS value exceeds 25%. If the PERCLOS values fall between 12.5% and 25%, then the random forest model determines whether the pilot is drowsy. Threshold values were adopted from Hanowski et al.’s work54. To assess their generalizability in an operational environment, functional tests were conducted on a group of five participants. The results support the suitability of the selected thresholds, though the limited sample size should be noted as a potential limitation.

For the pilot to be classified as drowsy, a certain threshold percentage of drowsiness classification needs to be exceeded within a specified time window. Because comparable methodologies are scarce in the scientific literature, the threshold values and the time window range were determined empirically. Satisfactory results were achieved using a 60-frame moving window and a critical threshold of 50% for the pilot to be alerted to potential drowsiness. Figure 5 illustrates the proposed algorithm for classifying the pilot’s drowsiness state.

Figure 5
Figure 5
Full size image

Proposed algorithm for drowsiness classification. Based on the calculated behavioral indicators (PERCLOS, EAR, MAR, pitch and roll angles), the algorithm proceeds as follows. If the PERCLOS value is greater than or equal to 25%, the operator is classified as drowsy. If the PERCLOS value is less than 12.5%, the operator is classified as not drowsy. For PERCLOS values between 12.5% and 25%, classification is performed using the random forest model, which determines the operator’s drowsiness state based on EAR, MAR, pitch, and roll values.

Results

The random forest model was tested on the NTHUDDD testing dataset41, which was used only after the model’s hyperparameters had been selected and on DROZY dataset55.

Partial dependence plots (PDPs) were generated for each feature involved in the model’s decision-making process: EAR, MAR, pitch angle, and roll angle (Fig. 6a–d). The methodology for constructing PDPs was detailed in the review56. These plots illustrate how the model’s predicted classification probability changes in response to variations in a single feature, assuming that all other parameters remain constant. In this context, output values closer to 0 indicate a “drowsy” classification, whereas values closer to 1 correspond to “not drowsy.”

Furthermore, a feature importance plot (Fig. 6e) was generated for the random forest model, illustrating each input parameter’s relative influence on the model’s final classification outcome. Higher importance indicates that a specific feature plays a larger role in the model’s decision.

A confusion matrix (Fig. 6f) was generated to evaluate the random forest model’s performance. This matrix illustrates the agreement between the model’s predictions and the ground-truth labels assigned to the samples. Each cell in the matrix represents the count of instances classified into the corresponding category (“drowsy” or “not drowsy”), with rows denoting the true labels and columns representing the predicted labels.

Figure 6
Figure 6
Full size image

The first four subplots (ad) present Partial Dependence Plots (PDPs) derived from the random forest model for (a) EAR, (b) MAR, (c) pitch angle, and (d) roll angle. The x-axis represents the value of the corresponding feature, while the y-axis represents the model output, ranging from 0 to 1. Values closer to 0 correspond to the not drowsy state, whereas values closer to 1 correspond to the drowsy state. Subplot (e) shows the feature importance scores for all input parameters used in the model, with higher values indicating greater importance. Subplot (f) presents the confusion matrix evaluating the classifier’s performance in drowsiness detection, where lighter colors indicate fewer classifications in a given cell and darker blue indicates more frequent classifications.

In the scientific literature42, specific metrics are used for quantitative assessment of the model. These include accuracy, precision, recall, and the F1-score. Each metric serves a different purpose and provides a way to examine a model from different perspectives. The model has been tested on NTHUDDD testing set41 and on DROZY dataset55 to verify the degree of the overfitting of the model. For the random forest model, the quantitative metrics obtained from these datasets are summarized in Table 4.

Table 4 Values of the quantitative metrics for the random forest model tested on different datasets.

To assess the validity of real-time working of this application, functional tests were performed. The application was executed for one minute while FPS measurements were recorded, and this procedure was repeated twenty times. Tests were conducted on the following hardware:

  • CPU: AMD Ryzen5 7535HS,

  • RAM : 16GB,

  • GPU: AMD Radeon RX 6550M.

The average FPS value was \(38.0\pm 0.7\) (standard deviation).

Discussion

Figure 6a showcases PDP for EAR value. In the EAR range of approximately 0.10 to 0.30, a gradual transition in classification from “drowsy” to “not drowsy” was observed as EAR increased, although the change was not uniform. In the 0.10–0.15 interval, a slight upward trend was noted. Between 0.15 and 0.22, the partial dependence value remained relatively stable, followed by the most pronounced increase in the 0.22–0.30 range. Around EAR value of 0.25, the transition from the drowsy to the not drowsy state occurred. This mostly aligns with the typical threshold values of EAR at 0.251. Beyond 0.30, further increases in EAR had no impact on the model’s decision.

During testing, EAR values below 0.01 were not observed, even with fully closed eyelids. Minor fluctuations observed at the EAR value of 0.01 (Fig. 6a) may result from inaccuracies in the placement of facial landmarks during the creation of the training dataset. However, these do not have a significant impact on the final classification results.

Figure 6b presents the partial dependence plot showing the relationship between the MAR parameter and the model’s decision. Two characteristic ranges can be distinguished. In the 0–0.27 interval, MAR has a negligible influence on classification, as indicated by a stable partial dependence value around 0.5. In other words, within this range, the model does not interpret minor variations in mouth opening as relevant to drowsiness detection; fluctuations in the curve may result from overfitting to specific samples in the training dataset. A notable decrease in the partial dependence value is observed only in the 0.27–0.40 range, corresponding to the classification of the drowsy state. This suggests that the model considers a more pronounced mouth opening as one of the indicators of drowsiness. This is consistent with the use of yawning as an indicator of drowsiness present in the literature42.

Figure 6c presents the partial dependence plot of the model with respect to the pitch parameter. Negative values indicate forward head tilt, while positive values correspond to backward tilt. Excluding observations around approximately \(2^{\circ }\), a degree of symmetry can be identified relative to \(-10^{\circ }\). Analysis of the NTHUDDD41 dataset indicates that the camera was positioned higher than the operator’s head, resulting in a systematic underestimation of pitch values as an upright posture was recorded as a slight forward tilt. This should be kept in mind if one decides to use this dataset for training machine learning models.

Taking this correction into account, it can be observed that around the pitch value of \(-10^{\circ }\) the model is more likely to classify the state as not drowsy. As the value deviates from \(-10^{\circ }\) either toward more negative or more positive angles the probability of drowsiness classification increases, with backward head tilt leading to a slightly faster transition toward the drowsy state.

It is also worth noting the range between \(2^{\circ }\) and \(6^{\circ }\), where the partial dependence curve rises, indicating a higher probability of a not drowsy classification with more pronounced backward head tilt. This is most likely an effect of overfitting to specific training set examples in which the operator maintained such a head position without exhibiting signs of fatigue.

Furthermore, the pitch range in the NTHUDDD41 dataset spans approximately from \(-24^{\circ }\) to \(6^{\circ }\), which, assuming a systematic error of about \(10^{\circ }\), corresponds to an actual head tilt distribution within \(\pm 15^{\circ }\). This is a relatively narrow range, indicating limited representation of more extreme angles in the training data. Such a limitation may reduce the model’s ability to reliably detect drowsiness in cases where the operator’s head assumes extreme positions.

Figure 6d presents the partial dependence plot of the model with respect to the roll parameter, describing lateral head tilt. Negative values indicate tilt to the right, while positive values correspond to tilt to the left. In the range from \(-3^{\circ }\) to \(-1^{\circ }\), the model classifies the operator as more prone to drowsiness. This may be due to the fact that, in the sample frames, operators typically tilted their heads forward or backward (pitch) with only slight lateral deviation (roll). The curve also shows a degree of symmetry around \(-2^{\circ }\), which may suggest a small systematic error.

When the roll value changes by approximately \(3^{\circ }\) relative to \(-2^{\circ }\), the curve stabilizes, indicating that the model is less sensitive to further deviations along this axis. At the same time, rightward head tilt is more frequently associated by the model with a not drowsy state, which may reflect overfitting to training examples where operators exhibited drowsiness primarily through forward or backward head tilt rather than lateral tilt.

It is also worth noting that the roll range observed in the NTHUDDD41 dataset spans from approximately \(-12^{\circ }\) to \(8^{\circ }\). Considering a possible systematic error of about \(2^{\circ }\), this corresponds to an actual lateral tilt distribution of roughly \(\pm 10^{\circ }\). Such a limited range may result in insufficient representation of cases with more extreme lateral head positions, thereby restricting the model’s ability to accurately detect drowsiness in situations outside the range observed in the training data.

The feature importance plot (Fig. 6e) for the random forest model revealed that the EAR is the most influential parameter in the model’s decision-making process. The impact of the EAR on classification outcomes is approximately 6.85 times greater than that of the MAR, 11.03 times greater than that of the pitch angle, and 32.2 times greater than that of the roll angle.

Figure 6f showcases the confusion matrix of the model. It can be observed that the model is more likely to predict not drowsy operators as drowsy (false positives) than drowsy operators as not drowsy (false negatives). In the opinion of the authors this is more preferable than the alternative, as undetected drowsiness could result in an accident.

These results indicate that the model exhibits a degree of overfitting to the training set, which stems from certain dataset characteristics. This is further supported by Table 4, which shows that the model performs worse on an unseen dataset. This is one of the limitations of this study. To mitigate this, the training set should be expanded to include a broader variety of training samples from different datasets to improve the model’s performance. This expansion would mitigate biases and improve the model’s generalization capability.

Guidotti et al.56 stated that artificial neural networks function as “black boxes” and lack methods to analyze the internal workings of the model. In contrast, random forest models, despite having lower performance scores, provide ways to see how algorithms process data and, as a result, allow the identification of certain biases in datasets, which can subsequently be addressed to improve the performance of the models.

This study has additional limitations. Although the PERCLOS thresholds were tested for generalizability in an operational environment, the sample size was small and not fully representative. Additional data are needed to confirm the validity of the selected PERCLOS values.

Moreover, the current model performs only binary classification, which limits its sensitivity during the early stages of drowsiness. This restriction may delay detection and increase the risk of accidents. Drowsiness can be more effectively modeled using the Karolinska Sleepiness Scale57, and even introducing a single intermediate class could improve detection performance under real-world operational conditions.

Furthermore, the model has not been evaluated against environmental and subject-specific variables, such as lighting conditions, background clutter, ethnicity, or the presence of glasses and facial hair. Further research is required to assess the influence of these factors on model accuracy, although some hypotheses can be drawn from the present analysis. Since the model relies on numerical values of extracted parameters, its accuracy is inherently dependent on the precision with which these parameters are obtained. In particular, the accuracy of drowsiness detection is affected by the performance of Mediapipe, which combines a face detector with a 3D mesh model applied to facial geometry58. Variations in lighting and background clutter may affect the detector’s performance, while partial obstructions, such as glasses or facial hair, can reduce landmark localization accuracy. This, in turn, reduces the reliability of the derived behavioral parameters and ultimately lowers the accuracy of drowsiness detection. Additional experiments are therefore necessary to quantify the impact of these variables on model performance.

Conclusions

While artificial neural networks dominate current drowsiness detection systems in the scientific literature42,59, this work aims to highlight the benefits that stem from the use of random forest models, particularly in terms of interpretability and bias analysis, which could result in improved performance of subsequent models.

This work could be improved upon by implementing the following:

  1. 1.

    Expanding the model with additional features, such as physiological and UAV control parameters.

  2. 2.

    Training the model on data from different datasets.

  3. 3.

    Verifying the effect of environmental and subject-specific variables on model performance.

  4. 4.

    Historical data can be incorporated into the inference process of the model by means of recurrent neural networks (RNN), long short-term memory (LSTM), or transformer-based models.

  5. 5.

    Introducing intermediate drowsiness classes beyond the current binary classification to increase sensitivity of detecting early-stage drowsiness.