Dog facial landmarks detection and its applications for facial analysis

Martvel, George; Zamansky, Anna; Pedretti, Giulia; Canori, Chiara; Shimshoni, Ilan; Bremhorst, Annika

doi:10.1038/s41598-025-07040-3

Download PDF

Article
Open access
Published: 01 July 2025

Dog facial landmarks detection and its applications for facial analysis

George Martvel¹,
Anna Zamansky¹,
Giulia Pedretti²,
Chiara Canori²,
Ilan Shimshoni¹ &
…
Annika Bremhorst^3,4

Scientific Reports volume 15, Article number: 21886 (2025) Cite this article

2916 Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Automated analysis of facial expressions is a crucial challenge in the emerging field of animal affective computing. One of the most promising approaches in this context is facial landmarks, which are well-studied for humans and are now being adopted for many non-human species. The scarcity of high-quality, comprehensive datasets is a significant challenge in the field. This paper is the first to present a novel Dog Facial Landmarks in the Wild (DogFLW) dataset containing 3732 images of dogs annotated with facial landmarks and bounding boxes. Our facial landmark scheme has 46 landmarks grounded in canine facial anatomy, the Dog Facial Action Coding System (DogFACS), and informed by existing cross-species landmarking methods. We additionally provide a benchmark for dog facial landmarks detection and demonstrate two case studies for landmark detection models trained on the DogFLW. The first is a pipeline using landmarks for emotion classification from dog facial expressions from video, and the second is the recognition of DogFACS facial action units (variables), which can enhance the DogFACS coding process by reducing the time needed for manual annotation. The DogFLW dataset aims to advance the field of animal affective computing by facilitating the development of more accurate, interpretable, and scalable tools for analysing facial expressions in dogs with broader potential applications in behavioural science, veterinary practice, and animal-human interaction research.

Automated analysis of emotional expressions in dogs based on geometric morphometrics

Article Open access 02 September 2025

Explainable automated recognition of emotional states from canine facial expressions: the case of positive anticipation and frustration

Article Open access 30 December 2022

Facial asymmetry in dogs with fear and aggressive behaviors towards humans

Article Open access 15 November 2022

Introduction

Many animals are capable of producing a wide range of behavioural expressions, such as variations in body posture, gaze direction, and facial expressions^1,2,3,4. These signals provide a continuous flow of information and can play a crucial role in social interactions, allowing individuals to convey intentions, communicate with others, and potentially express internal states^5,6,7. However, studying such behaviours in detail remains challenging. Manual behavioural coding remains one of the most time-intensive tasks in animal science. Despite the wealth of potential information carried in an animal’s facial expressions and body movements, decoding this information still relies heavily on manual coding, typically often requiring specialised and trained observers, frame-by-frame inspection, and considerable effort, which is often difficult to scale to larger datasets or across studies⁸. In response to this bottleneck and with the increasing use of computational methods in animal behaviour research, there is growing interest in automating the analysis of behavioural expressions using machine learning and computer vision⁹. This is particularly relevant when studying potentially subtle or short-lived indicators of affective states, such as emotions or pain^10,11,12,13.

Emotions are complex states involving physiological, cognitive, and behavioural components¹⁴. These states are internal experiences facilitating adaptive response and are thus expressed through observable behaviours, providing critical information about the animal’s well-being and intentions^15,16. This way, even changes in facial expressions become a key medium, allowing animals to convey/communicate their internal states to others.

Human emotion research has focused on facial expressions for many decades, as they provide a non-invasive, measurable way to recognise emotional states¹⁷. Similarly, most mammalian species produce facial expressions, and, as in humans, they are assumed to convey information about emotional states, motivations and future intent^4,18. Therefore, new technologies in facial analysis in animals can lead to new indicators for measuring animal affective states and can revolutionise the way we assess, study, and interpret subjective states, such as stress, emotions, and pain, in domestic species.

Interest in domestic dogs’ behaviour and cognition has increased significantly over the last 30 years¹⁹. In fact, they also serve as valuable clinical models for numerous human disorders²⁰ and are commonly used in research on domestication²¹, attachment²², cognition, and more, leading to the multi-disciplinary research field of canine science. In this field, emotions are of increasing interest, addressing both the production of facial expressions of dogs in emotional states, representing potential canine emotion indicators²³, but also the perception of human emotions by dogs²⁴ and of canine emotions by humans²⁵. Moreover, due to the remarkable diversity of dog facial expressions²⁶, an increasing number of works address objective measurement of dog facial expressions as indicators of emotional states in different contexts^23,27. Facial expressions have also been studied in the context of understanding dog-human communication (e.g., the impact of dog facial phenotypes on their communication abilities with humans²⁸, the effect of facial features on the ability of humans to understand dogs²⁹, etc.).

The gold standard for objectively assessing changes in facial expressions in human emotion research is the Facial Action Coding System (FACS)³⁰. FACS has recently been adapted for different non-human species, including dogs. The Dog Facial Action Coding System (DogFACS³¹) has been applied in several studies^{23,27,28,32,33} to measure facial changes in dogs objectively. However, using this method for facial expression analysis depends on laborious manual annotation, which also requires extensive specialised human training and certification and may still be prone to at least some level of human error or bias³⁴. Some first steps to automating facial movements (action units) detection in dogs were taken by Boneh-Shitrit et al.³⁵, but much more data is needed for substantial progress.

Facial landmarks (keypoints/fiducial points) detection³⁶ offers an appealing, more lightweight but nonetheless highly objective and precise alternative for automated facial analysis. Landmark locations have been shown to provide crucial insights in the context of face alignment, feature extraction, facial expression recognition, head pose estimation, eye gaze tracking, and other tasks^37,38,39. They have also been used for automated facial movement recognition systems^40,41. An important advantage of landmark-based approaches is their ease of application to video data, producing time series of multiple coordinates, which can then be analyzed⁴², classified^43,44,45, or processed for different purposes^46,47.

Such uses of landmark time-series data are just beginning to be explored in the domain of animal behaviour, primarily for body landmarks (see, e.g.,^48,49,50), while facial analysis still remains underexplored. The significant challenges mainly arise from the wide variety of textures, shapes, and morphological structures found across different breeds and species. This is especially true for domesticated animals, such as farm and companion ones, and particularly for dogs, which exhibit the most variability in appearance and morphology among all mammalian species⁵¹. Moreover, typical datasets with human facial landmarks consist of thousands of images with dozens of landmarks. This abundance of data leads to better model performance, even in challenging scenarios such as occlusions or low-quality images. The animal domain, on the other hand, severely needs landmark-related datasets and benchmarks, as highlighted in Broomé et al.⁵². These are just beginning to be developed for species such as cats⁵³, dogs⁵⁴, horses⁵⁵, cattle⁵⁶, and sheep⁵⁷; however, in most cases, the limited number of instances in the training data, the small number of landmarks, as well as lack of justification for their placement in terms of facial muscles, makes these tools inadequate for capturing the subtle facial changes necessary for emotion or pain recognition. Martvel et al.⁵⁸ recently addressed this gap for cat facial analysis by introducing a dataset and a detector model for 48 anatomy-based cat facial landmarks. The detector performed well in several tasks requiring subtle facial analysis, such as breed, cephalic type, and pain recognition^59,60.

This present study is the first to address these challenges in dog facial analysis, adapting a landmark-based approach. We develop an extensive landmark scheme with 46 facial landmarks grounded in dog facial anatomy and introduce the Dog Facial Landmarks in the Wild (DogFLW) dataset, containing 3732 images of dogs annotated with landmarks and facial bounding boxes. We then utilise different computer vision models to provide a benchmark landmark detection pipeline for the DogFLW. Moreover, we apply and evaluate this pipeline in two case studies related to dog facial analysis from video data: emotional state classification and DogFACS action unit detection.

Related works

In this section, we examine the current animal facial landmark datasets and their organisation. We then briefly present the Facial Action Coding System and its adaptation to dogs, commonly used for measuring facial appearance changes in dogs, and discuss its benefits and shortcomings. Finally, we discuss works on emotion recognition from animal facial expressions and review studies on the automated detection of facial action units.

Animal facial landmarks detection

Most of the existing animal datasets mentioned in this section cover a relatively small number of facial landmarks (fewer than 15). For comparison, popular human facial landmark detection datasets have several dozens of landmarks^61,62,63. Sufficient for general pose detection, low-dimensional landmark schemes can’t capture subtle facial movements. The advantage of datasets based on those schemes is a large number of instances with a high variation in environment and appearance, since annotating images with fewer landmarks requires less time and effort.

Liu et al.⁵⁴ collected a dataset of 133 dog breeds comprising 8351 dog images, annotated with eight facial landmarks: one landmark per eye, one for the nose, two landmarks on the upper base of ears, two on the ear tips, and one on the forehead. The amount of collected data is excellent, but the suggested landmark scheme doesn’t allow for tracking mouth and tongue movements, which could serve an important role in dog behaviour analysis⁶⁴.

Khan et al.⁶⁵ developed the AnimalWeb dataset, which contains 21,900 images annotated with nine facial landmarks. The dataset includes images of various animals, with around 860 images of dogs. Each image is annotated with two landmarks for each eye, one for the nose and four for the mouth. The proposed scheme doesn’t include ear landmarks at all, which can be used in behaviour analysis as well⁶⁶.

The horse dataset, developed by Pessanha et al.⁵⁵, contains horse head poses divided into three groups: frontal, tilted, and side. Each group has its own landmarks: 54 for frontal, 44 for tilted, and 45 for side views. This approach was chosen to deal with self-occlusions but has two issues: first, the annotator has to decide which landmark scheme to use each time, and second, having three schemes for one animal makes automated landmark detection impractical for real-life applications.

Martvel et al.⁵⁸ created the CatFLW dataset using the cat 48 facial landmark scheme developed by Finka et al.⁶⁷. This scheme is based on the cat’s facial anatomy and enables various applications, such as the classification of animal pain based on the analysis of facial landmarks^60,68.

Other relevant animal datasets^{53,56,57,69,70,71,72} with facial landmarks are listed in Table 1. There are other datasets suitable for face and body detection, as well as recognition of various animal species, but they do not have facial landmark annotations^{73,74,75,76,77,78}.

From the literature on dog facial landmarks, we can conclude that there are no comprehensive datasets covering the nuances of dog facial anatomy and that existing datasets do not share the standard facial landmark scheme, which complicates training computer vision models and following behaviour analysis.

Table 1 Comparison of animal facial landmarks datasets.

Full size table

Facial action coding systems and DogFACS

The benchmark for objectively assessing changes in facial expressions in human emotion research is the Facial Action Coding System (FACS)³⁰. In this system, each facial movement is represented by a variable, either an action unit (AU) or an action descriptor (AD), and is defined based on the underlying muscle activity that produces observable changes in facial appearance. Action units refer to specific, identifiable muscle contractions, while action descriptors are used for movements involving muscle groups or cases where the precise muscular basis is unknown or difficult to isolate. FACS has been extended to animals (AnimalFACS), including non-human primates and domesticated animals, including horses, dogs, and cats^{31,79,80,81,82,83}. DogFACS³¹ is developed on a base of dog facial anatomy⁸⁴ and has in total 21 AUs and ADs to track movements in the ear, eye, and mouth region.

Caeiro et al.³² applied DogFACS to assess the spontaneous emotional responses of individuals of different breeds in naturalistic settings using videos from the Internet. Unlike that of Caeiro et al., Bremhorst et al.²³ investigated dogs’ facial expressions of positive anticipation and frustration in a controlled experimental setting, standardising the dog breed (Labrador Retriever). Measuring the dogs’ facial expressions using DogFACS, Bremhorst et al. showed, e.g., that the ears adductor variable was more frequently observed in the positive condition. In contrast, blink, lips part, jaw drop, nose lick, and ears flattener were more common in the negative condition. In a subsequent study, Bremhorst et al.⁸⁵ replicated their experiment with a new group of dogs, new controlled settings, and disentangling expressions likely linked to emotions from those of the underlying motivation state.

Boneh-Shitrit et al.³⁵ used the dataset and DogFACS codings from Bremhorst et al.²³ and showed that a machine learning model could classify the emotional state based on manual DogFACS coding with an accuracy of 71%.

Pedretti et al.²⁷ further investigated the relation between dogs’ emotional states and facial expressions across different breeds, replicating the experimental approach of Bremhorst et al. but introducing the social factor (a human experimenter). They found that some movements (ears flattener, blink, and nose lick) are more common in the social context of frustration, than in the non-social one.

Sexton et al.²⁸ analysed dog facial movements during social interactions with humans in four contexts—non-verbal: without and with eye contact, and verbal: without and with familiar words. The authors found that dogs display various DogFACS action units more with increased communication activity, varying in quantity and diversity and depending on the dogs’ facial colours).

Most studies using DogFACS are conducted manually, so there is room for human bias and error. Slight and fast movements can be challenging to spot in long videos, and it’s not always easy to determine their boundaries and duration. This leads to the need to investigate the automation of FACS variable detection, which we explore in this study using a landmark-based approach.

Emotional state classification in dogs

There is no consensus on the definition of animal emotions^86,87. However, they are often characterised as internal states expressed in physiological, cognitive, and behavioural changes⁸⁸. Measuring animal emotions is particularly challenging due to their internal and subjective nature and the lack of a verbal basis for communication¹⁴. One of the non-invasive ways of doing so is measuring behavioural changes and facial signals, which convey emotional information in most mammals^89,90,91,92. Recently, the number of studies addressing emotion recognition in animals is growing. Broomé et al.⁵² provides a comprehensive survey of more than twenty computer vision-based studies on recognising animal pain and emotional states. Below, we review the most relevant works that focus on dogs.

Boneh-Shitrit et al.³⁵ used the dataset collected by Bremhorst et al.²³ to compare two different approaches to emotional state classification from facial expressions: DogFACS coding-based machine learning model and deep learning model (operating on a single-frame basis). While the latter reached better performance (above $89\%$ accuracy), it is less explainable than the DogFACS one.

Hernandez et al.⁹³ created a dataset of 7899 images with dogs in four emotional states: fear, contentment, anxiety, and aggression. Computer vision classification models trained on this dataset show good performance (0.67 F1 score), but the quality of images (all images were collected from the Internet using keywords) and reliability of the ground truth labels (authors report fair-to-moderate agreement for all classes) remain questionable.

Franzoni et al.⁹⁴ classified three emotional states in dogs: anger, joy, and neutral, attributing a facial expression to emotion (growl—anger, smile—joy, sleep—neutral). Authors reported $\sim$ $95\%$ accuracy of classification of facial expressions, stating that the created computer vision model “is indeed able to recognise in dog images what humans commonly identify as dog emotions”⁹⁴. One can notice that the authors implied the classification of emotions, although they used the concepts of facial expressions and classified them specifically. We agree that emotional state classification can be based on facial expressions, but it is essential to differentiate between a direct correspondence and a statistical correlation between facial expressions and emotions. In the following paper, Franzoni et al.⁹⁵ used the dataset created by Caeiro et al.³², which comprises videos of dogs in states of happiness, positive anticipation, fear, frustration, and relaxation. Using different preprocessing techniques and FACS coding, the authors achieved $62\%$ accuracy in emotional state classification. They also pointed out that the environment could bias deep learning methods for classification in the wild. For instance, a computer vision model might classify a dog on a couch as sleeping not necessarily because of its current activity but because the training data contains a prevalence of sleeping dogs on couches. Franzoni et al. used various techniques to mitigate bias, including facial and body bounding box cropping and segmentation. Their findings indicate that the best classification performance was achieved using facial bounding boxes, which aligns with our face-focused approach.

AnimalFACS event detection

Another essential task in animal behaviour analysis is movement detection and recognition. Since most animals cannot explicitly translate their emotions and pain verbally, movement detection can be one of the non-invasive ways to translate an animal’s emotional state. Facial movements have their specifics since they can be slight or vague (such as half-blinks or lip corner movements) and are hard to define clearly in real-world data. The abovementioned AnimalFACS describes facial movements using unified schemas for specific animals, allowing different researchers to operate in the same terms, but this approach requires an enormous amount of manual annotation work.

Automated movement detection and recognition is a well-developed field in humans^{96,97,98,99,100,101}. In animals, such automatization is lacking and only beginning to appear.

Morozov et al.¹⁰² classified six MaqFACS variables in macaques, reporting $81\%$ and $69\%$ accuracy for upper and lower face parts within one individual, $75\%$ and $43\%$ accuracy across individuals of the same species (Macaca Mulatta), and $81\%$ and $90\%$ accuracy across breeds (Macaca Fascicularis). The studied variables had different frequencies within the dataset used, so the authors performed undersampling of video frames, obtaining 1,213 and 310 images per variable of upper and lower face parts. Additionally, Morozov et al. showed the application of automated movement detection, performing a behaviour analysis on videos with macaques reacting to the appearances of other individuals.

Li¹⁰³ classified nine EquiFACS variables from horse images, with F1 score varying from 0.56 to 0.73. Despite the demonstrated potential of classification of horse facial movements from images, the author reports the impossibility of detecting ear movements without temporal information and suggests using “sequence models for horse facial action unit recognition from videos to learn the temporal information of AUs”.

Boneh-Shitrit et al.³⁵ detected nine manually coded DogFACS variables, sampling frames from manually annotated videos with Labrador Retriever dogs. Performing a “one versus all” classification, the authors achieved detection performance varying from 0.34 to 0.76 F1 score, which almost linearly correlates with the number of training samples. Such an approach allows the detection of frequent FACS variables in videos, processing them frame by frame, with a limitation of the deep learning architecture that does not allow transferring motion detectors to other breeds.

When creating automated methods to detect animal movements, it is vital to consider the possible bias and validity of such tools¹⁰⁴. Various animal appearances and complications related to real-world environments (such as the low resolution of videos with wild animals) create possible errors in movement detection. In the current study, we utilise the same-breed Ladrador Retriever dataset²³ for movement detection, almost mitigating the dog appearance variations, with the same laboratory environment, not introducing noise (occlusions, weather, lightning, etc.) into the data. Such an approach to facial movement detection limits the applicability of the created tool but ensures its validity as much as possible.

The DogFLW dataset

In the present study, we have created the DogFLW dataset, inspired by existing ones for humans and animals, to promote the development of automated facial landmark detectors in dogs. Next, we present the processes of developing the landmark scheme and data annotation, as well as benchmark results for landmark detection on the DogFLW dataset.

Dataset

Landmark scheme

The 46 facial landmark scheme was developed by experienced dog behaviour experts and active canine science researchers certified in DogFACS. To establish the number of landmarks and each landmark location, the experts worked independently using two different approaches and then converged using expert consensus. The first approach is based on the anatomy of canine facial musculature and the range of possible expressions and facial movements to ensure the connection between landmark positions, possible facial movements, and underlying facial musculature. The second approach was largely inspired by the landmark scheme development for cats⁶⁷, informed by the CatFACS coding system⁸³. Analogously, the developed dog landmarks were based on DogFACS³¹.

After independently developing the two landmark schemes, both approaches converged. The experts compared the set of landmarks, and those consistent across both approaches were retained while differing landmarks were discussed thoroughly to reach an expert agreement on the precise location. The final landmark scheme is presented in Fig. 1.

Like in Finka et al.⁶⁷, a comprehensive manual was developed detailing landmarks placement and their relevance to facial musculature and action units. This manual serves as a guide for accurately annotating canine facial landmarks, ensuring consistency and reliability in future research and applications, and is available in the Supplementary Materials and by the link https://rb.gy/r6srv9.

Annotation

As a source for the DogFLW dataset, we used the Stanford Dog dataset¹⁰⁵, which contains 20,580 images of 120 dog breeds with bounding boxes.

First, we selected a random subset of images with an equal number of images per breed. Then, we filtered the images according to the following criteria: the image contains a single, mostly visible dog face. Other dogs could be present, but their faces shouldn’t be visible for the unambiguity of detection. Unlike in the CatFLW, we included images with partially occluded faces, estimating the positions of landmarks that were not visible.

The resulting subset contains 16–53 images per breed (31 on average) of all 120 breeds (3732 images total), ranging in size from $100\times 103$ to $1944\times 2592$ pixels. Dogs in filtered images have different sizes, colours, body and head poses, as well as different surroundings and scales. All images are annotated with 46 facial landmarks using the CVAT platform¹⁰⁶ according to the established scheme. Each image is also annotated with a face bounding box, which encompasses the dog’s entire face, along with approximately 10% of the surrounding space. This margin has proven crucial for training face detection models, as it prevents the cropping of important parts of the dog’s face, such as the tips of the ears or the mouth. Figure 2 shows examples of annotated images from the DogFLW dataset.

The dataset was annotated by an experienced data annotator, then the annotated samples were iteratively reviewed by the last author, a certified DogFACS coder, until annotation saturation was reached, e.g., no significant corrections were needed.

To make the annotation process more efficient, we followed the active learning AI-assisted annotation process^58,107,108, dividing the dataset into ten batches and annotating training data using predictions of a machine learning model that is gradually being retrained on previously corrected data. More information on the active learning paradigm can be found in the study¹⁰⁹.

Landmark detection benchmark

The dataset was divided into the train (3252 images) and test (480 images) sets for evaluation. The test set contains an equal number of images of all 120 breeds present in the dataset. In the preprocessing stage, we cropped all the faces by their bounding boxes to provide fair metrics for all models.

We performed landmark detection on a Supermicro 5039AD-I workstation with a single Intel Core i7-7800X CPU (6 cores, 3.5 GHz, 8.25 M cache, LGA2066), 64GB of RAM, a 500GB SSD, and an NVIDIA GP102GL GPU.

Metrics

We use Normalised Mean Error ($NME_{iod}$) that preserves the relativity of the error regardless of the size of the image or the scale of the face on it. It is commonly utilised in landmark detection^110,111,112, and uses MAE as the basis and inter-ocular distance (distance between the outer corners of the two eyes, IOD) for normalisation:

$$\begin{aligned} NME_{iod} = \frac{1}{M \cdot N} \sum _{i=1}^{N} \sum _{j=1}^{M} \frac{\left\| {x_i}^j - {x'_i}^j \right\| _1}{iod_i}, \end{aligned}$$

where M is the number of landmarks in the image, N is the number of images in the dataset, ${x_i}^j$ and ${x'_i}^j$ — the coordinates of the predicted and ground truth landmark, respectively.

Baselines

To measure the performance on the DogFLW, we have selected the following models: Ensemble Landmark Detector (ELD)⁵⁸, DeepPoseKit (DPK)¹⁰⁸, DeepLabCut (DLC)^113,114,115 and Stacked Hourglass¹¹⁶. Those models are widely used in the animal field and are usually viewed as the default for the animal body/facial landmark detection^52,117,118. The training process for all the models except ELD was performed using the DeepPoseKit platform¹⁰⁸. All models were trained for 300 epochs with a batch size of 16, mean squared error (MSE) loss, the ADAM optimiser, and optimal parameters for each model (indicated in the corresponding papers).

Preprocessing

We randomly applied different augmentations (rotation, colour balance adjustment, brightness and contrast modification, sharpness alteration, application of random blur masks, and addition of random noise) to the training data, doubling the size of the training set.

Results

The results of landmark detection with different models and backbones on the DogFLW are shown in Table 2.

Table 2 Comparison of landmark detection error on the DogFLW dataset using different detection models.

Full size table

During the experiments, it was noticed that the accuracy of detecting ear landmarks is significantly lower compared to other facial parts across all models. This is likely because of the varying shapes and lengths of ears among different breeds, which may include ear types and positions not previously encountered in the training set, especially for floppy or half-floppy ear types. To test our hypothesis, we divided the data into three subsets: one with erect ears (pointy), another with hanging ears (floppy), and the third with other types (half-floppy). When dividing, we were guided by the average breed ear type in a relaxed state, obtaining a ratio of 31:50:19 for the training set and 29:52:19 for the test set. Despite the dataset’s predominance of breeds with floppy ear types, Table 3 shows that the accuracy of the ELD model landmark detection on a test set for such dogs is less than for dogs with erect and half-floppy ears.

Table 3 Normalised mean error of landmark detection of the ELD model on the test set for dogs with different ear types.

Full size table

Due to the uneven presence of breeds in the training set, the accuracy of landmark detection for rare breeds may be lower than for frequent breeds. However, it would be incorrect to attribute detection errors solely to the number of samples. The breed itself plays a significant role in detection accuracy, as many breeds have distinct fur length, texture, and facial anatomy that influence the results. Figure 3 shows the image distribution of dogs of each breed in the training set and the ELD’s detection error for the same breeds in the test set. Based on the distribution, it can be observed that breeds with long fur covering facial features or extending on ears are the most difficult to detect accurately (Irish Water Spaniel, Briard, Standard Poodle, Bedlington Terrier, Scottish Deerhound, Komondor, Kerry Blue Terrier). Moreover, we previously demonstrated that detection has a high error on dogs with long, floppy ears, resulting in low accuracy in detecting facial landmarks for breeds with such ears (Basset, Redbone). It is also worth noting that some breeds have non-obvious detection results. In some cases, this could be explained by a significant variation in the position of the ears (Ibizan Hound, Collie, Great Dane) or by their being obscured (Chow), which can cause a significant detection error. It can also be seen that breeds with short and smooth facial fur, such as Cardigan, Kelpie, Miniature Pinscher, Dingo, Chihuahua, and Malinois, tend to have the lowest average detection error, as their facial features are more clearly distinguishable.

We additionally evaluated the generalisability of the ELD model to assess how it performs on unseen breeds. For that, we conducted an additional experiment using a 2 × 2 design that systematically varied two key morphological features in canine faces: ear type (erect vs floppy) and snout length (short vs normal/long). This resulted in four distinct morphological combinations. For each combination, we selected one representative breed and excluded all images of that breed from the training set, using it only for testing. The selected breeds were: Labrador Retriever (normal/long snout, floppy ears), Boxer (short snout, floppy ears), French Bulldog (short snout, erect ears), and German Shepherd (normal/long snout, erect ears). Accordingly, we generated four subsets, excluding all images of dogs of the selected specific breeds from the training set (one per subset: Labrador Retriever: 26 images, Boxer: 30 images, French Bulldog: 39 images, German Shepherd: 24 images). Then, we trained four models on the resulting subsets and evaluated them on the original test set. The results, provided in Table 4, demonstrate that the exclusion of one breed from the training data almost does not impact the model’s performance in general, but reduces the performance on that specific breed in the test set. This could be explained by the fact that excluded breeds take about $1\%$ of the training data, so the general performance doesn’t drop significantly. However, the varying decrease in performance between excluded breeds could be explained by the presence or absence of breeds with similar morphology in the training set.

Table 4 The normalised mean error of landmark detection for the ELD model, trained on different subsets from DogFLW, is evaluated on the test set. In the case of “included,” all breeds were present in the training set. Conversely, in the “excluded” case, the selected breed was absent from the training set. The error is reported for both the full test set and for the selected breed within the test set.

Full size table

Methods

To investigate the applications of the landmark-based approach, we use the dataset from Bremhorst et al.²³. The dataset contains recordings of 29 Labrador Retriever dogs (248 videos total) in a controlled laboratory setting, inducing two emotional states: positive (anticipation of a food reward) and negative (frustration due to the reward’s inaccessibility). In addition to the labels for positive/negative emotions, the videos were coded using DogFACS.

The DogFLW dataset provides an excellent opportunity for testing the trained landmark detector for complex, subtle video analysis tasks related to dog facial behaviour. The specific tasks we chose are (i) emotion recognition (positive/negative) and (ii) DogFACS event (variable) detection.

We used the Google Colab cloud service (https://colab.research.google.com) with the NVIDIA TESLA V100 GPU to train and evaluate the video classification and movement detection models. All models were trained and evaluated using TensorFlow 2.17¹¹⁹, pandas 2.2.3¹²⁰, and NumPy 2.1.0¹²¹.

Emotion recognition

The first classification task is defined as follows: given a video of a dog, classify whether it is in a positive or negative emotional state. To this end, we integrated facial landmark prediction as a middle step and use the obtained time series for classification. The pipeline is described in Fig. 4(left).

Landmark detection

We processed all 248 videos from the dataset with the Ensemble Landmark Detector (ELD)⁵⁸, predicting landmarks on each frame. We chose the ELD model for its performance, introducing two minor changes to the architecture. First, as a face detector, we use the custom-trained YOLOv8¹²² model. This choice is crucial for video processing since the original model’s architecture implies the presence of the animal’s face on the input image. In videos, it is not always true due to animals’ movements, head rotations, obscurity, etc. Moreover, we trained a fully connected verification model that takes landmarks as input and outputs a probability score. This score could be interpreted as the model’s certainty regarding the quality of the obtained landmarks. This model consists of three fully connected layers (128, 32, 1) and was trained on the DogFLW landmarks as positive examples and landmarks predicted on random images as negative examples.

The resulting landmarks from all videos were normalised and saved as a time series database. Each landmark row has 46 landmark coordinate pairs, as well as the verification model’s confidence. In some frames, landmarks were not detected (the dog was turned away/heavily obscured, etc.), so we assigned zero values to landmark coordinates and zero confidence for the whole frame.

Preprocessing

We intentionally left frames with no landmarks detected in the obtained time series without interpolation or filling. No detected landmarks in almost all cases means that the dog is heavily rotated/obscured, and its face is mostly not visible. Of course, the human annotator can still draw some information from such frames, but it’s impossible for the landmark-based model. Using an interpolation, we introduce artificial landmark sequences, which are not related to the real video in any way and may contain false signals. When doing window aggregation from time series, we skipped windows with one or more empty frames to ensure the model was trained on informative data.

We split the time series data into training and test sets. The training set was processed with a sliding window of size (L, 92) with a step of $S=1$. We additionally experimented with sparse windows, which have the same length as regular ones but capture more temporal information by skipping rows. The sparse window could be considered a standard sliding window for reduced fps. For example, a standard window with $L=5$ and 25 fps captures 0.2 seconds, and a sparse window with the same length and 5 fps (3 rows skipped each time) captures 1 second. Such an approach allows for the investigation of the impact of the temporal dimension without increasing computational resources. The resulting aggregated windows were divided into two classes based on the dog’s state in the original video, and then balanced by downsampling. We used sparse windows with $L=5$ and 14 fps for the current experiment (chosen empirically through a grid search), resulting in 11,922 windows total.

Metrics

To measure the classification performance, we implemented a majority voting mechanism over the sliding window predictions: the model predicts a class for each window, and then the dominant class is assigned to each video as a label. Obtaining the label for each video, we measured accuracy, precision, recall, and F1 score of the classification.

Since the dataset consists of 29 dogs and the same dog being in both training and test sets could introduce data leaks, we used the leave-one-animal-out cross-validation method, commonly applied to such tasks⁵².

Baseline

To classify the landmark windows, we implemented an LSTM-based model, which consists of two bidirectional LSTM layers¹²³, followed by four Dense layers (see Fig. 4(right)). Taking as input a window of size (L, 92), it outputs a probability of this input window belonging to one of the two classes: positive or negative. The model was trained for 300 epochs with a batch size of 16, MSE loss, and an ADAM optimiser.

DogFACS event detection

The second task we investigated was the following: given a video, detect movements (events) in the form of various DogFACS action units (and descriptors), and determine the start and end time of the event. In addition to investigating detection performance, we measure its usefulness for computer-assisted DogFACS annotation in a pilot experiment with a DogFACS-certified coder.

The dataset of Bremhorst et al. provides an excellent opportunity in this context as it is DogFACS-coded (start/end time for each of the DogFACS variables). Since DogFACS elements have two types (action unit and action descriptor), they are usually referred to as DogFACS variables. We refer to them in our context as “DogFACS events” for simplicity.

Our approach to this type of event detection is to treat meaningful events as time series anomalies. This is justified by the fact that facial movements are usually localised and could be separated from the animal’s general movement. That means if we accept the relaxed facial expression as normal, raising brows, opening the mouth, and other facial movements disrupt this normal state, but in a “localised” way. By utilising facial landmarks (which change their coordinates as the face changes its expression), it is possible to translate the occurring movement into coordinate time series, where classical time series analysis could be applied to detect this movement. For instance, if the dog is moving within the frame, in most cases, we can assume that the trajectory of each facial landmark follows the same direction and has a similar pattern to others. If a relative facial movement occurs, the trajectories of several landmarks change within the global pattern, and this variation could be potentially seen in the coordinate plots.

Therefore, we treat time series obtained from DogFACS events (such as blinking or licking) as anomalies, given that such events are comparably short (typically less than a second) and rarely occur in this dataset. To effectively detect rare movements in videos, we implement an LSTM autoencoder model, which operates on multivariate facial landmark time series obtained from the videos with a landmark detection model (see Fig. 5(left)).