Introduction

Methane (CH\(_4\)) constitutes a significant greenhouse gas, accounting for almost one-third of the warming effect that is currently driving global climate change1,2. Although its atmospheric lifetime is approximately 12 years, significantly shorter than that of carbon dioxide (CO\(_2\), methane exhibits a much higher efficiency in trapping heat3. Over a 20-year horizon, its warming potential is 86 times greater than that of CO\(_2\), while over a 100-year horizon, this potential reduces to 25–28 times greater due to its oxidation into CO\(_2\) and H\(_2\)O1. This efficacy indicates that mitigating methane emissions can result in immediate and substantial climate benefits4.

The agricultural sector is responsible for approximately 40% of anthropogenic methane emissions5. Similarly, livestock production constitutes one-third of global methane emissions, comparable to methane emissions from fossil fuel sources5,6. In particular, ruminants have developed a digestive system that is adept at breaking down plant-based material. In the rumen, microbes (including bacteria, protozoa, and fungi) ferment carbohydrates and proteins, resulting in the production of volatile fatty acids.

Currently, several techniques allow measurements of methane emissions from ruminants, including respiration chambers, the GreenFeed system, the sulfur hexafluoride (SF6) tracer method, and breath sampling during milking or feeding7,8. Among these, respiration chambers are considered the standard. This method involves confining an animal within a chamber for a duration of 2 to 7 days. Methane concentrations are monitored at both the inlet and outlet air vents of the chamber. The difference in concentrations observed at these points is multiplied by the airflow to estimate the rate of methane emission8. Unfortunately, this approach conditions the animal to confinement in a non-natural environment, which can induce stresses that alter the methane emission dynamics and profiles. To address this, the GreenFeed system was designed to collect breath samples from animals in natural grazing environments, barns, and pastures. It captures multiple short-duration samples (3 to 7 minutes each) from individual animals several times a day, as they approach a GreenFeed feeding station. It is a portable standalone device equipped with an extractor fan for active airflow and sensors to monitor head position and other variables to ensure proper breath sampling8. The output data is processed by the manufacturer and made available in real-time through an online data management platform9. It is an expensive solution, which still depends on proprietary controlled programs.

Recent advances in wearable sensor technology for livestock monitoring have unveiled new opportunities to improve behavioral analysis and overcome the limitations associated with current ruminant methane emission measurement techniques. In10, neck-mounted sensors were used to facilitate real-time tracking of critical indicators, such as posture transitions. Similarly, in11, the authors used triaxial accelerometers to characterize specific behaviors in cattle, including licking and feeding. These methodologies underscore the potential of wearable sensors to produce high-resolution data. Other studies7 use laser methane detectors (LMDs) to estimate mitigation in CH\(_4\) emissions. LMDs provide a non-invasive and efficient method for evaluating the impact of dietary strategies on the greenhouse gas emissions of livestock. The work in12 examines LMDs as instruments for quantifying methane emissions in ruminants such as sheep, goats, and cattle. This review addresses several aspects of LMDs, including measurement protocols and data analysis techniques. The author also highlights limitations such as its lower accuracy compared to other methods and the labor-intensive nature of the measurements. On the other hand, Zheng et al.13 propose a method for predicting CH\(_4\) emissions using Bayesian networks that consider the relationships between various factors that influence methane production, such as food intake, dietary composition, and several animal characteristics. Ross et al.14 introduced an innovative method to estimate cattle methane emissions using a predictive model that combines machine learning techniques and statistics, combining a traditional Linear Mixed Effects (ME) model with a Random Forest (RF) method. This approach demonstrated the potential of machine learning methods to improve the accuracy of CH\(_4\) measurements in the field. However, it falls short since most of the ML models used for estimating CH\(_4\) emissions require non-dispersive infrared (NDIR) sensors or expensive respirometry chambers that are impractical for large-scale use.

There is a clear need for reliable, low-cost wearable devices that can accurately measure emissions directly in the field, and with little to no disturbance to the animals. In such an approach, the operation on batteries of the wearable device implies an important effort to reduce power consumption by maintaining most of the electronics in a low-power mode. With this in mind, it becomes paramount to properly detect the start of the belching events in order to consequently activate the main electronics of the wearable device and characterize the corresponding methane emissions. However, few studies have been conducted on the use of wearable devices with integrated inertial sensors to detect events and estimate methane emissions directly from cattle burps. In this regard, this work focuses on the development of a wearable device to detect eructation events in livestock, by analyzing vibrational data captured by inertial measurement units (IMU), to support the proper synchronization and wake-up of the main electronics modules only when needed, saving energy in low-power wearable devices (Fig. 1).

Field data collection was conducted on 7 bovine subjects to determine a labeled dataset to train machine learning models that can be deployed to run on a embedded system, enabling real-time inference directly on the device. The wearable is equipped with IMU sensors placed around the animal’s head, in an approximate orthogonal configuration, as depicted in Fig. 2(a), which detect mechanical vibrations induced during belching events. A commercial micro-electromechanical methane gas sensor was used to simplify the annotation process during the data collection stages, enabling a rapid, scalable and human-independent approach to label the dataset. Machine learning (ML) models were evaluated and trained to anticipate eructation events based only on the IMU sensor data, while using the CH\(_4\) sensor to label significant events (emissions above a certain concentration threshold) in real-time. Beyond this labeling process, during the training and testing stages, the CH\(_4\) sensor information is not required and all estimations rely only on the IMU inertial readings.

Fig. 1
Fig. 1
Full size image

Our proposed IoT system for the detection of eructation events based on inertial data.

Materials and methods

This section outlines the design and deployment of the IoT wearable system, including hardware, firmware, mobile applications, and a machine learning model to detect methane emissions in cattle. It highlights the design of the wearable device and the machine-learning pipeline for classifying and characterizing belching events. Each sub-section covers the essential elements, from hardware design to data collection and model validation.

Electronics

The wearable device was designed using a custom printed circuit board (PCB) that integrates sensors and a micro-controller. The primary function is to collect inertial motion and methane concentrations from cattle burps and transmit data wirelessly to further analysis. We integrated the Seeed Studio’s XIAO nRF52840 Sense module due to several factors, such as built-in BLE antenna, 6-axis Inertial Measurement Unit (IMU), compact size, memory, and compatibility with Edge Impulse for embedded machine learning (ML) models. To capture vibrations at the animal’s neck, nape, and snout, two additional 6-axis IMUs were added (Fig. 2). An onboard MEMS methane sensor (Figaro\(\copyright\) TGS 2611-C00), powered by a 5V-500mA buck converter, automatically labels motion data. The system is powered by a 3.7V LiPo battery (500mAh), providing up to 12 hours of operation. All components were integrated into a compact 53.5\(\times\) 29 mm PCB with flat connectors for external IMUs and the methane sensor, as shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

(left) Sensor placement on a cow; (right) PCB layout for the wearable device.

A firmware was developed to handle sensor data acquisition, feature extraction, and wireless communication. Accelerometers and gyroscopes from the IMUs were used to capture 16-bit mechanical vibration data, while methane concentration levels were collected from the Figaro\(\copyright\) methane sensor. It is important to note that methane data is required only for labeling and training ML models, since our system will be able to detect burp events without the need for any methane sensor onboard. Therefore, key features from the IMU data such as the minimum, median, sum, maximum, and mean values are used to pre-trained TinyML models to estimate burp events, with start-end predictions indicated by an RGB LED located in the wearable for visual feedback to the user. Data transmission is conducted through a Bluetooth Low Energy (BLE) device, which periodically sends the predicted start-end of burp events, with the corresponding methane concentration and IMU data to a paired device. The firmware flow is illustrated in Fig. 3.

Fig. 3
Fig. 3
Full size image

Flow diagram of the firmware managing sensor data acquisition, feature extraction, and communication.

Sensor calibration

Fig. 4
Fig. 4
Full size image

(a) Deployment of the CH\(_4\) wearable. (b) Controlled gas chamber setup used for sensor characterization. (c) Sensor response curve showing the ratio \(R_s/R_0\) versus CH\(_4\) concentration, including experimental data with a logarithmic adjustment.

To validate the performance of the methane sensor, a characterization procedure was conducted using a controlled gas chamber setup, as depicted in Fig. 4. Known concentrations of methane mixed with air were introduced at a constant flow rate, and the corresponding sensor voltage output was logged in real time. This process was repeated across a range of methane levels (in ppm) to establish the sensor’s response. The resulting data were used to calculate the resistance ratio (\(R_s/R_0\)), which was compared against the reference curve provided by the sensor’s datasheet15. As shown in Fig. 4(c), the experimental results closely followed the expected logarithmic response. Table 1 contains the curves with the raw values from the initial calibration of the Figaro\(\copyright\) sensor.

Table 1 Raw calibration data for methane concentrations.

Given the relevance of methane concentration measurements in relation to the events under study, significant efforts were made to characterize the sensor behavior under different conditions. Figure 5 shows the resulting characterization curves based on a normalized response, calculated as the relative change in signal \(\frac{V_f - V_i}{V_i}\), where \(V_i\) is the initial sensor signal in equilibrium and \(V_f\) is the final value after exposure to a certain concentration of methane. This normalization allows for a consistent comparison across different concentration levels expressed in parts per million (ppm). Figure 5(a) shows the variation in the Figaro\(\copyright\) sensor’s response curve under different flow rates, ranging from 500mL/min to 4000mL/min. In contrast, 5(b) compares the deviation in the sensor’s behavior when exposed to methane versus when nitrous oxide is introduced as an interfering gas. The curves presented reflect the repeatability of the tests, each conducted three times using the same sensor. As such, the plotted data also include the standard deviation, providing insight into the consistency and variability of the measurements.

Fig. 5
Fig. 5
Full size image

Summary of the tests performed on the Figaro\(\copyright\) sensor. (a) Variation of the characterization curve at different flow rates, ranging from 500 mL/min to 4000 mL/min. (b) Comparison of the characterization curve using methane versus using nitrous oxide as an interfering gas.

Wearable design

The wearable system was designed with comfort and functionality in mind, using a halter with three strategically placed straps: one over the muzzle, another around the neck, and one under the muzzle. Each of these straps incorporates CAD-designed components optimized to ensure mechanical resistance and stability so that the electronic components, including the inertial measurement nodes, the micro-controller, the battery, and the methane sensor, are properly positioned and protected from mechanical impacts or abrupt movements. The snout node features a 12 cm gooseneck arm to position the methane sensor directly in front of the cow’s nose. A custom housing protects the sensor while allowing optimal readings of methane emissions. The total weight of the system is 300 grams, ensuring minimal impact on the animal’s comfort and behavior during prolonged use. The actual dimensions of each component, as well as the corresponding reference images of the designed parts, are presented in Fig. 6. These illustrations provide a detailed view of the design, assembly, and distribution of the elements within the wearable device.

Fig. 6
Fig. 6
Full size image

(a) wearable design, (b) cow with onboard CH\(_4\) sensor and automatic data labeling of the emissions, (c) cow with external CH\(_4\) sensor and manual data labeling of the emissions.

Data collection HMI

An Android/iOS-compatible application was developed for the purpose of handling wireless data acquisition (of inertial, sound, and methane concentration), storage, and precise time-stamped labeling. The development of a user-friendly graphical interface (GUI) was informed by the utilization of mock-ups created in Figma16. The application enables users to search for and establish a connection with available Bluetooth devices, and it provides an interactive interface with real-time graphs and control options for device operation.

Wireless data acquisition was achieved through the Bluetooth protocol, enabling seamless transmission of data from the sensors and the microphone embedded in the wearable system on the cow. We used a single communication service with two characteristics: one for sensor data and the other for event detection. The app was designed to discover and connect to Bluetooth Low-Energy (BLE) devices, ensuring automatic detection of the embedded system and establishing a connection without manual intervention. Once connected, the app maintained a stable communication channel to support real-time data acquisition. The first characteristic transmits a 40-position array containing IMU and microphone data, which is parsed into two-byte segments for visualization and storage. The variables in this array are detailed in Table 2. The second characteristic provides event detection values. All the collected data are stored in a CSV file for later training, validation, and testing of the machine learning model. Real-time graphing was implemented using SwiftUI’s declarative framework and the Charts library, dynamically updating as new data is received. This feature enhances data integrity validation and monitoring during collection. A CSV file is generated at a frequency of 10 Hz, containing gyroscopic, accelerometer, sound, and methane concentration time-stamped data.

Table 2 Structure of the raw data collected by the wearable device. IMU variables comprise accelerometer (x, y, z) and gyroscope (x, y, z) data.

The experimental work was organized into two main experiments, including data collection, model training and evaluation. In the former experiment 1, a preliminary dataset was collected from two subjects using the 3 aforementioned inertial measurement units (IMUs) placed around the head. During this stage, multiple models were evaluated, and preliminary on-device deployments were conducted to validate the accuracy of the prediction ML models running at the edge. Here, we examined the general aspects of the proposed approach, while gathering essential insights to support the design of the subsequent experiment. For experiment 2, the emphasis transitioned towards an intensive training process. An additional set of data was collected from 5 new subjects, resulting in a total of 7 animals. In light of the findings derived from the initial experiment, a thorough hyperparameter search was conducted to optimize certain ML model configurations. Figure 7 details the proposed experiments.

Fig. 7
Fig. 7
Full size image

A comparative analysis of the experimental setup is presented herein.

Burp detection model

Fig. 8
Fig. 8
Full size image

Model training scheme summarizing how raw inertial sensor data is transformed into inputs for methane-emission detection.

Fig. 9
Fig. 9
Full size image

The block diagram presents a comprehensive overview of the data transformation processes that are instrumental in the training of models.

Figure 8 provides an overview of the algorithm training framework, illustrating how the proposed approach aims to train machine learning models capable of detecting methane emissions from animals using exclusively inertial sensor data. Figure 9 presents a detailed synopsis of the data processing workflow that converts the initial raw signals into the datasets used for model training and evaluation. Finally, Algorithm 1 describes in detail the structure and procedures of the machine learning (ML) prediction model training pipeline.

The labeling of methane-emission events was performed using a custom-made desktop application. The software allows users to visualize time-series plots corresponding to the three accelerometer and gyroscope channels, as well as the methane concentration signal. The interface supports event annotation through mouse-based selection, enabling fast labeling of the raw data. The labeling strategy is grounded in the behavior of the methane concentration signal. Specifically, each event is defined by an elevation of methane levels above the resting baseline, continuing until the concentration returns to this baseline. An event is annotated only when its peak methane concentration exceeds 500 ppm, since it is important to note that the measurement range of the Figaro\(\copyright\) sensor used to measure methane concentrations is between 500 and 10,000 ppm. Therefore, any increase below 500 ppm is consider noise, as the reading from the sensor is inaccurate at this point.

Subsequently, the analysis of the events focused on two key metrics: peak amplitude and duration. Following this, a filtering step was applied to exclude events with maximum concentration and duration below the first quartile. This strategy was adopted in order to reduce the influence of noise, which has been observed to dominate at lower concentration levels, and to prioritize well-defined events likely to be associated with actual methane emissions. The dataset was divided into two sets: a training set comprising 90% and a test set encompassing 10%. The division is conducted on the basis of subject and is undertaken at random. This process is performed prior to segmentation in order to prevent data leakage between these sets.

Algorithm 1
Algorithm 1
Full size image

Machine learning pipeline: training algorithm.

The signals are subsequently segmented into ten-second intervals, with a variable degree of temporal overlap applied to ensure class balance. The negative class, defined as the absence of methane emissions, serves as the baseline and is generated without overlapping. The positive class, corresponding to the presence of emissions, is produced with an overlap ranging from 20% to 90% to achieve the desired equilibrium between classes. In addition to this class-balanced training strategy, an independent training and evaluation procedure was performed using a dataset constructed without temporal overlap and in the absence of any explicit class-balancing mechanism. This supplementary experiment was designed to evaluate model performance under realistic field conditions characterized by pronounced class imbalance.

Furthermore, the data underwent standardization at the subject level through the implementation of a z-score scaling procedure. This process of normalization serves to reduce the variability that has been introduced by differences in sensor placement, individual signal amplitudes, and baseline offsets. The consequence of this is an improvement in the consistency of the input features across subjects. Consequently, the models function within a more homogeneous feature space, thereby stabilizing the training process and facilitating cross-subject comparability.

A total of 10 motion features for each IMU node were calculated using tsfresh, which included minimum, maximum, mean, median, variance, standard deviation, sum, absolute maximum, RMS, and length. For each IMU value, we obtain: magnitude, pitch, and roll angles. The mathematical formulas for these features are included in Table 3.

Table 3 Features calculated from the inertial data for each node.

Using the pipeline described above, training was carried out using 10 different ML classification algorithms:ridge17, logistic regression (LR)17,18, k-nearest neighbor (kNN)19,20, decision tree (DT)21, random forest (RF)22, ada boost (AB)23, gradient boosting (GB)24, XGBoost25, CatBoost26 and LightGBM27. Ridge and LR were trained to have a baseline against which to compare, as they are the most simple algorithms to implement. DT/Ensemble-based methods were selected for experiment 1 since they have shown the best performance for time series management according to the Makridakis Competitions28 in addition they facilitate extremely rapid prototyping. In experiment 2, neural networks were incorporated into the analysis, in conjunction with the most efficacious models from the preceding experiment. These classification models obtained at the final stages of the pipeline were validated using the test dataset. Performance metrics were computed using the “metrics” class from the “sklearn” Python library. From the confusion matrix the accuracy, precision, recall, and F1-score were calculated, as well as the ROC-AUC curve.

Table 4 Hardware specifications of the embedded platforms used for model deployment.

The deployment workflow relies on the Edge Impulse platform, which not only manages model training and evaluation but also performs model quantization as a core step for embedded deployment. This quantization process substantially reduces memory usage and computational requirements, enabling efficient inference on resource-constrained hardware. The resulting quantized model is then exported as an Arduino-compatible library that packages all parameters and runtime components into a compact zip file, which is integrated directly into the firmware to support fully offline execution. In this study, quantized models are implemented on two embedded platforms: the Seeed Studio XIAO ESP32S3 and the Seeed Studio XIAO nRF52840 Sense. The key hardware specifications of these platforms are outlined in Table 4.

Given that a scikit-learn model is not directly deployable on a microprocessor, this type of model is converted into a TensorFlow Lite (TFLite) format through the following process: Initially, the trained model is implemented in TensorFlow, using identical weights and parameters extracted from the scikit-learn pipeline, or alternatively, it is converted via an intermediate representation employing JAX and ONNX29,30,31. Once the models have been converted to TFLite format, deployment is a straightforward process that can be executed by the Edge Impulse platform.

Results

A total of 436 events were obtained from the data obtained from the 7 subjects. Table 5 presents the statistics for the duration of the event, measured in seconds, and the amplitude of the methane concentration, for all methane emission events that constitute the dataset. Figure 10 presents all the signals that constitute a single methane emission event. In this section, the x, y, and z channels of both the accelerometer and the gyroscope for each of the three proposed inertial points are displayed.

Fig. 10
Fig. 10
Full size image

Collected data points from (a) in-field CH\(_4\) measurements [ppm], (b) tri-axial accelerometer signals per node, and (c) tri-axial gyroscope signal per node. The four labels represent the position and motion of the cow’s head during the measurements: Up-still refers to the cow standing up without head movement. Up in-motion indicates the cow standing while either walking or moving the head. Down-still corresponds to the cow lying on the ground without head movement. Down in-motion refers to the cow lying on the ground while either eating or moving the head.

Table 5 Statistical summary of Duration and CH\(_4\) Amplitude.

Experiment 1

This section delineates the findings from experiment 1. The obtained results are based on the data collected from two subjects (155B and 204B), as this experiment was specifically designed to evaluate the overall feasibility of the proposed methodology, as depicted in Fig. 7. The overall performance of ML algorithms in detecting the burp event using the test dataset in experiment 1 is summarized in Table 6. The Random Forest (RF), AdaBoost (AB), and LightGBM algorithms were the best performers in classifying the burp event.

Table 6 Performance metrics of the ML algorithms trained with the dataset obtained in the field, with a sliding window of 10 s and a time step of 1s. Results ranked by descending accuracy.

To identify the optimal combination of pre-event and event windows for detection the start of the burp event, a time sweep of these windows was used. The sweep was conducted from 0 to 60 seconds in 5-seconds increments for each window. An additional sample using the maximum event duration was included for the event window. Figure 11 illustrates the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) for various window combinations, with the pre-event window on the Y-axis and the event window on the X-axis. Performance improved with increasing window lengths.

Fig. 11
Fig. 11
Full size image

ROC-AUC for different pre-event and event window combinations.

Figure 12 displays the performance of the RF models in classifying the burp event. Models trained with all available features indicated that the combination of all IMU nodes performed the best (AUC=0.77), followed by the nape and neck nodes (0.74 and 0.73, respectively), with the snout node performing at 0.68. A summary of all performance metrics is presented in Table 7, The results in the table are highly comparable, which complicates the determination of any significant differences between the various inertial points. To assess the feasibility of deployment and evaluate performance of the exported embedded ML model on the target micro-controller, we conducted a validation process by passing selected test vectors from the original training dataset through the deployed model. This was done using both the TensorFlow Lite model and the Edge Impulse model, ensuring consistency between the ML training environment and embedded deployment. The micro-controller’s results were compared with those from the training phase to affirm the model’s accuracy and consistency post-deployment. The results, as shown in Fig. 13, indicate that the embedded implementation yielded predictions similar to the ML pipeline, suggesting successful integration with minimal performance degradation. The observed performance degradation is due to quantization errors during square root and trigonometric operations on the micro-controller, i.e. higher machine epsilon, which may cause slight discrepancies in numerical calculations compared to higher-precision environments.

Fig. 12
Fig. 12
Full size image

ROC curves for different nodes. Blue: All nodes, Orange: IMU1 (Nape), Green: IMU2 (Snout), Red: IMU3 (Neck).

Table 7 Performance metrics of the RF models for the IMU nodes.
Fig. 13
Fig. 13
Full size image

True Positive Rate (TPR) comparison between the full model (PC) and deployed embedded model performance (TFLM). The R\(^2\) corresponds to 0.6992.

The optimal RF model was deployed on the microprocessor for field testing. By analyzing different datasets and examining the probabilities obtained by the TensorFlow Lite Micro (TLFM) model, we found that a threshold of 0.7 yields the best performance in event detection. Figure 14 shows the CH\(_4\) concentration, probabilities, and predictions per unit of time. The probability tends to increase as an event approaches and decreases once the event concludes.

Fig. 14
Fig. 14
Full size image

(a) CH\(_4\) emissions over time. (b) Predictions using a threshold of 0.7. (c) Probabilities obtained performing inference in the embedded device.

Experiment 2

The differences observed between experiments 1 and 2, as depicted in Fig. 7 are twofold: i) the inclusion of all 7 subjects and ii) the concentration of data collection on a single inertial measurement unit (IMU). The main objective of this second experiment was to develop models with improved generalization and higher predictive performance. To achieve this, the two best-performing models from Experiment 1 (Random Forest (RF) and AdaBoost (AB)) were retained for comparison, while several Neural Network (NN) architectures, previously excluded due to their higher computational cost, were introduced.

The hyperparameter optimization of the neural networks, conducted using the Optuna framework, is illustrated in Fig. 15. The study comprised 800 optimization trials, during which key parameters, including the number of layers, neurons per layer, activation functions, batch size, and learning rate, were systematically varied. Equivalent hyperparameter sweeps were also performed for the RF and AB models. As it will be presented below, however, the neural network models were ultimately selected as the reference due to their superior performance.

As summarized in Table 8, the best-performing configuration for each algorithm is reported across multiple evaluation metrics. Figure 16 presents a comparative analysis of their respective receiver operating characteristic (ROC) curves. The results are subtle, yet significant, the neural network achieved the most consistent and overall superior results across all metrics, establishing it as the benchmark model. The most notable observation is the pronounced decline in model performance from Experiment 1 to Experiment 2. This reduction is primarily attributed to the increased number of subjects and the greater inter-individual variability introduced in the second dataset.

To assess model generalization across individual animals, the performance of the reference model was evaluated on test subsets corresponding to each subject (see Table 9). The overall accuracy and area under the ROC curve (AUC) across the entire test dataset were approximately 65%, with several subjects achieving accuracies exceeding 70%. Notably, subjects 164 and 187 exhibited AUC values greater than 80%, indicating strong model generalization for these individuals. In contrast, lower performance was observed for subjects 137, 155B, and 181, whose results approached the level of a random classifier (around 50% AUC). This variability highlights the influence of individual behavioral and physiological differences on model robustness and emphasizes the need for larger, more diverse datasets to improve cross-subject generalization.

Fig. 15
Fig. 15
Full size image

Evolution of Neuronal Network accuracy across optimization trials. The blue markers represent the accuracy obtained in each trial, while the red dashed line indicates the best accuracy achieved up to that point.

Table 8 Comparison of performance metrics between learning-based models and a chance-level random classifier.
Fig. 16
Fig. 16
Full size image

The following figure illustrates the Receiver Operating Characteristic (ROC) curves of the final models that were evaluated during experiment 2.

Table 9 Performance metrics per subject.

The summary statistics across the seven subjects, which can be seen in table 10, reveal a consistent but variable performance of the neural network, reflecting the expected heterogeneity in animal-specific head-movement patterns. On average, the model achieved an AUC of 0.709 (95% CI: 0.631–0.787), an accuracy of 0.677 (95% CI: 0.598–0.756), and an F1-score of 0.660 (95% CI: 0.569–0.751). The coefficients of variation (12.8–17.7%) indicate moderate dispersion, with F1-score exhibiting the highest variability. These results suggest that while the network is generally capable of identifying eructation events across animals, performance is influenced by individual differences and remains subject to uncertainty due to the small cohort size. Consequently, the aggregated metrics should be interpreted as indicative of feasibility rather than fully generalizable performance.

Table 10 Summary statistics of subject-level performance metrics (AUC, Accuracy, F1-score). Mean = average across 7 subjects; SD = standard deviation; CV = coefficient of variation; CI = 95% confidence interval of the mean.

Table 11 presents the mean classification accuracy and corresponding standard deviation obtained for different values of the maximum window overlap. For each overlap configuration, the model was trained three times with distinct random seeds, and the table reports the average performance together with the variability across runs. The results do not reveal a clear or monotonic dependence of model accuracy on the degree of window overlap; similar performance levels are observed over a broad range of overlap settings. Furthermore, the reported standard deviations demonstrate that the choice of random seed exerts a non-negligible influence on the final outcomes, in some instances being comparable to, or exceeding, the performance differences attributed to the overlap parameter itself.

Table 11 Average model accuracy and standard deviation as a function of maximum window overlap.

The quantized model deployed through the Edge Impulse platform was evaluated on two embedded systems with distinct hardware characteristics. As demonstrated in Table 12, these discrepancies result in disparities in both resource utilization and inference latency. For instance, the ESP32-S3 reports a flash usage of 647 kB and an inference time of 25 ms, whereas the nRF52840 requires 413 kB of flash and reaches an inference time of 74 ms. The results demonstrate the manner in which the deployment pipeline adapts the quantized neural network to the specific limitations and architectural features of each device, thereby enabling the model to operate within the available memory and compute budget of both platforms. It is important to note that the default partition scheme of ESP32S3 is employed, utilizing a mere 3 MB of the total available memory.

Table 12 Resource utilization and inference performance across embedded systems.

Performance evaluation under natural class imbalance

Table 13 and Table 14 summarize the performance of the proposed model under natural class imbalance and its comparison against a random baseline. To ensure robustness, the random baseline evaluation was repeated three times, and the reported metrics correspond to their average values. All random classifiers were configured to operate under the observed natural class proportion of approximately 8:1 between negative and positive classes. In contrast, the proposed model substantially improves positive event detection, achieving a significantly higher MCC and PR-AUC, and correctly identifying the majority of rare events at the cost of a controlled increase in false positives. These results demonstrate the practical advantage of the proposed approach in realistic deployment conditions characterized by extreme class imbalance.

Table 13 Performance comparison between the proposed model and a random baseline under natural class imbalance. Random baseline results correspond to the average over three independent runs.
Table 14 Confusion matrix comparison between the proposed model and the best-performing random baseline under natural class imbalance.

Discussion

We describe the field testing of the system, the datasets generated, and the performance evaluation of the embedded models. The study aims to assess the feasibility of methane-emission monitoring in real-world conditions and to evaluate the trade-offs between resource utilization and inference performance on embedded platforms.

To facilitate system validation, Experiment 1 involved field testing with two adult cows. The first animal, subject 204B, a Holstein breed, was fed a grass-based diet supplemented with Rentaleche (a high-fiber feed supplement). The second animal, subject 155B, a pregnant Jersey cow, received a grass-based diet supplemented with nutrients specifically formulated to support gestation. In Experiment 2, five additional Holstein cows were included. All animals followed an identical feeding regimen comprising two grazing sessions per day, 17% salt, and 1 kg of concentrate per 5 L of milk produced during milking.

All experiments were conducted under the supervision of trained and authorized field technicians at the San Javier Farm Laboratory of Pontificia Universidad Javeriana, a facility integrating scientific research and agricultural practices. The farm is located in Cogua, Colombia (\(5^o3'25''N\), \(73^o56'26''W\)), with average environmental conditions of \(16^{\circ }C\), 70% relative humidity, and 21 mm of precipitation. Constant factors such as feeding, altitude, humidity, and ambient temperature were not manipulated to avoid confounding effects. The verification was performed on a limited sample of seven adult bovines aged 3–7 years; with the exception of one Jersey, all were Holstein cows. Data were collected at intervals of 10–25 minutes to capture behavioral and environmental variability. During each acquisition session, the researcher remained near the animal to maintain Bluetooth connectivity and to monitor behavioral context. The complete dataset is publicly available in the OSF repositoryFootnote 1 and constitutes a valuable resource for future research on methane monitoring and mitigation strategies.

The IoT device developed in this study detects eructation events associated with methane emissions by detecting mechanical vibrations from the animal’s head and neck. The results from Experiment 1 (Table 6) show that the Random Forest (RF) model outperformed other classifiers, achieving the highest AUC (0.768) and maintaining superior scores across all metrics—accuracy (0.741), precision (0.741), recall (0.741), and F1-score (0.741). In comparison, Logistic Regression (LR) and Decision Tree (DT) models yielded lower performance, with DT exhibiting the lowest AUC (0.635) and accuracy (0.635), indicating limited capacity to model complex nonlinear patterns. AdaBoost (AB) and LightGBM also showed reasonable performance, with AUC values of 0.744 and 0.729, respectively.

The integrated burp detection model, combining data from all nodes, achieved the highest overall performance (accuracy = 0.7526, precision = 0.7682, recall = 0.7526, F1-score = 0.7489, and R\(^2\) = 0.6992) as summarized in Table 7. Although single-node models exhibited slightly lower metrics, they relied on only one-third of the total input features. A subsequent analysis incorporating data from all seven subjects (Experiment 2) revealed a noticeable decrease in model performance: the RF model’s AUC declined from 0.768 to 0.660, and AdaBoost’s from 0.744 to 0.637. This degradation is attributed to the increased inter-individual variability introduced by the expanded sample size. Based on comparative results, conventional neural networks were selected for further development, as they consistently outperformed other models across multiple evaluation criteria, despite RF’s superior initial results in Experiment 1.

This work assumes that all methane concentration levels above the defined threshold, excluding the first quartile of the distribution, correspond to eructation-related events. This assumption may introduce a degree of misclassification, as methane emissions can also exhibit natural variability unrelated to discrete eructation episodes, however, this premise is deliberately adopted due to the practical benefits it provides, including a simplified and reproducible event definition, reduced reliance on subjective annotation procedures, and improved robustness and scalability, achieving a rapid, scalable and human-independent labeling strategy. Within this framework, the proposed approach focuses on detecting methane-emission-defined events and serves as an initial step toward understanding the relationship between inertial movement patterns and methane release dynamics, and the results suggest a substantial potential for the accurate estimation of these belching events, as a key component of a future wearable to measure methane emissions in livestock.

The evaluation framework adopted in this study is explicitly restricted to the detection of methane-threshold-defined events. Specifically, emission events are operationally defined as increases in methane concentration exceeding 500 ppm and satisfying amplitude and duration filtering criteria, as measured by the integrated methane sensor. Consequently, the machine learning models are trained and evaluated to identify inertial motion patterns temporally associated with these threshold-based methane elevations, rather than independently verified physiological eructation episodes.

Accordingly, the present findings should be interpreted as demonstrating correlation between IMU-derived motion features and methane-threshold-defined events under the specified experimental protocol. The study does not establish an independent sensing capability for methane release based solely on inertial data. Independent behavioral observations, respiratory measurements, or validated reference instruments (e.g., respiration chambers or GreenFeed systems) would be required to confirm true standalone methane detection and to validate the relationship between head-vibration signatures and gas-release dynamics.

The low occurrence of eructation events (approximately 16.15 events per hour) resulted in a severely imbalanced dataset, an inherent characteristic of the biological system that cannot be experimentally controlled. This imbalance restricts the availability of positive samples, complicating both minority-class learning and realistic performance evaluation. Although data-balancing strategies and high temporal window overlap were employed to increase the number of training instances and enable the training of multiple algorithms, a sensitivity analysis revealed no significant differences in model performance across the range of maximum overlap values considered. Instead, variability in performance was more strongly influenced by the choice of random initialization seeds, likely due to the limited dataset size. When trained under natural class proportions, the model demonstrated predictive capability beyond a random classifier and achieved a Matthews correlation coefficient comparable to that obtained under balanced training conditions. Future work should focus on extending monitoring durations and expanding the dataset to enable more robust evaluation, particularly through precision–recall analysis under real-world class imbalance.

Other challenges arise from the natural variability in factors such as sensor placement, strap tension, and animal posture, all of which can introduce gradual shifts in the recorded motion signals. Although subject-level standard scaling was applied to enhance comparability across individuals, such pre-processing cannot fully compensate for the non-stationary nature of biological and environmental conditions. Consequently, the reported performance should be interpreted within these practical constraints, which remain an important area for continued methodological refinement.

A possible limitation of the present study is the absence of direct comparison with established emission measurement techniques such as GreenFeed or respiration chambers. In order to contextualize the advantages of the proposed approach relative to conventional methods, quantitative cross-validation will be essential in a future stage. This process encompasses parameters such as measurement accuracy, cost, monitoring duration and animal welfare. Although, the potential benefits and improvements associated with our wearable motion-based monitoring must be contrasted and bench-marked, these other reference instruments are considered more invasive and remove the animal from its environment, impacting its natural behavior and response.

While the IMU-based approach captures characteristic motion patterns that coincide with methane-emission events, the interpretation of these patterns requires caution. The vibrations detected by the sensors arise within a complex behavioral context, where multiple routine activities may produce partially overlapping signatures. Moreover, although these motion patterns tend to align temporally with measured increases in methane concentration, the present analysis does not establish a mechanistic link between them. As a result, the detected signals should be understood as correlational indicators within a broader behavioral and physiological landscape. Future work incorporating complementary sensing modalities will be necessary to more precisely disentangle the contributions of concurrent activities and to better characterize the relationship between motion dynamics and gas-release episodes.

All experimental procedures were reviewed and approved by the Ethics Committee at Pontificia Universidad Javeriana (approval code: FID-24–058, issued 15 February 2024). Feeding followed the farm’s established protocols under supervision, without dietary or fluid restrictions. Animals were not restrained except briefly (\(\le\) 15 minutes) during specific measurements. Occasional use of a methane measurement mask, collar, or nose harness adhered strictly to animal welfare standards. No invasive, surgical, or genetic procedures were conducted. All experimental protocols conform to ethical research standards and comply with ARRIVE guidelines regarding housing, handling, and minimization of animal stress.

Summary and conclusions

The results of this investigation demonstrate the feasibility of employing a wearable Internet of Things (IoT)-based system integrated with inertial measurement units (IMUs) for the real-time detection of eructation (burp) events in cattle. The findings indicate that machine learning models such as Random Forest (RF) and Neural Networks (NN) show an important potential to predict these events with satisfactory levels of accuracy, precision, and recall. Furthermore, combining data from multiple sensor nodes yielded the highest overall performance, although individual nodes (particularly the neck mounted IMU) also produced promising results. Subsequent research will refine the methodology by building on the improvement paths already discussed in this study, including a more systematic evaluation at the subject level, improved data balancing strategies, and a deeper examination of the potential of using inertial information on properly detecting the start of these belching events in livestock. The goal of these efforts is to establish a rigorous foundation for subsequent iterations of the system.

The proposed work offers a novel alternative for methane emission monitoring in livestock. Our approach, which employs low-cost inertial sensors and a simple printed circuit board design, shows promise in detecting the eructation events relying only on inertial data, in order to consequently activate the main electronics modules of a future low-cost methane measuring wearable device. Such a system could become an accessible and scalable alternative to conventional approaches such as non-dispersive infrared (NDIR) sensors, laser methane detectors (LMDs), or respiration chambers, all of which typically entail substantial financial costs, extensive infrastructural requirements, and, in some cases, invasive procedures. This approach enables the possible detection of eructation events with minimal disturbance to the animals. It additionally enables the development of energy-efficient wearable architectures based on event-driven activation. Moreover, it exhibits the potential capability to function directly under in situ or field conditions, without the requirement for specialized protective enclosures or elaborate calibration procedures.

In conclusion, this study introduces a novel methodological framework that integrates wearable sensing and machine learning to explore the relationship between inertial motion and eructation events in cattle. While the results are encouraging, they represent an initial proof of concept that requires further development, analysis and evaluation. Further research involving larger cohorts, longer monitoring periods, and more comprehensive experimental designs is required to assess scalability and long-term reliability. Nevertheless, the insights gained in this work represents an initial step toward methods that could contribute to more sustainable, cost-effective, and animal-friendly methane monitoring solutions.