Introduction

In the normal digestion process, food is propelled through the digestive tract by rhythmic muscle contractions known as peristalsis. The resulting movement of contents is referred to as gastrointestinal motility (throughout the digestive tract) or gastric motility (when limited to the stomach)1. Gastric motility disorder occurs when the normal peristalsis is disrupted, which may cause severe constipation, recurrent vomiting, bloating, diarrhea, nausea, and even death. Gastric motility may be assessed through various techniques. While direct measurements such as gastric emptying scintigraphy and wireless motility capsules (WMCs) quantify food movement, other modalities-including manometry, electrogastrography (EGG), and ultrasound-evaluate the factors affecting gastric motility. However, traditional methods of evaluating gastric motility have their limitations. Manometry involves intranasal intubation protocols, which may cause discomfort to patients and lead to the use of sedation2. Nuclear medicine is required for gastric emptying scintigraphy, which leads to the risk of radiation exposure to patients3. EGG has many variations in the recording system. Despite recent efforts to standardize EGG for body surface gastric mapping4, EGG remains susceptible to inter-individual physiological variability, such as differences in body mass index. Gastric ultrasound suffers from the trade-off between penetration depth and resolution5. Wireless motility capsules measure physiological parameters related to gastric peristalsis, such as temperature, pH, and pressure, as they travel through the gastrointestinal tract6. However, as the WMCs passively transit through the gastrointestinal tract, they may fail to accurately assess motility at specific anatomical landmarks7. On the other hand, magnetically controlled capsule endoscope (MCCE) is an emerging tool for the diagnosis of gastric diseases, which provides real-time, true-color visualization of the gastric environment. With active magnetic control, MCCE enables precise localization and visualization of anatomical landmarks. Moreover, the MCCE provides direct, multi-angle visualization of contraction waves, which enables comprehensive and effective analysis of gastric motility. Besides, the MCCE possesses advantages of comfort, safety, and no anesthesia8.

However, an extensive amount of labor from the clinical participants is needed to evaluate gastric motility using MCCE. For example, each MCCE frame needs to be inspected for the presence of peristalsis; and the period of peristalsis needs to be manually counted. Thus, there is a need to develop automatic algorithms for evaluating gastric motility using MCCE systems. Deep learning algorithms have been used in the field of medical imaging9,10,11 as well as assisting the diagnosis of MCCE systems12,13. Convolution neural network (CNN) has been used in detecting polyps14,15, ulcer16, tumor, and mucosa12. Moreover, the Deep Reinforcement Learning (DRL) approaches have been used in the automated navigation of the MCCE capsules within human stomach. However, the existing research focuses on detecting gastric lesions, anomaly detection, segmentation, and navigation17 based on single MCCE frame, instead of utilizing the temporal information from MCCE frame sequences.

In this paper, we develop a combination of algorithms for evaluating human gastric motility by detecting and measuring gastric peristalsis using the MCCE system. During the MCCE examination, an external magnetic head will guide the capsule to move and capture images within the human stomach, which poses a challenge for action recognition algorithms18. To mitigate the sudden movement of MCCE capsule, we develop a camera motion detector (CMD) for processing MCCE frame sequences. We develop a framework for detecting gastric peristalsis, which is compatible with CNN + long short-term memory (LSTM)19,20 and transformer-base models21. The human gastric contraction waves present features in both spatial and temporal domains. In the spatial domain, the waves have morphological shapes; in the temporal domain, the shape of waves changes over time. The CNN model is capable of capturing the spatial features and LSTM model is capable of capturing the temporal features; and the Video Swin Transformer21 is capable of analyzing patches across the spatial and temporal dimensions . For detection and classification algorithms in most medical applications, reducing false negatives is actually more important than reducing false positives22. False negative results may lead to an omission in detecting gastric peristalsis, which will lead to an underestimation of human gastric motility. The class activation mapping (CAM)23 is capable of highlighting the regions within the gastric where the peristalsis occurs. To make the detection results more reliable (i.e. have fewer false negatives in detecting peristalsis frames in the dataset), we improve the detection sensitivity using the visual interpretations provided by CAM. Moreover, we develop a periodical feature detector for measuring the period of human gastric peristalsis based on the analysis of feature map of MCCE frames.

We conducted extensive experiments on our MCCE dataset, which includes over 100,000 frames (specifically 100,055) from 30 subjects for the training and validation sets, and 24,183 frames from 11 subjects for the testing set. Our combination of algorithms is capable of evaluating gastric motility by detecting the presence of peristalsis as well as measuring the period of gastric peristalsis. The proposed algorithms have great potential to be developed in clinical devices for assisting the evaluation of gastric motility.

Methods

MCCE dataset

The MCCE dataset was acquired by the department of research and development at AnX Robotica. Using the NaviCam MCCE system, inspection videos of internal volunteers were collected. The MCCE system consists of four components: a swallowable, wireless, and magnetically controlled capsule endoscope (11.8 \(\times\) 27 mm), a guidance magnetic robot, a data recorder, and a computer workstation with corresponding softwares. An example of the components of the NaviCam MCCE system is shown in Fig. 1. The videos captured by MCCE were recorded at 2 fps, with a size of 480 \(\times\) 480 pixels. The MCCE videos were treated as frame sequences. Our training and validation set contains more than 100,000 MCCE (specifically 100,055) frames from 30 subjects.

Fig. 1
figure 1

An illustrations of the NaviCam MCCE system (https://www.anxrobotics.com/products/navicam-stomach-capsule-system/). (a) Controlled capsule endoscope. (b) Guidance magnetic robot. (c) Data recorder and computer workstation.

Design of camera motion detector

We design the CMD for filtering the MCCE frames which are deteriorated by camera movement. The proposed CMD takes two consecutive frames, Frame N-1 and Frame N as inputs. Then the CMD determines the camera motion by analyzing histograms. The details of the CMD is described in Algorithm 1. A normalized Gaussian function with \(\mu\) at 128 and \(\sigma\) at 20 is adopted as the mask M. The choice of \(\mu\) and \(\sigma\) is based on empirical research. With a higher threshold T, the mean sequence length of resulting video sequences is longer, leading to higher computational cost. However, the resulting video sequences will contain less motion with lower threshold T. We empirically set the threshold T to 200 to strike a trade-off between the sequence length and camera motion within sequences.

Algorithm 1
figure a

The camera motion detector.

Workflow of detecting human gastric peristalsis

The workflow of detecting human gastric peristalsis is shown in Fig. 2a. For training, the MCCE dataset will be processed by the CMD, which provides stable MCCE frames. For prediction (testing), the testing MCCE data will go through a post-processing step. In the post-processing step, the CMD is involved in determining the quality of the MCCE frames. The frames with camera movement above the threshold will be marked as ’camera moving’; the stable MCCE frames which pass the CMD will be sent to the pre-trained deep-learning model for prediction, which generates outputs of ’wave’ or ’nowave’. The proposed framework is compatible with various deep-learning models. In Fig. 2b, we demonstrate the method of using CAM for improving the sensitivity of the framework. In Fig. 2c, we demonstrate the ensemble of the CNN and LSTM for detecting human gastric peristalsis.

Fig. 2
figure 2

(a) The workflow of detecting human gastric peristalsis using MCCE frames and deep learning algorithms. During the training and inferencing phases, the CMD is used to filter the unstable MCCE frames, and deep learning algorithms are used to detect gastric peristalsis based on both spatial and temporal information. (b) The protocol for improving detection sensitivity using CAM. Using the counted activated pixels, we further calibrate the prediction results from the CNN-LSTM model. (c) The ensemble of CNN and bi-directional LSTM model for detecting gastric peristalsis. The CNN model extracts spatial features and LSTM model extracts temporal features.

Improving sensitivity using CAM

We use the CAM to calibrate the detection results of the deep learning model. The CAM for a particular category indicates the discriminative image regions used by the CNN to identify that category. We calculate the CAM for each MCCE frame k. Then we use a threshold \(\hbox {T}_{c}\) to filter the activated pixels in the CAM. Then we count the number of activated pixels. If the number of activated pixels of a frame is larger than the threshold \(\hbox {T}_{c}\), the frame will be classified as ‘wave’ in the modified label list c. Then we perform a calibration between the modified list c and the original prediction list p. If \(c_k\) or \(p_k\) is ‘wave’, the final calibrated prediction results in \(pr_k\) will be ‘wave’. The algorithm is described in Algorithm 2. The parameter \(\hbox {T}_{p}\) determines the threshold of choosing positive CAM pixels and \(\hbox {T}_{c}\) determines the number of CAM positive pixels to consider a frame as positive. The higher of these two parameters, the stricter of the CAM filter. For example, if the Tp is set to 1 and Tc is set to 230,400 (total number of the pixels in a MCCE frame with a size of 480\(\times\)480), then no frame can pass the CAM filter and the sensitivity of the CNN+LSTM model will not be improved. If both the \(\hbox {T}_{p}\) and \(\hbox {T}_{c}\) are set to 0, then every MCCE frame will be able to pass the CAM filter, and the sensitivity will be 1. Following some of the existing research24,25, we set \(\hbox {T}_{p}\) = 0.8. The choice of \(\hbox {T}_{c}\) is set to 400 based on empirical research.

Algorithm 2
figure b

Improving sensitivity using CAM.

Periodical feature detector for human gastric peristalsis

We design a periodical feature detector for measuring periods of human gastric peristalsis. The inputs of the periodical feature detector are a range of intervals, MCCE frames, and two thresholds \(T_{l}\) and \(T_{r}\). The periodical feature detector calculates the feature difference score S of feature maps across certain intervals i. For each interval i, a score \({\textbf {S}}_{i}^{mean}\) is calculated for the MCCE frames. The period of the human gastric peristalsis is determined by the local minimal P between thresholds \(T_{l}\) and \(T_{r}\). The details of the periodical detector are described in Algorithm 3. The intervals i were set from 5s to 50s, with an incremental of 0.5 s (2 fps). The \(T_l\) is set to 10 s and \(T_u\) is set to 40 s. The choice of \(T_l\) and \(T_u\) is determined by the minimum frames needed to detect human gastric peristalsis and the average period of normal human gastric peristalsis. We will show experimental details in results section to confirm that 10 s (20 MCCE frames) achieves optimal performance in detecting gastric peristalsis. Thus, we assume 10 s to be the lower bound for the periodical feature detector and set \(T_l\) to be 10 s. We set the \(T_u\) (40 s) to be twice the value of the average period of normal gastric peristalsis, which is around 20 s26. Note that the periodical feature detector will detect both the period and multiples of the period. Thus, setting \(T_u\) = 40 s can remove the multiples of the period of the normal human gastric peristalsis. We use the EfficientNet_b727 to generate the feature maps of the MCCE frames.

Algorithm 3
figure c

The periodical feature detector.

Evaluation setup

Using the CMD, we acquire 32,431 stable MCCE frames (wave: 9501, nowave: 22,930) from the training set. The 32,431 stable MCCE frames are divided into 1028 MCCE frame sequences (wave: 336, nowave: 692), each sequence consists of more than 20 frames. Each MCCE frame sequence corresponds to a single label of whether ‘wave’ or ‘nowave’. For training and cross-fold validation, we used the 1028 MCCE stable frame sequences. For testing, we used 30 additional MCCE records from another 11 individuals (24,183 frames), which are acquired after the training data. The training cohort and testing cohort were divided according to the time of data acquisition. The training and testing data were acquired from the same center, using the same MCCE system.

Network training parameters

For training the CNN-LSTM and Video Swin Transformer models, we set the batch size to eight (frame sequences). We used pre-trained weights on ImageNet28 for all the models in this project. For the CNN+LSTM models, we acquired the pretrained weights from the Torchvision29 package; for the Video Swin Transformer model, we acquired the pretrained weights from the mmaction30 package. We trained all the models for 200 epochs, in which we observed a plateau of loss function for all models. We used the first five epochs for warm-up, in which we only train the CNN model. The learning rate is initialized to \(10^{-4}\), followed by half decay for every 10 epochs. The experiments were carried out in a single RTX 3080 GPU.

Evaluation metrics

We used accuracy, F1 score, and area under the curve (AUC) to evaluate the classification performance. The accuracy score evaluates the performance of the model to correctly predict (true positive and true negative) the wave/nonwave video frames; the F1 score evaluates the performance of the model, with the consideration of falsely predicted cases (false negative and false positive). In this medical-related study, we need to evaluate model performance for both correctly and falsely predicted cases. The AUC score evaluate how much the trained model can distinguish between the wave and non-wave video sequences.

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN}, \end{aligned}$$
(1)

where TP stands for true positive; TN stands for true negative; FP stands for false positive; FN stands for false negative.

$$\begin{aligned} F1 = 2 \times \frac{Precision \times Sensitivity}{Precision+Sensitivity}, \end{aligned}$$
(2)

where precision is defined by \(\frac{TP}{TP+FP}\); sensitivity is defined by \(\frac{TP}{TP+FN}\).

The AUC score tells how much the classification model is capable of distinguishing between classes, which can be calculated by the receiver operating characteristic (ROC) curve. The AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0; one whose predictions are 100% correct has an AUC of 1.

We define the error rate to quantify the performance of the periodical feature detector.

$$\begin{aligned} \text {Error rate} = \frac{|\text {Detected period} - \text {Counted period}|}{\text {Counted period}} \times 100\%. \end{aligned}$$
(3)

Ethics declarations

The MCCE data adopted in this research was collected from internal healthy volunteers in Ankon, using the NaviCam system. The NaviCam system is registered medical device in National Medical Products Administration (NMPA).

The data collection protocol was designed and performed by the Medical Department in Ankon, which is in accordance with the Ankon Internal Volunteer Protocol and Declaration of Helsinki.

Informed consent for the data collection protocol and reuse of the data for research purpose were acquired from all the internal healthy volunteers. The internal healthy volunteers received monetary compensation.

Fig. 3
figure 3

An example of applying the camera motion detector to stable (upper row) and unstable MCCE (lower row) frame sequence. The MCCE frames are inputs of the proposed CMD. The histogram (H) mask (M), and masked histograms (\(\hbox {H}_M\)) are interim results. The output of the CMD is a score that evaluates the movements between two consecutive MCCE frames.

Results

Effects of CMD

We show the effect of using the CMD on stable and unstable MCCE frame sequences. An example of applying CMD to stable (upper row) and unstable (lower row) MCCE images is demonstrated in Fig. 3. On the upper row, we apply the CMD to two consecutive MCCE frames captured when the capsule is stable. In this case, the main body of the histogram \({\textbf {H}}\) of the residual images between the two frames is zero. However, the high band (right) and low band (left) in the histogram have high values. This is caused by the small motion of the capsule, which is equipped with a light source. The slight changes in the light source will change the positions of bright/dark regions, which will be captured by the high/low bands of the histogram of the residual image. Using the mask \({\textbf {M}}\), we can filter out the high/low bands of \({\textbf {H}}\), which results in \({\textbf {H}}_M\). On the bottom row, we apply the CMD to two consecutive MCCE frames captured when the capsule is unstable, which leads to a high CMD score S (2616).

Detecting human gastric peristalsis

Table 1 Performance of the deep learning model using different memory lengths31.

We train the CNN+LSTM models using different memory lengths (1, 5, 10, and 20 video frames). The model performance of different memory lengths is reported in Table 1. We observe the model performance increases with the value of the memory length. The CNN+LSTM model with a memory length of 20 shows the best performance.

Table 2 Performance comparison of different deep learning models.

We implemented different types of CNN models with LSTM of memory length of 20, including ResNet18, ResNet50, ResNet10132, ShuffleNet_v233, EfficientNet_b0, and EfficientNet_b727. Also, we implemented video swin transformer21 for detecting the human gastric peristalsis. The results are shown in Table 2. The CNN+LSTM with memory length of 20 shows the best performance. The video swin transformer model shows worse performance compared to the CNN+LSTM models. Although the video swin transformer has the potential to demonstrate superior performance in classifying natural images with multiple classes21, CNN+LSTM models perform better for detecting gastric peristalsis in our dataset. The EfficientNet_b7 has a better performance compared to other CNN models27. The multi-objective neural architecture search algorithm in EfficientNet_b7, which searches for optimal model design, may lead to superior performance over other CNN models.

Fig. 4
figure 4

Still images from inference results of four representative cases in the testing set. The EfficentNet_b7+LSTM with memory length of 20 is used for the inference. (a) Case14 (Supplementary Video case14.mp4). (b) Case19 (Supplementary Video case19.mp4). (c) Case20 (Supplementary Video case20.mp4). (d) Case24 (Supplementary Video case24.mp4).

Improving the detection sensitivity using CAM

In Fig. 4, we demonstrate the still images from the inference results of four representative cases in the testing set. We follow the inference protocol in Fig. 2a. The inference results are shown in black bold fonts. Also, we calculate CAM for the MCCE frames during inferencing. The CAM is projected to a heatmap, where red corresponds to high intensity (1) and blue corresponds to low intensity (0). In inference videos, the human gastric peristalsis is highlighted by the red regions in the CAM. The CAM provides visual explanations of the CNN+LSTM model for detecting gastric peristalsis.

By analyzing the activated regions in CAM (described in Algorithm 2) and calibrating the original predictions results, we improve the sensitivity of the CNN+LSTM model in detecting gastric peristalsis. The results are reported in Fig. 5. With the explainable information provided by CAM, we are capable of reducing the false negative results in detecting gastric peristalsis, compared to the vanilla CNN+LSTM model.

Fig. 5
figure 5

Sensitivity scores before and after using the CAM. Using the explainable information provided by CAM, the sensitivity score of the CNN+LSTM models is improved.

Fig. 6
figure 6

An example of applying the periodical feature detector on an MCCE frame sequence (case2 in the testing set). For each interval value I from 10 to 50 s, the periodical detector will generate a corresponding feature difference score. The I (between predefined \(T_l\) and \(T_u\)) with local minimum feature difference score will be identified as the detected period P. In this case, the detected period of human gastric peristalsis is 17.5 s (denoted as red line), which is close to the counted period (19.2s).

Measuring period of human gastric peristalsis

To capture the period of human gastric peristalsis, we develop the periodical feature detector in Algorithm 3. The proposed detector extracts the periodical information by analyzing the difference of MCCE feature maps given different intervals. An example of applying the periodical feature detector on MCCE frame sequence is shown in Fig. 6. In this case (case2 in the testing set), the detected period is 17.5 s (an error rate fo 8.85% compared to the counted period 19.2 seconds).

Fig. 7
figure 7

(a) The counted period of human gastric peristalsis for the testing set of 30 MCCE video sequences. (b) The error rate of the detected period of human gastric peristalsis for the testing set.

We apply the periodical detector to the testing set of 30 MCCE frame sequences. The counted and detected periods are reported in Fig. 7. The proposed periodical feature detector achieves a mean error rate of 8.36% with a standard deviation of 12.84%.

Discussion

In this paper, we explored the deep learning and image processing algorithms for detecting and measuring periods of human gastric peristalsis. We developed a generic framework for detecting human gastric peristalsis using deep learning. We explore multiple CNN+LSTM models and video swin transformer for detecting gastric peristalsis using the proposed framework. Also, we developed a CMD for filtering the MCCE frames which are deteriorated by camera movement. The current design of CMD is based on processing MCCE frames. In the future, we may add additional information and optimize the design of CMD to various devices, such as magnetic positioning data from the NaviCam. On our MCCE dataset with more than 100,000 MCCE frames (100,055 specifically) from 30 subjects, we achieved 0.8882 accuracy, 0.8192 F1, and 0.9400 AUC scores for detecting gastric peristalsis. In the future, we will train and test the proposed algorithms on different MCCE systems from different medical centers to evaluate the algorithm’s generalization ability. Moreover, we improved the sensitivity of detecting gastric peristalsis using visual interpretation provided by the CAM. To measure the period of the gastric peristalsis, we designed a periodical feature detector. The proposed periodical feature detector achieves a mean error rate of 8.36% in our dataset, which outperforms the existing method in our previous research31. We notice in the case26, the periodical feature detector has the highest error rate (67.68%). We investigate the case26 in Fig. 8. In case26, the MCCE frame sequences capture a substantial amount of mucus. The mucus has shape features and motions different than the gastric peristalsis, which may lead to the deteriorated performance of the periodical feature detector. To mitigate the performance drop of the periodical feature detector caused by the presence of mucus, image noise removal algorithms may be adopted. Also, the performance of peristalsis detection can be improved by involving more diverse training data, such as datasets with more presence of debris and mucus. With a more diversed dataset, the deep learning model can learn to ignore the noises (e.g. debris and mucus) and focus on detecting the gastric peristalsis. Moreover, image-denoising algorithms can be adapted to preprocess the data to reduce the level of noise.

Fig. 8
figure 8

Still image of the MCCE frame sequence case26 (Supplementary Video case26.mp4) in the testing set. The MCCE frames are de-identified. The MCCE frame sequence captures the presence of mucus.

The proposed algorithms have great potential to be integrated into clinical workflows. For example, the algorithms can be integrated into data recorders and computer workstations of MCCE systems (shown in Fig. 1c). During the clinical diagnosis, the collected data can be analyzed and provide real-time gastric motility evaluation. Moreover, the proposed algorithms can run off-line and analyze the collected MCCE data from previous clinical diagnoses, retrospectively. However, more diverse training data is needed before adopting the proposed algorithms in real clinical scenarios. Especially, the training data should cover a wide range of stomach environments, e.g. different genders, age ranges, ethnic groups, and previous medical conditions.

Conclusion

As an exploratory study on automatic detection and measurement of human gastric peristalsis, the algorithms developed in this research have great potential to help both clinicians and patients. Using the proposed algorithms, the extensive manual labor in evaluating gastric peristalsis, such as inspection of each MCCE frame and counting the period of peristalsis, can be reduced for clinicians; and patients can benefit from reliable examination results. The proposed algorithms contribute to an efficient and reliable workflow that we envision for MCCE systems. However, the algorithms, especially the periodical feature detector, are developed based on a clean gastric environment without the presence of debris and mucus. Although we improved the sensitivity of the CNN+LSTM model using the explainable visual interpretations provided by CAM, the model performance may deteriorate with the presence of gastric debris and mucus. In the future, we will improve the algorithm design focusing on the presence of debris and mucus by acquiring more data. With a more diverse dataset, the deep learning model can learn to be robust to the noises present in the image, including the debris and mucus. Also, we may add image de-nosing algorithms to pre-process the MCCE frames. We optimized the parameters (\(\mu\), \(\sigma\), and T in Algorithm 1, \(\hbox {T}_p\) and \(\hbox {T}_c\) in Algorithm 2) for the NaviCam MCCE system. We will keep optimizing the parameters using more data and for MCCE systems from other manufacturers. Moreover, our current dataset is collected based on healthy volunteers. We will extend our dataset and involve patients with gastric disease. We aim to further improve the robustness of the proposed algorithm with the extended dataset. Besides, we will enable the proposed algorithm to detect and classify human gastric diseases based on peristalsis.