Table 14 Comparison of CNN/LSTM and transformer-based HAR models with XTinyHAR.
Model | Architecture type | Accuracy (%) | Modality | Model complexity |
|---|---|---|---|---|
CNN/LSTM-based HAR models | ||||
CNN-LSTM 19 | Deep learning hybrid | 90.89 | IMU | High |
Attention-LSTM 20 | Attention-based RNN | 94.30 | IMU | High |
Self-supervised CNN 21 | Conv-based SSL | 96.50 | IMU | Medium |
DeepConvLSTM 22 | CNN + LSTM | 93.70 | IMU | High |
MC-HARNet 23 | Multiscale CNN | 95.20 | IMU | Medium |
Transformer-based HAR models | ||||
Spectro-transformer 25 | Transformer (spectrogram) | 97.80 | IMU | High |
Contrastive transformer 26 | Contrastive + Transformer | 98.60 | IMU | High |
RFID-transformer 27 | RFID + Transformer | 99.10 | RFID | High |
Skeleton-ViT 28 | ViT for skeleton data | 98.50 | Skeleton | High |
Student-teacher HAR 29 | Transformer KD | 98.70 | Multi to IMU | Medium |
DMFT 30 | Distilled multi-modal transformer | 93.97 | Multi-modal | Medium |
XTinyHAR (ours) | Lightweight inertial transformer (KD) | 98.71 (UTD), 98.55 (MM-Fit) | IMU | Low (2.45 MB) |