Visual prediction method based on time series-driven LSTM model

Jumahong, Huxidan; Wang, Yongjie; Aili, Abuduwaili; Wang, Weina

doi:10.1038/s41598-025-21911-9

Download PDF

Article
Open access
Published: 30 October 2025

Visual prediction method based on time series-driven LSTM model

Huxidan Jumahong^1,2,
Yongjie Wang³,
Abuduwaili Aili¹ &
…
Weina Wang^1,3

Scientific Reports volume 15, Article number: 38057 (2025) Cite this article

3056 Accesses
Metrics details

Subjects

Abstract

Significant progress has been made in time series prediction and image processing problems. However, most of the studies have focused on either the field of time series or image processing separately, failing to integrate the advantages of both fields. To overcome the limitations of existing algorithms in image temporal inference, this paper proposes a novel visual prediction framework based on the time series forecasting model, which can predict single-frame or multi-frame images by thoroughly analyzing their spatio-temporal features. Firstly, the ViT image feature extraction module is constructed by randomly masking and reconstructing the image to analyze the learned image and extract the features. Then, the time series construction module is designed to convert the extracted features into the time series model suitable for the LSTM network. Finally, the time series data is predicted based on LSTM, and the predicted time series data is transformed into the predicted image. A series of experiments is done on three types of cloud image datasets. The results are analyzed superficially and demonstrate the effectiveness and feasibility of the proposed method in terms of image prediction performance.

Introduction

Time series is a sequence of observations recorded in chronological order and collected at fixed or variable intervals, and the main characteristics of time series¹ include trend, seasonality, autocorrelation, and stationarity. These characteristics refer to the upward or downward trend of the data on a longer time scale, periodic and regular fluctuations in the data, the correlation between the current observation and its past values, and the statistical properties of a series (mean, variance) that remain constant over time. Time series forecasting is aimed to mining information for modeling and forecasting, and providing reliable decision support using efficient forecasting models. As an important data mining technique, time series forecasting plays a crucial role in real applications. In the financial field², correctly predicting stock price fluctuations can help investors develop effective investment strategies. In the transportation field³, accurately predicting changes in traffic flow can help urban planners improve the traffic management system. In addition, it also includes other fields such as environmental research⁴, industrial process⁵, etc. Therefore, the accuracy and reliability of time series forecasts are crucial for achieving efficient decision-making and resource utilization. However, with the continuous advancement of data collection and storage technologies, a large amount of time series data is collected, which makes time series forecasting more important and challenging. To address these challenges, some forecasting models, such as statistical models⁶, machine learning models⁷, and deep learning models⁸, have been proposed. DeepAR⁹ uses RNN and autoregressive methods to predict future short-term sequences. TimesNet¹⁰ transforms the original one-dimensional time series into a two-dimensional space and captures multi-periodic features through convolution. Pathformer¹¹ achieves a more complete time-series characterization through adaptive multi-scale path design. DifFormer¹² incorporates multi-resolution differencing mechanisms to construct a general prediction framework. They show superior performance in several benchmark tests, but there is still room for improvement in long-period dependency modeling.

As a key interdisciplinary subject connecting computer vision and artificial intelligence, digital image processing has experienced a revolutionary evolution from classical algorithms to intelligent paradigms, and its technical system and application boundaries continue to expand. Traditional methods are centered on spatial and frequency domain analysis, suppressing noise through spatial domain operations such as median filtering, extracting contour features with the help of edge detection algorithms such as Sobel¹³ and Canny¹⁴, and realizing texture characterization by using Fourier Transform¹⁵ and wavelet decomposition¹⁶. Although hand-designed features based on manual design dominate early target detection, their generalization ability is limited by manual a priori knowledge, which makes it difficult to cope with complex scene requirements. The rise of deep learning has reconfigured the technological paradigm of image processing, and Convolutional Neural Networks (CNNs) represented by ResNet¹⁷ and Mask R-CNN¹⁸ have achieved breakthroughs in image classification and semantic segmentation tasks through hierarchical localized feature learning. Generative Adversarial Networks (GANs) have pioneered new paths in image synthesis. CycleGAN¹⁹ enables cross-domain style migration. SRGAN²⁰ improves the quality of super-resolution reconstruction. Visual Transformer (ViT)²¹ captures global context through self-attention mechanism, and variants such as Swin Transformer²² show superior performance in medical image segmentation. To achieve lightweight architectures, MobileNet²³ and W4A8 quantization scheme²⁴ have significantly improved the efficiency of real-time inference for edge devices and promoted the landing of industrial-grade applications. Technological innovations have made key advances at the level of multimodal fusion and data efficiency. D-JEPA²⁵ fuses diffusion models with masked image modeling, balancing the quality and efficiency of generation.

Self-supervised frameworks such as SimCLR²⁶ and small-sample learning algorithms alleviate the dependence on labeled data. These advances have spawned transformative applications in multiple fields. Transformer-based TransCT²⁷ model reduces radiation dose by 80% while preserving diagnostic details. Dp-M3D²⁸ uses FEDC to leverage depth confidence guidance to determine the deformation offset of the 3D bounding box, aligning features more effectively and improving depth perception. And the Dp-NMS refines the selection process by incorporating the product of classification confidence and depth confidence, ensuring that candidate boxes are ranked effectively and the most suitable detection box is retained. CTAFFNet²⁹ designs a Local–Global Feature Fusion unit known as the Convolutional Transformation Adaptive Fusion Kernel (CTAFFK), which is integrated into CTAFFNet. The CTAFFK kernel utilizes two branches, namely CNN and Transformer, to extract local and global features from the image, and adaptively fuses the features from both branches. TS-BEV³⁰ replaces the previous multi-frame sampling method by using the cyclic propagation mode of historical frame instance information. The temporal-spatial feature fusion attention module fully integrates temporal information and spatial features, and improves the inference and training speed. YOLOv10³¹ driven industrial defect detection system achieves a 99.2% recall rate. GAN technology supports the high-fidelity restoration of Dunhuang murals to promote the digital protection of cultural heritage. With the rapid development of deep learning in recent years, training models are expected to accommodate hundreds of millions of labeled data points. The need for large-scale data processing has been addressed by self-supervised pretraining in the fields of natural language processing (NLP)³² and computer vision (CV)³³. Most of these solutions are based on masked modeling, such as masked language modeling in NLP or masked image modeling in CV. These methods first mask parts of the data based on the original data, and then recover these parts through learning. Masked modeling³⁴ enables models to infer the deleted parts based on contextual information, thus enabling the models to learn deeper semantics, which has become a benchmark for self-supervised pre-training in both NLP and CV. This pre-trained masked modeling has been shown to work well for a variety of downstream tasks, and one of the simpler and more effective approaches is the Masked Auto-Encoder (MAE)³⁵.

Based on the above systematic analysis, it can be concluded that: 1. Static visual models dominate and lack dynamic modeling capability. Mainstream models such as ResNet focus on classification and detection tasks of static images, ignore the temporal correlation of images, and lack a modeling mechanism for temporal continuity and inter-frame evolution, which is unsuitable to be used for future frame prediction tasks directly. 2. Time series models are difficult to use directly for image modeling. Most of the time series models are for numerical one-dimensional sequences and cannot deal with high-dimensional image data. 3. Existing time series prediction models are mainly oriented to low-dimensional structured data and are not suitable for high-dimensional inputs such as image frames. Moreover, the current mainstream visual computing models generally focus on perceptual tasks such as static image classification and target detection (e.g., YOLO architecture), and are not applied in the field of visual prediction. To break through the limitations of existing algorithms in image temporal inference, this paper innovatively constructs a visual prediction framework based on the LSTM³⁶ model, which realizes generative prediction of future single-frame or multi-frame images by deeply analyzing spatiotemporal feature evolution laws of historical frame sequences. The main contributions of this paper are as follows:

1.
The ViT image feature extraction module is constructed based on the mask self-coding paradigm. The module randomly masks image blocks at a high masking rate, and trains the model to complete the missing block reconstruction task. This mechanism makes the model learn the contextual dependencies between image blocks and generate more discriminative feature representations.
2.
Progressive reconstruction framework based on sequence-image co-transformation is designed to enhance feature acquisition and improve prediction performance. The features and local-global semantic association of the image are extracted and converted into a time series by the row-first sequence reconstruction method, then the predicted time series are projected to the high-dimensional semantic space through the inverse mapping mechanism.
3.
Visual prediction method based on time series-driven forecasting is proposed, which transforms the image prediction problem into a time series prediction problem. To the best of our knowledge, this is the first time that the combination of time series analysis and image processing methods has been used to solve the image prediction problem.

By integrating ViT-MAE with LSTM, the model effectively addresses the limitations of both static models and classical time-series approaches, offering a unified predictive framework. ViT-MAE employs a high masking ratio for representation learning, forcing the model to learn robust and contextual features through global reasoning rather than local textures. This enables explicit modeling of temporal continuity and inter-frame evolution, but the capabilities are absent in static models. Furthermore, by converting 2D feature maps into 1D sequences using a row-major order with positional encoding, the method transforms high-dimensional image prediction into a format amenable to LSTMs, while preserving spatial topology and ensuring reversibility. Unlike pooling or flattening operations, which discard structural consistency, the sequence representation preserves patch relationships across time, enabling accurate image reconstruction from predicted sequences. Finally, the proposed framework combines state-of-the-art spatial feature learning with powerful temporal dynamics modeling, providing a generalizable blueprint for image prediction tasks.

Related work

LSTM-base time series forecasting

As an important variant of recurrent neural networks, LSTM³⁶ has demonstrated excellent stability and modeling capabilities for time series forecasting, and its core strength is reflected in the effective capture of long-range dependencies. The model achieves this property through a unique memory cell architecture, which consists of three parameterized gating systems working in concert. The memory cell $c_t$ is essentially a self-renewing information container whose update process follows the sophisticated regulation of the gating logic. When new input data enters the system, the input gating determines the access weight of the current information, the forgetting gating dynamically adjusts the retention ratio of the historical memory $c_{t-1}$, and the final output gating is responsible for filtering the valid information to be delivered to the hidden state $h_t$.

This triple-gating mechanism not only realizes the temporal filtering of information, but also effectively mitigates the gradient vanishing problem of traditional RNN³⁷ models through the Constant Error Carousel. Compared with the basic RNN model, LSTM innovatively decouples information storage and information delivery. The memory cell acts, as an independent data carrier, and realizes the accumulation and forgetting of information through the differential adjustment of the gating parameter, and this design not only ensures the enduring memory capacity of the key information, but also gives the model the flexibility to dynamic adjust the memory cycle. Research practices have repeatedly verified that this gated memory architecture shows significant advantages in long sequence tasks such as speech recognition and machine translation. Therefore, with the progress of science and technology, more and more variant models of LSTM have been proposed. xLSTM³⁸ was proposed to solve the traditional problem of LSTM’s inability to dynamically adjust the storage decision, limited storage capacity, and insufficient parallelism by means of sLSTM (scalar memory + memory mixing) and mLSTM (matrix memory + covariance updating rule) as well as the introduction of exponential gating and matrix memory. SwinLSTM³⁹ combines the global self-attention mechanism of Swin Transformer with the temporal modeling capability of LSTM to propose a novel recurrent unit for spatiotemporal prediction tasks. Its core captures spatial global dependencies through Swin Transformer blocks, and LSTM handles temporal dynamics. KNN-LSTM⁴⁰ replaces the linear transformation layer in traditional LSTM with the KAN network and enhances the model’s ability to capture complex temporal patterns by combining nonlinear functions, which is suitable for high-frequency financial data and non-smooth sequence prediction.

LSTM is widely used in natural language processing and speech recognition due to its superior sequence modeling ability, and its gating structure can effectively alleviate the long-range dependency problem. However, there is still an obvious bottleneck in the image sequence modeling tasks. On the one hand, LSTM makes it difficult to capture the complex spatial structure inside the image and lacks the ability to perceive the local semantics and global context. On the other hand, the linear dependency of the traditional LSTM in processing sequence data limits their parallel processing efficiency in high-dimensional image scenarios. Although existing variants try to integrate spatial information (e.g., SwinLSTM) or enhance the memory mechanism (e.g., xLSTM), they still fail to achieve efficient and end-to-end image predictive modeling.

Vision transformer-based image processing

Vision Transformer (ViT)²¹ is a breakthrough model that applies the Transformer architecture from natural language processing to computer vision tasks. The core idea is to segment an image into fixed-size image blocks (patches) and input these blocks into the Transformer encoder as sequences, modeling global dependencies through a self-attention mechanism. ViT and its variants have been widely used in many domains, such as image classification⁴¹, target detection and segmentation (Next-ViT⁴²), point cloud and 3D data processing (PointMAE⁴³), and multimodal and edge computing (AIQViT⁴⁴). More and more VIT-based models have been proposed. DyT⁴⁵ reduces training memory consumption and accelerates inference by removing the normalization layer and adopting Dynamic Tanh instead of layer normalization. EoMT⁴⁶ reuses ViT encoder for image segmentation and proposes a “mask annealing strategy”, which enables masked attention at the beginning of training and gradually transitions to mask-free inference, significantly improving the efficiency. BHViT⁴⁷ designs a binarized ViT for edge devices that enhances local information interactions through a hybrid convolution-transformer architecture (MSDGC module), and introduces a quantized decomposition attention matrix and a regularized loss function to reduce binarization noise, etc.

Although Vision Transformer performs well in static tasks such as image classification, native ViT and its variants generally lack temporal modeling capabilities for image sequences. When dealing with tasks such as image prediction, such models are unable to model the evolutionary patterns between frames, and it is difficult to ensure the structural consistency and contextual reasoning ability between image blocks. Meanwhile, their high training and inference costs limit their application in high-frequency image sequence generation tasks.

Masked autoencoder in time series and image processing

Masked modeling paradigms have demonstrated powerful representation learning capabilities in the field of natural language processing, and the classical approach represented by BERT has successfully constructed efficient linguistic representations for downstream tasks by randomly masking part of the input sequences and reconstructing the masked contents in the pre-training phase. This self-supervised learning idea is gradually expanding to the field of multimodal data. In the field of computer vision, Context Encoders pioneered the use of convolutional neural networks to reconstruct locally occluded regions, pioneering image mask modeling. Inspired by the success of Transformer in the field of NLP, cutting-edge approaches such as BeiT⁴⁸ and ViT innovatively combine the visual Transformer architecture with pixel prediction tasks. The Masked Auto-Encoder (MAE) further proposes a high percentage masking strategy that retains only a small number of visible image blocks for input to the encoder, effectively enhancing the feature extraction capability of the model. This MAE paradigm exhibits strong generalization ability, and the subsequent derivations PointMAE⁴³ and VideoMAE⁴⁹ validate its effectiveness in the field of 3D point cloud and video understanding, respectively. For time data processing, ExtraMAE⁵⁰ innovatively combines recurrent neural networks with a dynamic masking mechanism, which significantly improves the efficiency of time series modeling. For complex spatiotemporal data, TSformer⁵¹ successfully constructed a mid-layer representation supporting graph neural networks through a joint spatiotemporal masking strategy. These cross-domain practices show that reasonably designed masking strategies with adapted model architectures can effectively mine the intrinsic associations of different modal data and provide a unified learning framework for multimodal intelligent systems. Although masked self-encoders have achieved remarkable results in the field of computer vision, their core modeling mechanism is still mainly oriented to the static image structure and lacks effective modeling of image temporal evolution laws. Meanwhile, the high masking rate strategy has an information recovery bottleneck in complex images, and the existing methods fail to build a unified framework for structural guidance and temporal modeling. LSTM, Vision Transformer, and Masked language modeling have cooperated to form new processing models and have achieved good results. VisonTS⁵² combines time series with image processing. MTSMAE⁵³ combines time series with a masked self-encoder. However, these methods are not used uniformly in a single model, nor are they used in visual image prediction.

Therefore, to further expand the application boundaries of the mask modeling paradigm, it is urgent to combine the masking mechanism with the temporal modeling structure to achieve a cross-dimensional breakthrough from image reconstruction to image prediction. In this paper, we propose a generative visual prediction framework that integrates the image and time series prediction methods, which can comprehensively improve the spatio-temporal reasoning and generative capability of the model from feature extraction and sequence modeling to image reconstruction.

Methods

Overall framework

In this paper, a visual prediction framework based on LSTM³⁶ recurrent neural network is proposed. The prediction network mainly consists of the ViT image feature extraction module, time series data construction module, and LSTM prediction module, as shown in Fig. 1. To realize the prediction of the image, the ViT feature extraction module is first constructed to analyze and learn the image and extract features by performing random masking and reconstruction. Then, the extracted features are fed into the time series construction module to generate the data that is acceptable to the LSTM. Finally, the time series data are predicted accordingly by the LSTM, and the predicted time series data are transformed into the predicted image.

ViT image feature extraction module

ViT requires a large amount of data for training, and it relies on datasets with Labels. The labeling process requires high labor costs, and data with labels are often more difficult to obtain compared to unlabeled datasets. In the field of NLP, the training method of BERT models relying on massive unlabeled corpus data for pre-training has achieved great success, and unsupervised learning based on unlabeled data has become a research direction that is currently attracting much attention. Therefore, the unsupervised ViT-MAE model is used to extract the features of the image, as shown in Fig. 2. The method guides the training of ViT through the backbone obtained from unsupervised learning, thus reducing the training data of ViT and helping ViT to learn a deeper representation. The specific method is as follows:

Image patch and mask representation

Let the input image be $x \in R^{H \times W \times 3}$, where H represents the image height, W represents the image width, and 3 denotes the RGB three elements. Segment the image into $N=(H / P) \times (W / P)$ non-overlapping square patch blocks, each of size $p\times p$, where P is 16, and subsequently each patch spreads to a one-dimensional vector: $x_{i} \in R^{3P^{2} }$. Set the masking ratio $\rho$ to be 85%, i.e., mask off $M=\rho N$ patch pixel blocks, then the set of randomly sampled patch indices is shown as follows. A collection of masks: $\omega \subset \left\{ 1, ..., N \right\} ,\left| \omega \right| = M$, and Visible collections: $v = \left\{ 1,..., N \right\} /\omega$.

Patching embedding

Map all patches $x_{i}$ to vectors of dimension d through a linear layer with position coding:

$$\begin{aligned} e_{i} =W_{proj} vec\left( x_{i} \right) +P_{i} \end{aligned}$$

(1)

where $vec\left( x_{i} \right) \in R^{3P^{2} }$ is the unfolded pixel vector, $W_{proj}\in R^{d\times \left( 3P^{2} \right) }$ is a learnable linear mapping that is the position embedding vector, and $P_{i } \in R^{d}$ encodes the patch position with the sine-cosine function.

Encoder-encodes only visible patches

The embedding $\left\{ e_{i} \right\} _{i\in v}$ of all visible patches is fed into the standard Vision Transformer encoder, and the output is a set of hidden vectors $\left\{ z_{i} \mid i\in v \right\}$, i.e., the result of global modeling of the visible image information, providing contextual semantics to support the decoder in reconstructing the occluded content:

$$\begin{aligned} z_{v} =Encoder\left( \left\{ e_{i} \right\} _{i\in v} \right) \end{aligned}$$

(2)

Decoder-rebuilds all patch pixels

Firstly, insert a shared learnable mask token for the masked patch $m\in R^{d}$:

(1)
For each $j\in \omega$, construct the inputs: $t_{j}=m+p_{j}$, $p_{j} \in R^{d}$ are position embedding vectors.
(2)
Recombine all tokens (encoder outputs + mask tokens) into a complete sequence and restore the order:
$$\begin{aligned} T=Unshuffle \left( \left\{ z_{i} \right\} _{i\in v} \cup \left\{ t_{j} \right\} _{j\in \omega } \right) \in R^{N\times d} \end{aligned}$$
(3)

where v denotes the set of unshuffled patch pixel blocks, $z_{i}$ denotes the encoder’s output vector for the ith visible patch pixel block, and Unshuffle() denotes rearranging the merged tokens in the order of the original patch pixel blocks to obtain a complete sequence.

Then, the complete token sequence T is fed into a small Transformer decoder: $\hat{y}_{1}, ...,\hat{y}_{N}=Decoder\left( T \right)$, where each output vector $\hat{y}_{i}\in R^{3P^{2}}$ represents the reconstructed pixel value of patch $x_{i}$.

Time series building blocks

Based on the ViT patching algorithm, the input image is first segmented into $16 \times 16$ pixel blocks, and the two-stage feature learning is performed by the improved ViT-MAE. In the first stage, a self-supervised reconstruction task with an 85% high masking rate is used to strengthen the local-global semantic associations. In the second stage, the spatio-temporal dimensionality distribution of the feature blocks is adjusted by the dimensionality reduction and alignment mechanism, and finally the 2D features are transformed into spatio-temporally continuous 1D sequences by the row-priority sequence reorganization algorithm. Let the original image be $I\in R^{H\times W\times 3}$, H be the width, W be the height, and 3 be the number of RGB channels.

Patching

A patch chunk of size $p\times p$ is performed on the original image, and the total number of patch pixel blocks is $N=\left( H/P \right) \times \left( W/P \right)$, where H/P is the number of row blocks and W/P is the number of column blocks. Then, a particular patch pixel block in the image can be represented as:

$$\begin{aligned} P_{i,j} =I\left[ \left( i-1 \right) p+1:ip,\left( j-1 \right) p+1:jp,: \right] \end{aligned}$$

(4)

where $\left( i-1 \right) p+1:ip$ denotes the pixel range of the patch pixel block in the row direction; $\left( j-1 \right) p+1:jp$ denotes the pixel range of the patch pixel block in the column direction, and “:” denotes that all RGB channels are preserved.

Extract features and form time series data

Each $P_{i,j}$ is fed into the ViT-MAE encoder to extract the features and spread them into vectors in row-major order: $f_{i} =Encoder\left( P_{i} \right) \in R^{3P^{2} },i=1,...,N$, where $P_{i}$ denotes the patch vectors arranged by rows, and then all the patch vectors are arranged into sequences by k through the subscript mapping $k=\left( i-1 \right) \frac{W}{p} +j,k=,2,...,N$:

$$\begin{aligned} X=\left[ f_{1}, f_{2},...,f_{N} \right] \in R^{N\times \left( 3p^{2} \right) } \end{aligned}$$

(5)

Linear dimensionality reduction

Since it is too high to use $3p^{2}$ dimensional features directly, a linear transformation is used to map each $f_{k}$ to the desired dimension of the LSTM, i.e. $e_{k} =W_{e} f_{k} +b_{e} ,W_{e}\in R^{d\times \left( 3p^{2} \right) } ,b_{e} \in R^{d}$, where $W_{e}$ denotes the linear transformation matrix and $b_{e}$ denotes the bias term. The sequence of features after dimensionality reduction is $\hat{X} =\left[ e_{1}, e_{2},...,e_{N} \right] ^{T} \in R^{N\times d}$.

Position code

To distinguish the spatial location of different patch pixel blocks, a learnable position bias is added to each sequence element $\pi _{k} \in R^{d}$. Then, the final timing data is $\tilde{e} _{k} =e_{k} +\pi _{k} ,k=1,...,N$, so the input to the LSTM is $\tilde{X} =\left[ \tilde{e}_{1} , \tilde{e}_{2},...,\tilde{e}_{N} \right] ^{T} \in R^{N\times d}$.

Deep prediction architecture

Using the learned spatiotemporal continuum sequence, an LSTM network is used for temporal evolution prediction. The output predicted feature vectors are inversely mapped back to the original patch pixel block space, and two-dimensional reorganization is carried out based on the spatial order at the time of the initial segmentation. The complete predicted image is finally spliced to generate the complete predicted image.

LSTM time series prediction

Let the history length be L and the prediction length be $H'$, then

$$\begin{aligned} X_{hist}=\left[ \tilde{e}_{1} ,...,\tilde{L}_{L} \right] ^{T} \in R^{L\times d} ,X_{true}=\left[ \tilde{e}_{L+1} ,...,\tilde{e}_{L+H'} \right] ^{T} \in R^{H'\times d} \end{aligned}$$

(6)

Firstly, using LSTMCell, remember that the hidden state at the moment of step t is ($h_{t}$,$c_{t}$), and initially $h_{0}$=0,$c_{0}$=0. Then for $t=1,..., L$:

$$\begin{aligned} \left( h_{t},c_{t} \right) =LSTMCell\left( \tilde{e}_{t},\left( h_{t-1},c_{t-1} \right) \right) \end{aligned}$$

(7)

Then, define the linear head $W_{a} \in R^{d\times h}$, the bias $b_{a} \in R^{d}$, and the LSTMHiddenDim size to be h such that

$$\begin{aligned} \tilde{e}_{L+1}=W_{a}h_{L}+b_{a}, \left( h_{L+\tau },c_{L+\tau } \right) =LSTMCell\left( \tilde{e}_{L+\tau },\left( h_{L},c_{L} \right) \right) \end{aligned}$$

(8)

Sequentially, the autoregressive eigenvectors for future time steps can be obtained:

$$\begin{aligned} \tilde{e}_{L+\tau }=W_{a}h_{L+\tau -1}+b_{a}, \left( h_{L+\tau },c_{L+\tau } \right) =LSTMCell\left( \tilde{e}_{L+\tau },\left( h_{L+\tau -1},c_{L+\tau -1} \right) \right) \end{aligned}$$

(9)

until $\tau =H'$, to obtain the final predicted sequence:

$$\begin{aligned} \hat{Y} =\left[ \tilde{e}_{L+1} ,...,\tilde{e}_{L+H'} \right] ^{T}\in R^{H'\times d} \end{aligned}$$

(10)

Sequence patch reconstruction

Mapping each prediction vector $\hat{e} \in R^{d}$ back to a pixel block, defining the decoding matrix $W_{d} =R^{\left( 3p^{2} \right) } \times d$ and bias $b_{d} \in R^{3p^{2} }$, the prediction patches the pixel block $\tilde{P} _{k} =W_{d} \tilde{e} _{k} +b_{d} \in R^{3p^{2} }$, which is finally mapped back to a pixel block:

$$\begin{aligned} \hat{P} _{k} =reshape\left( \tilde{P_{k} } \right) \in R^{3p^{2}} \end{aligned}$$

(11)

Patch pixel blocks back to full image

For all predicted ones, post back at the initial grid position (i, j):

$$\begin{aligned} \hat{I} \left[ \left( i-1 \right) p+1:ip,\left( j-1 \right) p+1:jp,: \right] =\hat{P} _{k} ,k=\left( i-1 \right) \frac{W}{p} +j \end{aligned}$$

(12)

where $i=1,..., \frac{H}{p},j=1,...,\frac{W}{p}$. The final predicted image with the same size as the original is obtained: $\hat{I} \in R^{H\times W\times 3}$.

Experiments

To systematically verify the comprehensive efficacy of the model constructed in this study, the first part presents the corresponding performance metrics. The rest of the section tests and validates the proposed method on the existing dataset through rigorous comparative experiments and conducts the corresponding ablation experiments to verify the feasibility of the proposed method in this paper, as well as its superiority in terms of prediction accuracy, robustness, and capability.

Implementation details

The experiments are based on PyTorch 1.12.0 and the vit-mae codebase. The image processing model used is vit-mae-base, and the object it is applied to is the base type, which is input at $256 \times 256$ resolution in the pre-training phase. The model is forced to reconstruct the key visual semantics through an 85% masking rate. The time-series prediction model utilized is the base LSTM, RNN, and Transformer. A dynamic position encoding mechanism is introduced in the fine-tuning stage to couple the 768-dimensional feature vectors obtained from pre-training with temporal coordinate information. The experimental configuration uses an NVIDIA A800 GPU cluster, and end-to-end training is carried out through cosine annealing learning rate scheduling (initial value $3e-4$) and AdamW optimizer, and each dataset is set with 200 training cycles are set for each dataset to ensure convergence. The evaluation metrics are chosen as mean square error (MSE) and mean absolute error (MAE), which are defined as follows:

$$\begin{aligned} \textrm{MSE}= & \frac{1}{L} \sum _{i=1}^{L}\left( z_{M+1}-\widetilde{z}_{M+t}\right) ^{2} \end{aligned}$$

(13)

$$\begin{aligned} \textrm{MAE}= & \frac{1}{L} \sum _{i=1}^{L}\left\| z_{M+1}-\widetilde{z}_{M+t}\right\| \end{aligned}$$

(14)

These two evaluation metrics are used on each prediction window to compute the average of the image predictions and roll the entire set with stride = 1. The ViT-MAE pre-training default settings are shown in Table 1. The length of the input sequence in the model is 200 and the patch size is 4, i.e., there are a total of 50 patches. The Adam optimizer is selected for optimization and the same strategy as for ViT-MAE is used. The optimizer configuration was not modified. The settings of the original vit-mae were followed and a linear learning rate scaling rule was used: lr = base lr $\times$ batchsize/256. The number of encoder layers was 3 and the number of decoder layers was 1, and in pre-training we chose a high percentage of masking rate of about 85%. All original images from the datasets (with a native resolution of 512x512 pixels) were consistently preprocessed to a resolution of 256x256 pixels using bilinear interpolation via the PyTorch $torchvision.transformer.Resize\left( \cdot \right)$ function. This resizing operation was applied uniformly across both the self-supervised pre-training stage of the ViT-MAE and the subsequent training and evaluation of the temporal prediction pipeline to ensure input consistency. And all input images were normalized using the mean and standard deviation of the ImageNet dataset (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). This preprocessing step, implemented using torchvision.transforms. $Normalize\left( \cdot \right)$, was applied consistently to the inputs of all models (including Ours, F-CLSTM⁵⁴, CloudCast⁵⁵, SATcast⁵⁶, and the CNN/Transformer ablation baselines) during both training and inference phases to ensure a fair comparison. These configurations are the same on all datasets and are not changed accordingly for different datasets.

Table 1 The config of ViT-MAE.

Full size table

Datasets

The selected three cloud map datasets (Fig. 3) are described as follows:

TCDD dataset

The dataset is the TJNU Cloud Detection Database (TCDD), and it collects cloud detection data from 2019 to 2020 for nine provinces in China, including Tianjin, Anhui, Sichuan, Gansu, Shandong, Hebei, Liaoning, Jiangsu, and Hainan. It contains 2300 ground-based cloud images and their corresponding cloud masks.TCDD consists of 1874 training images and 426 test images. The cloud images are captured by a vision sensor and stored in PNG format with a pixel resolution of 512*512.

Sample dataset

The dataset is a satellite cloud image dataset, and it contains 2200 training data and 300 test data for training the super-resolution model.

Rice dataset

The dataset is remote sensing image de-clouded dataset, and it consists of two parts: RICE1 contains 500 pairs of images, each pair of images with and without cloud image size of 512*512; RICE2 contains 450 groups of images, each group contains three images of 512*512 size, which are reference images without cloud, image with cloud, and mask with cloud, respectively.

Main results

Comparison of different models

To verify the advantages of the proposed model, this study selects three data sets and conducts systematic comparative experiments. The experiment selects three benchmark models: F-CLSTM⁵⁴, SATcast⁵⁶, CloudCast⁵⁵, VideoMAE⁵⁷, Predrnn⁵⁸, THItoGene⁵⁹ and VGG-TSwinformer⁶⁰, and compares their performance with the proposed model. The experimental results are shown in Table 2. From the results, it can be seen that the proposed model achieves the best in both MSE and MAE evaluation indicators, which proves that the model is an effective and significantly improved model by combining the time series prediction model with the image processing model. To show the advantages of the proposed model more intuitively and clearly, this study visualized some prediction sample images and compared them with F-CLSTM and CloudCast, as shown in Fig. 4.

Table 2 Comparative experimental results of different models.

Full size table

Prediction accuracy for different prediction lengths

To verify the effectiveness and advantages of the proposed model, the number of predicted images is set to be 1, 2, and 3 times in the above three datasets for comparison experiments, respectively. Table 3 shows that the proposed model’s prediction error indexes perform stably. MSE ranges from 0.46 to 0.98, and the fluctuation range of the MAE is controlled in the interval of 0.52 to 1.71. It is worth noting that when the number of predicted images is set to 1, the model achieves the optimal value in both evaluation indexes, and this phenomenon is highly compatible with the temporal pattern of the real scene. The model can effectively realize the accurate prediction of a single-frame image by analyzing the historical image sequence. This characteristic indicates that the model has good potential for practical application, and its ability to generate single-frame prediction results based on historical data learning is in line with the objective law of strong correlation between neighboring time-sequence images in the process of dynamic system evolution.

Table 3 Prediction results for different prediction lengths.

Full size table

Ablation studies

Effect of different time series forecasting models

To evaluate the generalizability and effectiveness of our core contribution, the ViT-MAE-based feature extraction and time-series conversion module, We conducted an ablation study to integrate this module into other state-of-the-art base and variant prediction models(LSTM³⁶, CNN¹⁷, Transformer¹¹, NOA-LSTM⁶¹, xPatch⁶², and PPDformer⁶³). As shown in Table 4, the experimental results show that the module proposed can be well integrated with different prediction models. This proves that the module provides a universal and powerful front-end that can combine image and temporal prediction to improve the accuracy of image prediction. And the results show that this modeling strategy of phased feature migration can effectively improve the quality of time-step image generation.

Table 4 Evaluate the use of different time prediction models while maintaining consistency with the ViT-MAE feature extraction front-end.

Full size table

Effect of different masking ratio

To find out the pattern of different masking ratios on the model performance, this study sets the prediction model input lengths to 24, 48, 96, and 192, respectively, and conducts the corresponding experiments on the TCDD dataset. The experimental results are shown in Fig. 5, where an 85% masking ratio is the optimal choice, a finding that contrasts with the findings of BERT and video-MAE, but is consistent with the experimental trend of MAE models in the image domain. A higher masking ratio forces the model to focus on high-level semantic feature extraction and integration by limiting the number of visible tokens. It is worth noting that when the masking ratio is reduced to 45%, the performance metrics drop instead, despite the encoder having access to more raw input information, which may be attributed to the fact that the low ratio of masking allows the model to complete the reconstruction by simple local feature interpolation, resulting in its overfitting to the underlying details.

Theoretical analysis shows that appropriately elevating the masking ratio (85%) can effectively reduce the spatial redundancy of the input data and force the model to establish global semantic associations rather than relying on local cues. However, an extremely high percentage of masking strategies (95%) can trigger significant information loss, resulting in severe attenuation of the training signal and ultimately impairing the model’s understanding of the nature of the data distribution. This inverted U-shaped performance curve reveals a dynamic balancing mechanism between information retention and abstraction learning: moderate information loss can facilitate high-level representation learning, but excessive masking will lead to the degradation of semantic modeling capabilities.

Effect of different input lengths

This study reveals the dynamic correlation mechanism between input window length and forecasting performance in time series modeling in this model through systematic experiments. The experimental results are shown in Table 5. The model accuracy shows a nonlinear trend of increasing and then decreasing with the input sequence length, with the best input length of 96. When the input length breaks through the critical threshold of 192, the model shows a significant degradation of 36.4% and 49.3% in the MAE and MSE metrics on the test set, respectively. Through analysis, it is found that this phenomenon mainly stems from the fact that the multi-scale cycle features (daily cycle, weekly cycle, seasonal cycle, etc.) implied in the long time-series data are difficult to be effectively decoupled by the standard attentional mechanism, which causes the model to fall into the local optimal solution and fails to establish cross-period correlations.

Table 5 Prediction results for different input lengths.

Full size table

Evaluation of generalization ability

Cross-Dataset Performance Consistency: The proposed model was evaluated on three distinct cloud image datasets (TCDD, Sample, and Rice) with different characteristics (ground-based vs. satellite, different geographical origins). The consistent superiority of our model over baselines across all these datasets is a strong indicator of its robustness and generalizability. The model is not overfitting to the specific nuances of a single dataset but is learning a generalizable representation for cloud image prediction.

Performance Under Different Conditions: The model’s performance under varying prediction lengths has been further analyzed as shown in Table 3. The stable and predictable degradation pattern as the prediction horizon extends (i.e., error increases gradually rather than catastrophically) demonstrates that the proposed model has learned a reasonable temporal dynamic that generalizes across different forecasting scenarios.

Impact of patch size

The patch size P is a critical hyperparameter in this framework, governing a fundamental trade-off between spatial granularity and temporal sequence length. A smaller P yields a longer sequence of finer-grained image patches, potentially capturing more detail at the cost of increased computational complexity and longer-range dependencies for the LSTM to model. A larger P results in a shorter sequence of coarser-grained patches, which reduces the computational load but risks the loss of vital spatial information. To find out the patch size more suitable for this model, the ablation studies were carried out, and the hyperparameter settings used were consistent with the main experiment. As shown in Table 6, the results demonstrate that a medium patch size, which generates a sequence of moderate length while preserving adequate spatial information, is essential for achieving high prediction accuracy. These robustly validate the choice of $P =16$ for the main experiments.

Table 6 Ablation study on the impact of different patch sizes. The experiment was conducted on the TCDD dataset. The best results are highlighted in bold.

Full size table

Conclustions

The current academic community has not yet formed a complete research system in the field of image timing prediction. For this reason, this paper innovatively constructs a visual prediction framework based on the LSTM recurrent neural network. The architecture is mainly composed of a feature learning module and a temporal prediction module, and its technical realization path contains three key phases. The first phase is the feature learning phase. Self-supervised pre-training of the original image through the mask reconstruction mechanism of ViT-MAE, and in-depth extraction of the spatial semantic features of the image. The process uses a random masking strategy to mask 85% of the image blocks, forcing the model to learn the potential spatial correlation laws. The second phase is time-sequence conversion phase. The acquired image feature matrix is expanded into a time-series by spatial dimensions to construct a multi-dimensional time series dataset with spatio-temporal correlation. This transformation process maintains the spatial topology of the feature vectors and ensures the invertible mapping of the temporal dimension to the spatial dimension. The third phase is prediction reconstruction stage. The LSTM network is applied to model the temporal dependency relationships, and the long-term dependency patterns are captured through the gating mechanism. The predicted output timing data is synthesized into an image block matrix by inverse spatial reconstruction and finally integrated into a complete predicted image. To prove the effectiveness of this study, relevant performance experiments and ablation experiments are done on three cloud image datasets, and the results show that the model achieves better results in both the evaluation criteria MSE and MAE, which indicates that the model proposed in this study demonstrates good feasibility and efficiency.

Although the proposed linear projection is effective, some granular spatial details might be lost in converting high-dimensional image patches into a lower-dimensional temporal sequence, which could limit the prediction fidelity for very fine-grained structures. In addition, the model’s performance may degrade on sequences exhibiting extreme, unpredictable changes (e.g., rapidly forming storm cells) that significantly deviate from the patterns learned during training. The autoregressive prediction loop could amplify errors in such scenarios.

In the future, we will study the possible limitations of the model and discuss many aspects of research and optimization directions, such as replacing the image processing model with a better model, replacing the time series model with a new model that is more stable and efficient, and improving the method of converting image patches to time series data.

Data availability

The datasets used and analysed during the current study available from the corresponding author on reasonable request.

References

Toivonen, E. & Räsänen, E. Time-series analysis approach to the characteristics and correlations of wastewater variables measured in paper industry. J. Water Process. Eng. 61, 105231 (2024).
Article Google Scholar
Song, R., Wang, Z., Guo, L., Zhao, F. & Xu, Z. Deep belief networks (dbn) for financial time series analysis and market trends prediction. World J. Eng. 7, 1–10 (2024).
Google Scholar
Yin, X. et al. Deep learning on traffic prediction: Methods, analysis, and future directions. IEEE Trans. Intell. Transp. Syst. 23, 4927–4943 (2021).
Article Google Scholar
Zhuang, D. et al. Data-driven predictive control for smart hvac system in iot-integrated buildings with time-series forecasting and reinforcement learning. Appl. Energy 338, 120936 (2023).
Article Google Scholar
Zhou, B. et al. Semantic-aware event link reasoning over industrial knowledge graph embedding time series data. Int. J. Prod. Res. 61, 4117–4134 (2023).
Article Google Scholar
Du, L., Gao, R., Suganthan, P. N. & Wang, D. Z. Bayesian optimization based dynamic ensemble for time series forecasting. Inf. Sci. 591, 155–175 (2022).
Article Google Scholar
Almutairi, M. S., Almutairi, K. & Chiroma, H. Hybrid of deep recurrent network and long short term memory for rear-end collision detection in fog based internet of vehicles. Expert Syst. Appl. 213, 119033 (2023).
Article Google Scholar
Zou, J., Lou, J., Wang, B. & Liu, S. A novel deep reinforcement learning based automated stock trading system using cascaded lstm networks. Expert Syst. Appl. 242, 122801 (2024).
Article Google Scholar
Amjad, F., Korotko, T. & Rosin, A. Forecasting pv energy generation using transformer-based architectures: A comparative study of lag-llama, tft, and deepar. IEEE 1–6 (2024).
Huang, Y., Zhou, C., Cui, K. & Lu, X. A multi-agent reinforcement learning framework for optimizing financial trading strategies based on timesnet. Expert Syst. Appl. 237, 121502 (2024).
Article Google Scholar
Chen, P. et al. Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting. arXiv preprint. arXiv:2402.05956 (2024).
Li, B. et al. Difformer: Multi-resolutional differencing transformer with dynamic ranging for time series analysis. Ieee Trans. Pattern Anal. 45, 13586–13598 (2023).
Article Google Scholar
Saidani, T. et al. Design and implementation of a real-time image processing system based on sobel edge detection using model-based design methods. Int. J. Adv. Comput. Sci. 15, 150328 (2024).
Google Scholar
Song, Y., Li, C., Xiao, S., Zhou, Q. & Xiao, H. A parallel canny edge detection algorithm based on opencl acceleration. PLoS ONE 19, 0292345 (2024).
Google Scholar
Zhang, H. et al. Contributions of fourier-transform infrared spectroscopy technologies to the research of asphalt materials: A comprehensive review. Fuel 371, 132078 (2024).
Article CAS Google Scholar
Zhou, Y., Guan, W., Sun, Q., Zou, X. & He, Z. Effect of multi-scale rough surfaces on oil-phase trapping in fractures: Pore-scale modeling accelerated by wavelet decomposition. Comput. Geotech. 179, 106951 (2025).
Article Google Scholar
Razavi, M., Mavaddati, S. & Koohi, H. Resnet deep models and transfer learning technique for classification and quality detection of rice cultivars. Expert Syst. Appl. 247, 123276 (2024).
Article Google Scholar
Zhang, H., Yin, Z.-Y., Zhang, N., Wang, X. & Ding, Z. A scale-adaptive mask r-cnn strategy for foreground particle segmentation and geometrical analysis of granular aggregates. Appl. Soft Comput. 164, 111931 (2024).
Article Google Scholar
Wang, Z. et al. Raman spectrum model transfer method based on cycle-gan. Spectrochim Acta A. 304, 123416 (2024).
Article CAS Google Scholar
Li, H., Cheng, L. & Liu, J. A new degradation model and an improved srgan for multi-image super-resolution reconstruction. Imaging Sci J. 73, 150–169 (2025).
Article Google Scholar
Kim, G.-I. & Chung, K. Vit-based multi-scale classification using digital signal processing and image transformation. IEEE Access. 12, 58625–58638 (2024).
Article Google Scholar
Pinasthika, K., Laksono, B. S. P., Irsal, R. B. P., Shabiyya, S. & Yudistira, N. Sparseswin: Swin transformer with sparse transformer block. Neurocomputing 580, 127433 (2024).
Article Google Scholar
Kumar Lilhore, U. et al. A precise model for skin cancer diagnosis using hybrid u-net and improved mobilenet-v3 with hyperparameters optimization. Sci. Rep. 14, 4299 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Qqq: Quality quattuor-bit quantization for large language models. arXiv preprint. arXiv:2406.09904 (2024).
Chen, D., Hu, J., Wei, X. & Wu, E. Denoising with a joint-embedding predictive architecture. arXiv preprint. arXiv:2410.03755 (2024).
Fırıldak, K., Çelik, G. & Talu, M. F. Simclr-based self-supervised learning approach for limited brain mri and unlabeled images. BEFD. 13, 1304–1313 (2024).
Google Scholar
Zhang, Z., Yu, L., Liang, X., Zhao, W. & Xing, L. Transct: dual-path transformer for low dose computed tomography. MICCAI 55–64 (2021).
Shi, P., Dong, X., Ge, R., Liu, Z. & Yang, A. Dp-m3d: Monocular 3d object detection algorithm with depth perception capability. Knowl.-Based Syst. 318, 113539 (2025).
Article Google Scholar
Dong, X., Shi, P., Liang, T. & Yang, A. Ctaffnet: Cnn-transformer adaptive feature fusion object detection algorithm for complex traffic scenarios. Transp. Res. Rec. 2679, 1947–1965 (2025).
Article Google Scholar
Dong, X., Shi, P., Qi, H., Yang, A. & Liang, T. Ts-bev: Bev object detection algorithm based on temporal-spatial feature fusion. Displays 84, 102814 (2024).
Article Google Scholar
Wang, A. et al. Yolov10: Real-time end-to-end object detection. NeurIPS 37, 107984–108011 (2024).
Google Scholar
Qin, L. et al. Large language models meet nlp: A survey. arXiv preprint (2024) arXiv:2405.12819.
Voulodimos, A., Doulamis, N., Doulamis, A. & Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 7068349 (2018).
Article PubMed PubMed Central Google Scholar
Guo, C., Mu, Y., Javed, M. G., Wang, S. & Cheng, L. Momask: Generative masked modeling of 3d human motions. CVPR 1900–1910 (2024).
Majmundar, K., Goyal, S., Netrapalli, P. & Jain, P. Met: Masked encoding for tabular data. arXiv preprint arXiv:2206.08564 (2022).
Yu, Y., Si, X., Hu, C. & Zhang, J. A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput. 31, 1235–1270 (2019).
Article MathSciNet PubMed Google Scholar
Dudukcu, H. V., Taskiran, M., Taskiran, Z. G. C. & Yildirim, T. Temporal convolutional networks with rnn approach for chaotic time series prediction. Appl. Soft Comput. 133, 109945 (2023).
Article Google Scholar
Beck, M. et al. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517. (2024)
Tang, S., Li, C., Zhang, P. & Tang, R. Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. ICCV 13470–13479 (2023).
Qin, Z., Cen, C. & Guo, X. Prediction of air quality based on knn-lstm. JPCS. 1237, 042030 (2019).
Google Scholar
Maurício, J., Domingues, I. & Bernardino, J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl. Sci. 13, 5521 (2023).
Article Google Scholar
Dong, A., Wang, R., Zhang, X. & Liu, J. Nextseg: Automatic covid-19 lung infection segmentation from ct images based on next-vit. IJCNN 1–8 (2024).
Younes, R., Yaacoub, C., El Khoury, G., Possik, J. & Daou, R. A. Z. Evaluation of point-mae for robust point cloud classification across diverse datasets. IC2SPM 68–73 (2024).
Jiang, R., Zhang, Y., Wang, L., Yu, P. & Guo, Y. Aiqvit: Architecture-informed post-training quantization for vision transformers. arXiv preprint arXiv:2502.04628 (2025).
Simon, T. A. & Patel, R. C. Molecular mechanisms in dyt-prkra: Pathways regulated by pkr activator protein pact. Dystonia 4, 14224 (2025).
Article Google Scholar
Huang, Y. et al. Eomt: A master-slave task scheduling strategy for grid environment. HPCC 226–233 (2008).
Gao, T. et al. Bhvit: Binarized hybrid vision transformer. arXiv preprint arXiv:2503.02394 (2025).
Bao, H., Dong, L., Piao, S. & Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
Yu, J. et al. Temporal-informative adapters in videomae v2 and multi-scale feature fusion for micro-expression spotting-then-recognize. ACM MM 11484–11489 (2024).
Lajić, R., Divnić, D., Risojević, V. & Mirjanić, D. Generative neural network models for synthetic solar irradiance sequences. J. Renew. Sustain Energy 16, 053501 (2024).
Article Google Scholar
Lin, J. & Wang, Y.-G. Tsformer: Tracking structure transformer for image inpainting. Acm Trans. Multim. Comput. 20, 1–23 (2024).
Article CAS Google Scholar
Chen, M. et al. Visionts: Visual masked autoencoders are free-lunch zero-shot time series forecasters. arXiv preprint arXiv:2408.17253 (2024).
Tang, P. & Zhang, X. Mtsmae: Masked autoencoders for multivariate time-series forecasting. ICTAI 982–989 (2022).
Tan, C., Feng, X., Long, J. & Geng, L. Forecast-clstm: A new convolutional lstm network for cloudage nowcasting. VCIP 1–4 (2018).
Partio, M., Hieta, L. & Kokkonen, A. Cloudcast–total cloud cover nowcasting with machine learning. arXiv preprint arXiv:2410.21329 (2024).
Chen, H. et al. Skillful nowcasting of convective clouds with a cascade diffusion model. arXiv preprint arXiv:2502.10957 (2025).
Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS. 35, 10078–10093 (2022).
Google Scholar
Wang, Y. et al. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE TPAM. 45, 2208–2225 (2022).
Article ADS Google Scholar
Jia, Y., Liu, J., Chen, L., Zhao, T. & Wang, Y. Thitogene: A deep learning method for predicting spatial transcriptomics from histological images. Brief Bioinform. 25, 464 (2023).
Article Google Scholar
Hu, Z., Wang, Z., Jin, Y. & Hou, W. Vgg-tswinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction. CMPB. 229, 107291 (2023).
Google Scholar
Yadav, H. & Thakkar, A. Noa-lstm: An efficient lstm cell architecture for time series forecasting. Expert Syst. Appl. 238, 122333 (2024).
Article Google Scholar
Stitsyuk, A. & Choi, J. xpatch: Dual-stream time series forecasting with exponential seasonal-trend decomposition. Proc. AAAI Conf. Artif. Intell. 39, 20601–20609 (2025).
Google Scholar
Wan, M. et al. Ppdformer: Channel-specific periodic patch division for time series forecasting. ICASSP 1–5 (2025).

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China under Grant 62266046, Innovation Project of the Research Platform at Yili Normal University under Grant CXZK2021027, and the Natural Science Foundation of Jilin Provincial Department of Education, China, under Grant JJKH20251304KJ.

Author information

Authors and Affiliations

School of Network Security and Information Technology, YiLi Normal University, Yining, 835000, China
Huxidan Jumahong, Abuduwaili Aili & Weina Wang
Yili Key Laboratory of Intelligent Computing Research and Application, YiLi Normal University, Yining, 835000, China
Huxidan Jumahong
School of Science, Jilin University of Chemical Technology, Jilin, 132022, China
Yongjie Wang & Weina Wang

Authors

Huxidan Jumahong
View author publications
Search author on:PubMed Google Scholar
Yongjie Wang
View author publications
Search author on:PubMed Google Scholar
Abuduwaili Aili
View author publications
Search author on:PubMed Google Scholar
Weina Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

All the authors contributed extensively to the manuscript. H.J. wrote the main manuscript, and helped with the formatting review and editing of the paper. Y.W. designed the experiments and wrote the main manuscript. A.A. and W.W. reviewed and edited the original document.

Corresponding author

Correspondence to Weina Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jumahong, H., Wang, Y., Aili, A. et al. Visual prediction method based on time series-driven LSTM model. Sci Rep 15, 38057 (2025). https://doi.org/10.1038/s41598-025-21911-9

Download citation

Received: 07 July 2025
Accepted: 24 September 2025
Published: 30 October 2025
Version of record: 30 October 2025
DOI: https://doi.org/10.1038/s41598-025-21911-9

Subjects

Abstract

Introduction

Related work

LSTM-base time series forecasting

Vision transformer-based image processing

Masked autoencoder in time series and image processing

Methods

Overall framework

ViT image feature extraction module

Image patch and mask representation

Patching embedding

Encoder-encodes only visible patches

Decoder-rebuilds all patch pixels

Time series building blocks

Patching

Extract features and form time series data

Linear dimensionality reduction

Position code

Deep prediction architecture

LSTM time series prediction

Sequence patch reconstruction

Patch pixel blocks back to full image

Experiments

Implementation details

Datasets

TCDD dataset

Sample dataset

Rice dataset

Main results

Comparison of different models

Prediction accuracy for different prediction lengths

Ablation studies

Effect of different time series forecasting models

Effect of different masking ratio

Effect of different input lengths

Evaluation of generalization ability

Impact of patch size

Conclustions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links