Introduction

Time series is a sequence of observations recorded in chronological order and collected at fixed or variable intervals, and the main characteristics of time series1 include trend, seasonality, autocorrelation, and stationarity. These characteristics refer to the upward or downward trend of the data on a longer time scale, periodic and regular fluctuations in the data, the correlation between the current observation and its past values, and the statistical properties of a series (mean, variance) that remain constant over time. Time series forecasting is aimed to mining information for modeling and forecasting, and providing reliable decision support using efficient forecasting models. As an important data mining technique, time series forecasting plays a crucial role in real applications. In the financial field2, correctly predicting stock price fluctuations can help investors develop effective investment strategies. In the transportation field3, accurately predicting changes in traffic flow can help urban planners improve the traffic management system. In addition, it also includes other fields such as environmental research4, industrial process5, etc. Therefore, the accuracy and reliability of time series forecasts are crucial for achieving efficient decision-making and resource utilization. However, with the continuous advancement of data collection and storage technologies, a large amount of time series data is collected, which makes time series forecasting more important and challenging. To address these challenges, some forecasting models, such as statistical models6, machine learning models7, and deep learning models8, have been proposed. DeepAR9 uses RNN and autoregressive methods to predict future short-term sequences. TimesNet10 transforms the original one-dimensional time series into a two-dimensional space and captures multi-periodic features through convolution. Pathformer11 achieves a more complete time-series characterization through adaptive multi-scale path design. DifFormer12 incorporates multi-resolution differencing mechanisms to construct a general prediction framework. They show superior performance in several benchmark tests, but there is still room for improvement in long-period dependency modeling.

As a key interdisciplinary subject connecting computer vision and artificial intelligence, digital image processing has experienced a revolutionary evolution from classical algorithms to intelligent paradigms, and its technical system and application boundaries continue to expand. Traditional methods are centered on spatial and frequency domain analysis, suppressing noise through spatial domain operations such as median filtering, extracting contour features with the help of edge detection algorithms such as Sobel13 and Canny14, and realizing texture characterization by using Fourier Transform15 and wavelet decomposition16. Although hand-designed features based on manual design dominate early target detection, their generalization ability is limited by manual a priori knowledge, which makes it difficult to cope with complex scene requirements. The rise of deep learning has reconfigured the technological paradigm of image processing, and Convolutional Neural Networks (CNNs) represented by ResNet17 and Mask R-CNN18 have achieved breakthroughs in image classification and semantic segmentation tasks through hierarchical localized feature learning. Generative Adversarial Networks (GANs) have pioneered new paths in image synthesis. CycleGAN19 enables cross-domain style migration. SRGAN20 improves the quality of super-resolution reconstruction. Visual Transformer (ViT)21 captures global context through self-attention mechanism, and variants such as Swin Transformer22 show superior performance in medical image segmentation. To achieve lightweight architectures, MobileNet23 and W4A8 quantization scheme24 have significantly improved the efficiency of real-time inference for edge devices and promoted the landing of industrial-grade applications. Technological innovations have made key advances at the level of multimodal fusion and data efficiency. D-JEPA25 fuses diffusion models with masked image modeling, balancing the quality and efficiency of generation.

Self-supervised frameworks such as SimCLR26 and small-sample learning algorithms alleviate the dependence on labeled data. These advances have spawned transformative applications in multiple fields. Transformer-based TransCT27 model reduces radiation dose by 80% while preserving diagnostic details. Dp-M3D28 uses FEDC to leverage depth confidence guidance to determine the deformation offset of the 3D bounding box, aligning features more effectively and improving depth perception. And the Dp-NMS refines the selection process by incorporating the product of classification confidence and depth confidence, ensuring that candidate boxes are ranked effectively and the most suitable detection box is retained. CTAFFNet29 designs a Local–Global Feature Fusion unit known as the Convolutional Transformation Adaptive Fusion Kernel (CTAFFK), which is integrated into CTAFFNet. The CTAFFK kernel utilizes two branches, namely CNN and Transformer, to extract local and global features from the image, and adaptively fuses the features from both branches. TS-BEV30 replaces the previous multi-frame sampling method by using the cyclic propagation mode of historical frame instance information. The temporal-spatial feature fusion attention module fully integrates temporal information and spatial features, and improves the inference and training speed. YOLOv1031 driven industrial defect detection system achieves a 99.2% recall rate. GAN technology supports the high-fidelity restoration of Dunhuang murals to promote the digital protection of cultural heritage. With the rapid development of deep learning in recent years, training models are expected to accommodate hundreds of millions of labeled data points. The need for large-scale data processing has been addressed by self-supervised pretraining in the fields of natural language processing (NLP)32 and computer vision (CV)33. Most of these solutions are based on masked modeling, such as masked language modeling in NLP or masked image modeling in CV. These methods first mask parts of the data based on the original data, and then recover these parts through learning. Masked modeling34 enables models to infer the deleted parts based on contextual information, thus enabling the models to learn deeper semantics, which has become a benchmark for self-supervised pre-training in both NLP and CV. This pre-trained masked modeling has been shown to work well for a variety of downstream tasks, and one of the simpler and more effective approaches is the Masked Auto-Encoder (MAE)35.

Based on the above systematic analysis, it can be concluded that: 1. Static visual models dominate and lack dynamic modeling capability. Mainstream models such as ResNet focus on classification and detection tasks of static images, ignore the temporal correlation of images, and lack a modeling mechanism for temporal continuity and inter-frame evolution, which is unsuitable to be used for future frame prediction tasks directly. 2. Time series models are difficult to use directly for image modeling. Most of the time series models are for numerical one-dimensional sequences and cannot deal with high-dimensional image data. 3. Existing time series prediction models are mainly oriented to low-dimensional structured data and are not suitable for high-dimensional inputs such as image frames. Moreover, the current mainstream visual computing models generally focus on perceptual tasks such as static image classification and target detection (e.g., YOLO architecture), and are not applied in the field of visual prediction. To break through the limitations of existing algorithms in image temporal inference, this paper innovatively constructs a visual prediction framework based on the LSTM36 model, which realizes generative prediction of future single-frame or multi-frame images by deeply analyzing spatiotemporal feature evolution laws of historical frame sequences. The main contributions of this paper are as follows:

  1. 1.

    The ViT image feature extraction module is constructed based on the mask self-coding paradigm. The module randomly masks image blocks at a high masking rate, and trains the model to complete the missing block reconstruction task. This mechanism makes the model learn the contextual dependencies between image blocks and generate more discriminative feature representations.

  2. 2.

    Progressive reconstruction framework based on sequence-image co-transformation is designed to enhance feature acquisition and improve prediction performance. The features and local-global semantic association of the image are extracted and converted into a time series by the row-first sequence reconstruction method, then the predicted time series are projected to the high-dimensional semantic space through the inverse mapping mechanism.

  3. 3.

    Visual prediction method based on time series-driven forecasting is proposed, which transforms the image prediction problem into a time series prediction problem. To the best of our knowledge, this is the first time that the combination of time series analysis and image processing methods has been used to solve the image prediction problem.

By integrating ViT-MAE with LSTM, the model effectively addresses the limitations of both static models and classical time-series approaches, offering a unified predictive framework. ViT-MAE employs a high masking ratio for representation learning, forcing the model to learn robust and contextual features through global reasoning rather than local textures. This enables explicit modeling of temporal continuity and inter-frame evolution, but the capabilities are absent in static models. Furthermore, by converting 2D feature maps into 1D sequences using a row-major order with positional encoding, the method transforms high-dimensional image prediction into a format amenable to LSTMs, while preserving spatial topology and ensuring reversibility. Unlike pooling or flattening operations, which discard structural consistency, the sequence representation preserves patch relationships across time, enabling accurate image reconstruction from predicted sequences. Finally, the proposed framework combines state-of-the-art spatial feature learning with powerful temporal dynamics modeling, providing a generalizable blueprint for image prediction tasks.

Related work

LSTM-base time series forecasting

As an important variant of recurrent neural networks, LSTM36 has demonstrated excellent stability and modeling capabilities for time series forecasting, and its core strength is reflected in the effective capture of long-range dependencies. The model achieves this property through a unique memory cell architecture, which consists of three parameterized gating systems working in concert. The memory cell \(c_t\) is essentially a self-renewing information container whose update process follows the sophisticated regulation of the gating logic. When new input data enters the system, the input gating determines the access weight of the current information, the forgetting gating dynamically adjusts the retention ratio of the historical memory \(c_{t-1}\), and the final output gating is responsible for filtering the valid information to be delivered to the hidden state \(h_t\).

This triple-gating mechanism not only realizes the temporal filtering of information, but also effectively mitigates the gradient vanishing problem of traditional RNN37 models through the Constant Error Carousel. Compared with the basic RNN model, LSTM innovatively decouples information storage and information delivery. The memory cell acts, as an independent data carrier, and realizes the accumulation and forgetting of information through the differential adjustment of the gating parameter, and this design not only ensures the enduring memory capacity of the key information, but also gives the model the flexibility to dynamic adjust the memory cycle. Research practices have repeatedly verified that this gated memory architecture shows significant advantages in long sequence tasks such as speech recognition and machine translation. Therefore, with the progress of science and technology, more and more variant models of LSTM have been proposed. xLSTM38 was proposed to solve the traditional problem of LSTM’s inability to dynamically adjust the storage decision, limited storage capacity, and insufficient parallelism by means of sLSTM (scalar memory + memory mixing) and mLSTM (matrix memory + covariance updating rule) as well as the introduction of exponential gating and matrix memory. SwinLSTM39 combines the global self-attention mechanism of Swin Transformer with the temporal modeling capability of LSTM to propose a novel recurrent unit for spatiotemporal prediction tasks. Its core captures spatial global dependencies through Swin Transformer blocks, and LSTM handles temporal dynamics. KNN-LSTM40 replaces the linear transformation layer in traditional LSTM with the KAN network and enhances the model’s ability to capture complex temporal patterns by combining nonlinear functions, which is suitable for high-frequency financial data and non-smooth sequence prediction.

LSTM is widely used in natural language processing and speech recognition due to its superior sequence modeling ability, and its gating structure can effectively alleviate the long-range dependency problem. However, there is still an obvious bottleneck in the image sequence modeling tasks. On the one hand, LSTM makes it difficult to capture the complex spatial structure inside the image and lacks the ability to perceive the local semantics and global context. On the other hand, the linear dependency of the traditional LSTM in processing sequence data limits their parallel processing efficiency in high-dimensional image scenarios. Although existing variants try to integrate spatial information (e.g., SwinLSTM) or enhance the memory mechanism (e.g., xLSTM), they still fail to achieve efficient and end-to-end image predictive modeling.

Vision transformer-based image processing

Vision Transformer (ViT)21 is a breakthrough model that applies the Transformer architecture from natural language processing to computer vision tasks. The core idea is to segment an image into fixed-size image blocks (patches) and input these blocks into the Transformer encoder as sequences, modeling global dependencies through a self-attention mechanism. ViT and its variants have been widely used in many domains, such as image classification41, target detection and segmentation (Next-ViT42), point cloud and 3D data processing (PointMAE43), and multimodal and edge computing (AIQViT44). More and more VIT-based models have been proposed. DyT45 reduces training memory consumption and accelerates inference by removing the normalization layer and adopting Dynamic Tanh instead of layer normalization. EoMT46 reuses ViT encoder for image segmentation and proposes a “mask annealing strategy”, which enables masked attention at the beginning of training and gradually transitions to mask-free inference, significantly improving the efficiency. BHViT47 designs a binarized ViT for edge devices that enhances local information interactions through a hybrid convolution-transformer architecture (MSDGC module), and introduces a quantized decomposition attention matrix and a regularized loss function to reduce binarization noise, etc.

Although Vision Transformer performs well in static tasks such as image classification, native ViT and its variants generally lack temporal modeling capabilities for image sequences. When dealing with tasks such as image prediction, such models are unable to model the evolutionary patterns between frames, and it is difficult to ensure the structural consistency and contextual reasoning ability between image blocks. Meanwhile, their high training and inference costs limit their application in high-frequency image sequence generation tasks.

Masked autoencoder in time series and image processing

Masked modeling paradigms have demonstrated powerful representation learning capabilities in the field of natural language processing, and the classical approach represented by BERT has successfully constructed efficient linguistic representations for downstream tasks by randomly masking part of the input sequences and reconstructing the masked contents in the pre-training phase. This self-supervised learning idea is gradually expanding to the field of multimodal data. In the field of computer vision, Context Encoders pioneered the use of convolutional neural networks to reconstruct locally occluded regions, pioneering image mask modeling. Inspired by the success of Transformer in the field of NLP, cutting-edge approaches such as BeiT48 and ViT innovatively combine the visual Transformer architecture with pixel prediction tasks. The Masked Auto-Encoder (MAE) further proposes a high percentage masking strategy that retains only a small number of visible image blocks for input to the encoder, effectively enhancing the feature extraction capability of the model. This MAE paradigm exhibits strong generalization ability, and the subsequent derivations PointMAE43 and VideoMAE49 validate its effectiveness in the field of 3D point cloud and video understanding, respectively. For time data processing, ExtraMAE50 innovatively combines recurrent neural networks with a dynamic masking mechanism, which significantly improves the efficiency of time series modeling. For complex spatiotemporal data, TSformer51 successfully constructed a mid-layer representation supporting graph neural networks through a joint spatiotemporal masking strategy. These cross-domain practices show that reasonably designed masking strategies with adapted model architectures can effectively mine the intrinsic associations of different modal data and provide a unified learning framework for multimodal intelligent systems. Although masked self-encoders have achieved remarkable results in the field of computer vision, their core modeling mechanism is still mainly oriented to the static image structure and lacks effective modeling of image temporal evolution laws. Meanwhile, the high masking rate strategy has an information recovery bottleneck in complex images, and the existing methods fail to build a unified framework for structural guidance and temporal modeling. LSTM, Vision Transformer, and Masked language modeling have cooperated to form new processing models and have achieved good results. VisonTS52 combines time series with image processing. MTSMAE53 combines time series with a masked self-encoder. However, these methods are not used uniformly in a single model, nor are they used in visual image prediction.

Therefore, to further expand the application boundaries of the mask modeling paradigm, it is urgent to combine the masking mechanism with the temporal modeling structure to achieve a cross-dimensional breakthrough from image reconstruction to image prediction. In this paper, we propose a generative visual prediction framework that integrates the image and time series prediction methods, which can comprehensively improve the spatio-temporal reasoning and generative capability of the model from feature extraction and sequence modeling to image reconstruction.

Methods

Overall framework

In this paper, a visual prediction framework based on LSTM36 recurrent neural network is proposed. The prediction network mainly consists of the ViT image feature extraction module, time series data construction module, and LSTM prediction module, as shown in Fig. 1. To realize the prediction of the image, the ViT feature extraction module is first constructed to analyze and learn the image and extract features by performing random masking and reconstruction. Then, the extracted features are fed into the time series construction module to generate the data that is acceptable to the LSTM. Finally, the time series data are predicted accordingly by the LSTM, and the predicted time series data are transformed into the predicted image.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Visual prediction framework.

ViT image feature extraction module

ViT requires a large amount of data for training, and it relies on datasets with Labels. The labeling process requires high labor costs, and data with labels are often more difficult to obtain compared to unlabeled datasets. In the field of NLP, the training method of BERT models relying on massive unlabeled corpus data for pre-training has achieved great success, and unsupervised learning based on unlabeled data has become a research direction that is currently attracting much attention. Therefore, the unsupervised ViT-MAE model is used to extract the features of the image, as shown in Fig. 2. The method guides the training of ViT through the backbone obtained from unsupervised learning, thus reducing the training data of ViT and helping ViT to learn a deeper representation. The specific method is as follows:

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

ViT-MAE.

Image patch and mask representation

Let the input image be \(x \in R^{H \times W \times 3}\), where H represents the image height, W represents the image width, and 3 denotes the RGB three elements. Segment the image into \(N=(H / P) \times (W / P)\) non-overlapping square patch blocks, each of size \(p\times p\), where P is 16, and subsequently each patch spreads to a one-dimensional vector: \(x_{i} \in R^{3P^{2} }\). Set the masking ratio \(\rho\) to be 85%, i.e., mask off \(M=\rho N\) patch pixel blocks, then the set of randomly sampled patch indices is shown as follows. A collection of masks: \(\omega \subset \left\{ 1, ..., N \right\} ,\left| \omega \right| = M\), and Visible collections: \(v = \left\{ 1,..., N \right\} /\omega\).

Patching embedding

Map all patches \(x_{i}\) to vectors of dimension d through a linear layer with position coding:

$$\begin{aligned} e_{i} =W_{proj} vec\left( x_{i} \right) +P_{i} \end{aligned}$$
(1)

where \(vec\left( x_{i} \right) \in R^{3P^{2} }\) is the unfolded pixel vector, \(W_{proj}\in R^{d\times \left( 3P^{2} \right) }\) is a learnable linear mapping that is the position embedding vector, and \(P_{i } \in R^{d}\) encodes the patch position with the sine-cosine function.

Encoder-encodes only visible patches

The embedding \(\left\{ e_{i} \right\} _{i\in v}\) of all visible patches is fed into the standard Vision Transformer encoder, and the output is a set of hidden vectors \(\left\{ z_{i} \mid i\in v \right\}\), i.e., the result of global modeling of the visible image information, providing contextual semantics to support the decoder in reconstructing the occluded content:

$$\begin{aligned} z_{v} =Encoder\left( \left\{ e_{i} \right\} _{i\in v} \right) \end{aligned}$$
(2)

Decoder-rebuilds all patch pixels

Firstly, insert a shared learnable mask token for the masked patch \(m\in R^{d}\):

  1. (1)

    For each \(j\in \omega\), construct the inputs: \(t_{j}=m+p_{j}\), \(p_{j} \in R^{d}\) are position embedding vectors.

  2. (2)

    Recombine all tokens (encoder outputs + mask tokens) into a complete sequence and restore the order:

    $$\begin{aligned} T=Unshuffle \left( \left\{ z_{i} \right\} _{i\in v} \cup \left\{ t_{j} \right\} _{j\in \omega } \right) \in R^{N\times d} \end{aligned}$$
    (3)

where v denotes the set of unshuffled patch pixel blocks, \(z_{i}\) denotes the encoder’s output vector for the ith visible patch pixel block, and Unshuffle() denotes rearranging the merged tokens in the order of the original patch pixel blocks to obtain a complete sequence.

Then, the complete token sequence T is fed into a small Transformer decoder: \(\hat{y}_{1}, ...,\hat{y}_{N}=Decoder\left( T \right)\), where each output vector \(\hat{y}_{i}\in R^{3P^{2}}\) represents the reconstructed pixel value of patch \(x_{i}\).

Time series building blocks

Based on the ViT patching algorithm, the input image is first segmented into \(16 \times 16\) pixel blocks, and the two-stage feature learning is performed by the improved ViT-MAE. In the first stage, a self-supervised reconstruction task with an 85% high masking rate is used to strengthen the local-global semantic associations. In the second stage, the spatio-temporal dimensionality distribution of the feature blocks is adjusted by the dimensionality reduction and alignment mechanism, and finally the 2D features are transformed into spatio-temporally continuous 1D sequences by the row-priority sequence reorganization algorithm. Let the original image be \(I\in R^{H\times W\times 3}\), H be the width, W be the height, and 3 be the number of RGB channels.

Patching

A patch chunk of size \(p\times p\) is performed on the original image, and the total number of patch pixel blocks is \(N=\left( H/P \right) \times \left( W/P \right)\), where H/P is the number of row blocks and W/P is the number of column blocks. Then, a particular patch pixel block in the image can be represented as:

$$\begin{aligned} P_{i,j} =I\left[ \left( i-1 \right) p+1:ip,\left( j-1 \right) p+1:jp,: \right] \end{aligned}$$
(4)

where \(\left( i-1 \right) p+1:ip\) denotes the pixel range of the patch pixel block in the row direction; \(\left( j-1 \right) p+1:jp\) denotes the pixel range of the patch pixel block in the column direction, and “:” denotes that all RGB channels are preserved.

Extract features and form time series data

Each \(P_{i,j}\) is fed into the ViT-MAE encoder to extract the features and spread them into vectors in row-major order: \(f_{i} =Encoder\left( P_{i} \right) \in R^{3P^{2} },i=1,...,N\), where \(P_{i}\) denotes the patch vectors arranged by rows, and then all the patch vectors are arranged into sequences by k through the subscript mapping \(k=\left( i-1 \right) \frac{W}{p} +j,k=,2,...,N\):

$$\begin{aligned} X=\left[ f_{1}, f_{2},...,f_{N} \right] \in R^{N\times \left( 3p^{2} \right) } \end{aligned}$$
(5)

Linear dimensionality reduction

Since it is too high to use \(3p^{2}\) dimensional features directly, a linear transformation is used to map each \(f_{k}\) to the desired dimension of the LSTM, i.e. \(e_{k} =W_{e} f_{k} +b_{e} ,W_{e}\in R^{d\times \left( 3p^{2} \right) } ,b_{e} \in R^{d}\), where \(W_{e}\) denotes the linear transformation matrix and \(b_{e}\) denotes the bias term. The sequence of features after dimensionality reduction is \(\hat{X} =\left[ e_{1}, e_{2},...,e_{N} \right] ^{T} \in R^{N\times d}\).

Position code

To distinguish the spatial location of different patch pixel blocks, a learnable position bias is added to each sequence element \(\pi _{k} \in R^{d}\). Then, the final timing data is \(\tilde{e} _{k} =e_{k} +\pi _{k} ,k=1,...,N\), so the input to the LSTM is \(\tilde{X} =\left[ \tilde{e}_{1} , \tilde{e}_{2},...,\tilde{e}_{N} \right] ^{T} \in R^{N\times d}\).

Deep prediction architecture

Using the learned spatiotemporal continuum sequence, an LSTM network is used for temporal evolution prediction. The output predicted feature vectors are inversely mapped back to the original patch pixel block space, and two-dimensional reorganization is carried out based on the spatial order at the time of the initial segmentation. The complete predicted image is finally spliced to generate the complete predicted image.

LSTM time series prediction

Let the history length be L and the prediction length be \(H'\), then

$$\begin{aligned} X_{hist}=\left[ \tilde{e}_{1} ,...,\tilde{L}_{L} \right] ^{T} \in R^{L\times d} ,X_{true}=\left[ \tilde{e}_{L+1} ,...,\tilde{e}_{L+H'} \right] ^{T} \in R^{H'\times d} \end{aligned}$$
(6)

Firstly, using LSTMCell, remember that the hidden state at the moment of step t is (\(h_{t}\),\(c_{t}\)), and initially \(h_{0}\)=0,\(c_{0}\)=0. Then for \(t=1,..., L\):

$$\begin{aligned} \left( h_{t},c_{t} \right) =LSTMCell\left( \tilde{e}_{t},\left( h_{t-1},c_{t-1} \right) \right) \end{aligned}$$
(7)

Then, define the linear head \(W_{a} \in R^{d\times h}\), the bias \(b_{a} \in R^{d}\), and the LSTMHiddenDim size to be h such that

$$\begin{aligned} \tilde{e}_{L+1}=W_{a}h_{L}+b_{a}, \left( h_{L+\tau },c_{L+\tau } \right) =LSTMCell\left( \tilde{e}_{L+\tau },\left( h_{L},c_{L} \right) \right) \end{aligned}$$
(8)

Sequentially, the autoregressive eigenvectors for future time steps can be obtained:

$$\begin{aligned} \tilde{e}_{L+\tau }=W_{a}h_{L+\tau -1}+b_{a}, \left( h_{L+\tau },c_{L+\tau } \right) =LSTMCell\left( \tilde{e}_{L+\tau },\left( h_{L+\tau -1},c_{L+\tau -1} \right) \right) \end{aligned}$$
(9)

until \(\tau =H'\), to obtain the final predicted sequence:

$$\begin{aligned} \hat{Y} =\left[ \tilde{e}_{L+1} ,...,\tilde{e}_{L+H'} \right] ^{T}\in R^{H'\times d} \end{aligned}$$
(10)

Sequence patch reconstruction

Mapping each prediction vector \(\hat{e} \in R^{d}\) back to a pixel block, defining the decoding matrix \(W_{d} =R^{\left( 3p^{2} \right) } \times d\) and bias \(b_{d} \in R^{3p^{2} }\), the prediction patches the pixel block \(\tilde{P} _{k} =W_{d} \tilde{e} _{k} +b_{d} \in R^{3p^{2} }\), which is finally mapped back to a pixel block:

$$\begin{aligned} \hat{P} _{k} =reshape\left( \tilde{P_{k} } \right) \in R^{3p^{2}} \end{aligned}$$
(11)

Patch pixel blocks back to full image

For all predicted ones, post back at the initial grid position (ij):

$$\begin{aligned} \hat{I} \left[ \left( i-1 \right) p+1:ip,\left( j-1 \right) p+1:jp,: \right] =\hat{P} _{k} ,k=\left( i-1 \right) \frac{W}{p} +j \end{aligned}$$
(12)

where \(i=1,..., \frac{H}{p},j=1,...,\frac{W}{p}\). The final predicted image with the same size as the original is obtained: \(\hat{I} \in R^{H\times W\times 3}\).

Experiments

To systematically verify the comprehensive efficacy of the model constructed in this study, the first part presents the corresponding performance metrics. The rest of the section tests and validates the proposed method on the existing dataset through rigorous comparative experiments and conducts the corresponding ablation experiments to verify the feasibility of the proposed method in this paper, as well as its superiority in terms of prediction accuracy, robustness, and capability.

Implementation details

The experiments are based on PyTorch 1.12.0 and the vit-mae codebase. The image processing model used is vit-mae-base, and the object it is applied to is the base type, which is input at \(256 \times 256\) resolution in the pre-training phase. The model is forced to reconstruct the key visual semantics through an 85% masking rate. The time-series prediction model utilized is the base LSTM, RNN, and Transformer. A dynamic position encoding mechanism is introduced in the fine-tuning stage to couple the 768-dimensional feature vectors obtained from pre-training with temporal coordinate information. The experimental configuration uses an NVIDIA A800 GPU cluster, and end-to-end training is carried out through cosine annealing learning rate scheduling (initial value \(3e-4\)) and AdamW optimizer, and each dataset is set with 200 training cycles are set for each dataset to ensure convergence. The evaluation metrics are chosen as mean square error (MSE) and mean absolute error (MAE), which are defined as follows:

$$\begin{aligned} \textrm{MSE}= & \frac{1}{L} \sum _{i=1}^{L}\left( z_{M+1}-\widetilde{z}_{M+t}\right) ^{2} \end{aligned}$$
(13)
$$\begin{aligned} \textrm{MAE}= & \frac{1}{L} \sum _{i=1}^{L}\left\| z_{M+1}-\widetilde{z}_{M+t}\right\| \end{aligned}$$
(14)

These two evaluation metrics are used on each prediction window to compute the average of the image predictions and roll the entire set with stride = 1. The ViT-MAE pre-training default settings are shown in Table 1. The length of the input sequence in the model is 200 and the patch size is 4, i.e., there are a total of 50 patches. The Adam optimizer is selected for optimization and the same strategy as for ViT-MAE is used. The optimizer configuration was not modified. The settings of the original vit-mae were followed and a linear learning rate scaling rule was used: lr = base lr \(\times\) batchsize/256. The number of encoder layers was 3 and the number of decoder layers was 1, and in pre-training we chose a high percentage of masking rate of about 85%. All original images from the datasets (with a native resolution of 512x512 pixels) were consistently preprocessed to a resolution of 256x256 pixels using bilinear interpolation via the PyTorch \(torchvision.transformer.Resize\left( \cdot \right)\) function. This resizing operation was applied uniformly across both the self-supervised pre-training stage of the ViT-MAE and the subsequent training and evaluation of the temporal prediction pipeline to ensure input consistency. And all input images were normalized using the mean and standard deviation of the ImageNet dataset (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). This preprocessing step, implemented using torchvision.transforms. \(Normalize\left( \cdot \right)\), was applied consistently to the inputs of all models (including Ours, F-CLSTM54, CloudCast55, SATcast56, and the CNN/Transformer ablation baselines) during both training and inference phases to ensure a fair comparison. These configurations are the same on all datasets and are not changed accordingly for different datasets.

Table 1 The config of ViT-MAE.

Datasets

The selected three cloud map datasets (Fig. 3) are described as follows:

TCDD dataset

The dataset is the TJNU Cloud Detection Database (TCDD), and it collects cloud detection data from 2019 to 2020 for nine provinces in China, including Tianjin, Anhui, Sichuan, Gansu, Shandong, Hebei, Liaoning, Jiangsu, and Hainan. It contains 2300 ground-based cloud images and their corresponding cloud masks.TCDD consists of 1874 training images and 426 test images. The cloud images are captured by a vision sensor and stored in PNG format with a pixel resolution of 512*512.

Sample dataset

The dataset is a satellite cloud image dataset, and it contains 2200 training data and 300 test data for training the super-resolution model.

Rice dataset

The dataset is remote sensing image de-clouded dataset, and it consists of two parts: RICE1 contains 500 pairs of images, each pair of images with and without cloud image size of 512*512; RICE2 contains 450 groups of images, each group contains three images of 512*512 size, which are reference images without cloud, image with cloud, and mask with cloud, respectively.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Selected samples from three different datasets.

Main results

Comparison of different models

To verify the advantages of the proposed model, this study selects three data sets and conducts systematic comparative experiments. The experiment selects three benchmark models: F-CLSTM54, SATcast56, CloudCast55, VideoMAE57, Predrnn58, THItoGene59 and VGG-TSwinformer60, and compares their performance with the proposed model. The experimental results are shown in Table 2. From the results, it can be seen that the proposed model achieves the best in both MSE and MAE evaluation indicators, which proves that the model is an effective and significantly improved model by combining the time series prediction model with the image processing model. To show the advantages of the proposed model more intuitively and clearly, this study visualized some prediction sample images and compared them with F-CLSTM and CloudCast, as shown in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The predicted cloudage images based on different prediction models.

Table 2 Comparative experimental results of different models.

Prediction accuracy for different prediction lengths

To verify the effectiveness and advantages of the proposed model, the number of predicted images is set to be 1, 2, and 3 times in the above three datasets for comparison experiments, respectively. Table 3 shows that the proposed model’s prediction error indexes perform stably. MSE ranges from 0.46 to 0.98, and the fluctuation range of the MAE is controlled in the interval of 0.52 to 1.71. It is worth noting that when the number of predicted images is set to 1, the model achieves the optimal value in both evaluation indexes, and this phenomenon is highly compatible with the temporal pattern of the real scene. The model can effectively realize the accurate prediction of a single-frame image by analyzing the historical image sequence. This characteristic indicates that the model has good potential for practical application, and its ability to generate single-frame prediction results based on historical data learning is in line with the objective law of strong correlation between neighboring time-sequence images in the process of dynamic system evolution.

Table 3 Prediction results for different prediction lengths.

Ablation studies

Effect of different time series forecasting models

To evaluate the generalizability and effectiveness of our core contribution, the ViT-MAE-based feature extraction and time-series conversion module, We conducted an ablation study to integrate this module into other state-of-the-art base and variant prediction models(LSTM36, CNN17, Transformer11, NOA-LSTM61, xPatch62, and PPDformer63). As shown in Table 4, the experimental results show that the module proposed can be well integrated with different prediction models. This proves that the module provides a universal and powerful front-end that can combine image and temporal prediction to improve the accuracy of image prediction. And the results show that this modeling strategy of phased feature migration can effectively improve the quality of time-step image generation.

Table 4 Evaluate the use of different time prediction models while maintaining consistency with the ViT-MAE feature extraction front-end.

Effect of different masking ratio

To find out the pattern of different masking ratios on the model performance, this study sets the prediction model input lengths to 24, 48, 96, and 192, respectively, and conducts the corresponding experiments on the TCDD dataset. The experimental results are shown in Fig. 5, where an 85% masking ratio is the optimal choice, a finding that contrasts with the findings of BERT and video-MAE, but is consistent with the experimental trend of MAE models in the image domain. A higher masking ratio forces the model to focus on high-level semantic feature extraction and integration by limiting the number of visible tokens. It is worth noting that when the masking ratio is reduced to 45%, the performance metrics drop instead, despite the encoder having access to more raw input information, which may be attributed to the fact that the low ratio of masking allows the model to complete the reconstruction by simple local feature interpolation, resulting in its overfitting to the underlying details.

Theoretical analysis shows that appropriately elevating the masking ratio (85%) can effectively reduce the spatial redundancy of the input data and force the model to establish global semantic associations rather than relying on local cues. However, an extremely high percentage of masking strategies (95%) can trigger significant information loss, resulting in severe attenuation of the training signal and ultimately impairing the model’s understanding of the nature of the data distribution. This inverted U-shaped performance curve reveals a dynamic balancing mechanism between information retention and abstraction learning: moderate information loss can facilitate high-level representation learning, but excessive masking will lead to the degradation of semantic modeling capabilities.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

MSE and MAE of different masking ratios with different prediction lengths.

Effect of different input lengths

This study reveals the dynamic correlation mechanism between input window length and forecasting performance in time series modeling in this model through systematic experiments. The experimental results are shown in Table 5. The model accuracy shows a nonlinear trend of increasing and then decreasing with the input sequence length, with the best input length of 96. When the input length breaks through the critical threshold of 192, the model shows a significant degradation of 36.4% and 49.3% in the MAE and MSE metrics on the test set, respectively. Through analysis, it is found that this phenomenon mainly stems from the fact that the multi-scale cycle features (daily cycle, weekly cycle, seasonal cycle, etc.) implied in the long time-series data are difficult to be effectively decoupled by the standard attentional mechanism, which causes the model to fall into the local optimal solution and fails to establish cross-period correlations.

Table 5 Prediction results for different input lengths.

Evaluation of generalization ability

Cross-Dataset Performance Consistency: The proposed model was evaluated on three distinct cloud image datasets (TCDD, Sample, and Rice) with different characteristics (ground-based vs. satellite, different geographical origins). The consistent superiority of our model over baselines across all these datasets is a strong indicator of its robustness and generalizability. The model is not overfitting to the specific nuances of a single dataset but is learning a generalizable representation for cloud image prediction.

Performance Under Different Conditions: The model’s performance under varying prediction lengths has been further analyzed as shown in Table 3. The stable and predictable degradation pattern as the prediction horizon extends (i.e., error increases gradually rather than catastrophically) demonstrates that the proposed model has learned a reasonable temporal dynamic that generalizes across different forecasting scenarios.

Impact of patch size

The patch size P is a critical hyperparameter in this framework, governing a fundamental trade-off between spatial granularity and temporal sequence length. A smaller P yields a longer sequence of finer-grained image patches, potentially capturing more detail at the cost of increased computational complexity and longer-range dependencies for the LSTM to model. A larger P results in a shorter sequence of coarser-grained patches, which reduces the computational load but risks the loss of vital spatial information. To find out the patch size more suitable for this model, the ablation studies were carried out, and the hyperparameter settings used were consistent with the main experiment. As shown in Table 6, the results demonstrate that a medium patch size, which generates a sequence of moderate length while preserving adequate spatial information, is essential for achieving high prediction accuracy. These robustly validate the choice of \(P =16\) for the main experiments.

Table 6 Ablation study on the impact of different patch sizes. The experiment was conducted on the TCDD dataset. The best results are highlighted in bold.

Conclustions

The current academic community has not yet formed a complete research system in the field of image timing prediction. For this reason, this paper innovatively constructs a visual prediction framework based on the LSTM recurrent neural network. The architecture is mainly composed of a feature learning module and a temporal prediction module, and its technical realization path contains three key phases. The first phase is the feature learning phase. Self-supervised pre-training of the original image through the mask reconstruction mechanism of ViT-MAE, and in-depth extraction of the spatial semantic features of the image. The process uses a random masking strategy to mask 85% of the image blocks, forcing the model to learn the potential spatial correlation laws. The second phase is time-sequence conversion phase. The acquired image feature matrix is expanded into a time-series by spatial dimensions to construct a multi-dimensional time series dataset with spatio-temporal correlation. This transformation process maintains the spatial topology of the feature vectors and ensures the invertible mapping of the temporal dimension to the spatial dimension. The third phase is prediction reconstruction stage. The LSTM network is applied to model the temporal dependency relationships, and the long-term dependency patterns are captured through the gating mechanism. The predicted output timing data is synthesized into an image block matrix by inverse spatial reconstruction and finally integrated into a complete predicted image. To prove the effectiveness of this study, relevant performance experiments and ablation experiments are done on three cloud image datasets, and the results show that the model achieves better results in both the evaluation criteria MSE and MAE, which indicates that the model proposed in this study demonstrates good feasibility and efficiency.

Although the proposed linear projection is effective, some granular spatial details might be lost in converting high-dimensional image patches into a lower-dimensional temporal sequence, which could limit the prediction fidelity for very fine-grained structures. In addition, the model’s performance may degrade on sequences exhibiting extreme, unpredictable changes (e.g., rapidly forming storm cells) that significantly deviate from the patterns learned during training. The autoregressive prediction loop could amplify errors in such scenarios.

In the future, we will study the possible limitations of the model and discuss many aspects of research and optimization directions, such as replacing the image processing model with a better model, replacing the time series model with a new model that is more stable and efficient, and improving the method of converting image patches to time series data.