Introduction

In past decades, the air transportation industry is booming with increasing air traffic flow due to economic achievements. The workload of air traffic controllers (ATCos) is inevitably burdened by high traffic density, which causes great challenges in providing safe and efficient services for air traffic control (ATC). The increasing air traffic flow in recent years has led to a rise in aviation incidents caused by various human factors (e.g., cognitive workload and weakened situational awareness), bringing a huge risk to ATC safety. Based on the investigation of1,2, 70% of aviation incidents are related to human factors, which inspires us to reconsider the protection of human factor-related risks in aviation fields.

In real-time ATC operation, most of the human factor-related risks are caused by errors in the ATC communication procedure3, such as unsafe ATC decisions, mishearing, and misunderstanding. The major limitation to detecting human factor-related risks is that current ATC systems fail to consider the human intention (represented by the controlling intent of the spoken instructions) for traffic prediction due to human-in-the-loop (HITL) natures. Specifically, in the ATC procedure, ATCos make decisions based on their awareness of the traffic dynamics and issue spoken instructions to negotiate with the pilots in ATC communication via very high-frequency (VHF) radiotelephony. Once the aircrews correctly readback the ATC instruction, they perform the required aircraft operations according to the controlling intent of the ATC instruction. The flight will be in a high-maneuvering status (denoted as instruction-driven maneuvering flight scenarios in this work) with rapid and complex transition patterns of the flight trajectory.

In this procedure, the most typical accidents by human errors can broadly categorized into two types: the unsafety ATC instructions issued by ATCos (within potential conflict) and incorrect operation of pilots (misunderstanding or failure to adhere to the provided ATC instructions)4. Notable examples include the Überlingen Mid-Air Collision in 2002 and the Haneda Airport runway collision in 2024, which resulted in significant losses and impacts for passengers, airlines, etc.5,6. Despite enormous efforts devoted to enhancing safety measures in the ATC procedure7,8,9, such incidents have not yet been resolved due to the aforementioned HITL nature, posing a continuous threat to people’s lives and safety. In this context, predicting the influence of ATCo instructions on real-time traffic operations is a promising approach to detecting and protecting human error-related risks for ATC procedures. For example, i) if the issued ATC instruction contains potential risks, the predicted results can enable the downstream conflict detection applications to promptly identify possible flight conflicts; ii) if a pilot misunderstands or incorrectly executes an ATC instruction, the resulting flight trajectory will significantly deviate from the expectations of ATC. Nevertheless, limited by the technical ability to tackle the spoken ATC instructions, the automation gap between human intention and ATC systems disables the development of automated measures for current ATC systems to detect human errors.

Accurate flight trajectory prediction (FTP) results enable ATC participants to gather traffic dynamics in advance, supporting efficient decision-making and detecting potential conflicts in specific airspace regions10,11,12. Currently, short-term FTP serves as the vital application of traffic prediction in modern ATC systems, which aims to forecast the flight status of the aircraft in the future time instants13,14,15,16. Although existing FTP approaches can achieve the expected performance with constant transition patterns, such as the en-route phase, they are still facing great challenges in predicting the flight trajectory with maneuvering operations due to the intervention of human intentions. In general, the complicated maneuvering patterns caused by human intentions can be summarized below:

  • Microscopic maneuvering by pilots, including operational behaviors, real-time environmental factors, etc. In17, a sophisticated time-frequency analysis method was proposed to achieve the FTP task by implicitly capturing the complicated multi-scale maneuvering patterns, which harvest desired performance on complicated flight patterns.

  • Macroscopic maneuvering by air traffic controllers, mainly concerning the spoken ATC instruction, which is the primary driving factor to cause maneuvering operations. Under the aircraft separation rules in the ATC domain, macroscopic maneuvering is the most decisive factor in influencing flight trends, and is able to provide confident pre-warnings to flight conflicts.

As illustrated in Fig. 1a, the controlling intent of real-time ATC spoken instructions (turn left direct to KAKMI) serves as a driving factor to induce the macroscopic flight maneuvering operations. Since conventional data-driven FTP approaches typically only rely on historical trajectory observations and fail to consider read-time intents and required parameters, they have few trend-aware abilities and suffer from unreliable prediction results with delayed responses. Moreover, the flight trajectory can be seen as the most direct and ultimate manifestation of the controlling intent (ATC spoken instructions). Therefore, the ATC spoken instruction can be integrated into the FTP process to empower air traffic predictions, enabling the ATC systems to consider human factors automatically in a closed loop and further detect human errors. In this context, it is imperative to explicitly consider the ATC spoken instruction for the FTP task in a proper way.

Fig. 1: Comparison of conventional data-driven FTP and instruction-driven FTP.
figure 1

a An example of the ATC communication procedure and the challenges faced by the conventional FTP tools. b The logic flow of the proposed instruction-driven FTP paradigm.

Inspired by this, in this work, an instruction-driven flight trajectory prediction paradigm is proposed to incorporate spoken instruction into the automation process, including controlling intent understanding and resulting flight trajectory prediction. In this way, the most important information source (spoken instructions) of the human intention in real-time ATC operations can be perceived automatically with high timeliness, thereby enhancing the predictability of the ATCos’ performance on traffic operations. Furthermore, automation tools can be developed to detect potential human-related risks4,9, and further enhance the safety and efficiency of traffic operations.

To be specific, we mainly focus on short-term FTP tasks within a few future time instants (~1–10 minutes) based on historical observations and the spoken instructions, which further support the downstream applications (e.g., conflict detection, and monitoring the ATC instruction performing processes). To this end, an intuitive approach is to develop multi-modal FTP approaches to explicitly consider both the spoken instructions and historical trajectory observations during the instruction-driven maneuvering flight scenarios. Unlike the concept in human/vehicle trajectory prediction domain18,19,20,21,22, the term “multi-modal" in this work refers specifically to the fusion of data from different modalities through multi-modal learning23. However, it is difficult to incorporate the spoken instructions into the FTP process due to the following challenges:

  • In the ATC procedure, the communications between the ATCos and pilots are based on speech communication via VHF radiotelephony, while the flight trajectories are collected via binary structures and decoded into the modality of spatial-temporal data. It is clear that the distinct modality gap between trajectory and spoken instructions brings great technical challenges to considering spoken instruction in FTP tasks.

  • An expected limitation of data-driven multi-modal approaches is the requirement for well-resourced trajectory-instruction pairs in the training process. However, the collection, preprocessing, and annotation of trajectory-instruction paired samples are both time-consuming and labor-intensive. It is challenging to train a multi-modal FTP model utilizing limited trajectory-instruction pairs.

Considering the abovementioned challenges, in this paper, a spoken instruction-aware flight trajectory prediction framework, called SIA-FTP, is innovatively proposed to implement instruction-driven FTP task, which further incorporates human intentions into an ATC automation process. In general, spoken instruction is the speech signal with considerable information redundancy, such as radiotelephony noise and speaker identity, which makes it challenging to incorporate the speech signal directly into the FTP task. Fortunately, based on our previous works on the Automatic Speech Recognition (ASR) technique in the ATC domain24,25,26,27,28, the spoken instruction can be translated into high-confidence human/computer-readable transcripts, indicating the controlling intents and required detailed parameters. Benefiting from the previous efforts, as depicted in Fig. 1b, the spoken instruction can be transcribed into textual modality by the existing ASR systems, which can further reduce the modality gap. In this context, the primary challenge of the SIA-FTP framework is to incorporate the textual spoken instructions into the FTP model under limited trajectory-instruction paired samples. It is expected that the proposed SIA-FTP framework can leverage the complementary and diverse information from both textual spoken instructions and spatial-temporal trajectory modalities to enhance the performance of FTP tasks.

In practice, compared to the limited paired data, the unimodal trajectory and text instruction data are well-resourced and can be easily obtained separately, as in our previous work9,13. Therefore, in this paper, a 3-stage progressive multi-modal learning paradigm is designed to train the proposed SIA-FTP framework, including trajectory-based FTP pre-training, intent-oriented instruction embedding learning, and multi-modal FTP fine-tuning, as described below:

  • Stage 1: trajectory-based FTP pre-training stage, a multi-horizon FTP model proposed in our previous work, named FlightBERT++29, is applied to learn the spatial-temporal movement patterns only by trajectory data samples, in which the temporal trajectory prediction (predicting future trajectory only based on historical observations) serves as the pre-training task.

  • Stage 2: intent-oriented instruction embedding learning stage, a BERT-based architecture is firstly introduced to learn abstract compact text representations, followed by a multi-label intent identification (IID) task to learn the discriminative embeddings among different controlling intents from the text instructions.

  • Stage 3: multi-modal FTP finetuning stage, a simple yet effective modal fusion strategy is designed to explicitly bridge the pre-trained FTP model and intent identification model to conduct multi-modal FTP model, which can incorporate the instruction embedding into the pre-trained FTP model.

Finally, the multi-modal FTP model is trained on the limited trajectory-instruction pairs to finetune the model parameters, which is expected to learn the flight transition patterns considering specific instructions (intent and required parameters). Thanks to the FlightBERT++ framework with the explicit fusion of controlling intents, the proposed SIA-FTP framework is able to predict future multi-horizon trajectory points in a non-autoregressive manner, as well as the macroscopic maneuvering awareness, which has high performance and efficiency to enhance real-world applicability.

To validate the proposed SIA-FTP framework, a real-world dataset is built to conduct the experiments, which is  collected from the industrial ATC systems in China. The experimental results demonstrate that the proposed SIA-FTP framework achieves impressive results in instruction-driven high-maneuvering flight processes, achieving over 20% relative reduction of the mean deviation error across 15 prediction horizons (5 minutes) compared to the best baseline. All the proposed techniques and strategies are confirmed to provide the desired performance improvement. Most importantly, extensive visualizations and in-depth diagnostic studies are conducted to enhance the interpretability and generalizability of the proposed SIA-FTP framework. It is believed that the proposed framework can be a promising approach to empower FTP-based downstream applications, especially for conflict detection and resolution, and erroneous ATC instruction identification, thereby enhancing the safety and efficiency of ATC operations. In summary, this work contributes to the human error detection in real-time ATC operations and resulting instruction-driven FTP tasks in the following ways:

  • This work innovatively defines the instruction-driven FTP task (i.e., incorporating spoken instruction into the FTP process), which has solid significance and applicability to the ATC work. In addition, the proposed framework can incorporate textual instructions into FTP tasks to improve the model performance in instruction-driven maneuvering scenarios.

  • A 3-stage progressive multi-modal learning paradigm is designed to develop the proposed SIA-FTP framework, which enables the model to achieve the desired performance under the limited trajectory-instruction paired samples.

  • A multi-label intent identification method is proposed to understand controlling intents from text instructions, which extracts informative intent-oriented instruction embeddings and projects them into a compact embedding space to support modal fusion during joint optimization.

  • A simple yet effective multi-modal fusion mechanism is designed to fuse the trajectory embedding and the intent-oriented instruction embedding, thereby supporting spoken instruction awareness in the FTP process.

  • A trajectory-instruction dataset is built to conduct the experiments, which can be regarded as benchmarking in future human intention automation studies. Extensive experiments demonstrated the efficiency and effectiveness of the proposed SIA-FTP framework.

Results

Task Overview

In general, the short-term FTP tasks can be formulated the spatial-temporal sequential modeling problems, which predict flight status in a few minutes based on the observation sequence. Let the observation sequence be Otk+1:t = {ptk+1, . . . , pt−1pt}, the FTP model aims to forecast the Pt+1:t+n = {pt+1pt+2, . . . , pt+n} based on the past k observations Otk+1:t, where the pt represents a trajectory point in time step t. The conventional FTP tasks can be mathematically described as follows:

$${P}_{t+1:t+n}=\{{p}_{t+1},{p}_{t+2},...,{p}_{t+n}\}={{{{\mathcal{F}}}}}({O}_{t-k+1:t})$$
(1)

where the nk are the number of prediction trajectory points and the observed sequence length, respectively. \({{{{\mathcal{F}}}}}(\cdot )\) denotes the learnable FTP model.

In this work, the SIA-FTP tasks can be defined as Eq. (2), where the SI is the textual spoken instruction of ATCo that is confirmed by the pilot in time step t.

$${P}_{t+1:t+n}=\{{p}_{t+1},{p}_{t+2},...,{p}_{t+n}\}={{{{\mathcal{F}}}}}({O}_{t-k+1:t},{{{{\rm{SI}}}}})$$
(2)

Additionally, the definition of the trajectory point pt is presented in Eq. (3). Specifically, a total of six trajectory attributes that describe the flight status are selected to form the pt, including longitude (Lon), latitude (Lat), altitude (Alt), and corresponding velocities (Vx, Vy, Vz) along these dimensions. In this work, the Lon, Lat, and Alt serve as the primary attributes of the four-dimensional (4D) FTP task, while the velocities are employed as the auxiliary attributes.

$${p}_{t}=[{{{{{\rm{Lon}}}}}}_{t},{{{{{\rm{Lat}}}}}}_{t},{{{{{\rm{Alt}}}}}}_{t},{{{{{\rm{Vx}}}}}}_{t},{{{{{\rm{Vy}}}}}}_{t},{{{{{\rm{Vz}}}}}}_{t}]$$
(3)

Dataset and data preprocessing

To validate the effectiveness of the proposed framework, a multi-modal situational dataset M2ATS30 is used to train the proposed SIA-FTP framework, which is collected from the real-world ATC system in China, from February 19 to February 27, 2021. The M2ATS covers the diverse ATC and flight operation data, including the flight trajectories data, flight plans, airspace information, and speech of the ATC communication with golden labels. In this work, we divided the M2ATS into three subsets to support the experiments of the SIA-FTP framework, i.e., trajectory subset, text instruction subset, and trajectory-instruction subset. The data split strategy of train, validation, and test set follows the original M2ATS. Specifically, the data of the first 7 days serve as training data, whereas the data from the rest two days are applied to the validation and test, respectively. It is noted that the above data split strategy is applied to all the experiments and training phases in this work. The detailed data preprocessing process is described in the Supplementary Information (Section Data Preprocessing).

Comparison baselines

In this work, a total of 7 competitive approaches serve as baselines. According to the inference style of the multi-horizon prediction process, the baselines are further categorized into iterative prediction models (LSTM, Transformer, Kalman-Filter, FlightBERT, and WTFTP) and direct prediction models (LSTM+Attention, and FlightBERT++). The iterative prediction models perform only one-step prediction in an inference procedure, and the predicted results will serve as pseudo observations to obtain the multi-horizon results iteratively. In contrast, the direct prediction models can generate multi-horizon prediction results through one-pass inference. The description of the baseline models is listed as follows.

  • LSTM: The LSTM network is applied to build the FTP model31, which is typically used for sequence modeling in many tasks.

  • Transformer: A vanilla Transformer architecture is adapted to develop the FTP model32.

  • Kalman-Filter: A typical model-driven flight state estimation algorithm based on historical observations and generally applied in target tracking scenarios. In this work, the Kalman-Filter (KF) is employed to perform the FTP task33.

  • FlightBERT: An FTP framework proposed in our previous work, which innovatively converts the FTP tasks into a multi-binary classification paradigm13.

  • WTFTP: A time-frequency analysis based FTP framework proposed in our previous work, which has demonstrated that the WTFTP is robust to the high-maneuvering flight scenarios17.

  • LSTM+Attention: A Seq2Seq architecture to predict the flight state of multiple future time steps directly. In this experiment, an Attention-based Encoder-Decoder LSTM network34 is adapted to perform the FTP task.

  • FlightBERT++: As described in Section “Methods”, the FlightBERT++ is a non-autoregressive multi-horizon FTP framework that is also employed as the pre-trained FTP model in the proposed SIA-FTP framework29.

Experimental Configuration

In the LSTM, Transformer, WTFTP, and LSTM+Attention baselines, the z-score normalization algorithm is applied to process the value into [0,1] for longitude and latitude attributes due to the sparse specificities of their distributions, while the other attributes are normalized into [0,1] through the max-min algorithm. Additionally, we employ the MSE loss function to train these models on the trajectory subset. For the binary encoding-based baselines, the experimental configurations follow our original works13,29.

In the SIA-FTP framework, all the configurations in the trajectory-baed FTP pre-training stage are the same as the FlightBERT++. For the intent-oriented instruction embedding learning stage, the Chinese characters and English words serve as the basic units (tokens) for the language modeling. A 256-dimensional hidden state is set to pre-train the BERT model. After the unsupervised text representation learning, an Fully Connected (FC) layer with 16 neurons is used as the prediction head to output the probabilities of the intent class. In the multi-modal FTP finetuning stage, the hidden state dimension for linear layers in the MLP is set to 1024 and 512, respectively.

In this work, the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root of Mean Square Error (RMSE) and Mean Deviation Error (MDE) are applied to evaluate the model performance and baselines. Among them, the MAE, MAPE, and RMSE are the common metrics to evaluate each attribute of trajectory point separately, while the MDE is proposed to measure the distance (Nautical Miles (NM)) of trajectory points between predictions and ground truth in three-dimensional (3D) airspace. In this way, it is believed that the performance of the proposed model and baselines can be comprehensively evaluated. The detailed definition of the abovementioned metrics can be found in Supplementary Information (Section Definition of Evaluation Metrics).

The Adam optimizer with 10−4 initial learning rate is applied to train all the above deep learning-based models. In this work, all the experiments are implemented with the open-source deep learning framework PyTorch 1.9.0. The models are trained on the server configured with Ubuntu 16.04 operating system, 8*NVIDIA GeForce RTX 2080 GPU, Intel(R) Core(TM) i7-7820X@3.6GHz CPU, and 128 GB memory.

Results and Quantitative Analysis

The experimental results are reported in Table 1. In general, thanks to the maneuvering ATC instruction incorporation, the proposed SIA-FTP framework achieves the desired performance in terms of all the four proposed evaluation metrics, for all prediction horizons. In the 1st prediction horizon, the baseline FlightBERT++ achieves comparable performance with the proposed SIA-FTP, in terms of the MAE of the Lon dimension and the Lat dimension. The results can be attributed that in the first prediction horizon, the instruction-driven factors only have a minor influence on the flight trajectory, and the FlightBERT++ is the trajectory representation learning model of the SIA-FTP, therefore, it can obtain some comparable indicators during the prediction process. To be specific, by analyzing the samples from the dataset, it is found that some flights might not have executed the instructions in the 1st prediction horizon (20 seconds) due to the operational habits of the pilot.

Table 1 The experimental results of the proposed framework and baselines

As can be seen, the proposed SIA-FTP framework is superior to all the baselines among the 4 evaluation metrics in the multi-horizon prediction results, even obtaining over 20% MDE reduction in the 9- and 15-horizons compared to the FlightBERT++. In general, thanks to the non-autoregressive prediction mechanism of the FlightBERT++, the SIA-FTP achieves higher multi-horizon prediction performance. Most importantly, the SIA-FTP is able to be aware of the controlling intent by integrating the spoken instruction into the FTP model explicitly. For the competitive data-driven baseline WTFTP approach, as demonstrated in17, thanks to the capability of time-frequency analysis and in-depth feature extraction, it can effectively capture microscopic maneuvering patterns of trajectory implicitly and obtain comparable performance against the proposed SIA-FTP framework within 3-horizons. However, as the prediction horizon increases and the flight enters the macroscopic maneuvering phase, it is difficult to estimate the flight intent implicitly in longer horizons, which results in performance degradation in multi-horizon predictions. The above experimental results demonstrate that, by considering the real-time spoken instructions, the proposed SIA-FTP framework is a promising solution to improve the prediction precision for FTP tasks in the ATC domain, especially for the instruction-driven high macroscopic maneuvering scenes.

Among the selective baselines, it can be found that the iterative multi-horizon prediction approaches are prone to suffering from error accumulation with the increase of the prediction steps. Taking the Kalman-Filter approach as an example, it loses the history observation to update the parameters of the system equation in the multi-horizon prediction process, leading to a significant degradation in performance. In contrast, benefiting from the multi-horizon prediction strategy, the direct multi-horizon approaches facilitate capturing the global trend and are superior to the iterative approaches. It can be seen from the results that, except for the proposed SIA-FTP framework, the FlightBERT++ achieves superior performance over other baselines. This also supports our motivation to employ FlightBERT++ for trajectory-based FTP pre-training in the proposed SIA-FTP framework.

In addition, Fig. 2 illustrates the error distribution of the FlightBERT++ and the proposed SIA-FTP framework across 1st, 3rd, 9th and 15th prediction horizons. Fig. 2a presents the absolute error for longitude, latitude, and altitude attributes, it is evident from the box plots that the proposed SIA-FTP framework consistently achieves lower absolute errors across all horizons compared to FlightBERT++. Notably, the proposed SIA-FTP harvests higher error reductions at longer horizons (9 and 15), where the ATC instructions provide stronger driven influences on flight patterns. The distribution of deviation error across different prediction horizons is reported in Fig. 2b. The results show that SIA-FTP maintains a lower deviation error across all horizons, further reinforcing its effectiveness by incorporating ATC instructions. Furthermore, the error distribution based on absolute percentage error and squared error, as shown in Supplementary Fig. 1, further corroborates the above conclusions. Additionally, the statistical analysis across four metrics also confirms that SIA-FTP significantly outperforms FlightBERT++ in predicting flight trajectories over longer horizons, which can be found in Supplementary Information (Supplementary Table 1). The significant error reduction achieved by SIA-FTP confirms its potential for enhancing the performance of FTP in instruction-driven maneuvering flight scenarios.

Fig. 2: Error distribution of the FlightBERT++ and the proposed SIA-FTP.
figure 2

a The distribution of absolute error for Longitude, Latitude, and Altitude attributes across different prediction horizons. b The distribution of deviation error across different prediction horizons. Boxplots of a and b show the median (center line), and 1st and 3rd quartiles (Q1 and Q3, respectively). The error bars correspond to the Q1-(1.5*IQR) and Q3 + (1.5*IQR) range (IQR = Inter-Quartile Range). Data points below Q1 - (1.5*IQR) or above Q3 + (1.5*IQR) are considered outliers and not shown in the boxplots. Source data are provided as a Source Data file.

Visualization and Qualitative Analysis

To vividly demonstrate the effectiveness of the SIA-FTP framework across various maneuvering intents, in this section, a total of six representative flight trajectories with ATC instructions are selected to visualize the predictive results of each model. The prediction results of the proposed framework and baselines are depicted in Fig. 3, in which the blue lines indicate the model inputs (i.e., observations of 9 trajectory points), and the altitude is measured in units of 10 meters. It is important to note that certain inaccurate predictions made by baselines are removed from Fig. 3 to enhance readability.

Fig. 3: Visualization of the trajectory prediction results with different maneuvering controlling intents.
figure 3

a–b ALT_ADJ. c SPD_ADJ. d OFFSET. e CANOFF& FLYTO. f HEAD_ADJ& FLYTO. The ATC instructions are presented in the chatbox, where the color fonts indicate the keywords of the controlling intents. Source data are provided as a Source Data file.

As shown in Fig. 3, it is evident from the visualization results that the proposed SIA-FTP method has the ability to perform consistent predictions under the instruction-driven maneuvering scenarios for all six controlling intents in a multi-horizon manner. Fig. 3a, b illustrate the prediction results of two different ALT_ADJ (altitude adjustment) ATC instructions, including descending and climbing flight phases, respectively. It can be seen from the visualizations that almost all baselines fail to predict the flight intents in future horizons when the flights are under complex ATC instructions. As demonstrated in our previous work29, the FlightBERT++ model has certain capabilities to predict future flight intent, which benefits from learning the flight transition patterns from the volumes of historical trajectory data. As stated in the WTFTP, the multi-horizon (over 9 steps) prediction is a limitation to improving the FTP performance, which also supports the multi-horizon prediction mechanism in this work. For instance, in Fig. 3a, b, the FlightBERT++ implicitly predicts the climbing/descending action to support the FTP task. However, FlightBERT++ falls short in delivering desired prediction results due to the inability to explicitly consider the detailed controlling intent from real-time ATC instructions. For instance, in Fig. 3b, the predicted flight trajectory of the FlightBERT++ only reaches the altitude of ~10,100 meters, rather than the specified 10,400 meters (target altitude) mentioned in the ATC instruction. Fortunately, thanks to the explicit fusion of the spoken instruction, the proposed SIA-FTP framework not only provides timely aircraft maneuvering predictions but also exhibits detailed awareness of the “maintain" intent from ATC instructions for future horizons after “climbing/descending" to the target altitude.

Similarly, the FlightBERT++ accurately predicts the flight intent of “OFFSET" in Fig. 3d but fails to exactly estimate the number of miles for the offset maneuver. In practice, the specific parameters of the controlling intents (such as altitude, speed, and miles of the offset) are typically determined by ATCos according to the real-time ATC situations. Thus, it is difficult for the trajectory-based FTP models to learn these patterns from the historical data and provide accurate prediction results, that is why we consider the explicit fusion of the controlling intents and required parameters.

In addition, the prediction results presented in Fig. 3e, f demonstrate that: i) the proposed SIA-FTP framework performs accurate predictions even with multiple controlling intents in an ATC instruction; ii) external airspace knowledge, such as the position of waypoints (IRVED, SUMUN), and projections of the direction (turn right/left), is learned from historical trajectory samples in the multi-modal FTP finetuning process. The awareness ability for multiple intents in the SIA-FTP framework can be attributed to the multi-label classification design of the IID model, enabling the model to extract comprehensive and informative features from ATC instructions.

In summary, the visualization results demonstrate that the proposed SIA-FTP learns expected flight transition patterns in instruction-driven maneuvering flight phases, which also supports our motivations for explicitly integrating the spoken ATC instructions into the FTP procedures.

Ablation Study

To further validate the advantages of the 3-stage progressive multi-modal learning paradigm in the proposed SIA-FTP framework, two further ablation studies are designed in this section as follows.

  • A1: To validate the effectiveness of the IID finetuning (Stage 2-2), in this experiment, we directly utilize the output of the BERT pre-training after Stage 2-1 as the sentence-level intent embeddings SIemb, while other settings remain unchanged with original SIA-FTP framework. In other words, instead of the finetuning with the IID model in Stage 2-2, only Stage 2-1 is conducted to learn the instruction embeddings in the intent-oriented instruction embedding learning phase.

  • A2: To validate the effectiveness of the intent-oriented instruction embedding learning, Stage 2 is fully removed to develop the SIA-FTP in this experiment. Specifically, the BERT model is employed as the instruction encoder to build the multi-modal SIA-FTP model after the trajectory-based FTP pre-training and proceeds directly to Stage 3 for model training with paired trajectory-instruction data.

The experimental results of the ablation studies are presented in Table 2. It can be seen that, compared to the original SIA-FTP framework, the performance of the A1 and A2 are degraded, which validates the effectiveness of the proposed training strategies in Stage 2. Fortunately, thanks to the fusion of the intent embeddings, their performance still surpasses that of the baseline models, which supports the motivation of this work again.

Table 2 The experimental results of the ablation study

To be specific, the A2 model achieves higher performance even without intent-oriented instruction embedding learning stage. The above experimental results further demonstrate that integrating the ATC instruction is a promising solution to improve the performance of FTP tasks in instruction-driven maneuvering flight scenarios by explicit consideration of the driving factors (controlling intents). In addition, it can also be observed from the experimental results (A2 vs. A1) that directly integrating the instruction embedding into SIA-FTP without any pre-training (A2) outperforms that of using instruction embedding obtained from pre-training BERT (A1). It can be attributed to the inability to learn intent-specific embeddings from universal textual representations without any further fine-grained optimizations through the IID task (Stage 2-2).

Furthermore, we also consider the MDE variations over 15 prediction horizons on the validation set across training epochs, as shown in Supplementary Fig. 2. The results indicate that the intent-oriented instruction embedding learning stage can effectively enhance the fusion process between trajectories and textual instructions. Consequently, the convergence of the proposed SIA-FTP framework significantly sped up during the early epochs, achieving superior performance compared to the A1 and A2 models, which also demonstrates the advantages of the proposed 3-stage progressive learning paradigm.

Interpretability Study

Analysis by the learned instruction embeddings. To further investigate the internal learning mechanisms in different stages, in this section, the t-Distributed Stochastic Neighbor Embedding (TSNE) tool is applied to cluster and visualize the sentence-level instruction embeddings SIemb on the test set, which projects the learned high-dimension embeddings into 2D space. In general, a well-trained IID model should have the ability to distinguish the controlling intent of the ATC instructions clearly, i.e., the instructions with the same controlling intents should be clustered in the compact embedding space, as well as formulating a clear cluster gap with other intent embeddings. Specifically, the instruction embeddings SIemb generated by the BERT-based pre-training model (Stage 2-1), IID model (Stage 2-2), SIA-FTP model (Stage 1 to Stage 3), and the model of ablation experiments A1 and A2, are illustrated in Fig. 4.

Fig. 4: Visualization of the text instruction embedding.
figure 4

ae Visualization of the text instruction embedding with different maneuvering intents via the TSNE tool using different models. a: BERT model (Stage 2-1). b IID model (Stage 2-2). c SIA-FTP model (Stage 1 to Stage 3). d A1 model (SIA-FTP without stage 2-2). e A2 model (SIA-FTP without stage 2). fg Visualization of the text instruction embedding with CLIMB and DESCEND intents via the TSNE tool using different models. f: IID model (Stage 2-2). g: SIA-FTP model (Stage 1 to Stage 3). hVisualization of the text instruction embedding with different descending intents via the TSNE tool using different models. h: IID model (Stage 2-2). i SIA-FTP model (Stage 1 to Stage 3). Source data are provided as a Source Data file.

As shown in Fig. 4a, the BERT-based pre-trained model can only roughly distinguish general textual embeddings for different flight instructions and intents within the embedding space. It can be attributed to the ATC instructions with the same controlling intents usually containing the same keywords and named entities, which inspires the pre-trained model to extract unified abstract embeddings. For instance, an ATC instruction with altitude adjustment intent must be conveyed the words “climbing" or “descending", as well as the flight level (e.g., eight thousand four hundred meters). With the finetuning of the IID task in Stage 2-1, the instruction embeddings within the same controlling intent are further compactly clustered in embedding space (Fig. 4b). It is demonstrated that the intent-oriented instruction embedding finetuning stage promotes the model to extract the intent-related representations, not only for general text embeddings. Furthermore, as shown in Fig. 4c, the embeddings of ATC instructions achieve high-compact cluster results in the embedding space after the multi-modal FTP finetuning of Stage 3, i.e., the distances of the samples within the same classes become closer, while the distances between classes are enlarged.

It is also noted that, with the training stages progress, the samples with similar semantical intents, such as FLYTO and CANOFF&FLYTO, or HEAD_ADJ&FLYTO, showcase increasingly closed embedding clusters in the feature space (Fig. 4c). It also demonstrates that the model learns the semantic consistency (both kinematic and ATC instruction semantics) from the trajectory-instruction pairs during the multi-modal FTP finetuning, which further improved the capabilities of the informative intent embedding extraction.

In this section, the embeddings of the ATC instructions from the A1 and A2 models are also visualized in Fig. 4d, e, respectively. Compared to the original SIA-FTP model (Fig. 4c), although the embeddings generated by the A1 and A2 models effectively distinguish samples in the embedding space among different intent classes, their embedding distributions lack compactness within each intent class. Therefore, it can be concluded that the above visualizations further validate the effectiveness and advantages of the proposed 3-stage progressive learning paradigm.

Insightful analysis. As observed from the visualization of trajectory prediction results in Section Visualization and Qualitative Analysis, the SIA-FTP model not only captures controlling intent but also attends to specific intent parameters from the ATC instructions, such as flight levels of the ALT_ADJ intent and the miles of the OFFSET intent. To further examine the ability of the SIA-FTP model to distinguish specific instruction parameters within the same intent class, the instructions with ALT_ADJ intent on the test set are further organized into two groups to perform the visualizations, i.e., climbing and descending.

The embeddings of the ALT_ADJ instructions generated by the IID model and SIA-FTP model are visualized in Fig. 4, respectively. As illustrated in Fig. 4f, g, compared to the IID model, the SIA-FTP can further learn altitude-specific features between climbing and descending instructions through multi-modal FTP finetuning with trajectory-instruction pairs.

Similarly, to further examine the capability of the model to consider instruction parameters, the samples involving the common flight levels, including “descending 8400" and “descending 8900", within the descending instructions are visualized in Fig. 4h–i. It can be observed that the IID model fails to distinguish the samples clearly with different instruction parameters within the same class in the embedding space, since the task objective of the IID model is only to identify the intent of instruction. However, after the multi-modal FTP finetuning of stage 3, the SIA-FTP is able to distinguish the instructions with different parameters (flight level) in the embedding space, which provides interpretability for the higher performance of the proposed SIA-FTP framework.

In this section, we also validate the attention weights of the tokens in ATC instructions during the multi-modal FTP prediction process. Fig. 5 shows the token representations HSI of 3 selected samples output by the SIA-FTP model. Specifically, HSI is summed along the feature dimensions and quantifies the importance of tokens in ATC instruction. The darker colors represent larger attention weights, indicating the higher importance of tokens during the fusion process. It can be observed that the model not only focuses on intent-related keywords of the instructions but also assigns high weights to critical parameters to guide the model in predicting the flight trajectory toward target positional parameters (e.g., flight level, offset, waypoint, etc.).

Fig. 5: The weights of the token in instruction embedding output by SIA-FTP framework.
figure 5

The red fonts represent the keywords indicating intents, while the fonts with blue colors denote the critical parameters. Source data are provided as a Source Data file.

In summary, the above visualizations demonstrate that the proposed SIA-FTP has the ability to perceive the controlling intent and required parameters by integrating the textual ATC instructions. Moreover, the proposed 3-stage progressive learning paradigm is an effective way to fill the modality gap between the trajectory and textual instructions under the conditions of limited paired data. It is believed that the proposed SIA-FTP framework not only provides a solution for instruction-driven maneuvering ATC scenarios but also develops a powerful FTP tool to detect the potential risks of human factors in the ATC procedure.

Generalization study

Integrating ATC instruction into LSTM+Attention. To validate the generalizability of integrating spoken instruction into the FTP procedure and the proposed 3-stage progressive learning paradigm, the LSTM+Attention model is applied to alternate the FlightBERT++ to conduct the SIA-FTP framework. Specifically, the LSTM+Attention baseline model is selected to perform the trajectory-based FTP pre-training in stage 1. Similar to stage 2 of the original SIA-FTP framework, the pre-trained IID model is used to extract intent-oriented instruction embedding. In the multi-modal FTP finetuning stage, the instruction embedding is integrated into the LSTM+Attention model by the proposed multi-modal fusion mechanism during the step-by-step decoding process. The implementation details can be found in Supplementary Information (Section Implementation Details of the SIA-FTP Framework using LSTM+Attention).

The experimental results are reported in Fig. 6a. It can be found that the LSTM+Attention model also achieves significant performance improvements by integrating the intent of spoken instruction using the proposed 3-stage training paradigm. Compared with the LSTM+Attention baseline, the SIA-FTP by LSTM+Attention achieves a 20.25% relative MDE reduction in 15 prediction horizons (5 minutes). Moreover, by integrating the spoken instruction, the SIA-FTP by LSTM+Attention also obtained superior performance than FlightBERT++ in 9 (1.03 MDE) and 15 (1.55 MDE) prediction horizons, which further confirms the effectiveness of considering the controlling intent for multi-step FTP task. However, thanks to the powerful learning ability of the well-designed model architecture, the SIA-FTP by FlightBERT++ achieves higher prediction performance, which also validates our selection for the backbone network in this work. In summary, the experimental results of the generalization study not only demonstrate the effectiveness of integrating the intent of spoken instruction into FTP procedures and the 3-stage progressive learning paradigm but also confirm the effectiveness and efficiency of the selected backbone network.

Fig. 6: MDE comparison over the 1, 3,9, and 15 prediction horizons for generalization studies.
figure 6

a The MDE results of LSTM+Attention and SIA-FTP using LSTM+Attention in stage 1. b The MDE results of the SIA-FTP, SIA-FTP using LSTM in stage 2, and SIA-FTP using Transformer in stage 2. c The MDE results of the SIA-FTP and SIA-FTP cascaded with an ASR system. Source data are provided as a Source Data file.

Incorporating the ATC instruction into SIA-FTP using other language model architectures. Similarly, to further validate the effectiveness of the proposed SIA-FTP framework, the LSTM and Transformer architectures are also introduced to conduct the IID model in stage 2. Consequently, the pre-trained LSTM- or Transformer-based IID model is applied to alternate the BERT to build the multi-modal FTP model and extract the intent-oriented instruction embedded in stage 3. The implementation details and the experimental results of these IID models can be found in Supplementary Information (Evaluation of the Proposed IID Model).

The comparison of MDE over 1, 3, 9, and 15 prediction horizons among the SIA-FTP using different language model architectures is shown in Fig. 6b. It is clear that the proposed SIA-FTP framework obtains expected performance improvement even when incorporating other types of language model architectures in stages 2 and 3. Furthermore, compared to the best baseline (FlightBERT++), the LSTM- and Transformer-based SIA-FTP framework achieve superior performance, obtaining about 13.5% and 11.6% MDE reduction over 15 prediction horizons, respectively. The results also confirm our primary motivation and contribution, i.e., integrating ATC instructions into the FTP model is a promising way to enhance prediction performance in instruction-driven maneuvering flight scenarios. Additionally, compared to the BERT model, it is observed that the SIA-FTP framework using LSTM and Transformer shows a slight performance reduction. This can be attributed to the better training strategy and network architecture of the BERT model, which facilitates learning the information ATC instruction embeddings.

Evaluating the proposed SIA-FTP with ASR. As described in Section Introduction, in practice, the ASR serves as the front-end component of the SIA-FTP framework, which is employed to translate speech instructions into textual instruction. To validate the applicability of the proposed SIA-FTP framework in real-world scenarios, we cascade an ASR model with the SIA-FTP to evaluate the system-level performance. Specifically, an ASR method based on Wav2vec 2.0 from our previous work28 is employed as the front-end component, which is also trained on the M2ATS dataset. During the testing phase, instead of the manually annotated textual ATC instructions, the predicted text of the ASR model for the test set is used as the input for SIA-FTP.

The ASR model achieves a 1.39% character error rate (CER) on the test set, demonstrating considerable performance in the ATC domain. The FTP results are reported in Fig. 6c. It can be seen from the results that the SIA-FTP achieves comparable performance with only a slight reduction when cascaded with an ASR model in the front, compared to using golden annotations as input. It is observed that the ASR performance is critical to the overall effectiveness of the SIA-FTP framework, in which the misrecognition of intent-related keywords by the ASR can impact the semantics of the extracted intent-oriented instruction embeddings. Consequently, cascading SIA-FTP with a high-performance ASR is essential for ensuring reliable application in real-world scenarios. Fortunately, thanks to advancements in ASR techniques in the ATC domain, the proposed SIA-FTP framework has great potential for real-world ATC applications.

Discussion

In this work, an instruction-driven FTP paradigm is proposed to incorporate spoken instructions into the ATC automation process, which provides a promising solution to detect the potential risks caused by human factors in real-time ATC operations. To this end, a SIA-FTP framework is proposed to consider the spoken instruction in the FTP procedure and implement a multi-modal FTP model within the controlling instruction-driven maneuvering flight phase. To address the modality gap between the textual spoken instructions and flight trajectory, we decompose the multi-modal learning of FTP tasks into 3 stages, including trajectory-based FTP pre-training, intent-oriented instruction embedding learning, and multi-modal FTP finetuning. The joint model is optimized with the limited trajectory-instruction pairs to further learn the flight transition patterns under instruction-driven maneuvering flight scenarios. The effectiveness of the proposed framework is demonstrated by extensive experiments based on a real-world dataset, and all the proposed strategies contribute to desired performance improvements. The multi-horizon prediction task is achieved with considerable performance. Most importantly, the intent and required position-guided parameters can be accurately perceived to enhance the FTP task. The fusion process between trajectory and text instruction is intuitively understood by extensive visualization results.

Although the SIA-FTP framework demonstrates significant performance improvements over comparative baselines, Fig. 4c shows that a few samples are difficult for the SIA-FTP framework to distinguish in the embedding space. In practice, the intent misclassification of ATC instructions might potentially impact prediction errors of the proposed SIA-FTP framework. By in-depth analysis of the samples in the dataset, it is found that the challenging or misclassified samples in Fig. 4c can be categorized into two primary classes:

  • Fig. 4c shows that some instruction embeddings with SPD_ADJ intent are in close proximity to those with ALT_ADJ intents in feature space. This can be attributed to SPD_ADJ intents often being issued during the execution of ALT_ADJ intents. In the 3rd stage of the proposed SIA-FTP framework, the multi-modal FTP model learns both kinematic and controlling intent semantics from real-time observed trajectories and ATC instructions. Consequently, the intent-oriented instruction embeddings are fine-tuned by kinematic semantics during stage three, resulting in the proximity of ALT_ADJ and SPD_ADJ instruction embeddings in feature space. This confirms that the proposed 3-stage SIA-FTP framework can capture informative features from both real-time trajectory observations and ATC instructions, thereby enhancing FTP performance in instruction-driven maneuvering flight processes. Additionally, Supplementary Fig. 3a visualizes a representative sample where an ATCo issues a speed adjustment instruction during the climbing phase. In this case, the instruction embedding extracted by the SIA-FTP framework is clustered close to the ALT_ADJ intents in feature space, but the SIA-FTP still achieves the expected prediction results due to the learned informative features.

  • The ATC instructions with multiple intents are not distinctly separated from single-intent instructions in Fig. 4c. Analysis shows that when multiple intents are presented in ATC instructions, the SIA-FTP framework tends to focus on the “primary" intent. For instance, the instructions contain CANOFF&FLYTO (cancel offset and direct to) and HEAD_ADJ&FLYTO (heading adjustment and direct to) intents, the SIA-FTP framework typically pays more attention to the FLYTO intent since it is the main goal of the ATC instruction, with CANOFF and HEAD_ADJ being accompanying actions during the execution of FLYTO. This observation also can be confirmed by the weight of the token in 3rd samples of Fig. 5, where the token weights of the FLYTO intent ("direct to sumun") are higher than the HEAD_ADJ intent ("turn right"). Supplementary Fig. 3b visualizes a representative sample with multiple intents that are not correctly classified in Fig. 4c to investigate the prediction errors of the SIA-FTP framework. It shows that the SIA-FTP mainly focuses on the FLYTO intent with marginal considerations on the CANOFF intent (cancel offset). Although the SIA-FTP model provides delayed prediction to timely cancel the offset, it still achieves comparable prediction performance (1.46 MDE). Most importantly, the predicted trajectory also approaches the waypoint “sumun" by considering the “primary" FLYTO intent, which is the prominent advantage (instruction-driven) compared to the conventional data-driven FTP model.

In summary, we present a perspective to consider human factors in real-time ATC operations by developing an instruction-driven FTP model and propose a solution to fuse the spatial-temporal trajectory data and textual instructions under limited data scale. It is believed that the proposed SIA-FTP framework can not only empower modern ATC systems to enhance aviation safety but also provide technical insights for multi-modal ATC data processing.

In the future, we will focus on exploring an efficient fusion of spoken instruction and flight trajectory. In addition, extracting more informative and detailed maneuvering intents from the spoken instructions is also an interesting research topic.

Methods

Overview of the proposed framework

The architecture of the proposed SIA-FTP framework is illustrated in Fig. 7. Specifically, the proposed framework decomposes the learning procedure of the FTP tasks into three stages to address the modality gap and low-resource problem of trajectory-instruction pairs, including trajectory-based FTP pre-training, intent-oriented instruction embedding learning, and multi-modal FTP finetuning.

Fig. 7: Overview of the proposed SIA-FTP framework.
figure 7

The Fdiff represents the differential operator that is applied to obtain the differential sequence of the observation. HACG: horizon-aware context generator. Concat: concatenation operation. GELU: activation function. MLP: multi-layer perception.

In the trajectory-based FTP pre-training stage, the FlightBERT++ is employed to train a data-driven FTP model based on motion patterns of the trajectory sequence (without considering controlling intents), which aims to learn the typical spatial-temporal flight transition patterns by leveraging historical flight trajectory data. In this stage, the robust spatial representations of trajectory points and generalized flight motion patterns are expected to be learned through the training process, thereby establishing a robust FTP foundation model and ensuring effective generalization for common flight scenarios. Although a generalized FTP model can be obtained in this stage, it may not be robust enough to handle instruction-driven maneuvering flight scenarios.

For the intent-oriented instruction embedding learning stage, a multi-label intent identification (IID) model is proposed to learn the discriminative features of spoken instruction within the different controlling intents. To this end, a BERT-based neural architecture is designed to perform intent identification, whose effectiveness has been demonstrated in many Natural Language Processing (NLP) tasks. The primary purpose of this stage is to project the textual ATC instructions into a shared embedding space (stage 2-1) and extract informative instruction embeddings that capture the strong semantics of the controlling intents (stage 2-2), thereby laying a foundation for bridging the modality gap between textual instructions and trajectory data.

In the multi-modal FTP finetuning stage, a multi-modal fusion mechanism is designed to incorporate the instruction embeddings into the pre-trained FTP models and to combine the pre-trained FTP and IID network into a joint model. This mechanism conducts a bridge of the data flow for the spoken instruction and trajectory modal, which enables us to finetune the joint model by utilizing the trajectory-instruction pairs. By combining information from both modalities, this stage ensures the multi-modal FTP model can learn both kinematic and controlling intent semantics from real-time observed trajectories and ATC instructions, enabling FTP for instruction-driven high-maneuvering flight scenarios.

By decomposing the learning procedures, the model parameters are mainly updated in the separate training stage and jointly optimized in the multi-modal FTP finetuning stage. On the one hand, we can fully leverage the well-resourced unimodal data to learn flight transition patterns and instruction embeddings separately. On the other hand, the trajectory and spoken instruction with different modalities are projected into the latent feature space in the 1st and 2nd stages, and further fused into a unified feature space in the 3rd stage. Compared to fusing the multi-modal data directly, it is believed that the proposed model and training strategy can easily learn the consistent semantics between trajectories and spoken instructions in the latent feature space. In this way, it is expected to develop a multi-modal FTP model that is robust to instruction-driven high-maneuvering scenarios even with limited trajectory-instruction pairs. A detailed description of each stage is provided in the following sections.

Trajectory-based FTP pre-training

In general, once the ATC instruction reaches a consensus between ATCo and aircrews, the pilot will proceed to maneuver the aircraft following the controlling intent of instruction. The maneuvering process requires a sustained execution from the initial operation to the completion, typically an average of 1-5 minutes. Therefore, in this work, the instruction-driven FTP is a multi-horizon forecasting task to predict the flight trajectory of the entire instruction execution process. In this context, the SIA-FTP task requires the FTP model can fit the complicated patterns in a data-driven manner, for both the trajectory evolution patterns and the resulting maneuvering operations of the controlling instructions. Considering that FlightBERT++29 achieves superior performance on both precision and efficiency within the multi-horizon FTP approaches, the FlightBERT++ serves as the prediction framework to conduct the instruction-driven FTP task. In this section, a brief review of FlightBERT++ is provided to elaborate on the model architectures.

Specifically, the FlightBERT++ is composed of three modules, i.e., trajectory encoder (TE), horizon-aware context generator (HACG), and differential-prompted decoder (DPD). The goal of the TE module is to project the observation trajectory into a trajectory-level high-dimensional representation. In succession, the HACG module is designed to generate multi-horizon context embeddings based on the high-dimensional representation and the prior horizon information horizon = {h1h2, . . . , hn−1hn}. These context embeddings and differential embeddings generated by the Conv1D-based differential embedding module (DEM) are jointly fed into the DPD module to predict the differential sequence of the future time steps. Finally, the Binary Cross Entropy (BCE) loss function is applied to optimize the FTP model in the training process.

The FlightBERT++ inherits the powerful representation capacity of the binary encoding (BE) representation proposed in13, and innovatively designed a series of technical improvements to solve the limitations of the BE representation, including high-bit prediction errors, error accumulation, and inefficient computation. The effectiveness of FlightBERT++ has been demonstrated in extensive diagnostic experiments, wherein even without external information, it can achieve high-confidence predictions in certain maneuvering scenarios. Therefore, we employ the FlightBERT++ architecture to conduct trajectory-based FTP pre-training and expect to achieve more precision and fine-grained prediction by incorporating spoken instruction. More details of the FlightBERT++ can be found in29.

Intent-oriented instruction embedding learning

As described in Section Introduction, the most critical driving factor of the flight status transitions in the ATC procedures is the spoken instruction with maneuvering intents. An intuitive idea is to extract the informative features that indicate the controlling intent from maneuvering instructions and integrate them into the FTP model, thereby improving the prediction precision.

To this end, the spoken instructions are manually divided into 16 categories according to the requirements of ATC work. Furthermore, 6 maneuvering instructions that significantly impact flight status are utilized to conduct the SIA-FTP framework, as listed in Table 3. Obviously, it is essential to first identify the maneuvering instruction, thereby extracting the discriminative features to support the FTP tasks. In real-world ATC works, an instruction usually with over 1 controlling intent, brings a challenge to capture the discriminative intent features. For instance, the instruction of No. 6 in Table 3 indicates the left flight turn and direct to the ubdid (waypoint).

Table 3 The descriptions of selected 6 categories maneuvering instructions

Considering the aforementioned specificities of spoken instructions, a multi-label intent identification (IID) model is designed to learn the discriminative features. In this work, the IID model is also trained with two stages: i) (stage 2-1) unsupervised text representation learning and ii) (stage 2-2) intent-oriented instruction embedding finetuning. Specifically, a BERT35 architecture is applied to build the IID model. In the first stage of the IID model (stage 2-1), the BERT-based network is trained to learn universal representations using a “masked language model" (MLM) pre-training objective by unlabeled text instructions. Let a textual instruction be SI = {w1w2, . . . , wl}, the BERT model can be formulated as Eq. (4):

$${{{{{\bf{H}}}}}}_{{{{{\rm{SI}}}}}}=\{{h}^{{w}_{1}},{h}^{{w}_{2}},...,{h}^{{w}_{l}}\}={{{{\rm{BERT}}}}}({{{{\rm{SI}}}}})$$
(4)

where wii = {1, 2, . . . , l} is the ith basic token of the IID model, i.e., the Chinese characters for instructions using Mandarin and English words for English instructions. l is the length of the instruction count by tokens. \({h}^{{w}_{i}}\in {{\mathbb{R}}}^{D}\) represents the high-dimensional features of wi, and D denotes the feature dimension.

In the stage of intent-oriented instruction embedding finetuning (stage 2-2), a multi-binary classification head is cascaded to the pre-trained BERT network to conduct the IID task. To capture the sentence-level instruction embeddings and retain more informative details, the token-level features are summed along the sequence dimensional to generate the final instruction embedding, as illustrated in Eq. (5):

$${{{{{\bf{SI}}}}}}_{emb}={{{{\rm{SUM}}}}}({{{{{\bf{H}}}}}}_{{{{{\rm{SI}}}}}})$$
(5)

where \({{{{{\bf{SI}}}}}}_{emb}\in {{\mathbb{R}}}^{D}\) server as the final instruction embedding, and the SUM( ) represents the sum operation.

In succession, as illustrated in Eq. (6), the instruction embedding SIemb is fed to the multi-layer perception (MLP) module to generate the probabilities of each intent class Intentprob. Notably, the output of the MLP module is activated by the Sigmoid function due to the multi-label classification settings of the IID model.

$${{{{{\bf{Intent}}}}}}_{prob}={{{{\rm{Sigmoid}}}}}({{{{\rm{MLP}}}}}({{{{{\bf{SI}}}}}}_{emb}))$$
(6)

The IID model is finetuned on the textual spoken instruction dataset with intent labels to learn the discriminative features for instructions with different intents. After the training process, the classification head is removed and the finetuned BERT model is preserved to obtain the sentence-level instruction embeddings SIemb. It is believed that we can not only identify the maneuvering instructions but also extract the discriminative features in the instruction embeddings. In this way, the instruction embeddings can be further incorporated into FTP models to support instruction-driven FTP tasks.

Multi-modal FTP finetuning

In this stage, we mainly focus on efficiently incorporating instruction embeddings into the pre-trained FlightBERT++ model, allowing the FTP model to be aware of maneuvering controlling intent. As described in Section Introduction, one of the primary challenges of this work is without large-scale trajectory-instruction pairs to support multi-modal learning. Therefore, the new parameters of the modal fusion procedures should be as few as possible to reduce the difficulty of the finetuning process.

To this end, firstly, the trajectory embedding output by the trajectory encoder is selected to fuse instruction embedding to generate the intent-aware trajectory embeddings, because the trajectory embedding implies the trajectory-level observation representation. Secondly, a simple yet effective multi-modal fusion mechanism is designed to incorporate the instruction embedding into FTP models and bridge the pre-trained FTP and BERT models.

Specifically, as shown in Eq. (7), a concatenate operation is performed to generate the multi-modal joint vector \({{{{{\bf{J}}}}}}_{mm}\in {{\mathbb{R}}}^{M}\) to roughly fuse trajectory embedding Trajenc and instruction embedding SIemb, in which the Concat[ ] is the function of concatenation, M is the dimension of the Jmm.

$${{{{{\bf{J}}}}}}_{mm}={{{{\rm{Concat}}}}}[{{{{{\bf{Traj}}}}}}_{enc},{{{{{\bf{SI}}}}}}_{emb}]$$
(7)

Moreover, an MLP module is applied to conduct deep fusion of the two embeddings and project the fused vectors into original dimensions to generate the intent-aware trajectory embedding Trajmm, as Eq. (8).

$${{{{{\bf{Traj}}}}}}_{mm}={{{{\rm{MLP}}}}}({{{{{\bf{J}}}}}}_{mm})$$
(8)

Notably, instead of the Trajenc in the trajectory-based FTP pre-training stage, the intent-aware trajectory embedding Trajmm is used to generate the context embeddings by the HACG module. Finally, similar to the original FlightBERT++, the context embeddings are further fed into the DPD to generate predictions.

Based on the above designs, we combine the pre-trained FTP model and BERT models into a joint model, only utilizing an MLP module with a few parameters. In the multi-modal FTP finetuning process, the joint model is optimized by the BCE loss function with the trajectory-instruction pairs. In this way, the inner correlation between flight transitions and instruction with maneuvering intents is expected to be learned, thereby enhancing the prediction ability of the FTP model in high-maneuvering scenarios.