Introduction

In recent years, the rapid development of the Internet and social media has witnessed an exponential growth in data volume and diversity, leading to increasingly complex challenges in information acquisition and management1,2. The automatic extraction of valuable information from massive unstructured texts has become a critical research focus3, where event extraction, as a key technology, plays a pivotal role in transforming unstructured data into structured knowledge.

Event extraction typically comprises two subtasks: event detection and argument extraction. Event detection identifies triggers and classifies event types through sequence element classification, while argument extraction recognizes event attributes and annotates their corresponding roles. As illustrated in Fig. 1(a), conventional approaches to this problem are categorized into pipeline approaches4 and joint models. Pipeline event extraction decomposes the task into sequential independent subtasks5,6, where each step operates in isolation. While this approach offers modularity and ease of implementation, it suffers from error propagation, ultimately compromising overall accuracy. In contrast, joint event extraction(Fig. 1(b)) employs an end-to-end framework7, enabling simultaneous extraction of triggers and arguments through a unified model. By leveraging interdependencies between tasks, it effectively mitigates cascading errors inherent in pipeline methods8.

Fig. 1
figure 1

Comparison of Model Architectures. (a) Pipeline models suffer from error propagation. (b) Joint extraction models require complex interaction mechanisms. (c) Our proposed generative paradigm (PosEKE-GPT2) simplifies the architecture through end-to-end text generation.

Event extraction in the financial domain aims to rapidly and accurately extract event information from specialized texts9. However, such texts are typically characterized by extended length, information redundancy, complex syntactic structures, and frequent co-occurrence of multiple events as shown in the real-case in Fig. 2, posing significant challenges to practical extraction tasks.

Fig. 2
figure 2

Illustration of multi-event extraction from financial text.

The sentence contains two events: (1) an Enterprise Financing event (trigger: financing; arguments: Financing-party: Devi Company, Amount: 500 million yuan), and (2) an Enterprise Acquisition event (trigger: acquired; arguments: Acquirer: Devi Company, Acquiree: Jinfang Technology Company).

To address these challenges, this paper proposes PosEKE-GPT2 (Position Extension and Knowledge Enhancement on GPT2), an enhanced GPT2 model that reformulates event extraction as a text generation task. The overall architecture of our proposed method is illustrated in Fig. 1(c). Specifically, we integrate adjacent positional encodings into the original GPT2 generation framework, overcoming the limitation of fixed-length position embeddings in the base model. Additionally, we introduce an attention mechanism to further capture nonlinear relationships between input embeddings and prompt embeddings.

Experimental results demonstrate that the proposed model achieves superior performance in joint event extraction tasks, with significant improvements in precision, recall, and F1-score across event type classification, trigger identification, and argument extraction, thereby validating the model’s effectiveness.

In summary, the primary contributions of this work are summarized as follows:

  1. 1)

    This work innovatively reformulates event extraction as a text generation task. Building upon the GPT2 framework, we implement a joint extraction paradigm that simultaneously identifies triggers and arguments through unified sequence generation. This architecture enables co-optimization of subtasks within a single model, effectively eliminating error propagation caused by traditional pipeline cascades.

  2. 2)

    This paper proposes a novel adjacent positional encoding fusion mechanism that doubles the input length capacity compared to conventional methods. This advancement enables precise capture of absolute positional relationships between event triggers and argument roles, thereby alleviating insufficient positional representation. The enhanced encoding significantly strengthens long-text modeling capabilities and deepens the model’s semantic and structural comprehension of complex events.

  3. 3)

    This paper introduces an attention-based knowledge enhancement method that formalizes event representation via prompt engineering and integrates external knowledge embeddings to enhance contextual comprehension. This approach dynamically recalibrates knowledge relevance weights through attention mechanisms. Furthermore, during target sequence construction, a specialized token tagging strategy is introduced to explicitly delineate event types, triggers, and arguments using structural markers, thereby strengthening the model’s structural awareness and boosting extraction accuracy.

Related work

Event extraction, a pivotal task in natural language processing, involves identifying critical elements from unstructured data and presenting them as structured representations for downstream applications. While researchers worldwide have extensively explored this field, particularly achieving remarkable progress through pre-trained language models (PLMs), existing methods still grapple with persistent challenges such as domain adaptation barriers and handling diverse event types across complex scenarios.

Event extraction in the financial domain

Event extraction technology holds particular significance in the financial domain, enabling critical information extraction from massive financial texts. However, the complex structural patterns and domain-specific characteristics of financial documents pose substantial challenges for event extraction. To address these issues, researchers have conducted extensive studies. Li et al.10 proposed Fin-PTPCG, a model integrating Fin-BERT with pseudo-trigger-aware pruned complete graphs. This framework effectively achieves multi-event detection and classification by combining domain-specific prior knowledge, pseudo-trigger mechanisms, and similarity pruning strategies. He et al.11 developed DEEM-PT, an event extraction model based on graph neural networks (GNNs). It enhances multi-event information interaction through event-type-guided prompt templates and integrates critical arguments via pseudo-event proxy nodes. Zou et al.12 introduced a generative financial event extraction method that resolves argument scattering and multi-event challenges through entity-to-document level information encoding and decoding. Hu et al.13 addressed contextual awareness and cross-sentence argument dispersion in financial documents by employing RoBERTa pre-trained embeddings combined with graph convolutional networks and enhanced path reasoning mechanisms. Jin et al.14 proposed the RACNN-BiLSTM framework, which significantly improves implicit causal relationship recognition in financial texts through fusion of local syntactic features, global semantic patterns, and self-attention mechanisms.

Despite notable progress, financial event extraction continues to face persistent challenges, particularly in lengthy document modeling and robustness enhancement. Current approaches frequently suffer from insufficient positional representation mechanisms when handling long-text scenarios, leading to degraded precision in identifying event elements. These unresolved issues demand further investigation and methodological innovations.

Event extraction based on joint learning

Compared to traditional pipeline approaches, joint methods demonstrate superior performance by sharing features and enabling inter-task information interaction, particularly excelling in capturing complex contextual dependencies and cross-sentence argument extraction. Cao et al.15 proposed OneEE, a model that reformulates event extraction as word-word relation identification through parallel grid tagging. It incorporates adaptive event fusion modules and distance-aware predictors to effectively mitigate error propagation. Dai et al.16 developed a cascaded decoding architecture with multi-feature fusion and condition-enhanced mechanisms, achieving robust performance in overlapping event extraction scenarios. Feng et al.2 introduced a joint pointer labeling framework combining PERT pre-trained embeddings, event-type semantic augmentation, and SATT-BiLSTM feature extraction to resolve argument overlapping conflicts. Sheng et al.17 proposed SaltyFishes, a parameter-sharing joint learning framework that addresses low-resource event extraction through conditional normalization mechanisms, achieving state-of-the-art results in the CCKS-2020 financial event extraction competition. Lin et al.18 presented ONEIE, a global graph optimization framework integrating cross-task dependencies via beam search decoding and joint global feature modeling, enabling comprehensive performance improvements across multiple information extraction tasks. Chen et al.19 designed MLSL, a multi-layer sequence labeling approach for biomedical event extraction, which simplifies traditional complex workflows by explicitly modeling trigger-argument interactions while maintaining candidate trigger awareness.

While joint extraction methods provide streamlined architectures compared to non-joint approaches, their performance remains suboptimal in handling complex event interdependencies and long-range contextual dependencies, necessitating further optimization for domain-specific scenarios.

Generative event extraction

Generative event extraction is a paradigm that reformulates event extraction tasks as text generation problems. Unlike traditional classification or sequence labeling methods, this approach enables flexible mapping of input texts into structured event representations, unconstrained by fixed tag schemas. It demonstrates enhanced adaptability, particularly in multi-event coexistence scenarios. Jia et al.20 developed an enhanced GPT2 model incorporating generative input modules and hybrid attention mechanisms, optimizing Transformer block outputs through layer-wise vector fusion strategies. Hsu et al.21 proposed DEGREE, a data-efficient event extraction framework that models the task as a conditional generation problem, achieving robust low-resource performance via manually crafted prompts. Duan et al.22 enhanced low-resource event extraction by integrating event keywords and fine-tuning BART with joint training objectives. Shi et al.23 introduced an end-to-end joint extraction framework employing dual encoders to simultaneously leverage trigger-context interactions during text generation. Lu et al.24 presented UIE, a unified text-to-structure generation framework that standardizes cross-task encoding through structured extraction languages. Chen et al.25 designed CPEE, a generative joint event extraction model combining ChatGPT-based data augmentation with entity-aware prompt learning, demonstrating superior few-shot capabilities. Li et al.5 pioneered MQAEE, a multi-turn QA paradigm that sequentially extracts triggers and arguments via machine reading comprehension mechanisms.

Although generative approaches exhibit strong generalization capabilities and data efficiency in event extraction tasks, they still face some challenges such as generation instability and information omission. To address these issues, this paper proposes the PosEKE-GPT2 model, which enhances knowledge representation through extended positional encoding and a knowledge-augmented attention mechanism. By leveraging comprehensive textual information and capturing associations between event elements, the model significantly improves multi-event understanding and extraction capabilities, thereby mitigating information incompleteness to a certain extent.

Model design

In this section, we elaborate on transforming event extraction into a conditional generation task based on prompt strategies, and propose an extended positional encoding method combined with a knowledge-augmented attention mechanism.

The architecture of the PosEKE-GPT2 model

PosEKE-GPT2 (Position Extension and Knowledge Enhancement on GPT2) extends the original GPT2 generative framework by enhancing positional encoding and incorporates knowledge augmentation through attention mechanisms guided by prompt strategies. As illustrated in Fig. 3, the model consists of four core modules: Model Input, Knowledge Augmentation, Positional Modeling, and Model Prediction.

Fig. 3
figure 3

PosEKE-GPT2 Model Diagram.

Dual-Channel vocabulary prompting and Event-Augmented labeling

The model’s input consists of two parts: input text and prompt text. To enable the model to better learn the meaning of text in complex contexts, this paper employs non-structured natural language text as input, allowing it to handle complex scenarios in real-world applications. Furthermore, the original data often contains multiple events, which further increases the complexity of the task. For example, Fig. 4 illustrates a multi-event extraction example from a financial news sentence.

Fig. 4
figure 4

Multi-event Text Example Diagram.

To address the challenges of event argument extraction in multi-event scenarios, this paper proposes a method based on dual-channel dynamic lexicon prompting and event-enhanced annotation to optimize input representation and target sequence construction for event extraction tasks. The method employs a dual-channel architecture, where the dynamic lexicon prompting mechanism constructs event-related prompt words, while explicitly modeled text is annotated with special tokens to refine the representation of event elements.

Specifically, trigger words and arguments are dynamically imported from external lexicons to automatically generate schema-agnostic lexical prompts during training. The lexicon is built from two primary sources: (1) Event schema annotations provided in the DUEE-Fin and FewFC datasets, which supply canonical trigger and argument labels; (2) Domain-specific terminology collected from publicly available financial news corpora and knowledge bases, with expert validation for semantic relevance and contextual applicability. This design ensures traceability and reproducibility of the lexicon resource. The standardized prompt format follows:

< Trigger>/n < Argument>.

Wherein: < Trigger > denotes the event trigger words, such as “announce”, “transfer”, “bankrupt”; and < Argument > represents event entities, such as “Sony”, “Alibaba”, “China Shandong Hi-Speed Financial Group Limited”.

The event-augmented labeling strategy optimizes event element representation by introducing a special token tagging mechanism during target sequence construction. During training data preprocessing, dedicated special tokens (e.g., [1], [2], [3], [9], [10]) are assigned to key elements including event types, triggers, and arguments. These tokens are then inserted into target sequences to explicitly annotate structural information of event elements. This labeling approach ensures format consistency across target sequences, enabling the model to learn structural patterns of each event element during training. It enhances comprehension of event compositions, improves event recognition capabilities, and establishes foundational support for subsequent joint extraction tasks.

Consider the following multi-event text as an example:

“Deyi Company announced to complete financing of 500 million yuan and acquire Jinfang Technology Company”.

The target generation sequence for this text is constructed as follows:

“[1] Enterprise financing [2] Financing [3] Financing party: Deyi Company [9] Amount: 500 million [9] [10]”.

“[1] Enterprise acquisition [2] Acquisition [3] Acquirer: Deyi Company [9] Acquiree: Jinfang Technology Company [9] [10]”.

In the annotation schema, [1] denotes the start position of the entire text, [2] marks the end position of the event type, [3] indicates the end position of the trigger word, [9] signifies the end position of arguments, and [10] represents the termination position of the complete input text.

During the training phase, the dual-channel architecture facilitates enhanced learning of event prior knowledge and textual information through prompt-guided modeling and target sequence modeling. Subsequent experiments demonstrate that the integration of dual-channel dynamic vocabulary prompting and special token tagging enhances the model’s generalization capability, thereby preventing overfitting to single-event patterns. By incorporating external knowledge bases, the method significantly improves the recognition accuracy for diverse events, effectively addressing event element extraction in multi-event scenarios.

Knowledge augmentation module

The first core component of the Knowledge Augmentation Module is the Word Embedding Encoding Layer. This component is constructed based on the vanilla GPT2 pre-trained model, with its primary function being the transformation of raw input text into embedding vectors interpretable by the model. Notably, to accommodate subsequent positional encoding expansion requirements, a position-agnostic processing strategy is adopted at this stage—retaining only the semantic embeddings of the text while temporarily excluding any positional encoding information. This process is illustrated in Fig. 5.

Fig. 5
figure 5

Knowledge Augmentation Process Diagram.

The mapping equations for the input texts St (t = 1, 2, …, n) and prompt texts Sp (p = 1, 2, …, m) are given in Eqs. (1) and (2):

$$X_t=E \bullet W_t$$
(1)
$$X_p=E \bullet W_p$$
(2)

Here, the token embedding matrix of GPT2 is denoted as ERV×e, where V is the vocabulary size and e is the embedding dimension. WtRs×V and WpRp×V represent the one-hot representation matrices of the input text and prompt text, respectively. Xt and Xp correspond to the token embedding representations of the input text and prompt text.

To enhance the model’s awareness of domain-specific knowledge, we introduce an attention-based knowledge augmentation method. Building upon the vanilla GPT2 token embeddings and given the model’s reliance on external knowledge, an attention mechanism is adopted to compute attention scores between textual elements, enabling dynamic weighted fusion of external knowledge.

First, we compute the relevance between the input text and prompt text. As obtained in the previous step, the token embeddings of the input text and prompt text are denoted as XtRb×s×e and XpRb×p×e, respectively, where b is the batch size, s is the input text length, e is the embedding dimension, and p is the prompt sequence length. We measure their relevance via inner product computation, denoted as At, as shown in Eq. (3):

$$A_t[b,s,p]=X_t[b,s,p] \bullet X_p[b,p,e]$$
(3)

After obtaining the relevance scores At, we normalize them and convert them into attention weights using the Softmax function. First, we sum the exponentiated scores across all prompt positions to compute the normalization term, denoted as Z[b, s], where pi represents all possible prompt positions and m denotes the number of prompt tokens. The specific formulation is given in Eq. (4):

$$Z[b,s]=\sum\nolimits_{{i=0}}^{m} {\mathop e\nolimits^{{A_t[b,s,p_i]}} }$$
(4)

We then normalize the attention scores for each prompt position, transforming them into a probability distribution. The final attention scores, denoted as Ascore, are computed where pi represents the prompt position index for the current softmax normalization, as detailed in Eqs. (5) and (6).

$$A_{score}[b,s,p_i]=\frac{{\mathop e\nolimits^{{A_t[b,s,p_i]}} }}{{Z[b,s]}}$$
(5)
$$A_{score}[b,s,p_i]=\frac{{\mathop e\nolimits^{{A_t[b,s,p_i]}} }}{{\sum\nolimits_{{i=0}}^{m} {\mathop {pi}\nolimits^{{\mathop e\nolimits^{{A_t[b,s,p_i]}} }} } }}$$
(6)

The computed attention weights Ascore are combined with the token embeddings of the prompt text via weighted summation, which constitutes the core operation of the attention mechanism. For each input position s, the model calculates the weighted sum across all prompt position pi based on their token embeddings, yielding the fused output representation as specified in Eq. (7):

$$A_{out}[b,s,e]=\sum\nolimits_{{i=0}}^{m} {A_{score}} [b,s,p_i] \bullet X_p[b,p_i,e]$$
(7)

Finally, the weighted prompt information Aout is integrated into the input text’s token embedding Xt to facilitate the subsequent positional encoding expansion and prediction tasks. The resulting knowledge-augmented information, denoted as Kout, is formulated as shown in Eq. (8):

$$K_{out}=X_t[b,s,e]+A_{out}[b,s,e]$$
(8)

Extended positional encoding module

In Transformer models, positional encoding serves as a critical component. Since the self-attention mechanism inherently lacks the capability to discern positional relationships between elements in a sequence, positional encoding addresses this limitation by injecting positional information, thereby enabling the model to comprehend the sequential order of elements in the sequence. Traditional absolute positional encoding methods typically employ sinusoidal function and cosine function to compute positional embeddings. However, GPT2, as a generative model, adopts learnable absolute positional encoding, where each position is assigned a trainable vector. This design allows the model to dynamically learn semantic patterns associated with different positions, making it better suited for complex task requirements.

GPT2 maintains a trainable positional embedding matrix with a maximum sequence length L and an embedding dimension dm, as formalized in Eq. (9).

$$P=[\begin{array}{*{20}{c}} {P_{0,0}}&{P_{0,1}}& \ldots &{P_{0,d_m}} \\ {P_{1,0}}&{P_{1,0}}& \ldots &{P_{1,d_m}} \\ \ldots & \ldots & \ldots & \ldots \\ {P_{L - 1,0}}&{P_{L - 1,1}}& \ldots &{P_{L - 1,d_m}} \end{array}]$$
(9)

For the input text St (t = 1, 2, …, n), the position encoding pi corresponding to the original position i of each word can be computed, where Wp[i] denotes the vector in the i-th row of the learnable position embedding matrix. The formula is shown in Eq. (10).

$$P_i=W_p[i]$$
(10)

However, as the input sequence length increases, the fixed-range limitation of positional encoding constrained by maximum sequence length may hinder the model’s ability to effectively capture long-range dependencies between distant words. When processing sequences exceeding the pre-defined maximum training length, the original positional encoding scheme becomes inapplicable. To address this issue, this paper proposes a novel positional encoding method that achieves smooth transition of positional information through adjacent positional encoding fusion. This approach breaks through the fixed-length constraints of conventional positional embeddings, enabling more continuous and scalable representation of positional relationships.

Fig. 6
figure 6

Positional Encoding Extension Schematic Diagram.

This positional encoding method helps the model better capture positional information in long texts, overcoming the input length limitation of the original GPT2 and increasing the input text volume, thereby enhancing the model’s ability to model long texts.

The process of extending positional encoding involves generating new positional encodings by fusing adjacent positional encodings. As illustrated in Fig. 6, the first step requires fusing every two adjacent positions in the original positional encoding matrix E, with the formula expressed as Eq. (11):

$$E_{avg(i)}=\frac{{E_i+E_{i+1}}}{2},i=1,2, \ldots ,L-1$$
(11)

Where Ei denotes the positional encoding of position i, Ei+1 denotes the positional encoding of position i + 1, and Eavg(i) represents the fused encoding, the average value of the original positions i and i + 1.

After obtaining the fused encoding, all positional encodings will undergo interpolation-based fusion processing to form a new extended positional encoding matrix, as shown in Eq. (12):

$$E_{ext}=[E_1,E_{avg(1)},E_2,E_{avg(2)}, \ldots ,E_{L-1},E_{avg(L - 1)},E_L]$$
(12)

This implementation is relatively simple and flexible. Despite introducing adjacent positional encoding fusion, the newly designed positional encoding through stacking-based design effectively reduces computational cost while avoiding the complexity caused by independently encoding each position. Additionally, the new positional encoding preserves original positional information and incorporates additional positional cues, enabling the model to capture positional relationships more accurately and ultimately enhancing the performance of event extraction tasks.

Model prediction

We reformulate the event extraction task as a joint generation task. During prediction, the model adopts a task-step generation approach, sequentially generating outputs in the order of event type → trigger → arguments. In each prediction step, the model generates the next target output based on the current task input and partial prediction results from the previous step.

To achieve joint extraction, we set different objectives for various subtasks and perform task-based autoregressive decoding in a predefined task sequence. Based on the input text, the model first predicts the event type yevent. After determining the event type, the model initializes new input and generates the event trigger word ytrigger. Subsequently, based on the trigger word, the model re-initializes new input and progressively generates arguments yargument, including roles and corresponding entities. The specific computational formulas are shown in Eqs. (13), (14), (15), and (16).

$$P_1=P(y_{event}|x)=\operatorname{Softmax} (W_{event}H_{event}+b)$$
(13)
$$P_2=P(y_{trigger}|x,y_{event},y_{trigger})=\operatorname{Softmax} (W_{trigger}H_{trigger}+b_{trigger})$$
(14)
$$P_3=P(y_{argument} |x,y_{event},y_{trigger})=\operatorname{Softmax} (W_{argument} H_{argument} +b_{argument} )$$
(15)
$$P(y_{event},y_{trigger},y_{argument} |x)=P_1 \bullet P_2 \bullet P_3$$
(16)

wherein, H represents the hidden states of each task, W represents the weight matrix of each task, and b is the bias term of each task. Specifically, Hevent represents the hidden state of the input text, Htrigger combines event type information, and Hargument integrates event type and trigger word information. Therefore, Wevent, Wtrigger, and Wargument are used for the mapping transformations of event type classification, trigger word recognition, and argument extraction respectively, while the corresponding bias terms bevent, btrigger, and bargument are used to adjust the prediction bias of the model. Finally, the maximum value of each task’s prediction is taken as the final choice to obtain the complete extraction results of event types, trigger words, and arguments.

Loss function

Given that the entire event extraction process has been rephrased as a conditional text generation task, the training objective is to maximize the accuracy of generating the target sequence given input text and prompts. Accordingly, we employ the standard token-level cross-entropy loss for sequence generation, which is a common practice for autoregressive language models like GPT-2.

The loss is defined as:

$$\mathcal{L}\ominus =\,-\sum\limits_{{{\text{t}}=1}}^{T} {\log P(y_t|y<t,K_{out})}$$
(17)

where yt denotes previously generated tokens, T is the sequence length and Kout is the used representation of text and prompt.

Experimental results and analysis

Dataset and parameters

The experiment adopted the DuEE-Fin26 financial domain open-source document-level event extraction dataset. This dataset contains a total of 7,250 annotated texts, including 1,179 test data entries, encompassing 13 event types and 9,440 events. In addition, the FewFC dataset was incorporated, consisting of 7,185 sentences from 899 texts, containing 10 event types and 3,172 event instances.

In this experiment, we adopted gpt2-distil-chinese-cluecorpussmall as the baseline model and implemented improvements upon it. The experiments were conducted using the PyTorch framework, with multiple hyperparameters adjusted during the training process, including batch size, learning rate, and optimizer. Detailed hyperparameter configurations are presented in Table 1, while the training details regarding batch iterations and epochs are further described in the section "Training Configuration and Convergence Analysis".

Table 1 Experimental parameter settings.

Analysis of experimental results

To verify the effectiveness of the PosEKE-GPT2 model in the event extraction task, we designed three groups of experiments. First, we compared the PosEKE-GPT2 with multiple mainstream baseline models to evaluate its overall performance advantages. Second, through ablation experiments, we progressively removed different modules in the model to analyze their impact on overall performance, thereby validating their necessity. Finally, we designed five distinct methods for the knowledge-enhanced module and screened out the optimal knowledge-enhancement strategy through comparative experiments to further improve model performance.

In the three subtask experiments, the model’s performance was evaluated utilizing three metrics: Precision (P), Recall (R), and F1-score (F1)27. The calculation formulas for these three metrics are shown in Eqs. (17), (18), and (19), respectively:

$${\text{P}}=\frac{{TP}}{{TP+FP}}$$
(17)
$${\text{R}}=\frac{{TP}}{{TP+FN}}$$
(18)
$${\text{F}}1=\frac{{2 \bullet {\text{P}} \bullet {\text{R}}}}{{{\text{P+R}}}}$$
(19)

Comparison with mainstream baseline models

In this experimental section, we demonstrate the effectiveness of PosEKE-GPT2 in the event extraction task through comparison with the following models:

  • BERT28: BERT itself serves as a powerful pre-trained language model that can effectively capture contextual information, which is the most critical component in pipeline-based event extraction. It inherently possesses the capability to handle both sequence labeling tasks and classification tasks.

  • GPT229: As a generative model, GPT2 learns the latent relationships between trigger words and arguments based on contextual information, accomplishing event type classification, trigger identification, and argument extraction simultaneously through joint extraction.

  • BERT + MMOE + CRF30: By leveraging BERT to extract semantic information, ensuring precise modeling of semantic features for each token, we introduce a multi-gate mixture-of-experts module that facilitates effective information sharing across different subtasks of event extraction through shared learning and expert gating mechanisms. Finally, CRF is incorporated into the output layer of the model to model dependency relationships among labels.

  • JEEDG30: By explicitly separating shared parameters and task-specific parameters, the introduction of a dual-layer gated network enhances the extraction and filtering capabilities of semantic knowledge.

  • CasEE31: The multi-level event extraction framework based on the BERT encoder identifies event core elements through three decoders: event type detection, trigger word extraction, and argument extraction in sequence, combined with self-attention mechanism and conditional fusion function to achieve structured semantic parsing.

The performance comparison between the proposed PosEKE-GPT2 model and benchmark models is shown in Table 2.

Table 2 Comparative experiment results table (Duee-Fin).

From the experimental results in Table 2, it can be seen that the performance of each model varies significantly in the event extraction task. In event type classification, BERT achieved an F1 of 93.86, demonstrating its strong contextual understanding, while the original GPT2 reached an F1 of 90.06. The PosEKE-GPT2 model proposed in this study achieved the best performance with an F1 of 94.88 through knowledge enhancement and positional encoding extension, verifying its advantage in capturing fine-grained semantic information. In trigger extraction, PosEKE-GPT2 led all comparison models with an F1 of 91.22, outperforming BERT-MMOE-CRF’s 85.60 and JEEDG’s 86.58, highlighting the effectiveness of its knowledge enhancement in trigger recognition. In argument extraction, PosEKE-GPT2 again achieved the highest F1 of 85.74, exceeding CasEE’s 81.24 and BERT’s 66.7, indicating that the positional encoding extension effectively captures complex argument relations. Overall, PosEKE-GPT2 achieved the highest mean F1 of 90.61, surpassing GPT2’s 85.15 and CasEE’s 86.41, fully demonstrating the synergistic effect of knowledge enhancement and positional encoding extension in multi-task joint learning, particularly in integrating cross-subtask contextual information and modeling long-range dependencies.

Table 3 Comparative experiment results table (FewFC).

As shown in Table 3, similar trends were observed in the FewFC dataset. In event type classification, PosEKE-GPT2 achieved the best F1 of 92.48, outperforming GPT2’s 82.69 and CasEE’s 88.64, confirming its superior semantic representation ability. In trigger extraction, PosEKE-GPT2 reached an F1 of 91.70, clearly higher than GPT2’s 82.88 and CasEE’s 82.88, showing its robustness in identifying diverse event triggers. In argument extraction, PosEKE-GPT2 achieved an F1 of 82.36, surpassing BERT’s 68.16 and CasEE’s 76.91, further validating the contribution of the positional encoding extension in capturing complex argument dependencies. Overall, PosEKE-GPT2 achieved the highest mean F1 of 88.85, significantly outperforming GPT2’s 80.85 and CasEE’s 82.81, demonstrating the consistent effectiveness of the proposed improvements across different datasets.

In summary, the proposed PosEKE-GPT2 model consistently demonstrated superior performance on both the DuEE-Fin and FewFC datasets, confirming its robustness and effectiveness in domain-specific event extraction tasks across different benchmarks.

Ablation experiment

To analyze the contribution of each module to the overall event extraction task, this paper conducted the following ablation experiments: removing the extended position embedding (ext_pos) and knowledge-enhanced (KB) modules respectively, and observed the changes in model performance. The experimental results are shown in Table 4.

Table 4 Ablation study results Table.

From the comparison of experimental results, the full-version PosEKE-GPT2 model demonstrates clear advantages in event extraction. In event type classification, it achieves an F1 of 94.88. Removing the position expansion module slightly reduces the F1 to 90.88, while removing the knowledge enhancement module results in an F1 of 92.09, indicating that knowledge enhancement mainly stabilizes performance across subtasks rather than directly boosting classification. For trigger extraction, the full model attains an F1 of 91.22; removing position expansion or knowledge enhancement slightly alters the F1 to 84.92 and 84.71 respectively, suggesting the modules jointly balance performance rather than individually maximizing trigger recognition. In argument extraction, the full model reaches the highest F1 of 85.74, whereas removing position expansion or knowledge enhancement reduces it to 81.75 and 82.94 respectively, showing both modules are crucial for capturing complex argument relations, with position expansion having a slightly stronger effect. Overall, the mean F1 of the full model is 90.61, surpassing the variant without position expansion 86.07, without knowledge enhancement 86.80, and the original GPT2 baseline 85.15, fully illustrating the synergistic effect of the two modules in multi-task joint learning.

As shown in Figs. 7 and 8, PosEKE-GPT2 demonstrates superior convergence in both training loss and validation loss compared to other incomplete models, exhibiting a more stable optimization process and stronger generalization capability.

Fig. 7
figure 7

Ablation Study Training Loss Plot.

Fig. 8
figure 8

Ablation Study Validation Loss Plot.

Qualitative case analysis

Fig. 9
figure 9

Quantitative Analysis.

To further demonstrate the advantages of the proposed model, we present a representative case from financial news, as illustrated in Fig. 9. The example sentence is: “Tencent announced a 500-million-yuan investment yesterday and completed the acquisition of a gaming studio in Shanghai.”

As shown in Fig. 9, the baseline model correctly identified the “investment” event, extracting “Tencent” as the investor and “500 million yuan” as the amount. However, it failed to detect the subsequent “acquisition” event and did not link the target entity “a gaming studio in Shanghai” to the corresponding trigger. This suggests that the baseline model struggles with multi-event sentences involving long-range dependencies, often being influenced by the most salient event and suffering from trigger omission and argument loss.

In contrast, PosEKE-GPT2 successfully extracted both events in their entirety, accurately identifying all triggers and associated arguments—including the distantly located acquisition target. This performance underscores the complementary benefits of the two core modules. The Knowledge Enhancement Module integrates external financial knowledge bases, supplying semantic priors that aid in recognizing less frequent yet domain-relevant triggers such as “acquisition,” thereby mitigating omissions common in the baseline. Meanwhile, the Positional Encoding Extension Module improves the model’s ability to capture long-range dependencies by interpolating intermediate positional encodings through averaging adjacent token representations. This enhancement facilitates the connection between triggers and their distant arguments, such as associating “acquisition” with “a gaming studio in Shanghai” effectively addressing the argument-missing issue observed in the baseline.

This case clearly illustrates that knowledge enhancement boosts trigger detection, while extended positional encoding significantly improves long-distance argument linking. Overall, the two modules enable accurate and comprehensive extraction of multiple events from complex financial sentences.

Comparative experiments on knowledge-enhanced modules

In the experimental section, to verify the effectiveness of fusing knowledge vectors and input text word embedding vectors through the attention mechanism for knowledge enhancement, this paper further designs five different fusion methods for comparative experiments. These five methods are as follows:

  • Direct Addition (ADD): Directly add the knowledge vector to the input text embedding vector to test the effectiveness of the simple vector superposition approach.

  • Prepend concatenation (Pre-concat): The knowledge vector is concatenated to the beginning of the input text to provide additional contextual background information.

  • Post-concatenation (Post-concat): Concatenate the knowledge vector to the end of the input text and observe the impact of knowledge integration at different positions on text comprehension.

  • Graph Attention Network Fusion (GAT): Adopt a Graph Attention Network to fuse knowledge vectors and input text word embedding vectors, thoroughly modeling the semantic associations between them and enhancing the depth of information interaction.

  • Attention mechanism fusion (ATTN): By introducing attention mechanisms, the model can learn the relevance between input text and prompt text, enhance semantic representation, and improve the performance of generation tasks.

Detailed experimental results are presented in Table 5.

Table 5 Comparison results table of Knowledge-enhanced Methods.

From the experimental results in Table 5, it can be seen that different knowledge fusion strategies have notable effects on event extraction performance. The attention-based fusion method performs best, achieving an average F1 value of 90.61, which surpasses all other strategies, demonstrating its effectiveness in enabling fine-grained interaction between text and external knowledge through dynamic weighting. The direct addition method reaches an average F1 value of 88.50, ranking lowest, suggesting that simple vector summation may introduce semantic conflicts. Forward concatenation and backward concatenation strategies obtain average F1 values of 88.53 and 88.39 respectively. By adjusting the position of knowledge embeddings, they mitigate some information conflicts, but static concatenation still leads to uneven representation distribution. The graph attention network strategy achieves an average F1 of 88.35, indicating limited structural modeling ability in long-text event extraction scenarios. Overall, within the joint event extraction framework, the attention mechanism enables context-aware dynamic fusion, improving cross-modal knowledge integration and outperforming the sub-optimal backward concatenation strategy by 2.22% points, highlighting the key role of dynamic knowledge fusion in capturing complex semantic relationships.

Comparative analysis of prompting strategies

Fig. 10
figure 10

Illustration of Different Prompt Design Strategies.

To validate the effectiveness of our proposed concise prompting strategy-which is characterized by its simplicity and direct use of lexical knowledge-we compare it against several alternative, more elaborate prompting designs in a controlled ablation study. A comparative visualization of these four prompting strategies is presented in Fig. 10. This comparison includes:

  • Ours (T1): Our proposed approach employs a concise template following the “<Trigger>\n<Argument>” format, delivering pure lexical knowledge without additional instructional markers or syntactic structures. The prompt consists solely of newline-separated lists of trigger words and argument roles, facilitating direct association learning between lexical knowledge and textual context.

  • Natural(T2): This strategy uses natural language instructions to frame the task in a human-readable form. Adopting prompts such as “Please identify the event type described in the text. Possible types include: <Trigger_list>” and “Please extract the arguments for different roles in the event. Possible roles include: <Argument_list>”, it evaluates the model’s ability to comprehend and respond to intuitive conversational directives.

  • Keywords(T3): It is a minimalist keyword-style prompt that reduces instructional context to its bare essentials. Using succinct formulations like “Event Type: <Trigger_list>” and “Arguments: <Argument_list>”, this method examines the model’s reliance on rich instructional context and its capacity to infer task requirements from minimal semantic cues.

  • QA-Format(T4): This approach reformulates the extraction task as an interactive question-answering session. With prompts such as “What type of event is described in the text? Options: <Trigger_list>” and “What participant information is contained in the text? Role types: <Argument_list>”, it explores alternative task formulations that may activate different reasoning pathways in the language model.

Table 6 Prompt strategy result Table.

As shown in Table 6, different prompt formulations lead to clear variations in performance, confirming that prompt design plays a crucial role in generative event extraction. The QA-style prompt achieved the highest F1-score of 94.35 in event type classification, but its performance in argument extraction dropped significantly to only 80.44. This indicates that framing the task as a question favors coarse-grained classification but provides limited guidance for capturing fine-grained structural information. The keyword-based prompt produced the weakest overall results with a mean F1-score of 87.84, demonstrating that overly simplified instructions are insufficient for guiding complex extraction tasks. In comparison, the natural language prompt demonstrated balanced improvements across all subtasks, attaining a mean F1-score of 89.32, which suggests that intuitive and human-readable instructions enhance the model’s generalization ability. Notably, our structured prompt design delivered the best overall performance with a mean F1-score of 90.61, while achieving particular advantages in both trigger and argument extraction. These findings highlight the sensitivity of generative models to prompt formulation and demonstrate the effectiveness of structured supervision combined with knowledge injection for improving robustness and accuracy.

Training configuration and convergence analysis

To evaluate the adequacy of the training setup, we further investigated the impact of different batch size and epoch configurations on model performance. The objective of this experiment was to assess the convergence behavior of the proposed model and to examine its robustness under varying training conditions.

Table 7 Ablation study on training configurations (Batch size & Epochs).
Fig. 11
figure 11

Epoch-Level Convergence Curve.

Table 7 presents a comprehensive comparison of different batch sizes and training epochs, demonstrating that both factors significantly influence model performance. When the number of epochs is fixed at seven, larger batch sizes consistently lead to improved results. Models trained with batch sizes of four or six achieve relatively lower mean F1 scores, while a batch size of eight yields stronger performance. The highest overall mean F1 score of 90.61 is attained with a batch size of ten. Although a batch size of four results in a marginally higher F1 score for argument extraction compared to a batch size of eight, this advantage is offset by a decline in overall performance, indicating that very small batch sizes may provide limited regularization at the cost of reduced stability.

Holding the batch size constant at ten and, increasing the number of training epochs from four to seven consistently improves model outcomes. The mean F1 score rises from 87.37 at four epochs to 89.28 at five epochs, and further to 89.92 at six epochs, reaching a peak of 90.61 at seven epochs under the B10E7 configuration. Similarly, the argument extraction F1 score improves to 85.74 at seven epochs, confirming that longer training enhances both overall performance and argument extraction capabilities. As shown in Fig. 11, the training loss plateaus around the seventh epoch, suggesting that the model has converged. However, extending training to eight epochs leads to a slight degradation in performance, with the mean F1 decreasing to 89.15 and argument extraction F1 dropping to 82.43, suggesting the onset of overfitting.

These findings indicate that a batch size of ten combined with seven training epochs achieves the optimal balance between convergence and generalization.

Length-sensitivity and efficiency analysis

To further evaluate the effectiveness of the proposed extended positional encoding mechanism, experiments were conducted from two aspects: length generalization and computational efficiency.

For the length generalization test, the dataset was divided into four intervals based on sequence length: short texts (0–99 tokens), medium-short texts (100–199 tokens), medium-long texts (200–299 tokens), and long texts (300 + tokens). The sample sizes for each interval are provided in Table 8.

Table 8 Distribution of samples across sentence length intervals.

As presented in Table 9, PosEKE-GPT2 consistently outperforms the baseline GPT2 across all length intervals, achieving higher mean F1 scores and demonstrating the robustness of the proposed mechanism. Although the improvement is moderate on short and medium-short texts, a more substantial gain is observed on medium-long and long texts. Notably, in the 300 + token interval, PosEKE-GPT2 attains a mean F1 score of 87.74, significantly surpassing the 75.66 achieved by GPT2. This result confirms that the proposed positional encoding extension effectively alleviates the performance degradation commonly associated with longer sequences.

Table 9 GPT2 vs. PosEKE-GPT2 across length Intervals.

In this study, we compared PosEKE-GPT2 with two baseline models: GPT2-base and GPT2 with sinusoidal positional embeddings. As summarized in Table 10, PosEKE-GPT2 achieves the highest mean F1 score of 90.61%, while maintaining comparable inference speed and per-epoch training time to GPT2-base. These results indicate that the proposed extension incurs negligible computational overhead while consistently improving performance across evaluations.

Table 10 Performance and efficiency comparison of positional encoding Mechanisms.

These findings collectively demonstrate that the proposed positional encoding extension mechanism (1) enhances model robustness across varying text lengths, performing particularly well on long-text scenarios, and (2) delivers consistent accuracy gains without sacrificing computational efficiency. These results underscore the practical value of PosEKE-GPT2 for document-level event extraction tasks that involve long-distance dependencies.

Conclusion

In this paper, we propose a model named PosEKE-GPT2 and elaborate on its architecture. We then conduct experiments on the Duee-Fin and FewFC datasets, validating the effectiveness of PosEKE-GPT2 in financial domain event extraction tasks. Experimental results demonstrate that our model surpasses all baseline models in the Mean F1 metric, proving its overall superiority.

Compared to traditional methods, PosEKE-GPT2 significantly improves extraction performance in jointly extracting multiple events through positional extension and knowledge enhancement strategies. The positional extension enables the model to adapt to longer texts, enhances adaptability to dataset length, and strengthens contextual understanding capabilities. The knowledge enhancement strategy utilizes external knowledge to generate prompt words, improving the model’s ability to model domain-specific terminology and contextual semantics. Ablation experiments further validate the effectiveness of both modules in the joint extraction task.

Although PosEKE-GPT2 has achieved good performance in financial event extraction tasks, there is still room for optimization. In the future, we will explore causal reasoning methods to enable the model to understand causal relationships between events and improve the interpretability of extraction.