Introduction

With the acceleration of globalization and informatization, English education has attracted increasing worldwide attention. In English language teaching, writing serves as a critical indicator of comprehensive language ability, reflecting students’ linguistic proficiency and cognitive skills1,2,3. Particularly in large-scale examinations, online education platforms, and international curricula, English writing proficiency has become a key metric for evaluating overall language competence. However, current writing instruction and assessment still rely heavily on manual grading by teachers—a process that is labor-intensive, time-consuming, and often influenced by subjective standards and inter-rater variability, making it difficult to ensure consistent and fair essay evaluation4,5,6. This issue is especially pronounced in large-scale testing scenarios, where scalable and reliable automated scoring systems are urgently needed as a replacement or complement to manual grading. Automated Essay Scoring (AES) technology has emerged as a promising solution to this challenge.

Early AES models largely depended on shallow linguistic features such as word frequency, sentence length, and spelling or grammar errors, combined with traditional machine learning methods. While these models demonstrated some effectiveness in specific tasks, their performance proved unstable when applied to essays prompted by different topics7,8. Changes in writing prompts, styles, or linguistic backgrounds often led to poor generalization and significant scoring bias, largely due to the models’ overreliance on topic-specific features in training data9,10. This phenomenon, known as the “cross-prompt scoring challenge,” remains one of the major obstacles in AES research. Addressing this challenge requires models that can capture general linguistic features while accurately assessing semantic alignment between essays and prompts across diverse topics, thereby improving scoring fairness and credibility.

Advancements in intelligent text analysis, particularly deep learning (DL)-based methods, enable automatic extraction of richer linguistic features from large-scale essay data. By capturing deep semantic representations and structural patterns, DL provides a more accurate foundation for essay evaluation11,12. Building on these advances, this study proposes a Hybrid Feature-based Cross-Prompt Automated Essay Scoring (HFC-AES) model designed to enhance scoring accuracy and consistency in multi-topic and multi-prompt scenarios. The model integrates shallow statistical features with semantic features extracted through DL. In the topic-independent stage, shared features are derived to provide stable preliminary assessments, while in the topic-specific stage, a hierarchical neural network and cross-attention mechanism are incorporated to model semantic relationships between essays and prompts more precisely. Leveraging text structure features and attention mechanisms, the proposed approach enhances robustness and adaptability in diverse prompting conditions. Ultimately, HFC-AES offers an intelligent scoring tool that supports English language teaching and facilitates the practical application of automated scoring technology in educational evaluation.

Literature review

Over the years, the rapid development of natural language processing (NLP) has significantly advanced the field of AES. Existing research can be broadly categorized into three areas: AES models based on traditional feature engineering, AES methods utilizing DL, and AES models designed for cross-prompt and multilingual contexts.

  1. (1)

    AES models based on traditional feature engineering.

Most early AES systems relied on manually designed shallow linguistic features, such as lexical density and syntactic structure, for scoring and modeling. Susanti et al. (2023) conducted a comprehensive literature review of AES systems, analyzing the use of various methods and datasets to provide methodological and dataset references for future research13. Li and Huang (2022) explored the influence of composition, organization, and overall quality on the evaluation of English as a foreign language writing in Chinese higher education. Through interviews with teachers and raters and a quantitative analysis of large-scale evaluation data, they identified clear differences in scoring focus. High-quality compositions were evaluated across multiple dimensions. In contrast, low-quality compositions were assessed mainly for language accuracy and content14. These findings highlighted the limitations of traditional AES models in constructing comprehensive scoring dimensions and underscored the need to reconsider feature selection for fairness and completeness. Although feature-engineered methods offer interpretability and computational efficiency, they struggle to capture deeper semantic relationships and contextual information. Consequently, their generalization ability is limited, and they fall short in assessing semantic alignment and overall discourse coherence, particularly for complex, variable-prompt writing tasks.

  1. (2)

    AES methods utilizing deep learning.

The emergence of neural network models has led many researchers to explore DL-based approaches for automatically learning semantic and structural features in student compositions. Lim et al. (2023) developed and validated a neural network-based automated assessment system tailored for Korean second-language writing. By combining NLP techniques with pre-trained neural language models, the system improved scoring performance through analyses of linguistic features such as grammatical complexity, quantitative complexity, and fluency15. This work demonstrated the value of applying neural methods to non-English AES tasks, extending the applicability of DL in multilingual contexts. Beyond NLP advances, intelligent text analysis has introduced new approaches for AES. Bai and Stede (2023) reviewed recent applications of machine learning in automated evaluation of student free-text responses, including both short answers and full essays, highlighting the predominant use of feature-based and neural network architectures16. Compared with traditional methods, DL-based approaches excel at automatically learning complex features and modeling contextual semantic relationships and textual coherence. However, existing DL models still face notable limitations: poor transferability across prompts, vulnerability to topic bias in training data, limited interpretability due to “black box” architectures, and insufficient handling of discourse-level structures, as most focus primarily on syntactic or lexical features rather than modeling macro-level semantic organization.

  1. (3)

    AES models in cross-topic and multilingual contexts.

To address the challenges posed by diverse essay prompts and the uneven distribution of language resources, researchers have explored strategies for cross-prompt and multilingual AES systems. Gao et al. (2024) reviewed the integration of AI and NLP in automated writing evaluation from an educational perspective, emphasizing the potential of large language models to improve assessment efficiency17. Hossain and Goyal (2024) trained pre-trained Transformer models—such as BERT, GPT, Multilingual BERT (mBERT), and Cross-Lingual Models (XLM-R)—on multilingual corpora covering over 20 languages. These models were fine-tuned for tasks including text summarization, content generation, and sentiment analysis, demonstrating strong multilingual semantic modeling and text generation capabilities, especially in coherence and fluency18. Li (2025) proposed a novel cross-lingual sentence similarity detection approach that combined the multilingual power of XLM-R with a stepwise weighted similarity metric integrating cosine similarity and Manhattan distance, along with language-independent embeddings from BiT-Internet and XLM-R. This method significantly improved semantic equivalence detection, setting new benchmarks in cross-lingual similarity tasks19. Although these studies laid a theoretical foundation for enhancing the generalization of AES systems, current models still struggle with semantic alignment and discourse-level modeling in essays—complex, highly topic-dependent text. They often fail to accurately capture semantic correspondence between essay content and prompts.

Building on these insights, this study proposes the HFC-AES model, which integrates shallow statistical features with deep neural representations. The architecture consists of two stages: a topic-independent stage that extracts stable shared features for consistent cross-prompt scoring, and a topic-specific stage that employs a hierarchical neural network and cross-attention mechanism to precisely align essay content with prompts. By combining semantic modeling with discourse structure recognition, the model overcomes robustness limitations of existing AES systems in multi-prompt scenarios. Table 1 compares mainstream AES approaches with the proposed model in terms of feature representation, semantic alignment, discourse modeling, cross-prompt adaptability, interpretability, and model-specific enhancements:

Table 1 Comparison of different AES methods.

As Table 1 illustrates, different AES approaches exhibit distinct strengths and weaknesses across feature modeling, semantic understanding, transferability, and interpretability. Traditional methods, though highly interpretable, lack the capacity to model complex semantics and discourse structures, limiting their applicability to challenging writing tasks. DL-based methods have made significant progress in semantic representation but often lack generalization beyond specific topics or datasets, resulting in unstable cross-prompt scoring. Multilingual pre-trained models demonstrate potential in handling cross-lingual semantics but remain inadequate in modeling essay-specific discourse structures. In contrast, HFC-AES systematically optimizes semantic alignment, discourse structure modeling, and robustness in cross-prompt scenarios by integrating shallow linguistic features with deep neural representations. Its cross-attention mechanism enhances the capture of key correspondences between essay content and prompts, improving both holistic discourse understanding and scoring reliability. Overall, HFC-AES balances interpretability with advanced semantic modeling, addressing limitations of existing methods and demonstrating greater generalizability and practical value.

Research model

Model overall architecture design

The HFC-AES model employs a two-stage feature extraction process. In the first stage, DL techniques are used to extract comprehensive textual features from raw essays, encompassing syntactic structure, lexical usage, and semantic information. These features are derived through pre-trained word embedding models and syntactic analysis tools. Moreover, sentence-level discourse structure features are incorporated to capture the internal logical relations and organizational framework of the composition, thereby enhancing the model’s grasp of discourse-level structures. In the second stage, a cross-attention mechanism further refines feature processing by automatically learning the relative importance of various scoring criteria for the overall assessment. This mechanism effectively emphasizes critical sections of the essay and dynamically allocates feature weights in accordance with specific task requirements. The overall workflow of the HFC-AES model is illustrated in Fig. 1.

Fig. 1
figure 1

The workflow of the HFC-AES model.

The HFC-AES model employs a dual-channel architecture. One channel is dedicated to extracting and processing global features, while the other focuses on capturing and enhancing local features. By integrating information from both channels, the model can generate scoring predictions at multiple granular levels. To enhance interpretability, a visualization technique is incorporated to display the feature weight distribution for each scoring criterion, thereby making the model’s decision-making process more transparent and reproducible. The overall structure of the HFC-AES model, including the topic-independent and topic-related feature extraction stages, is illustrated in Fig. 2.

Fig. 2
figure 2

Overall flow chart of the model.

In the topic-independent stage, the model extracts shallow text features at both word and sentence levels and combines these with deep semantic features generated by DL–based text analysis methods. In the topic-related stage, a Bi-LSTM coupled with an attention mechanism constructs a hierarchical semantic network that captures semantic information relevant to both the composition and the prompt. By linking the shared layer with the task-specific layer, this stage effectively models contextual semantic relationships within the essay and integrates feature correlations across multiple tasks through a cross-attention mechanism. Finally, the scoring module combines the topic-independent and topic-related feature representations to provide precise scores for essays on the target topic, enabling comprehensive assessment of both linguistic competence and content quality in English writing. Multiple neural network architectures are employed in the design of the HFC-AES model to fully explore the multi-level semantic information of essays. Table 2 summarizes and compares the key neural network architectures used, detailing their core functions, advantages, and specific roles in this study.

Table 2 Main neural network architectures in the HFC-AES model.

From the comparison presented in Table 2, it is evident that the various neural network architectures fulfill complementary roles and collaboratively enhance the model’s capability to capture multi-level semantic information within essays. The following sections provide a detailed introduction to the specific applications and implementations of these network architectures within both the topic-independent and topic-related feature extraction stages.

Topic-independent feature extraction stage

In the topic-independent stage, the primary objective is to extract shared features between the source and target essay datasets. These features are then employed to build a preliminary scoring model for initial evaluation of the target essays. The shared features consist of shallow text features and DL features, as illustrated in Fig. 3. Shallow text features are manually designed using statistical methods and capture fundamental textual information such as vocabulary usage frequency and sentence structure. In contrast, DL features are automatically learned by deep neural networks (DNNs), which identify complex patterns and semantic relationships within the text. The hybrid approach of combining shallow text features with DL features aims to leverage the strengths of both, enhancing the accuracy and robustness of the scoring model. Shallow features provide intuitive, easily computable information—such as lexical richness and sentence length—that directly reflect students’ language proficiency and offer high interpretability. Meanwhile, DL features model text at a deeper level, capturing intricate grammatical and semantic relationships, including logical sentence connections, discourse structure, and underlying semantic intentions. These high-level features are critical for assessing essay coherence, organization, and semantic precision. By employing intelligent text analysis, this extraction approach integrates traditional lexical and syntactic information with deeper semantic and contextual insights20. The combination of shallow and DL features mitigates the potential information loss that may occur if either feature set is used alone, thereby improving the model’s capacity for comprehensive evaluation of students’ writing skills.

Fig. 3
figure 3

Feature Extraction in topic-independent stage.

  1. (1)

    Shallow text feature extraction.

Shallow text features are extracted at the word and sentence levels using statistical methods to capture students’ vocabulary proficiency and sentence structure skills. To improve feature relevance, the Term Frequency-Inverse Document Frequency (TF-IDF) method measures each word’s importance within an essay by balancing its frequency against its rarity in the entire corpus. This effectively filters out common but less meaningful words and highlights key terms, enhancing lexical feature discrimination and scoring accuracy.

At the word level, features include composition length, average word length and its variance, number of spelling errors, and ratios of prepositions and conjunctions21,22. These are extracted with tools like SpellCheck and NLTK and analyzed alongside vocabulary profiles. Additionally, intelligent text analysis identifies spelling error types and word usage frequency in context, providing richer semantic insights23.

At the sentence level, features describe structural and coherence aspects, such as sentence count, average sentence length, grammatical errors, and overall coherence. Sentence count and the sentence-to-word ratio reflect essay complexity, while average sentence length indicates structural sophistication. Grammatical error counts serve as an accuracy metric24,25. Sentence coherence is calculated using the following weighted formula:

$$p=0.45\times\:\stackrel{-}{m}+\stackrel{-}{l}-21.05$$
(1)

\(p\) is the sentence coherence score; \(\stackrel{-}{m}\) denotes the average number of characters in a word; \(\stackrel{-}{l}\) refers to the average length of a sentence. To calculate these features, Language, a localized tool, is used to detect grammatical errors in sentences and calculate the coherence of sentences.

  1. (2)

    Deep text feature extraction.

Deep text features are extracted using a DNN that vectorizes and models essay text to capture high-level semantic attributes such as coherence and discourse structure. To better extract these deeper semantic features, the model combines Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM), leveraging their complementary strengths. First, essay text is converted into word vectors using the Word2Vec method. These word vectors are dynamically weighted through intelligent text analysis to automatically identify key topic words and important expressions. The CNN extracts local features from the word vectors, producing sentence-level representations. Subsequently, the LSTM captures temporal dependencies and global semantic information across sentences, generating features related to coherence and text structure. Finally, these DL features are combined with shallow text features to form the model’s input, which is then fed into the essay scorer and topic discriminator modules.

To further enhance performance, the feature extraction process employs a joint optimization mechanism involving the feature generator, essay scorer, and topic discriminator. The feature generator aims to produce features that benefit the scorer while confusing the topic discriminator. The essay scorer predicts the essay score accurately, and the topic discriminator attempts to identify the topic source of the features. The loss function for the feature generator is defined as follows:

$${Loss}_{{\theta\:}_{f}}={Loss}_{{\theta\:}_{y}}-\alpha\:{Loss}_{{\theta\:}_{d}}$$
(2)

\({Loss}_{{\theta\:}_{y}}\) and \({Loss}_{{\theta\:}_{d}}\) represent the loss function of the composition evaluator and the topic discriminator. \(\alpha\:\) refers to a hyperparameter to weigh the two objectives. The loss function of the topic discriminator is:

$${Loss}_{{\theta\:}_{d}}={L}_{d}({\theta\:}_{f},{\theta\:}_{d})$$
(3)

\({L}_{d}({\theta\:}_{f},{\theta\:}_{d})\) uses cross entropy loss to measure the interference degree of features generated by feature generator to topic discriminator. The loss function of the composition scorer is:

$${Loss}_{{\theta\:}_{y}}={L}_{y}({\theta\:}_{f},{\theta\:}_{y})$$
(4)

\({L}_{y}({\theta\:}_{f},{\theta\:}_{y})\) utilizes mean square error (MSE) to measure the deviation between the predicted and real scores. To realize joint optimization, the parameter updating rules are as follows:

$${\theta\:}_{f}\leftarrow\:{\theta\:}_{f}-\mu\: \left(\frac{\partial\:{L}_{y}}{\partial\:{\theta\:}_{f}}-\alpha\:\frac{\partial\:{L}_{d}}{\partial\:{\theta\:}_{f}}\right)$$
(5)
$${\theta\:}_{y}\leftarrow\:{\theta\:}_{y}-\mu\:\frac{\partial\:{L}_{y}}{\partial\:{\theta\:}_{y}}$$
(6)
$${\theta\:}_{d}\leftarrow\:{\theta\:}_{d}-\mu\:\left(\alpha\:\frac{\partial\:{L}_{d}}{\partial\:{\theta\:}_{d}}\right)$$
(7)

\(\mu\:\) is the learning rate. \({\theta\:}_{f}\), \({\theta\:}_{y}\) and \({\theta\:}_{d}\) represent the parameters of feature generator, composition scorer, and topic discriminator, respectively.

Topic-related feature extraction stage

Although grammatical, lexical, and coherence features are extracted during the topic-independent stage, the composition’s topic information has yet to be fully incorporated. Topic relevance plays a critical role in accurate scoring, especially in cross-topic tasks where aligning the essay content with the prompt is essential. The topic-related feature extraction stage focuses on capturing features closely tied to the prompt from the essay text. By integrating topic information into the scoring framework via a neural network, the model improves its semantic understanding and ability to judge topic alignment. Hierarchical neural networks effectively capture multi-level semantic information, enhancing the model’s overall grasp of essay semantics and better handling the complex relationship between prompts and compositions. Compared to traditional flat neural networks, hierarchical architectures preserve the text’s structural hierarchy, reducing risks of information loss or misinterpretation. This strengthens the model’s accuracy and reliability in topic-specific scoring. Text typically contains multiple semantic levels—from local words to sentences and overall discourse. By modeling these levels, hierarchical networks simultaneously attend to local and global information, deepening the model’s understanding of topic-related content. Therefore, this stage employs a hierarchical neural network to extract topic-related information layer by layer, enabling the model to capture topic elements and their semantic connections within the essay. As illustrated in Fig. 4, this stage divides the model into shared and task-specific layers. The shared layer extracts general semantic features, while the task layer focuses on features specific to the particular scoring task.

Fig. 4
figure 4

Neural network structure for topic-related feature extraction.

  1. (1)

    Shared layer.

The sharing layer extracts general semantic features through word embedding, word-level convolution, and attention pooling operations, ensuring broad applicability of the feature representations.

The word embedding layer encodes each word in the essay into a high-dimensional vector that captures its semantic and grammatical properties. This study uses a pre-trained BERT model for word vectorization. Specifically, let the essay \(E=\{{sent}_{1},{sent}_{2},\cdots\:,{sent}_{n}\}\), where n denotes the number of sentences, and each sentence \(\:{sent}_{i}=\{{d}_{1},{d}_{2},\cdots\:,{d}_{m}\}\), with mmm representing the number of words. After encoding with BERT, each word vector is represented as follows:

$${w}_{i}={BERT}^{\text{r}\text{e}\text{p}\text{r}\text{e}\text{s}\text{e}\text{n}\text{t}\left({d}_{i}\right)}$$
(8)

where represent denotes the encoding method applied to the word \({d}_{i}\). Leveraging BERT’s pre-training capabilities, the word embedding layer effectively captures contextual semantic dependencies and lexical-level information.

In the word-level convolution layer, a one-dimensional CNN processes the word embeddings to extract sentence-level semantic features. Subsequently, attention pooling aggregates these features into a comprehensive sentence representation. Specifically, for each word \({w}_{i}\) in a sentence, the convolution operation extracts its part-of-speech feature representation \({g}_{i}\):

$${g}_{i}=f({W}_{g} \cdot \:\left[{w}_{i}:{w}_{i+{h}_{w}-1}\right]+{b}_{g})$$
(9)

The function f denotes a nonlinear activation function; \({W}_{g}\) represents the convolution kernel weight matrix; \({b}_{g}\) is the bias term; \({h}_{w}\) refers to the convolution kernel size; and \(\left[{w}_{i}:{w}_{i+{h}_{w}-1}\right]\) indicates the set of words within the current sliding window.

Next, the attention vector \({a}_{i}\) and attention score \({v}_{i}\) for each word are computed using the attention mechanism as follows:

$${a}_{i}=\text{t}\text{a}\text{n}\text{h}({W}_{a} \cdot \:{g}_{i}+{b}_{a})$$
(10)
$${v}_{i}=\frac{{e}^{{W}_{v} \cdot \:{a}_{i}}\text{exp}({W}_{v} \cdot \:{a}_{i})}{\sum\:\text{exp}({W}_{v} \cdot \:{a}_{j})}$$
(11)

where \({W}_{a}\) and \({W}_{v}\) are trainable weight matrices and \({b}_{a}\) is the bias vector. The sentence representation s is obtained via the weighted sum of all word features:

$$s=\sum\:{v}_{i} \cdot \:{g}_{i}$$
(12)

where s represents the final sentence-level semantic feature vector.

  1. (2)

    Task layer.

The task layer performs feature modeling using a Bi-LSTM, sentence-level attention mechanism, and an output layer to complete feature extraction and topic-related scoring.

Sequence information is vital for semantic modeling in compositions. Compared to traditional LSTM, Bi-LSTM captures contextual information from both past and future states, allowing more comprehensive modeling of sentence semantics26. For each task j, the Bi-LSTM processes the input sentence representations \({s}_{t}^{j}\) and outputs \({h}_{t}^{j}\), following these equations:

Input gate calculation:

$${i}_{t}^{j}=\sigma\:({a}_{i,t}^{j}+{r}_{i,t-1}^{j}+{b}_{i}^{j})$$
(13)
$${a}_{i,t}^{j}={W}_{i}^{j}{s}_{t}^{j}$$
(14)
$${r}_{i,t-1}^{j}={U}_{i}^{j}{h}_{t-1}^{j}$$
(15)

Forget gate calculation:

$${f}_{t}^{j}=\sigma\:({a}_{f,t}^{j}+{r}_{f,t-1}^{j}+{b}_{f}^{j})$$
(16)
$${a}_{f,t}^{j}={W}_{f}^{j}{s}_{t}^{j}$$
(17)
$${r}_{f,t-1}^{j}={U}_{f}^{j}{h}_{t-1}^{j}$$
(18)

Candidate unit state:

$${\stackrel{\sim}{c}}_{t}^{j}=\text{tanh}({a}_{c,t}^{j}+{r}_{c,t-1}^{j}+{b}_{c}^{j})$$
(19)
$${a}_{c,t}^{j}={W}_{c}^{j}{s}_{t}^{j}$$
(20)
$${r}_{c,t-1}^{j}={U}_{c}^{j}{h}_{t-1}^{j}$$
(21)

Unit state update:

$${c}_{t}^{j}={i}_{t}^{j}\odot\:{\stackrel{\sim}{c}}_{t}^{j}+{f}_{t}^{j}\odot\:{c}_{t-1}^{j}$$
(22)

Output gate:

$${o}_{t}^{j}=\sigma\:({a}_{o,t}^{j}+{r}_{o,t-1}^{j}+{b}_{o}^{j})$$
(23)
$${a}_{o,t}^{j}={W}_{o}^{j}{s}_{t}^{j}$$
(24)
$${r}_{o,t-1}^{j}={U}_{o}^{j}{h}_{t-1}^{j}$$
(25)

Hidden state update:

$${h}_{t}^{j}={o}_{t}^{j}\odot\:\text{tanh}\left({c}_{t}^{j}\right)$$
(26)

where \({s}_{t}^{j}\) represents the input sentence representation of the j-th task at time t. \({h}_{t}^{j}\) refers to the corresponding output vector. Weight matrices \({W}_{i}^{j}\), \({W}_{f}^{j}\), \({W}_{c}^{j}\), \({W}_{o}^{j}\), \({U}_{i}^{j}\), \({U}_{f}^{j}\), \({U}_{c}^{j}\), \({U}_{o}^{j}\) and offset vectors \({b}_{i}^{j}\), \({b}_{f}^{j}\), \({b}_{c}^{j}\) and\(\:{b}_{o}^{j}\) are all model parameters, and \(\sigma\:\) is the activation function. \(\odot\:\) represents element-level multiplication.

To strengthen task relevance, a sentence-level attention mechanism assigns weights to sentence features for each task j. The attention vector \({q}_{t}^{j}\) and weight \({a}_{t}^{j}\) are computed as:

$${q}_{t}^{j}=\text{t}\text{a}\text{n}\text{h}({W}_{q}^{j}{h}_{t}^{j}+{b}_{q}^{j})$$
(27)

The attention weight is calculated as:

$${a}_{t}^{j}=\frac{\text{exp}({W}_{a}^{j} \cdot \:{q}_{i}^{j})}{\sum\:_{k}\text{exp}({W}_{a}^{j} \cdot \:{q}_{i}^{j})}$$
(28)

The weighted sentence summary \({o}^{j}\) is:

$${o}^{j}=\sum\:_{t}{a}_{t}^{j}{h}_{t}^{j}$$
(29)

where \({W}_{q}^{j}\) and \({W}_{a}^{j}\) are training matrix parameters; \({b}_{q}^{j}\) denotes the bias vector. \({q}_{t}^{j}\) and \({a}_{t}^{j}\) represent attention vector and attention weight respectively; \({o}^{j}\) is the final sentence feature representation vector of the current task j.

The model further incorporates a cross-task attention mechanism to exploit semantic correlations across tasks in multi-task learning. The attention score \({u}_{i}^{j}\) for the i-th feature in task j with respect to other task features \({A}_{-j,l}\) is calculated by:

$${u}_{i}^{j}=\frac{\text{exp}\left(score\right({o}^{j},{A}_{-j,i}\left)\right)}{\sum\:_{l}\text{exp}\left(score\right({o}^{j},{A}_{-j,l}\left)\right)}$$
(30)

Cross-task information integrates via weighted sum:

$${p}^{j}=\sum\:_{i}{u}_{i}^{j}{A}_{-j,i}$$
(31)

The final representation vector is formed by concatenating \({o}^{j}\) and \({p}^{j}\):

$${z}^{j}=[{o}^{j};{p}^{j}]$$
(32)

where \({u}_{i}^{j}\) represents the attention weight of the i-th feature in the j-th task. \({A}_{-j,i}\) denotes the feature set of other tasks. \(score\) refers to the attention score function. \({p}^{j}\) means the integrated cross-attention feature. Finally, \({o}^{j}\) and \({p}^{j}\) are spliced to form the final task feature vector \({z}^{j}\).

Finally, the task layer predicts essay scores through the output layer, which applies a sigmoid activation function to map the feature vector \({z}^{j}\) to the range [0,1]:

$${\widehat{y}}^{j}=sigmoid({W}_{y}^{j}{z}^{j}+{b}_{y}^{j})$$
(33)

where \({W}_{y}^{j}\) and \({b}_{y}^{j}\) are weights and biases, and \({\widehat{y}}^{j}\) is the predicted score for task j.

In summary, the proposed HFC-AES model achieves multi-level collaborative modeling of essay content and topic semantics through two stages: topic-independent and topic-related feature extraction and modeling. It fully integrates shallow textual features with deep semantic features and combines shared and task-specific layers. Moreover, the cross-task attention mechanism enhances the model’s adaptability to semantic variations across topics, improving scoring accuracy. The next section evaluates the model’s performance on cross-topic AES tasks using multiple public datasets. Results on scoring effectiveness, ablation studies, and practical applications demonstrate the model’s effectiveness and usability.

Experimental design and performance evaluation

Dataset collection

This experiment uses the Automated Student Assessment Prize (ASAP) dataset, which contains a large number of English compositions primarily designed to evaluate students’ writing proficiency. The ASAP dataset includes eight distinct topics, each corresponding to a subset of compositions labeled with overall scores. To protect privacy and reduce scoring bias, sensitive information such as specific names and locations in the compositions is anonymized by replacing them uniformly with the placeholder “@entity.” Text preprocessing also involves removing non-standard characters and special symbols, and converting all letters to lowercase to minimize noise during model training. For text segmentation, the NLTK toolkit is employed to perform sentence- and word-level tokenization, supporting subsequent hierarchical semantic modeling. Given the differing scoring intervals across topics, all original scores are normalized to the range [0,1] to ensure fairness and comparability in cross-topic scoring. To prevent data leakage in cross-topic experiments, prompt words and keywords explicitly related to the composition topics are removed. This step avoids the model “cheating” by learning the prompt content directly and helps ensure the scoring model’s generalization truly reflects writing quality. The dataset is split into training and test sets in a 3:2 ratio. For each experiment, one topic serves as the test set while the remaining seven topics form the training set. The model performance is evaluated using 50% cross-validation.

Experimental environment and parameters setting

Table 3 lists the software, hardware, and development environment used in this experiment, along with key parameter settings. Model parameters are primarily determined using empirical rules and optimized based on validation set performance.

Table 3 Experimental environment and parameter settings.

The Quadratic Weighted Kappa (QWK) is used as the evaluation index, which mainly measures the consistency between the model and the real rater, and considers the square penalty of the scoring deviation. The calculation equation of QWK is as follows:

$$QWK=1-\frac{\sum\:_{i=1}^{N}\sum\:_{j=1}^{N}{w}_{ij}{O}_{ij}}{\sum\:_{i=1}^{N}\sum\:_{j=1}^{N}{w}_{ij}{E}_{ij}}$$
(34)

N refers to the total number of rating levels. \({O}_{ij}\) is the number of the actual score i and the predicted score j. \({E}_{ij}\) represents the expected frequency calculated according to the rater’s score distribution and the predicted score distribution. \({w}_{ij}\) is the weight based on the difference of scores, and the square difference weight is usually adopted:

$${w}_{ij}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}$$
(35)

The value range of QWK is [−1,1], where 1 means complete consistency, 0 means random consistency, and a negative value means poor consistency.

Performance evaluation

  1. (1)

    Comparison of model scoring results.

To evaluate the effectiveness of the HFC-AES model, this study compared it with five established AES models: the Hierarchical Attention Model (Hi-att)27, Co-attention28, Temporary Deep Neural Network (TDNN)29, Siamese Enhanced Deep Neural Network (SEDNN)30, and Cross-Task Scoring Model (CTS)31. Hi-att and Co-attention target single-topic scoring, while TDNN, SEDNN, and CTS address cross-topic scoring. To further strengthen the results, two additional mainstream Transformer-based models were included: the BERT-based AES model (BERT-AES) and the GPT-based generative AES model (GPT-AES). BERT-AES uses multi-task fine-tuning to emphasize sentence-level semantic consistency, while GPT-AES incorporates prompt information and generates scoring predictions by producing rating sequences. Figure 5 presents the QWK results of all models in cross-topic evaluation.

Fig. 5
figure 5

Comparison of QWK values of various models in cross-topic scenes.

Figure 5 shows that in cross-topic AES, single-topic models such as Hi-att and Co-attention perform worse than cross-topic AES models. Among all models, HFC-AES achieves the highest performance, with an average QWK of 0.856, surpassing other cross-topic approaches and confirming its effectiveness. GPT-AES and BERT-AES achieve mean QWK scores of 0.810 and 0.791, respectively, outperforming traditional RNN and CNN models but still falling short of HFC-AES. These results indicate that while Transformer architectures excel at feature extraction, HFC-AES gains further advantages through structural optimization and cross-task modeling. Its multi-level semantic modeling and accurate topic-related feature extraction enhance the alignment between compositions and prompts. By integrating the task and shared layers with a cross-task attention mechanism, the model effectively handles semantic differences between topics, improving cross-topic scoring accuracy. Furthermore, the joint optimization mechanism enhances robustness and scoring consistency, enabling HFC-AES to achieve superior performance. To investigate the reasons behind HFC-AES’s performance advantage, a comparison was conducted with two-stage cross-topic AES models, TDNN and SEDNN, focusing on QWK results for pre-scoring compositions in the topic-independent stage, as shown in Fig. 6.

Fig. 6
figure 6

QWK Comparison of Three Cross-Topic Models in the Topic-Independent Stage.

Figure 6 shows that the HFC-AES model achieved a higher QWK than TDNN and SEDNN in pre-scoring compositions during the topic-independent stage. Its average QWK across eight prompts was 0.769, outperforming TDNN (0.546) and SEDNN (0.681). These results confirm that the first stage of HFC-AES is critical for improving pre-scoring quality. Unlike the comparison models, HFC-AES better incorporates prompt information, leading to stronger cross-topic performance. In the topic-related stage, HFC-AES again performed best. Its hierarchical neural network structure effectively captured the complex semantic relationships between compositions and prompts. By extracting general semantic features in the shared layer and emphasizing topic-relevant information in the task layer, the model improved topic alignment and scoring accuracy. The Bi-LSTM and attention mechanisms further enhanced contextual modeling and feature extraction, enabling superior results in cross-topic scoring.

To assess the model’s generalization across different writing types and datasets, supplementary experiments were conducted on the publicly available TOEFL11 and International Corpus of Learner English (ICLE) datasets. TOEFL11 contains compositions from 11 groups of non-native English speakers, and ICLE consists of academic texts from multiple non-English-speaking countries. To ensure robust and unbiased evaluation, tenfold cross-validation with repeated verification was applied to each dataset to minimize overfitting. Figure 7 reports the QWK scores of all models.

Fig. 7
figure 7

The QWK evaluation results of each model on TOEFL11 and ICLE datasets.

Figure 7 presents the QWK evaluation results of all models on the TOEFL11 and ICLE datasets. HFC-AES consistently outperformed the comparison models, achieving a QWK of 0.852 on TOEFL11 and demonstrating strong adaptability to non-native English writing. In contrast, traditional models such as Hi-att and Co-attention delivered lower accuracy and weaker consistency, highlighting the superiority of HFC-AES in handling compositions from diverse linguistic backgrounds. These results confirm the model’s robust generalization capability, particularly in scoring tasks involving non-native writers.

To further examine performance differences in real scoring scenarios, a qualitative error analysis was conducted on representative samples. Table 4 lists three compositions with their prompts, human-assigned scores, model predictions, and explanations for scoring discrepancies.

Table 4 Sample analysis of the differences between model and human scoring.

The discrepancies primarily stem from the model’s limited capacity to interpret rhetorical devices, nuanced tone, and deeper reasoning. Essays with complex structures or implicit meaning were more prone to misjudgment. This highlights an area for improvement: integrating advanced discourse reasoning modules or fine-tuning pre-trained language models at the discourse level to enhance recognition of implicit semantics and rhetorical strategies.

  1. (2)

    Ablation experiments.

Systematic ablation experiments were designed to evaluate the contributions of different features and mechanisms in the HFC-AES model to scoring performance. Two categories were examined: feature-level ablation (discourse structure, topic-independent features, and topic-related features) and mechanism-level ablation (e.g., attention mechanisms). Each feature or mechanism was removed individually and in combination to assess its impact on performance.

For feature-level ablation, the following configurations were tested: Structural features: discourse structure removed; Topic-independent features: all topic-independent features removed; Topic-related features: all topic-related features removed; Structural + topic-independent features: both discourse structure and topic-independent features removed. The results are presented in Fig. 8.

Fig. 8
figure 8

Results of the feature-level ablation experiment.

Figure 8 shows that each feature type contributes differently to model performance. Removing discourse structure features reduces the average QWK to 0.827, confirming their value in capturing overall organization and logical coherence. The impact is greater when topic-independent features are excluded, with the QWK dropping to 0.765, highlighting the importance of basic linguistic indicators such as vocabulary and syntax in modeling text complexity and writing style. Eliminating topic-related features results in a similar decline, with the QWK decreasing to 0.770, underscoring their role in assessing how well a composition aligns with its prompt. The largest drop occurs when both discourse structure and topic-independent features are removed, with the QWK falling to 0.735. This demonstrates that each feature type supports the others: removing one weakens performance, and removing both amplifies the effect. For example, tasks using Prompts 1–4 show the steepest degradation under this combination. Compared to the full HFC-AES model, which achieves an average QWK of 0.856, this ablation produces a 0.121 loss, emphasizing the need for diverse feature inputs. Overall, these results confirm that discourse structure, basic linguistic features, and topic-semantic matching work together to enable accurate scoring.

The next step evaluates the role of the attention mechanism. Three configurations are tested: Attention: removal of the attention mechanism; Attention + topic-related features: removal of both attention and topic-related features; Structural + topic-independent + attention: removal of discourse structure, topic-independent features, and attention. The outcomes are shown in Fig. 9.

Fig. 9
figure 9

Results of mechanism-level ablation experiments.

Figure 9 shows that removing the attention mechanism alone lowers the model’s average QWK to 0.818, only a slight decrease from the complete model. This indicates that the attention mechanism, though secondary, still contributes meaningfully, especially in handling compositions with complex structures or inter-sentence relationships. Its impact becomes more pronounced when combined with other features. For instance, removing both the attention mechanism and topic-related features reduces the average QWK to 0.792, a much larger drop than removing either alone, highlighting their interdependence. The attention mechanism enhances the modeling of semantic alignment between compositions and prompts, ensuring accurate topic matching. When discourse structure, topic-independent features, and the attention mechanism are all removed, the QWK further falls to 0.778, resulting in a loss of 0.078 compared with the full model (0.856). Performance declines are especially evident in Prompts 4 and 7, which require high-level semantic abstraction and contextual reasoning. Prompt 4 involves balancing ethical concerns and scientific progress, often using metaphors, concessions, and dual-argument structures that demand strong semantic and structural comprehension. Prompt 7 calls for critical analysis of social phenomena and technological impacts, with frequent logical reasoning and subjective expression. Without topic-related features and the attention mechanism, the model struggles to determine whether a composition stays focused on the prompt, reducing scoring consistency.

Overall, the attention mechanism is not the sole determinant of performance, but its synergy with semantic features significantly improves topic understanding and contextual semantic capture, making it a vital component in cross-topic scoring. To further evaluate the HFC-AES model under different feature configurations, additional ablation experiments were conducted on shallow learning (SL) and DL features. By removing each type separately, the SL-only and DL-only models were obtained, and their effects on cross-topic composition scoring are presented in Fig. 10.

Fig. 10
figure 10

Experiments on ablation with different feature types.

Figure 10 shows that the overall scoring performance of the HFC-AES model drops when either shallow learning (SL) or DL features are removed. The average QWK decreases to 0.821 without SL features and to 0.812 without DL features, both lower than the complete model’s 0.856. This indicates that both feature types are essential for accurate scoring. Shallow features, such as word frequency, sentence length, and syntactic diversity, provide intuitive and stable indicators of linguistic complexity and writing style, helping the model assess basic language quality. DL features, by contrast, capture richer semantic representations and contextual relationships through neural networks, improving the model’s ability to evaluate semantic coherence and logical flow. Together, they form a complementary multi-level semantic representation of each composition, making their joint use a key factor in achieving high-precision scoring. Figures 8 and 9 further reveal that among all features, topic-related features have the greatest impact. Removing them lowers the average QWK from 0.856 to 0.827, with marked performance drops on Prompts 4 and 7. These features directly model the semantic alignment between compositions and their prompts—an especially challenging aspect of cross-topic scoring—allowing the model to more accurately judge topical relevance. In HFC-AES, this is accomplished through bidirectional LSTM and attention mechanisms in the task layer, substantially improving scoring consistency and accuracy across topics.

To enhance interpretability, attention weight distributions and feature importance were further analyzed to provide deeper insights into the model’s decision-making. Table 5 presents the attention weights assigned to specific features across different scoring dimensions.

Table 5 Interpretability analysis of attention mechanisms.

To further reveal how the model assigned attention weights within specific texts, the attention distribution for the sentence “College education should be free so that everyone can access knowledge. However, the government needs a sustainable plan to fund it.” was visualized. The visualization is shown in Fig. 11.

Fig. 11
figure 11

Visualization of attention weight distribution for the example sentence.

The intra-sentence attention distribution reveals that the model assigns higher weights to phrases like “sustainable plan” and “government needs,” indicating its focus on the practical feasibility issues raised in the essay. This focus is crucial for evaluating the logical completeness of argumentative writing. However, the attention on the phrase “should be free so that everyone can access knowledge” is more dispersed, reflecting the model’s lower sensitivity to idealistic or emotional expressions compared to factual statements. This difference further highlights the model’s limitation in handling subjective stances and shifts in tone. This word-level visualization based on attention aids in explaining specific scoring discrepancies and represents a promising direction for improving model interpretability. The feature importance assessment quantifies the contribution of each feature to the scoring decisions. The results are presented in Table 6.

Table 6 Feature importance assessment.

Tables 5 and 6 reveal that, in the interpretability analysis of the attention mechanism, the model places greater emphasis on organizational structure during scoring. This suggests that the HFC-AES model prioritizes logical coherence and structural quality when evaluating compositions. Regarding feature importance, grammatical and semantic features hold significant weight, underscoring their critical role in determining final scores. In contrast, discourse structure shows relatively lower importance, possibly due to its reduced influence in certain composition types. These findings indicate that the model’s scoring decisions largely depend on grammar and semantic quality, while its attention to organizational structure supports effective assessment of coherence and logical consistency.

  1. (3)

    Influence of the cross-attention mechanism on the scoring model.

The HFC-AES model incorporates a cross-attention mechanism to evaluate both overall composition quality and specific scoring dimensions, including semantics, grammar, vocabulary usage, and organizational structure. The impact of this mechanism on overall scoring performance is assessed, with results presented in Fig. 12.

Fig. 12
figure 12

Feature Weight Distribution for Topic 1 in Predicting Overall and Individual Scores.

Figure 12 illustrates how the HFC-AES model dynamically adjusts the weight assigned to various features for different scoring tasks after incorporating the cross-attention mechanism. This adjustment notably enhances scoring accuracy and consistency. For the overall composition score, the cross-attention mechanism allocates weights thoughtfully across scoring dimensions. Semantic and grammatical features receive weights of 0.159 and 0.168, respectively, highlighting the model’s emphasis on semantic coherence and grammatical accuracy—aligning well with human scoring criteria. Vocabulary usage is weighted at 0.133, reflecting its importance in scoring, particularly in terms of diversity and precision. When predicting the organizational structure score, the mechanism concentrates the majority of the weight (0.173) on organizational features, significantly down-weighting other aspects. This selective focus enables the model to prioritize key features relevant to specific scoring tasks, thereby improving the accuracy of individual dimension scores. In summary, the cross-attention mechanism allows the model to flexibly reweight features depending on the scoring task, enhancing the precision and rationale of composition evaluation.

  1. (3)

    Practical application of the HFC-AES model.

To evaluate the practical utility of the HFC-AES model, its automatic scoring results are compared with human evaluator scores. This comparison helps verify the model’s accuracy and feasibility in real-world settings. Figure 13 presents this comparison, with scores normalized to a maximum of 100 points.

Fig. 13
figure 13

Comparison of Human Score and HFC-AES model score.

Figure 13 shows that the differences between the HFC-AES model’s scores and human ratings are minimal, with most errors falling within a 3-point range. This indicates that the HFC-AES model closely approximates human scoring standards, making it well-suited for practical automatic composition scoring tasks. While minor discrepancies may occur in individual cases, the model generally performs reliably, effectively supporting automatic scoring needs in real-world applications and demonstrating strong feasibility and potential.

To further assess the model’s practical applicability, its processing time was evaluated by measuring the average scoring time per composition. All experiments were conducted on a consistent hardware and software platform. Comparative models included HFC-AES, TDNN, SEDNN, CTS, BERT-AES, and GPT-AES. The results are summarized in Table 7.

Table 7 Comparison of processing time in model rating stage (Unit: seconds/article).

Table 7 shows that the processing time of the HFC-AES model is slightly longer than that of traditional DL models. This is primarily due to its integration of shallow features, deep semantic representations, discourse structure information, and a multi-module collaborative training mechanism. However, its processing time remains significantly shorter than that of GPT-AES and BERT-AES, which rely on large-scale pre-trained models and suffer from considerable time bottlenecks in practical applications due to their vast parameter sizes and complex inference procedures. Overall, HFC-AES achieves a strong balance between high scoring accuracy and acceptable processing efficiency, making it well-suited for scenarios that demand precise grading. In practical educational settings, this means the HFC-AES model can score approximately 69 essays per minute. For instance, in a medium-sized high school where 3,000 essays need to be graded in a single exam, the model can complete the task within 45 min. This level of performance offers a feasible and effective solution for classroom assessments, online writing platforms, and large-scale standardized testing.

Discussion

In summary, the proposed HFC-AES model integrates shallow textual features with DL representations in a two-stage framework that includes both topic-independent and topic-related feature extraction and modeling. This design significantly improves scoring consistency and robustness compared to existing approaches. For example, Li et al. (2023) developed an AES method that combined multi-scale features with Sentence-BERT embeddings and shallow linguistic and topic-related features, achieving a QWK of 0.79332. Wang (2023) extracted semantic features via CNN and LSTM and topic features through TF-IDF, which resulted in a neural network-based AES model with a QWK of 0.81633. Dhini et al. (2023) proposed an AES model based on semantic and keyword similarity using Sentence Transformers; by incorporating multilingual Paraphrase-Multilingual-MiniLM-L12-V2 and DistilBERT-Base-Multilingual-Cased-V1 models, their approach improved evaluation scores by 0.2 points34. In contrast, this model enhances the understanding of composition content and semantics and strengthens the robustness and adaptability of topic information through a cross-task attention mechanism. Consequently, it offers a more comprehensive and effective technical solution for intelligent evaluation in English language teaching.

In practical applications, computational efficiency is crucial for automated scoring systems deployed at scale. This study evaluates the HFC-AES model’s performance in processing thousands of essays in near real-time. On a single GPU machine, the model achieves an inference throughput of approximately 200 compositions per minute, satisfying the demands of most online education platforms. To increase throughput further, distributed computing and data parallelism can be employed to distribute scoring tasks across multiple servers for near-linear acceleration. Additionally, asynchronous batch processing can substantially improve overall system capacity while maintaining scoring latency within seconds. These features meet the low-latency, high-concurrency requirements of large-scale educational environments. To address scenarios with limited computing resources, lightweight model alternatives are explored. Recent advances in DL have produced compressed pretrained models like DistilBERT and TinyBERT, which maintain strong semantic understanding while greatly reducing parameter counts and computational overhead. These distilled models can be efficiently deployed on edge devices or resource-constrained classroom settings. By integrating the HFC-AES multi-stage feature fusion strategy with these lightweight models as substitutes for deep semantic extractors, the system retains high scoring accuracy while lowering latency and computational costs. This makes the scoring system more practical for large-scale real-world education. Future work will focus on systematically evaluating and optimizing these lightweight versions to further enhance the model’s applicability in educational contexts.

Although the HFC-AES model demonstrates strong efficiency and scoring consistency, deploying automated scoring systems raises important ethical concerns. The model may place excessive emphasis on surface-level features like language fluency and syntactic accuracy, potentially undervaluing creativity and critical thinking. This could lead to a bias favoring style over substance. Moreover, compositions reflecting significant differences in gender, cultural background, or language variants risk being unfairly scored due to imbalances in the training data, which can introduce algorithmic bias. To address these issues, future work should focus on enhancing training mechanisms to promote diversity, inclusiveness, and fairness—for example, by integrating fairness correction modules and improving the recognition and understanding of non-standard linguistic expressions. Additionally, quality control should be enforced through manual audits and human-in-the-loop processes to ensure that automated systems complement rather than fully replace human evaluators, thus mitigating risks of misuse or overreliance on technology.

Conclusion

Research contribution

This study proposes a cross-topic automatic English composition scoring model, HFC-AES, which integrates DL features with shallow text features through both topic-independent and topic-related feature extraction. The model aims to enhance the accuracy and reliability of English composition scoring, thereby improving the evaluation of English language teaching effectiveness. Experimental results validate the model’s effectiveness, leading to the following conclusions: (1) In cross-topic composition scoring, HFC-AES achieves the best performance with an average QWK of 0.856, surpassing other cross-topic models. Its pre-scoring QWK also outperforms TDNN and SEDNN, indicating a key role in improving pseudo-data quality. (2) Ablation experiments reveal that discourse structure features significantly impact argumentative composition scoring, with a 13.32% drop in score when these features are removed. In contrast, removing the attention mechanism has a relatively minor effect. Overall, the HFC-AES model excels when both text structure features and the attention mechanism are included, underscoring their importance in cross-topic scoring. (3) The introduction of the cross-attention mechanism substantially enhances scoring performance, aligning model predictions more closely with human judgments. By combining DL and shallow text features, HFC-AES demonstrates clear advantages in cross-topic English composition scoring, providing robust technical support and a practical foundation for evaluating English teaching effectiveness. The introduction of HFC-AES not only represents a performance breakthrough in automated scoring but also opens new possibilities for fairness, consistency, and efficiency in English teaching assessment. Traditional manual grading suffers from subjectivity and scalability limitations, while HFC-AES shows strong potential to reshape educational assessment through data-driven approaches. Its adaptability and scalability in scoring cross-topic, multi-task, and linguistically complex essays highlight its broad applicability. Moreover, the model’s interpretability modules offer transparent, visualized evidence for teachers, facilitating applications in instructional feedback, writing assistance, and educational diagnostics. This paves the way for future “human–machine collaborative” educational assessment.

Future works and research limitations

Although the HFC-AES model has demonstrated strong performance in cross-topic English composition scoring, several limitations remain. First, its generalization to different genres—such as narrative, reflective, or creative writing—needs improvement. These genres often feature nonlinear structures, subjective experiences, and emotional expression, which may not align well with the model’s current focus on structure and logical coherence. Future work could explore genre-adaptive modules or multi-genre scoring branches to enhance flexibility and applicability. Second, while preliminary consideration has been given to multilingual extension, the model’s potential cultural biases, linguistic preferences, and adaptation to geographically diverse corpora have not been systematically examined. Future research should emphasize fairness by integrating sociolinguistic and educational assessment theories to evaluate scoring consistency across students from varied socioeconomic and educational backgrounds. Strategies such as balanced training data, fairness-aware regularization, and bias mitigation techniques will be critical for reducing potential disparities. Third, despite its superior performance, the model’s computational demands remain relatively high, posing challenges for deployment in resource-constrained school settings or large-scale online examination platforms. Subsequent efforts will focus on model compression, knowledge distillation, and the development of lightweight, edge-compatible versions to improve efficiency and practical usability. Finally, to address trust and transparency issues inherent in automated scoring, future work may explore “human–machine hybrid scoring systems” where model outputs serve as decision-support tools or initial screening aids for human graders. This approach could safeguard scoring quality while enhancing efficiency and feedback speed, facilitating deeper integration of AI technologies into educational practice and advancing the intelligent transformation of composition assessment.