Introduction

Machine Reading Comprehension (MRC), as a core task in natural language processing, aims to enable machines to accurately answer questions based on a given text. In recent years, with the rapid advancement of deep learning techniques, MRC has achieved significant progress in both dataset construction and model development. Depending on the answer format, MRC tasks can be categorized into four types: cloze-style (e.g. CNN & Daily Mail dataset1), multiple-choice (e.g. RACE dataset2), span-extraction (e.g. SQuAD dataset3), and open-ended generation (e.g. MS MARCO dataset4). Among these, Multiple-Choice Reading Comprehension (MCRC) requires models to identify the correct answer from a set of given options. Often, the correct answer cannot be directly extracted from the passage but requires reasoning over the deep semantic relationships between the passage, question, and options, thus imposing higher demands on models’ semantic understanding and reasoning capabilities.

MCRC techniques have demonstrated significant applicability across multiple interdisciplinary domains. For instance, in intelligent processing of business English, these techniques have been successfully applied to scenarios such as communication and signal processing, as well as biomedical literature analysis, providing critical technological support for improving efficiency in international trade and cross-lingual collaboration. The advantages are twofold: first, they help resolve terminological ambiguities in formal texts such as legal contracts (e.g. differences in interpretation of FOB and CIF clauses across legal systems5); second, they enable inference of implicit intentions in cross-cultural business communication (e.g. indirect expressions commonly used in East Asian business emails6), thereby supporting decision-making and communication.

To systematically evaluate models’ performance on these capabilities, researchers widely adopt the RACE and DREAM multiple-choice reading comprehension datasets as benchmarks. RACE (Lai et al.2), derived from English exams for Chinese middle and high school students, is currently the largest English MCRC dataset, covering texts from multiple disciplines. DREAM7, on the other hand, centers on multi-turn dialogues and simulates authentic business interactions. Figure 1 illustrates a representative example from the RACE dataset along with its structural characteristics.The two datasets exhibit complementary characteristics in terms of task features: the RACE dataset is characterized by higher reasoning complexity and a dense use of domain-specific terminology, aligning more closely with the demands of standardized tests that assess deep text comprehension and logical reasoning abilities; in contrast, the DREAM dataset, leveraging its dialogue-based structure, faithfully replicates authentic communicative scenarios in business English (such as client negotiations and email correspondence), emphasizing the cultivation of practical language application and interactive skills.

Fig. 1
figure 1

An example from the RACE dataset, where the bold text indicates the correct answer.

In recent years, pre-trained language models represented by BERT8, RoBERTa9, and ALBERT10 have achieved remarkable results on multiple-choice reading comprehension (MCRC) tasks. By pre-training on large-scale corpora, these models can effectively capture deep linguistic features, thereby demonstrating superior performance in MCRC tasks. The Option Comparison Network (OCN) proposed by Ran et al. identifies correlations between answer options through word-level comparisons, thereby facilitating reasoning in MCRC. DCMN+11 employs a bidirectional matching mechanism that considers not only the relevance between the question and the passage but also the relevance between the options and the passage, providing a more comprehensive modeling of the semantic relationships among the passage, the question, and the candidate options. The MMM framework12 adopts a multi-stage task learning strategy: in the first stage, external data are used for coarse-tuning to enhance the model’s reasoning ability; in the second stage, fine-tuning is performed on MCRC datasets to further improve task-specific performance.

All of the above matching-based models rely on pre-trained models for semantic encoding of the context, considering both the correct answer and distractors simultaneously. This approach, however, limits the model’s ability to accurately capture differences between options:

  1. 1.

    When options exhibit high surface-level textual similarity but significant semantic differences (e.g. near-homographs like “affect” vs. “effect”), the model often struggles to discern the semantic distinctions and tends to rely on shallow linguistic patterns, leading to incorrect selection.

  2. 2.

    Although current pre-trained models can effectively capture textual features, they face difficulties in providing correct answers when certain candidate options lack explicit support in the passage or question.

This study aims to more accurately capture the differences between answer options and to simulate the human cognitive process in solving multiple-choice MCRC tasks. We propose a Contrastive Learning-driven Hierarchical Attention Multiple-Choice model (CL-HAMC). When humans tackle MCRC tasks, the process typically involves multi-level cognitive operations: first, semantic encoding and integration of the text are performed to establish a textbase and a situation model; subsequently, fine-grained relational reasoning occurs between the question and the options, wherein subtle differences among options are compared, distractor representations are suppressed, and the semantic association of the correct option is reinforced.

Inspired by this cognitive mechanism, the proposed model leverages the multi-head attention13 module in pre-trained language models to explicitly model interactions among the passage, question, and options at multiple levels—including word-level, phrase-level, and semantic-relation-level. Additionally, a contrastive learning mechanism is introduced: by constructing positive and negative sample pairs, the representation of the question is pulled closer to the correct option and pushed farther from incorrect options, thereby better capturing discriminative features among options.

This approach not only structurally aligns with the hierarchical cognitive strategy humans use for option comparison and reasoning, but also explicitly models option differences in the representation space via contrastive learning, enhancing the model’s discriminative capability in MCRC tasks. The contributions of this work can be summarized as follows:

  1. 1.

    We propose a CL-HAMC, which effectively enhances the model’s information extraction capability by improving the identification of both correlations and distinctions among answer options.

  2. 2.

    The model simulates the human cognitive process in solving multiple-choice reading comprehension (MCRC) tasks, employing a hierarchical attention mechanism to progressively focus on the interactions among passages, questions, and options, thereby significantly mitigating errors caused by highly similar distractors.

  3. 3.

    Extensive experiments were conducted on multiple benchmark datasets, including RACE, RACE-M, RACE-H, and DREAM. The results demonstrate substantial performance improvements, validating the model’s effectiveness and generalization capability.

Reated work

Multiple-choice machine reading comprehension

With the release of large-scale multiple-choice reading comprehension (MCRC) datasets and the rapid advancement of pre-trained models, MCRC has evolved from traditional approaches to deep learning paradigms, achieving substantial performance gains. Early works, such as the introduction of attention mechanisms by Bahdanau et al.14, provided new perspectives for natural language processing and were subsequently applied to MCRC. Huang et al.15 proposed a co-matching framework that employed dual attention and hierarchical LSTMs to integrate information across multiple sentences, significantly enhancing cross-sentence reasoning, yet it lacked explicit modeling of inter-option contrasts. Subsequent studies have further refined representation alignment and discriminative power: Kim et al.16 progressively eliminate demonstrably incorrect choices, whereas Zhang et al.11 jointly optimize key-sentence selection and fine-grained option comparison.

Later research focused on enhancing semantic understanding and reasoning precision through multi-stage, multi-task, and multi-head collaborative mechanisms. Jin et al.12’s multi-stage, multi-task framework mitigated the scarcity of annotated data in low-resource settings; and Zhu et al.17’s dual multi-head collaborative attention model performed global-local interactions across passages, questions, and options, yielding further gains in reasoning accuracy. However, these approaches generally suffer from complex architectures, high computational cost, and optimization challenges.

Recent efforts highlight the critical role of data and training strategies in performance improvement. Longpre et al.18 demonstrated the impact of data quality on MCRC outcomes; Li et al.19 proposed next-generation dataset construction methods; Bai et al.20 introduced a multi-role collaborative data selection framework that effectively reduces misclassification of distractors. Collectively, existing research can be categorized into four areas: attention and matching mechanisms, option interaction modeling, data and training strategies, and pre-training paradigm enhancements. Nonetheless, common limitations remain, including insufficient fine-grained modeling of option differences, heavy reliance on high-quality data, and challenges in model complexity and optimization.

Contrastive learning

Contrastive Learning (CL), initially applied in computer vision, is an unsupervised approach that learns feature representations by pulling similar samples closer while pushing dissimilar samples apart. SimCLR21, a seminal contrastive learning framework, constructs positive-negative sample pairs via data augmentation and leverages the InfoNCE loss function to learn rich feature representations from these pairs. SimCSE22 first introduced contrastive learning to natural language processing, learning textual sentence embeddings based on SimCLR. Its core methodology maximizes similarity between semantically related sentence pairs while minimizing similarity between unrelated pairs, thereby acquiring sentence embeddings with semantic coherence. ArcCon23 developed an angular space contrastive learning approach, which captures angular relationships between sentences through angle-constrained cosine similarity and further refines sentence representations using<anchor, positive, negative> triplets. PairSCL24 proposed a supervised sentence-level contrastive learning method that employs cross-attention mechanisms to derive joint representations of sentence pairs, narrowing distances between same-class pairs and expanding distances between different-class pairs through contrastive objectives. Shorinwa et al.25 systematically reviewed uncertainty quantification methods for large language models (LLMs) and highlighted that semantic similarity–driven contrastive learning (e.g. Semantic-Similarity UQ) can effectively mitigate misclassification of highly similar distractors.Our work integrates contrastive learning without extra information augmentation, precisely capturing distinctions among options to significantly enhance the model’s discriminative capacity.

Methodology

Problem definition

Given a triplet \(\langle P, Q, O \rangle\), where \(P = \{P_1, P_2, \ldots , P_m\}\) represents a passage composed of m sentences, Q denotes a question related to the passage, and \(O = \{O_1, O_2, \ldots , O_n\}\) is an option set containing n candidate answers. The model’s objective is to learn a probability distribution function: \(F(O_1, O_2, \ldots , O_n \mid P, Q)\).

CL-HAMC model

The model framework proposed in this work is illustrated in Fig. 2, and consists of three core components: a context encoder, a contrastive learning–driven hierarchical attention module, and a decoder. The context encoder first transforms the input natural language text—including passages, questions, and candidate options—into machine-interpretable semantic feature representations, providing the foundation for subsequent deep analysis and reasoning. Building upon these encoded features, the contrastive learning–driven hierarchical attention module simulates the cognitive process of humans performing MCRC. In the first stage, a hierarchical attention mechanism captures interactions among passages, questions, and options across multiple granularities, from word-level to sentence-level and semantic-level. Subsequently, a contrastive learning strategy is introduced to enlarge the representational distance between correct and incorrect options, thereby enhancing the model’s sensitivity to fine-grained differences among options. Finally, the decoder integrates the multi-level, multi-granularity information extracted in the preceding stages and conducts comprehensive comparison and reasoning over all candidate options to produce the final answer.

Fig. 2
figure 2

The architecture diagram of the CL-HAMC model.

Encoder

Pretrained models incorporate extensive out-of-domain knowledge that efficiently captures contextual semantic information during text encoding. We therefore employ a pretrained model as the encoder to encode passages, questions, and options into fixed-dimensional feature vectors:

$$\begin{aligned} \textbf{h}^p&= \text {Encoder}\,(P), \end{aligned}$$
(1)
$$\begin{aligned} \textbf{h}^q&= \text {Encoder}\,(Q), \end{aligned}$$
(2)
$$\begin{aligned} \textbf{h}^o&= \text {Encoder}\,(O), \end{aligned}$$
(3)

where \(\text {Encoder}(\cdot )\) outputs the feature representation from the final linear layer of the pretrained model. Here \(\textbf{h}^p \in \mathbb {R}^{|P| \times l}\), \(\textbf{h}^q \in \mathbb {R}^{|Q| \times l}\), and \(\textbf{h}^o \in \mathbb {R}^{|O| \times l}\) denote the embedding vectors of the passage, question, and options respectively. The terms |P|, |Q|, and |O| represent input sequence lengths, while l denotes the hidden dimension of the pretrained model.

Contrastive learning-driven hierarchical attention module

Hierarchical attention module

Simulating human cognition in MCRC tasks—locating relevant contextual information before selecting answers from options—our modular architecture employs stackable hierarchical attention modules to extract deep features of \(\langle P, Q, O \rangle\) (default \(k=1\) layer). Passage-question fusion \(\textbf{h}^{pq}\) is derived via Eq. (4), while question-option fusion \(\textbf{h}^{qo}\) via Eq. (5):

$$\begin{aligned} \textbf{h}^{pq}= & \text {MLP}\left( [\textbf{h}^p;\textbf{h}^q] \right),\end{aligned}$$
(4)
$$\begin{aligned} \textbf{h}^{qo}= & \text {MLP}\left( [\textbf{h}^q;\textbf{h}^o] \right). \end{aligned}$$
(5)

Cross-level semantic interactions are captured through attention mechanisms: Eq. (6) computes relevance with \(\textbf{h}^{pq}\) as Query and \(\textbf{h}^o\) as Key/Value, while Eq. (7) symmetrically processes \(\textbf{h}^{qo}\) and \(\textbf{h}^p\):

$$\begin{aligned} \text {Attn}(\textbf{h}^{pq},\textbf{h}^o)= & \text {softmax}\left( \frac{\textbf{h}^{pq} (\textbf{h}^o)^\top }{\sqrt{d_k}} \right) \textbf{h}^o, \end{aligned}$$
(6)
$$\begin{aligned} \text {Attn}(\textbf{h}^{qo},\textbf{h}^p)= & \text {softmax}\left( \frac{\textbf{h}^{qo} (\textbf{h}^p)^\top }{\sqrt{d_k}} \right) \textbf{h}^p. \end{aligned}$$
(7)

Multi-head attention enriches representations by parallel computation and concatenation:

$$\begin{aligned} \text {head}_i= & \text {Attn}\left( \textbf{h}^{pq}\textbf{W}_i^Q, \textbf{h}^o\textbf{W}_i^K, \textbf{h}^o\textbf{W}_i^V \right), \end{aligned}$$
(8)
$$\begin{aligned} \text {MHA}(\textbf{h}^{pq},\textbf{h}^o)= & \text {Concat}(\text {head}_1,...,\text {head}_h) \textbf{W}^F, \end{aligned}$$
(9)
$$\begin{aligned} \textbf{MHA}_1= & \text {MHA}(\textbf{h}^{pq},\textbf{h}^o), \quad \textbf{MHA}_2 = \text {MHA}(\textbf{h}^{qo},\textbf{h}^p), \end{aligned}$$
(10)

where MLP(\(\cdot\)) outputs \(\mathbb {R}^{d_{\text {model}} \times l}\) features. Learnable parameters include: \(\textbf{W}_i^Q \in \mathbb {R}^{d_{\text {model}} \times d_q}\), \(\textbf{W}_i^K \in \mathbb {R}^{d_{\text {model}} \times d_k}\), \(\textbf{W}_i^V \in \mathbb {R}^{d_{\text {model}} \times d_v}\) per head, and fusion matrix \(\textbf{W}^F \in \mathbb {R}^{h \cdot d_v \times d_{\text {model}}}\). This hierarchical multi-head design significantly enhances MCRC performance.

Contrastive learning module

Contrastive learning enhances feature discrimination by comparing inter-sample similarities and differences. To capture nuanced distinctions among options, we integrate contrastive learning as follows: First, apply mean pooling to MHA outputs to obtain feature vectors g (Eqs. 11 and 12). For each sample i in batch \(\mathcal {B}\), we construct positive pairs \((g_1^i, g_2^i)\) and treat other samples as negatives. The contrastive loss \(\mathcal {L}_{cl}\) is computed via InfoNCE:

$$\begin{aligned} g_1= & \text {MeanPool}(\textbf{MHA}_1), \end{aligned}$$
(11)
$$\begin{aligned} g_2= & \text {MeanPool}(\textbf{MHA}_2), \end{aligned}$$
(12)
$$\begin{aligned} \mathcal {L}_{cl}= & -\log \frac{ \exp \left( \text {sim}(g_1^i, g_2^i) / \tau \right) }{ \frac{1}{2} \sum _{\begin{array}{c} k \in \mathcal {B} \\ k \ne i \end{array}} \left( \exp \left( \text {sim}(g_1^i, g_1^k)/\tau \right) + \exp \left( \text {sim}(g_2^i, g_2^k)/\tau \right) \right) }, \end{aligned}$$
(13)

where \(\text {sim}(\cdot )\) computes cosine similarity, \(\tau\) is a temperature hyperparameter, and \(k \ne i\) enforces negative sampling.

The fused representation \(C_i\) combines dual-path features for final prediction (Eq. 14), with three fusion strategies ablated in Sect. 4.3 (Fig. 3):

$$\begin{aligned} C_i = \text {Fuse}(g_1^i; g_2^i), \end{aligned}$$
(14)

where \(C_i \in \mathbb {R}^l\), and \(\text {Fuse}(\cdot )\) denotes element-wise product, concatenation, or MLP-based fusion.

Fig. 3
figure 3

Structure diagram of the fusion method.

Decoder

We compute probability distributions over candidate options, where \(O_i\) denotes the ith option, \(C_i\) represents the output features of the \(\langle P, Q, O_i \rangle\) triplet, and \(O_k\) indicates the correct answer. The predictive loss is defined in Eq. (15), while the final objective function combines it with contrastive loss via weighted summation (Eq. 16) to drive model training:

$$\begin{aligned} \mathcal {L}(O_k | P, Q)= & -\log \frac{ \exp (\textbf{W}^{\top } \textbf{C}_k) }{ \sum _{i=1}^{n} \exp (\textbf{W}^{\top } \textbf{C}_i) }, \end{aligned}$$
(15)
$$\begin{aligned} \mathcal {L}_{\text {total}}= & \mathcal {L}(O_k | P, Q) + \lambda \mathcal {L}_{cl}, \end{aligned}$$
(16)

where \(\textbf{W} \in \mathbb {R}^l\) is a learnable parameter, n is the number of candidate options, and \(\lambda\) is a weighting hyperparameter.

Experimental results

The RACE (Reading Comprehension from Examinations) dataset2, a canonical benchmark for MCRC tasks, comprises English examination questions from Chinese secondary schools. It is stratified by difficulty into two subsets: RACE-M for middle school and RACE-H for high school. As detailed in Table 1, the complete dataset contains 27,933 passages and 97,687 questions, with RACE-M consisting of 7139 passages and 28,293 questions, while RACE-H includes 20,794 passages and 69,394 questions. Data splits allocate 90% to training, 5% to validation, and 5% to test sets.

DREAM (Dialogue-based REAding comprehension exaMination) is a dialogue-oriented multiple-choice reading comprehension dataset proposed by Sun et al.7. The dataset is collected from real English language examinations and contains 6444 dialogues and 10,197 associated questions. Unlike RACE, DREAM has several distinguishing characteristics:

  1. 1.

    Dialogue structural complexity: each dialogue contains an average of 5.2 turns, with answers often distributed across multiple turns;

  2. 2.

    Diverse reasoning patterns: 84% of the answers require context-dependent inference, and 34% of the questions rely on external commonsense knowledge (e.g. answering “Sam might attend the party” requires understanding the implicit commitment conveyed by “promised to come”);

  3. 3.

    Dynamic options: the number of options per question is not fixed (averaging 3.3 options per question) and includes many colloquial near-synonyms (e.g. distinguishing between “likely” and “probably”).

Regarding data usage, this study strictly follows the data split methodology proposed by Zhu et al.17 (specific proportions and composition are detailed in Table 1) to ensure the comparability and reproducibility of experimental settings.

Table 1 Data composition and distribution of RACE and DREAM subsets.

Evaluation metrics

In multiple-choice machine reading comprehension (MCRC) tasks, accuracy serves as the primary evaluation metric:

$$\begin{aligned} \text {acc} = \frac{N^+}{N}, \end{aligned}$$
(17)

where \(N^+\) denotes correct predictions and N the question count.

Experimental environment

ALBERT-xxlarge encoder with \(k=2\) hierarchical attention modules; PyTorch implementation; 10 training epochs; Learning rate \(2\times 10^{-5}\); 0.1 dropout per layer; \(\tau =0.05\), \(\lambda =0.2\); LAMB optimizer26 (batch size 8, 2000-step warmup); 4 nvidia RTX 4090 Ti GPUs.

Results and analysis

This section conducts a comprehensive comparative analysis between our proposed model and state-of-the-art multiple-choice reading comprehension (MCRC) approaches. To ensure experimental reliability, we perform 10 independent training runs with different random seeds, reporting the averaged results as final performance metrics. This methodology minimizes random fluctuations and enables precise assessment of model capability.

For the UDG-LLM27 model, we use the BERTScore settings from the original paper to perform semantic matching between the model outputs and the reference answers. Specifically, for each candidate option, we compute its BERTScore with the reference answer, and the option with the highest similarity is taken as the model’s prediction. If this option matches the correct reference answer, the prediction is considered correct, and the model’s overall accuracy is subsequently calculated.

Table 2 Accuracy (%) comparison on RACE benchmarks.

As evidenced in Table 2, the proposed model demonstrates superior performance across all RACE difficulty levels. Compared to the state-of-the-art (SOTA) ALBERT-xxlarge+DUMA, it achieves accuracy improvements of +1.1%, +0.4%, and +0.7% on RACE-M, RACE-H, and the full RACE dataset respectively, establishing its effectiveness for MCRC tasks. Crucially, versus the base ALBERT-xxlarge encoder, the model attains more substantial gains of +3.3%, +3.5%, and +2.7% on these benchmarks. These results indicate that by hierarchically extracting semantic features and discerning nuanced option disparities, the model not only enhances semantic representation but also significantly boosts generalization capability, validating its efficacy for complex reading comprehension scenarios.

Table 3 Accuracy (%) comparison on DREAM benchmarks.

As shown in Table 3, on the DREAM dataset, the proposed method achieves a 2.3% improvement in accuracy over the baseline ALBERT-xxlarge model, demonstrating a significant advantage over existing approaches. Compared with the state-of-the-art UDG-LLM model, our method exhibits a slight performance gap (1.27% lower), which may be attributed to UDG-LLM’s use of larger or more complex model parameters. Future work could explore integrating CL-HAMC with UDG-LLM to further exploit the potential of the MCRC task. The experimental results validate the effectiveness of our approach in handling complex dialogue structures and semantic reasoning, confirming its capability to achieve advanced semantic understanding and inference within conversational contexts.

Ablation experiment

To assess the contributions of individual components, we conducted ablation studies on the RACE dataset, focusing on the hierarchical attention module, the contrastive learning module, and the integration strategy. Systematic experiments verified the effectiveness of each module and associated parameters in enhancing semantic feature extraction and discriminating correlations among candidate options.

Context encoder module

To evaluate the impact of different pre-trained language models as the context encoder, we conducted comparative experiments using BERT-large, RoBERTa-large, XLNet-large, and ALBERT-xxlarge while keeping all other modules unchanged. This setup systematically investigates the influence of encoder architecture on overall model performance. Detailed experimental results are presented in Table 4.

Table 4 Performance of the proposed model with different pre-trained encoders on the RACE dataset (%).

It can be clearly observed from Table 4 that the ALBERT-xxlarge+CL-HAMC model achieves the best performance, attaining 92.3% accuracy on RACE-M, 89.0% on RACE-H, and 90.1% on the overall RACE dataset. This result significantly outperforms other CL-HAMC–based models, surpassing XLNet-large+CL-HAMC by 2.7% on RACE-M, 0.7% on RACE-H, and 0.6% on the full RACE dataset. These findings indicate that the combination of ALBERT-xxlarge with the CL-HAMC module not only fully leverages the strong semantic representation capability of ALBERT-xxlarge but also further enhances the model’s ability to comprehend complex textual structures and multiple-choice questions through the CL-HAMC module.

Hierarchical attention module

For the hierarchical attention module, this section focuses on the experimental validation of the hierarchical attention of article-question pairs and candidate items, question-candidate pairs and articles, while ensuring that the other modules remain unchanged, and Table 5 shows the effect on the model of this paper under the removal of different hierarchical attention.

Table 5 Impact of removing hierarchical attention components.

The full model incorporates all hierarchical attention components. Table 5 demonstrates that ablating individual interaction modules degrades overall performance by 1.6% and 1.3% respectively, while removing all hierarchical attention causes a significant 2.8% decline. This confirms that our hierarchical attention mechanism enhances cross-level feature representations among passages, questions, and options, effectively capturing critical semantic relationships.

As shown in Fig. 4, we vary the stacking depth of attention layers (k = 1 to 6) on RACE validation/test sets. Optimal performance occurs at k=2 layers, beyond which performance plummets drastically (k > 5). This degradation may stem from vanishing gradients or overfitting during model training.

Fig. 4
figure 4

Performance of different numbers of hierarchical attention layers on the RACE dataset.

Contrastive learning module

To further investigate the impact of the contrastive learning module on model performance, we conducted systematic ablation experiments on the loss weighting coefficient (\(\lambda\)). By keeping all other components fixed, the hyperparameter was varied over \(\lambda \in \{0, 0.2, 0.4, 0.8\}\). The results are illustrated in the figure, where the training curves represent the overall loss trends and the validation curves indicate the convergence of the Variation of Information (VOI) metric. The shaded areas reflect the variance observed across different experimental settings.

Fig. 5
figure 5

Convergence of training and validation curves under different \(\lambda\) values. Solid lines indicate training loss and validation results (VOI); shaded regions represent variance. \(\sigma \_{\textrm{train}}\) and \(\sigma \_{\textrm{valid}}\) denote the mean variances of the training and validation curves, respectively.

Figure 5 illustrates the effect of the weighting parameter \(\lambda\) on model convergence and stability. When \(\lambda = 0\), both training and validation curves exhibit pronounced fluctuations, with the largest variance. Settings of \(\lambda = 0.4\) and 0.8 improve stability compared to \(\lambda = 0\), but remain inferior to \(\lambda = 0.2\). Notably, \(\lambda = 0.2\) achieves the best performance, with faster convergence, minimal variance after approximately 50k iterations, and the most stable trend. This indicates that a moderate contrastive learning weight enhances feature discrimination and generalization, whereas excessive \(\lambda\) causes the contrastive loss to dominate, weakening the main task and inducing instability. Therefore, \(\lambda = 0.2\) provides the optimal balance between the main task and contrastive learning.

While keeping all other module parameters unchanged, we further extracted the association features between context–question pairs and candidate options, and visualized them as a correlation matrix heatmap (Fig. 6). In the figure, color intensity is proportional to the strength of the feature correlations, providing an intuitive illustration of the semantic association differences among options with respect to the context–question pairs.

Fig. 6
figure 6

Impact of contrastive learning modules on model performance. (a) With contrastive learning module. (b) Without contrastive learning module.

Figure 6 illustrates the feature correlations between candidate options and the passage–question pairs. For Question 1, heatmap (a) shows pronounced differences among options, with option D exhibiting the highest correlation, which corresponds to the predicted answer. In contrast, heatmap (b) indicates option C as having the highest correlation, while the distinctions among options are less marked. These results suggest that incorporating contrastive learning enhances the model’s ability to discriminate fine-grained differences among candidate options.

For Question 2, the required answer cannot be directly inferred from the passage. Nevertheless, by leveraging inter-option differences and integrating passage and question information, the model successfully identifies the relevant candidate. Specifically, it recognizes that sentence 14 (“horseless carriage”) was invented by Henry Ford, and sentence 16 describes the establishment of the “Ford Motor Company,” resulting in option B having the highest correlation and being correctly predicted. These findings demonstrate the model’s capacity for reasoning beyond explicit textual evidence by capturing subtle semantic distinctions among options.

Fusion strategy

Fusion strategies play a critical role in the final output features of the model. This paper conducts experiments on three fusion methods: multiplication, concatenation, and linear layer fusion. Specifically, multiplication employs element-wise multiplication for feature fusion; concatenation directly joins two features to form the output feature vector; linear layer fusion utilizes a Multi-Layer Perceptron (MLP) to extract abstract semantic information from feature vectors, with its hidden layer dimension set to 512. Figure 7 demonstrates the performance of different fusion strategies on RACE-M, RACE-H, and RACE datasets.

Fig. 7
figure 7

Performance of different fusion strategies on RACE-M, RACE-H, and RACE, where \(\sigma ^2\) denotes the variance of accuracy.

As shown in Fig. 7, the model employing the MLP integration strategy outperforms the other two strategies in terms of accuracy, while also exhibiting lower variance and higher stability. Although the introduction of the MLP integration incurs some computational overhead, it achieves the optimal trade-off between performance, stability, and accuracy. The modest increase in computational cost is justified by the significant improvement in performance and enhanced stability, making the MLP integration strategy the chosen solution for the final model.

Conclusion

We identify two core challenges in intelligent multiple-choice comprehension in English education: discriminating highly confounding options and inferring implicit answers. To address these challenges, we propose CL-HAMC (Contrastive Learning–driven Hierarchical Attention Model), which simulates human-like progressive reasoning in business contexts through hierarchical interactions among text, questions, and options, while leveraging contrastive learning to enhance sensitivity to subtle semantic differences between near-synonymous terms.

Experiments on the RACE benchmark dataset demonstrate that CL-HAMC outperforms state-of-the-art methods, achieving up to 0.7% improvement in accuracy, and also exhibits strong performance on the DREAM dataset, confirming the synergistic effectiveness of hierarchical attention and contrastive learning. Our work provides a scalable solution for automated English education, with particular applicability to standardized test item generation and AI-driven question-answering systems. Future research will focus on developing cross-domain adaptation frameworks for business knowledge (e.g. finance and law) and constructing interactive tutoring systems with real-time adaptive feedback capabilities.