Introduction

The global education landscape is undergoing rapid expansion, with student enrollments rising faster than the capacity of institutions to support them. At the same time, a persistent shortage of qualified teachers continues to widen the gap between instructional demand and available human resources. UNESCO reports that the world will require an additional 44 million teachers by 2030 to meet universal education goals1. Even among “Organisation for Economic Co-operation and Development (OECD)” countries, the average student–teacher ratio remains around 14:1 in primary and 13:1 in lower secondary education, reflecting the challenge of providing individualized attention at scale2. This growing demand intensifies the workload on teachers, who already spend a significant portion of their time on non-teaching tasks such as grading and administrative work. Studies show that teachers devote nearly half of their working hours to such responsibilities, including several hours dedicated to manually evaluating student responses3,4. Under these conditions, manual assessment often becomes inconsistent, time-consuming, and susceptible to fatigue-induced errors. Auto-Evaluation-Systems (AES) have emerged as a practical response to these growing educational challenges by offering scalable, consistent, and efficient mechanisms for evaluating student responses. Rather than depending on simple pattern matching or keyword lookup, AES compares the students responses with teacher’s answers based on semantic similarity and evaluate them accordingly.

A core requirement in AES is to develop a model to accurately measure the sentence similarity which helps AES understand textual relationships between student’s response and teacher’s answer and evaluate them accordingly. Existing tools to measure the sentence similarity, referred as Sentence-Similarity-Tools (SST), estimate semantic similarity, by generating embeddings (vectors) using pretrained transformer models such as BERT and Sentence-BERT (SBERT), and then calculating similarity between these embeddings using cosine similarity measure. SSTs tend to perfom well in handling sentences without negation words, but they struggle when negation words (such as not, cannot, or do not) are present. This is due to embeddings not adequately representing the impact of negation words, where in the embeddings (of sentences) of both kinds are placed very close to each other in the representation space with high similarity among the sentences. This becomes especially problematic in AES, where a sentence like “The theory is not applicable here”, should not be considered similar to “The theory is applicable here”, would lead to unfair scoring through AES. In addition to this, there are cases where sentences with identical words differ sharply in meaning based solely on position of the negation word. For example, “The system is not entirely stable” expresses a minor issue, while “The system is entirely not stable” signals a major failure. Such differences arise because the position of negation word relative to other words (present in the sentence), can fundamentally change the meaning of the sentence.

Further, the meaning of negated-sentences (sentences which have negation words) is affected by factors beyond position of the words. The number of negation words that occur in a negated sentence can alter the meaning of a sentence. Also, the presence of odd counts of negation words often reverse the meaning of negated-sentences, whilst the occurrence of even counts of negation words may neutralize or dilute the meaning of the negated-sentences. For instance, “not impossible” suggests possibility, while “not not acceptable” only partially removes the negative tone. Context also plays a key role in measuring the similarity, as observed in sentences like: “The method is not accurate for extreme cases”, which describes a specific limitation, whereas “The method is not accurate” indicates a general problem.When these aspects are overlooked, SSTs inaccurately treat negation as a superficial change rather than a substantial semantic modification. To address this persistent limitation of current SSTs in capturing the semantic effect of negation, often leading to inaccurate scores through AES, we propose a novel model referred as “Negation-Aligned-Similarity Scorer (NAS Scorer)” which computes a new score, in place of the usual similarity measure to capture the impact of negation sentences, with a hybrid semantic similarity framework that combines the embeddings from BERT, RoBERTa, SBERT, DistilBERT, and Word2Vec, enhanced through a Bi-Directional LSTM, fine-tuned with a dataset curated for this purpose.

The rest of the paper is organized as follows: Section “Related work” reviews the existing State of art of SSTs with respect to negation sentences. Section “Methodology” describes the methodology involved in the Negation-Aligned-Similarity Scorer. Section “Experimentation” demonstrates the curation of dataset along with experimental setup. In Section “Results, we show the performance results of NAS Scorer with baseline approaches and an ablation study. In Section “Conclusion”, we conclude the paper with the summary of our work, its limitations and future directions.

Related work

Research related to SSTs can be categorized as follows: semantic similarity, embedding fusion, negation interpretation, and contextual modeling or structured encoders. In this section, we review the State-of-Art of semantic-similarity, as categorized above.

Semantic similarity and embedding models

Zhou et al.5 proposed Word Confusion, a classifier-confusion–based similarity measure based on an encoder based classifier instead of the cosine similarity to better capture contextual and asymmetric semantic relations. Although effective at the word level, it does not address negation or sentence-level polarity shifts.

Liu et al.6 introduced an embedding model fine-tuned by a dataset with soft-negative samples to have an appropriate embeddings for the soft-negative sentences. Similarly, Wu et al.7 proposed a model to control false negatives through masking and positive reassignment, whereas Liu et al.6 proposed a model, to enhance semantic discrimination through fine-tuning with hard negative samples (which are too complex to be identified). Yet, all these models could not handle the negation sentences exhaustively.

Qiu et al.8 introduced a hybrid semantic embedding by fusing conceptual knowledge with weighted word embeddings to improve domain-aware semantics. Ijebu et al.9 proposed soft-cosine and extended-cosine adaptations for transformer embeddings through integration of correlation matrix with cosine similarity measure and a weighting factor for vector magnitudes respectively to overcome cosine saturation. Further, Li et al.10 developed an angle-optimized embedding that captures a better embedding, yet these models do not handle negation properly.

Negation-sentence related models

Anschütz11 proposed NegBLEURT, a negation-aware evaluation metric, fine-tuned with negation samples. While their approach improves robustness in general evaluation, it does not cater beyond lexical overlap of negation sentences. Okpala12introduced a semantic negation-detection framework leveraging word-sense disambiguation and antonymization to restore polarity for negated words. Although effective for sentiment classification, the method handles only surface-level negation and struggles with structural or multi-clause constructions. Complementing these findings, Laverghetta Jr. and Licato13analyzed the transformer based model using an NLI-based diagnostic suite, revealing systematic weaknesses in transformer reasoning across different negation types. Their work, however, provides analysis rather than a modeling solution.

At the syntactic level, Winaya et al.14 proposed a language-specific dependency-parser–guided negation-scope detection method, integrated with XLNet to improve sentiment prediction. Additionally, Lal et al.15 developed a WordNet-based negation-handling mechanism using antonym synsets to adjust polarity, yet this lexicon-driven approach does not generalize to broader semantic tasks involving complex sentences. Further, Biradar et al.16 explored domain-general negation handling to improve cross-domain sentiment classification, but this method focuses on classification and does not deal with similarity scoring.

Contrastive, siamese learning approaches

Siamese architectures and contrastive objectives play a central role in sentence similarity modeling, enabling pairwise comparison of sentence embeddings. Advancing this line of work, Alnajem et al.17 proposed a siamese neural network architecture that integrates sentence transformers with LSTM layers and backward-flow components for similarity detection. Although the model achieves high accuracy, it lacks mechanisms to capture semantic reversals that arise due to negation words. In a related effort, Chen et al.18 introduced SDQKC, a siamese BERT framework enhanced with dynamic QK co-attention and contrastive learning to estimate short-text similarity. Despite a good performance relative to baseline models, it could not handle negation sentences.

Contextual modeling using Bi-LSTMs

Bi-LSTM architectures continue to hold relevance for capturing sequential ordering, local context, and compositional semantics in sentence-level tasks.

Sumanathilaka TGDK et al.19 proposed a Bi-LSTM based model, augmented with an attention mechanism and an advanced preprocessing pipeline for robust emotion detection. Yet, the approach does not help to identify semantic inversion caused by negation. Similarly, Lam et al.20 proposed an LSTM-based sentence representation model combining C-RNN and MLP components for sequential sentence classification in medical abstracts. However, their architecture is tailored to discourse-level structure and does not explicitly model polarity-sensitive semantics.

Mohebbi et al.21 introduced a deep graph learning framework with SRL-driven graphs and DGNN architecture on top of transformer embeddings, to refine semantic similarity computation. This integration of SRL and graph neural networks yields strong performance improvements, the model does not incorporate negation-scope or polarity-aware adjustments in its similarity scoring process.

Summarizing this section, we observe three key limitations related to semantic similarity models. First, negation is seldom encoded explicitly at either the embedding or similarity-scoring level, treating negation as single entity rather than a semantic operator that reverses meaning. Second, multi-representation models fail to correlate lexical polarity cues with contextual embeddings. Third, similarity metrics remain predominantly cosine-based, lacking dynamic adjustments required for negation sentences.

Methodology

To address the shortcomings discussed earlier, we propose a novel model, called as, “Negation-Aligned-Similarity Scorer” (NAS Scorer) to compute the negation-aligned similarity-score, instead of the usual cosine similarity score. By integrating the fusion of multi-embeddings with Bi-LSTM for contextualization, with an additional mechanism that dynamically adjusts the similarity score, by considering negation words into account. Our NAS Scorer aims to provide a robust and a reliable similarity evaluation for sentence pairs which include negation sentences as well as non-negation sentences. The methodology of NAS Scorer has two sequential phases. The Phase 1 extracts deep, harmonized semantic features for each sentence using a siamese encoder architecture, while Phase 2 handles the impact of the negation sentences and computes a new score called “negation-aligned-similarity score” by training a Bi-LSTM with a dataset that is curated for this purpose. Together, these two phases as shown in Fig. 1, enable the NAS Scorer to account for negation-driven semantic shifts that traditional embedding-based SSTs frequently overlook.

Fig. 1
Fig. 1
Full size image

Flow ofNegation-Aligned-Similarity Scorer.

Each phase of the NAS Scorer consists of deterministic and learnable transformations that progressively compare the two input sentences \(S_1\) and \(S_2\) and compute a negation-aligned similarity score in the range [0, 100]. Because the model relies on supervised learning to capture how negation patterns influence perceived semantic similarity, it is essential to ground these transformations in a dataset that explicitly encodes such variation. We therefore first describe the Negation-Sentence-Similarity Dataset (NSSD) which was specifically curated for this purpose, which provides the annotated sentence pairs and negation-aware signals used.

DatasetDescription

Since our work is focused towards AES, we have created a dataset, “Negation-Sentence-Similarity-Dataset (NSSD)” which consists of sentence pairs, spanning four conceptual domains. Each sample contains two sentences, a human annotated similarity score, negation-word counts for both sentences, and a categorical label describing the type of negation variation. Table 1 summarizes the domain coverage.

Table 1 Domains represented in the Negation-Sentence-Similarity Dataset (NSSD).

Each instance in NSSD is represented as a structured record containing all information necessary for negation-aware similarity modelling as represented in Table 2. Every sample includes a domain label indicating whether the pair belongs to Operating Systems, Databases, Computer Networks, or Machine Learning, followed by the two sentences that form the pair. Alongside the raw text, the dataset stores a similarity score that reflects the semantic relationship after accounting for negation cues. To support explicit negation reasoning, the dataset also records the number of negation words appearing in each sentence, enabling the model to capture the asymmetricity or imbalance in polarity between the pair.

Table 2 Sample instances from Negation-Sentence-Similarity-Dataset (NSSD).

Finally, each entry specifies a variation type, such as single negation, sentences with, even or odd negation patterns, or conjunction–negation interactions thereby making the dataset fully transparent and interpretable for downstream analysis. The dataset incorporates several categories of negation transformations, all derived algorithmically using even–odd negation behavior and conjunction-aware meaning shifts. These categories are listed in Table 3.

Table 3 Negation types used in NSSD.

Phase 1: Fusion embedding and feature extraction

Phase 1 converts each input sentence into a fixed-size semantic vector of dimension 256. The phase is organized into two consecutive stages: Stage 1 : Fusion and Stage 2 : Feature Extraction.

All operations described below are applied concurrently to Sentence 1 and Sentence 2; the notation below describes the computation for a generic sentence S and the resulting vectors are denoted as \(f_1\) and \(f_2\) for the two sentences \(S_1\) and \(S_2\) respectively.

Stage 1: Fusion

We employ a heterogeneous ensemble of pretrained encoders because each contributes a distinct and complementary semantic perspective. BERT22 provides deep contextualized token-level representations that capture fine-grained, context-dependent meaning. SBERT23, as a sentence-level transformer optimized for similarity estimation, produces stable embeddings well aligned with semantic comparison tasks. RoBERTa24 contributes a robust masked language model driven contextual view that often complements the representational patterns learned by BERT. DistilBERT24, through its knowledge-distillation process, yields embeddings with smoother semantic organization and reduced representational redundancy, supporting more abstract and generalized meaning extraction. Word2Vec [26] contributes classical distributional semantics based on lexical co-occurrence patterns, supplying surface-level lexical structure that transformer models may under-emphasize.

Together, these five encoders reduce the blind spots inherent to any single representation family and produce a richer, more comprehensive fused semantic embedding

For a sentence S, we extract one pooled embedding from each encoder and denote them as follows:

$$\begin{array}{lll} e_B \,\in \mathbb {R}^{768} & & \quad \text {(BERT pooled embedding)}, \\ e_S \,\in \mathbb {R}^{384} & & \quad \text {(SBERT (Mini-LM) sentence embedding)}, \\ e_D \,\in \mathbb {R}^{768} & & \quad \text {(DistilBERT pooled embedding)}, \\ e_R \,\in \mathbb {R}^{768} & & \quad \text {(RoBERTa pooled embedding)}, \\ e_W \,\in \mathbb {R}^{200} & & \quad \text {(Word2Vec mean embedding)}. \end{array}$$

The per-model sentence embedding is produced by each encoder independently. For transformer-based encoders,this is the mean-pooled representation, obtained by averaging token embeddings to produce one vector that captures the overall meaning of the sentence, while for Word2Vec it is the average of the word vectors that occur in the sentence. These five vectors are then combined by concatenation

$$\begin{aligned} x \;=\; e_B \oplus e_S \oplus e_R \oplus e_D\oplus e_W, \end{aligned}$$
(1)

where \(\oplus\) denotes vector concatenation, and x represents the fused embedding vector that aggregates the semantic information contributed independently by BERT, SBERT, RoBERTa, DistilBERT and Word2Vec into a single high dimensional sentence representation. The concatenation operation stacks the five vectors end-to-end, producing a single fused vector x whose dimensionality is 2888, denoted as \(D_{\text {in}}\).

The purpose of this concatenation is twofold: (i) To preserve all embeddings produced by each encoder without discarding model-specific information, and (ii) To consolidate all semantic information into one vector that serves as input for the NAS Scorer’s later learning stages. Concatenation therefore serves as a loss-less fusion step, leaving it to the later layers to determine how the information should be integrated.

Stage 2: Feature extraction

This stage reduces the high-dimensional fused vector x into a compact, sequential representation suitable for contextual summarization, and finally produces the 256-dim sentence embedding f. This stage consists of a learnable projection followed by sequential encoding with a bidirectional LSTM.

First, the fused vector \(x \in \mathbb {R}^{D_{\text {in}}}\) is passed through a two-layer learning projection (Neural-Network) with a pointwise non-linearity. This projection learning stage implements a dimensionality reduction and harmonization:

$$\begin{aligned} h \;=\; \textrm{ReLU}\big (W_{p}^{(2)}(\,\textrm{ReLU}(W_{p}^{(1)} x + b_{p}^{(1)})\,) + b_{p}^{(2)}\big ). \end{aligned}$$
(2)

In words, Eq. (2) applies a linear map \(W_{p}^{(1)}: \mathbb {R}^{D_{\text {in}}} \rightarrow \mathbb {R}^{p}\) followed by a ReLU activation, optional dropout, and a second linear map \(W_{p}^{(2)}: \mathbb {R}^{p} \rightarrow \mathbb {R}^{p}\) with ReLU. In our description p is the chosen projection dimension. We choose \(p=512\), for reducing the redundancy in the fused multi-encoder representation while retaining task-critical semantic features. The resulting vector h is a compact latent representation of the fused sentence embedding, preserving the key semantic information needed for later processing. Thus, the projection performs the numerical reduction. Thus, the projection maps \(D_{\text {in}} = 2888 \rightarrow p = 512\), and the ReLU activations introduce non-linearity, allowing the learning projection (Neural-Networks) to learn rich, non-linear transformations of the concatenated features rather than merely performing a linear compression.

After passing through this learning projection, the vector \(h \in \mathbb {R}^{p}\) is reshaped into a short sequence to enable sequential modeling. Specifically, h is split into L equal-sized chunks so that

$$h \quad \longrightarrow \quad H \in \mathbb {R}^{L \times C},$$

where \(C = p / L\). Here, H denotes the reshaped sequence representation of the sentence S. This reshape does not create new information; it simply reorganizes the p-sized vectors of h into L pseudo-timesteps that the subsequent recurrent module can process. In our setup, typical choices are \(p=512\) and \(L=8\) (hence \(C=64\)), or equivalently \(p=512\) and \(L=16\) with \(C=32\) depending on the configuration; the implementation enforces that p is divisible by L.

H is then processed by a bidirectional LSTM. The forward LSTM ingests the pseudo-timesteps in order and produces a final forward hidden state \(\overrightarrow{h_T}\in \mathbb {R}^{H_s}\), while the backward LSTM ingests the sequence in reverse and produces \(\overleftarrow{h_1}\in \mathbb {R}^{H_s}\); Here \(H_s\) denotes the per-direction LSTM hidden size (for example \(H_s=128\)). These two directional summaries are complementary and jointly encode the full bidirectional context of the sentence. The two directional summaries are concatenated to obtain the final sentence representation:

$$\begin{aligned} \overrightarrow{h_T}&:\ \text {taken from the final timestep } T \text { (forward LSTM)},\\ \overleftarrow{h_1}&:\ \text {taken from the first timestep (backward LSTM)}. \end{aligned}$$
$$\begin{aligned} f \;=\; \overrightarrow{h_T} \oplus \overleftarrow{h_1}, \end{aligned}$$
(3)

so that

$$f \in \mathbb {R}^{2H_s} \quad \text {(example: }f\in \mathbb {R}^{256}\text { with }H_s=128).$$
Fig. 2
Fig. 2
Full size image

Phase 1: Fusion embedding and feature extraction.

To summarize the numerical flow for a single sentence S : the per-model pooled vectors concatenate to a fused vector \(x\in \mathbb {R}^{D_{\text {in}}}\), a learnable projection reduces and harmonizes this to \(h\in \mathbb {R}^{p}\), h is reshaped into L pseudo-timesteps \(H\in \mathbb {R}^{L\times C}\) (example \(L=8\), \(C=64\)), the BiLSTM processes the sequence and produces two directional summaries of size \(H_s\) each, and their concatenation yields the 256-dimensional sentence embedding f:

$$2888 \;\longrightarrow \; 512 \;\xrightarrow {\text {reshape}}\; 8\times 64 \;\xrightarrow {\text {BiLSTM}}\; 128+128 \;=\;256.$$

The complete sequence of operations described above is illustrated in Fig. 2.

Further, all parameters involved in this transformation from the learning projection (Neural-Network) to the BiLSTM are learned supervisedly using custom-curated NSSD, ensuring that the dimensionality reduction and sequential encoding are optimized specifically for negation-sensitive semantic similarity. The entire Stage 1 (fusion) and Stage 2 (feature extraction) pipeline described above is applied concurrently to Sentence 1 and Sentence 2, producing the pair of final semantic vectors \(f_1,f_2\in \mathbb {R}^{256}\) which become the inputs to Phase 2.

Phase 2: Negation-aligned similarity estimation

Phase 2 converts the pair of 256-dimensional sentence embeddings produced in Phase 1 into a unified representation from which a negation-aware semantic similarity score is derived. The phase is organized into four sequential stages: Stage 1: Semantic-Interaction Based Composition, Stage 2: Negation Feature Construction, Stage 3: Feature Fusion and Projection, and Stage 4: Regression. Each stage uses the embedding pair \((f_1, f_2)\) from Phase 1, gradually adding contrastive, alignment, and negation-based information until a final “Negation-Aligned-Similarity” score is produced. All transformations described below are trained end-to-end using the custom-curated Negation-Sentence-Similarity Dataset (NSSD).

Stage 1 : Semantic-interaction-based composition

Phase 1 produces two dense semantic vectors,

$$f_1, f_2 \in \mathbb {R}^{256},$$

representing the contextual interpretations of sentences \(S_1\) and \(S_2\). These vectors arise from a BiLSTM encoder with hidden dimensionality 128 per direction, producing a final 256-dimensional sentence representation. The vectors \(f_1\) and \(f_2\) encode the sentence-level semantics that form the foundation for downstream similarity estimation.

To quantify semantic contrast, the model computes the absolute difference between the two embedding vectors:

$$d = |\,f_1 - f_2\,|.$$

Each component of d expresses how strongly the two sentences diverge along its corresponding semantic dimension, enabling the model to identify areas of disagreement in meaning. To capture semantic alignment, the model also computes the element-wise product:

$$p = f_1 \odot f_2,$$

where the operator \(\odot\) denotes element-wise multiplication. Here, vector p exhibits parallel activation patterns, indicating shared semantic characteristics. Together, these four vectors: \(f_1\), \(f_2\), d, and p span the semantic–interaction from which similarity is derived. Their combined dimensionality is:

$$\underbrace{256}_{f_1} + \underbrace{256}_{f_2} + \underbrace{256}_{d} + \underbrace{256}_{p} = 1024.$$

These four components together form the core similarity feature space.

Stage 2 : Negation feature construction

Neural embedding models frequently underrepresent the logical and structural effects of negation. To address this limitation, NAS Scorer constructs a compact three-dimensional negation-aware vector \(n \in \mathbb {R}^3\) from the tokenized sentences of S1 and S2. Each component of n introduces polarity-sensitive information that semantic embeddings alone do not reliably encode.

The first component measures polarity asymmetry by comparing the negation count of the two sentences:

$$n_1 = \big |\, \text {neg}(S_1) - \text {neg}(S_2) \,\big |.$$

Here, \(\text {neg}(S)\) denotes the number of negation words in sentence S. A difference in negation count often signals a difference in semantic polarity. For instance,

\(S_1\): The system is running correctly. (0 negations)

\(S_2\): The system is not running correctly. (1 negation)

$$n_1 = |0 - 1| = 1.$$

The second component captures a parity mismatch:

$$n_2 = {\left\{ \begin{array}{ll} 1, & \text {if } (\text {neg}(S_1) \bmod 2) \ne (\text {neg}(S_2) \bmod 2), \\ 0, & \text {otherwise}, \end{array}\right. }$$

reflecting whether the sentences undergo different patterns of polarity reversal or not. Odd negation counts typically invert polarity, whereas even counts cancel and preserve it. For instance,

\(S_1\): The task is not complete. (1 negation, odd)

\(S_2\): The task is not impossible to complete. (2 negations, even)

$$n_2 = 1,$$

The third component indicates whether negation-words occupy comparable structural positions in both sentences:

$$n_3 = {\left\{ \begin{array}{ll} 1, & \text {if negations in } S_1 \text { and } S_2 \text { occur in comparable positions}, \\ 0, & \text {otherwise}. \end{array}\right. }$$

This factor informs the model whether negation modifies comparable semantic regions across the pair. For instance, the sentences with negations in aligned positions look like:

  • \(S_1\): The boy did not pass the exam.

  • \(S_2\): The boy did not clear the test.

$$n_3 = 1.$$

The sentences with negations in misaligned positions look like:

  • \(S_1\): The boy did not pass the exam.

  • \(S_2\): The boy passed the not easy exam.

$$n_3 = 0.$$

Together, the components of n supplement the semantic–interaction space with explicit polarity information that is otherwise implicit or absent in the neural embeddings.

Stage 3: Feature fusion and projection

All semantic, interaction-based, and negation-aware features are merged into a single fused representation. This vector, denoted by z, is constructed by concatenating the semantic embeddings \(f_1\) and \(f_2\) with the interaction features d and p, along with the negation-aware vector n:

$$z = f_1 \oplus f_2 \oplus d \oplus p \oplus n,$$

where \(\oplus\) denotes vector concatenation. The vector z, therefore contains all information used by the model to predict similarity, bringing together sentence-specific semantics, cross-sentence relational cues, and explicit logical polarity structure.

The dimensionality of the fused representation Z is :

$$\underbrace{256 + 256 + 256 + 256}_{\text {semantic + interaction}} \;+\; \underbrace{3}_{\text {negation}} \;=\; \textbf{1027}.$$

This 1027-dimensional vector forms the complete feature representation to be processed by the regression module described later. The subsequent transformations refine this fused representation and progressively compress it towards our NAS score.

The first transformation produces the hidden vector \(h_1\):

$$h_1 = \text {ReLU}(W_1 z + b_1),$$

where \(h_1 \in \mathbb {R}^{512}\), \(W_1 \in \mathbb {R}^{512 \times 1027}\), and \(b_1 \in \mathbb {R}^{512}\). This transformation nonlinearly re-encodes the fused negation-aware feature vector z and performs the first dimensionality reduction from \(1027 \rightarrow 512\), enabling joint nonlinear integration of semantic, interaction, and negation cues into an unified latent representation.

A second transformation further compresses this representation:

$$h_2 = \text {ReLU}(W_2 h_1 + b_2),$$

where \(h_2 \in \mathbb {R}^{128}\). This second reduction from \(512 \rightarrow 128\) enforces strong regularization and retains only the most task-relevant high-level interactions, yielding a compact representation that serves as the final hidden input to the regression layer.

Fig. 3
Fig. 3
Full size image

Phase2 : Negation-Aligned Similarity Estimation.

Algorithm 1
Algorithm 1
Full size image

Negation-Aligned Similarity (NAS) Scorer

Stage 4: Regression

This is the final stage, the contextualized feature vector \(h_2\) is mapped to a scalar similarity prediction through a linear regression layer:

$$o = W_3 h_2 + b_3,$$

where \(o \in \mathbb {R}\) represents the model’s raw similarity estimate. To obtain a bounded and interpretable output, this value is passed through the sigmoid activation function \(\sigma (\cdot )\), which maps real-valued inputs to the range [0, 1]:

$$\hat{y} = \sigma (o),$$

where \(\hat{y} \in [0,1]\) denotes the normalized similarity score.

Since the Negation-Sentence-Similarity- Dataset (NSSD) provides human-annotated similarity scores in the range [0, 100], the model output is rescaled as:

$$\hat{y}_{\text {scaled}} = 100 \cdot \hat{y}.$$

The quantity \(\hat{y}_{\text {scaled}}\) represents the final negation-aware semantic similarity predicted by the NAS Scorer for a given sentence pair. During training, this predicted value is directly compared with the gold similarity scores provided by NSSD. Thus, NSSD is used to supervise the training of the regression module and all preceding layers. Through this end-to-end learning process, the model learns to jointly weight semantic proximity, contrastive differences, multiplicative alignment patterns, and explicit negation cues to produce the final NAS score, as summarized in the Phase 2 architecture shown in Fig. 3. The complete workflow of the proposed NAS Scorer is summarized in Algorithm 1.

Experimentation

This section explains the computational environment, the frameworks and libraries used at each stage of the proposed Negation-Aligned-Similarity Scorer (NAS Scorer), and the hyperparameters adopted during training. All experiments were carried out in Google Colab, which provides a cloud-based Python environment with access to an NVIDIA GPU. GPU acceleration was enabled through CUDA support in PyTorch, allowing faster execution of the multi-encoder embedding extraction, BiLSTM operations, and the training loop.

Dataset curation process

The Negation-Sentence-Similarity Dataset (NSSD) was curated using a structured hybrid pipeline that integrates rule-based automation, Large Language Model (LLM) assistance, and human-in-the-loop validation. The curation pipeline was implemented entirely in Python, with core data handling and CSV input–output operations performed using the Pandas library. The NSSD dataset was first loaded and organized using Pandas DataFrames, which enabled systematic generation of new negation-based samples and consistent management of sentence pairs, class labels, and similarity scores.

Negation-aware sentence generation was driven by a combination of linguistic rule systems and LLM-guided rewriting. Regular expression-based pattern matching from Python was used to detect existing negation words (not, no, never, etc) and conjunction markers (and, but, although, etc), as well as to identify valid insertion points for controlled negation placement. For linguistic validation and structural consistency during transformation, lightweight natural language processing support was employed using spaCy, NLTK for tokenization and basic syntactic verification. Large language models were used selectively to ensure that the generated negation variants remained grammatically fluent and semantically coherent after polarity modification.

Similarity score adjustment during curation was implemented using deterministic heuristic functions in NumPy. Even-odd negation parity rules and conjunction-negation interaction logic were encoded as conditional transformations that modified the base similarity scores by fixed proportional ranges. These automated score adjustments were subsequently verified through human inspection to ensure alignment with perceived semantic similarity.

Following automated augmentation, every generated sample was manually reviewed to validate negation placement, logical polarity behavior, and the correctness of the adjusted similarity score. This human-machine collaborative workflow ensured that the final dataset maintained both linguistic naturalness and reliable negation-sensitive semantics. Through this controlled expansion process, the dataset was systematically scaled from its original seed set into the final large-scale NSSD. The resulting expansion statistics and overall distribution characteristics are summarized in Table 4.

Table 4 Summary statistics of the Negation-Sensitive Similarity Dataset (NSSD).

Model development and implementation

The implementation was done in Python 3.10. Core deep learning components, including linear layers, recurrent layers, activation functions, loss functions and the optimization routine, were implemented using the PyTorch library. Pretrained transformer encoders such as BERT, RoBERTa and DistilBERT were loaded using the Hugging Face Transformers library, while sentence-level encoders such as SBERT were instantiated from the Sentence-Transformers framework. Classical lexical embeddings were obtained using the Gensim library. Dataset handling, numerical preprocessing and feature manipulation were performed using NumPy and Pandas, while train-validation splits and scaling utilities were provided by scikit-learn. All plotting and diagnostic visualizations were generated using Matplotlib.

Phase 1: Fusion embedding and feature extraction

Phase 1 of the NAS Scorer transforms each input sentence into a 256-dimensional semantic vector through two consecutive stages: fusion of embeddings and feature extraction. For the fusion stage, five pretrained encoders were instantiated. BERT-base-uncased, RoBERTa-base and DistilBERT-base were loaded using the AutoTokenizer and AutoModel classes from the Hugging Face Transformers library. The Sentence-BERT encoder was created using the SentenceTransformer class from the Sentence-Transformers framework, which directly provides sentence-level embeddings optimized for similarity tasks. Word2Vec embeddings were loaded using Gensim’s KeyedVectors interface, giving access to pre-trained 200-dimensional word vectors. For each sentence, tokens were first converted to ids and passed through the corresponding transformer models, and the resulting token-level hidden states were mean-pooled using PyTorch tensor operations to obtain fixed-size sentence embeddings.

For Word2Vec, each token in the sentence was mapped to its vector using the Gensim model, and the sentence-level embedding was obtained by averaging these word vectors using PyTorch reduction operations. All five embeddings were finally concatenated along the feature dimension using PyTorch to form a single fused vector of dimension 2888 for each sentence.

The feature extraction stage in Phase 1 was implemented as a learnable projection followed by sequence modeling using a BiLSTM, all defined in a custom PyTorch class. The projection was realised using fully connected layers. Concretely, the fused 2888-dimensional vector was first passed through a linear layer. The projection block included a Rectified Linear Unit activation. The projected vector was reshaped using PyTorch’s tensor view and passed into a bidirectional LSTM.

The final forward and backward hidden states were concatenated to obtain the final 256-dimensional sentence embeddings \(f_1\) and \(f_2\). The entire Phase 1 process, in which fusion relies on frozen pretrained encoders while the projection and BiLSTM layers were trained using the Negation-Sentence-Similarity Dataset (NSSD).

Phase 2: Negation-aligned similarity estimation

Phase 2 takes the pair of 256-dimensional sentence embeddings and the negation features and estimates them into a single scalar NAS score in the range [0, 100]. The semantic interaction features were implemented using basic PyTorch tensor arithmetic. Given the embeddings \(f_1\) and \(f_2\) as tensors of shape (\(\text {batch}\)_\(\text {size}\),256), the absolute difference vector was computed using torch.abs(f1–f2), and the element-wise product was obtained using pointwise multiplication f1 f2. These operations produced the vectors d and p described in the methodology.

Negation features were extracted directly from each sentence pair \((S_1, S_2)\) rather than being read as precomputed attributes. Each sentence was first tokenized using standard Python string processing utilities, and negation words were identified through rule-based pattern matching against a predefined negation lexicon. The total number of detected negation tokens in each sentence was counted to obtain \(\text {neg}(S_1)\) and \(\text {neg}(S_2)\). Using these values, the three-dimensional negation-aware vector n was constructed by computing the absolute negation-count difference, the parity mismatch based on odd–even negation counts, and the relative positional alignment of negation words across the sentence pair. The resulting vector \(n \in \mathbb {R}^3\) was then converted into a PyTorch tensor and used as an explicit polarity-sensitive feature.

The four vectors \(f_1\), \(f_2\), d and p and the negation vector n were concatenated into a single 1027-dimensional fused feature vector using torch.cat. This fused representation was passed to the regression subnetwork, which was implemented using PyTorch’s nn.Sequential container. The first hidden layer was declared as nn.Linear(1027, 512), followed by a ReLU activation implemented as nn.ReLU() and a dropout layer implemented using nn.Dropout(p = 0.3). The second hidden layer was defined as nn.Linear(512, 128), again followed by nn.ReLU(). The final output layer was implemented as nn.Linear(128, 1).

After the last linear layer, a Sigmoid activation was applied to bound the output between 0 and 1. The resulting value was then multiplied by 100 to rescale the prediction into the target range [0, 100] expected by the NSSD labels.

Training procedure and hyperparameters

The hyperparameters used for training the NAS Scorer are summarized in Table 5. The NSSD dataset was split into training and validation subsets using the train_test_split function from sklearn.model_selection. The resulting NumPy arrays were converted into PyTorch tensors and wrapped into TensorDataset objects. Mini-batches were created using the DataLoader class with a batch size of 32. The NAS Scorer was moved to the GPU. The optimizer used for training was AdamW, with a learning rate of \(2 \times 10^{-4}\).

The loss function was implemented using Mean Squared Error (MSE) loss, and the Root Mean Squared Error (RMSE) was computed by applying a square-root operation to the MSE value during evaluation. Training was carried out for 30 epochs. During each iteration, gradients were computed, and parameters were updated. To ensure reproducibility, Python’s built-in random.seed, NumPy’s numpy.random.seed, and PyTorch’s torch.manual_seed were all initialized with a fixed seed value.

Table 5 Hyperparameters for training NAS Scorer.

Results

This section presents a comprehensive evaluation of the proposed Negation-Aligned Similarity (NAS) Scorer. The experimental analysis includes four complementary components: (i) Training behavior, (ii) Validation, (iii) Comparative Analysis of NAS Scorer, and (iv) Ablation Study.

Training behaviour

The training loss and validation RMSE curves shown in Fig. 4 illustrate the learning behavior of the NAS Scorer over 30 training epochs and indicate both optimization quality and generalization performance. The training loss reflects how well the model fits the training data, while the validation RMSE measures how effectively it generalizes to unseen sentence pairs. The observed trend corresponds to an ideal learning scenario, with both curves decreasing smoothly and stabilizing at low error values with a small gap between them. Specifically, the validation RMSE drops from 22.67 in the first epoch to 5.5610 at convergence, demonstrating strong generalization capability. The absence of divergence between the curves shows that the model avoids both overfitting and underfitting, confirming that the fused embeddings, BiLSTM contextualization, and explicit negation-aware interaction modeling together form a stable and reliable system.

In addition to convergence analysis, the discriminative behavior of the NAS Scorer was assessed through Receiver Operating Characteristic (ROC) analysis. As shown in Fig. 5, the NAS Scorer achieves an Area Under the Curve (AUC) value of 0.9875, which is very close to the ideal value of 1.0. This indicates that the proposed model possesses an exceptionally strong ability to discriminate between semantically similar and dissimilar sentence pairs across all decision thresholds. Since the AUC represents the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample, a value of 0.9875 confirms that the NAS Scorer performs near optimally in ordering sentence pairs according to their true semantic similarity.

Fig. 4
Fig. 4
Full size image

Training loss versus validation of NAS scorer.

Fig. 5
Fig. 5
Full size image

AUC-ROC curve of NAS scorer.

The ROC curve consistently remains far above the diagonal random classification baseline (AUC = 0.5), demonstrating a high true positive rate even at very low false positive rates. This behavior indicates that the model is able to correctly identify genuinely similar sentence pairs while simultaneously suppressing incorrect similarity assignments. The strong ROC performance directly validates the effectiveness of the negation-aware interaction modeling and the fusion-based semantic encoding strategy, showing that explicit negation alignment significantly enhances binary similarity discrimination beyond conventional embedding-only approaches.

Validation

Validation of the NAS Scorer was carried out on the STS Benchmark25 dataset by comparing its performance against LoRA-fine-tuned DistilBERT and LoRA-fine-tuned SBERT, and the corresponding quantitative results are summarized in Table 6. Since the STS Benchmark dataset provides similarity scores on a 0–5 scale, all scores were linearly rescaled to the 0–100 range for consistency with NAS Scorer outputs. The proposed NAS Scorer achieves a markedly lower RMSE of 5.5610, substantially outperforming the LoRA–DistilBERT model, which records an RMSE of 14.6680, and the LoRA–SBERT model with an RMSE of 14.4327. This clearly demonstrates the superior regression accuracy of the NAS Scorer.

Table 6 Performance Comparison of NAS Scorer.

In terms of classification performance after thresholding, the NAS Scorer attains the highest accuracy of 0.9759 and an F1-score of 0.9704, indicating highly reliable similarity discrimination. Although the LoRA–DistilBERT baseline achieves a high recall of 0.9491, its significantly lower precision of 0.6031 reveals a strong tendency to overestimate similarity by misclassifying dissimilar sentence pairs as similar. The LoRA–SBERT model exhibits a more balanced precision–recall profile with values of 0.7760 and 0.8434, respectively, but still remains consistently inferior to the NAS Scorer across all reported evaluation metrics.

To further analyze the behavior of different models under structured negation, we conduct a sentence-level similarity comparison using carefully selected sentence pairs from the STS Benchmark dataset that explicitly exhibit different types of negation variations. The similarity scores for NAS, LoRA–DistilBERT, LoRA–SBERT, and vanilla SBERT are reported in Table 7. For the transformer-based baseline models (LoRA–SBERT, LoRA–DistilBERT, and vanilla SBERT), sentence embeddings are first extracted and the similarity is then computed using the cosine similarity between these embedding vectors, which is the standard procedure followed in traditional SSTs. This setup enables a direct and fair comparison between the proposed negation-aware scoring mechanism in NAS Scorer and conventional embedding-based similarity estimation.

Table 7 Negation-Aware Similarity ComparisonAcross Models.

In the original (no negation) case, all models assign high similarity scores, indicating correct semantic alignment under standard conditions. In the presence of single and odd negation, NAS Scorer produces very low similarity scores (12.3 and 9.8), accurately reflecting polarity reversal, whereas SBERT and the LoRA-based models continue to assign relatively high scores, indicating systematic overestimation under negation. For the even negation case (“not impossible” vs. “possible”), NAS assigns a high similarity score (91.5), correctly capturing negation cancellation, while the baselines show only partial sensitivity. In the conjunction-negation example, NAS again assigns a low similarity score (18.7), capturing the combined effect of negation and context, whereas the other models remain inflated.

Comparative analysis of NAS scorer

To evaluate the effectiveness of the proposed NAS Scorer against strong transformer based baselines, we perform a comparative analysis using LoRA-fine-tuned DistilBERT and LoRA-fine-tuned SBERT models. DistilBERT is selected due to its lightweight architecture and computational efficiency, making it particularly well-suited for parameter-efficient fine-tuning, while SBERT is chosen as the standard vanilla transformer model widely adopted in traditional Sentence Similarity Tools (SSTs). Both baselines are first fine-tuned on the NSSD dataset using parameter-efficient Low-Rank Adaptation (LoRA), where only a small set of adapter parameters is updated while the base transformer weights remain frozen. This ensures a fair and stable adaptation of the baselines to the negation-sensitive training distribution.

Fig. 6
Fig. 6
Full size image

Training loss and validation RMSE for LoRA baselines.

The training and validation behavior of the LoRA–SBERT and LoRA–DistilBERT models on NSSD is jointly illustrated in Fig. 6. In both cases, the training loss decreases sharply during the initial epochs, indicating effective parameter-efficient optimization. However, the corresponding validation RMSE curves saturate early and converge to relatively high final values when compared to the NAS Scorer. This trend indicates that although LoRA fine-tuning stabilizes training, these transformer-only baselines remain limited in modeling negation-driven semantic shifts due to the absence of explicit negation-aware reasoning mechanisms.

The NAS Scorer incurs a higher memory cost compared to single-encoder baselines due to its hybrid embedding fusion design. During inference, multiple pretrained models (BERT, SBERT, RoBERTa, DistilBERT, and Word2Vec) are loaded simultaneously, and their outputs are concatenated into a high-dimensional fused embedding before dimensionality reduction and BiLSTM processing. This parallel multi-encoder operation significantly increases GPU memory consumption relative to baselines that rely on a single transformer and cosine similarity. In contrast, LoRA-based baselines are considerably more memory-efficient, since only a small set of low-rank adapter parameters is updated while the base transformer weights remain frozen. While this makes NAS comparatively more memory-heavy, the additional cost is a direct consequence of richer semantic representation and explicit negation modeling.

Ablation study

As described in the methodology, the proposed NAS Scorer is organized into two phases. Phase 1 consists of Stage 1(fusion embedding) and Stage 2(feature extraction), while Phase 2 consists of Stage 1(Semantic-Interaction-Based Composition), Stage 2(Negation Feature Construction), and Stage 3(Feature Fusion and Regression). Since the BiLSTM module in Phase 1-Stage 2 and the negation vector constructed in Phase 2-Stage 2, represent the two most critical learning components of NAS Scorer, we perform a stage-wise component ablation by selectively disabling each of these stages to quantify their individual contributions.

Table 8 Stage-wise Ablation Study of core NAS Components.

We evaluated the contribution of the core NAS stages using the custom-curated Negation-Sentence-Similarity Dataset (NSSD). The results of our ablation study is described in Table 8. When Phase 2–Stage 2, which constructs the negation feature vector, is removed, performance declines, with RMSE increasing to 10.1842 and the F1-score decreasing to 0.8872. This indicates that, polarity information contributes meaningfully to similarity estimation. When the negation vector is retained but Phase 1–Stage 2, the BiLSTM-based feature extraction module, is removed, RMSE increases further to 12.6395 and the F1-score decreases to 0.8248, showing that sequential contextual encoding plays an important role in modeling the scope and position of negation. The full NAS Scorer, which includes all stages of Phase 1 and Phase 2, attains the lowest RMSE of 5.5610 and an F1-score of 0.9620, reflecting the combined effect of contextual encoding and negation modeling.

Overall, these results indicate that while cosine similarity serves as a standard mechanism for measuring coherence between sentence embeddings, the primary limitation lies in the embeddings themselves, which fail to adequately capture the contextual scope, positional influence, and count-based effects of negation, whereas NAS Scorer is designed to account for these factors to achieve robust negation-aware semantic behavior.

Conclusion

In conclusion, this work presents a Negation-Aligned-Similarity Scorer (NAS Scorer) implemented through Hybrid Semantic Similarity (HSS) framework for robust similarity estimation under negation-rich conditions. By combining multi-embedding fusion, BiLSTM-based contextual encoding, and explicit negation-aware interaction features, the proposed model achieves high regression accuracy and strong classification performance, consistently outperforming LoRA-fine-tuned DistilBERT and SBERT baselines across all evaluation metrics. These results confirm the importance of explicitly modeling negation and contextual sequencing for reliable semantic similarity estimation. Despite its effectiveness, the proposed approach has certain limitations. The reliance on multiple transformer encoders introduces higher computational and memory overhead, which may constrain real-time deployment in resource-limited environments. In addition, although the NSSD dataset is systematically curated and human-verified, it may not fully capture the wide range of pragmatic and discourse-level negation phenomena present in naturally occurring text.

Future work will focus on improving scalability through knowledge distillation and lightweight fusion strategies to reduce inference cost, expanding the dataset to include more nuanced and context-dependent forms of negation, and evaluating cross-domain generalization to assess robustness across diverse linguistic settings. These directions aim to advance practical, scalable, and semantically faithful negation-aware similarity modeling for real-world NLP applications.