Introduction

The course “Safety Access Education in University Laboratories” is one of the fundamental courses crucial for agricultural laboratory safety, playing a positive role in enhancing safety standards in agricultural laboratories. The focus of this study is to evaluate the factors influencing the teaching effectiveness of the “Safety Access Education in University Laboratories” course, thereby improving teaching quality and strengthening laboratory safety.

MCDM (Multiple Criteria Decision Making) methods include AHP (Analytic Hierarchy Process), TOPSIS (Technique for Order Preference by Similarity to Ideal Solution), VIKOR (Multi-Criteria Optimization and Compromise Solution), etc1,2,3. , and there have been many studies on their application in various decision-making problems. The core idea of the AHP method is to break down complex decision-making problems into levels such as objectives, criteria, and alternatives4. It determines the relative importance of each factor through pairwise comparisons and uses mathematical methods to calculate their weights, thereby providing a basis for the selection of the best alternative. he interval-valued circular intuitionistic fuzzy set-based AHP provides a structured framework for evaluating innovative teaching approaches5. An improved AHP method is applied to optimize the teaching and delivery of PE Physical Education programs6. A curriculum evaluation system based on AHP and clustering is used to evaluate the performance of the English hybrid teaching courses7.There are also some studies on the application of the TOPSIS method in teaching. An enhanced TOPSIS is utilized to analyse the teaching evaluation in a fuzzy environment8.The TOPSIS combined with fuzzy analytic network process is proposed to analyse the dimensions, indicators and alternatives of blended design teaching service quality9.The improved TOPSIS method is proposed to assess the knowledge share quality10.A TOPSIS method with N-valued neutrosophic trapezoidal numbers is proposed to MCDM problems11.Some related studies on the VIKOR method are as follows: the VIKOR combined with type-2 neutrosophic number is used to assist decision-makers in making the most informed choice12. A VIKOR method based on the entropy measure is developed to solve the problem of MCDM13. The VIKOR is integrated with AHP to analyse the most pertinent renewable energy sources for electricity generation in developing countries14. Additionally, several improved approaches have been proposed, including modifications to various decision-making methods using trapezoidal fuzzy multi-numbers15,16,17,18, grey relational theory18, interval type-2 fuzzy sets19,20, and regret theory21. However, when these methods are applied to the evaluation of teaching effectiveness for courses such as “Laboratory Safety Admission Education”, two key limitations remain: the lack of consideration for complex interrelationships among indicators and insufficient handling of uncertainties inherent in linguistic evaluations.

In terms of the complex interrelationships among factors, since the influencing factors of teaching effectiveness are not isolated but rather have complex mutual connections, the core idea of the AHP method is that the criteria are independent of each other, and its hierarchical structure cannot describe this network-like causal relationship. The TOPSIS and VIKOR methods also assume that the indicators are orthogonal and independent in their calculations. If there are strong correlations among the factors, the ranking will be distorted due to the repeated calculation of information based on weighted distances or compromise solutions. To overcome these problems, based on the effectiveness of the DEMATEL19 method in analysing the mutual influence relationships among factors in complex systems20, this study introduces DEMATEL to determine the weights of the influencing factors of teaching effectiveness. Unlike AHP, which is based solely on importance comparisons, DEMATEL quantifies the direct influence intensity among factors through expert judgment, and then calculates the cause degree and centrality of each factor. Factors with high centrality are in a pivotal position in the entire influence network, while those with high cause degree are the fundamental sources driving system changes. Therefore, in the evaluation of the teaching effectiveness of the “Safety Access Education in University Laboratories” course, using DEMATEL to determine the weights can objectively reflect the actual mutual promotion or restriction relationships among the influencing factors of teaching effectiveness, laying a foundation for subsequent precise evaluations.

After determining the interrelationship structure and weights of the influencing factors of teaching effectiveness through the DEMATEL method, another challenge in the evaluation of teaching effectiveness is how to precisely handle the inherent uncertainty and fuzziness of the semantic evaluation information from different evaluators. Although there have been improved studies such as fuzzy AHP or fuzzy TOPSIS, the requirement of AHP for precise scale in pairwise comparisons and the dependence of TOPSIS/VIKOR on precise numerical performance matrices make it difficult to handle the evaluation information characterized by possibility language term sets, which simultaneously contain preference probability distributions and the fuzziness of language terms, without information loss. Linguistic terms or direct fuzzy sets can mathematize qualitative language concepts, enabling knowledge based on experience and intuition to be formally expressed, reasoned, and calculated, thus building a bridge between human subjective and fuzzy cognition and the precise computation of machines. The theory of fuzzy sets has been studied in many aspects. For instance, q-Rung Orthopair Fuzzy Set21, Fuzzy CoCoSo with Bonferroni methods22, Fuzzy Best Worst Method Approach23, Random Forest Algorithm with fuzzy sets24. According to the characteristics of fuzzy decision-making theory, to address this issue, this study further introduces PLTS as the carrier of semantic evaluation information25. PLTS allows each language term to be associated with a probability, thereby enabling the lossless encapsulation of group language evaluation opinions with distribution characteristics. Since its proposal, PLTS has been used in many aspects for research, such as, PLTS is integrated with other two methods to maximize customer satisfaction26. PLTS combined with the Best-worst method is applied to the location selection of offshore wind power station27. An improved PLTS is used to establish a hotel evaluation model to process the hotel information28. The PLTS is connected with discrete probability distribution to assess the practical atmospheric pollutant evaluation problem29. PLTS and its variants are utilized in online production ranking30, air quality index31, hotel recommendation32, tourism attraction selection33, and PLTS also has research in the field of teaching, the multi-granularity PLTS is used to improve the teaching quality34, an extended PLTS is employed to promote the teaching reform of big data technology and application courses in the new liberal arts construction scenario35. While PLTS excels at fully preserving and leveraging the probability distribution information embedded in expert group evaluations, it falls short in capturing the deviations of results during outcome comparisons.

To address the aforementioned challenges, we proposed a hybrid evaluation framework that systematically integrates the DEMATEL method, PLTS and a modified possibility degree method, aiming to tackle the dual complexity inherent in teaching effectiveness evaluation. Specifically, DEMATEL is employed to unpack and quantify the complex causal network structure among factors influencing teaching effectiveness, thereby deriving objective weights. PLTS serves to encapsulate the fuzziness associated with probabilistic distributions in expert group semantic evaluations. A pivotal advancement lies in the introduction of a modified possibility degree method: this formula not only computes the overall preference possibility between two PLTS but also simultaneously captures discrepancies in both the mean and variance of the compared pairs. This enhancement enables the model to effectively differentiate between evaluation outcomes with comparable expected values yet substantial divergences in expert opinions, thereby capturing risk and divergence information that traditional methods tend to overlook.

When evaluating the teaching effectiveness of courses such as “Safety Access Education in University Laboratories”, there are problems such as the lack of complex correlations among indicators and the insufficient handling of the uncertainty in language evaluation, as well as the insufficiency of PLTS in result comparison. This study constructs a two-stage hybrid evaluation framework: in the first stage, DEMATEL is used to analyse the correlation network among factors to obtain the objective weights reflecting the system structure; in the second stage, the probability linguistic term set is combined with the improved possibility degree formula mentioned above to construct a probability linguistic possibility degree evaluation matrix for effective comparison and ranking of PLTS. This framework combines the advantages of DEMATEL in analysing the correlation of system structure with the advantages of probability linguistic methods and possibility degree theory in handling the uncertainty of semantic information, making the evaluation results more scientific.

Except for the first Sect. Introduction Introduction, the remainder of the article is structured as follows: The second section is Model Construction, encompassing 2.1 Evaluation Indicators and 2.2 Indicator Weight Calculation via the DEMATEL Method. The third section, Probabilistic Linguistic Term Sets and Possibility Degree, includes two subsections: Sects. Construction of probabilistic linguistic term sets and possibility degree and Probabilistic linguistic possibility degree-based teaching evaluation process. The fourth section, Teaching Effectiveness Evaluation and Validation, comprises Sects. Validation of teaching effectiveness, Robustness test, Reliability analysis of expert evaluations, Bootstrap-based confidence interval estimation, Comparative analysis of methods, and Summary. The last section is Conclusions.

Model construction

Evaluation indicators

The teaching of the “Safety Access Education in University Laboratories” course typically involves instructors delivering lectures followed by routine safety exams, with assessment scores provided as the evaluation method. The course teaching evaluation indicators are divided into five categories: teaching contents, teaching methods, professional distinction, teaching quality assessment methods, security knowledge acquired, and classroom atmosphere. The descriptions of each indicator are as follows:

  1. 1.

    Teaching contents (TC) The richness and novelty of the teaching content directly determine the teaching effectiveness and student engagement in the classroom36,37.

  2. 2.

    Teaching methods (TM) The rise of teaching approaches such as “online + offline” and “in-class + extracurricular,” along with the application of teaching platforms like MOOC and Rain Classroom, provides more options for teaching methods. Selecting appropriate teaching methods significantly enhances teaching effectiveness38.

  3. 3.

    Professional distinction (PD) Different agricultural disciplines have varying directions, requiring different levels and scopes of knowledge39. The professional distinctions should be fully considered when constructing the scoring matrix.

  4. 4.

    Teaching quality assessment methods (TQ) Choosing reasonable assessment methods not only boosts students’ learning motivation but also improves their innovative competencies40.

  5. 5.

    Security knowledge acquired (SC) Adopting different teaching contents, methods, and assessment approaches leads to varying levels of safety knowledge acquired by students. This indicator serves as the ultimate evaluation of teaching effectiveness41.

Miller’s classic research indicates that the capacity of human working memory is approximately 7 ± 2 information chunks42. The seven-point scale is precisely located near the upper limit of this cognitive comfort zone. It can provide evaluators with sufficient gradation to make fine distinctions (superior to the five-point scale), while avoiding decision fatigue and cognitive unreliability caused by too many levels (such as ten). This design ensures data variance while minimizing the subjective measurement error of evaluators to the greatest extent. At the same time, the seven-point scale has a wide application basis, such as semantic differential43, self-evaluation44, etc. Relevant studies have shown that seven evaluation levels can ensure discrimination while avoiding cognitive overload of evaluators45, Moreover, in the early stage of the research, interviews were conducted with ten teaching experts, and the collected evaluation data showed that the score values were non-uniformly distributed, with obvious clustering phenomena near thresholds such as 0.45 and 0.75, verifying the rationality of the interval division in Table 1. Therefore, this study selects seven evaluation levels.

Each indicator is divided into seven levels based on its characteristics: Extremely High (EH), High (H), Lower High (LH), Middle (M), Lower (LE), Low (L), and Extremely Low (EL). The indicators are quantified according to semantic conventions, and the levels and corresponding scoring ranges are shown in Table 146,47.

Table 1 Indicator levels and scoring Ranges.

After quantifying the indicator scores, the evaluation indicator set \(\varvec{\varOmega}\) is constructed based on the evaluation indicators:

$$\varvec{\varOmega}={\left\{ {TC,TM,TD,TQ,SC} \right\}^T}$$
(1)

Indicator weight calculation based on the DEMATEL method

Typically, the weights of teaching evaluation indicators are determined subjectively, which introduces uncertainty into the evaluation of influencing factors. To ensure the objectivity of the weights, the DEMATEL method is used to calculate the weights48,49. The steps are as follows:

Step 1

Based on the evaluation indicator set \(\varvec{\varOmega}\), the scoring matrix \(\varvec{\varOmega}{\text{=}}{\left[ {{\sigma _{ij}}} \right]_{m \times m}}\) is obtained. The scores in the matrix are normalized according to Formula 2.

$${\varvec{\varOmega}_0}=t\cdot \varvec{\varOmega}$$
(2)

where \(t=\frac{1}{{\mathop {\hbox{max} }\limits_{{1 \leqslant i \leqslant m}} \sum\nolimits_{{j=1}}^{m} {{s_{ij}}} }}\)is a standardization factor, ensuring that the sum of all rows in the standardized matrix \({\varvec{\varOmega}_0}\) does not exceed 1. \({s_{ij}}\)represents the influence intensity of indicator i on indicator j,\(\sum\nolimits_{{j=1}}^{m} {{s_{ij}}}\)represents the total direct influence of indicator i on all other indicators, and \(\mathop {\hbox{max} }\limits_{{1 \leqslant i \leqslant m}} \sum\nolimits_{{j=1}}^{m} {{s_{ij}}}\) represents the maximum value among all row sums.

Step 2

According to Eq. (3), the comprehensive relationship matrix \({{\varvec{R}}_{{\varvec{i}}{\varvec{j}}}}={\left[ {{r_{ij}}} \right]_{m \times m}}\) is calculated.

$${\varvec{R}}={\varvec{\varOmega}_0}{\left( {{\varvec{E}} - {\varvec{\varOmega}_0}} \right)^{ - 1}}$$
(3)

where \({\varvec{\varOmega}_0}\) represents the standardized direct influence matrix, \({\varvec{E}}\) is the identity matrix, \({\left( {{\varvec{E}} - {\varvec{\varOmega}_0}} \right)^{ - 1}}\)represents the inverse matrix of \(\left( {{\varvec{E}} - {\varvec{\varOmega}_0}} \right)\), and \({\varvec{R}}\) represents the total influence matrix (including both direct and indirect influences).

Step 3

Based on the comprehensive relationship matrix \({{\varvec{R}}_{{\varvec{i}}{\varvec{j}}}}\) and the rules of determinant operations, the influence degree \({{\varvec{A}}_{\varvec{i}}}={\left[ {\sum\limits_{{j=1}}^{m} {{r_{ij}}} } \right]_{m \times 1}}\) and the influenced degree \({{\varvec{B}}_{\varvec{j}}}={\left[ {\sum\limits_{{i=1}}^{m} {{r_{ij}}} } \right]_{m \times 1}}\) of each indicator are obtained by summing the rows and columns, respectively.

Step 4

The cause degree represents the influence factors between indicators, while the centrality degree indicates the importance of each indicator itself. According to Eqs. (4) and (5), the cause degree and centrality degree are calculated, respectively.

The formula for the cause degree is:

$${{\varvec{C}}_i}={{\varvec{A}}_{\varvec{i}}} - {{\varvec{B}}_j}$$
(4)

where \({{\varvec{C}}_i}\) represents the cause degree. If \({{\varvec{C}}_i}>0\)t indicates that the indicator is a causal factor, if \({{\varvec{C}}_i}<0\), it indicates that the indicator is result facto, if \({{\varvec{C}}_i}=0\)it indicates that the indicator is a balancing factor.

The formula for the centrality degree is:

$${{\varvec{D}}_{\varvec{i}}}={{\varvec{A}}_{\varvec{i}}}+{{\varvec{B}}_j}$$
(5)

where \({{\varvec{D}}_{\varvec{i}}}\) represents the centrality degree. The larger \({{\varvec{D}}_{\varvec{i}}}\) it is, the more important the indicator i is in the system. A high value \({{\varvec{D}}_{\varvec{i}}}\) means that the indicator strongly influences other indicators and is also strongly influenced by other indicators, reflecting the indicator’s pivotal position in the network.

Step 5

Calculate the comprehensive evaluation value of each indicator according to Eq. (6)

$${{\varvec{Q}}_{\varvec{i}}}=\sqrt {{{\varvec{C}}_{\varvec{i}}}^{2}+{{\varvec{D}}_{\varvec{i}}}^{2}}$$
(6)

where \({{\varvec{Q}}_{\varvec{i}}}\) represents the comprehensive evaluation value of the corresponding indicator.

Step 6

Calculate the relative weights according to Eq. (7)

$${\omega _i}=\frac{{{Q_i}}}{{\sum\limits_{{i=1}}^{m} {{Q_i}} }}$$
(7)

where \({\omega _i}\) represents the weights of each indicator, \({\omega _i} \in (0,1)\), and \(\sum\limits_{{i=1}}^{m} {{\omega _i}} =1\).

The weight set \(\varvec{\omega}={\left[ {{\omega _i}} \right]_{i=m}}\) is obtained through calculations from Step 1 to Step 6.

Example 1

Toy example for the DEMATEL process.

Step 1: Problem definition. Consider a simplified teaching evaluation system with three indicators: TC, TM and PD. We construct a 3 × 3 direct influence matrix based on expert linguistic assessments in Table 2.

Table 2 3 × 3 direct influence matrix based on expert linguistic assessments.
$$\varvec{\varOmega}=\left[ \begin{gathered} {\text{ }}0{\text{ }}0.825{\text{ }}0.675 \hfill \\ 0.525{\text{ }}0{\text{ }}0.825 \hfill \\ 0.25{\text{ 0}}{\text{.40 0}} \hfill \\ \end{gathered} \right]$$

Step 3

Normalization. Firstly, calculate row sums.

Row 1 (TC): 0 + 0.825 + 0.675 = 1.500.

Row 2 (TM): 0.525 + 0 + 0.825 = 1.350.

Row 3 (PD): 0.25 + 0.40 + 0 = 0.650.

Maximum row sum:\(\hbox{max} (1.500,1.350,0.650)=\)1.500

Using formula (2), solve for the Normalized Matrix.

\({\varvec{\varOmega}_0}=t\cdot \varvec{\varOmega}=\frac{1}{{1.500}}\cdot \left[ \begin{gathered} {\text{ }}0{\text{ }}0.825{\text{ }}0.675 \hfill \\ 0.525{\text{ }}0{\text{ }}0.825 \hfill \\ 0.25{\text{ 0}}{\text{.40 0}} \hfill \\ \end{gathered} \right]=\left[ \begin{gathered} {\text{ }}0{\text{ }}0.5500{\text{ }}0.4500 \hfill \\ 0.3500{\text{ }}0{\text{ }}0.5500 \hfill \\ 0.1667{\text{ 0}}{\text{.3667 0}} \hfill \\ \end{gathered} \right]\)

Step 3: Calculate the comprehensive relationship matrix \({\varvec{R}}\).The identity Matrix \({\varvec{E}}\) is as:

\({\varvec{E}}=\left[ \begin{gathered} 1{\text{ }}0{\text{ }}0 \hfill \\ 0{\text{ }}1{\text{ }}0 \hfill \\ 0{\text{ 0 1}} \hfill \\ \end{gathered} \right]\)

Compute \({\varvec{E}} - {\varvec{\varOmega}_0}\) is as:

\({\varvec{E}} - {\varvec{\varOmega}_0}=\left[ \begin{gathered} {\text{ 1 -}}0.5500{\text{ -}}0.4500 \hfill \\ - 0.3500{\text{ }}1{\text{ -}}0.5500 \hfill \\ {\text{-0}}{\text{.1667 -0}}{\text{.2667 1}} \hfill \\ \end{gathered} \right]\)

Inverse of \({\varvec{E}} - {\varvec{\varOmega}_0}\) is as:

\({\left( {{\varvec{E}} - {\varvec{\varOmega}_0}} \right)^{ - 1}}=\left[ \begin{gathered} {\text{1}}{\text{.9547 }}1.5847{\text{ }}1.7295 \hfill \\ 1.2953{\text{ }}2.2217{\text{ 1}}{\text{.8941}} \hfill \\ {\text{0}}{\text{.7708 1}}{\text{.0169 2}}{\text{.0780}} \hfill \\ \end{gathered} \right]\)

the comprehensive relationship matrix \({\varvec{R}}\) is as:

\({\varvec{R}}={\varvec{\varOmega}_0}{\left( {{\varvec{E}} - {\varvec{\varOmega}_0}} \right)^{ - 1}}=\left[ \begin{gathered} {\text{1}}{\text{.0978 }}1.3651{\text{ }}1.4614 \hfill \\ 1.3383{\text{ }}1.3247{\text{ 1}}{\text{.7856}} \hfill \\ {\text{0}}{\text{.7246 0}}{\text{.8455 1}}{\text{.0524}} \hfill \\ \end{gathered} \right]\)

Step 4: Calculate the cause degree and the centrality degree.

Row sums: A1 = 1.0978 + 1.3651 + 1.4614 = 3.9243, A2 = 1.3383 + 1.3247 + 1.7856 = 4.4486, A3 = 0.7246 + 0.8455 + 1.0524 = 2.6225.

Column sums: B1 = 1.0978 + 1.3383 + 0.7246 = 3.1607, B2 = 1.3651 + 1.3247 + 0.8455 = 3.5353, B3 = 1.4614 + 1.7856 + 1.0524 = 4.2994.

According to formula (4), the cause degree is as: C1(TC) = 3.9243–3.1607 = 0.7636, C2(TM) = 4.4486–3.5353 = 0.9133, C3(2.6225–4.2994 = -1.6769).

According to formula (5), the centrality degree is as: D1 (TC) = 3.9243 + 3.1607 = 7.0850, D2 (TM) = 4.4486 + 3.5353 = 7.9839, D3 (PD) = 2.6225 + 4.2994 = 6.9219.

Step 5: Calculate Final Weights. Using formulas (6) and (7), normalized prominence weights:

$$\varvec{\omega}={\left[ {{\omega _i}} \right]_{i=m}}=[{\omega _1}{\text{ }}{\omega _2}{\text{ }}{\omega _{\text{3}}}]=[0.2749{\text{ 0}}{\text{.3178 0}}{\text{.4073}}].$$

According the results, TM has the highest centrality degree (D2 =7.9839). indicating it is the most central indicator in the system. PD (C3 = -1.6769) is identified as a net effect factor, meaning it is highly influenced by other factors but exerts relatively less influence on them. Both TC and TM are cause factors, actively influencing other indicators in the system. The final weights reflect both the centrality and cause degree each indicator, with PD receiving the highest weight due to its strong position as an effect factor that accumulates influences from other indicators.

This toy example demonstrates the complete DEMATEL process from linguistic assessments to numerical weights, showing how qualitative judgments can be transformed into quantitative importance measures for decision-making.

Probabilistic linguistic term sets and possibility degree

Construction of probabilistic linguistic term sets and possibility degree

Definition 1

A PLTS27 defined on a domain is represented as follows:

$$L=\left\{ {\left. {{l^k}\left( {{p^k}} \right)} \right|{l^k} \in \Omega ,{p^k} \succ 0,k=0,1,2, \ldots ,n} \right\}$$
(8)

where \({l^k}\left( {{p^k}} \right)\) represents an element in the PLTS, \({l^k}\) is the linguistic term, and \({p^k}\) is the probability corresponding to \({l^k}\).

Definition 2

If the probability information defined on the PLTS L does not satisfy \(\sum\limits_{{k=1}}^{n} {{p^k}} =1\), standardization is required. The standardized probability information is:

$${\tilde {p}^k}=\frac{{{p^k}}}{{\sum\limits_{{k=1}}^{n} {{p^k}} }}$$
(9)

The standardized PLTS after standardization is represented as:

$$\tilde {L}=\left\{ {\left. {{l^k}\left( {{{\tilde {p}}^k}} \right)} \right|{l^k} \in \Omega ,{{\tilde {p}}^k} \succ 0,k=0,1,2, \ldots ,n} \right\}$$
(10)

where \(\sum\limits_{{k=1}}^{n} {{{\tilde {p}}^k}} =1\).

Definition 3

Suppose \(\Omega =\left\{ {\left. {{\Omega _i}} \right|i=1,2, \ldots ,\delta } \right\}\) is a given reference PLTS, and the standardized PLTSs \({L_1}=\left( {\left. {l_{1}^{k}\left( {\tilde {p}_{1}^{k}} \right)} \right|k=1,2, \ldots ,{n_1}} \right)\) and \({L_2}=\left( {\left. {l_{2}^{k}\left( {\tilde {p}_{2}^{k}} \right)} \right|k=1,2, \ldots ,{n_2}} \right)\) are defined on \(\Omega\). The possibility degree formula50 for the PLTS \({L_1} \geqslant {L_2}\) is defined as:

$$P\left( {{L_1} \geqslant {L_2}} \right)=0.5+\sum\limits_{{k=1}}^{{{n_1}}} {t_{1}^{k}\cdot \tilde {p}_{1}^{k} - \sum\limits_{{k=1}}^{{{n_1}}} {t_{2}^{k}\cdot \tilde {p}_{2}^{k}} }$$
(11)
$$t_{2}^{k}=\frac{{c_{2}^{k}}}{\delta } - 0.5{\left( {\frac{{c_{2}^{k}}}{\delta }} \right)^2}$$
(12)
$$t_{1}^{k}=\frac{{c_{1}^{k}}}{\delta } - 0.5{\left( {\frac{{c_{1}^{k}}}{\delta }} \right)^2}$$
(13)

where, \(c_{1}^{k}\) and \(c_{2}^{k}\) are the subscripts of \(l_{1}^{k}\) and \(l_{2}^{k}\) in the linguistic term set,\(\tilde {p}_{1}^{k}\) and \(\tilde {p}_{2}^{k}\) are the corresponding standardized probability information.

Definition 4

If there are two standardized PLTSs \({L_1}\)and \({L_2}\) the possibility degree comparison relationship between \({L_1}\)and \({L_2}\) is defined as:

$$P\left( {{L_1} \geqslant {L_2}} \right)\left\{ \begin{gathered} \succ 0.5,{L_1} \quad is \quad better\quad than \quad{L_2} \hfill \\ 1,{L_1}\quad is\quad clearly\quad better\quad than\quad{L_2} \hfill \\ 0.5,{L_1}\quad is\quad equal\quad to\quad {L_2} \hfill \\ \end{gathered} \right.$$
(14)

Definition 5

Suppose the possibility degree matrix \({{\varvec{P}}_{\varvec{i}}}\) for a certain attribute i is calculated according to Definition 3. \({{\varvec{P}}_{\varvec{i}}}\) is defined as:

$${{\varvec{P}}_{\varvec{i}}}=\left[ {\begin{array}{*{20}{c}} {P_{{11}}^{i}}&{P_{{12}}^{i}}& \ldots &{P_{{1m}}^{i}} \\ {P_{{21}}^{i}}&{P_{{22}}^{i}}& \ldots &{P_{{2m}}^{i}} \\ \vdots & \vdots & \ddots & \vdots \\ {P_{{m1}}^{i}}&{P_{{m2}}^{i}}& \ldots &{P_{{mm}}^{i}} \end{array}} \right]$$
(15)

Based on the calculated weights \({\omega _i}\) the comprehensive possibility degree matrix \({\varvec{P}}\) is defined as:

$${\varvec{P}}=\sum\limits_{{i=1}}^{m} {{\omega _i}\cdot } {{\varvec{P}}_{\varvec{i}}}$$
(16)

Definition 6

Let \({\varvec{x}}={\left( {{x_1},{x_2}, \ldots ,{x_n}} \right)^T}\) and \({x_i} \in \left[ {0,1} \right]\) be the ranking vectors of the comprehensive possibility degree matrix. \({x_i}\) is defined as:

$${x_i}=\frac{{\sum\limits_{{k=1}}^{n} {{P_{ik}}} +1}}{n} - 0.5$$
(17)

Example 2

Illustrating the advantage of the proposed possibility degree.

Consider the evaluation of teaching experts using the 7-level linguistic term set with the following scoring ranges, the Linguistic term set with precise scoring ranges is shown in Table 3.

Table 3 Linguistic term set with precise scoring ranges.

Both PLTS are based on the language set \(\Omega =\left\{ {{S_0}:EL,{S_1}:L,{S_2},LE,{S_3}:M,{S_4}:LH,{S_5},H,{S_6}:EH} \right\}\), where L1 = {S5(1.0)} (expectation = 5.0, variance = 0), L2 = {S4(0.5), S5(0.5)} (expectation = 4.5, variance = 0.25), and L3 = {S4(0.2), S5(0.6), S6(0.2)} (expectation = 5.0, variance = 0.32).

Mathematical comparison using subscripts,

For L1: Subscript = 5, Probability = 1.0.

For L2: Subscript= {4, 5} with probabilities {0.5 0.5}.

For L3: Subscripts = {4, 5, 6} with probabilities {0.2, 0.6, 0.2}.

Comparison is performed by calculating the expectation value of each PLTS:

For L1: E(L1) = 5 × 1.0 = 5.0.

For L2: E(L2) = 4 × 0.5 + 5 × 0.5 = 4.5.

For L3: E(L3) = 4 × 0.2 + 5 × 0.6 + 6 × 0.2 = 5.0.

Thus, the traditional method yields: L1 = L3 > L2.

According to formulas (7) and (8), set \(\delta =7\), the calculation process of our proposed possibility degree is as follow:

Step 1: Calculate t-values. According to the formulas (12) and (13), where \(\delta =7\):

\(t(4)\)= 0.4081,\(t(5)\) = 0.4592,\(t(6)\)= 0.4897.

Step 2: Calculate\(\sum {{t_k}{p_k}}\)for each PLTS.

For L1: \(\sum {{t_1}{p_1}}\) = \(t(5)\)×1.0 = 0.4592.

For L2: \(\sum {{t_2}{p_2}}\)= \(t(4)\)×0.5 + \(t(5)\)×0.5 = 0.4081 × 0.5 + 0.4592 × 0.5 = 0.4337.

For L3: \(\sum {{t_3}{p_3}}\)= \(t(4)\)×0.2 + \(t(5)\)×0.6 + \(t(6)\)×0.2 = 0.4550.

Step 3: Calculate pairwise possibility degrees. According to the formula (11),

\(P({L_1} \geqslant {L_2})\)= 0.5 + 0.4592 − 0.4337 = 0.5255,

\(P({L_1} \geqslant {L_3})\)= 0.5 + 0.4592 − 0.4550 = 0.5042,

\(P({L_3} \geqslant {L_2})\)= 0.5 + 0.4550 − 0.4337 = 0.5213.

Step 4: Determine the ranking.

Since \(P({L_1} \geqslant {L_3})\) > 0.5, and \(P({L_3} \geqslant {L_2})\) > 0.5. Final ranking: L1 > L3 > L2.

From the above calculation process and results, we can see that the traditional method based on expectation would considerL1 = L3 > L2. However, the evaluation of L3 shows significant divergence (large variance), while the evaluation of L1 is highly consistent. In actual decision-making, we might prefer the more consistently evaluated L1. A formula based solely on expectations might yield \(P({L_1} \geqslant {L_3})\)= 0.5, failing to distinguish between L1 and L3. The possibility degree formula we proposed in this paper is used to calculate that \(P({L_1} \geqslant {L_3})\) = 0.5042 > 0.5, clearly indicating that although the expectations are the same, L1 is considered superior to L3 due to its more concentrated distribution. This counterexample demonstrates that the method presented in this paper can capture the variance information overlooked by the compared methods, thereby making more refined and reasonable distinctions.

Probabilistic linguistic possibility degree teaching evaluation process

Step 1: Construct the evaluation indicator matrix \(\varvec{\varOmega}\) based on the evaluation indicator formula (1) and the scoring table.

Step 2: Calculate the relative weights based on Eqs. (27) of the DEMATEL method to obtain the weight set \(\varvec{\omega}\).

Step 3: Calculate the standard probabilistic linguistic set based on Eqs. (810), compute the possibility degrees for individual indicator probabilistic linguistic sets according to Eqs. (1114), and mutually compare the possibility degrees.

Step 4: Calculate the individual possibility degree matrix \({{\varvec{P}}_{\varvec{i}}}\) based on Eq. (15), and compute the comprehensive possibility degree matrix \({\varvec{P}}\) by combining weight set \(\varvec{\omega}\) with the individual possibility degree matrix \({{\varvec{P}}_{\varvec{i}}}\) according to Eq. (16).

Step 5: Sort \({\varvec{P}}\) according to Eq. (17) to obtain the final results.

Teaching effectiveness evaluation and validation

Teaching instance validation

Through questionnaires, interviews, and other methods, five agricultural experts responsible for laboratory safety (T1-T5) were surveyed to rate various indicators. The standardized probabilistic linguistic evaluation matrix \(\varvec{\varOmega}\) was constructed, as shown in Table 4.

Table 4 Standardized probabilistic linguistic evaluation matrix \(\varvec{\varOmega}\).
Table 5 Indicator evaluation relationship matrix.

Calculate the comprehensive relationship matrix based on Eqs. (2) and (3).

$${{\varvec{R}}_{{\varvec{i}}{\varvec{j}}}}=\left[ {\begin{array}{*{20}{c}} {0.448}&{0.537}&{0.614}&{0.558}&{0.471} \\ {0.652}&{0.513}&{0.462}&{0.671}&{0.667} \\ {0.377}&{0.428}&{0.733}&{0.387}&{0.351} \\ {0.587}&{0.566}&{0.486}&{0.508}&{0.496} \\ {0.419}&{0.708}&{0.549}&{0.267}&{0.645} \end{array}} \right]$$

Calculate the causality degrees and centrality degrees of each indicator, as shown in Table 6.

Table 6 Causality degrees and centrality degrees of each Indicator.
Fig. 1
figure 1

DEMATEL causal diagram of the “Safety Access Education in University Laboratories” course.

The weight set \(\varvec{\omega}=\left\{ {0.308,0.264,0.070,0.104,0.254} \right\}\) is obtained after normalization processing of each indicator. The DEMATEL causal diagram of the “Safety Access Education in University Laboratories” course is shown in Fig. 1, TC and TQ assessment methods are the core driving factors in the cause area, exerting a significant influence on other indicators. The bar chart of DEMATEL-based weights of teaching effectiveness factors is shown in Fig. 2, the weight distribution in further validates this point, with TC having the highest weight.

Fig. 2
figure 2

The bar chart of DEMATEL-based weights of teaching effectiveness factors.

Step 2

Perform comparative calculations on the possibility degrees of individual indicator probabilistic linguistic sets based on Eqs. (1114), and the results are as follows.

$${{\varvec{P}}_{{\varvec{T}}{\varvec{C}}}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.4837}&{0.5147}&{0.5}&{0.4785} \\ {0.4963}&{0.5}&{0.4621}&{0.5287}&{0.5177} \\ {0.5122}&{0.4986}&{0.5}&{0.5644}&{0.5296} \\ {0.4826}&{0.5274}&{0.5049}&{0.5}&{0.4762} \\ {0.5}&{0.5122}&{0.4966}&{0.4278}&{0.5} \end{array}} \right]$$
$${{\varvec{P}}_{{\varvec{T}}{\varvec{M}}}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.4795}&{0.4654}&{0.4987}&{0.5375} \\ {0.5348}&{0.5}&{0.4533}&{0.5027}&{0.5971} \\ {0.5497}&{0.4922}&{0.5}&{0.4861}&{0.5128} \\ {0.5}&{0.5673}&{0.5074}&{0.5}&{0.4784} \\ {0.5011}&{0.5128}&{0.4982}&{0.5691}&{0.5} \end{array}} \right]$$
$${{\varvec{P}}_{{\varvec{P}}{\varvec{D}}}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.5367}&{0.5}&{0.4691}&{0.5463} \\ {0.4763}&{0.5}&{0.5136}&{0.4763}&{0.5149} \\ {0.5247}&{0.4988}&{0.5}&{0.5055}&{0.5184} \\ {0.4692}&{0.5149}&{0.4959}&{0.5}&{0.4716} \\ {0.5074}&{0.4866}&{0.5189}&{0.5479}&{0.5} \end{array}} \right]$$
$${{\varvec{P}}_{{\varvec{T}}{\varvec{Q}}}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.4934}&{0.5169}&{0.4722}&{0.5} \\ {0.5312}&{0.5}&{0.5121}&{0.5237}&{0.5049} \\ {0.4796}&{0.5047}&{0.5}&{0.4864}&{0.4628} \\ {0.4687}&{0.4633}&{0.4746}&{0.5}&{0.4793} \\ {0.5121}&{0.4978}&{0.5315}&{0.5032}&{0.5} \end{array}} \right]$$
$${{\varvec{P}}_{{\varvec{S}}{\varvec{C}}}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.5222}&{0.4713}&{0.4874}&{0.4833} \\ {0.5621}&{0.5}&{0.5063}&{0.5429}&{0.5} \\ {0.5037}&{0.4975}&{0.5}&{0.4978}&{0.5219} \\ {0.4963}&{0.4986}&{0.5151}&{0.5}&{0.5037} \\ {0.4617}&{0.5055}&{0.4787}&{0.5317}&{0.5} \end{array}} \right]$$

Step 3

Calculate the comprehensive probabilistic linguistic possibility degree comparison based on Eq. (16), as follows.

$${{\varvec{P}}_{}}=\left[ {\begin{array}{*{20}{c}} {0.5}&{0.5296}&{0.5131}&{0.4912}&{0.5018} \\ {0.5149}&{0.5}&{0.4761}&{0.5177}&{0.4871} \\ {0.4893}&{0.4921}&{0.5}&{0.5038}&{0.5149} \\ {0.5011}&{0.5064}&{0.4835}&{0.5}&{0.5247} \\ {0.5063}&{0.4876}&{0.4966}&{0.5134}&{0.5} \end{array}} \right]$$

Step 4

Compare the comprehensive possibility degree matrix based on Eq. (17) and calculate as follows.

$${\varvec{x}}=\left[ {0.2088,0.2047,0.1863,0.2028,0.1974} \right]$$

Step 5

Rank the influencing factors based on the calculation results of ranking vector \({\varvec{x}}\), with the results as follows.

$${x_{TC}} \succ {x_{TQ}} \succ {x_{TM}} \succ {x_{SC}} \succ {x_{PD}}$$

According to the above ranking results, \({x_{TC}}\) has the highest score and \({x_{PD}}\) has the lowest score.

Robustness test

To verify the robustness of the research results, we conducted a systematic sensitivity analysis, including: sensitivity analysis of parameters, sensitivity test of the language-numeric mapping scheme, and cross-validation based on leave-one-out method.

Sensitivity analysis of parameters

To verify the stability of the ranking results of the model proposed in this paper, a sensitivity analysis is conducted. We fine-tune the TC indicator, which has the largest weight, by adjusting its weight by ± 5% and ± 10% from the benchmark value (0.308), and simultaneously adjust the weights of other indicators proportionally to keep the total sum at 1, to observe whether there are significant changes in the final ranking51. The original weights of the evaluation model’s indicators are shown in Table 7. The expert scores and rankings under the original indicator weights are presented in Table 8. The calculation method for the transformation of indicator weights is as follows:

$${\omega _{inew}}={\omega _i}\cdot (1 - \mu \cdot {\omega _{TC}})$$
(18)

where \(\mu\) is the weight adjustment coefficient, since we set TC to be adjusted by ± 5% and ± 10% respectively from the base value (0.308), the values of \(\mu\)are 0.9, 0.95, 1, 1.05, and 1.10. \(\mu \cdot {\omega _{TC}}\)represents the weight of TC after adjustment. (\(1 - \mu \cdot {\omega _{TC}}\)) is the sum of the weights of the other four indicators, \({\omega _i}\)the original proportion of the other indicators, and \({\omega _{inew}}\) is the adjusted weight of the other indicators.

Table 7 Weights of DEMATEL indicators.
Table 8 Expert scores and rankings under original weights.
Parameters sensitivity experimental results

The expert scores and rankings after adjusting the TC weights by ± 5% and ± 10% according to Formula 18 are shown in Table 9. The impact of TC weight changes on expert scores is illustrated in Fig. 3, the changes in expert rankings under different TC weights are presented in Fig. 4, and the Distribution of Indicator Weights under Different TC Weights is depicted in Fig. 5.

Table 9 The expert scores and rankings after adjusting the TC weights.
Fig. 3
figure 3

Impact of TC weight changes on expert scores.

Fig. 4
figure 4

The changes in expert rankings under different TC weights.

Fig. 5
figure 5

Distribution of indicator weights under different TC weights.

Parameters sensitivity analysis

As shown in Table 8, when the TC weight fluctuates within ± 10%, the ranking order of experts remains unchanged as T1 > T2 > T4 > T5 > T3. The impact of TC weight changes on expert scores is very small, indicating that the model’s ranking results are highly stable to the changes in the weight of the most important indicator. The sensitivity analysis results show that the evaluation model proposed in this paper based on DEMATEL and probabilistic linguistic possibility degree has good robustness. Even when the weight of the most important indicator fluctuates reasonably, the comprehensive evaluation results of experts’ teaching effectiveness remain stable, proving that the model can provide reliable and consistent ranking results.

Sensitivity test of the Language-numeric mapping scheme

To examine the sensitivity of the evaluation results to the selection of mapping schemes, this study designed a comparative experiment, using three alternative mapping schemes to compare with the scheme proposed in this paper.

Comparative experiment design

Let the mapping scheme of this article be M₀, and define two alternative schemes:

Equidistant mapping scheme M147: Divide the interval [0, 1] into 7 equal parts uniformly. The width of each interval = 1/7 ≈ 0.1429.

Logarithmic mapping function M252: Based on the logarithmic scale division, \(x_{i}\)\(=\)\(\log\)\((1 + i)\)\(/\log\)\(8\),\(i=0,1,...,6\),\({x_i}\) is the logarithmic division of the interval length.

The Language-numeric mapping scheme evaluation index

To quantitatively compare the differences in results, two indicators, namely the weight difference degree \(\vartriangle \omega\) and the ranking consistency index \(RCI\) are defined53. The weight difference degree is as:

$$\vartriangle \omega =(1/n)\cdot \sum {\left| {{\omega _{ial}} - {\omega _{ior}}} \right|}$$
(19)

where \({\omega _{ial}}\) is the weight of the original scheme,\({\omega _{ior}}\) is the weight of the alternative scheme. n is the number of indicators.

The ranking consistency index is as:

$$RCI=\left( {{C_E}/{N_E}} \right)\cdot 100\%$$
(20)

where \({C_E}\) is the number of experts with the same ranking, and \({N_E}\) is the total number of experts.

Experimental results and analysis for the Language-numeric mapping scheme

The experimental results are shown in Table 10. The weight difference degree \(\vartriangle \omega\) of all alternative schemes is less than 0.03, indicating that the change of mapping schemes has a limited impact on the index weights. The expert ranking consistency \(RCI\)is all over 95%, and the optimal expert (T1) remains consistent in all schemes. Although the numerical mapping methods are different, the core results of the evaluation conclusion (optimal expert selection, identification of important indicators) remain unchanged. The experimental comparison results prove the reliability of our mapping scheme.

Table 10 Comparison of results under different mapping schemes.

Cross-validation based on leave-one-out method

In the leave-one-out cross-validation process54,55, one expert is removed at a time from the five experts, and the evaluation results are recalculated using the data of the remaining four experts. Here we have defined two indicators: the ranking stability index \(RSI\) and the weight consistency coefficient \({\omega _c}\). The formula for the ranking stability index is as:

$$RSI=\left( {{C_I}/{T_I}} \right)\cdot 100\%$$
(21)

where \({C_I}\)represents the number of indicators for consistent ranking, and \({T_I}\) represents the total number of indicators. The formula for the weight consistency coefficient is as:

$${\omega _c}=1 - \vartriangle {\omega _{\hbox{max} }}/{\omega _r}$$
(22)

where \(\vartriangle {\omega _{\hbox{max} }}\)represents the maximum variation of the weight of this variable obtained in each calculation during leave-one-out cross-validation. It measures the sensitivity of the model weights to minor data perturbations (deleting one sample). If \(\vartriangle {\omega _{\hbox{max} }}\) is small, it indicates that the coefficient changes little during leave-one-out, and the model stability is high. If \(\vartriangle {\omega _{\hbox{max} }}\) is large, it indicates that the model weights are sensitive to individual samples, and the stability is poor.\({\omega _r}\)represents the absolute value of the weight of a certain variable obtained by training with all the data. The closer the \({\omega _c}\) value is to 1, the better the consistency of the weights.

As shown in Table 11, even if any one expert is removed, the evaluation rankings remain exactly the same, and the weight changes do not exceed 2%. This proves that the results are not sensitive to the evaluation of any single expert.

Table 11 Leave-one-out cross-validation results.

Reliability analysis of expert evaluations

To conduct a reliability analysis of the expert evaluations, we carried out Kendall’s coefficient of concordance test and intra-class correlation coefficient analysis.

Kendall’s coefficient of concordance test

In the Kendall’s coefficient of concordance test56, the evaluation scores of 5 experts on 5 indicators are converted into ranks, and the Kendall’s W coefficient is calculated. Let the evaluation score of expert j on indicator i be \({x_{ij}}\), convert the scores into ranks \({r_{ij}}\) (1 for the highest score and 5 for the lowest score). The calculation formula is:

$$W=12*S/\left[ {{m^2}*({n^3} - n)} \right]$$
(23)

where \(m=5\) is the number of experts, \(n=5\)is the number of indicators, \(S={\sum {({R_i} - R)} ^2}\), \({R_i}\) is the rank sum of indicator i, and R is the average rank sum. When \(W>0.7\), it indicates that there is good consistency among experts. When \(0.5<W \leqslant 0.7\), it indicates that there is moderate consistency among experts. When \(W \leqslant 0.7\), it indicates that the consistency among experts is poor.

Intra-class correlation coefficient analysis

We selected the ICC(3,1) model57 to conduct the intra-class correlation coefficient analysis and calculate the intra-class correlation coefficient of the expert scores. The calculation formula is:

$$ICC=(MSB - MSW)/\left[ {MSB+(k - 1)*MSW} \right]$$
(24)

where \(MSB\)is the mean square between groups, indicating the variation among indicators.\(MSW\)is the mean square within groups, indicating the variation among experts. k is the number of experts. When \(ICC>0.75\), it indicates excellent reliability. When \(0.6<ICC \leqslant 0.75\), it indicates good reliability. When \(ICC \leqslant 0.6\), it indicates insufficient reliability.

Analysis of the results of expert evaluation reliability

As shown in Table 12, the Kendall’s coefficient W in the expert evaluation reliability results is 0.82, indicating a high degree of consistency among the experts’ evaluations. The intra-class correlation coefficient value \(\left( {ICC} \right)\) is 0.78, suggesting excellent evaluation reliability. These two expert evaluation consistency test results indicate that all consistency indicators have reached a good level or above, proving that the evaluations of the five experts are reliably consistent and the research results are not the accidental outcome of a small sample.

Table 12 Results of consistency check for expert evaluation.

Bootstrap-based confidence interval estimation

Bootstrap confidence intervals58 estimation does not rely on theoretical assumptions about the distribution of weights. Through Bootstrap confidence interval estimation, we can quantify the statistical uncertainty of model weights and evaluate their stability.

Bootstrap sampling method

The process of the sampling method based on Bootstrap is as follows:

Step 1: Five samples were randomly drawn with replacement from the experts’ original data.

Step 2: Recalculate all evaluation indicators based on the sampled data.

Step 3: Repeat the above process B = 1000 times to obtain 1000 sets of estimated values.

Step 4: Calculate the 95% confidence interval based on 1000 estimated values.

The formula for calculating the 95% confidence interval \({Q_v}\)59 is as follows:

$${Q_v}=[{Q_{v2.5}},{Q_{v97.5}}]$$
(25)

where \({Q_{v2.5}}\) and \({Q_{v97.5}}\) represent the interval boundaries formed by taking the 2.5th percentile and the 97.5th percentile from the 1000 sets of estimated values obtained through Bootstrap resampling, respectively.

Confidence interval analysis of the degree of cause indicators

As shown in Table 13, the 95% confidence interval of TC cause degree ([0.315, 0.387]) is completely greater than 0, indicating it is a significant cause factor. The 95% confidence interval of TM cause degree ([0.005, 0.075]) includes positive values but is close to 0, suggesting it is a weak cause factor. The 95% confidence interval of PD cause degree ([−0.248, −0.182]) is completely less than 0, indicating it is a significant result factor. The 95% confidence interval of TQ cause degree ([0.418, 0.480]) is completely greater than 0, indicating it is a significant cause factor. The 95% confidence interval of SC cause degree ([−0.340, −0.284]) is completely less than 0, indicating it is a significant result factor.

Table 13 Bootstrap confidence intervals of the degree of causality index.
$$DCI=1 - ({C_I}/\left| {{P_{EV}}} \right|)$$
(26)

where \({C_I}\) represents the width of the confidence interval, and \(\left| {{P_{EV}}} \right|\) represents the absolute value of the point estimate.

After calculation, the \(DCI\) of TQ is 0.863, that of SC is 0.821, that of TC is 0.795, that of PD is 0.693, and that of TM is 0.250.

Confidence interval analysis of centrality indicators

To analyze the stability of centrality, we have defined the centrality stability index \(CCI\)61, which is expressed as follows:

$$CCI=1 - {C_{AR}}/{C_{MR}}$$
(27)

where \({C_{AR}}\) is the average relative error of centrality, and \({C_{MR}}\) is the maximum possible error of centrality.

As shown in Table 14, the centrality hierarchy of TC centrality and TM centrality is at the first level, with the highest system influence. The centrality hierarchy of SC centrality is at the second level, with medium system influence. The centrality hierarchy of TQ centrality is at the third level, with relatively low system influence. The centrality hierarchy of PD centrality is at the fourth level, with the lowest system influence.

Table 14 Bootstrap confidence intervals of centrality indicators.

Analysis of confidence interval of comprehensive weights

To analyze the stability of the comprehensive weight, we have set up the comprehensive weight stability index \(CWI\)62, which is expressed as follows:

$$CWI=1 - {A_{RE}}/{M_{PE}}$$
(28)

where \({A_{RE}}\) is the average relative error of the weights, and \({M_{PE}}\) is the maximum possible error.

As shown in Table 15, the 95% confidence intervals of the comprehensive weights calculated based on centrality and causality for all indicators do not include 0 and there is no overlap among them. The confidence interval of TC weight is [0.291, 0.321], which is significantly higher than that of other indicators. The confidence intervals of TM and SC weights have some overlap, but the upper limit of TM (0.274) is lower than the lower limit of SC (0.242).

Table 15 Bootstrap confidence intervals of the comprehensive weights.

Comprehensive evaluation of statistical stability

To comprehensively evaluate the stability of confidence intervals, we have established a comprehensive stability index \(CSI\)63, which is expressed as follows:

$$CSI=1 - {A_E}/{A_{BE}}$$
(29)

where \({A_E}\) represents the average relative error of the indicator, and \({A_{BE}}\) represents the maximum possible error of the indicator.

After calculation, the stability index of the cause degree\(DCI\)is 0.652, the stability index of the centrality degree is 0.824, and the stability index of the weight is 0.867. The relative error of the weight estimation CSI = 0.867 is controlled within ± 11.43%, making the weight estimation the most stable. The relative error of the centrality degree estimation CSI = 0.824 is controlled within ± 6.52%, indicating that the centrality degree estimation is relatively stable. The cause degree estimation CSI = 0.652, and the coefficient of variation of the TM cause degree is as high as 1.750, showing relatively low stability.

Confidence interval estimation analysis

All point estimates of the indicators are within the 95% confidence interval. TC and TQ are identified as cause factors, while PD and SC are recognized as result factors with statistical significance. The weight ranking (TC > TM > SC > TQ > PD) is supported by the confidence interval analysis. The uncertainty of the cause degree and weight estimation of the PD indicator is relatively large. The cause degree of TM is close to 0, resulting in its confidence interval including the area near 0. For high certainty indicators (TC, TM, SC), decision-makers can rely on the model results with more confidence. The overall model provides reliable evaluation results at the 95% confidence level.

The confidence interval analysis based on Bootstrap indicates that our conclusion is statistically significant. TC and TQ are confirmed as significant cause factors (the confidence interval is completely greater than 0), PD and SC are confirmed as significant result factors (the confidence interval is completely less than 0), and TM is a weak cause factor. The relative error of the weight estimation is controlled within a reasonable range, supporting the reliability of the evaluation results.

Comparative analysis of methods

To verify the effectiveness and advantages of the DEMATEL method (DPL) integrating PLTS and possibility degree proposed in this paper, we designed a systematic comparative experiment. To ensure that all differences in comparison are due to the methodology itself rather than different parameters or settings, all comparative methods use exactly the same original expert evaluation data and keep the pre-processing steps as consistent as possible.

Comparison baseline and experimental setup

We selected three representative benchmark methods for fair comparison, as is shown below:

  1. (1)

    Variants of the classical method

Crisp-DEMATEL (CD)64: The probabilistic linguistic evaluations are aggregated into crisp numbers through the calculation of expected values, and then the classical DEMATEL is applied. This comparison aims to separate and highlight the value of probabilistic linguistic processing in handling uncertainty.

PLWA-DEMATEL (PLD)65: The probabilistic linguistic evaluations are aggregated using the probabilistic linguistic weighted averaging (PLWA) operator, followed by DEMATEL analysis. This comparison is used to evaluate the impact of different information aggregation strategies.

  1. (2)

    Other probabilistic linguistic decision-making models

PL-TOPSIS (ET)66: A TOPSIS method based on probabilistic linguistic distance measures.

PL-TODIM (TL)67: A probabilistic linguistic TODIM method considering the psychological behavior of decision-makers.

  1. (3)

    Fuzzy linguistic variants

Fuzzy-DEMATEL (FL)68: The original evaluations are transformed into triangular fuzzy numbers and the fuzzy DEMATEL method is applied. This comparison aims to explore the effects of different types of linguistic uncertainty processing methods.

To ensure the reproducibility of the experiments, all key parameters of the methods were standardized as shown in Table 16. The experiments were conducted in the same computing environment (MATLAB 2023b, Windows 11 operating system, Intel(R) Core (TM) i9-14900KF (3.20 GHz)), and each method was run 30 times, with the average running time of each method recorded.

Table 16 Parameter settings of comparative methods and average running time.

Results analysis and comparison

The evaluation results of the five teaching influencing factors by each method are shown in Table 17. To quantitatively compare the consistency of the results, we calculated the Spearman rank correlation coefficient between the results of each method and the results of the method in this paper (DPL)69.

Table 17 Comparison and consistency analysis of evaluation results by different methods.

The DPL method we proposed shows a high degree of consistency (ρ ≥ 0.95) with other mainstream methods (PLD, TL, FL) in the ranking of factor importance, all identifying TC and TM as the two most critical factors, effectively verifying the reliability of this method and the robustness of its conclusions. To further quantify the stability of DPL, we recalculated the factor ranking each time and conducted a Bootstrap stability test. We further calculated Kendall’s coefficient of concordance to be 0.82, indicating that the output ranking of this method has a high degree of internal consistency when the input data has reasonable fluctuations, verifying its good robustness.

Summary

This Section conducts a comprehensive validation of the effectiveness and robustness of the proposed DEMATEL method integrating probability linguistic possibility degree (DPL) in the evaluation of the teaching effectiveness of the “Safety Access Education in University Laboratories” course through systematic empirical analysis and method comparison. Firstly, a standardized probability linguistic evaluation matrix was constructed, and the DPL method was applied to calculate the centrality, causality, and final weights of each indicator. The results show that TC is the core driving factor in the system, having a significant causal impact on teaching effectiveness, and it has the highest weight. The specialized differentiation of teaching methods (PD) has the least influence. This conclusion is visually presented in the DEMATEL causal diagram.

To verify the reliability of the results, this section conducts multi-level robustness tests. Parameter sensitivity analysis indicates that even when the weight of the highest-weighted TC indicator is perturbed by ± 10%, the comprehensive evaluation ranking of teaching effectiveness by experts remains stable. The sensitivity test of the language-numerical mapping scheme shows that using different mapping functions (equidistant, logarithmic) has a limited impact on the indicator weights (difference < 0.03), and the optimal expert identification results are consistent, demonstrating the robustness of the proposed mapping scheme. Leave-one-out cross-validation further confirms that the results are not sensitive to the evaluation of individual experts, with a ranking stability index of 100% and a weight consistency coefficient close to 1, demonstrating the method’s excellent anti-interference ability. Expert evaluation reliability analysis (Kendall’s W = 0.82, \(ICC=0.78\)) statistically confirms the high consistency and reliability of the expert group’s evaluation opinions.

To quantify the statistical uncertainty of the model results, this section uses the Bootstrap method to calculate the 95% confidence intervals of each indicator. The results show that all point estimates are within the confidence intervals, and the key conclusions are statistically significant: TC and TQ are confirmed as significant causal factors (confidence intervals completely greater than 0), and PD and SC are confirmed as significant result factors (confidence intervals completely less than 0). Through a comprehensive assessment of the stability indices of causality, centrality, and comprehensive weights, it is found that the stability of weight estimation is the highest (\(CSI=0.867\)), followed by centrality estimation, and the stability of causality estimation (especially TM) is relatively low. This provides a basis for decision-makers to understand the certainty differences of different indicators.

Finally, through a systematic comparative analysis with classic DEMATEL variants (CD, PLD), other probability linguistic decision-making models (ET, TL), and fuzzy linguistic variants (FL), the superiority of DPL is verified at the methodological level. DPL has a high consistency in ranking results with multiple mainstream methods (Spearman correlation coefficient ρ ≥ 0.95), proving the universality of its conclusions. More importantly, the comparison highlights the unique value of the DPL method: compared with methods that directly aggregate probability linguistic information (CD), DPL retains the distribution and uncertainty information in expert evaluations through possibility degree calculation; compared with other comprehensive decision-making models (ET and TL), DPL not only provides a ranking of factor importance but also reveals the complex causal network structure within the system, providing deeper systematic insights for management decisions.

In conclusion, this section, through various experimental design and comprehensive validation analysis, confirms that the proposed DPL method is an effective, robust, and highly interpretable tool for evaluating teaching effectiveness. It can not only reliably identify key influencing factors but also analyze the causal mechanisms between them, and it has good robustness against data perturbation, parameter settings, and calculation methods, providing scientific and reliable methodological support for complex educational system analysis and decision-making in an uncertain linguistic environment. From the above analysis results, it can be seen that the proportion of teaching content in the “Safety Access Education in University Laboratories” course is the largest and most important, while the impact of professional differentiation is the smallest. This is mainly because the content covered in laboratory safety access education focuses more on universal and fundamental safety knowledge, thus the influence of professional differentiation is relatively weak.

Furthermore, it is observed that the evaluation results of the DEMATEL method based on probabilistic linguistic possibility degree (DL), the probabilistic linguistic fuzzy cognitive map method (FL), and the probabilistic linguistic TODIM method (TL) showed a high degree of consistency, while the ELECTRE and TOPSIS methods with probabilistic linguistic terms (ET) yielded different rankings. The DL method analyzes the causal network and influence intensity between indicators through DEMATEL, and its weights are derived from the system structure rather than subjective assumptions, reflecting the inherent importance of factors in the correlation network. The FL method uses fuzzy cognitive maps to depict the nonlinear feedback and dynamic influence between indicators, and its output is the stable state emerging from the internal interaction of the system. The TL method is based on prospect theory, focusing on depicting the decision-makers’ psychological reference points and loss aversion behavior under risk, with its core being the simulation of choices under cognitive biases. These three methods essentially deal with causal, influence, and psychological reference relationships rather than simple attribute additions. Therefore, when they face the same probabilistic linguistic information, they may capture similar systemic structures or decision-makers’ psychological patterns, leading to consistent results. The ET method usually aggregates probabilistic linguistic information into scalar expected values before entering the ELECTRE or TOPSIS process, which may lead to early information loss. This results in the difference between the ET method and the other three methods.

According to the results of this study, teachers in the “Safety Access Education in University Laboratories” course should prioritize the allocation of limited resources to optimizing teaching content and teaching methods. Teaching content and teaching methods are the core for stimulating students’ interest, so more attention should be paid to the quality and interest of teaching content, such as adding appropriate safety education animations and using actual safety incidents in the laboratory as teaching cases.

Conclusions

To evaluate the influencing factors of teaching effectiveness in the “Safety Access Education in University Laboratories” course, this study proposes a method combining the DEMATEL method with probabilistic linguistic possibility degrees. The conclusions are as follows:

By evaluating the influencing factors of teaching effectiveness in the “Safety Access Education in University Laboratories” course, ​targeted improvements can be made: Teaching content has the ​greatest impact, so enriching and optimizing content can enhance student engagement and motivation. Professional distinction has the ​least impact, allowing for appropriately reduced requirements in instructional design. Teaching method plays a ​positive role in improving teaching outcomes.

The framework integrating the DEMATEL method and probabilistic linguistic possibility degrees proposed in this study provides an effective tool for addressing complex and uncertain educational evaluation issues. The research findings can be directly applied to guide universities in optimizing the design and implementation of laboratory safety courses, which has positive practical significance. Although certain achievements have been made, this study has certain limitations in sample selection: only five experts from the fields of agriculture and laboratory management were invited to participate in the evaluation, with a relatively limited sample size and a relatively concentrated disciplinary background. This may affect the stability and generalizability of the evaluation conclusions to some extent. Simultaneously, although the evaluation index system constructed is based on literature review and theoretical analysis, it may still not fully cover all potential factors affecting teaching effectiveness, thereby introducing the risk of systematic bias.

To overcome the above limitations and deepen the research, future work will be carried out in the following three aspects: First, expand the sample range and diversify the disciplinary background, and conduct verification in a more diverse expert group to enhance the robustness and universality of the conclusions. Second, promote the cross-integration of methodologies, and attempt to combine this model with other multi-criteria decision-making methods (such as the Analytic Network Process) to construct a more refined and adaptable evaluation framework for complex relationships. Finally, extend the data sources and evaluation scenarios: on the one hand, incorporate evaluation data from the student group to form a dual-group decision-making model with expert evaluations; on the other hand, apply this framework to the evaluation of other types of courses to further test its transferability and application boundaries.