Introduction

Thyroid cancer is a prevalent malignancy worldwide, with an increasing incidence reported globally1,2. Cervical lymph node metastasis, encompassing both central and lateral compartments, is a critical factor affecting patient prognosis, with an incidence rate of 20–50%3, and increasing the risk of mortality by 46%4,5. The central compartment is widely recognized as the station for lymph node metastasis6, and numerous related studies have been conducted7,8,9. In contrast, studies on lateral compartment metastasis are scarce, primarily due to: (1) the complexity of the anatomical structure10, with the distribution pathways and connection patterns of lymphatic vessels around the thyroid varying among individuals11; (2) the dispersed distribution of lateral lymph nodes, complicating statistical analysis and potentially leading to omissions12; (3) limitations of research methods—lateral lymph nodes are often located in deep cervical tissues, and their small size and depth make accurate detection and research challenging, with ultrasound sensitivity in predicting lateral lymph node metastasis (LLNM) being only 62%3; and (4) difficulty in sample acquisition. Biopsy of lateral lymph nodes requires a high level of clinician expertise and poses certain risks to patients, limiting the scale and depth of related studies13. Furthermore, central compartment lymph nodes are routinely dissected, whereas prophylactic dissection of lateral lymph nodes is not typically considered standard procedure in some countries and regions, leading to a significant shortage of available samples for research3.

Preoperative prediction of Lateral Lymph Node Metastasis (LLNM) is crucial for surgical planning and prognostic management in thyroid cancer. A positive LLNM result typically indicates that the tumor has begun to spread more extensively, prompting physicians to adopt a more aggressive treatment strategy, which may include expanding the surgical field and considering postoperative radiotherapy or chemotherapy14. Literature15,16 has shown that LLNM is associated with a worse prognosis; the recurrence rate is significantly higher compared to patients with central compartment lymph node metastasis (60% vs. 30%, P = 0.007). Disease-free survival and average recurrence time are also markedly shorter (30 months vs. 52 months, P = 0.035, and 7 months vs. 44 months, P = 0.004, respectively)15. Therefore, effectively predicting LLNM enables physicians to develop more appropriate treatment plans, reduce the risk of cancer progression due to missed dissections, and more accurately assess patient prognosis, providing more comprehensive support and care3,16. In some cases, patients may present with skip metastases—negative central compartment and positive lateral compartment17,18 —which are prone to being missed during preoperative evaluation and surgery19. Moreover, the prognosis of skip metastases varies among different tumor types20,21, suggesting that clinicians should consider the specific biological characteristics of the tumor and the anatomical pathways of the lymphatic system13. Currently, preoperative lymph node biopsy is the standard method for evaluating LLNM3; however, ultrasound-guided fine-needle aspiration has limitations, such as inaccurate or missed punctures3,22. Given the particular importance of LLNM, there is an urgent need for preoperative evaluation methods that can effectively predict LLNM, assisting clinicians in determining the nature of the disease and taking appropriate measures, thereby contributing significantly to improving patient survival rates.

Current research indicates that cervical lymph node metastasis is closely associated with the histological morphology and location of the primary tumor6,23,24. Specifically, it often results from the growth and spread of the primary tumor (thyroid nodule), with different tumor types exhibiting varying metastatic tendencies. For instance, abnormally enlarged tumors may be an important sign of lymph node metastasis25, while tumors with abnormal morphology or texture may also suggest the possibility of metastasis6,26. Moreover, the growth location of the primary tumor affects the invasion pathway of tumor cells into the lymphatic system, increasing the likelihood of superior pole metastasis23,24 and influencing the risk and prognosis of lymph node metastasis in different regions3,15.

Ultrasound imaging, known for its non-invasive, real-time, and convenient features, is one of the most common diagnostic methods for thyroid cancer. It helps physicians detect early tumor abnormalities such as increased size, irregular shape, and abnormal internal structure, thereby playing a predictive role in LLNM27,28,29. However, it suffers from low inter-organ contrast and poor image quality, and evaluation results heavily depend on the physician experience. Deep learning techniques can enhance tumor recognition by learning image features such as tumor morphology, size, and calcification6,30,31,32,33. Techniques like foreground-background algorithms and graph convolutional networks statistically analyze positional information7, assisting physicians in preoperative diagnosis and prognosis assessment, including tumor malignancy grading, subtype evaluation, and prediction of cervical lymph node metastasis34,35. Yet, there is a lack of large-scale cohort studies and efficient intelligent tools for precise analysis of LLNM16,27,36,37,38, with conclusions often lacking qualitative/quantitative explanations23,24.

Here we show the LLNM-Net, a bidirectional attention architecture that integrates multimodal data for preoperative LLNM prediction. As illustrated in Fig. 1a, we employ foreground-optimized segmentation39,40 and Central Point Distance Transformation (CPDT)41 to extract tumor morphology and precise location. Our Thyroid Multimodal Deep Learning (TMDL) transformer42 (Fig. 1b) fuses imaging features with clinical reports and demographic data via bidirectional attention exchange43,44,45. We generate 3D risk heatmaps through diffeomorphic registration and perform attention-based gradient analysis to interpret metastasis mechanisms. Evaluated on 39,451 patients from seven institutions (Fig. 2), LLNM-Net provides quantitative preoperative assessment to guide surgical planning and prognosis management.

Fig. 1: Network architecture of LLNM-Net.
Fig. 1: Network architecture of LLNM-Net.
Full size image

a Data feature extraction process. For the ultrasound image input \(x\), we used the optimized YOLO-v8 model to segment the nodule label \({l}_{n}\). Subsequently, the U-Net + + model is utilized to segment the thyroid label \({l}_{t}\). Following this, \({l}_{n}\) is combined with \(x\) to derive the features including shape \({x}_{s}\), echogenicity \({x}_{e}\), internal morphology \({x}_{t}\), and obtaining margin \({x}_{m}\) through random mosaic method. The combination of \({l}_{t}\) and \({l}_{n}\) yields the merged label \({l}_{t}-{l}_{n}\), which is processed through the CPDT φ, converting distance information into image grayscale values. Thereby we obtained the location information \({x}_{l}\). b Overall workflow of TMDL. The input data consists of 8 features, including: morphological features (\({x}_{t},{x}_{m},{x}_{e},{x}_{s}\)), locational information \({x}_{l}\), radiological reports \({x}_{r}\), demographics (sex \({x}_{{sex}}\) and age \({x}_{{age}}\)). In the two-layer initial embedding layers, convolutional layers are used to encode the image-type features \({x}_{l}\), \({x}_{t}\) and \({x}_{m}\) into a sequence of image patch tokens (\({{Tokens}}_{I}\)). The encoder encodes \({x}_{{sex}}\), \({x}_{{age}}\), \({x}_{e}\), \({x}_{s}\) and unstructured data \({x}_{r}\) into \({{Tokens}}_{T}\). The two types of tokens are combined into unified tokens and then input into the bidirectional attention block. This block consists of two normalization layers (Norm), a bidirectional multimodal fusion layer, and a multi-layer perceptron (MLP). The attention between the two types of data is exchanged and computed. This block is stacked into four layers, followed by 12 layers of self-attention blocks.

Fig. 2: Flow diagram summarizing the inclusion of patients.
Fig. 2: Flow diagram summarizing the inclusion of patients.
Full size image

This study consisted of 23,692 patients in the training set, 5923 patients in the validation set, and 9836 patients in external test sets. Poor image quality includes blurred nodular areas, image jitter, and incomplete imaging of the nodular region. Source data are provided as a Source Data file.

Results

Data description

We collected pathological diagnoses, preoperative ultrasound images, radiology reports, and demographic information from a cohort of 39,451 patients (Table 1). Notably, the median age was 43 years, with female patients outnumbering male patients by more than twofold, and approximately 91% of the cohort identified as Han ethnicity. The most represented categories in the Kwak Thyroid Imaging Reporting and Data System (Kwak-TIRADS)46 were 4B (48%) and 4C (28%). Patients with thyroid nodules smaller than 10 mm accounted for 73%, and the rate of LLNM-positive patients was 52%. The subtypes collected included 35,804 cases of papillary thyroid carcinoma (PTC), 2845 cases of follicular thyroid carcinoma (FTC), and 802 cases of medullary thyroid carcinoma (MTC). We trained the model using 80% of the 29,615 patients from two hospitals, with the remaining 20% used for model validation. The external test sets comprised 9836 patients from five multicenter sites. More detailed information can be found in Tables S1 and S2.

Table 1 Characteristics of the patients at baselinea

Prediction performance of models and human experts

Figure 3a shows that LLNM-Net exhibits significantly superior predictive performance, with an Area Under the Curve (AUC) of 0.948 (95% CI: 0.946–0.950) on the validation set. Furthermore, the AUC on the external test set is 0.944 (95% CI: 0.942–0.945), with an accuracy of 0.847(95% CI: 0.840–0.854). In the comparative test, LLNM-Net (accuracy = 0.875) is significantly higher than the predictive accuracy of human experts (accuracy = 0.643, paired t-test, t = 1.998, P = 0.0473) (Fig. 3a, b). Additionally, the high specificity and PPV demonstrated in the results can more effectively prevent missed diagnoses and enhance the screening performance for LLNM. The accuracy of the segmentation module is presented in Table S4, and the accuracy results for FTC and MTC can be found in Table S5. Comparative experimental results with the latest published AI methods are provided in Table S7.

Fig. 3: Prediction performance of model and human.
Fig. 3: Prediction performance of model and human.
Full size image

a Prediction performance of model. The table displays the AUC, specificity, sensitivity, accuracy, negative predictive value (NPV) and positive predictive value (PPV) prediction performance of LLNM-Net for patients across the three datasets. Source data are provided as a Source Data file. b Prediction performance of the comparative test between radiologists and LLNM-Net. Radiologists performed well in the malignant classification test but poorly in the LLNM classification test. In contrast, LLNM-Net performed better on the same test dataset. Source data are provided as a Source Data file. c ROCs of LLNM-Net on the training set, validation set, and external test sets, as well as the predictive performance of senior and junior radiologists. The AUC results are presented as mean values, and 95% confidence intervals are derived from n = 100 experimental replicates for each task setting. In each replicate trial, real patient input data are selected via bootstrap sampling from the real dataset. We used a two-sample two-sided unadjusted Kolmogorov-Smirnov (KS) test for goodness of fit to examine the predictive distribution values of radiologists and LLNM-Net. Raincloud plots with violin and box diagrams are used to show the comparison of individual-level prediction probabilities between the radiologists (Doctor raincloud plot, mean accuracy of 108 radiologists, n = 200) and the LLNM-Net (LLNM-Net raincloud plot, n = 200, KS = 0.385, P < 1 × 10−12). Each boxplot includes a box representing the median value and interquartile range (IQR). The whiskers extend from the box to the maximum and minimum values, with their length not exceeding 1.5 times the IQR. The red color indicates LLNM-positive samples, while the blue color represents LLNM-negative samples. Source data are provided as a Source Data file.

The study recruited 108 imaging experts, including 42 senior radiologists (with over five years of clinical experience, 21 female and 21 male) and 66 junior radiologists (with three to five years of clinical experience, 33 female and 33 male). These physicians demonstrated high accuracy in diagnosing the malignancy of thyroid nodules (Fig. 3b), with an average accuracy of 0.883, specificity of 0.899, and sensitivity of 0.868. However, in the experiments predicting LLNM classification, both senior and junior physicians exhibited lower predictive capabilities, with an accuracy of 0.643, specificity of 0.642, and sensitivity of 0.644 (Fig. 3b, c). This indicates that the ability to predict LLNM based on traditional clinical knowledge and subjective experience is limited, whereas artificial intelligence models can learn important factors contributing to LLNM from a large number of features.

Qualitative and quantitative analysis for predicting LLNM

Figure 4 provides qualitative and quantitative analyses for predicting LLNM in the locational dimension. Figure 4a displays the visualization results of nodule locational information. We defined the central point and calculated the minimum distance from all points within the nodule region, obtaining the location feature image through the transformation φ. The model learns iteratively and computes the attention heatmap through gradient-weighted calculations47, with the heatmap indicating areas identified by the model that have a greater impact on LLNM metastasis. Figure 4b presents a statistical analysis of the minimum distance from the nodule to the thyroid capsule. The results show that as the minimum distance decreases, the probability of LLNM correspondingly increases. When the minimum distance is less than 0.25 cm, the average probability of LLNM occurrence exceeds 72%.

Fig. 4: Qualitative and quantitative assessment of nodule locational information.
Fig. 4: Qualitative and quantitative assessment of nodule locational information.
Full size image

a Visualization of nodule locational information. We obtained the location feature image using the CPDT method. Subsequently, we computed the model parameters to generate a heatmap that maps the key minimum distances the model focuses on, which are crucial for assessing the probability of LLNM occurring in relation to the nodule’s location. In the figure, a higher attention score indicates that the model pays more attention to the region. The color bar is labeled as “Attention score [AU]”, ranging from 0.0 (blue) to 1.0 (red). b Quantitative assessment of nodule locational information. The curve illustrates that as the minimum distance between the thyroid nodule and the thyroid capsule decreases, the probability of LLNM gradually increases. Two examples are presented: LLNM-positive cases (red contour) and LLNM-negative cases (purple contour). When the distance is less than 0.25 cm, the probability of LLNM increases by 72%. c Qualitative assessment of nodule locational information. This presents risk heatmaps illustrating the likelihood of LLNM occurrence when thyroid nodules are located in different regions. The depth of color represents the locational metastasis risk. When the risk value is 1.00, the risk of metastasis in the locational dimension reaches the maximum. The metastasis risk is at the minimum when the risk value is 0.00. The color bar is labeled as “Locational metastasis risk [AU]”, with values from 0.0 (white) to 1.0 (red). d Example of a risk heatmap. Given the nodule’s location, the probability value for LLNM occurrence can be automatically generated.

Figure 4c illustrates the three-dimensional risk heatmap model for LLNM occurrence in thyroid regions, statistically analyzed based on the diffeomorphic affine transformation method48,49. This model is publicly available at: https://snowinbio.github.io/LLNM-Net/. The results indicate that the upper central region of the left lobe of the thyroid, as well as the medial and lateral regions of the upper part of the right lobe, are key areas of concern for LLNM occurrence, likely closely related to lymphatic metastasis pathways. Figure 4d showcases an example application of the heatmap model. When provided with the growth location of a thyroid nodule, the model can automatically generate the probability value for LLNM occurrence at that location.

In Fig. 5, we present an example of LLNM-Net. The model takes both imaging data and user information as inputs and automatically outputs a multi-feature score. It also provides the contribution ratio of each feature, as illustrated by the Sankey diagram on the right. This multi-feature metastatic score helps physicians understand how the model predicts metastasis. To demonstrate that our interpretation aligns with clinical knowledge, we compared it with TIRADS. It can be concluded that there is a correlation between tumor malignancy and LLNM, as shown in Figs. S1, S2.

Fig. 5: An illustrative example of LLNM-Net results.
Fig. 5: An illustrative example of LLNM-Net results.
Full size image

The results output from LLNM-Net. On the left, the prediction scores for each feature response are given, while the Sankey diagram on the right shows the contribution ratio to metastasis. The prediction scores are integrated according to their contribution to finally obtain the metastasis prediction probability. In this example, the final metastasis score is 0.833, which is LLNM positive, consistent with the actual outcome. The color bar is labeled as “Metastasis score [AU]”, with values from 0.0 (white) to 1.0 (red). Source data are provided as a Source Data file.

Figure 6a shows the contribution ratio of all features to LLNM, indicating that the growth location of the thyroid nodule is the most significant factor, accounting for 48.8%. This is followed by morphological features, which account for 29.9%. Among the morphological features, shape and internal morphology contribute 7.5% and 64.1%, respectively, suggesting that clinical attention should focus on nodules with abnormal shapes or significant enlargement. Text information from clinical reports contributes 19.7% overall. Gender is an important factor in demographics (accounting for 56.7% in demographics). Figure 6b displays the attention heatmaps for LLNM-negative and LLNM-positive cases, showing that the network focuses on the edges and various internal texture features. Figure S3 illustrates examples of the association between imaging reports and internal morphological features.

Fig. 6: Attention analysis and decision curve analysis.
Fig. 6: Attention analysis and decision curve analysis.
Full size image

a Contribution analysis of features. The location information of thyroid nodules is the most significant factor influencing LLNM (accounting for 48.8%). Clinical findings accumulated by physicians, hidden in the imaging report, account for 19.7%. The internal morphology feature accounts for 64.1% in morphology, while gender accounts for 56.7% in demographics. Source data are provided as a Source Data file. b Visualization of the attention for Internal morphology. The columns display the ultrasound image, the corresponding feature image, and the attention heatmap in LLNM-Net. The hotspots in the heatmap indicate the areas where the model focuses. The higher attention score indicates that the model pays more attention to the area. All nodules shown in the figure are malignant. In the morphological feature panel, the model focuses more on the overall morphological characteristics of the nodules. This suggests that the internal texture features of the nodules contribute more significantly to the model’s prediction of LLNM. The color bar is labeled as “Attention score [AU]”, with values from 0.0 (blue) to 1.0 (red). c Decision curve analysis. the Standardized net benefit (The benefit of effective LLND and the cost of LLND) of the model-based strategy, the strategy that do LLND for “All patients” and “None patients”. Source data are provided as a Source Data file. d Clinical impact of the model-based strategy. Red line: The proportion of patients that accept LLN FNA under model-based strategy. Blue line: The proportion of false negative LLNM cases. Source data are provided as a Source Data file.

The decision curve analysis (Fig. 6c) illustrates the clinical benefit of stratery (the benefit of effective LLND minus the cost of LLND). Under different cost-benefit ratio settings, our model-based strategy outperforms the naïve strategies of performing lateral lymph node (LLN) fine-needle aspiration (FNA) for all or no patients. Figure 6d shows the clinical impact of our model-based strategy within our cohort. At a threshold of 0.5, the strategy improves the identification of 47.4% of patients who should undergo LLN FNA, with only 5.3% of LLNM-negative patients undergoing unnecessary LLN FNA. Additionally, we conducted a reverse cognitive test to evaluate the interpretability of our model. The results indicate that clinicians’ understanding of the model outcomes improved by 25.4% compared to general AI (Fig. S4).

Predicting high-risk lymph node metastasis patients

The model is capable of predicting the stage of lateral lymph node metastasis. Based on clinical guidelines3,50,51,52, we classified lymph node metastasis into three stages according to the number and size of metastatic nodes:

  • Stage 1: Low risk. Five or fewer micro-metastases (<0.2 cm in largest dimension).

  • Stage 2: Medium risk. More than five metastatic lymph nodes, and any metastatic lymph node <3 cm in largest dimension.

  • Stage 3: High risk. Any metastatic lymph node >3 cm in largest dimension.

Table 2 demonstrates that the model exhibits good predictive performance, achieving an average AUC of 0.971 in external test sets. This suggests that LLNM-Net can precisely identify individuals at medium to high risk, offering physicians reference advice for FNA testing, and ensure regular follow-ups for low-risk individuals, thereby optimizing the efficient allocation of medical resources.

Table 2 Performance of multi-level risk of lymph node metastasisa

Application of LLNM-Net in clinical practice

Figure 7 illustrates the traditional clinical guidelines3 and guidelines with LLNM-Net. In Fig. 7a, when patients undergo thyroid imaging, physicians assess the malignancy of nodules based on subjective experience and decide whether FNA is necessary. If needed, a cervical lymph node ultrasound examination is considered to evaluate the likelihood of LLNM and consider whether to perform FNA for the lateral lymph nodes. Then the decision to proceed with thyroidectomy and LLND is made based on the FNA results. However, whether the physician conducts a cervical ultrasound examination depends on subjective experience, leading to potential missed diagnoses. Incomplete coverage of the detection area during the ultrasound examination can also result in missed diagnoses, affecting the accuracy of FNA results for lymph nodes and influencing the decision to perform LLND.

Fig. 7: Clinical application of LLNM-Net.
Fig. 7: Clinical application of LLNM-Net.
Full size image

a Traditional clinical guidelines for LLNM. The process primarily involves the following steps: Doctors assess the malignancy of thyroid nodules via ultrasound and clinical information; Determine whether to perform FNA for nodules; Cervical lymph node ultrasound examinations for LLNM; Determine whether to perform FNA for LLN; FNA for lymph nodes and LLN, among other procedures to assess the necessity of thyroidectomy and LLND. b Clinical guidelines for LLNM with LLNM-Net. LLNM-Net assists in preoperative risk prediction for LLNM, guiding the detection of cervical lymph nodes in low-risk patients and recommending FNA for lymph nodes for medium and high risk patients.

In Fig. 7b, the integration of LLNM-Net aids in preoperative LLNM risk prediction, helping doctors determine whether to conduct cervical ultrasound examinations, thereby reducing missed diagnoses. During cervical ultrasound examinations, LLNM-Net highlights patients at risk of LLNM, decreasing the rate of missed diagnoses. It also recommends FNA for lymph nodes for medium and high risk LLNM patients, reducing missed diagnoses caused by incomplete ultrasound examination areas or physician judgment errors. In summary, LLNM-Net helps improve the diagnostic and treatment process for LLNM.

In this diagnostic study, we developed an interpretable multimodal deep learning model that can be implemented as an AI support system for LLNM risk assessment based on thyroid ultrasound images. This model provides qualitative and quantitative clinical explanations for predictions based on the fusion transformer method. This study addresses the lack of effective methods for preoperative diagnosis of LLNM, providing clinical insights while accurately screening high-risk populations, and significantly improving patient survival rates and societal welfare.

Discussion

We developed LLNM-Net, achieving precise preoperative prediction of LLNM and demonstrating strong generalizability across five external centers, with predictive performance reaching an AUC of 0.944 (95% CI: 0.942–0.945). Currently, ultrasound has a low sensitivity of 0.623 for detecting lateral neck regions. Our model’s AUC improved by 7.4% compared to the best existing model, and its accuracy was 20.4% higher than that of human physicians. In high-risk population screening, the AUC reached 0.971. Furthermore, compared to general AI scores, our model improved clinicians’ understanding of the results by 25.4%. This significantly addresses the gap in LLNM research, enhances preoperative predictions for more precise treatments, and guides physicians in early patient stratification for closer monitoring and treatment, thereby improving patient quality of life and survival rates. This is crucial for enhancing fairness in AI-based clinical diagnosis.

Our study provides guidance for clinical detection area research, revealing findings: nodules located in the upper middle region of the left lobe and the upper middle region of the right lobe are high-risk areas for LLNM, suggesting that clinicians should focus on these regions. This may be closely related to lymphatic metastasis pathways. Statistics show that when the minimum distance between the nodule and the capsule is less than 0.25 cm, the average probability of LLNM exceeds 72%, indicating a correlation between the minimum distance, growth region, and LLNM. We quantified the risk areas and feature contributions for each patient.

We addressed the challenge of small foregrounds and high intra-class variance in backgrounds in ultrasound images by using an attention-based foreground optimization segmentation network for precise nodule segmentation. To tackle the multi-scale issue in ultrasound images, we proposed a CPDT method to accurately extract the precise location information of thyroid nodules. To efficiently integrate ultrasound images and clinical information, we designed a multimodal deep learning approach based on a bidirectional attention exchange mechanism, extracting local interconnected information between report text and imaging features and uncovering clinicians’ latent findings. This end-to-end multimodal feature extraction method can be widely applied for efficient tumor detection, growth location analysis, and nature assessment, with the potential to solve most imaging diagnostic tasks in various diseases.

In traditional diagnostic processes, physicians rely on subjective experience to decide whether to perform cervical ultrasound examinations, leading to potential missed diagnoses. During the cervical ultrasound examination, incomplete coverage of the detection area may also result in missed diagnoses, affecting the accuracy of FNA results for lymph nodes and influencing the decision for LLND. Using LLNM-Net can optimize this process by helping reduce the missed detection rate of cervical ultrasounds and minimizing missed diagnoses caused by incomplete examination areas, thus improving clinical guidelines.

There are still some limitations in our study. First, the number of collected cases and disease types is limited. In the future, we plan to collect more extensive data from a broader population, including more subtypes, countries, and regions, to enhance LLNM-Net’s generalizability and applicability. Second, the actual clinical benefits of our model have not yet been validated. We plan to design prospective experiments for validation and explore the model’s real-world effectiveness across different institutions and regions.

Methods

Ethics approval

All clinical data, including demographics, operative procedures, pathology, and complications, were retrospectively collected. This study was approved by the local Ethics Committee and the Institutional Review Board (IRB) of Ruijin Hospital, Shanghai Jiao Tong University School of Medicine Hospital, and undertaken according to the Declaration of Helsinki. Informed consent from patients with thyroid cancer and controls was exempted by the IRB because of the retrospective nature of this study.

Data collection

We conducted a retrospective analysis by gathering preoperative thyroid ultrasound images, radiological reports, and clinical information from patients undergoing thyroidectomy. The criteria for patient inclusion in our study were as follows: (a) patients must be 18 years of age or older, (b) they should have undergone thyroid ultrasound examination with clear ultrasound images available, (c) a diagnosis of thyroid malignant nodule following thyroidectomy, (d) patients were required to have undergone central lymph node dissection with a total of at least 5 lymph nodes removed, and (e) there must be a pathologic assessment of FNA for LLNM. Exclusion criteria were as follows: missing pathological reports, surgery not on thyroid, and patients who had received preoperative treatment. To maintain a high standard of image quality, we implemented rigorous control measures, which involved excluding cases with poor image quality, one image with multifocal lesions, and images with measuring lines. Concurrently, as part of our data collection, we gathered extensive patient demographic information, radiological reports, Kwak Thyroid Imaging Reporting and Data Systems (Kwak-TIRADS) grade46, postoperative pathology results, and details regarding LLNM. Each patient’s data includes two clear ultrasound images from different orientations, a complete ultrasound report, and clinical information. Sex information was collected through self-reporting. However, the primary objective of this study is to predict lateral lymph node metastasis in thyroid cancer, and no differential results were found for sex characteristics, so no further differentiation is made.

We collated patient data from seven hospitals to form training and test cohorts spanning from January 2015 to May 2021. Figure 1 delineates the process of patient inclusion and exclusion. Furthermore, the training cohort from two hospitals was subdivided into a training set and a validation set, while the test cohort from five additional hospitals was designated as external test sets. This methodical strategy guaranteed that our study population was a representative sample of diverse individuals across various geographical and ethnic strata in China. For patient privacy protection, the researchers were granted access solely to anonymized data.

For the classification of patients as LLNM positive, we selected individuals in whom at least one positive lymph node was identified among those excised during surgery. In terms of meticulous quality control for data annotation, we implemented a two-step process:

  1. (1)

    Differentiation of malignant nodules: All malignant nodules were diagnosed based on pathological reports. Independent ultrasound physicians with over 5 years of experience were assigned to reassess the images. In cases where discrepancies between their evaluation and the original report were identified, we sought expert judgment to resolve the differences.

  2. (2)

    Pathological annotation of nodules with LLNM: In managing patients with multiple nodules, determining which nodule metastasized to the lymph nodes posed a challenge during the annotation process. To address this complexity, three ultrasound radiologists were engaged to meticulously compare the ultrasound images with the corresponding pathological reports for each patient. Their objective was to select the images of nodules most likely to have metastasized, taking into account factors such as nodule location and degree of malignancy.

Data quality control principles

We obtained preoperative thyroid ultrasound images from seven hospitals. To analyze the ultrasound images, we first removed all patient, institution, and device information from the images. Then we trained an image cropping model to crop images from different institutions and devices, applying a standardized brightness range to achieve uniform images. To preserve the morphology of the nodules (particularly the aspect ratio), we did not use any scaling methods throughout the process. There are two types of clinical text data: unstructured imaging reports (containing the physician’s expertise) and structured demographic data (age and gender). We set the maximum length for imaging report data to 50 characters: if the report length exceeded 50 characters, we used only the first 50 characters; otherwise, we applied zero-padding to meet the length requirement.

LLNM-Net architecture

To effectively predict LLNM preoperatively, we have developed the LLNM-Net. This model combines segmentation, distance transformation, and intra-model attention exchange modules to achieve an integrated analysis of the tumor’s morphological and locational information. It also incorporates demographic information and clinical reports to provide a comprehensive prediction of LLNM. Additionally, it performs qualitative and quantitative analysis of the metastasis mechanism through attention-based feature analysis (Fig. 1).

Figure 1a shows the process of extracting independent features within the model. Ultrasound images of thyroid nodules present challenges such as small detection targets (foreground) and high intra-class variance in the background. We employed a foreground optimization segmentation network39,40 based on an attention mechanism to achieve precise segmentation of thyroid nodules, simultaneously extracting morphological features such as internal morphology, edges, echogenicity differences, and shape. To address the issue of multi-scale input in ultrasound images, we proposed a CPDT method41 to accurately extract the precise location information of thyroid nodules.

The ultrasound report contains verbal descriptions by medical experts regarding nodule characteristics, such as “normal size and volume,” “heterogeneous echogenicity,” and “diffuse changes“53,54,55 We designed a TMDL transformer42 based on a bidirectional attention exchange mechanism43,44,45 to efficiently integrate imaging features, report text information, and patient demographic data (Fig. 1b). The TMDL consists of two embedding layers, four bidirectional attention blocks, and twelve self-attention blocks. The embedding layers convert inputs into image and text tokens, which are then processed through the bidirectional attention blocks. In these blocks, attention exchange is used to compute intermodal attention among tokens across different modalities, uncovering potential local interconnections between report text and imaging features, providing advantages over non-integrated models. The computed multimodal representations are then fed into the twelve self-attention blocks for efficient learning.

We conducted qualitative and quantitative analyses of key factors related to LLNM. Using a flexible diffeomorphic registration method48, we created a risk heatmap from a three-dimensional perspective showing the likelihood of LLNM occurrence in different thyroid regions. Additionally, we used attention-based gradient-weighted calculations47 to analyze the relationship between various features and the prediction outcomes.

Data feature extraction process

For the input image data \(x\), we used an optimized YOLO-v8 model39,40 for segmentation, obtaining the nodule label \({l}_{n}\). And we used the U-Net++ network56,57 to obtain the thyroid label \({l}_{t}\). Based on the obtained label \({l}_{n}\) and image \(x\), we applied a cropping operation to extract the texture feature \({x}_{t}\), and calculated the length and width to obtain the shape58 feature \({x}_{s}\). Besides, we derived \({x}_{e}\) by calculating the difference in the mean echo values inside and outside the nodule boundary30,59. Using a random mosaic method60, we minimized the influence of nodule morphology on edge blurring, independently extracting the edge feature \({x}_{m}\). Then we obtained the merged label \({l}_{t}-{l}_{n}\).

We designed the CPDT method to convert positional features. For a point \({l}_{i}\) within the nodule region, where \({l}_{i}\in {l}_{n}\), and the point \({l}_{j}\) within the thyroid region, where \({l}_{j}\in {l}_{t}\), the following equation applies:

$${d}_{i}\left({{l}_{i},l}_{t}\right)={\min }_{{l}_{j}\in {l}_{t},\,{l}_{i}}\left(\sqrt{{\left({p}_{{l}_{j}}-{p}_{{l}_{i}}\right)}^{2}+{\left({q}_{{l}_{j}}-{q}_{{l}_{i}}\right)}^{2}}\right)$$
(1)

Where \(p\) and \(q\) represent the horizontal and vertical coordinates of a point in the image. \({d}_{i}\) represents the minimum distance from point \({l}_{i}\) to the thyroid capsule.

We have defined the central point \(C\), which is the point within the thyroid region that has the maximum distance to the thyroid capsule. The maximum distance, \({d}_{\max }\), from any point inside the thyroid to the capsule is defined as follows:

$${d}_{\max }\left({l}_{t}\right)={\min }_{{l}_{j}\in {l}_{t},\,{l}_{t}}\left(\sqrt{{\left({p}_{{l}_{j}}-{p}_{{l}_{t}}\right)}^{2}+{\left({q}_{{l}_{j}}-{q}_{{l}_{t}}\right)}^{2}}\right)$$
(2)

When evaluating the risk of metastasis based solely on distance metrics, the point \(C\) has the minimum risk of distance-related metastasis. Meanwhile, points on and beyond the thyroid capsule can be considered to have the maximum risk of distance-related metastasis. Therefore, we designed the distance transformation \(\varphi\) to represent the risk of distance-related metastasis for a given point:

$$\varphi \left({l}_{t},{l}_{i}\right)=1-{d}_{i}\left({{l}_{i},l}_{t}\right)/{d}_{\max }\left({l}_{t}\right)$$
(3)

Through the \(\varphi\), the grayscale value of each point in \({l}_{n}\) is converted to its minimum distance from the thyroid, thereby representing the relative locational information of the nodule and thyroid. Ultimately, we obtained the locational information \({x}_{l}\). By calculating the distance ratio between the nodule region and point \(C\), we can extract information about the different positions of the nodule within the thyroid region. For instance, when the nodule is on the left and right sides of point \(C\), the resulting \({x}_{l}\) will be different, even if the distance to the edge of the thyroid is the same.

TMDL module

In practice, we pass multimodal input data (i.e., medical images and clinical text information) to the TMDL module to compute prediction logits, where binary cross-entropy is chosen as the loss function. TMDL is a unified Transformer module. Its structure mainly includes: two initial embedding layers that embed tokens from input images and text respectively; four stacked bidirectional multimodal attention blocks that learn intermediate representations of fused features by capturing interactions between tokens from the same modality and different modalities; 12 stacked self-attention blocks that learn the overall multimodal representation and enhance its discriminative power, and a classification head for generating prediction logits.

In TMDL, the multimodal input data consists of eight components: image data includes location \({x}_{l}\), texture \({x}_{t}\) and margin \({x}_{m}\), as well as imaging reports \({x}_{r}\), echogenicity \({x}_{e}\), shape \({x}_{s}\) and each patient’s gender \({x}_{{sex}}\) and age \({x}_{{age}}\). We combine \({x}_{l}\), \({x}_{t}\) and \({x}_{m}\) and pass them through a convolutional layer, which generates a series of visual tokens. Next, we add standard learnable 1D positional embeddings61,62 and dropout to each visual token, resulting in a series of image patch tokens \({{Tokens}}_{{Image}}\left(3n\right)\), where \(n\) is the length of a single image patch. At the same time, we use a tokenization encoder to encode each word in \({x}_{r}\). Specifically, we use a pre-trained BERT model62 to generate embedding feature vectors for each word in \({x}_{r}\), producing a series of word tokens \({{Tokens}}_{{Text}}\left(m\right)\), where \(m\) is the maximum length set for the text. We linearly project \({x}_{{sex}}\), \({x}_{{age}}\), \({x}_{e}\) and \({x}_{s}\) to obtain encoded feature vectors \({{Tokens}}_{{Sex}}\), \({{Tokens}}_{{Age}}\), \({{Tokens}}_{e}\) and \({{Tokens}}_{s}\). We then concatenate \(\{{{Tokens}}_{{Text}}\left(m\right),\,{{Tokens}}_{{Sex}},\,{{Tokens}}_{{Age}},\,{{Tokens}}_{e},\,{{Tokens}}_{s}\}\) to generate a series of clinical text tokens \({{Tokens}}_{T}(m+4)\). In practice, we set mmm to 50.

The combined tokens are fed into four stacked bidirectional multimodal attention blocks. Assume that the input to the first bidirectional multimodal attention block consists of \({{Tokens}}_{I}^{l}\) and \({{Tokens}}_{T}^{l}\), where \(l\,\left(=0\right)\) denotes the layer index, \({{Tokens}}_{I}^{0}={{Tokens}}_{{Image}}\left(3n\right)\) represents the set of image patch tokens, and \({{Tokens}}_{T}^{0}={{Tokens}}_{T}(m+4)\) represents the set of clinical text tokens. In the bidirectional multimodal attention block, the process of generating the query, key, and value matrices for each modality is as follows:

$${Q}_{I}^{l},{K}_{I}^{l},{V}_{I}^{l}={LP}\left({Norm}\left({{Tokens}}_{I}^{l}\right)\right)$$
(4)
$${Q}_{T}^{l},{K}_{T}^{l},{V}_{T}^{l}={LP}\left({Norm}\left({{Tokens}}_{T}^{l}\right)\right)$$
(5)

Where \({LP}\left(\cdot \right)\) and \({Norm}\left(\cdot \right)\) represent linear projection and layer normalization, respectively. The forward pass within the bidirectional multimodal attention block can be summarized as follows:

$${{{{\mathcal{T}}}}}_{I}^{l}={Attention}\left({Q}_{I}^{l},{K}_{I}^{l},{V}_{I}^{l}\right)+\alpha {Attention}\left({Q}_{I}^{l},{K}_{T}^{l},{V}_{T}^{l}\right)$$
(6)
$${{{{\mathcal{T}}}}}_{T}^{l}={Attention}\left({Q}_{T}^{l},{K}_{T}^{l},{V}_{T}^{l}\right)+\alpha {Attention}\left({Q}_{T}^{l},{K}_{I}^{l},{V}_{I}^{l}\right)$$
(7)

Among them, \({Attention}({Q}_{I}^{l},{K}_{I}^{l},{V}_{I}^{l})\) and \(A{ttention}({Q}_{T}^{l},{K}_{T}^{l},{V}_{T}^{l})\) capture intra-modal connections within the image and text modalities, respectively. \({Attention}({Q}_{I}^{l},{K}_{T}^{l},{V}_{T}^{l})\) and \({Attention}({Q}_{T}^{l},{K}_{I}^{l},{V}_{I}^{l})\) explore inter-modal connections between the image and text. Next, the intra-modal and inter-modal connections are encoded into latent representations \({{{{\mathcal{T}}}}}_{I}^{l}\) and \({{{{\mathcal{T}}}}}_{T}^{l}\). After some preliminary experiments, we set \(\alpha\) to 1.0. \({Attention}\left(Q,K,V\right)\) consists of two matrix multiplications followed by a scaled \({softmax}\) operation:

$${Attention}\left(Q,K,V\right)={softmax}\left(\frac{Q{K}^{\intercal}}{\sqrt{{d}_{k}}}V\right)$$
(8)

Here, \({{{\rm{T}}}}\) denotes the matrix transpose operator, and \({d}_{k}\) is a scaling hyperparameter, which we set to 64. We then introduce residual learning and pass the resulting \({{{{\mathcal{T}}}}}_{I}^{l}\), \({{{{\mathcal{T}}}}}_{T}^{l}\) to the next normalization layer and MLP:

$${{Tokens}}_{I}^{l+1}={MLP}\left({Norm}\left({{{{\mathcal{T}}}}}_{I}^{l}\right)\right)+{{Tokens}}_{I}^{l}$$
(9)
$${{Tokens}}_{T}^{l+1}={MLP}\left({Norm}\left({{{{\mathcal{T}}}}}_{T}^{l}\right)\right)+{{Tokens}}_{T}^{l}$$
(10)

\({{Tokens}}_{I}^{l+1}\) and \({{Tokens}}_{T}^{l+1}\) are passed as inputs to the next bidirectional multimodal attention block, producing \({{Tokens}}_{I}^{l+2}\) and \({{Tokens}}_{T}^{l+2}\). This operation is repeated until the fourth layer, generating \({{Tokens}}_{I}^{l+4}\) and \({{Tokens}}_{T}^{l+4}\). Then we concatenate the tokens from \({{Tokens}}_{I}^{l+4}\) and \({{Tokens}}_{T}^{l+4}\) to form a unified sequence of tokens, which are passed to the subsequent self-attention blocks. We also allocate 12 multiple heads42 in the bidirectional multimodal attention and self-attention blocks. This multi-head mechanism allows the model to perform attention operations simultaneously across multiple representation subspaces and subsequently aggregate the results.

Finally, we apply average pooling to the unified tokens generated from the last self-attention block to obtain the overall multimodal representation used for predicting LLNM. This representation is passed through a two-layer MLP to produce the final prediction logits. During the training phase, we compute the binary cross-entropy \({loss}\) between these logits and the lymph node metastasis labels, as given by the following formula:

$${loss}=-\frac{1}{N}{\sum }_{i=1}^{N}\left[{Y}_{i}\cdot \log \left(P\left({Y}_{i}\right)\right)+\left(1-{Y}_{i}\right)\cdot \log \left(1-P\left({Y}_{i}\right)\right)\right]$$
(11)

Here, \(N\) represents the number of samples in the training set, \({Y}_{i}\) denotes the label of a sample, and \(P\left({Y}_{i}\right)\) is the probability value predicted by the LLNM-Net output. A patient has two sets of imaging data from different directions but shares the same clinical information. Each set of patient data results in a loss value calculation, so there are two loss values per patient. We apply average pooling to these values, taking the mean, and then pass it to the two-layer MLP and \({loss}\) function.

Model interpretation method

We used a standard attention analysis method for feature analysis. For each layer in LLNM-Net, we computed the average attention weights across multiple heads. Considering the residual connections, we added an identity matrix to each attention matrix and normalized the resulting weight matrix. Next, we recursively multiplied the weight matrices from different layers of LLNM-Net. Finally, we obtained an attention map that includes the similarity between each input token and the CLS token. Since the CLS token is used for diagnostic prediction, these similarities indicate the correlation between the input tokens and the prediction outcome, which can then be used for visualization. We used Grad-CAM++47 to visualize the model parameters.

To provide a qualitative interpretation of position, we employed a symmetric diffeomorphism-based algorithm48 for registration. Thyroid images are not spatially aligned due to individual variability and factors during image acquisition. Therefore, we needed to map all thyroid data onto a unified standardized template. We defined the registration process as \(R\). This process entailed interacting with an atlas feature matrix referred to as \({x}_{l}\) and a target feature matrix indicated by \({x}_{l}^{R}\), both expressed as functions \({x}_{l},{x}_{l}^{R}:R\). The algorithm posits that the diffeomorphism \({\mathfrak{d}}\) is established within the domain of the feature matrix \(\Omega\), connecting these feature matrices so that \({x}_{l}^{R}={x}_{l}\cdot {\varphi }^{-1}\). The boundary point \({\mathfrak{d}}{\mathfrak{=}}{{\mathfrak{D}}}_{1}\) of the curve \({\mathfrak{d}}{\mathfrak{=}}{{\mathfrak{D}}}_{t},{t}\in \left[{{\mathrm{0,1}}}\right]\) adheres to the ordinary differential equation (o.d.e.):

$$\frac{d{{\mathfrak{D}}}_{t}}{{dt}}={v}_{t}\left({{\mathfrak{D}}}_{t}\right),\,{{\mathfrak{D}}}_{0}={Fd},\,t\in \left[0,1\right]$$
(12)

In this context, \({{\mathfrak{d}}}_{0}={Fd}\) presents the identity transformation, while \({v}_{t}\) signifies the time-varying, smooth velocity field, which is defined as \({v}_{t}:\varOmega \to R,{t}\in \left[{{\mathrm{0,1}}}\right]\). The computation of \(\varphi\) is performed as indicated below: \({\mathfrak{d}}{\mathfrak{=}}{{\mathfrak{D}}}_{1}={\int }_{0}^{1}{v}_{t}\left({{\mathfrak{D}}}_{t}\right){dt}\) with \({{\mathfrak{d}}}_{0}={Fd}\). Here, we determine the optimal \({v}_{t}\) by solving the standard Large Deformation Diffeomorphic Metric Matching (LDDMM)63 equation:

$$\hat{v}={{{{\rm{arg}}}}{\min} }_{v:\frac{d{{\mathfrak{d}}}_{t}}{{dt}}={v}_{t}\left({{\mathfrak{d}}}_{t}\right)}\left({\int }_{0}^{1}{\bigg||L{\nu }_{t}|\big|}_{{L}^{2}}^{2}{dt}+\frac{1}{{\sigma }^{2}}{\bigg||{x}_{l}\cdot {{\mathfrak{d}}}^{-1}-{x}_{l}^{R}|\big|}_{{L}^{2}}^{2}\right)$$
(13)

Where \(L\) is the smoothness operator defined by equation: \(L=-\alpha {\nabla }^{2}+\gamma x\), where \({\nabla }^{2}\) is the Laplacian operator. We used linear interpolation for image transformation. Mutual information served as the optimization metric during the registration process, and the final evaluation index employed was the mean square error (MSE).

$${MSE}=\frac{1}{{MN}}{\sum }_{q=1}^{M}{\sum }_{p=1}^{N}{\left[{x}_{l}\left(p,q\right)-{x}_{l}^{R}\left(p,q\right)\right]}^{2}$$
(14)

where \(M\) and \(N\) represented the row and column dimension of the matrix, respectively.

We calculated the sum of \({x}_{l}^{R}\left({Meta}\right)\) for all metastatic patients and subtracted the sum of \({x}_{l}^{R}\left({Non}\right)\) for all non-metastatic patients. Then we scaled the matrix to the 0−1 range, obtaining the metastasis risk distribution map \({{Risk}}_{{Meta}}\)

$${{Risk}}_{{Meta}}=N{orm}\left(\sum {x}_{l}^{R}\left({Meta}\right)-\sum {x}_{l}^{R}\left({Non}\right)\right)$$
(15)

Model evaluation and radiologist competing test

We evaluated the performance of the predictive model using the AUC of the receiver operating characteristic (ROC) curve, as well as its sensitivity, specificity, accuracy, NPV, and PPV. To compare the predictive effectiveness of AI and human experts for LLNM, we designed an ultrasound physician test experiment. During the recruitment of physicians, a rule of equal representation of male and female experts was followed. Sex information was collected through self-reporting. However, the sex of experts is not used as a variable in the analysis of this study. All participating physicians were required to complete two tasks:

  • Task 1: Physicians were asked to diagnose thyroid cancer based on 200 ultrasound images, which included 100 benign and 100 malignant nodules.

  • Task 2: Physicians were required to predict LLNM based on 200 cases using ultrasound images, imaging reports, and clinical information, including 100 LLNM-positive and 100 LLNM-negative cases.

The purpose of Task 1 was to assess the participating physicians’ expertise in detecting thyroid lesions on ultrasound images. Task 2 was designed to evaluate the physicians’ ability to predict LLNM using a combination of images and clinical information.

Statistical analysis

We estimated the 95% confidence intervals (CI) for the performance metrics pertaining to our classification results using bootstrapping, which encompassed AUC, sensitivity, specificity, accuracy, NPV and PPV. The method we used involved implementing n-out-of-n bootstrap sampling with replacement at the image level for our datasets. For each bootstrap sample (100 samples), we calculated and retained the performance metrics specific to that sample. This process was carried out 1000 times. Subsequently, we established the 95% CIs by taking the 2.5th and 97.5th percentiles from the distribution of each metric’s empirical data. All computations and statistical analyses were conducted using Python, version 3.9 (Python Software Foundation).

NPV is the probability that a person testing negative for a disease truly does not have the disease. In other words, it’s the percentage of negative results that are correct. The formula for NPV is:

$$N{PV}=\frac{{True\; negative}}{{True\; negative}+{False\; negative}}$$
(16)

PPV is the probability that a person testing positive for a disease truly has the disease. It’s the percentage of positive results that are correct. The formula for PPV is:

$${PPV}=\frac{{True\; positive}}{{True\; positive}+{False\; positive}}$$
(17)

Sensitivity (also referred to as the true positive rate and the recall) is the proportion of positives that are correctly identified as follows:

$${Sensitivity}/{TPR}=\frac{{True\; positive}}{{True\; positive}+{False\; negative}}$$
(18)

Specificity (also known as the true negative rate), which measures the proportion of correctly identified negatives, was calculated as follows:

$${Specificity}/{TNR}=\frac{{True\; negative}}{{True\; negative}+{False\; positive}}$$
(19)

AUC, standing for Area under the ROC Curve, measuring the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1), was calculated as:

$$A={\int }_{0}^{1}{TPR}({FP}{R}^{-1}(x)){dx}$$
(20)

Where \({FPR}=1-{TNR}=1-{Specificity}\). Given two bounding boxes \({b}_{1}\) and \({b}_{2}\), their IoU could be computed as:

$${IOU}\left({b}_{1},{b}_{2}\right)=\frac{{{{\rm{A}}}}\left({b}_{1}\cap {b}_{2}\right)}{{{{\rm{A}}}}({b}_{1}\cup {b}_{2})}=\frac{{{{\rm{A}}}}\left({b}_{1}\cap {b}_{2}\right)}{{{{\rm{A}}}}\left({b}_{1}\right)+{{{\rm{A}}}}({b}_{2})-{{{\rm{A}}}}({b}_{1}\cap {b}_{2})}$$
(21)

Where A(·) was the area of the shape. The calculation of IoU could therefore be formulated as a problem involving the computation of the area of each spherical rectangle and the intersection of two spherical rectangles.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.