Background & Summary

Cancer remains one of the most significant global health challenges, despite significant advances in early detection, screening and treatment strategies1. Cancer progression is a complex, multistep process that typically begins with the transformation of normal cells into precancerous lesions, which may subsequently progress to invasive malignancies2,3,4. These transitions are characterized by abnormal histological and immunohistological features, as well as changes at the molecular level5,6,7,8,9,10,11,12. Early detection and intervention of precancerous lesions are critical for reducing cancer morbidity and mortality13,14,15,16,17,18. Understanding the molecular alterations that drive these transitions is crucial for developing effective early detection methods and therapeutic strategies. However, despite significant advances in the molecular profiling of cancer, there is still a critical gap in our ability to effectively detect and monitor pre-cancerous conditions.

High-throughput omics technologies, such as genomics, transcriptomics, proteomics, and epigenomics, have greatly advanced our understanding of precancerous lesions19,20. Numerous molecular profiling studies have revealed a large number of molecular alterations associated with tumorigenesis19,21,22. Despite these valuable contributions, significant gaps remain in the comprehensive monitoring and analysis of precancerous lesions17,19,23. Existing molecular profiling and knowledge of precancerous lesions is fragmented and scattered across various studies, publications, databases and research groups24, making it challenging to integrate and compare findings. In addition, many existing cancer databases focus primarily on invasive malignancies, neglecting the earlier, non-invasive stages of cancer progression25,26,27,28,29. This data fragmentation hinders the discovery of common molecular signatures that may be present in precancerous lesions across different cancer types. Therefore, there is a critical need for a comprehensive, centralized resource that consolidates and harmonizes multi-omics data, specifically focused on precancerous lesions.

In this study, we present the Precancerous Molecular Resource (PCMR), the first comprehensive database specifically designed to consolidate and harmonize multi-omics data focused on precancerous lesions. PCMR integrates precancerous molecular profiles, gene-precancerous lesion associations, and functional modules to provide a comprehensive online platform for data retrieval, analysis and visualization (Fig. 1). The PCMR resource incorporates two main categories of data (Table 1): (1) High-throughput multi-omics profiles (Fig. 2A), including data from precancerous lesions and paired normal and/or malignant conditions, generated from various biological materials (e.g., tissue samples, liquid biopsies, cell lines, and organoids). The integration of transcriptomic, proteomic, and epigenomic data, including mRNA, miRNA, circRNA, protein, and DNA methylation, allows for a more holistic exploration of precancerous lesions; (2) Precancer-gene associations, derived from differential analysis of precancerous profiles (Fig. 2B), text mining of abstracts using the ChatGPT large language model, and manual curation of cancer-related genes from published resources (Fig. 2C). In addition, PCMR provides abundant functionality on the ‘Search’, ‘Browser’ and ‘Analysis’ pages, enabling users to retrieve, analyze and visualize resources related to precancerous lesions. PCMR will serve as an essential resource for advancing our understanding of the cellular and biological processes underlying the onset and progression of precancer.

Fig. 1
figure 1

Data source and architecture of PCMR (PreCancerous Molecular Resource). PCMR combines precancerous profiles, precancer-gene associations, and functional modules, providing a holistic system for online retrieval, analysis, and visualization of gene signature. The PCMR platform incorporates: (1) high-throughput profiles of precancer and paired normal and/or malignant conditions across various omics, platforms and biological material sources. (2) precancer-gene associations derived from differential analysis of precancerous profiles, text mining of abstracts using the ChatGPT large language model, and manual curation of cancer-related genes from published resources. (3) abundant functionality on the ‘Search’, ‘Browser’ and ‘Analysis’ pages, facilitating the retrieval, analysis and visualization of resources related to precancerous lesions.

Table 1 Precancerous profiles and precancer-gene associations in PCMR.
Fig. 2
figure 2

Summary of precancerous profiles and precancer-gene associations in PCMR. (A) Number of precancerous profiles for each organ and tissue type, sourced from tissue biopsies, liquid biopsies, cell lines and organoids, encompassing omics data types including mRNA, microRNA, circRNA, DNA methylation and protein. (B) Precancer-gene associations identified from differential analysis were summarized. For mRNA, microRNA, circRNA, and protein analysis, a fold change greater than 2 and an adjusted P-value below 0.05 were used to identify differential genes. In the differential DNA methylation analysis, genes with an absolute methylation difference greater than 0.2 and an adjusted P-value less than 0.05 were considered differentially methylated. (C) Precancer-gene associations from text mining of literature abstracts using the ChatGPT large language model, and manual curation of cancer-related genes from five published resources across cancer types.

Methods

Precancerous profiles collection and quality control

To systematically collect high-throughput sequencing and microarray data from precancerous, paired normal and/or malignant conditions, we searched the Gene Expression Omnibus (GEO)24 database with the following keywords: ‘precancerous’, ‘premalignant’, ‘preneoplastic’, ‘preinvasive’, ‘precarcinoma lesion’, ‘cancer/tumor precursor’, ‘benign lesion’, ‘incipient neoplasia’ and ‘dysplasia’. Relevant data matrices for various premalignant lesions, including transcriptomic, epigenomic, and proteomic profiles, were manually curated and downloaded.

To ensure the quality and reliability of the data included in our repository, we implemented a rigorous multi-step filtering process. We first identified datasets using keywords related to precancerous lesions. The following criteria were used to exclude datasets: 1) datasets involving physical, chemical or biological interventions (e.g., viral infection, drug treatment, radiation exposure) that could introduce confounding effects; 2) datasets lacking precancerous lesion samples or missing both normal and cancer samples; 3) datasets without molecular omics data or sufficient annotation details; 4) datasets with incomplete or ambiguous clinical information regarding sample disease status; 5) datasets with fewer than 20 total samples; 6) datasets with the sample group (precancerous, normal/cancer) contained fewer than 3 samples; 7) datasets with duplicate dataset IDs and sample IDs to ensure unique storage and analysis.

The following criteria were used to filter samples and genes: 1) samples containing mixed tissues, where precancerous lesions and cancerous tissues could not be clearly distinguished were excluded; 2) samples with uncertain disease classification were removed; 3) genes or RNAs that could not be mapped to standardized names or IDs were excluded; 4) genes or RNAs detected in less than 20% of samples within a dataset were removed to increase data reliability.

After filtering, datasets were manually curated and cross-checked for consistency before being stored and analyzed. We manually reviewed the descriptions and literature of each dataset following keyword searches in GEO. For sample annotations, including patient clinical information, data platform annotations, and source information, we implemented a two-step verification process: independent data collection followed by cross-checking. The dataset and sample information were cross-checked by different researchers against the source databases and original publications. Any discrepancies identified during cross-checking were systematically reviewed, and corrections were made to ensure their consistency with the original source. In cases where uncertainty remained, discussions were held among the curators to reach a consensus, and a final decision was made based on the most reliable source documentation. These stringent quality control measures ensure that only high-quality, well-annotated datasets are included in our repository.

Data normalization

After completing the above steps, data normalization was performed. For each profile, sample IDs were unified into GEO accession IDs. Gene IDs in RNA and protein datasets were unified into Gene Symbols, microRNA IDs were aligned with miRBase precursor accession IDs, and circRNA IDs were converted to circBase IDs. For methylation datasets, methylation profiles were normalized to the gene level, defined as the average methylation level of probes within the promoter region (2 kb upstream to 0.5 kb downstream of the transcription start site).

Normalization was performed across arrays for each microarray dataset. High-throughput RNA-sequencing profiles were saved as either FPKM (Fragments Per Kilobase of transcript per Million mapped reads) or RPKM (reads per kilobase of transcript per million reads mapped). The human reference genome GRCh38 was used in all datasets.

Precancerous lesion-gene associations from differential analysis

In PCMR, genes relevant to precancerous lesions were primarily identified through differential analysis and text-mining using the ChatGPT large language model. Differential analysis was performed independently for precancerous samples against their paired normal or cancer counterparts in each dataset. Significantly differential genes associated with precancerous lesions were identified using thresholds for fold change, absolute difference, and adjusted P-values. Differential analyses were conducted by R (version 4.2.0), using packages including ‘rstatix’, ‘dplyr’, ‘tidyr’ and ‘limma’30. The human reference genome GRCh38 was used for all datasets and visualization. The P-value for comparing two groups in differential analysis was calculated using the Mann-Whitney U test and adjusted with the Bonferroni correction. In differential DNA methylation analysis, genes with an absolute methylation difference greater than 0.2 and an adjusted P-value less than 0.05 were considered differentially methylated. For mRNA, microRNA, circRNA, and protein analysis, a fold change greater than 2 and an adjusted P-value below 0.05 were considered as differential.

Precancerous lesion-gene associations from literatures

For precancer-gene associations derived through text mining, gene names, precancerous lesion names, and human organ/tissue names were extracted from the abstracts and unified. Initially, we searched the PubMed database using the same keywords as for high-throughput profiling, which yielded 717,875 article abstracts from relevant studies. We then used ChatGPT’s standard GPT-3.5-turbo model (unmodified by fine-tuning) to process the abstracts, implementing a custom prompt template for biomedical entity recognition via the Azure OpenAI API. The prompt is “You are a professional biologist. Your task is to accurately identify and list the following categories from the given abstract: organ, gene symbol, precancerous lesion, sentences containing both gene symbols and precancerous lesions. Organs should be categorized into given list (organ list)”. After initially filtering out articles with missing organ or gene information, we narrowed down the dataset to 147,110 articles, which were then subjected to a second round of information extraction. Finally, we obtained meaningful precancerous gene information from 94,888 articles using the ChatGPT platform.

Since our goal was to extract genes associated with precancerous lesions across various human organs and tissues, we did not impose strict criteria on lesion subtypes or anatomical locations within an organ. Instead, we focused on extracting and unifying human organ/tissue names and gene names. During the data mining process with ChatGPT, we first standardized organ and tissue nomenclature to ensure consistency. This standardized terminology was used as a reference in the prompts to guide ChatGPT in categorizing extracted terms appropriately. After ChatGPT extracted human organ/tissue names, gene names, and precancerous lesion names from the abstracts, we retained entries where both an organ/tissue and a gene name were present, without requiring specific precancerous lesion names. For gene name standardization, we removed rows containing empty or meaningless strings in the gene column, and then used the Ensembl, miRBase, and circBase databases to ensure uniformity in gene symbols and IDs. Precancerous lesion names were converted to lowercase and stripped of unnecessary or meaningless characters before storage. Finally, the precancer-gene associations were categorized into two confidence levels: those inferred to be related contextually, and those identified through co-occurrence within the same abstract.

Additionally, PCMR also incorporates cancer-associated genes across various cancer types from five publicly available, manually curated databases, including the Catalogue Of Somatic Mutations In Cancer (COSMIC)31, the Network of Cancer Genes (NCG)32, the Candidate Cancer Gene Database (CCGD)33, miR2Disease34, CircR2Disease35.

Data Records

The dataset is available at Figshare36 and the dataset contains:

  1. (i)

    The ‘precancerous profiles information.txt’ file contains detailed information of precancerous profiles paired with normal and/or malignant counterparts. The table includes organ/tissue, cancer type, disease state (normal/premalignant/cancer), disease name, biological materials, omics, platform, gender, age, cancer stage, and PubMed ID.

  2. (ii)

    The ‘precancer-gene associations from differential analysis.txt’ file contains precancer-gene associations derived from differential analysis. The table includes organ/tissue, gene symbol, cancer type, biological material origin, omics, and the relevant differential groups.

  3. (iii)

    The ‘precancer-gene associations from ChatGPT.txt’ file contains precancer-gene associations obtained from the ChatGPT large language model. The table includes organ/tissue, gene symbol, cancer type, PubMed ID, and whether the association was obtained from relation extraction.

  4. (iv)

    The ‘cancer-gene associations from manually curated databases.txt’ file contains cancer-associated genes or RNAs from five publicly available, manually curated databases, including COSMIC, NCG, CCGD, miR2Disease and CircR2Disease. The table includes organ/tissue, gene symbol, cancer type, PubMed ID, data origin.

  5. (v)

    The ‘Technical Validation data.xlsx’ file contains all the data used during technical validation of PCMR. The file includes 1) Precancerous sample clinical information and profile metadata of GSE13898 dataset cataloged in PCMR. 2) PIGR gene expression level across precancerous, normal, and/or malignant samples within GSE13898. 3) Precancer-gene associations collected in PCMR, derived from differential analysis of precancerous profiles, text mining of abstracts using the ChatGPT large language model, and manual curation of cancer-related genes from published resources. 4) Differential expression analysis results and expression level of 7 previously reported esophageal precancer-related genes (CFTR, PIGR, NAT2, CCDC25, ABCG2, CD86, GPA33), and 13 additional unreported relevant genes (HES1, RETSAT, NAT8, BMP4, SHH, TGFB1, SMO, CDKN2A, PTCH1, IGF1, GLI1, TP53, NFKB1) from GSE13898 dataset within PCMR. 5) Overall survival data of esophageal carcinoma from The Cancer Genome Atlas Program (TCGA), together with NAT8 expression level and its subgroups based on median. 6) Gene set functional items from gene functional enrichment analysis of 13 differentially expressed genes based on the Database for Annotation, Visualization and Integrated Discovery (DAVID).

Technical Validation

To ensure the accuracy of clinical information, multi-omics profiles and precancer-gene associations in PCMR, we followed a series of carefully designed procedures. For the collection of high-throughput sequencing and microarray profiles from precancerous, paired normal, and/or malignant conditions, we manually reviewed the descriptions and literature of each dataset after performing keyword searches in GEO. For annotation details of each sample such as patient clinical information, data platform annotations and data source information, we ensure the accuracy and consistency of the extracted information data by implementing a two-step verification process, including independent data collection and cross-checking. Collected dataset and sample information were then independently cross-checked by different researchers against the source databases and original publications. Any discrepancies identified during cross-checking were systematically reviewed, and corrections were made to ensure consistency with the original source information. In cases where uncertainty remained, discussions were held among the curators to reach a consensus, and a final decision was made based on the most reliable source documentation.

On the other hand, a strict and rigorous approach was followed for the retrieval of precancer gene associations. For associations derived from differential analysis of high-throughput molecular profiles, R packages (‘rstatix’, ‘dplyr’, ‘tidyr’ and ‘limma’) was used. The P-value for comparing two groups was calculated by the Mann-Whitney U test and adjusted with the Bonferroni correction. For mRNA, microRNA, circRNA, and protein analysis, a fold change greater than 2 and an adjusted P-value below 0.05 were considered as differential. In the differential DNA methylation analysis, genes with an absolute methylation difference greater than 0.2 and an adjusted P-value less than 0.05 were considered differentially methylated. By these commonly used differential analysis tools, the data underwent filtering and normalization, followed by the application of strict thresholds for each type of omics. The entire process was independently cross-verified for reliability. For associations extracted from text mining, we carefully selected ChatGPT prompts and pretested the model using a small paper dataset containing precancer-gene pairs. Cancer-related genes were collected from five widely  used, manually curated public databases.

To verify the reliability of data resources and function included in PCMR, we examined the gene PIGR in the context of esophageal premalignant development, a relationship previously reported in the literatures37,38,39. Based on premalignant and normal/cancer samples from dataset GSE1389840 stored within PCMR, we identified significant expression differences for PIGR between premalignant and normal/cancer samples (Fig. 3A, B). Furthermore, the association between esophageal precancer and PIGR derived from ChatGPT large language model was also collected in PCMR resource (Fig. 3C).

Fig. 3
figure 3

Technical validation of PCMR resource. (A) Precancerous sample clinical information and profile metadata of GSE13898 dataset cataloged in PCMR. PCMR incorporates information of each precancerous, normal and/or malignant sample, including sample group (normal/premalignant/cancer), disease name, biological tissue for data, data platform, cancer stage, gender, age, and PubMed ID. (B) Distribution of PIGR gene expression across precancerous, normal, and/or malignant conditions within GSE13898. P-value was calculated by Mann–Whitney U test, and adjusted by Bonferroni correction. (C) Precancer-gene associations collected in PCMR, originated from differential analysis of precancerous profiles, text mining of abstracts using the ChatGPT large language model, and manual curation of cancer-related genes from published resources. (D) Differential expression analysis of 7 previously reported esophageal precancer-related genes (CFTR, PIGR, NAT2, CCDC25, ABCG2, CD86, GPA33), and 13 additional unreported relevant genes (HES1, RETSAT, NAT8, BMP4, SHH, TGFB1, SMO, CDKN2A, PTCH1, IGF1, GLI1, TP53, NFKB1). Based on GSE13898 dataset, the expression level of RNAs are compared across precancerous, normal, and malignant conditions. P-value was calculated by Mann–Whitney U test, and adjusted by Bonferroni correction. (E) Kaplan-Meier survival analysis of NAT8 in esophageal carcinoma based on the Gene Expression Profiling Interactive Analysis (GEPIA) database, with data from The Cancer Genome Atlas Program (TCGA). (F) Gene functional enrichment analysis of 13 differentially expressed genes based on the Database for Annotation, Visualization and Integrated Discovery (DAVID).

For a more thorough validation, we investigate the dynamic changes, biological processes, and clinical relevance of gene signatures within precancerous conditions, by including 7 previously reported esophageal precancer-related genes (CFTR41,42, PIGR37,38,39, NAT243,44, CCDC2545, ABCG246,47,48, CD8649, GPA3350), and 13 additional genes (HES1, RETSAT, NAT8, BMP4, SHH, TGFB1, SMO, CDKN2A, PTCH1, IGF1, GLI1, TP53, NFKB1). We observed that all seven previously reported genes showed significant differential expression between premalignant and normal/cancer samples based on dataset collected in PCMR (Fig. 3D). Among 13 genes with unreported relevant genes, 6 exhibited significant differential expression changes between precancerous lesions and normal or cancer groups (Fig. 3D). RETSAT and NAT8 showed marked expression differences in both precancerous vs. normal and precancerous vs. cancer comparisons but no variation between cancer and normal samples, suggesting transient involvement in early carcinogenic processes. Conversely, BMP4, SMO, PTCH1, and HES1 displayed significant expression shifts in normal vs. precancerous and normal vs. cancer groups but not between precancerous and cancer tissues, indicating their stable association with esophageal tissue pathology but inability to discriminate malignancy levels. Conversely, TP53 showed differential expression only between normal and cancer groups, but not between premalignant and normal/cancer samples, consistent with existing knowledge that it is frequently altered in cancers51, with no evidence linking it to esophageal precancerous lesions.

The gene NAT8, which shares N-acetyltransferase activity with NAT2, displayed a consistent differential expression pattern between premalignant and normal/cancer samples. Moreover, NAT8 emerged as a significant predictor of survival in esophageal carcinoma based on survival analysis (Fig. 3E), which demonstrated that PCMR could serve as important resource for mining precancerous prognosis biomarkers. DAVID functional enrichment analysis of 13 genes with significant differential expression in esophageal precancerous lesions revealed strong enrichment in GO biological processes related to dorsal/ventral neural tube patterning and smooth muscle tissue development (Fig. 3F). Neural tube patterning involves differentiation during embryogenesis, mediated by pathways such as Wnt, Notch, and BMP signaling52,53,54, all of which are closely linked to cancer initiation and progression. Additionally, aberrant activation of genes involved in esophageal smooth muscle development may indicate early neoplastic transformation55,56,57. These findings demonstrate the reliability and biological relevance of PCMR dataset for studying precancerous lesions.

Usage Notes

In addition to Figshare36, the data associated with this work is also available at http://www.bio-data.cn/pcmr, which allows for interactive visualization of PCMR datasets.