Introduction

Genomic sequencing advancements have led to an explosion of data, making the interpretation of variants in lesser-known genes a day-to-day challenge for geneticists. Key gene-phenotype associations often remain underrepresented in widely used databases involved in human disease, like Online Mendelian Inheritance in Man (OMIM) [1]. For example, OMIM may omit some gene-phenotype associations [2] or include them, but with an emphasis on symptoms different from those observed in some patients. To avoid this issue, PubMed or other databases can be useful to find the most relevant scientific publications regarding the link between a gene and a specific phenotype. This thorough approach to genomic data interpretation is time-consuming and potentially less accurate over time. This is especially true for whole-genome sequencing (WGS) analysis, where a significant number of variants located in non-OMIM morbid genes are retained by classical filters (such as “rare loss-of-function”, “rare homozygous missense for a recessive hypothesis”).

To address these challenges, we developed PubMatcher, a free online tool that simplifies the retrieval of gene-phenotype associations by querying multiple curated databases and PubMed simultaneously. PubMatcher uniquely supports batch format-free analysis, significantly reducing the time required to identify candidate genes relevant to a patient’s phenotype. We describe in this article the modus operandi of PubMatcher and its relevance in revealing non-obvious gene-phenotype associations.

Materials and methods

PubMatcher is a full-stack web application developed using Node.js version 18 [3, 4], with an Express.js backend and a Vue.js 3.5 frontend. Dependency management is handled via npm, and data persistence relies on a PostgreSQL 14 database. The application is fully containerized using Docker to ensure reproducibility and ease of deployment. Two types of inputs are required in Pubmatcher: one or more genes and one or more phenotypes (or relevant keywords) (Fig. 1). The ‘Extract from Text’ feature employs a client-side pattern-matching algorithm utilizing a cached dataset of HGNC symbols and aliases. The extraction logic prioritizes official symbols first, identifying matches within the input text using case-insensitive regular expressions with word boundary enforcement (\b). If an official symbol is not found, the algorithm scans for known aliases. Validated matches are automatically mapped to their current official HGNC symbol. To minimize false positives—a common challenge in text mining—aliases shorter than three characters are automatically excluded. Furthermore, the tool incorporates a user-controlled exclusion list (‘blacklist’) stored in the browser’s local storage, allowing users to persist the exclusion of specific problematic aliases or genes that may trigger false positives in their specific clinical context. The PubMatcher pipeline aggregates information using a hybrid approach: it performs real-time web scraping for PubMed (Keyword(s) + gene) and queries public APIs for UniProt [5], the International Mouse Phenotyping Consortium (IMPC) [6], and PanelApp [12]. To optimize performance, gnomAD constraint metrics [7], ClinVar, and Gene Curation Coalition (GenCC) data are accessed via locally stored datasets updated periodically, while API responses are managed with local caching strategies.

Fig. 1: PubMatcher query page.
Fig. 1: PubMatcher query page.
Full size image

Search genes either manually or by using an “EXTRACT FROM TEXT” mode which consists of copy-pasting characters including gene names. Phenotypes can be incremented manually in the lower box, one or more, separated by commas. (version: January 2025).

The results page presents a summary of all the information collected (Fig. 2). Ensuring wide accessibility, PubMatcher is designed to be accessed via web browsers at https://pubmatcher.fr and the source code and documentation are available on GitHub (https://github.com/victormar1/PubMatcher). It does not require user registration, adhering to most journal’s guidelines for software tools.

Fig. 2: Example of PubMatcher output.
Fig. 2: Example of PubMatcher output.
Full size image

(query: genes “DIDO1, MC4R, BRAT1, SLC12A5” and phenotype “obesity, diabetes”). Crosses indicate lack of information in gene. “GENE” column contains gene constraint metrics; “PUBMATCH” column contains the number of publications retrieved on PubMed and the title of the first publication; “FUNCTION” column contain the Uniprot function description of the protein and functional tags; “PHENOTYPE KO”column contains icons representing mouse symptoms after KO compiled within IMPC database; “CLINVAR LOOKUP” column contains the number and type (missense / loss-of-function) of either pathogenic / likely pathogenic variants or variants of unknown significance; “STATUS” column contains OMIM, Gene Curation Coalition and PanelApp England / Australia information.

Results are presented in an organized table format, where gene-phenotype pairs are listed with key metrics, such as constraint scores and publication count. The details of each query are described below.

Genes constraint metrics

PubMatcher obtains for each gene the following constraint metrics from the gnomAD v2.1 and v4 database [8] : pLi (probability of being loss-of-function intolerant), LOEUF (loss-of-function observed/expected upper bound fraction), MOEUF (missense observed/expected upper bound fraction) and missense Z-Score. LOEUF and MOEUF metrics indicate a gene’s tolerance to loss-of-function and missense variants, respectively, helping prioritize genes under selective constraint for clinical relevance. LOEUF and pLi values are highlighted based on constraint levels: dark red for low pLOEUF ( < 0,35 for v2 and <0,6 for v4) or high pLi ( > 0,9) and colored in green for low pLi ( < 0,1) [9]. Discrepancies between the two gnomAD versions are highlighted with an exclamation mark. To note, GnomAD v4 and v2 versions differ especially in terms of size, European ancestry proportions, technical and sequencing quality, cohort segmentation regarding the disease or non-disease status. Hence, metric values and interpretations of metrics may vary due to differences in cohort size (i.e., improve statistical power for same observed/expected ratio), the loci quality and also the different threshold recommended. Thus, concordance between metrics and improvement in statistical power for v4 could reinforce interpretation of gene constraint, but differences should be interpreted with caution.

PubMed

PubMed is a free online database providing access to a vast repository of biomedical research articles maintained by the National Center for Biotechnology Information (NCBI) and represents an “up-to-date” knowledge source for gene-phenotype associations [10]. PubMatcher includes the number of publications retrieved following a query, the title of the first publication in the list, and a link to access the query on PubMed and the related research articles. The PubMed research includes the association between a gene name and a phenotype. Moreover, the queries are cumulative for each gene-phenotype pair. An example of query is shown in Fig. 2, which includes five genes and two phenotypes. The PubMed query for each gene follows this pattern: (GENE AND PHENOTYPE_1) OR (GENE AND PHENOTYPE_2). Hovering over the title of the publication will display titles of other matching publications.

Uniprot

The UniProt database [5] provides information about protein functions, which may be relevant for genetic interpretation. PubMatcher requests the protein description and biological features keywords from UniProt using API access.

International mouse phenotyping consortium

The IMPC database [6] provides information about the consequences of gene knockouts in mice, which could suggest a gene’s involvement in human diseases. Different phenotypes are listed as presented on IMPC and specific symptoms can be displayed by mouseover.

Clinvar lookup

PubMatcher integrates data from ClinVar, a public database maintained by the NCBI that provides clinically relevant interpretations of genetic variants, including their pathogenicity, molecular consequences, and supporting evidence. For pathogenic and likely pathogenic small nucleotide variants, PubMatcher displays both the number of loss-of-function (LOF) variants—including frameshift, nonsense, and canonical splice site alterations—and the number of missense variants. Additionally, VUS are also reported to ensure no potentially relevant findings are overlooked.

Gene curation coalition, PanelApp & OMIM

PubMatcher integrates data from GenCC [11], PanelApp England and Australia [12], and OMIM [1] to provide comprehensive information on gene-disease associations, ensuring rapid and accurate curation of clinically relevant genes. GenCC aggregates gene-disease validity information from multiple expert-curated sources, facilitating the identification of genes with well-established evidence for their role in human diseases. PubMatcher displays the gene status from GeneCC. The number of genes listed in both PanelApp England and PanelApp Australia are mentioned in the PubMatcher output due to their significance in fast gene-disease curation. Links are provided for quick access to the relevant entries on the PanelApp websites. OMIM is a comprehensive, authoritative resource that catalogs human genes and genetic phenotypes, including their relationships to disease. PubMatcher integrates data from OMIM to indicate whether a gene is associated with a known morbid condition or phenotype.

Relevance of PubMatcher in human whole-genome sequencing analysis

We evaluated the relevance of the PubMatcher tool in WGS analyses of patients with rare diseases performed at the Auragen laboratory in Lyon, France. This laboratory is part of the French 2025 genomic project, which aims to expand genomic access in human healthcare [13, 14]. First, the proportion of variants filtered out by an example set of common WGS filters (detailed in Table S1) that were not located in OMIM morbid genes across 20 trio-based WGS analyses was assessed. Then, we present examples of variants revealed by PubMatcher in genes that proved potentially relevant for medical use after analyzing 100 WGS cases.

Whole-genome sequencing was performed following the recommendations of “France Médecine Génomique 2025” Plan. Genomic DNA extracted from whole blood was sequenced according to standard procedures for a Polymerase Chain Reaction-Free genome on a NovaSeq6000 instrument (Illumina, San Diego, California, USA). Sequencing data were aligned to the GRCh38p13 full assembly using bwa 0.7 + . Variants were called by several algorithms including GATK4 + , Bcftools1.10 + , Manta1.6 + , CNVnator0.4 + , and annotated using the variant effect predictor. Detected variants were prioritized using in-house procedures. Further details are available on request on http://www.auragen.fr.

Results

Variants in non-OMIM genes found by common WGS filters

PubMatcher is meant to quickly identify gene-phenotype associations using the most up-to-date sources. Although the OMIM database is regularly updated, the most recent phenotype-to-gene associations may be missing, potentially leading to the exclusion of relevant variants. Therefore, we evaluated the proportion of non-OMIM morbid genes in 20 WGS trios consisting of an affected patient and unaffected parents, using a classic filtering strategy (see Table S1 for filters’ details).

After applying these filters, the remaining variant counts ranged from 80 to 150 per sample, with a mean of 114. Among these, the mean proportions of variants mapping to OMIM morbid genes, OMIM non-morbid genes, and non-OMIM genes were 31%, 52%, and 18%, respectively (Fig. 3). These results confirm a high representation of non-morbid or non-OMIM genes (70%) post-filtering, underscoring the utility of PubMatcher for efficiently screening them.

Fig. 3: Proportions of OMIM statuses for genes with identified variants using common WGS filters.
Fig. 3: Proportions of OMIM statuses for genes with identified variants using common WGS filters.
Full size image

Morbid: 30.7% (SD = 5.72), Non-Morbid: 51.3% (SD = 6.11), Non-OMIM: 18% (SD = 5.56).

Misannotated or non-annotated genes with relevant variants in 100 WGS analyses

We present examples of variants found in genes either not annotated in OMIM for the researched disease or with a non-syndromic form not specified in OMIM (Table 1). These relevant variants were identified in 15 out of 100 whole-genome sequences analyzed at the French laboratory Auragen (Lyon, France) for medical purposes. The genomes included in this study were selected solely based on their availability as trios and were analyzed in chronological order in those medical contexts: genodermatosis, chronic nephropathy, intellectual deficiency, or red blood cell diseases. None of those genomes were analyzed previously, hence they also include diagnosis on well know genes correctly annotated within OMIM database. Among the 15 genomes with relevant variants in incompletely annotated genes, only one also carried a pathogenic variant in a well-known gene (Table 1).

Table 1 Miss- or non-annotated genes in OMIM with relevant variants for 100 WGS analyses.

Integrating PubMatcher in genomic variant analysis workflows

PubMatcher is a tool that can be integrated early in the general workflow of genomic single nucleotide variant analysis. We propose a flowchart for data interpretation in a large-scale genomic approach (Fig. 4).

Fig. 4
Fig. 4
Full size image

Proposed integration of PubMatcher in interpretation of pangenomic analysis.

Starting with a conventional filtering strategy (as described in Table S1), a rapid diagnosis can be made if a causative variant is identified—for example, a previously described ClinVar pathogenic variant that matches the patient’s clinical presentation. If such a variant is not found, a more thorough variant analysis is required to explore and report relevant genetic variants.

The tool can be used for gene screening across all identified variants, allowing for a quick exploration of the most recent scientific knowledge (via PubMed and PanelApp queries), gene constraint metrics, protein functions (Uniprot), and the consequences of mouse knockout models (IMPC). The Mode of Inheritance based on the family pedigree is also crucial. A recent publication from Chong et al. [9]. compiled five key criteria for retaining genes of interest, nearly all of which are integrated into the proposed flowchart that includes PubMatcher, except for gnomAD variant co-occurrence.

After analyzing the Pubmatcher output in the context of the patients’ phenotypes, some genes of interest may be retained. If the gene evidence level is sufficient, the variants located within can be interpreted via usual tools and reported if classified pathogenic or likely pathogenic (with additional exploration needed if it is a variant of unknown significance). Conversely, if the evidence level is low, a more research-focused approach, such as submitting to MatchMaker Exchange [15] (Genematcher, etc.) or conducting further fundamental post-genomic investigations, may be suggested.

Discussion

We believe PubMatcher is a significant advancement in clinical genomic research, addressing the need for more efficient interpretation of genomic data. By rapidly identifying relevant gene-phenotype associations—especially in lesser-known genes—PubMatcher increases both the speed and accuracy of genomic analyses. Indeed, the high proportion of candidates variants in filters located outside OMIM morbid genes (70%) is more quickly analyzed using PubMatcher than individual queries, because it offers a unique window to display all of this information for many genes. This approach also helps ensure that rare yet important variants are not missed, which is critical for their inclusion in broader research studies; given their rarity, these cases can provide invaluable insights into disease mechanisms and phenotypic diversity.

There are other tools which interrogate gene-function links such as Open Targets Platform. That tool, however, cannot query the most recent PubMed literature which is critical to obtain the most up-to-date information about gene-rare diseases relationship. For instance, the request of LEF1 gene (Table 1) does not rely with genodermatosis despite of articles published several years ago which demonstrated its involvement. Moreover, there is no easy-to-use batch mode available to analyze gene list. PubMatcher ’s main strength is its ability to display in one window the main information to screen the variants located on the non-OMIM morbid genes.

PubMatcher is a gene-level tool, which is complementary with variant-centered tools like Varsome or MobiDetails [16]. Indeed, WGS analysis in medical context needs to study gene and then variant relevance.

An important consideration is the inclusion of animal models, such as the mouse model, which provides invaluable insights into gene function and disease relevance due to its genetic similarity to humans. However, mouse models present limitations, including differences in gene expression and phenotypic responses. Expanding to other model organisms, such as zebrafish, could diversify the functional insights available to PubMatcher users, particularly for genes where murine models have limited general data or translational relevance.

The effectiveness of PubMatcher heavily depends on the quality and completeness of its external data sources. Attempts to incorporate alternative sources, such as Google Scholar, resulted in an overwhelming volume of unspecific and irrelevant data, highlighting PubMed as the most reliable and curated source for retrieving relevant literature. Advances in AI-driven text-mining tools, such as PubTator [17], offer promising avenues for improving data retrieval by extracting gene-disease relationships from biomedical literature. These tools could significantly enhance the exhaustivity of PubMatcher ’s results by identifying additional relevant publications that might otherwise remain undetected. However, current rate limitations (3 requests per second) within the PubTator API preclude its integration into PubMatcher at this stage.

PubMatcher has demonstrated effectiveness in identifying clinically relevant genes, thereby fulfilling its primary objective. Notably, several geneticists outside the development team have already integrated PubMatcher into their variant interpretation workflows, underscoring its reliability and practical utility and adaptability in clinical genomics. Further exploration of PubMatcher ’s applications in clinical settings could be beneficial. Another important consideration is the accessibility of the tool. While the current interface is user-friendly—particularly in term of input formatting, result clarity, and advanced features upon login (such as input history)—further simplifying the user experience and providing enhanced guidance and support would make the tool even more accessible to a wider audience.

Integrating artificial intelligence or machine learning could also boost PubMatcher ’s capabilities by adding features like gene scoring to rank the matches by their relevancy to the phenotype. Ongoing updates, as well as feedback from the user community, will be crucial for the tool’s continued development and for expanding its utility in the field of genomic research.

Conclusion

PubMatcher provides an effective solution for supporting genomic data analysis by seamlessly integrating bibliographic research into genomic interpretation workflows. This approach significantly enhances efficiency, particularly in identifying lesser-known yet clinically relevant gene-phenotype associations. As PubMatcher continues to evolve, improvements in data integration, interface design, and user-driven enhancements will further solidify its role as a valuable tool for both clinical diagnostics and genomic research.