Development of a robust and generalizable algorithm

Pathak, Abhay Kumar; Kural, Sukhad; Singh, Shweta; Kumar, Lalit; Yadav, Mahima; Gupta, Manjari; Das, Parimal; Jain, Garima

doi:10.1038/s41598-024-66770-y

Download PDF

Article
Open access
Published: 13 August 2024

Development of a robust and generalizable algorithm "gQuant" for accurate normalizer gene selection in qRT-PCR analysis

Abhay Kumar Pathak²^na1,
Sukhad Kural³^na1,
Shweta Singh¹,
Lalit Kumar³,
Mahima Yadav⁴,
Manjari Gupta²,
Parimal Das¹ &
…
Garima Jain¹

Scientific Reports volume 14, Article number: 18774 (2024) Cite this article

2417 Accesses
4 Citations
Metrics details

Subjects

Abstract

The emergent role of nucleic acid-based biomarkers—microRNAs(miRNAs), long non-coding RNAs(lncRNAs), and messenger RNAs(mRNAs), is becoming increasingly prominent in disease diagnostics and risk assessment. qRT-PCR is the primary analytical method for quantitative measurement of biomarkers. Yet, the relative infancy of non-coding RNAs recognition as biomarkers poses a challenge due to the absence of a consensus on a universally accepted normalizer gene, an absolute requirement for accurate quantification. Current tools normalizer selection are fraught with statistical limitations and suboptimal graphical user interface for data visualisation. These deficiencies underscore the necessity for a balanced tool tailored to handle qRT-PCR datasets. Addressing the identified challenges, we have developed 'gQuant' tool crafted to address these limitations. We employed voting classifiers that combine predictions from multiple statistical methods. Tool's efficacy was validated through different available and in house data derived from urinary exosomal miRNAs datasets. Comparative analysis with existing tools revealed that their integrated methodologies could skew the ranking of normalizer genes, whereas 'gQuant' consistently yielded rankings characterised by lower standard-deviation, reduced covariance, and enhanced kernel density estimation values. Given 'gQuant's' promising performance, normalizer gene identification will be greatly improved, improving precision of gene expression quantification in a variety of research scenarios. The gQuant tool developed for this study is available for public use and can be accessed at [https://github.com/ABHAYHBB/gQuant-Tool]."

Construction of RNA reference materials for improving the quantification of transcriptomic data

Article 18 February 2025

Innovative GenExpA software for selecting suitable reference genes for reliable normalization of gene expression in melanoma

Article Open access 28 February 2022

A method for simultaneous detection of small and long RNA biotypes by ribodepleted RNA-Seq

Article Open access 12 January 2022

Introduction

Quantitative reverse transcription PCR (qRT-PCR) is a gold standard technique for the quantification of gene expression levels. Its importance has been magnified with the advent of microRNA (miRNA)-based biomarkers^1,2. The linchpin of comparative gene expression studies utilizing qRT-PCR is the judicious selection of normalizer genes for accurate data normalization³. Although internal reference genes, such as ACTB, GAPDH, and 18 s Ribosomal RNA⁴, have been successfully employed at the tissue level for mRNA quantification, Researchers also opt to choose specific normalizers for individual experiments particularly where a universally accepted set of normalizers remains elusive such as in miRNA-based assays^5,6,7. Although many miRNA-based studies used U6 as the most commonly used reference gene. It has variable expression and does not represent the optimal reference gene for miRNA analysis⁸.Although, existing tools like RefFinder⁹ , use Ct values from experiments to determine the stability of genes. However such tools have limitations in addressing the unique challenges posed by the presence of null values. Also it is based on a weighted approach that can be biased especially when dealing with complex biological variability. A detailed description of the limitations and approaches of currently available tools is provided in Supplementary Table 1. Therefore, there is a need for robust, consistent, and generalizable algorithms for most stable gene identification.

MicroRNAs (miRNAs) have emerged as promising biomolecules in diagnostic, prognostic and therapeutic avenues in various cancers¹⁰. Their stability, abundance, and presence in body fluids, such as urine, make them ideal candidates for cancer detection and monitoring¹¹. Among these, urinary extracellular vesicles (uEVs) have garnered significant attention due to their cargo of stable miRNAs, reflecting the molecular signatures of their parent cells¹². One of such example is the exploration of uEVs to detect miRNAs released by tumors, especially in urological cancers like bladder cancer (BCa) and prostate cancer (PCa) in the search for non-invasive methods to detect and monitor the disease^13,14,15.

However, the absence of a robust and standardized normalizer led to the use of varied strategies for miRNA normalization. Some of the commonly used strategies are use of spike-in controls¹⁶ like cel-miR-39³, multiple endogenous miRNAs and their arithmetic means¹⁷ , pair ratio method¹⁸, and study specific normalizers^7,19. Such variable strategies raise significant concerns about the comparability and reliability of results across different studies.

Given this context, our effort to develop an efficient normalizer finding tool and its validation on uEV-miRNAs represents a timely and crucial endeavor. In this cross-disciplinary research initiative, we endeavor to synergize biological and mathematical fields to address this pressing issue.

While searching for a standard normalizer gene for another study, we set to screen selected miRNAs based on existing literature. This initial assessment involved experimental qRT-PCR, followed by the use of existing tools like RefFinder. However, we encountered discrepancies in the results and limitations with these methods, which led us to develop a new and more robust analytical algorithmic model.

This paper introduces a novel and efficient algorithmic tool designed for analyzing qRT-PCR expression data to identify the most stable gene. We have validated our tool using a range of experimental data and existing datasets. Furthermore, we conducted a comparative analysis with current tools, pointing out their limitations, and detailed the methodological thoroughness that went into the development of this innovative analytical approach.

Material and method

Platform for mathematical analysis

We have used the Jupyter Notebook interface to write scripts in ‘python’. Which is a web-based interactive computing environment that allows the execution of live code, embedding of Visuals, and explanatory text all in one document. Libraries used were, Pandas: Used for data manipulation and analysis, NumPy: A fundamental library for numerical computations in Python, SciPy: Used for scientific and technical computing, Scikit-learn: This library is used for machine learning and data mining tasks.

Subjects

During the period spanning October 2022 to August 2023, urine specimens were taken from individuals who were diagnosed with BCa, PCa and asymptomatic healthy control subjects were recruited from Department of Urology, Institute of Medical Sciences, Banaras Hindu University, India. Controls were chosen to closely match patients in age (range 45–60) and absence of prior cancer history. An informed consent for participation was obtained from all the individuals involved. The study protocols and samples was approved by the Human Ethical Committee, Institute of Science, BHU(Registration number—ECR/226/indt/UP/2014/RR-22) and Institutional Ethics Committee, Institute of Medical Sciences, BHU(Registration number-ECR/526/inst/UP/2014/RR-20), confirming that all experiments were performed in accordance with relevant guidelines and regulations.

Exosomal miRNA isolation

Norgen Urine Exosome RNA Isolation Kit (Catalog #47,200, Norgen Biotek Corp.) was used to extract uEV-miRNAs. We chose a 10 mL urine sample as the starting volume, following the kit's instructions, and processed the sample. RNA quality and concentration were then determined using NanoDrop (Thermo Scientific, USA).

cDNA synthesis

Reverse transcription of RNA samples to cDNA was performed using RevertAid First Strand cDNA Synthesis Kit (Catalog #K1622, Thermoscientific, USA). We used miRNA-specific stem-loop primers to reverse transcribe targeted miRNA only. Primer details are provided in Supplementary Table 2.

Quantitative reverse transcription PCR

We conducted qRT-PCR using the Maxima SYBR Green/ROX qPCR Master Mix (2X) (Catalog #K0221, Thermo Scientific, USA). The reaction was performed using Applied Biosystems QuantStudio 6 Flex Real-Time PCR System.

Process and development of gQuant

Current tool limitation

I.
Missing Value: The lack of methods to handle missing values in existing tools, such as Gnorm, DeltaCt, BestKeeper, RefFinder, and GenExpa, raises concerns about the final ranking produced by these tools. We strengthened this weakness by including an extra preprocessing unit in our tool to reduce the dangers associated with working with NULL values.
II.
Different approaches: Current tools use diverse statistical matrices that provide different understandings of data, its nature, and interpretation. Whereas, our tool uses four evaluation matrices : Kernel Density Estimation, Standard Deviation, Covariance, and Geometric Mean, to combine the diversity of observations and enable multidimensional analysis from a distinctive angle.
III.
Scaling: On any integration system, using literal values of different matrices could lead to dominance, as it is in existing tools. In democratic voting, we have employed standard scaling to standardise the results into the interval [0,1], preventing the dominance of a single matrix. It will guarantee that none of the four matrices can affect the voting process.
IV.
Interactive Graphics: During the gene ranking procedure, visual depictions serve as a critical tool for the users. They allow us to understand the core characteristics of the data, its fluctuations, its distribution, and the density of its scatter points. We employed a box plot that displays the expression of various genes over distinct intervals, indicating the data scope and distribution. Furthermore, we incorporated a KDE (Kernel Density Estimation) plot to illustrate the density of data points (and their specific numerical values) as well as their spread range.
V.
Democratic Voting-Based Integration: Traditional tools such as RefFinder have relied on weighted schemes that could introduce biases by assigning predetermined weights to rank genes. To address this issue, our tool adopts a democratic voting strategy, where each gene competes for votes based on its characteristics. The gene receiving the majority of votes is selected, and an iterative ranking index is constructed. This method ensures a fair and balanced approach to gene ranking, minimizing potential biases inherent in weighted systems.

Tool development

The algorithm aims to address the challenge of gene ranking using a multi-metric approach that combines four different statistical measures to iteratively rank gene’s stability. This algorithm utilizes Standard Deviation, Geometric Mean, Covariance, and Kernel Density Estimation to select and rank genes iteratively from a high-dimensional dataset.

Data preprocessing

During the initial preprocessing stage, as the input dataset is introduced, the program embarks on data preprocessing by identifying and quantifying the missing values or null values within each gene column. Subsequently, it calculates the ratio of available values to missing values (denoted as NA) for every individual gene column. This ratio is instrumental in determining the subsequent action: should the ratio transcend a user-specified threshold, the tool engages in a missing value imputation process. Here, it substitutes the missing value (NA) with the median of the available values within the respective gene column. Conversely, if the ratio is found to be beneath the stipulated threshold, the program autonomously eliminates the particular column from the dataset. The mechanism applied iteratively across all gene columns within the dataset. This imputation strategy is predicated on an assumption of non-normal data distribution within the miRNA dataset. Ratio of available values to missing values is problem specific, and could be chosen manually. For our result and validation, we have taken the 8:1 ratio as a threshold for preprocessing. This detailed procedure represents the inaugural part of the tool's operational framework, formulated in a scientific discourse as Part A in Fig. 1A.

Selection of the most stable gene

In an effort to develop a model that synergistically combines the strengths of existing techniques while mitigating their limitations, optimising the performance evaluation matrix for more nuanced and robust gene selection, we followed a systemic approach. As explained in the following sections.

The tool first uses "Part A's" preprocessed data as input shown in Fig. 1B. It then computes all the metrics, scaling the results from the four metrics into a range of [0,1] to prevent any disproportionate impact of individual metrics on the voting outcome. Two things could happen in order for us to use a majority voting integration strategy to determine which gene is the most stable across the dataset: When there is a tie in the first round of voting, the tied genes are given the same ranking in the index; if there is a majority vote for a particular gene, it is saved on the ranking index and the remaining genes repeat the process until only one gene is left.

Different metrics used in tool

Standard deviation (SD)

In our model for the selection of the most stable gene, standard deviation was employed as a key metric. We used SD to quantify the level of gene expression variability across various experimental groups. The algorithm computes the standard deviation within groups for each gene's expression levels, which serves as an initial filter in identifying potentially stable genes. Low standard deviation shows that the gene’s expression levels are tightly clustered around the mean, conversely, genes with high standard deviation are loosely clustered means they are unstable for being chosen to be a reference gene.

The standard deviation $SD\left( {g_{i} } \right)$ of a gene $g_{i}$ with $m$ samples is given by:

$$SD\left( {g_{i} } \right) = \sqrt {\frac{1}{m}\mathop \sum \limits_{j = 1}^{m} \left( {X_{ij} - \underline{{X_{i} }} } \right)^{2} }$$

(1)

where $X_{ij}$ is the expression level of gene $g_{i}$ in sample $j$, and $\underline {X}_{i}$ is the mean expression of $g_{i}$.

Geometric mean (GM)

The geometric mean is a measure of central tendency that is more robust to extreme values (outliers) compared to the arithmetic mean.GM is used to identify genes that maintain a consistent expression level across all samples, since a low GM would imply that at least one sample has a low expression level.

The geometric mean $GM\left( {g_{i} } \right)$ for a gene $g_{i}$ is

$$GM\left( {g_{i} } \right) = \left( {\mathop \prod \limits_{j = 1}^{m} X_{ij} } \right)\frac{1}{m}^{ }$$

(2)

where $m$ is the number of samples.

Covariance (CV)

Covariance provides a measure of the degree to which two variables change together. The Covariance Mean is an average of the covariances between one gene and all other genes. High covariance mean values indicate genes that are likely to be involved in similar biological quantification, and therefore may be of particular interest.

The Covariance $CV\left( {g_{i} } \right)$ for a gene $g_{i}$ is

$$CV\left( {g_{i} } \right) = \frac{1}{n}\mathop \sum \limits_{k = 1}^{n} Cov\left( {g_{i} ,g_{k} } \right)$$

(3)

where $Cov\left( {g_{i} ,g_{k} } \right)$ is the covariance matrix between $g_{i}$ and $g_{k}$.

Kernel density estimation (KDE)

Kernel Density Estimation is a non-parametric way to estimate the probability density function of a random variable. In the context of gene ranking, it is used to estimate the distribution of expression levels for each gene. KDE serves to identify genes that have expression levels concentrated around the mean, which may be of biological significance.

The Kernel Density Estimation $KDE\left( {g_{i} } \right)$ for a gene $\mu_{i}$ is

$$KDE\left( {g_{i} } \right) = \frac{1}{m \times h}\mathop \sum \limits_{j = 1}^{m} K\left( {\frac{{\mu_{i} - X_{ij} }}{h}} \right)$$

(4)

where $K$ is the Kernel function, $h$ is the bandwidth, and $\mu_{i}$ is the mean of $g_{i}$.

Mathematical formulation of gQuant tool algorithm

Mathematical Notation and followed by algorithms given below.

$G$: Set of Genes $g_{1} ,g_{2} ,g_{3} .........,g_{n}$
$X$: Matrix of gene expression values, where $X_{ij}$ denotes the expression value of gene $g_{i}$ in sample $j$.
$S$: Standard deviation vector $S = [S_{1} ,S_{2} ,S_{3} .........,S_{n} ]$
$GM$: Geometric mean vector $GM = [GM_{1} ,GM_{2} ,GM_{3} .........,GM_{n} ]$
$COV$: Covariance mean vector $COV = [COV_{1} ,COV_{2} ,COV_{3} .........,COV_{n} ]$
$KDE$: Kernel density estimation vector $S = [KDE_{1} ,KDE_{2} ,KDE_{3} .........,KDE_{n} ]$
$R$: Ranking index list $R = [(g_{r1} ,v_{1} ),(g_{r2} ,v_{2} )$,$(g_{r3} ,v_{3} ),.........(S_{1} ,S_{2} )]$ where $g_{ri}$ is a gene and $v_{i}$ is gene ranking

Initialization

Read the gene expression data, extracting names and numerical values for further processing.

$$R = \left[ { } \right],\;v = 1,\;g = \{ g_{1} ,g_{2} ,g_{3} .........,g_{n} \}$$

Scaling

All metrics are scaled to ensure that no single metric dominates the ranking procedure.

Each vector $SD,GM,COV,KDE$ have values $\left[ {0,1} \right]$ range.

Iterative ranking metric calculation

For each remaining gene $i = 1$ to $n - 1$, all four metrics are calculated using Eq. (1–4).

Voting

Each metric elects the gene with the "least" value (except for Kernel Density, which looks for the maximum).

$${\text{Votes}} = \, \left[ {argmin\left( S \right),\;argmin\left( {GM} \right),\;argmin\left( {COV} \right),\;argmax\left( {KDE} \right)} \right]$$

Majority rule

The gene that appears the most frequently in these elections is removed from the data and ranked.

$$Majority = Modes\left( {Votes} \right)$$

Update $R\left[ \right]$ majority with $v_{ }$, then remove majority from $G$, with increment of $v$.

Tie Breaking

If multiple genes tie for the most votes, all are removed and given the same rank.

Recursion

Steps are repeated until only one gene remains.

Add remaining genes in $G$ to $R$ with rank $v$.

Overview of validation dataset

To ascertain gQuant's precision, three distinct qRT-PCR datasets were utilized:

I.
Dataset One: Derived from research on OvCar-3 and PC-3 cancer cell lines, this dataset encompasses transcriptomic profiling of 84 genes pertinent to cell regulation and five conventional housekeeping genes. The data, accessible via GSE57888, serves as a benchmark for mRNA-based qRT-PCR analysis, offering validation for gQuant through non-normalized data^20,21.
II.
Dataset Two: Dataset Two: We chose a dataset GSE239868 with more detailed variables, such as the expression of 1,066 human miRNAs and n = 36 tracheal aspirate samples, for a comparative efficiency assessment of gQuant. This database included data for a comparative analysis on the management of missing values and allowed us to test the appropriateness of our method on miRNA background. This dataset's unnormalized Ct values provide a strict testing environment^22,23.
III.
Dataset Three: In an endeavor to construct a more robust analytical framework, we decided to explore the potential of microRNA (miRNA)-based normalizers as an alternative to the well-established messenger RNA (mRNA)-based normalizers. Given the nascent state of miRNA-based normalization data, we focused specifically on uEVs. Sample collection and processing are explained in the method section. Comprehensive literature review and the Qiagen’s Human Urine Exosomes Focus miRCURY LNA Panel list was used to select the miRNAs set used in this study for validation. This investigation led to the selection of miR-16-5p, miR-10b-5p, 30b-5p, and miR-30d-5p^24,25. Additionally, upon evaluating the expression levels of the Let7c cluster genes from another ongoing study. We observed high stability for let-7c-5p across both diseased and control samples, leading to its inclusion in subsequent investigations. Therefore, our preliminary study proceeded to focus on the selected miRNAs: let-7c-5p, miR-16-5p, miR-10b-5p, miR-30a-5p, and miR-30d-5p. Using qRT-PCR, we assessed their expression in uEVs samples of BCa (n = 9), PCa (n = 6), and control samples (n = 3). To evaluate the stability of miRNA expression both with the publicly available tools RefFinder and using gQuant (Table 4). Among the evaluated miRNAs, let-7c-5p has emerged as the most promising candidate for normalizer functions. Subsequently, expression levels for let-7c-5p miRNAs in uEVs were quantified in an expanded sample set (n = 30) using qRT-PCR. Rigorous quality control measures, including primer efficiency and melt curve analysis, were implemented, and any data with Ct values over 38 were excluded. The stability of the miRNA expression was validated using both RefFinder and gQuant, solidifying let-7c-5p's potential as a reliable normalizer. This dataset is given in supplementary Table 3.

Result and validation: performance of gQuant in control gene identification

Baseline evaluation with established endogenous controls

We employed an external qRT-PCR dataset from the Gene Expression Omnibus (GEO) to extensively assess the performance of our model. A dataset (Dataset One) comprising the expression values of reputable endogenous controls was employed for this assessment. Table 1 presents the results, where the normalizer that our model identified has the greatest ranking is shown. Traditional mRNA normalizers consistently ranked in the top 5 in our analysis. We were able to verify the model's applicability in a range of experimental situations and assess the model's performance on independent data.

Table 1 Ranking index of most stable gene obtained using gQuant on dataset One (GSE57888).

Full size table

Assessing gQuant's stability with heterogeneous miRNA expression

A Comparison with RefFinder—To assess gQuant's sensitivity to expression variability and the presence of outliers and missing values, we conducted a comprehensive sensitivity analysis using a subset of Dataset Two. The subset included 51 genes out of 1066 genes for all 36 sample sizes. The dataset included C. elegans miR-39 as an exogenous spiked-in control, six snoRNA/snRNA genes, and a Positive PCR Control (PPC) for PCR performance benchmarking. The comparative analysis is detailed in Table 2 which showcases the ranking of normalized genes identified by the model. gQuant's performance enhancement is evidenced by the superior ranking of normalization genes, indicating its robustness in handling data anomalies, including outliers and missing values. Additionally, the lower ranking of the spiked-in control C-miR-39 underscores the potential impact of manual errors on data normalization processes. ‘SNORD 72’ transcends the specified threshold ratio of 8:1,so the rank is undermined. In RefFinder column ,the miRNA ‘cel-miR-039’s Rank was undetermined.

Table 2 Comparison of ranking obtained by gQuant and RefFinder using dataset Two (GSE239868).

Full size table

Selection and validation of let-7c-5p as the superior normalizer using dataset three

Utilizing Dataset Three, which focused on the expression profiles of miRNAs including let-7c-5p, miR-16-5p, miR-10b-5p, miR-30a-5p, and miR-30d-5p. Our goal was to determine the most reliable normalizer among the candidates. gQuant accorded the highest stability ranking to let-7c-5p, as presented in Table 3. The validity of this ranking was reinforced through evaluative measures of distribution in Fig. 2(A).

Table 3 Ranking index for most stable gene predicted using gQuant on dataset three.

Full size table

The gene “let-7c-5p” served as a reference, with its tighter distribution and lower value, showing a more consistent expression level. Conversely ‘miR-30d-5p’ and ‘miR-10b-5p’ demonstrate a broad range of expression levels with several points as outliers. The data points overlay on the boxplot allows for the detailed analysis of the spread and particular variation within different genes. And the concentration of data points via Kernel Density Estimate (KDE) in Fig. 2(B).

KDE plot shows the distribution of Ct Values of different genes from our dataset. The X-axis represents the Ct values which are quantifying characteristics of gene expression level and Y-axis represents the estimated density. It depicts the distribution of Ct values for various Genes labelled as “Ct Mean “let-7c-5p”, “miR-16-5p”, “miR-30a-5p” “miR-30d-5p”, and “miR-10b-5p”. The Red curve shows the averaged Ct values for let-7c-5p with KDE value of 0.33, displaying a prominent peak at approximately 22, showing a mode of expression level for this gene. For other genes, they show different distribution patterns and peaks, indicating different expression nature.

We used a Gaussian kernel with calculated bandwidth to correctly reflect the distribution. KDE highlights the major aspects like dispersion, multimodal Ct Values and central tendencies to get insights and expressive nature of the individual genes.

Further validation was pursued through an expanded study encompassing a diverse set of uEVs samples, including PCa (n = 22), BCa (n = 30), BPH (Benign Prostatic Hyperplasia) (n = 7), and Control (n = 16). An analysis of let-7c-5p expression across these biological cohorts revealed minimal variability, thereby substantiating its consistent expression irrespective of the biological context.

In boxplot as Fig. 2(C), illustrating the distribution of Cycle threshold (Ct) across four different classes BPH, BCa, Control and PCa where we have compared control with BCa and BPH with PCa. To observe the difference, we formulated a NULL hypothesis “that there are no significant differences between control with BCa and BPH with PCa”. As we can see by doing the t-test, we got p-values of 0.1542 for control with BCa comparison and a p-value of 0.1688, which are not significant. So, we can say there are no statistical differences between both comparisons.

Compared with RefFinder and other existing tools available for gene ranking

Here, we have attached as Fig. 2(D), a snapshot of RefFinder Tool result on dataset three, showing ‘miR-30a-5p’, as most stable genes followed by ‘miR-10b-5p’, ‘let-7c-5p’, ‘miR-30d-5p’, ‘miR-16-5p’. Which is different than gQuant Tool ranking.

On Dataset Two, we have presented a stable gene ranking in Table 4, according to gQuant and other tools available. The compact distribution of let-7c-5p values (Fig. 2A), high density of data points (Fig. 2B), and east variation among biological data points (Fig. 2C), validate the stability of let-7c-5p as predicted by gQuant. However, the ranking index given by RefFinder and other tools does not follow the same pattern.

Table 4 Gene ranking index of dataset three Using gQuant, RefFinder, Delta Ct, BestKeeper, NormFinder and GeNorm.

Full size table

Discussion

In this work, we present a novel algorithm called ‘gQuant’ that aims to enhance the methodology of stable reference gene identification in gene expression investigations utilizing the widely used laboratory method of qRT-PCR. “gQuant” is a powerful and versatile algorithmic tool designed to overcome the drawbacks of current approaches. The tool employs additional pre-processing of data to ensure efficient dealing of missing values.

Our approach leverages democratic voting classifiers, combining predictions from multiple statistical methods to yield more accurate rankings than any individual measure. This technique presents a graphical demonstration to depict data fluctuation distribution and density.

Our model underwent rigorous validation using external qRT-PCR datasets from the Gene Expression Omnibus (GEO) and in-house generated datasets. Dataset One represents traditional qPCR data consisting of amplification values for target genes and conventional normalizer genes. when we ran this data through gQuant, as expected all the conventional normalizer genes scored top five ranks. This validation not only approves the quality of the gQuant but also shows its applicability on an independent dataset. Dataset Two represents a larger data set with a high number of target genes and samples. Since this dataset has null values in each gene column, the validation of this dataset reflected gQuant’s ability to handle null values as compared to RefFinder. The list of conventional normalizers used in this dataset includes the spike in controls, the family of SNORD (-61, -68, -72, -95, -96A), U6 small nuclear RNA, one unknown gene depicted as miRTC, and PCR positive control. When a subset of dataset Three, was run through gQuant, above mentioned normalizers could score ranks between 1 and 12. Whereas RefFinder’s rank indices went as low as 27. In both the cases, however, SNORD72 could not be ranked. gQuant eliminated SNORD72 as its null-value ratio exceeded 8:1 whereas the reason for RefFinder’s undetermined result for SNORD72 is not known. Nevertheless, we could see variations among the results obtained from RefFinder and gQuant and improved ranking indices of conventional normalizer genes approves the better efficiency of gQuant.

Dataset Three represents amplification data from uEV-miRNAs, where there is the absence of any classical normalizer genes or small nuclear RNAs. To generate this dataset, we isolated uEV-miRNAs and collected qRT-PCR data for 5 selected miRNAs. These 5 miRNAs were previously used as stabilizers in uEV-based studies. Amplification data for these 5 miRNAs were first collected using a small cohort. This generated data was then run through gQuant, which distinctly identified let-7c-5p as normalizer. Whereas RefFinder placed let-7c-5p on rank 3 when the Same dataset was checked. To check the precision of this result, we generated let-7c-5p expression data on a larger cohort with 4 biological groups. The high p-value among the biological group supports the fact that the expression of let-7c-5p shows low variance among the biological groups. Apart from this, high-density points as depicted by the KDE curve and compact distribution of let-7c-5p as compared to other genes, also support the fact that let-7c-5p should be ranked highly stable gene among the given dataset.

These validations demonstrate our tool's robustness. Nevertheless, gQuant is not without its restrictions. For example, if the criteria for missing data are excessively stringent, it may miss possible reference genes. At this time, multiplex-RT-PCR and RNA sequencing data are not supported by gQuant's validation; it is exclusively verified for qRT-PCR data. It should be the goal of future research to apply it to more data kinds.

All things considered, gQuant provides a more accurate method for locating stable reference genes in gene expression research. This progress is essential for fields such as molecular biology research and disease diagnosis. With its notable advancement in gene expression analysis, gQuant has great promise for expanding our knowledge in the realms of molecular and medical research.

Data availability

References of used dataset : Direct link of Dataset One is https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57888 and the primary accession code of Gene Expression Omnibus(GEO) is ‘GSE57888’. Dataset Two is https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE239868 and the primary accession code of Gene Expression Omnibus(GEO) is ‘GSE239868’. Dataset Three is added in supplementary Table 3.

References

Ye, J., Xu, M., Tian, X., Cai, S. & Zeng, S. Research advances in the detection of miRNA. J. Pharm. Anal. 9, 217–226 (2019).
Article PubMed PubMed Central Google Scholar
Harshitha, R. & Arunraj, D. R. Real-time quantitative PCR: A tool for absolute and relative quantification. Biochem. Mol. Biol. Educ. 49, 800–812 (2021).
Article CAS PubMed Google Scholar
Smith, T. A. D. et al. Selection of endogenous control genes for normalising gene expression data derived from formalin-fixed paraffin-embedded tumour tissue. Sci. Rep. 10, 17258 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Suzuki, T., Higgins, P. J. & Crawford, D. R. Control selection for RNA quantitation. Biotechniques 29, 332–337 (2000).
Article CAS PubMed Google Scholar
Identification of miR-23a as a novel microRNA normalizer for relative quantification in human uterine cervical tissues | Experimental & Molecular Medicine. https://www.nature.com/articles/emm201139.
Danese, E. et al. Reference miRNAs for colorectal cancer: analysis and verification of current data. Sci. Rep. 7, 8413 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Reference Genes for qPCR-Based miRNA Expression Profiling in 14 Human Tissues - PubMed. https://pubmed.ncbi.nlm.nih.gov/35354155/.
Jacobsen, K. S. et al. Identification of valid reference genes for microRNA expression studies in a hepatitis B virus replicating liver cell line. BMC Res. Notes 9, 38 (2016).
Article PubMed PubMed Central Google Scholar
Xie, F., Wang, J. & Zhang, B. RefFinder: a web-based tool for comprehensively analyzing and identifying reference genes. Funct. Integr. Genomics 23, 125 (2023).
Article PubMed Google Scholar
Bhatia, A., Upadhyay, A. K. & Sharma, S. miRNAs are now starring in “No Time to Die: Overcoming the chemoresistance in cancer”. IUBMB Life 75, 238–256 (2023).
Article CAS PubMed Google Scholar
Condrat, C. E. et al. miRNAs as biomarkers in disease: Latest findings regarding their role in diagnosis and prognosis. Cells 9, 276 (2020).
Article CAS PubMed PubMed Central Google Scholar
Recent advances in the roles of exosomal microRNAs (exomiRs) in hematologic neoplasms: pathogenesis, diagnosis, and treatment - PMC. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10152632/.
Piao, X.-M., Cha, E.-J., Yun, S. J. & Kim, W.-J. Role of exosomal miRNA in bladder cancer: A promising liquid biopsy biomarker. Int. J. Mol. Sci. 22, 1713 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mugoni, V., Ciani, Y., Nardella, C. & Demichelis, F. Circulating RNAs in prostate cancer patients. Cancer Lett. 524, 57–69 (2022).
Article CAS PubMed Google Scholar
Zidan, H. E., Abdul-Maksoud, R. S., Elsayed, W. S. H. & Desoky, E. A. M. Diagnostic and prognostic value of serum miR-15a and miR-16-1 expression among egyptian patients with prostate cancer. IUBMB Life 70, 437–444 (2018).
Article CAS PubMed Google Scholar
Sewer, A. et al. Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression. BMC Res. Notes 7, 302 (2014).
Article PubMed PubMed Central Google Scholar
Pagacz, K. et al. A systemic approach to screening high-throughput RT-qPCR data for a suitable set of reference circulating miRNAs. BMC Genom. 21, 111 (2020).
Article CAS Google Scholar
Lekchnov, E. A. et al. Searching for the novel specific predictors of prostate cancer in urine: The analysis of 84 miRNA expression. Int J Mol Sci 19, 4088 (2018).
Article PubMed PubMed Central Google Scholar
Jain, G. et al. Urinary extracellular vesicles miRNA—A new era of prostate cancer biomarkers. Front Genet 14, 1065757 (2023).
Article CAS PubMed PubMed Central Google Scholar
Changes in the transcriptional profile in response to overexpression of the osteopontin-c splice isoform in ovarian (OvCar-3) and prostate (PC-3) cancer cell lines | BMC Cancer | Full Text. https://bmccancer.biomedcentral.com/articles/https://doi.org/10.1186/1471-2407-14-433.
GEO Accession viewer. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57888.
Siddaiah, R. et al. MicroRNA signatures associated with bronchopulmonary dysplasia severity in tracheal aspirates of preterm infants. Biomedicines 9, 257 (2021).
Article CAS PubMed PubMed Central Google Scholar
GEO Accession viewer. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE239868.
Bryzgunova, O. E. et al. Comparative study of extracellular vesicles from the urine of healthy individuals and prostate cancer patients. PLoS One 11, e0157566 (2016).
Article PubMed PubMed Central Google Scholar
Rodríguez, M. et al. Identification of non-invasive miRNAs biomarkers for prostate cancer by deep sequencing analysis of urinary exosomes. Mol. Cancer 16, 156 (2017).
Article PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by the Institute of Eminence, BHU (MPDF to GJ), Banaras Hindu University; (RET Non-Net Fellowship to AP); Institute of Eminence, BHU (seed grant to LK); Institute of Eminence, BHU (seed grant Research fellowship to SK).

Author information

These authors contributed equally: Abhay Kumar Pathak and Sukhad Kural.

Authors and Affiliations

Centre for Genetic Disorders, Institute of Science, Banaras Hindu University, Varanasi, India
Shweta Singh, Parimal Das & Garima Jain
DST-CIMS, Institute of Science, Banaras Hindu University, Varanasi, India
Abhay Kumar Pathak & Manjari Gupta
Department of Urology, Institute of Medical Sciences, Banaras Hindu University, Varanasi, India
Sukhad Kural & Lalit Kumar
Department of Pathology, Institute of Medical Sciences, Banaras Hindu University, Varanasi, India
Mahima Yadav

Authors

Abhay Kumar Pathak
View author publications
Search author on:PubMed Google Scholar
Sukhad Kural
View author publications
Search author on:PubMed Google Scholar
Shweta Singh
View author publications
Search author on:PubMed Google Scholar
Lalit Kumar
View author publications
Search author on:PubMed Google Scholar
Mahima Yadav
View author publications
Search author on:PubMed Google Scholar
Manjari Gupta
View author publications
Search author on:PubMed Google Scholar
Parimal Das
View author publications
Search author on:PubMed Google Scholar
Garima Jain
View author publications
Search author on:PubMed Google Scholar

Contributions

Research concept and design: GJ; Coding of gQuant: AP; Patient Sample Collection: SK,SS and LK; Pathological analysis MY; Collection and/or assembly of data: SK, AP, SS; Data analysis and interpretation: GJ, AP; Writing the article: GJ, SK, AP; Critical revision of the article: GJ, MG and PD. All the authors have approved the submitted version and agree to be personally accountable for the author’s own contributions and for ensuring that questions related to the accuracy or integrity of any part of the work.

Corresponding author

Correspondence to Garima Jain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Pathak, A.K., Kural, S., Singh, S. et al. Development of a robust and generalizable algorithm "gQuant" for accurate normalizer gene selection in qRT-PCR analysis. Sci Rep 14, 18774 (2024). https://doi.org/10.1038/s41598-024-66770-y

Download citation

Received: 01 February 2024
Accepted: 03 July 2024
Published: 13 August 2024
DOI: https://doi.org/10.1038/s41598-024-66770-y

This article is cited by

Integrating miRNA profiling and machine learning for improved prostate cancer diagnosis
- Shweta Singh
- Abhay Kumar Pathak
- Garima Jain
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Construction of RNA reference materials for improving the quantification of transcriptomic data

Innovative GenExpA software for selecting suitable reference genes for reliable normalization of gene expression in melanoma

A method for simultaneous detection of small and long RNA biotypes by ribodepleted RNA-Seq

Introduction

Material and method

Platform for mathematical analysis

Subjects

Exosomal miRNA isolation

cDNA synthesis

Quantitative reverse transcription PCR

Process and development of gQuant

Current tool limitation

Tool development

Data preprocessing

Selection of the most stable gene

Different metrics used in tool

Standard deviation (SD)

Geometric mean (GM)

Covariance (CV)

Kernel density estimation (KDE)

Mathematical formulation of gQuant tool algorithm

Initialization

Scaling

Iterative ranking metric calculation

Voting

Majority rule

Tie Breaking

Recursion

Overview of validation dataset

Result and validation: performance of gQuant in control gene identification

Baseline evaluation with established endogenous controls

Assessing gQuant's stability with heterogeneous miRNA expression

Selection and validation of let-7c-5p as the superior normalizer using dataset three

Compared with RefFinder and other existing tools available for gene ranking

Discussion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Integrating miRNA profiling and machine learning for improved prostate cancer diagnosis

Search

Quick links