Scientific Reports

Table 1 Description of diverse configurations of 8 distinct omics modalities.

From: Multi-omics driven computational framework for cancer molecular subtype classification

Data modality	Summary	Scales	Technical information
CNV	Copy Number Variation (CNV) datasets in TCGA (The Cancer Genome Atlas) contain information about alterations in the number of copies of genomic regions, which can influence gene expression and drive cancer progression. CNVs include amplifications (gain of copies) and deletions (loss of copies) in DNA segments⁷³	Gistic2 all thresholded	It contains thresholded CNV data produced by the GISTIC2 algorithm⁷⁴. This dataset simplifies the raw copy number values into discrete levels, such as – 2 for homozygous deletions, – 1 for single-copy deletions, 0 for normal diploid regions, 1 for low-level amplifications, and 2 for high-level amplifications⁷⁵. It aims to focus on significant genomic regions that are recurrently amplified or deleted across multiple cancer samples
CNV		Gistic2 all databy genes	It provides gene-specific CNV information derived from the GISTIC2 analysis. Instead of reporting alterations at the regional level, this dataset maps CNV values directly to individual genes. The data often includes raw or continuous copy number values, offering finer detail about how specific genes are affected by CNVs
RNAseq	This dataset contains raw or normalized gene expression values derived from RNA sequencing (RNAseq) data produced using the Illumina HiSeq platform. The values typically represent either normalized counts, such as log-transformed reads per kilobase of transcript per million (RPKM) or fragments per kilobase of transcript per million (FPKM), or other forms of normalization^76,77	HiSeqV2	It contains raw or normalized gene expression values derived from RNA sequencing (RNAseq) data produced using the Illumina HiSeq platform. The values typically represent either normalized counts, such as log-transformed reads per kilobase of transcript per million (RPKM) or fragments per kilobase of transcript per million (FPKM), or other forms of normalization⁷⁸
		HiSeqV2 percentile	It provides gene expression values transformed into percentiles. Each gene’s expression level in a sample is ranked relative to other genes in that sample, and the rank is converted into a percentile value (0–100). This normalization removes inter-sample variability due to sequencing depth or other technical factors, allowing for comparisons of gene expression patterns across samples⁷¹
		Gav2	It contains raw and normalized gene expression values derived from RNAseq data generated using the Illumina HiSeq platform. The GAV2 pipeline employed MapSplice for read alignment and RSEM for transcript quantification against the hg19 (GRCh37) reference genome. The resulting values represent either raw read counts or normalized measures such as fragments per kilobase of transcript per million mapped reads (FPKM) and upper-quartile normalized FPKM (FPKM-UQ), providing gene and isoform-level expression estimates
Exon	It contains expression measurements at the exon level, derived from RNA sequencing data produced using the Illumina HiSeq platform. Instead of summarizing reads across entire genes, the exon dataset quantifies read counts mapped to individual exons, enabling finer-resolution analysis of alternative splicing, isoform usage, and exon-specific expression changes	GAV2	The GAV2 pipeline utilizes MapSplice for accurate spliced alignment and RSEM for transcript quantification against the hg19 (GRCh37) reference genome. It outputs exon-level read counts and normalized expression metrics such as FPKM and upper-quartile normalized FPKM. This dataset is particularly valuable for characterizing differential exon usage, identifying alternative splicing events, and exploring exon-specific expression alterations in tumors
Exon		HiSeqV2	The HiSeqV2 Exon pipeline aligns reads to the reference genome using MapSplice and quantifies exon expression with RSEM. The resulting data are expressed as fragments per kilobase of exon per million mapped reads (FPKM). These exon-level profiles enable detailed analyses of isoform usage, exon inclusion/exclusion, and transcript diversity within cancer samples
miRNA	It contains expression values specifically for the mature strands of miRNAs. miRNAs are transcribed as precursor molecules that are processed into mature miRNA strands, which are the functional forms involved in gene regulation⁷⁹. This dataset focuses exclusively on these mature miRNA strands, which directly interact with mRNA targets
SNP and INDEL	This dataset encompasses all somatic mutations identified in tumor samples, including single-nucleotide polymorphisms (SNPs) and small insertions or deletions (INDELs). These mutations represent changes in the DNA sequence that occur in tumor cells and are absent in matched normal cells	Mutation wustl gene	This is a gene-level summary of somatic mutation data (SNPs and INDELs), processed and annotated by the McDonnell Genome Institute at Washington University (WUSTL)⁸⁰. Instead of reporting individual mutations, this dataset aggregates mutations for each gene, providing an overview of the mutation burden and functional impacts on specific genes
SNP and INDEL		MC3 gene-level	The MC3 gene-level somatic mutation data⁸¹ from TCGA provides a comprehensive overview of mutations (SNPs and INDELs) across various cancer types, aggregated at the gene level. It includes details on point mutations (SNPs), where a single nucleotide is replaced, and insertions or deletions (INDELs) that alter the DNA sequence. The dataset contains information on mutations in thousands of genes across hundreds of cancer samples. Specifically, it includes mutation data for over 20,000 genes
RPPA	This refers to the measurement of the abundance of proteins in cancer samples. The data is typically obtained through techniques such as mass spectrometry or antibody-based assays, providing insights into the levels of different proteins in tumor tissues compared to normal tissues	RPPA	The Reverse Phase Protein Array (RPPA) technique is used to measure protein levels in cancer samples. RPPA provides quantitative data on protein expression, focusing on specific proteins or signaling pathways that are often involved in cancer progression⁸². The RPPA data in TCGA allows for the analysis of protein expression across multiple cancer types, providing a high-throughput method for assessing large numbers of samples
RPPA		RPPA-RBN	The RPPA-RBN (Reverse Phase Protein Array – RNA-Based Normalization) involves normalizing the RPPA data using RNA expression data to account for variations in sample quality and other factors. This normalization helps to refine protein expression analysis, ensuring that the observed protein levels are not biased by underlying RNA expression differences⁸³
Meth.	It primarily captures methylation at cytosine residues in CpG dinucleotides, which can influence gene expression and chromatin structure. TCGA methylation data is generated using platforms like the Illumina HumanMethylation450 BeadChip (450K array) and, in some cases, the HumanMethylation27 BeadChip. Methylation is quantified as beta values, ranging from 0 (unmethylated) to 1 (fully methylated), allowing for analyses of hypermethylation and hypomethylation patterns	Human Methylation-450	It contains DNA methylation levels measured at over 450,000 CpG sites using the Illumina HumanMethylation450 BeadChip platform. Methylation levels are typically reported as beta values, ranging from 0 (unmethylated) to 1 (fully methylated)⁸⁴. The dataset provides insights into epigenetic regulation by covering CpG sites in promoters, gene bodies, intergenic regions, and CpG islands, as well as regulatory regions like shores and shelves
Meth.		Human Methylation-27	The HumanMethylation27 dataset refers to a DNA methylation array platform that measures the methylation status of approximately 27,000 CpG sites across the human genome. The array is designed to assess the methylation patterns at these CpG sites, which are crucial in gene regulation⁸⁵
Array	The Gene Expression Array datasets listed are part of TCGA’s collection, used to measure gene expression across various cancer types using different array platforms	AgilentG4502A07(1-3)	This dataset represents gene expression data obtained from the Agilent G4502A platform. The Agilent arrays use microarray technology to measure gene expression by hybridizing labeled RNA to probes on the array. The G4502A is one of Agilent’s arrays designed for high-throughput gene expression analysis, offering broad coverage of human genes. The different versions (e.g., 07-1, 07-2 and 07-3) may reflect updates to the array design or slight variations in the data, ensuring that the data from multiple sources or experiments is comparable⁸⁶
Array		HTHG-U133A	This dataset refers to gene expression data generated using the Affymetrix Human Genome U133A Array. The U133A array is widely used in gene expression profiling and measures the expression of around 22,000 genes. It is a standard platform for large-scale studies in cancer genomics and transcriptomics, providing reliable data for identifying gene expression patterns across various samples

Back to article page

Search

Advanced search

Quick links