Table 1 Description of diverse configurations of 8 distinct omics modalities.

From: Multi-omics driven computational framework for cancer molecular subtype classification

Data modality

Summary

Scales

Technical information

CNV

Copy Number Variation (CNV) datasets in TCGA (The Cancer Genome Atlas) contain information about alterations in the number of copies of genomic regions, which can influence gene expression and drive cancer progression. CNVs include amplifications (gain of copies) and deletions (loss of copies) in DNA segments73

Gistic2 all thresholded

It contains thresholded CNV data produced by the GISTIC2 algorithm74. This dataset simplifies the raw copy number values into discrete levels, such as – 2 for homozygous deletions, – 1 for single-copy deletions, 0 for normal diploid regions, 1 for low-level amplifications, and 2 for high-level amplifications75. It aims to focus on significant genomic regions that are recurrently amplified or deleted across multiple cancer samples

Gistic2 all databy genes

It provides gene-specific CNV information derived from the GISTIC2 analysis. Instead of reporting alterations at the regional level, this dataset maps CNV values directly to individual genes. The data often includes raw or continuous copy number values, offering finer detail about how specific genes are affected by CNVs

RNAseq

This dataset contains raw or normalized gene expression values derived from RNA sequencing (RNAseq) data produced using the Illumina HiSeq platform. The values typically represent either normalized counts, such as log-transformed reads per kilobase of transcript per million (RPKM) or fragments per kilobase of transcript per million (FPKM), or other forms of normalization76,77

HiSeqV2

It contains raw or normalized gene expression values derived from RNA sequencing (RNAseq) data produced using the Illumina HiSeq platform. The values typically represent either normalized counts, such as log-transformed reads per kilobase of transcript per million (RPKM) or fragments per kilobase of transcript per million (FPKM), or other forms of normalization78

HiSeqV2 percentile

It provides gene expression values transformed into percentiles. Each gene’s expression level in a sample is ranked relative to other genes in that sample, and the rank is converted into a percentile value (0–100). This normalization removes inter-sample variability due to sequencing depth or other technical factors, allowing for comparisons of gene expression patterns across samples71

Gav2

It contains raw and normalized gene expression values derived from RNAseq data generated using the Illumina HiSeq platform. The GAV2 pipeline employed MapSplice for read alignment and RSEM for transcript quantification against the hg19 (GRCh37) reference genome. The resulting values represent either raw read counts or normalized measures such as fragments per kilobase of transcript per million mapped reads (FPKM) and upper-quartile normalized FPKM (FPKM-UQ), providing gene and isoform-level expression estimates

Exon

It contains expression measurements at the exon level, derived from RNA sequencing data produced using the Illumina HiSeq platform. Instead of summarizing reads across entire genes, the exon dataset quantifies read counts mapped to individual exons, enabling finer-resolution analysis of alternative splicing, isoform usage, and exon-specific expression changes

GAV2

The GAV2 pipeline utilizes MapSplice for accurate spliced alignment and RSEM for transcript quantification against the hg19 (GRCh37) reference genome. It outputs exon-level read counts and normalized expression metrics such as FPKM and upper-quartile normalized FPKM. This dataset is particularly valuable for characterizing differential exon usage, identifying alternative splicing events, and exploring exon-specific expression alterations in tumors

HiSeqV2

The HiSeqV2 Exon pipeline aligns reads to the reference genome using MapSplice and quantifies exon expression with RSEM. The resulting data are expressed as fragments per kilobase of exon per million mapped reads (FPKM). These exon-level profiles enable detailed analyses of isoform usage, exon inclusion/exclusion, and transcript diversity within cancer samples

miRNA

It contains expression values specifically for the mature strands of miRNAs. miRNAs are transcribed as precursor molecules that are processed into mature miRNA strands, which are the functional forms involved in gene regulation79. This dataset focuses exclusively on these mature miRNA strands, which directly interact with mRNA targets

  

SNP and INDEL

This dataset encompasses all somatic mutations identified in tumor samples, including single-nucleotide polymorphisms (SNPs) and small insertions or deletions (INDELs). These mutations represent changes in the DNA sequence that occur in tumor cells and are absent in matched normal cells

Mutation wustl gene

This is a gene-level summary of somatic mutation data (SNPs and INDELs), processed and annotated by the McDonnell Genome Institute at Washington University (WUSTL)80. Instead of reporting individual mutations, this dataset aggregates mutations for each gene, providing an overview of the mutation burden and functional impacts on specific genes

MC3 gene-level

The MC3 gene-level somatic mutation data81 from TCGA provides a comprehensive overview of mutations (SNPs and INDELs) across various cancer types, aggregated at the gene level. It includes details on point mutations (SNPs), where a single nucleotide is replaced, and insertions or deletions (INDELs) that alter the DNA sequence. The dataset contains information on mutations in thousands of genes across hundreds of cancer samples. Specifically, it includes mutation data for over 20,000 genes

RPPA

This refers to the measurement of the abundance of proteins in cancer samples. The data is typically obtained through techniques such as mass spectrometry or antibody-based assays, providing insights into the levels of different proteins in tumor tissues compared to normal tissues

RPPA

The Reverse Phase Protein Array (RPPA) technique is used to measure protein levels in cancer samples. RPPA provides quantitative data on protein expression, focusing on specific proteins or signaling pathways that are often involved in cancer progression82. The RPPA data in TCGA allows for the analysis of protein expression across multiple cancer types, providing a high-throughput method for assessing large numbers of samples

RPPA-RBN

The RPPA-RBN (Reverse Phase Protein Array – RNA-Based Normalization) involves normalizing the RPPA data using RNA expression data to account for variations in sample quality and other factors. This normalization helps to refine protein expression analysis, ensuring that the observed protein levels are not biased by underlying RNA expression differences83

Meth.

It primarily captures methylation at cytosine residues in CpG dinucleotides, which can influence gene expression and chromatin structure. TCGA methylation data is generated using platforms like the Illumina HumanMethylation450 BeadChip (450K array) and, in some cases, the HumanMethylation27 BeadChip. Methylation is quantified as beta values, ranging from 0 (unmethylated) to 1 (fully methylated), allowing for analyses of hypermethylation and hypomethylation patterns

Human Methylation-450

It contains DNA methylation levels measured at over 450,000 CpG sites using the Illumina HumanMethylation450 BeadChip platform. Methylation levels are typically reported as beta values, ranging from 0 (unmethylated) to 1 (fully methylated)84. The dataset provides insights into epigenetic regulation by covering CpG sites in promoters, gene bodies, intergenic regions, and CpG islands, as well as regulatory regions like shores and shelves

Human Methylation-27

The HumanMethylation27 dataset refers to a DNA methylation array platform that measures the methylation status of approximately 27,000 CpG sites across the human genome. The array is designed to assess the methylation patterns at these CpG sites, which are crucial in gene regulation85

Array

The Gene Expression Array datasets listed are part of TCGA’s collection, used to measure gene expression across various cancer types using different array platforms

AgilentG4502A07(1-3)

This dataset represents gene expression data obtained from the Agilent G4502A platform. The Agilent arrays use microarray technology to measure gene expression by hybridizing labeled RNA to probes on the array. The G4502A is one of Agilent’s arrays designed for high-throughput gene expression analysis, offering broad coverage of human genes. The different versions (e.g., 07-1, 07-2 and 07-3) may reflect updates to the array design or slight variations in the data, ensuring that the data from multiple sources or experiments is comparable86

HTHG-U133A

This dataset refers to gene expression data generated using the Affymetrix Human Genome U133A Array. The U133A array is widely used in gene expression profiling and measures the expression of around 22,000 genes. It is a standard platform for large-scale studies in cancer genomics and transcriptomics, providing reliable data for identifying gene expression patterns across various samples