Background & Summary

The T cell receptor (TCR) is a protein expressed on the surface of T cells that is critical in the detection of diverse antigens and in the initiation of immune responses. TCRs are composed of alpha (α) and beta (β), or gamma (γ) and delta (δ) chains, with most immune responses mediated by α and β chains1. The diversity of TCRs stems from small sequence variations that can significantly alter antigen recognition and immune behavior. In humans, this diversity culminates in an estimated repertoire of over \({10}^{15}\) distinct TCR sequences2. The frequency of specific TCR sequences reflects immune dynamics, including clonal expansion during activation, which is particularly relevant in the context of cancer3.

The immense diversity of the TCR repertoire is essential for the immune system to recognize a wide array of antigens and mount effective defenses against different threats, including pathogens and tumors. This diversity arises from recombination events, V(D)J recombination, during T-cell development in the thymus. Additional diversity is generated through random nucleotide addition and pairing of α and β chains. These processes ensure that the adaptive immune system can generate a variety of TCRs with high specificity and affinity for antigens, enabling robust immune surveillance and response4.

The adaptive immune system also functions as a molecular archive, encoding an individual’s immunological history. Through antigen-driven clonal expansion and diversification, the TCR repertoire reflects past infections, vaccinations, autoimmune events, and cancer immunoediting. Advances in high-throughput sequencing technologies, specifically TCR sequencing (TCR-seq), enable profiling of TCR repertoires and the hypervariable complementarity-determining region 3 (CDR3). This region is a critical determinant of antigen specificity5.

Recent advancements in repertoire sequencing (Rep-seq)6 have facilitated systematic interrogation of TCR landscapes, generating datasets that reveal baseline immune states and disease-associated changes. Computational pipelines are now essential for processing these complex data, offering mechanistic insights into immune responses7. In the field of oncology, TCR repertoires are increasingly recognized as biomarkers for diagnosis, therapeutic response prediction, and risk assessment. Importantly, these analyses often rely on peripheral blood mononuclear cells (PBMCs) as a minimally invasive source for profiling systemic immune dynamics8.

Colorectal cancer (CRC) remains a leading cause of cancer-related morbidity and mortality worldwide, with roughly 2 million new cases and over 900,000 deaths reported annually. Ranking as the third most commonly diagnosed cancer, CRC presents a significant challenge to global health systems9. Although it has traditionally affected older populations, an alarming increase in early-onset cases underscores the need for enhanced risk assessment and early detection strategies10. Early diagnosis is critical for improving survival rates and is typically achieved through colonoscopy11. Recently, however, insights into the immune system’s role—particularly the function of T cells in identifying and eliminating tumor cells—have opened new avenues for both diagnosis and treatment8.

The TNM staging system is the established framework for classifying cancer progression, categorizing disease from stages I to IV based on tumor depth of invasion, lymph node involvement, and distant metastasis. For CRC patients, stage III is further divided into high-risk and low-risk groups (based on T and N status) to guide adjuvant treatment decisions12.

While the TNM system provides essential anatomical insights, it does not capture the dynamic interplay of the immune response in cancer progression. Profiling the T-cell receptor (TCR) repertoire offers a more personalized view by reflecting the adaptive immune response to CRC, potentially leading to improved risk stratification13. However, large-scale datasets capturing the pre-treatment TCR repertoire in CRC patients are scarce, leaving a critical gap in our on post-treatment immune profiles or rely on small patient cohorts, limiting their utility in developing robust biomarkers or predictive models. By addressing this gap, TCR repertoire analysis has the potential to complement traditional staging systems by identifying novel biomarkers associated with treatment outcomes and informing more tailored therapeutic strategies.

In this study, as presented in Fig. 1, we use a large-scale dataset detailing the pre-treatment TCR repertoire of CRC patients. This comprehensive resource is poised to facilitate the identification of immune biomarkers, enhance risk assessment, and ultimately guide precision medicine efforts in the management of CRC.

Fig. 1
figure 1

T-cell receptor sequencing pipeline. Peripheral blood was drawn from 216 samples, total RNA was isolated, and TCRα/β genes were amplified by PCR. After library preparation and Illumina sequencing, FASTQ files underwent quality control and processing to identify clonotypes. Metadata, including TNM labels, was then added to yield the final dataset of 200 samples and ~1.49 million TCR sequences for subsequent analyses.

Methods

Collecting the samples and sequencing

We collected 216 blood samples from newly diagnosed CRC patients prior to any surgical or systemic treatment via the Sheba Hospital biobank. One library failed initial sequencing QC, leaving 215 FASTQ datasets. Of these, 15 libraries exhibited lower QC metrics and lacked TNM staging information; they were excluded from stage-stratified analyses but remain publicly accessible in the SRA, flagged Exclude_from_analysis in Supplementary Table 1. The final analysed cohort comprised 200 libraries with both high-quality sequencing and TNM data. Five milliliters of blood were collected from each patient into EDTA-coated tubes for PBMC purification. The PBMC fraction was cryopreserved in 10% DMSO/90% FCS and stored in liquid nitrogen until shipment to Bar-Ilan University, with each vial containing approximately 1–1.5 × 10^6 cells/mL.

Ethics approval

Blood samples were collected under Institutional Review Board (IRB) approval from The Sheba Medical Center (7541-20-SMC). Written informed consent was obtained from all participants, which included permission for data sharing and future use of the samples for research purposes. The study protocol, including data collection, analysis, and potential sharing for future research, was explicitly approved by the IRB. Dataset was deidentified: Samples allocated for this study were preserved with unique codes, and each participant was assigned a unique serial number. Serial numbers were stored separately from the personal details of the participants and will not be shared. Any publication of the results of the study will be jointly published in accordance with the common standards of academic collaboration between medical centers, and the mutual understanding between the parties.

RNA purification from blood samples

Frozen PBMC samples were thawed, and their total RNA content was extracted using the RNeasy mini kit. The extracted RNA was assessed for concentration and purity to ensure standardized quality.

Library preparation

A fixed total RNA concentration of 200 ng from each sample was subjected to the SMARTer Human TCR α/β Profiling Kit V2 (Takara Bio). This kit enables the analysis of TCR repertoires from bulk RNA samples and generates Illumina-compatible sequencing libraries.

The TCR sequencing library was then size-selected and purified using AMPure XP beads. The generated libraries were measured for their DNA concentration by qubit and assessed for their sizes using Tapestation. This enables the pooling of 24 libraries per flow cell while ensuring equal representation of each library in the final pool, resulting in a total of nine pools.

Sequencing and reads processing

Sequencing was performed on an Illumina sequencer using the 300-cycle Nextseq. 500/550 Mid Output Kit v2.5 with paired-end, 2 × 150 base pair reads. This specific format enables us to capture the CDR3 domain of each TCR α/β transcript. Transforming raw reads to α and β CDR3 sequences was done by using MiXCR14, which a benchmarking study ranked among the most accurate and bias-resistant TCR callers7.

The output files were analyzed using MiXCR with the following:

mixcr analyze amplicon ‐s hsa ‐‐starting-material rna ‐‐5-end no-v-primers ‐‐3-end c-primers ‐‐adapters adapters-present ‐‐receptor-type tcr

For the alpha chain we obtained 13.3·\({10}^{6}\) reads, with an average of 61975 clones per sample, and an average of 25887 clonotypes per sample. For the beta chain we obtained 35.8·\({10}^{6}\) reads, with an average of 186519 clones per sample, and an average of 73305 clonotypes per sample. The data representing quantitated clonotypes for α and β were subsequently combined for further analysis. To prevent bias and maintain consistency due to varying sample sizes, we removed libraries with fewer than 10,000 clones and randomly downsampled the remaining libraries to match the smallest retained clone count, yielding 200 samples for analysis.

Data Records

The raw FASTQ files for 215 peripheral-blood PBMC libraries from colorectal cancer patients have been deposited in the NCBI Sequence Read Archive under study accession SRP56575115. Individual run accessions (SRR IDs), read-pair counts, BioSample accessions, and TNM stage metadata are provided (Supplementary Table 1) where 11 libraries that did not pass initial QC are flagged ‘Exclude_from_analysis = Yes’. These runs remain publicly available but were omitted from all downstream and stage-stratified analyses.

Technical Validation

TCR-seq validation

To assess the raw sequencing data in our FASTQ files, we used FastQC16, a widely used tool for evaluating high-throughput sequencing quality. Since our samples were processed in nine separate sequencing pools, we then employed MultiQC17, a tool that consolidates the outputs into a single comprehensive report. By combining all FastQC outputs, MultiQC verified that each pool met or exceeded the quality thresholds necessary for downstream analyses. As shown in Fig. 2a,b, there is consistently high per-base and per-sequence quality scores, with only a slight decline toward the read ends - typical of Illumina data. Phred scores remain in the high-confidence range throughout most read positions, reinforcing the overall reliability of the dataset. In addition, the proportion of ambiguous base calls (‘N’s) was extremely low (Fig. 2c), indicating sequencing accuracy. We also examined the sequence length distribution (Fig. 2d), confirming that the reads were predominantly at the expected 150 base pairs. This result underscores the consistency of our library preparation and indicates minimal adapter contamination or truncated reads. Taken together, these results demonstrate that our raw sequencing data is robust, accurate, and suitable for subsequent clonotype characterization and other in-depth analyses.

Fig. 2
figure 2

MultiQC report of aggregated FastQC results across all 216 CRC samples. (a) Per‐base mean quality scores: the average quality at each base position for all reads. (b) Per‐sequence quality distribution: how overall read quality is distributed across the dataset. (c) Per‐base N content: the fraction of ‘N’ calls at each position. (d) Read length distribution: the range and frequency of read lengths present in the libraries.

Nonetheless, because the repertoire was profiled from peripheral-blood PBMCs, tumour-resident T-cell clones that do not recirculate may be under-represented; this potential compartment bias should be kept in mind when interpreting stage-specific diversity patterns.

Clonotyping validation

After validating the raw data, the next step involved clonotype validation to ensure the dataset was suitable and reliable for downstream analyses. We used the Immunarch package18, an R package to perform subsampling and measure basic statistics of the quantified clonotypes. After removing samples with low clone counts or any internal mismatches identified during quality control, we performed stage‐stratified analyses by grouping libraries by clinical stage (I–IV) based on T/N/M labels, 126 libraries had complete staging information. The remaining 74 high-quality libraries lacked one or more T/N/M annotations and were assigned to an “Unknown” category but were still included in the overall clonotype QC (see Supplementary Table 1 for full SRR mapping). We display these samples in Figs. 35 to present the complete repertoire landscape; however, they were excluded from formal stage-specific statistical comparisons (Stages I–IV) to prevent bias arising from missing clinical information. The stage counts are shown in Fig. 3: Stage I: T1–2, N0; Stage II: T3–4, N0; Stage III: any T, N1–2; Stage IV: any T, any N, M = 1. As illustrated in Fig. 4a, the number of unique clonotypes is similar across all stages. Likewise, Fig. 4b shows that the proportion of top clonotypes remains fairly consistent among these groups. We also evaluated diversity using several metrics - true diversity index19, Inverse Simpson index20, and Gini–Simpson index21 - presented in Fig. 5a–c, respectively. Although all three measures display similar overall trends, their distinct emphases (richness vs. abundance vs. evenness) provide a more nuanced picture of TCR repertoire diversity across clinical stages.

Fig. 3
figure 3

Distribution of 200 samples by clinical stage (I–IV), based on T/N/M labels (data in Supplementary Table 1). Bars show the number of samples in each stage; “Unknown” (n = 74) indicates samples with incomplete or missing T/N/M data.

Fig. 4
figure 4

Clonotype distributions stratified by clinical stage. Individual points represent samples. (a) Total number of TCR clonotypes per sample in each stage group (I–IV, Unknown). (b) Proportion of dominant clonotypes by abundance across stage groups.

Fig. 5
figure 5

Diversity Measures in each clinical stage (I–IV) and among “Unknown” samples (represented by individual dots). (a) “True diversity” index, an estimator of effective number of clonotypes in each sample (simple clone count). (b) “Inverse Simpson” index, accounting for richness while adjusting for clonal dominance (higher values indicate more balanced repertoires). (c) “Gini‐Simpson” index, expressing diversity as the probability that two randomly drawn reads belong to different clonotypes.

Usage Notes

The raw T-cell receptor (TCR) repertoire data in this dataset can be repurposed for a variety of immunogenomic analyses. First, researchers can reconstruct clonotypes from the original FASTQ files under different parameter settings, extracting specific fields of interest (e.g., VDJ gene usage or CDR3 sequences) to suit diverse investigative aims. In this context, tools such as MiXCR (as used here), or alternatives like TRUST422, IgBlast23 and others, can be readily applied to produce clonotype-level annotations. Additionally, because the underlying sequences can be interpreted as short-read RNA data, investigators could employ standard RNA-seq pipelines (alignment, assembly, and expression quantification) to explore broader transcriptomic features.

Once clonotypes are identified, the resulting TCR sequences (including both α and β chains) may be paired with the associated TNM labels for each sample, thus enabling integration with clinical staging data. Multiple downstream software solutions such as Immunarch18, ImmunoSeq Analyzer24, Immcantation Framework25, and VDJtools26 can be used for the following analysis and visualization of the repertoire. A recent review27 provides a comparative assessment of these tools, guiding researchers in choosing the most appropriate platform for their objectives.

Notably, this dataset contains nearly 1.5 million T-cell-receptor sequences from colorectal-cancer patients, providing a comprehensive snapshot of TCR diversity and enough depth to benchmark machine learning models that classify samples by TNM stage or other clinical labels. Subsets of clonotypes can also be examined for their functional roles in CRC staging, and the data may be integrated with external genomic, transcriptomic or clinical resources to enable broader cross-cohort comparisons.