Background & Summary

Kidney cancer is one of the most common malignancies of the urinary system, second only to prostate cancer and bladder cancer, with 431,288 new cases worldwide in 20201. Clear cell renal cell carcinoma (ccRCC) is the most common type of histology, accounting for about 75% of cases2. The genetic characteristics of ccRCC are associated with potentially high levels of tumour heterogeneity and genetic susceptibility3,4. The previous TCGA studies5 have found a wide range of genetic alterations in ccRCC, including VHL, PBRM1, and SETD2 mutations, which reveals the comprehensive features of ccRCC. In addition, the epigenetic regulation of tumour cells is a new mode to drive tumour occurrence and progression6. The organization of accessible chromatin across the genome plays an essential role in establishing and maintaining cellular identity, reflecting a network of epigenetic regulation through which enhancers, promoters, insulators and chromatin-binding factors cooperatively regulate gene expression7. In particular, alterations in chromatin accessibility have been implicated in driving cancer initiation, progression and metastasis8.

In the past, there have been many DNA sequencing-based methods for mapping nucleosomes and chromatin accessibility, such as ATAC-seq9, DNase-seq10 and MPE-seq11. In recent years, with the development of single-cell sequencing technology, it has become possible to perform single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq)12, especially the high-throughput single-cell analysis13. Although the characteristics of chromatin accessibility on ccRCC have been reported in a previous study14, the large number of cells (a total of 61,693 nuclei and 190,916 unique peaks) provided by the study is difficult to reutilize and reanalyze, which depends on high speed calculator. Another study was limited to chromatin accessibility features of immune cells, while did not focus on tumor cells15.

To address this problem, we provide a high-quality chromatin accessibility data for ccRCC at single-cell level. Performing high-throughput scATAC-seq on 3 human ccRCC samples (RCC30, RCC61 and RCC76), we obtained a total of 18,703 high quality nuclei and 104,818 unique peaks, including coding, non-coding regions, promoters and enhancers (Fig. 1). After quality control (QC) and downstream analysis, our data may reveal comprehensive epigenetic characteristics of ccRCC, including tumour cells, endothelial cells (EC), cancer-associated fibroblasts (CAF) and immune cells. In addition, we demonstrate a complete analysis process for scATAC-seq data, and makes the application more feasible and convenient. Taken together, the scATAC-seq data of ccRCC can provide valuable information and new strategies for study and treatment of ccRCC in the future.

Fig. 1
figure 1

A schematic overview of this study design.

Methods

Human ccRCC samples

These three participants (RCC30, RCC61 and RCC76) were diagnosed with renal tumor before surgery and underwent laparoscopic radical nephrectomy in The First Affiliated Hospital of Guangxi Medical University (Table 1). After detailed communications with the three patients, they voluntarily donated the tumor tissues. The postoperative pathological results of all samples indicated ccRCC. This study was approved by the Institution Review Board (IRB) from the First Affiliated Hospital of Guangxi Medical University (No. Expedited trial 2018003), which agreed to conduct the study and share the data. And all participants signed informed consent forms and agreed to share the data.

Table 1 Details of samples and FASTQ files.

Single nuclei preparation

Here, we need to perform this in two steps. Firstly, we should prepare ccRCC single-cell suspension. By means of multi-point sampling, we can obtain representative tumour tissues, and totally select 1 cm3 tissues. Washing by DPBS (WISENT, 311-425-CL) twice, the tissue specimens were digested by a solution (1 mg/mL collagenase I (Gibco, 5401020001) and 1 mg/mL DNaseI (Roche, 10104159001) in HBSS) for 30 min at 37 °C. We terminated the digestion by DMEM (WISENT, 319-006-CL) with 10% FBS (Gibco, 10099141). 70 μm cell strainer (Falcon) can be used for filtering out large tissue fragments, which did not fully digeste into a single-cell state by the above digestive process. Then, we removed red blood cells by RBC lysis buffer (10X diluted to 1X; BioLegend, 420301) for 5 min on ice and filtered by 40 μm cell strainer (Falcon) again. At last, we obtained single-cell suspension that can be counted by trypan blue (Gibco, 15250-061) staining. In this study, RCC61 and RCC76 were isolated from single-cell nuclei, immediately. And RCC30 was frozen in liquid nitrogen for a period of time and then isolated from single-cell nuclei.

The second step for isolating single-cell nuclei can refer to our previous study14. Briefly, we need to prepare the lysis buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% Nonidet P40 Substitute, 0.01% digitonin and 1% BSA) and wash buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20 and 1% BSA). The lysis buffer (100 µl) was incubated with single-cells on ice for 4 different time gradients (3 min, 3.5 min, 4 min and 4.5 min), because cells were sensitive to the duration of lysis buffer. Then, we added the wash buffer (1 ml) for terminating the lysis process. The samples were centrifuged at 500 g for 5 min at 4 °C and removed the supernatant. Single-cell nuclei were resuspended by PBS and examined the quality of each gradients by microscopy. Finally, we selected the optimal samples that were resuspended in chilled 1x Nuclei Buffer (10x Genomics, 2000153) at approximately 5,000–7,000 nuclei/μL.

DNA library construction and preliminary sequencing results

The number of target nuclei captured in each sample was 7,000. The DNA library construction for scATAC can be refer to 10X Genomics ‘Chromium Single Cell ATAC Reagent Kits User Guide’ (https://support.10xgenomics.com/single-cell-atac/library-prep/doc/technical-note-chromium-next-gem-single-cell-atac-v11-reagent-workflow-and-software-updates). Then, DNA libraries were sequenced by Novaseq6000 (Illumina, San Diego, CA). The sequencing parameters can be set to the 2 × 50 paired-end. Preliminary sequencing files (.bcl) were converted to FASTQ files by CellRanger ATAC (version 1.2.0, https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/what-is-cell-ranger-atac). After running cellranger-atac mkfastq options, the read1, barcode, read2, and sample index were associated with R1, R2, R3, I1, respectively (Table 1). Then FASTQ files were compared with the human genome reference sequence GRCh37 by the cellranger-atac count function. Finally, the summary results were generated in a file, which we can indicate an overview of the scATAC-seq data (Table 2).

Table 2 The details of quality-control (QC) for three ccRCC samples.

Secondly analysis for scATAC-seq data

Here, we demonstrated the use of Seurat16,17 (version 4.0.0) and Signac18 (version 1.0.0) R packages for data downstream analysis. Before quality control (QC), we had to understand these important parameters, such as transcriptional start site (TSS) enrichment score, nucleosome signal (NS), number of fragments in peaks, fraction of fragments in peaks (pct reads in peaks) and ratio reads in genomic blacklist regions (Fig. 2a–c, Table 2). According to the parameters (peak region fragments > 1000 & peak region fragments < 20000 & pct reads in peaks > 15 & blacklist ratio < 0.05 & nucleosome signal < 4 & TSS enrichment > 1) reported in the previous study14, we filtered out low-quality nuclei and finally obtained 18,703 high-quality nuclei (Fig. 2a–d, Table 2).

Fig. 2
figure 2

Data integration and quality control (QC) of the three ccRCC samples. (a,b) Performing transcriptional start site (TSS) enrichment score and the nucleosome signal (NS) of the data. (c) We showed the quality parameters of each cell nucleus by scatterplot. (d) A quality control strategy is applied to the data, with the blue dots representing cells and the cells in the red box being high-quality cells for downstream analysis. (e) We showed the correlation between depth and reduced dimension components from 1 to 50 by latent semantic indexing (LSI) analysis.

After QC, we performed term frequency-inverse document frequency (TF-IDF) normalization and obtained a total of 104,818 unique peaks. We identified gene annotations by GRanges function, referencing ‘hg19’ from University of California Santa Cruz (UCSC19). Then, we run singular value decomposition (SVD) on the TD-IDF matrix by using the peaks in each cell. After completing the above two steps, we can perform latent semantic indexing (LSI) analysis20. We calculated the correlation between depth and reduced dimension components from 1 to 50 and selected 35 as the parameter for downstream analysis, which the correlation is close to zero (Fig. 2e). Cell nuclei can be unbiased clustered by FindClusters function with a resolution of 0.5.

Calculation of gene activity score and differential peaks

The gene activity score is a method of quantifying the activity of each gene in the genome by assessing the accessibility of chromatin associated with a gene. Gene activity matrix can be created by extracting gene coordinates and including the 2 kb upstream region. This step can be performed with the GeneActivity function. To understand the chromatin state between each cell cluster, calculating differential peaks is a valuable method. Here, we applied the FindAllMarkers function to obtain differential peaks between each cell cluster (Table S1).

Motif and transcription factor (TF) footprinting analysis

Based on the chromVAR21 R package that was integrated into Signac, we performed the AddMotifs function to add the DNA sequence motif information for motif analyses. We could calculate the motif activity score in each cell and identify differential activity scores between cell types by FindAllMarkers function. Then we enriched the most significant motifs in each cell subtype (Table S2). After normalization by z-scores, differential activity scores between cell types referred to “avg_diff”. We selected the most significant motifs (top 1 or 2) in each cell subtype for secondary analysis. In addition, we visualized the above motifs and identified the corresponding TFs by the MotifPlot function. After confirming the cell type-specific TFs, we can gather all the required data and stores it in an assay by the Footprint function. Finally, we presented the footprint analysis of the above transcription factors by the PlotFootprint function.

Definition of each cell type

Compared with single-cell RNA sequencing, the results of scATAC-seq are more challenging for cell definition. As previously reported in the literature13, we recommend the integration of three dimensions for cell definition: (1) based on the gene activity matrix that extract gene coordinates and extend them to include the 2 kb upstream region, we can assess the chromatin accessibility associated with the marker genes; (2) calculating the differential peaks of each cell subtype, which can be matched to specific locations in the gene sequence (Table S1); (3) TF analysis that cell-type specific TFs were discovered and then combined with the reported literature.

Data Records

All the processed scATAC-seq data can be access in NCBI GEO database. The GEO accession number is GSE27127322. The data was obtained after preliminary analysis by Cellranger software. The raw data (.fastq files) have been deposited in the NCBI Sequence Read Archive (SRA) and the project accession number is PRJNA113084223. After secondary analysis by Signac, we saved the object file (.rds), Tables S1 and S2, which can be access in figshare24

Technical Validation

Here, we presented three high-quality scATAC-seq data from human ccRCC samples, including appropriate quality control parameters, such as fraction of fragments overlapping TSS > 30%, fraction of transposition events in peaks in cell barcodes > 40% and fraction of fragments overlapping any targeted region > 60% (Table 2). By applying Signac18 to unbiased clustering of the single-cell nuclei, a total of 18,703 high quality cells and 20 cell subtypes can be identified (Fig. 3a). According to cell annotation (Methods), we classified these cells into tumour cells, endothelial cells, T cells, macrophage, CAF, NK cells and B cells (Fig. 3a). Given that two of the three samples (RCC61 and RCC76) were fresh nuclei and one (RCC30) was frozen nuclei, we compared the cell subpopulations of the three samples. We found that in addition to the tumour cell subtypes, other cell subtypes were contributed by cells from almost each sample (Fig. 3b). This result was similar to previous scATAC-seq studies13,14. The samples included both male and female. For example, RCC30 and RCC61 obtained from male, while RCC76 obtained from female (Table 1). In addition, given that the samples we used for scATAC-seq included both fresh samples (RCC61 and RCC76) and frozen stored sample (RCC30), we randomly detected 3 regions of the chromatin accessibility, and found a high degree of similarity between the three samples (Fig. 3c–e), which further indicated the reliability of the scATAC-seq data.

Fig. 3
figure 3

Three ccRCC samples were merged by Signac. (a) A total of 18,703 nuclei were unbiased clustering and classified into 20 different cell subtypes, which were projected by UMAP. (b) Spatial distribution of three different ccRCC samples on UMAP. (c–e) The chromatin accessibility of three different samples on chr 3, chr 9 and chr 17 were randomly selected to demonstrate.

We detected the gene activity scores of major cell subtypes, of which these marker genes were consistent with previous studies, such as CA9 in tumour cells25 in tumour cells, VWF in endothelial cells26, RGS5 in CAF27, PTPRC in immune cells28, MSR1 in macrophage29, IL7R in T cells30, KLRD1 in NK cells31 and SDC1 in B cells13 (Fig. 4a). Meanwhile, this data can reveal the universal regions of chromatin accessibility in all cells (Fig. 4b,c). Interestingly, we discovered the specific peaks of each cell cluster and the chromatin location of these regions (Fig. 5a and Table S1). Here, we showed the specific peaks in tumour cell subtypes (cluster 1,2,3,14), which located on CTB-164N12.1, ATRNL1, KRT14, and RP11-118K6.3, respectively (Fig. 5–e).

Fig. 4
figure 4

scATAC-seq revealed the epigenetic regulatory features of ccRCC. (a) Cell type-specific gene activity scores, the colour gradient indicated the level of the score, with yellow representing high and purple representing low. (b,c) scATAC-seq identified the universal regions of chromatin accessibility.

Fig. 5
figure 5

Discovery of cell type-specific peaks by scATAC-seq. (a) We showed the top 10 peaks in each cell subtype and label the genes that the characteristic peaks locate. (be) The chromatin accessibility of CTB-164N12.1, ATRNL1, KRT14, and RP11-118K6.3 was specific in ccRCC tumour cells.

Finally, based on the Signac18 and chromVAR21 R packages, we presented a method to discover cell type-specific transcription factors (TFs) and motifs that included variable 200 TFs and motifs (Fig. 6a). In addition, we enriched some of the most significant TFs and motifs, which have been verified by previous studies13,14,15, such as HNF1B/HNF1A in tumor cells, SOX8/SOX9 in endothelial cells, EBF2/EBF3 in CAF, SPIB/SPIC in macrophage, EOMES/TBR1 in NK cells and ETS1/FLI1 in T cells (Fig. 6b). Subsequently, we can perform footprint analysis for the above motifs in each cell cluster (Fig. 7). Collectively, our data provided high-quality epigenetic information on ccRCC and more references for future treatment and diagnosis.

Fig. 6
figure 6

Analysis of cell type-specific transcription factors (TFs) and motifs. (a) We enriched the total 200 variable TFs in each cell cluster. the colour gradient indicated the level of the differential activity scores (Avg_diff). (b) These cell type-specific TFs and their motifs were discovered.

Fig. 7
figure 7

Motif footprinting analysis for cell type-specific TFs.