Abstract
Multiple assessment checks are required to handle the increasingly complex engineered-cell and cell-line provenance. To manage the biosafety and efficacy demands, we developed a bioinformatic pipeline for de novo profiling of short tandem repeats and mutations (STRaM) to identify and track homologous edited/engineered cells. The core technology of STRaM comprises an error-sensing bioinformatic pipeline with 3 analysis modules (STR analysis, STR flanking analysis and EMS analysis) for profiling, and an integrated assessment system with three indices for the respective reporting of identity, purity and genetic modifications of tested cells. STRaM maintains a transformed pathway for backward compatibility with traditional capillary gel electrophoresis (CE) based DNA databases. To introduce our integrative and cost-effective STRaM system to best practices in the management of modern cell products, we applied our enhanced DNA fingerprinting technique to several basic and translational cell research examples.
Similar content being viewed by others
Introduction
With the fast-paced development of advanced technologies for making designed changes in human cells for cell-based products and therapies, questions about the biosafety and efficacy of these products are being raised. Furthermore, with the proliferation of engineered cells and new cell lines evolving from divergent lineages via highly-popular gene editing CRISPR and patient-specific mutation technologies, the need for improved engineered-cell and cell-line provenance in research and clinical translation is widely acknowledged. Engineered-cell and cell-line provenance can be defined as database records containing information pertaining to the origin, source, handling, tracking and validation of engineered cells and cell lines. Advanced avant-garde techniques that can be routinely deployed for the biosafety and bioefficacy management of human engineered-cell and modern cell-line provenance are in high demand1,2,3. As tandemly repetitive DNA sequences composed of 1–6 bp motifs with high allelic variability and structural polymorphisms4, short tandem repeats (STRs) are well recognized as genetic loci for cell authentication5, forensic investigations6,7, and human population genetics8. STR profiling by CE is the gold standard for cell line identification and authentication9 and is supported by the ASN-0002 standard, which was developed by the International Cell Line Authentication Committee, and available reference databases for the STR profiles of human cell lines9,10. Despite the success in authenticating older generations of human cell lines, CE-based STR profiling does not fully meet the needs of quality control (QC) in advanced cell-based therapies. The technique is not sufficiently sensitive for detecting several types of cross-contamination. CE can only handle length variation of STR profiles for a similarity assessment and not the nucleotide variation in the repeat or flanking regions11,12. This technical limitation has generated some fixed and incorrect numbers of repeat units (NRUs) in cell line reference databases and a new method to uncover genetic polymorphisms and match conventional CE data13 is needed. Additionally, the CE method requires the fluorescent detection of multiple STR loci, which is performed with specific instruments and software that is laborious and expensive. Single nucleotide polymorphisms (SNPs) serve as genetic markers for human cell line authentication14,15,16, but the technical and bioinformatic resources required for SNP analysis via whole-genome sequencing (WGS) can be prohibitive due to the large amount of data.
Next-generation sequencing (NGS), also known as massively parallel sequencing is an ultrafast, high-throughput technique for sequencing DNA or RNA and is highly expandable for various applications. NGS has been suggested as a promising tool for STR and SNP genotyping in the forensic field17,18. It is not limited by the fluorescent channels used in CE-based technology and is therefore, suitable for the detection of excess STR markers, which is helpful for authenticating cancer cells with genomic abnormalities. Over the last decade, several methods have been developed for identifying and genotyping STR via NGS data (Table 1). An example is the STR identification package STR-FM19, which is likely to misidentify complex STR, e.g. the VWA locus. STRait Razor is another tool for STR profiling on the basis of NGS data20,21. To minimize failure due to the recognition of mutated flanking sequences, the software STRait Razor uses an alignment module with low accuracy, which limits the selection of STR to those with distinct flanking sequences. Here, we provide a strategy that combines STR identification analysis, flanking alignment analysis and comparison analysis to effectively improve the efficiency and accuracy of STR sequence identification.
To address the biosafety and efficacy demands of cell-based therapies, we developed an integrated strategy that combines STRs and target mutations to construct characteristic STRaM profiles for identifying and tracking homologous edited/engineered cells for contamination monitoring (Fig. 1A). The automated analysis pipeline of STRaM consists of 5 sections: a data input interface, a preprocessing unit, three analysis modules, an error-sensing segment and a 3-index assessment group (Supplementary Fig. 1A). The core technology is built into 3 sections: an error-sensing segment with 3 analysis modules (STR analysis, STR flanking analysis and EMS analysis) (Fig. 1B), a STRaM locus set that includes the genetic STR loci and edited/mutant sequence (EMS, e.g., gene mutations, transgenes) via targeted amplicon sequencing (TAS) (Fig. 1C), and the evaluation of three assessment indices: the similarity index (SI), purity index (PI) and editing/mutation index (EMI) (Fig. 1D) for the reporting of identity, purity and genetic modifications of tested cells, respectively.
A Overall schematics of STRaM. B An error-sensing bioinformatic pipeline containing 3 analysis modules: STR analysis, STR flanking analysis and EMS analysis. C A New STRaM marker set with STR loci derived from autosomes 1-22. The amelogenin gene is used to distinguish X/Y sex chromosomes. D New assessment criteria defined by 3 indices: SI, PI and EMI for handling identity, purity and editing/mutation tests, respectively.
Results
A new analytic rule for STRaM profiling
CE-based STR profiling is unable to handle nucleotide variations in the repeat or flanking regions11,12 found in many advanced cell-based products. On the other hand, NGS offers more efficient data and sequence information of STR loci on the basis of the new recommendations of the International Society for Forensic Genetics (ISFG) on the STR sequence nomenclature22. SNPs present in marginal positions of the repeat sequence and base insertions or deletions contained in the flanking sequences limit the authenticity of the CE-based reference database (Fig. 2A). To improve precision, we present a new rule by considering the SNPs in the repeat marginal regions (Fig. 2B). The true boundary for the inclusion of all repetitive tetranucleotide units in the STRaM rules, rather than the exclusion of 1 or 2 units at the 5' or 3' end of a certain STR (e.g., D5S818 and D13S317 in Supplementary Fig. 1B) is in accordance with the CE rules for compatibility with reference databases.
A Inclusivity of SNPs in the repeat marginal regions ensures the continuation of repetitive sequences, e.g. the DS5818 and D13S317 loci, n = 16 cell line alleles. B STRaM rules encompass CE rules via an annexation of the repeat units with SNPs. C, D A schematic illustration of independent STR analysis and the STR flanking analysis. Both analyses occasionally run into errors, C a misidentification of VWA in the A549 cells and (D) failure to align merged reads of the T47D cells to the 3' reference flanking sequence of D2S437. E Error checks of STR analysis and STR flanking analysis via DSP, CSL and FRC as evaluated, respectively, from the genomic coordinates, STR length and read counts. F Nucmer outputs filtered out by the different sizes of aligned sequences. G Accuracy comparison of the output results of STR-FM, Flanking alignment and STRaM for true sequences using the ASN-0002 STR panel and STRaM panel loci, Error bar, mean ± SEM.
Our error-sensing bioinformatic pipeline was developed on web-based galaxy servers23 for a more robust STR detection (Fig. 1B). The pipeline combined the handling of STR analysis and STR flanking analysis into one cohesive unit to avert misidentification or failure had the two analyses been performed independently (Fig. 2C, D). The STR analysis module uses the STR detection program of the STR-FM package19 to recognize continuous STRs and calculate their repetitive motifs, lengths and chromosomal coordinates of individual merged reads. The STR flanking analysis module uses the sequence aligner Nucmer in the MUMmer4 package24 to determine the flanking sequences of STR and computes the lengths of individually merged reads. A creative and error-sensing comparison analysis of genomic coordinates, STR lengths and read counts extracted independently by the STR analysis and the STR flanking analysis from individually merged reads is built into the pipeline (Fig. 2E) to handle errors of the two analytic modules. The mismatched outputs were either corrected or merged reads with mismatched outputs discarded. The EMS analysis module employs Nucmer to identify mutant, edited or engineered DNA sequences, as well as, the amelogenin gene in merged reads (Fig. 2F). Notably, that the bioinformatic pipeline is capable of analyzing many more STRs and the EMS as a whole than those mentioned in the study. Using the 3 analysis modules, STRaM can efficiently identify true STRs and targeted mutations.
Comparison of the output results of STR-FM, flanking alignment and STRaM analysis for true sequences using the ASN-0002 STR panel (left) and STRaM panel loci (right) (Fig. 2G). As depicted, STRaM correctly reflects the true sequences for distinguishing the position of mutations, the length change of repeated sequences and sequence variation with 100% accuracy. In comparison, STR-FM achieved accuracy rates of 83.3 ± 2.2% (ASN-0002 STR panel loci) and 92.6 ± 0.9% (STRaM panel loci), while the flanking alignment analysis yielded accuracy rates of 99.7 ± 0.3% (ASN-0002 STR panel loci) and 99.9 ± 0.1% (STRaM panel loci).
Customizing a simple and effective STRaM locus set for profiling
There are many STR loci with complex structures in insertions, deletions and multiple repeat motifs that are not conducive to the rapid formation of STRaM profiles of cell samples. These include the compound and/or interrupted loci VWA and THO1 typically used in CE-based profiling (Supplementary Fig. 2A). To increase the simplicity and effectiveness of STRaM, we developed a new set that encompassed 22 STR loci and the amelogenin gene (Fig. 1C and Supplementary Table 1) and validated it via PCR (Supplementary Fig. 2B). To generate highly improved accuracy and target amplicon sequencing (TAS) compatibility, the new STRaM set is formulated on the basis of five selection criteria or identification rules: (1) simple (or uninterrupted) tetranucleotide STR only for an enhanced allele-to-stutter signal ratio in TAS and amplification; (2) STR with a high heterozygosity index (Href > 0.6) to ensure allelic variability (Supplementary Fig. 2C); (3) STR shorter than 200 bp length for compatibility with paired-end sequencing (Supplementary Fig. 2D); (4) STR with low variations in their flanking regions to reduce manual corrections in the analysis pipeline (Supplementary Fig. 2E and allele sequence in Supplementary Table 2); and (5) STR set with coverage of all human chromosomes to ensure the detection of tumor cells with loss of chromosomes.
Characterization algorithms for allele and stutter reads
Replication slippage of STR generates PCR artifacts (or stutters) with a deletion or an insertion of one or more repeat units25. Although stutter information is not included in STR profiling reports, the classification of alleles and stutters is important in de novo STR profiling for therapeutic cell products or their parental cells. To classify alleles and stutters in the NGS data, we employed continuous methods to first establish a cutoff model using the prominence ratio (Pr), which we define as the relative ratio of the read counts of alleles and stutters to the highest read counts at an STR locus (Methods).
The cutoff model with 8 STR loci: CSF1PO, D5S818, D7S820, D13S317, D16S539, THO1, TPOX, and VWA, was first studied via NGS data from 16 human cell lines (Supplementary Fig. 3), with allelic information of those STR available in reference databases, e.g., Cellosaurus9,10. We applied the same cutoff model to classify alleles and stutters for the new STRaM set of STR (Fig. 3A). The STR sequences with Pr > 0.3 in an adjacent position or >0.1 in a nonadjacent position to the max allele were considered the second allele, and the other sequences were stutters (Fig. 3B). We then studied stutter patterns by measuring the stutter ratio (Sr, defined in the methods) at an STR locus26 and compared it to that of corresponding alleles (Fig. 3C). Interestingly, the adjacent stutters (e.g. n-1 and n + 1) consistently demonstrated Sr < 0.12, whereas the nonadjacent stutters (n-2, n + 2, etc.) had Sr < 0.01 (Fig. 3D). The Sr values of n-3 and beyond (or n + 3 and beyond) were not further decreased, possibly due to technical limitations. Notably, combined stutters generally presented higher Sr values than the corresponding forward or reverse stutters. The combined stutters (e.g., na-1/nb + 1) between 2 alleles were excluded because they were influenced by both alleles (Fig. 3C, E). For example, the “combined stutter” ratio at the na + 1/nb-1 position of D6S1282 were 0.3 in the SW480 cell line. The stutter patterns of the breast cancer cell line T47D revealed consistent results in 3 independent experiments (Fig. 3F), and the forward stutters (e.g., n + 1 or n + 2) presented lower Sr values than the corresponding backward stutters (e.g., n-1 or n-2) at individual STR loci. In the 16 cell lines (Figs. 3G), 100.0 ± 0.0% of the n-1 and n + 1 stutters had Sr < 0.12. Conversely, 96.9 ± 1.0% of the n-2 and n + 2 stutters and 98.6 ± 0.7% of the n-3 and n + 3 stutters had Sr < 0.01. The de novo STRaM profiles of the 16 cell lines analyzed under the new STRaM rules were generated (Supplementary Table 3). The data defined as the Pr and Sr thresholds were established to separate alleles from stutters, and the Sr threshold was 0.12 for the n-1 and n + 1 adjacent stutters and 0.01 for the n-2 and n + 2 nonadjacent stutters (Fig. 3H).
A Log10 (Pr) plot for characterizing alleles and stutters of STR in all the merged reads obtained from TAS data of 16 human cell lines. Relative NRU at a STR locus were referred as increased or decreased NRU in sequencing reads in comparison to NRU of the STR sequence with the highest read count. The orange and black dots represented alleles and stutters, respectively. B Pr thresholds for different alleles. Compared to STR alleles with the highest read counts, the threshold for 2nd adjacent alleles was Pr > 0.3, and for 2nd nonadjacent alleles Pr > 0.1. C Three types of stutters at the STR loci: backward stutters, forward stutters and combined stutters. D Sr thresholds for different stutters. The threshold for na-1 and na + 1 adjacent stutters was Sr ≤ 0.12; and for na-2 and na + 2 nonadjacent stutters (and beyond) was Sr ≤ 0.01. E The threshold for combined stutters (between two alleles) was Sr ≤ 0.3 compared to the maximum read count. F, G Sr values of the individual STR loci of STRaM were calculated from (F, n = 3 independent experiments) T47D cells and (G, n = 16) 16 cell lines. The dash line indicated the Sr thresholds for stutters. H A summary of the Pr thresholds for alleles and Sr for stutters in the STRaM. Data were analyzed by student’s t test. Error bar, mean ± SEM.
PI for enhanced sensitivity detection of cross-contamination
Cross-contamination is a prevalent issue that can arise at multiple stages of sample handling, including preparation, manufacturing, and the storage of cellular materials, as well as during the sequencing process. Similarly, cell lines of the same disease model are often cultured simultaneously, making them highly susceptible to cross-contamination. In this work, we employed two different breast cancer cell lines T47D and HS578T for STRaM identity analysis (Fig. 4A). To simulate cross-contamination, we mixed 2 breast cancer cell lines at different ratios and the different loci revealed different changes in alleles and stutters at varying mixing ratios (Fig. 4B and Supplementary Fig. 4A). For severe contamination, e.g. 50% contaminated cells (Supplementary Table 4), the STR alleles of T47D and HS578T were detected by STRaM (Supplementary Fig. 4B).
A STRaM profiles of T47D and HCC827 cell lines. B Relative read contributions of alleles and stutters for the T47D/HS578T cells to D6S1282 locus in the relative mixtures of T47D and HS578T cells at a ratio of 1:1, 50:1 and 100:1. C Cross-contamination occurs when T47D cells were spiked with 1% and 2% HS578T cells and detected by SI and PI (n = 4 independent experiments). Unspiked T47D cells serve as control (n = 3 independent experiments). D SI and PI assessments in the T47D mixed A549 with 10%, 3% and 1% ratio (each mixed ratio n = 4 independent experiments). E SI and PI assessment in the HEK293FT mixed HCC827 cells (n = 6 independent experiments) and 5637 mixed HS578T cells (n = 4 independent experiments) with 1% rate. F Determination of qualified or unqualified locus requires observation of the distribution of stutter products. The qualified locus was assessed according to the Sr thresholds for stutters (See Fig. 3H). G Abnormal non-adjacent stutter signals occurred when non-adjacent allele exogenous signals were increased. H, I PI is a more sensitive indicator of contamination. Heatmap of Sr values at STR loci of the STRaM set for purity monitoring of (H) T47D cells and (I) 100:1. The PI were computed using the formula in Fig. 1D. Note in the figure: “Y” for yes, representing the qualified locus and “N” for no, representing the unqualified locus. J The ROC-Curve analysis used for the statistical analysis of the PI’s ability to distinguish contaminated samples from uncontaminated ones. Uncontaminated samples (n = 24 independent experiments) ≥ 2% (n = 15 independent experiments) and 1% (n = 20 independent experiments) mixed rate contaminated samples. The data were analyzed by Student’s t-test. Error bar, mean ± SEM.
A minor contamination, i.e., 1% contaminated cells is nearly undetectable. The cancer cell line T47D shown in Fig. 4C possesses a high SI and PI of 100.0 ± 0.0% each. When T47D cells were mixed with 1% and 2% HS578T cells, the PI for T47D cells decreased to 46.6 ± 8.6% and 42.0 ± 7.0%, respectively. The corresponding SI decreased to 98.4 ± 1.1% and 98.8 ± 1.2%, respectively, which implied that there was no apparent contamination. Thus, STRaM adds a risk assessment for low-percentage contamination, compared to existing CE methods or SI alone. For instance, in the T47D mixed A549 cells at 10%, 3%, and 1% ratios, the results showed that SI failed to detect low-level contaminations, whereas PI was effective in detecting up to a 1% mixture rate (Fig. 4D). To demonstrate the sensitivity and robustness, the added 1% mixing ratio cases, such as HEK293FT with HCC827 (n = 6 independent mixing experiments), and 5637 with HS578T (n = 4 independent mixing experiments) were detected by SI and PI, the identification result of SI and PI remains consistent with the previous cases. (Fig. 4E). The results showed that PI is a more sensitive indicator of cross-contamination than SI.
Upon further analysis, it was observed that the number of stutter sequencing reads was consistently lower than the number of alleles, as indicated by the characterization analysis of allele and stutter reads. Consequently, the stuttering signal was found to be more vulnerable to interference from heterologous alleles derived from contaminated samples, particularly the non-adjacent stutters. In this context, we utilized the Sr thresholds (Fig. 3H) to develop a purity algorithm, incorporating the PI, for evaluating the contamination of STR loci by foreign alleles. An STR locus was deemed qualified if it exhibited an Sr < 0.12 for adjacent stutters and Sr < 0.01 for nonadjacent stutters; otherwise, it was classified as disqualified (Fig. 4F).
This insight prompted us to devise a novel purity algorithm to detect cross-contamination based on the premise that foreign cell alleles, differing in length from those of the host cells, would lead to an abnormal increase in the read counts of host stutters at specific STR loci due to contamination by foreign alleles (Fig. 4G and Supplementary Fig. 4C). These abnormal increases serve as a sensitive indicator of cross-contamination. The PI is calculated as the numerical ratio of qualified STRs in the test cells to those in the parental cells (Fig. 1D). As illustrated in Fig. 4H, I, Supplementary Fig. 4D, and Supplementary Table 4, the PI demonstrated high sensitivity to abnormal elevations in Sr values in T47D cells mixed with 1%, 2% HS578T cells.
To test the hypothesis that foreign cell alleles are consistently different in length, we analyzed the STRaM fingerprint profiles across 16 cell lines. As shown in Supplementary Fig. 4E, the STR fingerprint profiles similarity among various samples was typically below 80%, with more than 10 loci displaying dissimilar NRU alleles. This pattern held true for all sample pairs except two cell lines of common origin: SW480/SW620 and HEK293FT/HEK293T. Further analysis revealed that in any two randomly selected different samples, the count of non-adjacent alleles exceeded four (Supplementary Table 3).
To statistically assess the effectiveness of the PI in distinguishing contaminated samples from uncontaminated ones, we performed a receiver operating characteristic (ROC) curve analysis. The outcomes of this analysis were highly promising. For mixed samples with a contamination proportion of ≥2%, the ROC curve yielded an area under the curve (AUC) of 1 (n = 15 independent mixing experiments), which indicates perfect discrimination between contaminated and uncontaminated samples. For mixed samples with a 1% contamination proportion, the AUC was 0.960 (n = 20 independent mixing experiments), signifying a very high level of accuracy in identifying contamination (Fig. 4J). However, it is important to acknowledge the technical limitations of the STRaM. For samples with very low levels of contamination, such as 0.1% and 0.01%, the PI was unable to reliably distinguish contaminated samples from uncontaminated ones, as indicated by the data presented in Supplementary Fig. 4F. This suggests that while STRaM is highly effective for detecting moderate to high or low levels of contamination, it may not be sufficiently sensitive for detecting very low levels of contamination.
Integration of an EMI for tracking engineered cells
Genetic engineering involving CAR-T cell therapy27 and CRISPR-Cas9 edited hematopoietic stem cell (HSC) therapy28, etc., is rapidly gaining popularity in clinical trials and applications and in the treatment of cancer and other diseases. To meet their high standards, the tracking of lineages of engineered cells is a requirement and we have integrated an EMI into the STRaM framework for the detection of edited or mutated cells (Fig. 1D). The EMI is more effective than the SI in distinguishing homologous/engineered cell samples. For example, the SW480 and SW620 cell lines derived from primary colon carcinoma and lymph node metastasis from the same patient29 were apparently distinguishable with different EMIs (0.0 ± 0.0% vs 33.1 ± 0.4% for mutant BRMS1c.708_709del, 0.0 ± 0.0% vs 50.5 ± 0.5% for mutant PTPN9c.757C>T, 0.0 ± 0.0% vs 48.1 ± 1.3% for mutant KMT2Bc.817G>A and 0.1 ± 0.1% vs 47.7 ± 0.1% for mutant KMT2Bc.875G>T, P < 0.0001), but not with CE-based STR profiling or STRaM-based SI (Fig. 5A-D). Further, the genotyping was also analyzed for the three mutant loci that indicates the presence of corresponding mutations in only one allele (Supplementary Fig. 5A). The significant difference in EMI values between SW480 and SW620 readily distinguishes the two cell lines.
A, B The STR profiles and SI values of colon cancer cell lines SW480 and SW620 were very similar despite the two cell lines were from different cancer types. Both (A) ASN-0002 STR panel and (B) STRaM panel were assessed. C, D Four point-mutations were detected by (C) Sanger sequencing and (D) STRaM analysis showed a finite EMI in SW620 cells but not in SW480 (n = 3 independent experiments). Black arrows identified the point-mutations. E Schematic development of HEK293 clones with 7 silent mutations using CRISPR-Cas9 technology. F Edited HEK293 clones were compared to their parental cells using STRaM analysis with SI. The STR loci marked with red NRUs were different between cell lines and cell clones. G Sequence alignment between wild type PTEN and PTEN with 7 silent mutations (as PTEN-7SM). H, I Cell clones with CRISPR/Cas9-edited mutations of the PTEN gene were assessed for authentication, (H) SI and PI assessment and (I) editing status. Note: the difference in EMI was statistically significant, P < 0.0001. Data were analyzed by Student’s t-test. Error bars, mean ± SEM.
To monitor edited cells with genetic mutations via CRISPR-Cas9 technology, we used 3 edited clones of HEK293 cells with the same point mutations (Fig. 5E-I). STRaM analysis revealed that the 22-locus STR profiles of the 3 edited clones were very similar to those of their parental cells (Fig. 5F, H). Conversely, for the 7 silent mutations in exon 5 of the tumor suppressor gene PTEN (or PTEN-7SM), the EMI was used to separate the 3 clones from their parental cells (98.7 ± 0.2% vs 0%, Fig. 5G, I and Supplementary Table 5, P < 0.0001). During gene editing to monoclonal cell selection experiments, EMI was used alone for genotypic identification analysis to screen clones with target mutations (Supplementary Fig. 5B-D). Taken together, our data showed that STRaM with the integrated EMI tool is an effective tool for monitoring the status of genetic editing or mutation in cells.
Furthermore, we investigated the presence of known translocations in the K562 cell line, specifically focusing on the BCR-ABL1 locus and the CDC25A-GRID1 locus30,31 (Supplementary Fig. 5E). To validate the sequences at the translocation loci junctions, we performed Sanger sequencing on samples that had passed the quality control criteria for both SI and PI (Supplementary Fig. 5F). The results confirmed that the sequences of the BCR-ABL1 and CDC25A-GRID1 translocation loci were accurate (Supplementary Fig. 5G). Next, we conducted genotyping of the translocation loci sequences and WT genes by EMI. The analysis revealed a ratio of approximately 3:1 for the WT genes to the two translocation locus genes, indicating a clear genetic distinction between the WT and translocated sequences (Supplementary Fig. 5H).
We integrated and streamlined the evaluation system of STRaM with three indices and demonstrated that they were capable of monitoring the identity, purity and genetic modifications of cells (Fig. 1D), in contrast to a single cell identity test for CE-based STR profiling. More importantly, STRaM further combines the functions of identification and distinction of homologous/engineered cells.
Matching the STRaM analysis with the CE-based STR profiling
The sequence-based STRaM is technically distinct from the length-based CE method (Fig. 2B). To demonstrate the difference, we carried out a comparative analysis of the STR profiles of a human cell line using both STRaM and CE technique. Using the HEK293FT cell line as an example, the STR loci were amplified for data collection (Fig. 6A). The HEK293FT cell line was validated using the CE technique and exhibited the same STR profiles as those recorded in the Cellosaurus reference database10 (Fig. 6B; the profiles of the other 15 cell lines are translated in Supplementary Fig. 6).
A Representative PCR amplification of genomic DNA from the HEK293FT cell line was performed for the ASN-0002 STR panel, including CSF1PO, D5S818, D7S820, D13S317, D16S539, THO1, TPOX, VWA, and amelogenin. The PCR product size ranged from 120 to 300 base pairs. B STR profiles of specified cell lines were obtained via CE and STRaM, and compared. The CE/ reference column displayed STR profiles derived from CE analysis, which matched those in the reference database Cellosaurus. For comparison purposes, the STR profiles from the STRaM analysis were determined under the CE rules. The red NRUs at STR loci were different from corresponding NRUs in Cellosaurus. C, D STR loci of (C) D5S818 and (D) D13S317 exhibited different NRUs analyzed under CE rules versus STRaM rules. E STR profiles analyzed by STRaM under the CE STR rules were identical to those in Cellosaurus, as indicated by the STRaM SI values (n = 4 cell lines). F STRaM profiles of 5 human cell lines were generated from both WGS and TAS data and compared using SI. The WGS data of 5 human cell lines were obtained from the SRA, while the TAS data were generated in the study (n = 5 cell lines). G, H A comparison of (G) sequencing yields and (H) on-target read counts between the WGS and TAS data. Gbases, gigabases and error bars, mean ± SEM.
In contrast, STRaM analysis revealed distinct STR profiles for the HEK293FT cell line. Specifically, the STR alleles at loci D5S818 and D13S317 contained varying numbers of tetranucleotide repeat units. The true NRUs at D5S818 and D13S317 were consistent with their actual genetic structures but did not align with historical records in the reference database due to the presence of SNPs at the margins of the repetitive sequence (Fig. 6C, D).
TAS-based STR genotyping of four human cell lines (A549, HEK293FT, HS578T, and SW480) was subsequently performed using STRaM, adhering to CE-based STR rules. The results showed a 100 ± 0.0% match of their 9 loci STR profiles with of CE-based STR profiling and referenced cell line databases (Fig. 6E). These findings indicate that STRaM is fully compatible with traditional CE-based technology under the CE STR rules. More significantly, the repeat units with SNPs serve as a key medium that translates STRaM profile results to match with conventional CE database.
TAS makes STRaM more targeted and accessible
Depending on the type of data used for input, both WGS and TAS can be used with STRaM; however, with a focused approach to genome sequencing, the latter has the advantages of being simple in execution, relatively inexpensive, easily scalable and applicable to a wide variety of genes. To compare the 2 sequencing approaches, we analyzed the available WGS data of 5 cell lines (5637, A549, HCC827, SW480, and T47D) from the Sequence Read Archive (SRA), National Institute of Health, U.S. with the TAS data generated in the study. The read length of the paired-end WGS data was 101 bp, whereas that of the paired-end TAS data was 150 bp. An analysis of WGS data revealed similar STR profiles for the 5 cell lines, 98.4 ± 0.1% (Fig. 6F and Supplementary Fig. 7A). However, the number of on-target reads of all STRaM loci was 278 ± 39 read counts from the 138.3 ± 36.2 Gbases WGS raw data, whereas, it was 2570247 ± 825999 read counts form the 1.3 ± 0.4 Gbases TAS data (Fig. 6G, H). Additionally, TAS data generated approximately 100 times more read depths per STR locus (more than 5000 read counts per locus) than the WGS data (less than 50 read counts per locus) (Fig. 6H).
Further, we referenced eight WGS data sourced from different libraries32. For library preparation, we selected three kits known for their low DNA input requirements: the QIAGEN QIAseq Methyl Library Kit (QIAseq), the Illumina TruSeq DNA Methylation Kit (TruSeq), and the Swift Biosciences Accel-NGS Methyl-Seq DNA Library Kit (Swift). Compared to the cellular WGS data, the raw data from QIAseq and TruSeq increased by several tens of gigabytes, whereas the Swift raw data remained consistent (Supplementary Fig. 7B). The number of valid on-target reads for all STR loci was consistent between TruSeq and Swift when compared to the cellular genome. However, the number of valid on-target reads for all STR loci decreased for QIAseq (Supplementary Fig. 7C). By comparing WGS data with different data volumes from the same kit (QIAseq: 107.70 G vs. 165.88 G and Swift: 102.3 G vs. 132.88 G), we observed that the read count near STR loci increased with higher raw data volumes (Supplementary Fig. 7D). Nevertheless, the increase in valid on-target STR reads was not significant (Supplementary Fig. 7E). In summary, appropriate library preparation and sequencing depth can alter the number of valid on-target STR reads. However, the 30-60 G increase in raw WGS data does not match the data yield detected by the TAS method.
TAS and WGS are two distinct methods for DNA sequencing. Here, we compare their costs, time requirements, and overall efficiency. WGS involves sequencing the entire genome (approximately 3 G for the human genome). The cost for 100 G of raw WGS data is 2000-3000 RMB ($276-414), with an additional 25 RMB ($3.45) for each extra 1 G of data, and the data collection cycle takes 17-22 working days. When using a high-throughput platform, TAS costs 300-400 RMB ($41.40-$55.20) for 1 G of raw data, with a collection cycle of 1-2 weeks. For a medium-to-low-volume platform, it costs 100 RMB ($13.80) to collect 500,000 reads (more than 1000 per locus), with a time cycle of 6 days. Upon analysis, the number of valid on-target reads that fully covered the repetitive sequences of all 22 STR loci using the WGS method did not exceed 500, averaging less than 50 per locus. In contrast, the TAS method yielded more than 500,000 valid on-target reads, exceeding 5000 per locus (Supplementary Fig. 7F).
Tracking preclinical patient-derived cells and clinical applications
To demonstrate the feasibility and applicability of STRaM, we applied the technique to preclinical patient-derived cell models (Fig. 7A), which has become a popular replacement for cell lines33,34. Patient-derived models face the same challenges with respect to cell authentication and cross-contamination of established cell lines. We used STRaM to track and analyze the passages of patient-derived xenograft (PDX) tumors and organoids (PDOs). The first example involved human CD45+ (hCD45+) leukocytes sorted from a leukemia PDX mouse and analyzed using STRaM (Supplementary Fig. 8A). Compared with peripheral blood mononuclear cells (PBMCs) derived from leukemia patients, hCD45+ leukocytes from a PDX mouse presented identical STR profiles with an SI = 100% (Fig. 7B and Supplementary Fig. 8B). The hCD45+ cells were considered uncontaminated because the PI = 95.2%. The second example involved PDO cells derived from hepatocellular carcinomas (HCCs) of 2 patients, and were positive for both alpha fetoprotein (AFP) and cytokeratin-18 (CK-18) HCC markers35,36 (Supplementary Fig. 8C). More importantly, the PDO cells were genetically similar to the 2 original tumors with SI = 100.0% and 98.5%, respectively (Fig. 7B and Supplementary Fig. 8D). Furthermore, they had the same PI = 100.0%, implying that PDO cells were not contaminated. The outcome of our experiments with PDXs and PDOs showed that STRaM is a useful tool for monitoring patient-derived models.
A STRaM analysis applied to cell line analysis, preclinical model analysis and cell therapy monitoring. B Comparison of preclinical PDO and PDX cells to parental tumors (n = 3 patients). C In vitro expansion of MSC was evaluated by STRaM analysis (n = 3 volunteers). D Scheme for autologous CAR-T cell therapy. E-G CAR-T cells were monitored via STRaM analysis for (E) authentication (SI = 100 ± 0.0%, all samples), (F) purity test (PI = 97.7 ± 0.9%, all samples), and (G) relative levels of CAR genes were increased after CAR-T infusion (n = 3 patients). The PBMCs 5 day before infusion (-5) were used as parental controls for SI and PI. The data were analyzed by student’s t-test. PBMCs, peripheral blood mononuclear cells. Error bars, mean ± SEM.
The next example involves cell-base therapy, which has become highly popular owing to the potential for treating many currently intractable diseases. We applied STRaM to investigate the issue of QC during the preparation and use of therapeutic cells. We monitored the in vitro expansion of mesenchymal stem cells (MSCs) from the umbilical cords (UCs) of 3 donors. The MSCs and UCs were then harvested and analyzed. We obtained SI = 100 ± 0.0% (Fig. 7C and Supplementary Fig. 9A), which implied that the expanded MSCs were genetically identical to those of the donor UCs. Further, we obtained PI = 96.9 ± 1.6% indicating that no cross-contamination with MSCs occurred (Fig. 7C).
As a last example, we used STRaM to monitor the preparation and therapeutic use of autologous CAR-T cells from 3 lymphoma patients (Fig. 7D). Our STRaM analysis revealed that the engineered CAR-T cells and the PBMCs after treatment had STR profiles identical to those of parental T cells, which was confirmed by SI = 100 ± 0.0% (Fig. 7E and a case in Supplementary Fig. 9B). Furthermore, these cells were also free of cross-contamination, with high PI values in the parental T cells: engineered CAR-T cells at 100 ± 0.0%, and PBMCs after 5, 10 and 15 days of treatment were, respectively, 96.9 ± 1.6%, 98.4 ± 1.6% and 95.5 ± 2.6% (Fig. 7F). Note that PBMCs at 5-day before infusion (-5) were used as parental controls for SI and PI. Our EMI tracking of CAR-T cells is based on the relative read counts of CAR transgenes. Two targeted regions of the lentiviral transgene CAR; the anti-CD19-CD28 and 4-1BB-CD3ζ37 were amplified and sequenced (Supplementary Fig. 9C). Relative levels of transgene CAR were increased in PBMCs during the CAR-T treatment (Fig.7G, Supplementary Fig. 9D and Supplementary Table 5). The in vitro production and in vivo dynamics of CAR-T cells were tracked using STRaM for CAR genes and the assessment included identity and purity tests.
Discussion
In this study, we have shown that STRaM is a genetic framework for assessing the quality of therapeutic and experimental cells. Its design enables the integration of multiple tests into the genetic inspection of cells, which is well suited for fast-growing cell-based therapies that demand identity, purity, potency, and sterility tests.
STRs are genetic markers with high levels of diversity, and accurately determining their structures containing highly mutable sequences is quite challenging. STR profiling by CE is the traditional method used to authenticate and track cell line provenance9; however, it is handicapped by two limitations of the CE method – inadequate STR sequence variation identification and insufficient sensitivity in the contamination monitoring of homologous samples. The STR sequence length determination in CE is based on a comparison of an allelic ladder’s calibrated repeat number. However, the CE method necessitates the use of specialized equipment and software to ascertain STR length and does not account for sequence variants within repeat sequences or flanking regions (Table 2). Under the STRaM rules, STR structures are only consistent with true sequence changes in the human genome, but no longer focus on backward compatibility with CE data, which contain historical errors. However, STRaM analysis retains the transformed pathway for national DNA databases to support interlaboratory comparisons and bioinformatic development. STRaM not only collects comprehensive DNA sequence information but also enhances the sample evaluation system by incorporating the PI and EMI. These indices are essential for reporting the purity and genetic modifications of the tested cells, providing a more nuanced and informative analysis than traditional methods.
The STRaM data collection is based on NGS (e.g. TAS). NGS is an effective method for sequence analysis, but some studies have ignored some sequence variations that can affect the NRU compatibility with the traditional CE method38,39. Numerous tools have been developed for identifying and genotyping of STRs from NGS data (Table 1). These tools can be classified into three main analytical pathways: STR identification (e.g., STR-realigner40, ExpansionHunter41, STRetch42), STR flanking alignment (e.g., STRinNGS43, popSTR44, GangSTR45), and hybrid frameworks (e.g., STR-FM19, toaSTR46, SNiPSTR47). As compared in Table 2, STR identification methods directly analyze repetitive sequences using multiple algorithms in genomes, such as the hidden Markov model algorithm48, heuristic algorithms to identify repeats based known repeats library42 and the high-frequency k-mer based algorithms19,46,49. The strengths of these methods include the detection of all STR types within the input sequence and the provision of detailed read sequence information, such as position, sequence, length, and alignment scores. However, they are susceptible to misidentification and premature termination due to mutations in flanking sequences or repeat structures50. Additionally, processing large genomic datasets requires substantial memory, runtime power, and extended processing times. STR flanking alignment analysis determines STR length by aligning the tested sequence to the flanking sequences of known STR loci. This approach is advantageous due to its analysis speed and the principles are similar to that of CE. However, it requires a STR flanking sequence database of known STR loci set and can be compromised by variants in the flanking sequence, leading to analysis anomalies. Hybrid frameworks combine multiple detection algorithms or pathways for repeat annotations. These frameworks often include a suite of tools for repeat sequence annotation that can be integrated in series or parallel. Series combinations enable rapid and accurate STR site localization and analysis, but the lack of comparative processing among tools can result in anomalies if one tool fails. Parallel tools, on the other hand, allow for a comparison of different models’ analysis parameters to correct errors, offering good integration properties, such as the integration of targeted sequence analysis. Programmers can easily construct databases of known STRs and mutation sites. However, reliance on known STR sites is not conducive to the discovery and study of novel loci, and analyzing large genome databases requires significant memory and runtime. In contrast to these tools, STRaM incorporates an error-check module to prevent exceptions within individual STR analysis modules. Moreover, STRaM can operate on an online platform with ease.
STRaM is an expandable framework based on NGS, which allows the integration of multiple cell quality tests. The STRaM framework incorporates 3 distinct tests: an identity test, a purity test, and a targeted gene-editing test (Fig. 1D). Multiple indices in the STRaM framework have been established or employed for the quantitative assessment of the quality of therapeutic or experimental cells1,2,3. As a significant component of the STRaM framework, the quantitative indices provide a more straightforward means of evaluating the quality of therapeutic or experimental cells. Furthermore, the framework can be expanded for additional genetic tests as needed. As a result, the expandable tool demonstrates superior performance to the STR profiling method based on CE9, which is limited to an identity test. An ingenious algorithm based on STR stuttering has been evaluated to assess the potential risk and extent of cross-contamination. The conventional stutter analysis treats it as a noise signal, and uses a threshold to calculate a threshold for exclusion25,51,52. In STRaM, the STR stutters are not regarded as waste products, but rather as a valuable source of information for the algorithm. The assessment or monitoring of edited gene/transgene levels in CAR T cell products or PBMCs of treated patients commonly involves quantitative PCR (qPCR)53,54,55. The STRaM integrated targeted gene editing test can enhance quality assessment and monitoring of homologous or engineered cell products, and to avoid multiple independent tests for identification and monitoring at the same time. Furthermore, STRaM with its integrated framework for superior performance and expandable capacity is designed to handle the analysis cost-effectively (Supplementary Fig. 10).
Looking forward, we expect STRaM to significantly potentiate the biosafety and bioefficacy management of sophisticated engineered cells. We would like to see STRaM adopted for best practices management of advanced cell products and modern cell-based therapies.
Methods
Ethics statement and clinical samples
This study involves clinical samples sourced from several organizations and the study protocols were approved by the Institutional Review Boards of Jiangxi Cancer Hospital, Nanchang, Jiangxi Province, China (Approval No. 2023ky119), The First Affiliated Hospital of Zhejiang University, Hangzhou, Zhejiang Province, China (Approval No. IIT20220027C-R2), The Second Affiliated Hospital of Zhejiang University, Hangzhou, Zhejiang Province, China (Approval No. 2021–0497) and Zhujiang Hospital of Southern Medical University, Guangzhou, Guangdong Province, China (Approval No. 2017-GDEK-004 and 2022-KY-003–01). Clinical samples that included tumor specimens, blood and umbilical cords (Supplementary Table 6) were collected with patients’ informed consent from the First Affiliated Hospital and the Second Affiliated Hospital of Zhejiang University and Zhujiang Hospital of Southern Medical University. STRaM experiments were carried out at Zhejiang University-University of Edinburgh Institute, Haining, Zhejiang Province, China, and Jiangxi University of Chinese Medicine, Nanchang, Jiangxi Province, China in accordance with the Belmont Report.
STRaM framework
The STRaM framework encompasses a new set of STR, more inclusive rules for STR structures, an error-sensing bioinformatic pipeline established on the Galaxy server (https://usegalaxy.org/)56 and 3 assessment indices for reporting identity, purity and editing/mutation information. The bioinformatic pipeline contains 3 analytic modules: STR analysis, STR flanking analysis and EMS (gene mutations, transgenes) analysis.
STRaM bioinformatic pipeline for TAS data
The bioinformatic pipeline contains data preprocessing, 3 modular analyses (STR analysis, STR flanking analysis and EMS analysis) and a comparison of analytic outputs for the detection of STR errors, which are assembled in Galaxy servers56. The validated outputs of the STRaM pipeline are then translated into 3 indices for the testing of cell identity, cross-contamination and status of genetic modifications. The pipeline can handle both single-end reads and paired-end reads, although the latter is preferred.
The true repetitive structures of STR are programmed in STRaM and the resulting STRaM rules are different from traditional CE rules for STR identification, which adhere to the ASN-0002 human cell authentication standard9.
A STRaM set of STR loci
A new set of 22 STR loci (Supplementary Table 1) in human autosomes was selected from the database STRBase57 (https://strbase-archive.nist.gov), the Marshfield comprehensive human genetic maps58 (https://www.biostat.wisc.edu/~kbroman/publications/mfdmaps) and experimentally validated.
Data preprocessing
Generated by TAS for STR and edited/mutated sequences, the paired-end reads were preprocessed for high quality sequences and then, merged for bioinformatic analysis via the STRaM pipeline. Read quality was first evaluated with the program fastQC59. The 5' and/or 3' sequences of poor quality in the reads were trimmed with the program fastp60. Reads with poor quality (Phred quality score Q < 25) or shorter than 50 bases in length were discarded via fastp. Those processed and selected paired-end reads were then merged for better coverage of long STR (up to 200 bp) via the program FLASH61. The merging step was not needed for single-end reads.
STR analysis
The analysis module was designed to perform genomic mapping and de novo identification of STR in sequencing reads. Merged reads (or single-end reads) were first mapped via the genome mapper BWA-MEM62 on the human reference genome GRCh38 (i.e., GCA_000001405.15 _GRCh38_no_alt_analysis _set)63, which was obtained from the genome center of the National Center for Biotechnology Information (NCBI), USA (https://ftp.ncbi.nlm.nih.gov/). The STR sequences in individual mapped reads were recognized by the STR detection program of the STR-FM package19, which outputs STR information, including their read ID, STR lengths, repetitive motifs, hamming distances (the maximal number of substitutions in a repeat unit)64, mapped genomic coordinates (the chromosome names and the start positions of the repeat sequence) and raw sequences. The identities of STR loci in individual mapped reads were further determined according to their genomic coordinates. Since the repetitive motifs of STRs are prone to mutation, the misidentification of STR structures by STR analysis cannot be avoided.
STR flanking analysis
The analytic module was designed to identify an STR in accordance with flanking sequences on both sides of the STR. Since the flanking sequences of the STR are much more stable and specific, the module generates fewer analytic errors. It utilizes the distinct alignment program Nucmer in the MUMmer package24, which performs all-vs-all comparisons of sequences with genomic alterations. For the Nucmer alignment, the two reference library files in the FASTA format (Supplementary Table 7) included reference sequences of 30 bp (and their reverse complement sequences) at the 5' and 3' flanking regions of the STR in the STRaM set. The merged reads (or single-end reads) were aligned via Nucmer to reference sequences at the 5' and 3' flanking regions of STR in the STRaM set. Note that individual reads aligned to the flanking regions of different STR were discarded, but those reads aligned to both flanking ends of the same STR were chosen and equipped with the STR genomic coordinates (based on GCA_000001405.15_GRCh38_no_alt_analysis_set, Supplementary Table 8) and lengths calculated via the equation below (Eq. (1)):
where the end positions of the 5' flanking sequences (P5e) and the start positions of the 3' flanking sequences (P3s) were extracted from Nucmer outputs.
Error-sensing comparison of STR analysis and STR flanking analysis
Variants or alterations in STR or their flanking regions are common. Furthermore, PCR amplification and/or sequencing cause technical errors (or mutations) in the STR sequences. Despite their low occurrence, these genomic variants or technical errors likely generate misidentification of STR or miscalculation of their lengths when sequence-based bioinformatic tools are used. To minimize analytic errors, we formulated an error-sensing comparison of independent outputs from the STR analysis and the STR flanking analysis. This comparison involves 3 STR parameters: genomic coordinates, STR lengths, and read counts.
Since some STRs share repetitive motifs, it is possible that some STR-carrying reads were mapped to incorrect genomic locations. Therefore, a comparison of the genomic coordinates of the STR provided by the STR analysis and the STR flanking analysis was performed to ensure correct genomic mapping of the STR-carrying reads. The difference in the STR start positions (DSPs) was calculated as follows (Eq. (2)):
where the length of merged reads was 300 bases and that of single-end reads was 150 bases in this study. If both the STR analysis and the STR flanking analysis detected the same STR in individual reads, the value of DSP should be very close to 0. In other words, if the comparison of genomic coordinates revealed the same chromosome names and a DSP value between -0.5 and 0.5, the STR analysis and the STR flanking analysis were consistent. If the comparison did not demonstrate the same chromosome names or a CSP between -0.5 and 0.5, the STR analysis and STR flanking analysis generated errors. These misreads are filtered out by comparing gene coordinates, i.e., by calculating the DSP. Reads with a DSP > 0.5 or < -0.5 are excluded.
Since mutations or variants in an STR or its flanking sequences often result in analytic errors, a comparison of the STR length (CSL) between the STR analysis and the STR flanking analysis was also designed in the pipeline to determine whether bioinformatic identification of STR in reads was correct. The same STR lengths from the 2 independent analyses demonstrated correct identification of STR. In addition, misidentification of STR was likely detected, which required further inspection of the 2 analytic modules. The CSL between the STR and STR flanking analyses is used to determine the consistency of STR sequences identified by both methods. If the result is “True”, it indicates that both modes have produced the same recognition result without the need for error correction. If “False” is displayed, it suggests that the recognition results differ, necessitating error correction. CSL discrepancies (“False” results) are often due to sequence mutations, particularly at the junctions between STRs and flanking sequences.
The STR identification was further validated by a comparison of the read counts provided by either the STR analysis or the STR flanking analysis. The comparison was illustrated by a function of the read count (FRC) at a particular STR locus (Eq. (3)):
The value change of the read count from the two analyses should be > 90%, i.e., the FRC value should fall within the range of [-0.05, 0.05]. However, FRC values between [-1, -0.05) and (0.05, 1] need to be checked for analysis exceptions.
Errors in a comparison of 3 parameters indicate abnormal operations in either the STR analysis or the STR flanking analysis, which often resulted from mutations (or alterations) in STR or in their flanking sequences. The errors of the 2 analyses were either corrected manually, or the reads with errors were discarded. A comparison of DSP effectively identifies and locates STR loci, while the CSL serves to resolve STR sequences between the two analysis modes. The function of the FRC is to verify the correct operation of the two analysis modes (a representative STRaM analysis of HEK293FT cells is shown in Supplementary Table 9).
Amelogenin and EMS analysis
The EMS analysis module in the pipeline employed the aligner Nucmer to determine reads that contained either the amelogenin locus or the edited (mutated) sequences. For the Nucmer alignment of paired-end sequencing reads, the 2 reference sequence library files were customized with specific target sequences plus amelogenin sequences (Supplementary Table 10). The amelogenin locus was used to distinguish sex chromosomes, as the sequences in chromosomes X and Y are different with 12 point mutations and a 6 bp deletion65.
Identification and characterization of alleles and stutters
The amplification of STR alleles is often accompanied by stutter products, which contain the deletion or insertion of one or more repeat units in the newly synthesized sequences compared with the parental STR templates51,52. To establish de novo STRaM profiling, we need to distinguish true STR alleles from their stutters, which is based on the prominence-cutoff model. In the model, the repetitive sequence with the highest read counts at an STR locus was defined as an allele. From our data of 16 human cell lines, the prominence-cutoff model defines a prominent ratio (Pr) of read counts of different repetitive sequences to the highest read counts (Eq. (4)).
There are three types of stutters: backward stutter, forward stutter and combined stutter. The backward stutter of an STR allele contains deletion of one or more repeat units (i.e., n-1, n-2, etc.), whereas, the forward stutter contains insertion of one or more repeat units (i.e., n + 1, n + 2, etc.). When there are two distinct alleles at a STR locus, for example, the combined stutter is a mixture of the na + 1 stutter of one STR allele and the nb-1 stutter of another. The stutter behavior is characterized by a stutter ratio, Sr (Eq. (5)).
Three indices for similarity, purity, and gene modification assessments
To facilitate the interpretation and understanding of STRaM results, we developed 3 indices: SI, PI, and EMI, to assess the identity, purity, and gene modification of cell products, respectively.
The SI is formulated via published Tanabe algorithms66 with the Sørensen-Dice coefficient (Eq. (6), Fig. 1D).
It evaluates the similarity between tested cells and parental or reference cells. An SI value of ≥80% indicates that the tested cells are likely to originate from parental cells. SI is also used to detect samples with a high rate of similarity, e.g., 50% similarity (see one case in Supplementary Table 4, SI: 73.2 ± 0.0%, PI: 22.7 ± 0.0%, n = 3).
The PI measures the degree of cross-contamination on the basis of an alteration of Sr by the STR profiles of foreign cells (Eq. (7), Fig. 1D).
In a cross-contamination situation with more than 1% foreign cells, however, an abnormal increase in Sr at an STR locus resulted from a read contribution of a foreign allele to the corresponding host stutter. An STR locus with Sr below the threshold is qualified, whereas, an STR locus with Sr above the threshold is disqualified (Supplementary Table 4). The PI is then determined by the ratio of the qualified STR in the tested cells to that in the parental cells. The percentage of PI without cross-contamination should be greater than 80%. Note that the combined stutters were precluded from the PI calculation.
The EMI determines the proportion of tested cells with edited or mutant genes in the total number of cells (Eq. (8), Fig. 1D) and is given by:
See the edited gene testing and transgene CAR monitoring in Supplementary Table 5.
Cell culture
The A549, HCC827, HEK293T, HS578T, K562, SW480, SW620, and 5637 cell lines were purchased from the National Collection of Authenticated Cell Cultures, Shanghai, China. The SUM159PT and T47D cell lines were purchased from Procell Life Science Technology, Hubei, China. A549 cells were cultured in Ham’s F-12K medium (Gibco, 21127022) supplemented with 10% (v/v) heat-inactivated fetal bovine serum (FBS, ExCell Bio, 12A230); HCC827 cells were cultured in RPMI 1640 medium (Sigma, R5886) supplemented with 10% (v/v) FBS, 1% (v/v) MEM nonessential amino acids (MEM-NEAA, Gibco, 11140050) and 1 mM sodium pyruvate (Gibco, 11360-070); HEK293T, HS578T, K562, SW480 and SW620 cells were cultured in DMEM (Gibco, 12800-017) supplemented with 10% FBS; 5637 cells were cultured in RPMI 1640 medium (11875093, Thermo) supplemented with 10% (v/v) FBS; SUM159PT and T47D cells were cultured in phenol-red-free DMEM/F12 (Sigma, D2906) supplemented with 10% FBS; and 293FT (Invitrogen, R70007) cells were cultured in DMEM supplemented with 10% FBS, 0.1 mM MEM-NEAA and 1 mM sodium pyruvate. All cells were cultured in a 37 °C incubator with 5% carbon dioxide (CO2). The cell line information is listed in Supplementary Table 11.
STRaM profiling
Sixteen human cell lines were subjected to genomic DNA extraction, PCR amplification and TAS. STRaM analysis was performed for STR profiles of both the ASN-0002 panel and the STRaM panel. The final STR profiles of the ASN-002 panel were generated in accordance to CE rules, whereas, those of the STRaM panel were generated in accordance to STRaM rules.
Detection of cross contamination in cell lines
The T47D cells were artificially mixed with the HS578T cells at ratios of 1:1, 50:1 and 100:1. The T47D cells were mixed with the A549 cellsat ratios of 10%, 3% and 1%. The HEK293FT:HCC927, 5637:HS578T mixed cells with at ratios of 1:100 or 100:1. After extraction from the mixed cells, their genomic DNA were subjected to PCR amplification and TAS. The sequenced data were processed via STRaM analysis, followed by a determination of their SI and PI for cellular cross-contamination.
Detection of mutated or edited cells from parental cells
The ability of STRaM to detect mutated or edited genes is superior to that of the gold-standard CE-based STR profiling. To distinguish between the SW480 and SW620 cell lines derived from the same patient29, the gene mutations BRMS173686-73687del, PTPN9C.757C>T, KMT2BC.817G>A, and KMT2BC.875G>T were selected for STRaM as they appear only in the SW620 cell line (CCLE67, https://sites.broadinstitute.org/ccle/) and COSMIC68, https://cancer.sanger.ac.uk/cosmicdatabases). The primers used were designed to amplify the mutation regions of 3 genes (Supplementary Table 12). The PCR products of the 3 mutated genes and the STRaM set were combined for TAS and STRaM analysis. The EMI was calculated to assess the genetic difference between the 2 cell lines of the same origin.
To detect HEK293 cells with CRISPR/Cas9-mediated mutations in the PTEN gene, the mutation-carrying exon 5 of PTEN was amplified with primers (Supplementary Table 12). The PCR products of the PTEN exon 5 and the STRaM set were combined for subsequent TAS and STRaM analysis. The EMI was evaluated to reveal the genetic difference between the 3 edited HEK293 cells and parental cells. Clone 37 was generated from parental HEK293FT cells, whereas clones 3-19 and 3-68 were generated from parental HEK293T cells.
Umbilical cord (UC)-derived mesenchymal stem cells (MSCs)
Human UC-derived MSCs were isolated and cultured in accordance with a previous study69. Briefly, the UC tissues of 3 volunteers were digested with MSCs, which were further cultured in human MSC medium (Tbdscience, SC2013-G-kit) in a 37 °C incubator with 5% CO2. Genomic DNA was extracted from both 2-3×105 MSC passage 4 and parental UC tissues for downstream PCR amplification with TAS and STRaM analysis.
Patient-derived organoids (PDOs)
Hepatocellular carcinoma cells were collected in cold high-glucose DMEM (Gibco, 11965092) supplemented with 100 U/mL penicillin and streptomycin. After removing the nonepithelial components, the tumor samples were cut into small pieces (approximately 1-3mm3) and digested with 1:5 (v/v) collagenase IV (Gibco, 2383671) at 37 °C. The digests were diluted with phosphate-buffered saline (PBS, Tbdscience, PB2004Y), filtered through a 100μm cell strainer (BIOFIL, CSS013100) and centrifuged (250 g, 3 min, 4 °C) for hepatocellular carcinoma cells. Approximately 2 × 105 cells were mixed in Matrigel (Corning, 354234) as soon as possible and transferred onto culture plates (50 μL mixture per plate). After solidification in a 37 °C humidified incubator with 5% CO2 for 30 min, the cell-Matrigel mixtures were cultured with hepatocellular carcinoma organoid medium (BioGenous, K2105-HCC) in a 37 °C incubator with 5% CO2, which was changed every 3-4 days. The hepatocellular carcinoma organoids were fixed, permeabilized, and labeled with a rabbit monoclonal anti-AFP antibody (Abcam, ab133617, clone: EPAFP61), a mouse monoclonal anti-CK18 antibody (Cell Signaling Technology, 4548, clone: DC10) and DAPI. Genomic DNA was extracted from both organoid cells and parental hepatocellular carcinoma cells for subsequent STRaM analysis.
Patient-derived xenografts (PDXs)
The leukemia PDX model was approved by the Ethics Review Committee of Zhejiang University-University of Edinburgh (ZJE) Institute, Haining, Zhejiang Province, China (Approval No. ZJU20230534). To develop leukemia PDX mice, the NOD-Prkdcem26Cd52Il2rgem26Cd22/NjuCrl (NCG) mice of 4-6 weeks were purchased from the GemPharmatech Co., Jiangsu, China, and maintained in a specific-pathogen-free (SPF) animal facility. One million human acute myeloid leukemia cells resuspended in HBSS buffer (8 g/L NaCl, 0.4 g/L KCl, 1 g/L D-glucose, 60 mg/L KH2PO4, 126 mg/L Na2HPO4 ·12H2O, 0.35 g/L NaHCO3, pH: 6.72-6.73) were injected into the caudal vein of NCG mice preconditioned with busulfan70 (Sigma, B2635). Twelve weeks after transplantation, hematopoietic cells harvested from the bone marrow and spleen of the engrafted NCG mice were stained with fluorescence-conjugated monoclonal antibodies including PerCP-conjugated anti-mouse TER-119 (Biolegend, 116244, clone: TER-119, RRID: AB_2565872), eFluor506-conjugated anti-mouse CD45 (mCD45, eBioscience, 69-0451-82, clone: 30-F11, RRID: AB_2637147) and Pacific blue-conjugated anti-human CD45 (hCD45, Biolegend, 304022, clone: HI30, RRID: AB_493655) antibodies, along with 7-aminoactinomycin D (7-AAD, Sangon Biotech, A606804, CAS: 7240-37-1). The stained cells were then sorted via a BD Influx sorter (BD Biosciences) for subsequent STRaM analysis.
CAR-T cell therapy
The preparation and treatment of CAR-T cells were reported in two clinical studies37,71, and were approved by the Ethics and Review Committee of the First Affiliated Hospital, College of Medicine, Zhejiang University. CAR-T cells were produced in the Department of Hematology, the Second Affiliated Hospital, College of Medicine, Zhejiang University. Briefly, activated T-cells were cultured in complete fresh AIM V medium (Gibco, 12055091) supplemented with 10% human AB serum (Sigma, H4522), 300IU/mL interleukin (IL)-2, 5 ng/mL IL-7 and IL-15 (PrimeGene, GMP-101), and infected with the CAR lentivirus in a 37°C incubator with 5% CO2. The infected CAR-T cells were assessed according to clinical standards and used to treat leukemia patients. Since the number of CAR-T cells in the peripheral blood of patients increases to the highest level approximately 14days after transplantation37, PBMCs were collected at days 5, 10, and 15 post-transplantation. Genomic DNA was extracted from CAR-T cells and PBMCs for subsequent STRaM analysis including detection of the transgene CAR. Two pairs of primers were designed for amplification of the CAR (Supplementary Table 12).
Genomic DNA Extraction
The 3-5×105 cells, or 10-20 mg clinical samples chopped into small pieces, were digested in 1 mL lysis buffer (100 mM Tris, 5 mM EDTA, 0.2% SDS, 200 mM NaCl, pH8.0, 0.1-0.2 mg/mL proteinase K) overnight at 56°C. The lysate was slowly mixed with an equal volume of isopropyl alcohol until the genomic DNA precipitates appeared. The DNA precipitates were isolated and dissolved in TE buffer (10 mM Tris, 2.5 mM EDTA, pH8.0) preheated at 56°C. Alternatively, genomic DNA was extracted from 200 µL of frozen or fresh anticoagulant whole blood via the FastPure® Blood DNA Separation Mini Kit (Vazyme, DC111-01) in accordance with the manufacturer’s instructions. The extracted genomic DNA measured via a Nanodrop (Thermo, A30221) was used immediately for PCR or stored at -80°C.
Polymerase chain reaction (PCR)
PCR primers were designed via Primer3 (https://bioinfo.ut.ee/primer3-0.4.0/) for 120-240 bp products and synthesized by Tsingke Biotechnology, Beijing, China. All primer sequences are listed in Supplementary Table 12. The DNA fragments were amplified from 50-150 ng of human genomic DNA by PCR via Platinum™ SuperFi II DNA polymerase (Thermo, 12361010) according to the manufacturer’s instructions. The 120-250 bp PCR products were resolved on a 2% agarose gel and purified via DNA purification columns (Transgen, EG101) for TAS. The DNA size marker (bands: 4360 bp, 1750 bp, 1060 bp, 690 bp, 380 bp, 240 bp and 120 bp) was used for gel electrophoresis.
Target amplicon sequencing (TAS)
The purified PCR products were combined for TAS by Genewiz Biotechnology, Suzhou, Jiangsu Province, China. Briefly, the PCR products were assessed via a Qubit dsDNA HS Assay Kit (Thermo, Q32851) according to the manufacturer’s instructions. The >50 ng purified PCR products per sample were then used for indexed library preparation with the VAHTS Universal Pro DNA Library Prep Kit for Illumina (Vazyme, ND608) in accordance with the manufacturer’s instructions. The libraries were purified via magnetic beads and assessed for pair-end sequencing on NovaSeq 6000 Sequencing System (Illumina). The sequencing data were generated through image recognition and base calling via Illumina real-time analysis software, followed by base call conversion and demultiplexing with Illumina bcl2fastq software. The final sequencing data in FASTQ format were assessed and used for STRaM analysis.
Whole-genome sequencing (WGS) analysis
Thirty-five WGS datasets for human individuals were obtained from the International Genome Sample Resource (IGSR, https://www.internationalgenome.org/data-portal/sample), and are listed in Supplementary Table 13. These datasets were mapped via BWA-MEM on Galaxy servers and displayed using Integrative Genomics Viewer software (IGV, https://www.igv.org/). The STR profiles for the STRaM set were extracted to calculate the observed heterozygosity72 of STR loci in 35 individuals (Hobs).
The WGS datasets for 5 cell lines including 5637 (SRR8639140), A549 (SRR8639173), HCC827 (SRR8639147), SW480 (SRR8670707) and T47D (SRR8670674) were obtained from NCBI SRA67 (https://www.ncbi.nlm.nih.gov/sra, Supplementary Table 14). These SRA data were visualized and analyzed via NCBI Sequence Viewer (or IGV software). STR profiles of the STRaM set were extracted from SRA data and compared with STR profiles from TAS.
We cited WGS data of 8 human cells sourced from NCBI SRA (Supplementary Table 14). WGS data were generated using different library preparation methods with three selected kits32: QIAGEN QIAseq Methyl Library kit (QIAseq, SRR9888302, SRR9888307 and SRR9888314), Illumina TruSeq DNA Methylation kit (TruSeq, SRR9888304, SRR9888310 and SRR9888341) and Swift Biosciences Accel-NGS Methyl-Seq DNA Library kit (Swift, SRR9888338 and SRR9888340). These datasets were mapped via BWA-MEM onto Galaxy servers and visualized using IGV software.
CE-based STR profiling
Eight human STR loci including CSF1PO, D5S818, D7S820, D13S317, D16S539, THO1, TPOX, and VWA plus the amelogenin locus were selected from the 13-STR panel recommended by the ASN-0002 human cell authentication standard9 for CE-based profiling in the study (named the ASN-0002 STR panel). The STR profiles of the human cell lines A549, HE293FT, HS578T and SW480 were detected through CE-based analysis by Genewiz Biotechnology, Suzhou, Jiangsu Province, China. Briefly, the STR panel was amplified from the genomic DNA of cells via the GenePrint 10 System with primers labeled with multiple fluorescent dyes (Promega). The fluorescent PCR products were resolved by CE using a 3730xl Genetic Analyzer (Applied Biosystems) and analyzed for STR profiles via GeneMapper4.0 software (Applied Biosystems). The resulting STR profiles of the 5 cell lines were compared with the reference STR profiles in the Cellosaurus Cell Line Databases9,10 and the STR profiles generated via STRaM analysis.
Statistics and Reproducibility
All experiments were performed for at least two independent biological replicates. No statistical method was used to predetermine the sample size. No data were excluded from the analyses. The experiments were not randomized. Student’s t-test was used to analyze the data via GraphPad Prism software (in the Fig. 4C, D; Fig. 5D, I; Supplementary Fig. 7B, C). The data are presented as mean ± SEM shown in the figure legends.
Genomic coordinates
Genomic mapping and coordinates are based on the human reference genome GRCh38 (GCA_000001405.15_GRCh38_no_alt_analysis_set).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The raw sequencing data used in this study (GSAs: HRA005387 and HRA005391) have been deposited in the Genome Sequence Archive of the National Genomics Data Center, China National Center for Bioinformatics, Chinese Academy of Sciences, Beijing, China, and are accessible through a web link (https://ngdc.cncb.ac.cn/gsa). The 35 WGS datasets for human individuals were obtained from the IGSR (https://www.internationalgenome.org/data-portal/sample). The five SRA datasets for the cell lines 5637 (SRR8639140), A549 (SRR8639173), HCC827 (SRR8639147), SW480 (SRR8670707) and T47D (SRR8670674) were obtained from the NCBI SRA. The bioinformatic workflow was established with multiple programs on the Galaxy servers, which are free to the public and accessible on the internet (https://usegalaxy.org/). An automated STRaM analysis workflow has already been uploaded to GitHub (https://github.com/STR-M2023/STRaM.git). Supplementary Figs. are available online at https://doi.org/10.1038/s42003-025-08547-1, supplementary tables are available in Supplementary Data 1 and the source data of main figures are available in Supplementary Data 2.
References
Zaaijer, S., Groen, S. C. & Sanjana, N. E. Tracking cell lineages to improve research reproducibility. Nat. Biotechnol. 39, 666–670 (2021).
Bashor, C. J., Hilton, I. B., Bandukwala, H., Smith, D. M. & Veiseh, O. Engineering the next generation of cell-based therapeutics. Nat. Rev. Drug Discov. 21, 655–675 (2022).
Sullivan, S. et al. Quality control guidelines for clinical-grade human induced pluripotent stem cell lines. Regenerative Med. 13, 859–866 (2018).
Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).
Workgroup, A. T. C. C. S. D. O. Authentication Of Human Cell Lines: Standardization Of Short Tandem Repeat (STR) Profiling - Revised 2022. U.S. patent (2022).
Budowle, B. & Sajantila, A. Short tandem repeats — how microsatellites became the currency of forensic genetics. Nat. Rev. Genet. 25, 450–451 (2024).
Zahra, A., Hussain, B., Jamil, A., Ahmed, Z. & Mahboob, S. Forensic STR profiling based smart barcode, a highly efficient and cost effective human identification system. Saudi J. Biol. Sci. 25, 1720–1723 (2018).
Takezaki, N. & Nei, M. Genetic Distances and Reconstruction of Phylogenetic Trees From Microsatellite DNA. Genetics 144, 389–399 (1996).
Almeida, J. L. & Koch, C. T., (In: Markossian S., Grossman A., Brimacombe K., et al., editors. Assay Guidance Manual [Internet]. Bethesda (MD): Eli Lilly & Company and the National Center for Advancing Translational Sciences 2004-.). Authentication of Human and Mouse Cell Lines by Short Tandem Repeat (STR) DNA Genotype Analysis. (2023).
Bairoch, A. The Cellosaurus, a Cell-Line Knowledge Resource. J. Biomolecular Tech. 29, 25–38 (2018).
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
Wang, D. et al. STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data. Hereditas 157, 8 (2020).
Parson, W. et al. Massively parallel sequencing of forensic STRs: Considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements. Forensic Sci. Int.: Genet. 22, 54–63 (2016).
Castro, F. et al. High‐throughput SNP‐based authentication of human cell lines. Int. J. Cancer 132, 308–314 (2012).
Demichelis, F. et al. SNP panel identification assay (SPIA): a genetic-based assay for the identification of cell lines. Nucleic Acids Res. 36, 2446–2456 (2008).
Chen, X. & Sullivan, P. F. Single nucleotide polymorphism genotyping: biochemistry, protocol, cost and throughput. Pharmacogenomics J. 3, 77–96 (2003).
Ballard, D., Winkler-Galicki, J. & Wesoły, J. Massive parallel sequencing in forensics: advantages, issues, technicalities, and prospects. Int. J. Leg. Med. 134, 1291–1303 (2020).
Fan, H. & Chu, J.-Y. A Brief Review of Short Tandem Repeat Mutation. Genomics, Proteom. Bioinforma. 5, 7–14 (2007).
Fungtammasan, A. et al. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications. Genome Res. 25, 736–749 (2015).
Woerner, A. E., King, J. L. & Budowle, B. Fast STR allele identification with STRait Razor 3.0. Forensic Sci. Int.: Genet. 30, 18–23 (2017).
Warshauer, D. H. et al. STRait Razor: A length-based forensic STR allele-calling tool for use with second generation sequencing data. Forensic Sci. Int.: Genet. 7, 409–417 (2013).
Gettings, K. B. et al. Recommendations of the DNA Commission of the International Society for Forensic Genetics (ISFG) on short tandem repeat sequence nomenclature. Forensic Science International: Genetics 68 (2024).
Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol. 14, e1005944 (2018).
Agudo, M. M. et al. A comprehensive characterization of MPS-STR stutter artefacts. Forensic Science International: Genetics 60 (2022).
Bright, J.-A., Taylor, D., Curran, J. M. & Buckleton, J. S. Developing allelic and stutter peak height models for a continuous method of DNA interpretation. Forensic Sci. Int.: Genet. 7, 296–304 (2013).
Zhang, J. et al. Non-viral, specifically targeted CAR-T cells achieve high safety and efficacy in B-NHL. Nature 609, 369–374 (2022).
Daniel-Moreno, A. et al. CRISPR/Cas9-modified hematopoietic stem cells—present and future perspectives for stem cell transplantation. Bone Marrow Transplant. 54, 1940–1950 (2019).
Maamer-Azzabi, A., Ndozangue-Touriguine, O. & Bréard, J. Metastatic SW620 colon cancer cells are primed for death when detached and can be sensitized to anoikis by the BH3-mimetic ABT-737. Cell Death Dis. 4, e801 (2013).
Xie, H. et al. De novo assembly of human genome at single-cell levels. Nucleic Acids Res 50, 7479–7492 (2022).
Zhou, B. et al. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res 29, 472–484 (2019).
Zhou, L. et al. Systematic evaluation of library preparation methods and sequencing platforms for high-throughput whole genome bisulfite sequencing. Sci. Rep. 9, 10383 (2019).
Ledford, H. US cancer institute to overhaul tumour cell lines. Nature 539, 391 (2016).
Hou, X. et al. Opportunities and challenges of patient-derived models in cancer research: patient-derived xenografts, patient-derived organoid and patient-derived cells. World Journal of Surgical Oncology 20 (2022).
Wang, X. & Wang, Q. Alpha-Fetoprotein and Hepatocellular Carcinoma Immunity. Can. J. Gastroenterol. Hepatol. 2018, 1–8 (2018).
Morris, K. L. et al. Circulating biomarkers in hepatocellular carcinoma. Cancer Chemother. Pharmacol. 74, 323–332 (2014).
Liu, H. et al. CD19-specific CAR T Cells that Express a PD-1/CD28 Chimeric Switch-Receptor are Effective in Patients with PD-L1–positive B-Cell Lymphoma. Clin. Cancer Res 27, 473–484 (2021).
Wang, Z. et al. Massively parallel sequencing of 32 forensic markers using the Precision ID GlobalFiler™ NGS STR Panel and the Ion PGM™ System. Forensic Sci. Int.: Genet. 31, 126–134 (2017).
Kitayama, T. et al. Massively parallel sequencing data of 31 autosomal STR loci obtained using the Precision ID GlobalFiler NGS STR Panel v2 for 82 Japanese population samples. Legal Medicine 58 (2022).
Kojima, K., Kawai, Y., Misawa, K., Mimori, T. & Nagasaki, M. STR-realigner: a realignment method for short tandem repeat regions. BMC Genomics 17, 991 (2016).
Dolzhenko, E. et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019).
Dashnow, H. et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biology 19 (2018).
Friis, S. L., Buchard, A., Rockenbauer, E., Børsting, C. & Morling, N. Introduction of the Python script STRinNGS for analysis of STR regions in FASTQ or BAM files and expansion of the Danish STR sequence database to 11 STRs. Forensic Sci. Int Genet 21, 68–75 (2016).
Kristmundsdóttir, S., Sigurpálsdóttir, B. D., Kehr, B. & Halldórsson, B. V. popSTR: population-scale detection of STR variants. Bioinformatics 33, 4041–4048 (2017).
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90–e90 (2019).
Ganschow, S., Silvery, J., Kalinowski, J. & Tiemann, C. toaSTR: A web application for forensic STR genotyping by massively parallel sequencing. Forensic Sci. Int Genet 37, 21–28 (2018).
Poethe, S.-S. et al. Cost-Effective Next Generation Sequencing-Based STR Typing with Improved Analysis of Minor, Degraded and Inhibitor-Containing DNA Samples. International Journal of Molecular Sciences 24 (2023).
Ummat, A. & Bashir, A. Resolving complex tandem repeats with long reads. Bioinformatics 30, 3491–3498 (2014).
Dashnow, H. et al. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. Genome Biol. 23, 257 (2022).
Liao, X. et al. Repetitive DNA sequence detection and its role in the human genome. Commun. Biol. 6, 954 (2023).
Brookes, C., Bright, J. A., Harbison, S. A. & Buckleton, J. Characterising stutter in forensic STR multiplexes. Forensic Sci. Int.: Genet. 6, 58–63 (2012).
Vilsen, S. B. et al. Stutter analysis of complex STR MPS data. Forensic Sci. Int.: Genet. 35, 107–112 (2018).
Kunz, A. et al. Optimized Assessment of qPCR-Based Vector Copy Numbers as a Safety Parameter for GMP-Grade CAR T Cells and Monitoring of Frequency in Patients. Mol. Ther. Methods Clin. Dev. 17, 448–454 (2020).
Till, B. G. et al. CD20-specific adoptive immunotherapy for lymphoma using a chimeric antigen receptor with both CD28 and 4-1BB domains: pilot clinical trial results. Blood 119, 3940–3950 (2012).
Hollyman, D. et al. Manufacturing validation of biologically functional T cells targeted to CD19 antigen for autologous adoptive cell therapy. J. Immunother. 32, 169–180 (2009).
Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50 W345–W351 (2022).
Ruitberg, C. M., Reeder, D. J. & Butler, J. M. STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Res. 29, 320–322 (2001).
Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L. & Weber, J. L. Comprehensive Human Genetic Maps: Individual and Sex-Specific Variation in Recombination. Am. J. Hum. Genet. 63, 861–869 (1998).
de Sena Brandine, G. & Smith, A. D. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Research 8 (2019).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Zhang, Z. et al. Uniform genomic data analysis in the NCI Genomic Data Commons. Nature Communications 12 (2021).
Groult, R., Léonard, M. & Mouchard, L. Speeding up the detection of evolutive tandem repeats. Theor. Computer Sci. 310, 309–328 (2004).
Sullivan, K. M., Mannucci, A., Kimpton, C. P. & Gill, P. A rapid and quantitative DNA sex test: fluorescence-based PCR analysis of X-Y homologous gene amelogenin. Biotechniques 15, 636–641 (1993).
Tanabe, H. et al. Cell line individualization by STR multiplex system in the cell bank found cross-contamination between ECV304 and EJ-1/T24. Tissue Cult. Res. Comm. 18, 329–338 (1999).
Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).
Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Zhang, Y. et al. Schisandrin B Improves the Hypothermic Preservation of Celsior Solution in Human Umbilical Cord Mesenchymal Stem Cells. Tissue Eng Regen Med (2023).
Montecino-Rodriguez, E. & Dorshkind, K. Use of Busulfan to Condition Mice for Bone Marrow Transplantation. STAR Protocols 1 (2020).
Liang, Y. et al. CD19 CAR-T expressing PD-1/CD28 chimeric switch receptor as a salvage therapy for DLBCL patients treated with different CD19-directed CAR T-cell therapies. J. Hematol. Oncol. 14, 26 (2021).
Greenbaum, G., Templeton, A. R., Zarmi, Y. & Bar-David, S. Allelic richness following population founding events–a stochastic modeling framework incorporating gene flow and genetic drift. PLoS One 9, e115203 (2014).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Highnam, G. et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41, e32–e32 (2013).
Cao, M. D. et al. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res 42, e16 (2014).
Doi, K. et al. Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing. Bioinformatics 30, 815–822 (2014).
Gelfand, Y., Hernandez, Y., Loving, J. & Benson, G. VNTRseek—a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Res. 42, 8884–8894 (2014).
Anvar, S. Y. et al. TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes. Bioinformatics 30, 1651–1659 (2014).
Kojima, K. et al. Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree. BMC Genomics 17, 494 (2016).
Tang, H. et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am. J. Hum. Genet 101, 700–715 (2017).
Tankard, R. M. et al. Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data. Am. J. Hum. Genet 103, 858–873 (2018).
Jønck, C. G., Qian, X., Simayijiang, H. & Børsting, C. STRinNGS v2.0: Improved tool for analysis and reporting of STR sequencing data. Forensic Sci. Int Genet 48, 102331 (2020).
Fearnley, L. G., Bennett, M. F. & Bahlo, M. Detection of repeat expansions in large next generation DNA and RNA sequencing data without alignment. Sci. Rep. 12, 13124 (2022).
Acknowledgements
This study was supported by Noncommunicable Chronic Diseases-National Science and Technology Major Project (Grant No. 2023ZD0501300) of W.G. and W.B.Q. And the National Natural Science Foundation of China (Grant Nos. 8217110405 and 82173351), National Key Research and Development Program of China (Grant No. 2018YFA0107800), Jiangxi University of Chinese Medicine Innovation Team Funding (Grant No. CXTD22018) and Jiangxi University of Chinese Medicine Cancer Research Center Startup Funds (Grant No. 12418008) of R.H., Zhejiang University Startup Funds of W.G., Science and Technology Program of Guangzhou (Grant No. 2024B03J0022) of Q.P. We would also like to thank Dr. Di Chen at Zhejiang University-University of Edinburgh Institute (ZJE), Zhejiang University, Haining, China, for providing assistance with the H1 embryonic stem cell line; Dr. Mikael Bjorklund at ZJE for the hTERT-RPE1 cell line; Dr. Wanlu Liu at ZJE for the H9 embryonic stem cell line and Dr. Xin Xie at ZJE for the LOVO cell line.
Author information
Authors and Affiliations
Contributions
Conception, design & development of methodology: B.L., W.G., R.H.; Data analysis and interpretation (PCR, statistical analysis, biostatistics, computational analysis, etc.): B.L., C.L., Y.Q.Z., Y.W.W., K.C.Z., Y.L., K.Y.Z., W.G.; Data acquisition (cell culture, acquired and managed patient samples, etc.): B.L., W.L., Y.Q.Z., Y.Z., Y.W.W., Y.Y.W., K.C.Z., W.Z., J.S., W.Y., S.M., X.L., L.L., H.C., S.Z., Y.T., Q.P., W.Q.; Administrative, technical, material support (i.e., reporting or organizing data, constructing databases): B.L., W.L., Y.Q.Z., Q.P., W.Q., W.G., R.H.; Writing and manuscript revision: B.L., R.H.; Study supervision: W.G., R.H.; Correspondence and request for materials should be addressed to Profs. Wei Guo & Ray P.S. Han.
Corresponding authors
Ethics declarations
Competing interests
Authors declare no competing interests. A patent has been filed for this work, Methods and Applications of Genomic Genetic Markers for the Identification and Tracking of Human Samples (202310347891.7) of W. Guo & Ray P.S. Han, Zhejiang University-University and Jiangxi University of Chinese Medicine.
Peer review
Peer review information
Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Ophelia Bu. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Graphics software The authors used SRplot online platform for generating ROC curves.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, B., Le, C., Lei, W. et al. STRaM: A genetic framework for improved cell product provenance for research and clinical translations. Commun Biol 8, 1232 (2025). https://doi.org/10.1038/s42003-025-08547-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42003-025-08547-1









