A pluripotent stem cell atlas of multilineage differentiation

Shen, Sophie; Werner, Tessa; Chiu, Han Sheng; Chen, Xiaoli; Nguyen, Quan; Palpant, Nathan J.

doi:10.1038/s41597-025-05549-w

Download PDF

Data Descriptor
Open access
Published: 15 July 2025

A pluripotent stem cell atlas of multilineage differentiation

Sophie Shen ORCID: orcid.org/0000-0002-1867-050X¹,
Tessa Werner¹,
Han Sheng Chiu¹,
Xiaoli Chen¹,
Quan Nguyen¹ &
…
Nathan J. Palpant¹

Scientific Data volume 12, Article number: 1238 (2025) Cite this article

4090 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Human pluripotent stem cells offer a scalable platform to study genetic and signalling mechanisms governing cell lineage decisions during differentiation. Genome-wide and single-cell transcriptomics technologies likewise offer high-throughput analysis of heterogeneous cell differentiation states. While in vivo development has been extensively characterised using these technologies, there remains a need for comprehensive single-cell transcriptomic profiling of stem cell differentiation from pluripotency. Understanding gene expression changes governing differentiation in vitro is key to developing high fidelity differentiation protocols and understanding fundamental mechanisms of development. We generated a single-cell RNA sequencing time course to study the role of developmental signalling pathways on multilineage diversification from pluripotency in vitro. The combined dataset of over 60,000 cells spans cell types from a time course of differentiation across all germ layers, ranging from gastrulation cell states to progenitor and committed cell types. These data provide a diverse benchmarking reference point to compare against in vivo development and advance understanding of signalling regulation of differentiation, providing insights into protocol development, drug screening, and regenerative medicine applications.

Atlas of multilineage stem cell differentiation reveals TMEM88 as a developmental regulator of blood pressure

Article Open access 04 February 2025

Deep learning-based models for preimplantation mouse and human embryos based on single-cell RNA sequencing

Article Open access 14 November 2024

Purification technologies for induced pluripotent stem cell therapies

Article 22 July 2024

Background

Cell differentiation is governed by sequential signalling cues that coordinate gene expression changes to guide specification of functionally distinct cell types. Human pluripotent stem cells (hPSCs) represent a controlled and scalable platform to study this process in a human system, providing models and products with broad potential applicability across drug screening, disease modelling, and cell therapy^1,2. Numerous differentiation protocols have been designed to derive diverse cell types from pluripotency through modulation of molecular, temporal, and structural parameters of culture conditions to achieve differentiated cell phenotypes in a dish³. There, however, is little coordination in the field for controlled benchmarking of each parameter and its effects on differentiation of cell lineages⁴, limiting reproducibility and optimisation of protocols for therapeutic use. We aim to address this by providing a study that evaluates effects of a restricted set of signalling cues at the germ layer stage of differentiation, while controlling for all other variables. We harness the opportunity provided by high-throughput transcriptomics to characterise mesendoderm-directed multilineage differentiation and provide an in vitro reference point to benchmark against the many existing in vivo single-cell transcriptomic atlases of development^5,6. This dataset serves as a resource to better understand early human lineage trajectories and to inform the design of robust and scalable differentiation protocols relevant to applications for regenerative medicine.

We provide a two-part dataset. The first captures temporal regulation of gene expression changes during differentiation of human induced pluripotent stem cells (hiPSCs) into mesendoderm cell types over eight consecutive days of differentiation, spanning mesendodermal cells at day 2 to committed cell types at day 9 of differentiation (Fig. 1a). The second dataset interrogates the role of WNT, BMP4, and VEGF signalling pathways during differentiation through introduction of small molecule or recombinant proteins targeting these pathways at the germ layer stage of differentiation. The dataset captures the cells prior to treatment (day 2), and the resulting progenitor (day 5) and committed cell types (day 9) following these signalling perturbations (Fig. 1b & Table 1). The data are provided as two separate scRNA-seq datasets, as well as an integrated dataset with cell type annotation and characterisation. Our companion manuscript⁷ demonstrates utility for this integrated in vitro dataset to reveal a novel role for WNT signalling regulator TMEM88 in cardiovascular development in vivo. In the same manner, this dataset can be further interrogated to uncover signalling-, temporal-, and lineage-specific regulators of differentiation. Overall, this single-cell transcriptomic atlas of hiPSC differentiation will be of interest to the field by providing a starting point for discovery and benchmarking against in vivo development to facilitate protocol development with relevance ranging from fundamental understanding of cell lineage regulation to applications in cell therapy and synthetic biology.

Table 1 Sample assignment in signalling perturbation dataset.

Full size table

Methods

Cell lines and maintenance

All human pluripotent stem cell studies were carried out in accordance with consent from the University of Queensland’s Institutional Human Research Ethics approval (HREC# 2015001434). Undifferentiated hiPSCs were cultured on Vitronectin XF (STEMCELL Technologies #07180)-coated plates in mTeSR1 media (STEMCELL Technologies #05850) with supplement at 37 °C with 5% CO2.

The cell line used to generate the time course dataset was a WTC CRISPRi TMEM88-g2.3 GCaMP hiPSC line (Karyotype: 46, XY; RRID: CVCL_VM38; generously provided by M. Mandegar and B. Conklin, Gladstone Institute, UCSF), generated as previously described⁸. In brief, a doxycycline-inducible pQM-u6g-CNKD construct containing guide RNA targeting the TMEM88 transcriptional start site was transfected into WTC CRISPRi GCaMP hiPSCs using the GeneJuice protocol (Novagen). Cryopreserved WTC CRISPRi TMEM88-g2.3 GCaMP hiPSC samples were tested for copy number variants using the STEMCELL Technologies hPSC Genetic Analysis Kit (STEMCELL Technologies #07550), carried out by StemCore (Brisbane, Australia). For this study, the cells were not exposed to doxycycline and were treated as transcriptionally wildtype. For the signalling perturbation dataset, eighteen custom barcoded hiPSC lines were used. To generate the barcoding lines, WTC-WT11 hiPSCs (Gladstone Institute of Cardiovascular Research, UCSF; Karyotype: 46, XY; RRID: CVCL_Y803, generated as previously described^9,10) were edited to enable expression of a barcoded GFP transcript driven by a CAG promoter at the human AAVS1 safe-harbour locus via CRISPR-Cas9 genome editing. These barcoding lines thus facilitate simple multiplexing of isogenic hiPSC scRNA-seq samples. 15 cells per barcoding line were karyotyped as a professional service by Sullivan Nicolaides Pathology (see accompanying manuscript⁷ for more details and quality control of the genomic barcoding strategy).

Mesendoderm-directed differentiation

On the day prior to differentiation (day −1), cells were dissociated using 0.5 mM EDTA solution and seeded onto separate coated plates in mTESR1 pluripotency media with ROCK Inhibitor (STEMCELL Technologies #72308) and cultured overnight. Once forming an ~80% confluent monolayer, differentiation was induced (day 0) by changing the culture media to RPMI (ThermoFisher, #11845119) containing 3 µM CHIR99021 (STEMCELL Technologies, #72054), 500 mg/mL BSA (Sigma #A9418), and 213 mg/mL ascorbic acid (Sigma #A8960) with a PBS wash. On days 3 and 5, media was changed with the same media cocktail excluding CHIR99021. On day 7 and every subsequent second day, cultures were fed with RPMI containing 1xB27 (ThermoFisher #17504001) supplement plus insulin. See also Fig. 1a.

Single-cell RNA sequencing for control time course dataset

Mesendoderm-directed differentiation as described above was induced a separate plate for each collection time point (days 2–9 of differentiation; Fig. 1a), with one biological and technical replicate per time point. Cells were dissociated using 0.5 mM EDTA in 2.5% Trypsin (ThermoFisher, #15400054) and neutralised with 50% foetal bovine serum (GE Healthcare Life Sciences, #SH30084.03) in DMEM/F12 media (Sigma #11320033). 1e6 cells from each sample were labelled with a different TotalSeq-A cell hashing antibody (BioLegend antibodies A2051-8) as per the recommended protocol¹¹ and sorted for viability on a BD Influx Cell Sorter (BD Biosciences) using propidium iodide. 5e5 live cells per time point were collected and pooled for Chromium Single Cell 3′ V3 (10x Genomics) reactions following the manufacturer’s protocol, targeting 2e4 cells. The fraction of droplets containing the Cell Hashing Oligonucleotide (HTO)-derived cDNA were captured using HTO additive primers and Illumina TruSeq DNA D7xx_s primers according to the cell hashing protocol. Gene expression libraries were sequenced on an Illumina NovaSeq 6000 and the cell hashing libraries were sequenced on an Illumina NextSeq 550Dx. The cell ranger software (10x Genomics, version 3.0.2) was used to demultiplex base calls into fastq files and then map the reads to GRCh38 and derive gene expression count matrices. CITE-seq-Count¹² (version 4.2.1) was used to quantify reads mapped to Cell Hashing HTOs for sample demultiplexing.

Single-cell RNA sequencing for signalling perturbation dataset

Mesendoderm-directed differentiation was performed as described above in separate plates for each time point, treatment group, and biological replicate (Fig. 1a). To allow for biological duplicates for each of the treatment-timepoint pair, 18 barcoded cell lines were used (see Methods: Cell lines and Maintenance and Table 1). On day 2 of differentiation, one sample per cell line was collected for scRNA-seq to serve as an untreated reference point. For the remaining samples, cells were treated with a small molecule or recombinant protein signalling perturbation as detailed in Table 1, added to the existing media cocktail introduced on day 3 of differentiation and removed on day 5. Resulting cells were collected on day 5 and 9 of differentiation for scRNA-seq.

For cell collection, cells were dissociated using 0.5 mM EDTA in 2.5% Trypsin (ThermoFisher, #15400054) and neutralised with 50% foetal bovine serum (GE Healthcare Life Sciences, #SH30084.03) in RPMI media (ThermoFisher, #11845119). 5e5 cells per group were pooled into three sample pools (as per Table 1, where each “Library” corresponds to a sample pool) and sorted for viability on a BD Influx Cell Sorter (BD Biosciences) using propidium iodide, compensating for the GFP signal. Chromium Single Cell 3′ v3 (10x Genomics) reactions were performed for each sample pool according to the manufacturer’s protocol, targeting 2e4 cells. Gene expression libraries were prepared according to the manufacturer’s protocol and an additional aliquot of amplified full-length cDNA was used to generate a barcoding library for each sample pool, amplifying the cDNA regions containing the barcoding cassette and appending partial P5 and P7 sequencing adaptors. A single pool was prepared from the three gene expression and three barcoding libraries for sequencing. The samples were pooled equimolar within each library type and combined so that the gene expression libraries together made up 90% of the pool, and the barcoding libraries 10%. Samples were sequenced on an Illumina NovaSeq 6000 using a NovaSeq 6000 S4 Kit V1 (200 cycles) (Illumina, #20027466). Base cell demultiplexing, fastq file generation, and read mapping to GRCh38 to derive gene expression and barcode count matrices was achieved using the cell ranger pipeline (10x Genomics, version 3.1.0).

Sample demultiplexing, cell filtering and data pre-preprocessing

Sample demultiplexing, filtering, and pre-processing was performed on the two datasets separately. The HTODemux function in the Seurat R package¹³ was performed on the HTO and barcode count matrices for their respective datasets, using the default 0.99 quantile cutoff to determine the dominant sample barcode for each cell and identify negative and doublet cells based on their sample barcode counts. All cells annotated as “negative” based on the HTODemux output were removed as their sample of origin cannot be retrieved. We also used the three in silico doublet detection methods in the scds R package¹⁴ to identify cell barcodes likely to be doublets based on their transcriptomic features, so that doublets that share the same sample barcode could also be identified. Cells labelled as doublets by three or more methods (including sample barcoding) were removed. Finally, cell filtering based on the raw expression matrices was performed to retain only high quality cells with library sizes between 10,000 and 100,000 reads, 2,500 and 10,000 features, fewer than 20% of reads mapping to mitochondrial genes, and fewer than 45% to ribosomal genes for the time course dataset. For the signalling perturbation dataset, retained cells had with library sizes between 5,000 and 50,000 reads, 2,000 and 7,500 features, and mitochondrial reads making up fewer than 25% of all reads in each cell. Data normalisation (NormalizeData), scaling (ScaleData), PCA (RunPCA), and UMAP (RunUMAP) dimensionality reduction as part of the default Seurat pipeline were performed on each dataset separately for preliminary visualisation (Fig. 2c,d). For evaluation of replicate similarity (Fig. 2e), we averaged the gene expression of cells from each individual sample (barcoding line/time point/treatment combination, Table 1) and calculated the Pearson correlations between all sample pairs, visualising the results using the pheatmap R package. See also scripts 1_preprocess_timecourse.R, 2_preprocess_signalling.R, 3_FIG2.R.

Dataset integration and cell type annotation

Raw count matrices from the two scRNA-seq datasets were used as input for data integration using the RCPI method (scMultiIntegrate) as part of the RISC R package¹⁵, following the recommended pipeline for pre-processing. Genes expressed in both datasets were used to generate 15 gene eigenvectors using the signalling perturbation dataset as the reference dataset to perform integration, returning 50 principal components for further analysis. UMAP dimensionality reduction (scUMAP) of the integrated data was also done using the RISC package, using 15 components of the ‘PLS’ embeddings as recommended for the integrated values. Cell type clustering (FindClusters for resolutions 0.1, 0.2, 0.3 and 0.4) (Fig. 4a) was performed using the Seurat package using the first 10 PLS embeddings from the RCPI integration for the reduced dimension input, instead of the default PCA reductions. The Seurat function FindAllMarkers was used to find differentially expressed genes for each cluster at the 0.3 cluster resolution. For GO term and KEGG pathway enrichment analyses, the top 100 most significantly differentially expressed genes for each cluster (ranked by “p_val_adj” then “avg_log2FC”) were used as input for the enrichGO and enrichKEGG functions from the clusterProfiler R package¹⁶. For label transfer analysis to align our atlas dataset against in vivo development, we used the Seurat label transfer pipeline. First, both query and reference datasets underwent TRIAGE-transformation^17,18 to prioritise the biological signal from cell type-specific regulatory genes before input into the FindTransferAnchors and TransferData Seurat functions with default parameters. The label transfer annotation for each cell were filtered by prediction score, selected based on the observed score distribution for each reference database separately (0.4 for the mouse organogenesis¹⁹ dataset, 0.3 for the human gastrulation²⁰ dataset, and 0.05 for the prenatal mouse⁶ dataset). Cells that did not pass the score threshold are annotated as NA in the metadata and Fig. 4d. See also scripts 4_rcpi_integration.R, 6_clustering_annotation.R, 7_label_transfer.R.

Data Records

All raw and processed data have been uploaded the NCBI Gene Expression Omnibus repository (accession ID: GSE279710²¹). The portion of the record described in this manuscript are labelled as “Signalling perturbations library” 1–3 (Samples GSM8578510-15), “Control time course” (Samples GSM8578516-17), and the supplementary (processed) files with the prefix “GSE279710_integrated_sig_time”. Raw mapped read data (fastq files) are provided in “GSE279710_RAW.tar”.

For the signalling perturbation dataset, there is one sample for the barcoding sequencing library (labelled as “BC”; e.g. GSM8578510 for library 1) and one sample for the gene expression sequencing data (labelled as “GEX”; e.g. GSM8578511 for library 1). Both the BC and GEX libraries are provided as three separate files: cell barcodes (e.g. GSM8578510_raw_barcode_sig1_barcodes.tsv.gz), gene (or feature) names (e.g. GSM8578510_raw_barcode_sig1_features.tsv.gz), and the gene expression count matrix (e.g. GSM8578510_raw_barcode_sig1_matrix.mtx.gz). Cell barcode files are a single column containing a list of all cell barcodes in the dataset. Feature files contain three columns: Ensembl ID, gene symbol, and library type (“Gene Expression”), where the BC files contain the barcode names in the first two columns, and the GEX files also have all 20 barcodes appended to the end of the features file, prefixed with “addSeq_”. The count matrix is a sparse matrix of raw read counts with columns as cell barcodes and rows as genes, in the same order as they appear in the corresponding barcode and feature files.

Note that for the signalling perturbation dataset, the barcode IDs were initially BC03-BC20 and renamed to be BC01-BC18 in the manuscript files. The initial numbering is reflected in the raw matrix files (Samples GSM8578510, 12 & 14), but not in the supplementary (processed) files in the record.

The raw files for the time course dataset are organised the same as the signalling perturbation dataset, where there is one sample for the sample multiplexing library (“Control time course, HTO”, sample GSM8578516), and one for the gene expression (“Control time course, GEX”, sample GSM8578517), each containing a cell barcode, feature, and count matrix file. In the time course GEX sample, the sample (HTO) barcodes are not included in the features and count matrix.

The supplementary files (i.e. the processed files with “GSE279710_integrated_sig_time” prefixes), also include a gene (GSE279710_integrated_sig_time_genes.txt.gz) and count matrix file (GSE279710_integrated_sig_time_norm.mtx.gz), where the count matrix is a sparse matrix containing the read counts after integration and normalisation as described in the Methods. Instead of a cell barcode file, a cell metadata file is provided (GSE279710_integrated_sig_time_meta.txt.gz) with the following columns providing extra information for each cell: “cell_barcode”; “UMAP1” & “UMAP2” (coordinates used in Fig. 4a); “run” (sequencing sample, where “0Xav” represents the time course dataset and “lib1”, “lib2”, and “lib3” represent the three libraries in the signalling perturbation dataset (see Table 1); “day” (time point); “treatment” (where “no_treatment_tc” represents the time course dataset, and details about the signalling perturbations are provided in Table 1); as well as several columns related to the TRIAGE-Cluster analysis introduced in our main manuscript^7,18,22 (columns: “peak_cluster” (TRIAGE-Cluster cluster assignments), “PijuanSala_LabelTransfer” (filtered label transfer assignments with reference to a scRNA-seq dataset of mouse organogenesis¹⁹), “Hutchins_CTDomains” (filtered gene set scoring assignments using gene sets derived from RNA-seq of mouse embryonic domains²³), “Tyser_GastrulaLT” (filtered label transfer assignments with reference to a scRNA-seq dataset of human gastrulation²⁰), “Annotation” (cell type label as seen in our main manuscript⁷), and “Anchor Gene” (high scoring TRIAGE gene used to guide TRIAGE-clustering²²)).

Technical Validation

Single-cell RNA sequencing and quality control

We generated the time course control and signalling perturbation scRNA-seq datasets separately, with the former captured in one Chromium Single Cell 3′ V3 (10x Genomics) reaction using Cell Hashing¹¹ for sample multiplexing and minimisation of between-reaction batch effects. This dataset comprises eight time points separated by 24 hours from day 2 to 9 of in vitro mesendoderm differentiation, with one biological replicate per time point (eight samples, with one Cell Hashing antibody each) (Fig. 1a). We note that the cell line used for the time course experiment was found to have a 1.5-fold expression increase at the Chr20q11.21 locus compared to validated diploid controls, indicating a copy number expansion which could have an impact on phenotype (see Methods).

The signalling perturbation datasets captured cells across three Chromium reactions, profiling effects of perturbing differentiation with each of eight small molecule or recombinant proteins to modulate key developmental signalling pathways. Differentiating cells were captured at day 2 (mesendoderm cells prior to treatment), day 5 (progenitor cell types immediately following treatment), and day 9 (resulting definitive cell types), with two biological replicates, facilitated by use of eighteen isogenic barcoded cell lines for sample multiplexing to minimise costs and sequencing batch effects⁷ (Fig. 1a). Biological duplicates were collected and sequenced in different libraries to further allow for identification of between-library batch effects in data processing (Table 1). Comprehensive validation showing karyotypic and phenotypic normality of all isogenic barcoded cell lines is provided in our accompanying study⁷. After single-cell capture and RNA sequencing, the cell ranger pipeline (10x Genomics, see Methods) was used to generate fastq files and map reads to the GRCh38 genome to derive four gene expression matrices (one per reaction/sequencing library). Detailed read and alignment quality metrics are provided in Table 2. Standard scRNA-seq quality control metrics were assessed based on these gene expression matrices, demonstrating comparable metrics between the four libraries (Table 3).

Table 2 Sequencing quality control metrics.

Full size table

Table 3 Transcriptome quality control metrics and filtering.

Full size table

As both multiplexing methods relied on quantification of reads mapped to each sample barcode for sample demultiplexing (see also Methods), we performed tSNE dimensionality reduction of the cells based on their sample barcode expression (cell hashing oligonucleotide (HTO) barcodes for the time course dataset and genomic barcode for the signalling perturbation dataset) (Fig. 2a,b). This, alongside the HTODemux function in the Seurat package¹³ allowed identification of cells with multiplet or negative barcode reads (see accompanying study⁷, for more in-depth assessment of genomic barcoding quality), which were removed. Scds¹⁴, a transcriptome-based multiplet detection tool was also used to identify doublets via three different algorithms (Table 3). Combining the barcode-based and three transcriptome-based doublet detection methods, we removed cells predicted to be doublets by three or more methods, with the predicted doublets being assigned the barcode with the highest expression. Finally, cell filtering was performed to remove low quality, negative, and stressed cells based on their total read count, total gene count, mitochondrial, and ribosomal read content (Table 3 & Methods).

After filtering, the libraries had between 13,000 and 16,600 high quality singlets remaining (Table 3). We performed data scaling, normalisation, and dimensionality reduction on the time course and signalling perturbation datasets separately. UMAP representation of the time course dataset confirmed co-localisation of cells from the same time point (Fig. 2c). The same is true for the signalling perturbation dataset, where the UMAP visualisation and pairwise Pearson correlation between mean expression profiles of each sample indicate that biological replicates from the same time point (see also Table 1) but captured in different libraries have highly similar transcriptomic profiles that distinguish them from the other time points (Fig. 2d,e). While all samples at day 2 are treated identically, and largely form a single transcriptional cluster, we note that two samples (BC04 and BC15) appear to be more transcriptionally distinct, reflecting modest baseline variation between lines prior to treatment (Fig. 2d).

Data integration

To address potential batch effects and improve comparability between the two datasets, we performed data integration using the RCPI algorithm¹⁵. Figure 3a shows principal components analysis (PCA) and UMAP representations of the combined datasets before integration, where the two datasets are almost entirely non-overlapping and transcriptionally distinct. After data integration, cells from the shared day 2, 5, and 9 time points co-localise. As anticipated, the cell types in the time course dataset are represented in the signalling perturbation dataset, with additional new cell types being generated in the latter as a result of the small molecule and recombinant protein treatments (Fig. 3a,b).

We visualise marker gene expression compared between the two datasets in the integrated UMAP space to further validate the integration and to gain insight into cell types captured in the dataset (Fig. 3c). This shows that mesoderm (MESP1, EOMES), definitive endoderm (FOXA2), endothelium (CDH5), posterior foregut (TTR), and cardiomyocytes (ACTC1, MYH6) are shared cell types between the two datasets, cell types all expected to arise from a largely undirected mesendoderm differentiation^3,24. This marker gene expression analysis also highlights divergence of the signalling perturbation dataset from the time course to produce unique populations of cells expressing NR2F2, SOX2, TBXT, NOG, and DNAAF3. Finally, Fig. 3c also shows that the increased sampling of cells at day 2 in the signalling perturbation dataset reveals two transcriptionally distinct cell types at day 2, one enriched in mesendodermal marker GSC, and the other in SOX2, a gene associated with pluripotency and neuromesodermal lineages²⁵.

Clustering and cell type annotation in the integrated dataset

We performed cell clustering using the standard Seurat pipeline on four resolutions: 0.1, 0.2, 0.3, and 0.4 to characterise broad cell type heterogeneity in the integrated dataset. We selected the 0.3 resolution for further analysis of cell type heterogeneity in the integrated dataset (Fig. 4a). Differential gene expression analysis for each cluster, followed by GO term and KEGG pathway enrichment of the significantly differentially expressed genes in each of the 12 clusters provide a basis for cluster annotation (Fig. 4b,c & Supplementary Table S1) and enable evaluation of cell type assessments based on gene expression in Fig. 3c. We additionally used label transfer analysis in the Seurat pipeline to evaluate the clusters with reference to three scRNA-seq datasets of early development in vivo (Fig. 4d)^6,19,20. Together, these data confirm that the unique cell types introduced in the signalling perturbation dataset to be axial mesoderm (Cluster 4: TBXT, NOG), anterior foregut endoderm (Cluster 8: NR2F2, SOX2), nodal cells (Cluster 11: NOG), and neural-like cells (Cluster 9: SOX2, NR2F2). Finally, we compute relative proportions of cells captured from each time point and treatment condition in each cluster to facilitate interpretation of how the signalling cues contribute to different cell fate outcomes (Fig. 4e). This confirms that the two clusters primarily at day 2 (Clusters 0 & 1) show roughly even contributions of cells from each treatment group, as anticipated since they were all untreated at the day 2 time point (see Fig. 1a). As another point of validation, Fig. 4e also shows that the cardiomyocyte cluster (Cluster 10) is strongly biased in the XAV treatment, aligning with expectations as treatment with this WNT inhibitor at the germ layer stage of differentiation is commonly used in cardiomyocyte differentiation protocols²⁶. These two examples confirm recapitulation of expected treatment effects, providing confidence in the dataset’s utility for interpreting the impact of the less well-characterised signalling perturbations on lineage outcomes.

Taken together, this study provides a controlled scRNA-seq dataset capturing cell type dynamics resulting from signalling modulation of hPSC mesendoderm differentiation. By profiling over 60,000 cells across sequential stages of differentiation and perturbations of WNT, BMP, and VEGF signalling pathways, we offer a valuable reference point for benchmarking in vitro differentiation against in vivo development. The dataset enables dissection of dose-dependent signalling effects on lineage specification, offering insight into how pathway modulation influences fate decisions relevant to embryogenesis and organogenesis. This resource will thus support optimisation of high-fidelity cardiomyocyte differentiation protocols with utility across regenerative medicine, drug screening applications, and potential to inform synthetic biology and customised cell differentiation strategies.

Code availability

Code for the described data processing and analysis are provided at https://github.com/palpant-comp/data_descriptor.

References

Kishimoto, K. et al. Directed differentiation of human pluripotent stem cells into diverse organ-specific mesenchyme of the digestive and respiratory systems. Nature Protocols 17, 2699–2719, https://doi.org/10.1038/s41596-022-00733-3 (2022).
Article CAS PubMed PubMed Central Google Scholar
Raniga, K. et al. Strengthening cardiac therapy pipelines using human pluripotent stem cell-derived cardiomyocytes. Cell Stem Cell 31, 292–311, https://doi.org/10.1016/j.stem.2024.01.007 (2024).
Article CAS PubMed Google Scholar
Wu, Z. et al. in Advanced Technologies in Cardiovascular Bioengineering (eds Zhang, J. & Serpooshan, V.) 125–167 (Springer International Publishing, 2022).
Chow, C. S. Y. et al. A community-oriented, data-driven resource to improve protocol design for cardiac modelling from human pluripotent stem cells. bioRxiv, 2024.2009.2004.611313, https://doi.org/10.1101/2024.09.04.611313 (2024).
Tabula Sapiens Consortium et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896, https://doi.org/10.1126/science.abl4896 (2022).
Article CAS Google Scholar
Qiu, C. et al. A single-cell time-lapse of mouse prenatal development from gastrula to birth. Nature 626, 1084–1093, https://doi.org/10.1038/s41586-024-07069-w (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Shen, S. et al. Atlas of multilineage stem cell differentiation reveals TMEM88 as a developmental regulator of blood pressure. Nature Communications 16, 1356, https://doi.org/10.1038/s41467-025-56533-2 (2025).
Article CAS PubMed PubMed Central Google Scholar
Mandegar, M. A. et al. CRISPR Interference Efficiently Induces Specific and Reversible Gene Silencing in Human iPSCs. Cell Stem Cell 18, 541–553, https://doi.org/10.1016/j.stem.2016.01.022 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kreitzer, F. R. et al. A robust method to derive functional neural crest cells from human pluripotent stem cells. Am J Stem Cells 2, 119–131 (2013).
CAS PubMed PubMed Central Google Scholar
Miyaoka, Y. et al. Isolation of single-base genome-edited human iPS cells without antibiotic selection. Nat Methods 11, 291–293, https://doi.org/10.1038/nmeth.2840 (2014).
Article CAS PubMed PubMed Central Google Scholar
Stoeckius, M. et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biology 19, 224, https://doi.org/10.1186/s13059-018-1603-1 (2018).
Article CAS PubMed PubMed Central Google Scholar
Roelli, P., bbimber, Flynn, B., santiagorevale & Gui, G. Hoohm/CITE-seq-Count: 1.4.2. Zenodo https://doi.org/10.5281/zenodo.2590196 (2019).
Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e1821, https://doi.org/10.1016/j.cell.2019.05.031 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bais, A. S. & Kostka, D. scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics 36, 1150–1158, https://doi.org/10.1093/bioinformatics/btz698 (2020).
Article CAS PubMed Google Scholar
Liu, Y., Wang, T., Zhou, B. & Zheng, D. Robust integration of multiple single-cell RNA sequencing datasets using a single reference space. Nature Biotechnology, https://doi.org/10.1038/s41587-021-00859-x (2021).
Xu, S. et al. Using clusterProfiler to characterize multiomics data. Nature Protocols, https://doi.org/10.1038/s41596-024-01020-z (2024).
Shim, W. J. et al. Conserved Epigenetic Regulatory Logic Infers Genes Governing Cell Identity. Cell Systems, https://doi.org/10.1016/j.cels.2020.11.001 (2020).
Zhao, Q. et al. TRIAGE: an R package for regulatory gene analysis. Briefings in Bioinformatics 26, bbaf004, https://doi.org/10.1093/bib/bbaf004 (2025).
Article PubMed PubMed Central Google Scholar
Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature 566, 490–495, https://doi.org/10.1038/s41586-019-0933-9 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Tyser, R. C. V. et al. Single-cell transcriptomic characterization of a gastrulating human embryo. Nature 600, 285–289, https://doi.org/10.1038/s41586-021-04158-y (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Shen, S. et al. Atlas of multilineage stem cell differentiation reveals TMEM88 as a developmental regulator of blood pressure. NCBI GEO http://identifiers.org/geo:GSE279710 (2024).
Sun, Y. et al. Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity. Nucleic Acids Research, gkad307, https://doi.org/10.1093/nar/gkad307 (2023).
Hutchins, A. P. et al. Models of global gene expression define major domains of cell type and tissue identity. Nucleic Acids Research 45, 2354–2367, https://doi.org/10.1093/nar/gkx054 (2017).
Article CAS PubMed PubMed Central Google Scholar
Mennen, R. H., Oldenburger, M. M. & Piersma, A. H. Endoderm and mesoderm derivatives in embryonic stem cell differentiation and their use in developmental toxicity testing. Reproductive Toxicology 107, 44–59, https://doi.org/10.1016/j.reprotox.2021.11.009 (2022).
Article CAS PubMed Google Scholar
Koch, F. et al. Antagonistic Activities of Sox2 and Brachyury Control the Fate Choice of Neuro-Mesodermal Progenitors. Developmental Cell 42, 514–526.e517, https://doi.org/10.1016/j.devcel.2017.07.021 (2017).
Article CAS PubMed Google Scholar
Friedman, C. E. et al. Single-Cell Transcriptomic Analysis of Cardiac Differentiation from Human PSCs Reveals HOPX-Dependent Cardiomyocyte Maturation. Cell Stem Cell 23, 586–598.e588, https://doi.org/10.1016/j.stem.2018.09.009 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the Flow Cytometry Facility at the Queensland Brain Institute for facilitating cell sorting; StemCore at the Australian Institute for Bioengineering and Nanotechnology for CNV assessment of cell lines; the University of Queensland sequencing facility at the Institute for Molecular Bioscience for performing single-cell capture and library preparation; and the Garvan Sequencing Platform for Illumina sequencing. Funding support was provided from the National Health and Medical Research Council of Australia (Grants 1143163 (N.J.P.), 2008928 (Q.N.); the Medical Research Future Fund (APP2016033 (N.J.P.); and the National Heart Foundation of Australia (Grants 101889 & 106721 (N.J.P)).

Author information

Authors and Affiliations

Institute for Molecular Bioscience, The University of Queensland, St Lucia, QLD, 4067, Australia
Sophie Shen, Tessa Werner, Han Sheng Chiu, Xiaoli Chen, Quan Nguyen & Nathan J. Palpant

Authors

Sophie Shen
View author publications
Search author on:PubMed Google Scholar
Tessa Werner
View author publications
Search author on:PubMed Google Scholar
Han Sheng Chiu
View author publications
Search author on:PubMed Google Scholar
Xiaoli Chen
View author publications
Search author on:PubMed Google Scholar
Quan Nguyen
View author publications
Search author on:PubMed Google Scholar
Nathan J. Palpant
View author publications
Search author on:PubMed Google Scholar

Contributions

For the time course scRNA-sq dataset, S.S. and X.C. carried out cell culture, collection, and sample preparation; S.S. performed quality control and computational analysis. For the signalling perturbation scRNA-seq dataset, T.W. and H.C. carried out cell culture, collection and sample preparation; S.S. and T.W. performed computational analysis. S.S. performed computational analysis of the combined dataset and wrote the manuscript; N.J.P. and Q.N. provided supervision; N.J.P. conceptualised the study and raised funding.

Corresponding author

Correspondence to Nathan J. Palpant.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, S., Werner, T., Chiu, H.S. et al. A pluripotent stem cell atlas of multilineage differentiation. Sci Data 12, 1238 (2025). https://doi.org/10.1038/s41597-025-05549-w

Download citation

Received: 24 November 2024
Accepted: 04 July 2025
Published: 15 July 2025
Version of record: 15 July 2025
DOI: https://doi.org/10.1038/s41597-025-05549-w