Extended Data Fig. 12: Statistical analysis for quality control of the WTC-11 hiPSC Single-Cell Image Dataset v1.
From: Integrated intracellular organization and its variations in human iPS cells

a. Box plots of principal component values for all cell lines together (first bin in dark green) and per tagged structure cell line, plotted in pipeline timeline order, the order that structure datasets were collected (total n = 175,147; n per structure in Supplementary Data 1). The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range (IQR and dots represent outliers beyond the IQR. The dashed horizontal line spanning the entire plot represents the median value for all cell lines together (first bin in dark green). The colours for each cell line refer to the pipeline workflow (see Methods for details). Triangles indicated structures for which the IQR does not overlap with the mean value for all cell lines. b. Left plots shows the distributions of cell height (top) and cell volume (bottom) for all cell lines together (first bin in dark green; n = 202,847) and per tagged structure cell line, plotted in pipeline timeline order (n per structure in Supplementary Data 1 and Extended Data Fig. 1d). Structure names in red indicate those structures imaged with an adjusted Matrigel coating protocol towards the end of the pipeline timeline. The centre plots show a comparison of cell height (or volume, bottom) between actomyosin bundle-tagged cells (via non-muscle myosin IIB) in the main dataset (Pipeline 4.1; n = 6,223) and in a repeat dataset imaged with Pipeline 4.4 settings with the adjusted Matrigel coating protocol (n = 380). The right plots shows a comparison of cell height (or volume, bottom) between all cell lines imaged pre-Pipeline 4.4, during Pipeline 4.4 with original Matrigel coating and during Pipeline 4.4 with adjusted Matrigel coating. Percentages shown in the plot are the relative height reduction compared to the mean height of cell lines imaged pre-Pipeline 4.4. c. The top image diagrams circular mapping of imaged colonies (via the 12X overview images). Two cells are represented by two red dots within an FOV, represented by a rectangle. The FOV centre is at distance d from the closest edge of the colony. The two cells are then mapped into a unit circle that serves as a template to visualize the radial location of the two cells. The radial location is the FOV relative distance to the edge of the colony, ℓ = d/Reff, where Reff represents the effective radius of the colony. The angular location of a cell (θ1 and θ2 for the two cells in the image) is independently drawn from a uniform distribution of angles in the range [0,2π]. Cells from the dataset that were associated with a colony size (see Methods) were grouped into four bins, each with similar number of cells, based on the area of the colony where they came from. The colony area range of each bin is 15k-230k µm2, 230k-377k µm2, 377-620k µm2 and 620k-14,285k µm2. Each point represents one cell within the colony area bin that was mapped into the unit circle. The unit circle was then rescaled to match the mean colony area for that bin. Points are colour-coded by their corresponding cell height. Listed above each circle is the mean colony area in that bin to which the unit circle is scaled. Below each circle are profile plots of cell height as a function of the radial distance for each of the cell (in black). The red curve represents the rolling average. Each row of circular colony mappings represents a different aggregation of the data based on the imaging mode: the first row is for all imaging modes (modes A, B and C; n = 104,269), the second row is for modes A and B only (n = 75,146) and third row is for mode C only (n = 29,123). d. Circular colony mappings as in (c) where points (cells) are now colour-coded by values of the shape modes. Circular colony mappings are shown for Shape Modes 1 and 2, and profile plots (as in c), for Shape Modes 3-8 (all imaging modes, n = 104,269). e. Scatter plots on the far left show true values of cell height compared to cell height values predicted by random forest regression models (n = 95; see Methods) that include either all experimental variables (top plot) or all experimental variables except for the cell line identity (bottom plot). The error bars on the predicted values are obtained via bootstrapping (n = 100). The centre column shows box plots representing the feature importance for each of the two models as measured by the increase in the mean squared error (MSE) when all values of that corresponding feature are shuffled across samples. The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box by 1.5x the interquartile range (IQR and dots represent outliers beyond the IQR. The right top plot is the Pearson correlation matrix between five continuous experimental variables used in training the regression models. The bottom right plot is the Cramer’s V correlation matrix between six categorical experimental variables used in training the regression models. Variables with correlation above the significance threshold 0.3 are assumed to be highly correlated53.