Extended Data Fig. 1: Genotype calling and imputation and breed prediction. | Nature Genetics

Extended Data Fig. 1: Genotype calling and imputation and breed prediction.

From: A compendium of genetic regulatory effects across pig tissues

Extended Data Fig. 1: Genotype calling and imputation and breed prediction.

a, Pearson’s correlation (r) between number of clean reads and number of called SNPs across 7,095 RNA-Seq samples. The P-value is obtained by Pearson’s r test. b, Distribution of the number of SNPs called from 7,095 RNA-Seq samples. c, Number of imputed SNPs (left, gray bars) from 7,008 RNA-Seq samples across 18 pig chromosomes after quality control (DR2 ≥ 0.85, minor allele frequency ≥ 0.05). The red point represents the number of genes (right) in each chromosome in the Sscrofa11.1. assembly (Ensembl v100). d, Distribution of 42,523,218 SNPs from the Pig Genomics Reference Panel (PGRP) and 3,087,268 imputed SNPs used for molecular QTL (molQTL) mapping across eight genomic features. e, Minor allele frequency (MAF) of imputed SNPs in 7,008 RNA-Seq samples. f, Distribution of the number of imputed SNPs around 1 Mb of transcript start site (TSS) of 18,911 protein-coding genes. g, Concordance rate (CR) and squared correlation (r2) of imputed and observed genotypes in 50 evenly spaced MAF bins based on individuals that are not present in the PGRP. ‘ALL’ represents the entire variants. h, CR and r2 of imputed genotypes from RNA-Seq only and those directly called from whole-genome sequence (WGS) data (red), and imputed genotypes (blue) from SNP array, respectively, in the same individuals. Point and whisker are mean and standard deviation, respectively. Labels of x-axis are breeds and number of individuals. i, CR and r2 (median and interquartile) of imputed and observed genotypes in different genomic features. Point and whisker are median and interquartile, respectively. j, The overall pipeline utilized to predict missing breed labels for RNA-Seq samples. k, Estimated ancestry proportion of Duroc (n = 485), Landrace (n = 280), Yorkshire (n = 145), Landrace×Yorkshire (n = 165) and Duroc×Landrace×Yorkshire (n = 40) samples. l, Distribution of sample size of training and prediction sets in pure and cross breeds. m,n, Accuracy of breed prediction for pure breeds (m) and cross breeds (n) measured by cross-validation. The red triangle represents the sample size of the target breed.

Back to article page