Background & Summary

There are several native chicken varieties in Bangladesh of which Common Deshi (CD) or Non-descript Deshi, Hilly (HL) and Naked Neck (NN) chickens are noteworthy for their egg production, meat quality and survivability in harsh environmental conditions. Consumers prefer indigenous chicken meat and eggs due to their special characteristics of smell, taste and texture1,2. Some studies found that around 98% of consumers pay attention to particular qualities such as fat content, meat colour and taste, eggshell colour, size and yolk colour3. Domestic local chicken is thus preferred over intensively produced commercial hybrid chicken in Bangladesh.

However, these precious chicken populations have been undergoing genetic erosion since the introduction of improved stocks (both pure lines of different chicken breeds and commercial hybrids) from developed countries. This has occurred as a result of various factors like the incorporation of exotic chicken breeds and commercial hybrids, indiscriminate cross-breeding, sub-optimal breeding strategies and lack of conservation programmes2,4. Recognising the potential of Bangladeshi local chicken varieties, the Bangladesh Livestock Research Institute (BLRI) started a conservation and improvement programme for the three above-mentioned native chicken varieties around 20115, applying conventional breeding strategies. Both egg production and growth performance have improved significantly as compared to the foundation stock6,7,8. These native chickens maintained by the BLRI are known as ‘BLRI Improved Native Chicken’. However, apart from a few studies using RAPD markers, microsatellite markers or partial mitochondrial DNA-loop sequences to understand the maternal origin9,10,11,12, advanced genomic research on these native breeds has yet to be conducted. In-depth genome-level research on these promising native chicken species will be crucial for the identification of potential genomic regions and candidate genes responsible for productivity improvement, disease resistance and stress tolerance potential, to harvest maximum utilization. In addition, appropriate breeding strategies based on genomic surveys need to be undertaken for the conservation of native chicken germplasms.

Advances in genomics have enabled Whole Genome Sequencing (WGS), allowing scientists to uncover genomic insights, contributing to livestock breeding and development13,14,15,16,17,18,19. For instance, WGS analysis of 234 indigenous African chickens identified around 15 million SNPs, of which 14% represent unique variants, with some being associated with environmental adaptation and other important traits20.

In this article, we report whole-genome sequencing data from 96 Bangladeshi native chickens. The samples include BLRI improved native chickens and the same variety of local chickens from eight different locations across the country. The indigenous chicken samples taken from various villages in Bangladesh are referred to here as ‘Unimproved Native Chicken’. Paired-end next-generation sequencing for short reads was carried out on all samples with an average 23X coverage and reads mapped to the GRCg7b chicken reference genome (GCA_016699485.1; https://ftp.ensembl.org/pub/release-109/fasta/gallus_gallus/dna/). More than 22 million biallelic Single Nucleotide Polymorphisms (SNPs) were identified in this study, with 25% being novel variants.

The utility of data generated from the present study is expected to include helping evaluate genetic variation and diversity of Bangladeshi native chicken varieties, detecting important genomic regions and candidate genes underlying different economic traits of interest (for example, egg production, body weight and stress tolerance ability) and future investigation on genomic selection as a potential strategy in chicken breeding programmes. The data can also help make associations between the genome and the environment using Ecological Niche Modelling (ENM)21,22,23,24,25 and aid in the development of SNP chips/imputation panels26,27,28,29,30. This is the first-time whole genome sequencing from Bangladeshi native chicken varieties has been presented, at scale, thus providing a valuable resource for avian researchers to understand the genetics of local chickens. These WGS data will also help to enrich the efforts of the Chicken Genomic Diversity Consortium31 to reveal origins and adaptations of global chicken populations.

Methods

Sampling locations

390 blood samples were collected from birds from different geographical locations in Bangladesh (Fig. 1) between February and June 2022. Of the total samples, 215 blood samples were taken from BLRI-improved chickens represented by three native varieties (Common Deshi, Hilly and Naked Neck) at the BLRI Poultry Research Farm. 175 blood samples were also taken from unimproved indigenous chickens (of the same three breeds) from several villages in eight different geographical locations across the country. While selecting the sampling sites, different environmental or climatic factors like temperature and humidity variation and availability of native chicken variety in that particular region, were taken into consideration. Details of the collected samples are shown in Fig. 1 and described in Supplementary Table 1.

Fig. 1
figure 1

Sample collection sites from across Bangladesh. Blue circles indicate unimproved native chicken sampling locations (across the different villages) while the red circle indicates the improved native chicken sampling location (BLRI Headquarters, Dhaka, Bangladesh). The numbers inside the circles indicate the number of samples that were sequenced from each location.

Ethical approval

All relevant ethical approvals were obtained from the Animal Experimentation Ethics Committee (AEEC) of BLRI [AEEC/BLRI00116]. In addition, prior collection of blood samples from the village chickens was done with the full consent of the owners. The samples were transported to the United Kingdom under a material transfer agreement (MTA) between BLRI and The University of Edinburgh (MTA Roslin 3324). In addition, import authorization for blood samples was issued under the Trade in Animals and Related Products (Scotland) Regulations (2012).

Blood collection

Around 0.5 ml blood was withdrawn from the brachial wing vein (or cutaneous ulnar vein) of the selected healthy chickens using 1.0 ml insulin syringe (JMI Syringes & Medical Devices Limited, Bangladesh) following the procedure stated by Kelly and Alworth32. Chickens were handled with extreme care, maintaining standard animal handling procedures to ensure minimum stress. Immediately after collection, the blood was gently and carefully spread onto Whatman FTA classic cards (Cat. No: WHAWB120205; Merck, Germany). After completion of blood collection, the FTA cards were kept at room temperature for at least three hours to allow complete drying. Appropriate caution was taken to keep the FTA cards safe from direct sunlight. All air-dried blood samples were then stored safely in double zipper bags until further processing at the Roslin Institute (Edinburgh, UK).

Sample selection for WGS

For whole genome sequencing, a total of 96 samples were selected from the 390 samples collected. The selected samples consisted of equal numbers of male and female chickens from each of the 6 groups (3 improved and 3 unimproved indigenous chicken varieties) as shown in Table 1. In the case of selecting BLRI-improved native chickens, unrelated chickens were given priority, considering the pedigree data (parental information) of the previous two generations of the sampled chicken populations. Samples from all geographical locations were included while selecting unimproved chickens for WGS. Again, within the same sampling area, individuals were selected from different villages.

Table 1 Selected samples (total = 96) for WGS from BLRI improved native chicken (n = 48) and unimproved native chickens from local villages (n = 48).

Sample preparation for WGS

All the selected blood samples on FTA cards were processed using a QIAamp® DNA Investigator-50 kit (CAT no: 56504; QIAGEN, Germany). The quality and the quantity of the isolated genomic DNA were assessed by three different instruments - NanoDrop 1000 spectrophotometer (Thermo Fisher Scientific, USA), Qubit 4 fluorometer (Invitrogen, USA) and TapeStation 4200 (Agilent Technologies, USA). After normalising to a final volume of 50 μL, the genomic DNA samples were sent to BGI Genomics, Poland for whole genome sequencing (150 bp paired-end, 20X coverage).

Library preparation and sequencing

Library preparation was conducted at BGI Genomics in Poland. Before sequencing, concentration, integrity and purity of all the genomic DNA samples were again checked. Concentration was determined by Qubit Fluorometer (Invitrogen) and sample integrity and purity were detected by Agarose Gel Electrophoresis (concentration of agarose gel: 1%, voltage: 150 V, electrophoresis time: 40 minutes). Random fragmentation of genomic DNA was done using a Covaris (E220) instrument, then the fragmented genomic DNA was selected by Magnetic beads [Cat no: 1000005278 (MGI Easy DNA Clean Beads; https://en.mgi-tech.com/Products/reagents_info/id/7] to an average size of 200–400 bp. Fragments were then end repaired and 3′ adenylated, with adaptors then ligated to the ends of these 3′ adenylated fragments. PCR was then carried out to amplify the fragments with adaptors and PCR products were purified by magnetic beads. The double stranded PCR products were then denatured and circularised by the splint oligo sequence. The single strand circular DNA (ssCir DNA) was formatted as the final library. The library was then assessed by quality control. The library was amplified with phi29 to make DNA nanoballs (DNB) which have more than 300 copies of each molecule. The DNBs were loaded into the patterned nanoarray and paired end 150 bp reads were generated by combinational Probe-Anchor Synthesis (cPAS). The sequencing was performed using the next generation high-throughput platform at BGI Genomics (DNBSEQ-T7) in paired-end mode (~20X coverage).

Data processing

Raw reads were filtered using a series of data processing steps to remove adaptor sequences, contamination and low-quality reads. This was done using SOAPnuke software33 by BGI Genomics. Upon receipt of sequence data, quality of the sequences was checked using the FastQC programme (version 0.11.7)34. For ease of reviewing the sequence quality, FASTQC reports for all 96 samples were aggregated in a single report by the MultiQC (version 1.1) package35, examples from which are shown in Fig. 2. No adaptor sequences were present and as the quality of the raw reads was very high, no further quality-based trimming was performed on the sequence reads.

Fig. 2
figure 2

Quality control metrics from FastQC analysis. (a) Per sequence quality scores and (b) per sequence GC content.

Mapping of the sequence reads was performed against the GRCg7b chicken reference genome (https://www.ebi.ac.uk/ena/browser/view/GCA_016699485.1) using Burrows-Wheeler Aligner (bwa-version 0.7.15)36,37 with default parameters. Before alignment, sequence dictionary and FASTA index files were created using Samtools (version 1.13)38 which were used by the BWA-MEM programme. The resultant Sequence Alignment Map (SAM) files then underwent some further processing steps such as sorting according to their coordinates using the SortSam programme of Picard tools (version 2.25.4)39 and marking duplicate reads using the MarkDuplicate programme of the same tool. BAM files were then validated to troubleshoot errors such as improper formatting, faulty alignments and incorrect flag values. In addition, different WGS metrics were calculated using both Samtools and Picard tools. Base Quality Score Recalibration (BQSR) was then carried out using the BaseRecalibrator tool from the Genome Analysis Toolkit - GATK (version 4.0.10.1)40,41 to correct the biases in the quality scores assigned by the sequencer. The final recalibrated BAM files were then used for further downstream analysis. The overview of the mapping and variant calling steps is presented in Fig. 3.

Fig. 3
figure 3

Overview of the sequence alignment, variant calling and variant filtration process.

GATK best practice guidelines for germline short variant discovery were followed for variant calling and SNP detection using the HaplotypeCaller tool with the ‘-ERC GVCF’ settings to generate GVCF files which then underwent joint genotyping using the GenomicsDBImport and then GenotypeGVCFs functions. The Variant Quality Score Recalibration (VQSR)42 function was then applied to perform variant filtration using a set of around one million validated SNPs26 as a training and true set, with over 21 M chicken SNPs from the Ensembl database (release-110)43 used as known variants.

The following annotations or context statistics were considered during the VQSR step: read depth (DP), variant quality by depth (QD), root mean square mapping quality (MQ), mapping quality rank sum test statistics (MQRankSum), read position rank sum test statistics (ReadPosRankSum), and strand bias statistics (FS and SOR). A tranche sensitivity threshold of 99% was applied for filtering variants. As the final quality control of the called variants, any SNPs with a missing genotype rate more than 10% across the samples were filtered out using VCFtools44 (version 0.1.13).

All codes used for the mapping and variant calling steps are included in the Supplementary materials and also available on GitHub.

Data Records

All full-length raw sequencing data in FASTQ format can be accessed from the Sequence Read Archive (SRA) of the NCBI database under BioProject accession number PRJNA102732545. The filtered VCF file containing more than 22 million high-quality autosomal biallelic SNPs can also be accessed from the European Necleotide Archive (ENA) and the European Variation Archive (EVA) repositories under the Project accession number PRJEB7835746,47 and Analysis accession number ERZ24818048.

Technical Validation

Quality control of sequencing data

The total number of bases generated from the sequencing each sample was from 24 Gb to 28 Gb, with GC content averaging 42%. Around 95.33% of the bases had a minimum Phred scaled quality score of 30 which indicates a base calling accuracy of 99.9%. The average estimated genome coverage across all sample was ~23X (after marking duplicate reads) with the range varying from 20X to 25X. FastQC reports (shown in Fig. 2) indicate that sequencing quality of all samples was of high-quality. The average mapping rate of the sequence reads against the reference genome was 99.60%, which further confirmed the high quality of the sequencing data.

Quality control of SNP data

During the variant calling step using GATK best practice guidelines, more than 30 M total variants were identified including more than 26 M SNPs and 4.5 M insertions/deletions (INDELs). VQSR filtering was then applied to ensure identification of high-quality variants and to minimize the number of false positives. More than 1 M validated SNPs26 and about 22 M SNPs from Ensembl43 were used as known variants (training data set) during the VQSR step. The VQSR filtering retained 100% of the SNPs (26.07 M). Next, only the biallelic SNPs (22.75 M) were taken into consideration for downstream analysis which included a further filtering step. SNPs with a missing genotype rate of more than 10% were discarded, which retained around 22 M high-quality SNPs. In this step, minimum genotype quality (GQ) score was considered 20 (–minGQ 20.0), depth of sequence coverage was 3 (–minDP 3), Hardy-Weinberg Equilibrium (HWE) value was 0.00001 (–hwe 0.00001) along with the maximum missing rate of genotypes of 10% (–max-missing 0.9) using VCFtools. Annotation of the identified SNPs was done using Ensembl’s Variant Effect Predictor-VEP48 (release-110) which revealed that around 75% of the total high-quality variants are already reported in the public databases, but the rest were novel variants. We found that the majority of the identified variants (around 68%) were intronic while around 2% were exonic variants (shown in Table 2a). Again, we found a greater number of variants in unimproved chicken populations compared to improved chickens of the same varieties (Table 2b). Details of SNPs in different annotation categories in both improved and unimproved native chicken populations are given in Supplementary Table 2.

Table 2 (a) SNPs in different annotation categories for overall Bangladeshi native chicken populations. (b) Sumary statistics of SNPs in Bangladeshi improved and unimproved native chicken populations.

To determine the quality of SNP calling from the high-throughput sequencing data, the transition/transversion ratio (Ts/Tv) can be employed. We obtained a transition/transversion ratio (Ts/Tv) of 2.453 in the overall populations which is typically found to be ~2 for whole genome sequence data49. Some previous studies reported a Ts/Tv ratio between 2.17 to 2.69 for different indigenous chicken breeds20,50,51,52,53. Unless it is too high (>4), a higher Ts/Tv ratio generally indicates better SNP calling54.

The proportion of singleton SNPs in the overall population for this study was 13.64% while the unimproved chicken populations have higher percentages compared to the improved populations (Table 2b). These reflect the genetic diversity between the studied chicken populations. In addition, smaller sample size may be also responsible for a large proportion of singleton SNPs55.

We observed an average of 1 SNP in every 57 base pairs in 10 kb non-overlapping windows across the studied genome. Indigenous chicken species exhibit higher SNP density mainly due to their greater genetic diversity, exposure to varied environmental conditions and more natural settings or less controlled breeding compared to the commercial breeds53,56,57. The SNP density across various chromosomes including the sex chromosomes (Z and W) from our study is detailed in Supplementary Table 3 and illustrated in Fig. 4.

Fig. 4
figure 4

Chromosome-wise SNP distribution heat map across the Bangladeshi native chicken genome based on more than 22 M identified SNPs. The x-axis represents the chromosome length (Mb) and the y-axis denotes chromosome number.