Background & Summary

Citrus Huanglongbing (HLB, also called yellow shoot disease) is a destructive disease threatening citrus production worldwide1. HLB is caused by the unculturable phloem-limited α-proteobacterium “Candidatus Liberibacter spp.”, mainly including “Ca. L. asiaticus” (CLas), “Ca. L. africanus” and “Ca. L. americanus”1,2. Among the three species, CLas was the most widespread species and responsible for the increasing economic losses of citrus industries in Asia and America3. The characteristic symptoms of citrus trees infected by CLas mainly included yellow shoots, yellowing/mottling leaves, small and malformed fruit with aborted seeds, fruit abscission, rotted roots and ultimately tree death4. HLB severely affected the longevity and fruit yield of citrus plants and posed a significant risk challenge for disease management due to the presence of the vector Asian citrus psyllid (ACP, Diaphorina citri)5. Literature records indicated that HLB was first observed in Chaoshan region of Guangdong province, China around 1860s and became a local epidemic around 1930s2,6. The spread of HLB to the major citrus producing areas in southern China was observed after 1930s2,6. As of now, HLB have been found in 11 out of 19 citrus growing provinces in China and widely distributed in more than 50 citrus producing countries in Asia, Africa, America, which severely limits the development of global citrus industry3,7.

Despite CLas cannot be cultured in vitro, the advancement in whole genome sequencing has greatly facilitated the CLas research, mainly including genetic diversity, evolution, gene function analysis, pathogenicity and biology of CLas8,9,10,11. One major break-through from CLas genome sequence analyses was the discovery of CLas phages/prophages, which was further used for CLas strain characterization and biological investigations12,13,14. Currently, three large prophages, designated SC1 (Type 1), SC2 (Type 2), and P-JXGC-3 (Type 3), were identified in CLas strains12,13, providing valuable insights into CLas biology and genetic diversity11,15,16. Additionally, whole genome sequence resource of CLas have also been employed for evolutionary analysis among Liberibacter species and genetic diversity among CLas strains. Comparative genomes of Liberibacter species showed the evolutionarily separation of CLas from the non-pathogenic Liberibacter crescens10,17. Genome-based analysis revealed the genetic variations among CLas strains from different geographical locations and offered insights into the possible source of CLas introduction and HLB epidemiology in the United States10. A recent study of genome comparison based on 35 published CLas genomes identified over 6,000 minor variations and the highly heterogeneous variations distribution across CLas genome, including four highly diverse non-prophages regions and three prophage regions18.

As the most widely distributed species within Liberibacter genus, the current available CLas genomes resources are very limited, mainly due to the inability of in vitro culture. Thus, the total DNAs extracted from CLas-infected citrus plants or insect vectors became the only DNA resources for CLas genome sequencing8,19. However, the high ratio of host DNA as compared to CLas DNA in total DNA resulted in a very low efficiency of CLas whole genome sequencing, thereby increasing the challenge of obtaining high-quality CLas genome sequence. Efforts have been focused on getting sufficient number of CLas reads to enhance the quality of CLas genome assembly and led to two main effective strategies, including the use of host tissue with high CLas titer and increase of sequencing depth8,19,20,21. Our previous study found that the citrus fruit pith can be used as an ideal host tissue source for CLas genome sequencing, due to its ability to support the multiplication of CLas to a high level21,22. In addition, two non-natural host plants of CLas, the periwinkle (Catharanthus roseus) and dodder (Cuscuta campestris), were proved to be the more amenable hosts for CLas proliferation, which can serve as the surrogate host source to gain the DNA sample with high ratio of CLas DNA20,21. These efforts provided the suitable DNA sources for CLas whole genome sequencing, which makes it feasible to obtain the high-quality genome sequence of CLas strains from different geographical locations.

The establishment of a comprehensive CLas genome database is anticipated to greatly advance the CLas research, particularly in CLas genetic diversity, evolution, epidemiology and biology. Currently, a total of 46 CLas genomes were released, with only 13 were in complete level (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=34021, accessed June 2024). However, there were only seven CLas genomes originally from China, among which were limited to few HLB-epidemic areas in China11,13,19,21,23. As the most prevalent and destructive Liberibacter species throughout HLB-affected citrus-growing areas worldwide, the limited genome resource of CLas hindered the understanding of evolution relationships of CLas population across varied geographical regions in the HLB-epidemic countries around world, especially in the historical HLB-epidemic country, such as China.

This study presents the whole genome sequencing data from 135 CLas strains originally collected from 20 commercial citrus cultivars distributed in ten HLB-endemic provinces in China. The average nucleotide coverage of 135 CLas genomes was 675X, indicating the high-quality genome sequencing and assembly. A total of 5,090 SNPs were identified among 148 CLas genomes, including 135 sequenced in this study and 13 complete CLas genomes available in NCBI database. These SNPs included 4,247 SNPs in chromosome, 383 SNPs in Type 1 prophage region, 323 SNPs in Type 2 prophage region, and 137 SNPs in Type 3 prophage region. Our CLas genome sequence dataset will not only serve as a valuable resource for further research in evolutionary pattern and genetic diversity of CLas strains from China and others worldwide HLB-endemic countries, but also facilitate the research in CLas pathogenicity and biology.

Methods

Sample collection

Over 1000 CLas samples were collected from ten HLB-epidemic provinces in China, including Fujian, Guangdong, Guangxi, Guizhou, Hainan, Hunan, Jiangxi, Sichuan, Yunnan and Zhejiang (Fig. 1). To improve the quality of CLas genome sequencing and assembly, a total of 135 representative CLas samples with high CLas titer (Ct value < 25 by primer set CLas4G/HLBr) were further selected for genome sequencing (Fig. 1; Supplementary Table S1). These samples were originally collected from 20 commercial citrus cultivars (Supplementary Table S1). All samples were collected during 2017 to 2022. The leaf midribs from leaves showing HLB typical symptoms (mottled or yellowing) or fruit pith tissue from HLB-symptomatic fruit (“red-nose” fruit) were sampled and stored at −20 °C for before DNA extraction.

Fig. 1
figure 1

Overview of sample collection, data processing and bioinformatics analysis pipeline. The name of province was in blue fonts in the map and the red points represent the sampling sources. The fruit photo indicated the main citrus cultivars resources for CLas samples collected from each province.

DNA extraction and validation

A total of 100 mg leaf midribs or 50 mg fruit pith tissue were chopped in to 2-mm section with sterile blades. Total plant DNA was extracted using the E.Z.N.A. high-performance plant DNA kit (Omega Bio-Tek, Doraville, GA, USA) according to the manufacturer’s instructions. The quality of the extracted DNA was verified by 1% agarose gel electrophoresis. The concentration and purity were measured using the NanoDrop One microvolume UV-Vis spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) at a wavelength of A260/A280.

Quantification of CLas

Quantification of CLas was performed by SYBR Green Real-time PCR with the primer sets CLas4G/HLBr as described in a previous study24. The 20 μL of PCR reaction mixture contained 1 μL of DNA template (~25 ng), 0.5 μL of each forward and reverse primer (10 μM), 10 μL of iQ™ SYBR Green Supermix (Bio-Rad, Hercules, CA, USA) and 8 μL of ddH2O. All PCR was conducted in CFX Connect Real-Time System (Bio-Rad, Hercules, CA, USA) under the following procedure: 95 °C for 3 min, followed by 40 cycles at 95 °C for 10 s and 60 °C for 30 s, with fluorescence signal capture at the end of each 60 °C step. The data (cycle threshold value, Ct value) were generated and analyzed using Bio-Rad CFX Manager 2.1 software with automated baseline settings and a manually set threshold at 0.1. Only CLas samples with Ct value < 25 were further selected as candidate for whole genome sequencing (Supplementary Table S1).

High-throughput sequencing

Library preparation for each CLas sample was performed with the NEB Next Ultra DNA Library Prep Kit (Illumina, San Diego, CA, USA). Genome sequencing was performed on the Illumina HiSeq 3000 platform (Illumina, San Diego, CA, USA) with 150-bp paired-end reads by a commercial sequencing company. The raw data files that obtained from high-throughput sequencing were converted to raw sequences (fastq format) by Illumina CASAVA Base Calling v.1.8.2.

Data pre-processing

Raw data obtained by sequencing were filtered by removing adapter reads, reads with N (N indicates that base information could not be determined) greater than 10%, and low-quality reads (Qphred ≤ 20 bases accounting for more than 50% of the entire length of the reads) using fastp v.0.19.4 with default parameters to generate the clean reads25. The quality control of clean data was performed by fastQC v.0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and multiQC v.1.1426 using default parameters. All clean reads were mapped to citrus genomes (Citrus maxima genome: MKYQ00000001.1, Citrus reticulata genome: NIHA00000000.1, Citrus sinensis genome: AJPS00000000.1, Citrus sinensis mitochondrion: NC_037463.1 and Citrus reticulata chloroplast: KU170678.1) to filter out the citrus reads using Bowtie v.2.4.227. The unmapped reads were retained for CLas genome assembly.

Genome assembly

CLas genome sequence was generated by the combination with reference-based assembly and de novo assembly. For reference-based assembly, three CLas genomes that contained different types of prophages, including YNBC (CP118771, contained Type 1 prophage), A4 (CP010804, contained Type 2 prophage) and JXGC (CP019958, contained Type 3 prophage), were used as reference for CLas genome assembly. The identification of prophage type for each CLas strain was based on the reads mapping to three prophage sequences, including Type 1 prophage (P-YNBC-1, ranged from 1,187,948 bp to 1,230,892 bp of strain YNBC), Type 2 prophage (P-A4-2, ranged from 1,189,877 bp to 1,603 bp of strain A4) and Type 3 prophage (P-JXGC-3, ranged from 1,192,430 bp to 1,582 bp of strain JXGC). The retained reads were mapped to the reference genomes using CLC Genomics Workbench v.20.0 with default parameters. The de novo assembly of each retained data was performed using Velvet v.1.2.10 by setting the minimum contig length as 1,000 bp28. The CLas de novo contigs were identified and extracted through BLAST search using CLas strain A4, YNBC and JXGC genome as queries. Gap closures of consensus sequences generated from reference-based assembly were performed using: (1) de novo assembly contigs that connected CLas reference-mapping contigs via BLAST + v.2.10.0 (E-value < 1e-5)29; (2) reads walking following the previously published method30. These efforts generated the high-quality CLas whole genome sequence, including the chromosomal and prophage region. All genomes were submitted to the NCBI genome database and annotated using the Prokaryotic Genome Annotation Pipeline (PGAP) v.6.531.

SNP calling and annotation

A total of 13 available CLas complete genomes, including A4 (CP010804), CoFLP (CP054558), GDCZ (CP118922), gxpsy (CP004005), Ishi-1 (AP014595), JRPAMB1 (CP040636), JXGC (CP019958), PGD (CP100754), psy62 (CP001677), PYN (CP100417), ReuSP1 (CP061535), TaiYZ2 (CP041385) and YNBC (CP118771), were downloaded from the NCBI genome database and combined with 135 CLas genome sequenced in this study as the CLas genome dataset for SNPs calling analyses. The SNPs of all above CLas genomes were identified with “snippy-multi” program in snippy v.4.6 (https://github.com/tseemann/snippy) using the A4 chromosomal sequence (ranging from 1,604 bp to 1,189,876 bp of A4 genome) and three prophage sequences (P-YNBC-1, P-A4-2 and P-JXGC-3) as the reference. The information of each SNP was extracted, including SNP base positions and alleles (.vcf files). The SNPs were annotated using vcf-annotator (https://github.com/rpetit3/vcf-annotator), including intragenic (synonymous or non-synonymous) and intergenic mutations.

Phylogeny analysis

The phylogeny (Neighbour Joining tree) of all CLas strains (including 135 sequenced in this study and 13 previously published) was constructed based on the aligned SNPs with CLC Genomics Workbench v.20.0 under the Jukes-Cantor (JC) model. The phylogenetic tree was visualized in iTOL v.6.8.132. The p-distance of SNP mutation profiles was calculated using VCF2Dis v.1.50 (https://github.com/BGI-shenzhen/VCF2Dis).

Data Records

The clean sequencing data (fastq format) were deposited in NCBI Sequence Read Archive under accession number PRJNA1123441 (https://identifiers.org/ncbi/insdc.sra:SRP513632)33. The genome assembly (GenBank format) were deposited in NCBI’s Genome Database with the accession number for Project PRJNA996237 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA996237)34 (Supplementary Table S1). The VCF files are available on figshare35.

Technical Validation

Quality control of sequencing data

The quality of all sequencing data was assessed by fastQC and multiQC program to investigate the mean quality score per position and the GC content. Illumina HiSeq platform yielded a total of 1,533 gigabases clean data for 135 samples, with an average of 11.4 gigabases per sample. The multiQC reports for 150 paired-end reads showed the quality value across each base position of all reads were higher than the Phred quality score of 30 (Fig. 2A) and the average quality scores of >94% of total reads was greater than the phred score of 20 (Fig. 2B, Supplementary Table S1). These statistics confirmed that the mean quality scores and per-sequence metrics fell within the high sequence standard range for subsequent analyses. The GC content of all reads from all samples showed a stable distribution (average of 36.5%) (Fig. 2C), indicating no possible contamination during the sequencing process.

Fig. 2
figure 2

Quality assessment of sequencing data and genome assembly. (A) Mean quality scores across each base position. (B) Mean quality scores per read. (C) GC content of reads. (D) Statistics of coverage (X) of genome assembly.

Quality evaluation of CLas genome assembly

Reads filtering with citrus host genome combined with reference-based mapping generated a total of 750,493,499 CLas reads from 135 CLas samples, with an average of 5,559,211 reads per sample. The full length of the genome assembly for 135 CLas strains showed high integrity, with the average length of 1,241,536 bp, ranging from 1,221,309 bp to 1,308,521 bp (Supplementary Table S1). The average coverage of 135 CLas genomes was about 675X. A total of 122 CLas genomes (90.4%) showed over 10X coverage. In particular, 93 CLas genomes (68.9%) showed over 100X coverage (Fig. 2D). The number of genes could also be employed as the standard for evaluating the quality of the whole genome assembly. A close number of genes were annotated in the whole genome among 135 CLas strains, ranging from 1,104 genes to 1,211 genes, indicating a high-quality of 135 CLas genomes (Supplementary Table S1). Prophage typing of 135 CLas genomes found that 44 CLas strains contained Type 1 prophage, 89 CLas strains contained Type 2 prophage, 44 CLas strains contained Type 3 prophage. It was noted that 28 CLas strains contained two types of prophages/phages and six of them contained three types of prophages/phages (Supplementary Table S1). These quality metrics indicate the completeness and contiguity of CLas genome assemblies, which enhanced the resolution and reliability for downstream analyses.

Quality control of SNP data

The high-resolution SNPs dataset for CLas genomes was generated by “snippy-multi” program in snippy v.4.6. A total of 5,090 high-quality SNPs were identified among 148 CLas genomes, included 135 sequenced in this study and 13 complete CLas genomes downloaded from NCBI database. The density of SNPs across the CLas genome was shown in Fig. 3. Specifically, a total of 4,247 SNPs were retained in chromosome, 383 SNPs were in Type 1 prophage, 323 SNPs were in Type 2 prophage, and 137 SNPs were in Type 3 prophage. Based on variants annotation, SNPs were distributed in intragenic (synonymous or non-synonymous) and intergenic regions (Fig. 3). Overall, most SNPs were located in the intragenic region of both CLas chromosomal region and three prophage regions. Among SNPs identified in intragenic region, the non-synonymous SNPs, which caused the change in the amino acids, were accounted for more than half number of total SNPs (Fig. 3). In chromosomal region, a higher number of intergenic SNPs were observed than those identified in three prophages (Fig. 3).

Fig. 3
figure 3

Statistics of SNP annotation in chromosomal region (a), Type 1 prophage (b), Type 2 prophage (c) and Type 3 prophage (d).

Phylogeny analysis of CLas genomes

The SNP mutant profiles of all strains were compared and clustered into a phylogenetic tree (Fig. 4). The 148 CLas strains showed high similarity in chromosome with the pairwise evolutionary distance less than 0.1 (Fig. 4a). Two main phylogenetic clades (Clade I and Clade II) were identified among CLas strains with a close intrinsic number within clade, i.e. 55% (82/148) in Clade I and 45% (66/148) in Clade II (Fig. 4a). The majority of CLas strains in Clade I carried Type 2 prophages (79 out of 95), while those in Clade II predominantly carried Type 1 (41 out of 50) or/and Type 3 (39 out of 45) prophages (Fig. 4b). According to the geographical origin of CLas strains, Clade I was dominant in CLas strains from Guangdong, Guangxi, and Guizhou, while Clade II was dominant in CLas strains from Hainan, Jiangxi, Sichuan, and Yunnan (Fig. 4c). Early studies had suggested the difference in CLas population structure could be associated with the bacterial environment adaptation and activity of phage11,12,36. Based on the comparison result of CLas genomes dataset, it was therefore interesting to hypothesize that the population differentiation of CLas could be mainly related to the CLas-phage interaction and various environment conditions from different geographical locations. However, the regional transport of CLas-infected seedlings could also lead to the genetic mixing and differentiation of CLas populations37, reflecting the possible HLB spread and epidemiology. The phylogeny analysis suggested that the CLas whole genome data would be reliable for further genomic research and provide information for the CLas/HLB epidemiology and control.

Fig. 4
figure 4

Phylogeny of “Candidatus Liberibacter asiaticus” (CLas) strains. (a) The Neighbor-Joining (NJ) tree based on genomic SNPs. (b) The proportion of prophage types. (c) The proportion of CLas strains from different provincial sources.