Introduction

The steady expansion of human–vector interactions has facilitated the emergence and re-emergence of vector-borne diseases worldwide1,2. Among these vectors, mosquitoes contribute to an extensive list of illnesses which account for more than half of vector-related fatalities annually3. Notwithstanding the medical significance of mosquitoes as vectors, there is a clear disparity in the availability of genetic data in public databases4,5,6,7, demonstrating a bias towards a relatively small number of well-studied taxa in the Aedes, Anopheles and Culex genera4,8,9,10,11. Apart from the prominent species in these genera, there are a plethora of other mosquitoes, many belonging to morphologically cryptic species complexes which play a role in maintaining and driving pathogen transmission in human and non-human cycles12.

With a human population of over 4.8 million, Harris County is the most populous county in Texas and third most populous county in the United States13. The county’s geography consists predominantly of forests in the northern and eastern regions, with savanna grasslands and coastlands located in the southern and western regions. Houston, the largest city in the county, is situated in a gulf coastal plain ecosystem; built on flat topology with a vast system of bayous, human-created canals and rivers, making the city prone to recurrent flooding14. This ecosystem provides an optimal breeding environment for the 56 mosquito species recorded from Harris County. These mosquitoes are classified into 10 genera15 and approximately 25% of these species are of known medical importance. Harris County is well acquainted with outbreaks of mosquito-related illnesses predominately transmitted from Aedes and Culex species16,17,18, with recent reports from the Harris Country Public Health Mosquito and Vector Control Division (HCPH-MVCD) identifying pools of Culex mosquitoes testing positive for the West Nile Virus (WNV) after flooding events19. For mosquito control operations like HCPH-MVCD, surveillance of mosquito counts, species identification and when available, pathogen detection, drive decisions for mitigation strategies and public engagement.

Despite having well-trained personnel, misidentification of specimens can occur due to a myriad of factors including poor specimen quality. Misidentification is particularly common in species complexes, where mosquitoes are morphologically similar and may occupy similar ecological niches20,21,22. Inaccurate identification may lead to ambiguous surveillance data, which could have a negative impact on deployment and success of mitigation strategies23. Mosquito species are conventionally identified using taxonomic keys based on morphology, and often validated if necessary or possible with molecular barcodes targeting the cytochrome oxidase I gene (COI) and internal transcribed spacer 2 (ITS2) regions24,25. However, traditional barcoding approaches like these are constrained by the paucity of genomic reference data available for comparison against query sequences26 and in power to redress taxonomic discrepancies in cryptic species complexes4,27. This emphasizes the importance of reference sequences from well-curated voucher specimens in genomic repositories such as NCBI’s GenBank database. These data may in turn may be used to develop rapid and cost-effective toolsets for accurate identification of mosquito taxa.

Over the last few decades, mosquito phylogenetics using molecular data ranging from short genetic sequences to full genomes have been of interest due the ability of genetic data to confirm taxonomy which has been historically been built on morphological characteristics4,7,10. Though morphological taxonomy remains an integral and valuable tool, molecular phylogenetics is important in compensating for limitations such as the accurate identification of cryptic species and damaged/incomplete specimens28,29,30. It is crucial to differentiate between vector and non-vector species for development and implementation of targeted mitigation strategies31. Advancements in sequencing technologies and computational approaches have facilitated a dramatic increase in genomic datasets including the mitochondrial genomes for a wide range of organisms including insects32,33,34,35. Mitochondrial genomes (mitogenomes) have proven successful in resolving species identification, population structure, molecular taxonomy and evolutionary studies of metazoans4,9,10,36. The mitogenome exists as a circular, double-stranded DNA molecule which encodes 2 ribosomal RNA (rRNA) genes, 13 protein-coding genes (PCGs), 22 transfer RNA (tRNA) genes and a non-coding region control region (CR) associated with DNA and RNA synthesis. Characteristics such as a high copy number, absence of introns, low incidence of recombination and maternal inheritance contribute to the usefulness of the mitogenome in molecular identification and inferring well-supported phylogenies4,29. Use of the concatenated sequence of the 13 PCGs is the preferred choice for phylogenetic analyses within the mitogenome versus use of the full 37 genes, as the highly conserved tRNAs and complex secondary structure of the rRNA gene sequences may confound phylogenetic analysis37,38,39. Additionally, the vast majority of mitogenome analyses, including those for mosquitoes, have focused on the PCGs, have proven highly informative, and allows for more direct comparison between different studies5,10,40.

Genetic information of any kind for the mosquitoes of Harris County is currently limited, with publicly available data for less than 30% of the 56 mosquito species that have been recorded and are considered present. To strengthen the capacity of the HCPH-MVCD for accuracy and confirmation of identification of damaged specimens or cryptic species using molecular and phylogenetic approaches, this study aimed to (i) generate complete mitochondrial reference genomes for well-curated mosquito specimens from Harris County and (ii) demonstrate the resolution of phylogenetic approaches using mitogenomes to inform species identification among morphologically cryptic taxa within the Culicidae.

Results

Mosquito collection

Specimens of 37 of the 56 known mosquito species reported from Harris County were collected in April 2022 and from January to December 2023. These species represented 10 Culicidae genera (Aedes, Anopheles, Culex, Culiseta, Coquillettidia, Mansonia, Othopodomyia, Psorophora, Toxorhynchites and Uranotaenia) from diverse habitats.

Sequencing, mitochondrial genome statistics and characteristics

The 37 single mosquito specimens yielded a total of 1,213,000,000 paired end reads; ranging from 16.51 (Ae. vexans) to 63.64 (Ae. infirmatus) million reads each. This included genomic data for 25 newly characterized mitogenomes where no sequence data were previously available. Mitochondrial genome size was relatively consistent within genera with Or. signifera having the largest contig size of 17,190 bp and An. crucians the smallest contig size of 15,365 bp. For most species, less than 25% of the sequence reads from each specimen was used for mitogenome assembly (Table 1).

Table 1 Statistics for 37 mosquito mitochondrial genomes from Harris County.

The mitogenomes generated in this study were comparable to reference mosquito mitogenomes retrieved from the GenBank database in encompassing 37 genes; 13 Protein Coding Genes (PCGs), 22 transfer RNAs (tRNAs), 2 ribosomal RNAs (rRNAs) and a control region (Fig. 1). Among the 10 genera, the length of the concatenated PCGs ranged from 11,220 bp for Cx. restuans to 11,243 bp for Ps. ciliata; with an average AT content of 78.74% (Table S1). Additionally, all mitochondrial genomes resulted in positive AT and negative GC skews (Table S2).

Fig. 1
Fig. 1
Full size image

Structural representative of a mosquito mitochondrial genome of public health importance in Harris County, Texas. Culex tarsalis is usually captured at specific sites following flooding events. The teal, black and salmon color blocks represent the PCGs, tRNAs and rRNAs respectively.

Phylogenetic analysis and Culex species nucleotide diversity

Maximum likelihood (Figure S1) and Bayesian analyses resulted in similar tree topologies, with Aedes, Anopheles, Culex and Psorophora genera separating into 4 strongly supported primary clades, demonstrating bootstrap values greater that 70% and posterior probabilities close to 1. Single specimens representing 4 other genera (Coquillettidia, Mansonia, Orthopodomyia, Toxorhynchites) separated into a single well-supported clade (Fig. 2). Single specimens representing Uranotaenia and Culiseta separated on single branches of their own. Mitogenomes from well-characterized species in this study cluster with reference genomes of the same species from the GenBank database except for Ps. cyanescens and Ps. discolor that clustered with Aedes species. Despite similar topologies, phylogenies using the 13 PCGs of the mitochondrial genome generally had better support compared to those utilizing only the commonly used COI region; species belonging to the Culex genus were utilized for this example for both Bayesian (Fig. 3) and maximum likelihood (Figure S2) analyses. Of note in the Culex-focused phylogenetic comparison, Cx. restuans NCBI reference sequence clustered with Cx. tarsalis in both analyses. Additionally, specimen sequences of Cx. restuans and Cx. erraticus derived from specimens from Harris County clustered with Cx. quinquefasciatus in both analyses rather then with reference sequences from those taxa where available (Fig. 3). Sliding window analysis demonstrated that nucleotide diversity (Pi) for Culex species were the highest for NAD2 (0.081), COXI (0.062), NAD5 (0.058) and NAD4 (0.053) (Fig. 4).

Fig. 2
Fig. 2
Full size image

Phylogenetic tree for the 37 mosquito species from Harris County, Texas. Accession numbers starting with ‘PQ’ were sequenced in this study. Mitogenomes of the 25 species newly characterized are indicated with blue. The tree was constructed using the concatenated 13 PCGs using BEAST with the General Reversible Time (GTR + G + 1) model. Numbers at the nodes represent posterior probabilities based on Bayesian inferences.

Fig. 3
Fig. 3
Full size image

Phylogenetic reconstruction using Bayesian Inference based on A. concatenated 13 PCGS of Culex species mitogenomes sequenced in the study (asterisk) and 7 Culex mitogenomes from NCBI GenBank B. Extracted COI genes from Culex species sequenced in this study (asterisk) and COI regions from Culex species on GenBank repository.

Fig. 4
Fig. 4
Full size image

Sliding window analysis of protein coding genes among 8 Culex mosquito mitochondrial genomes sequenced in this study.

Discussion

Despite advancements in sequencing technologies, and the expanding medical importance of mosquitoes, there remains a dearth of mosquito sequence data—only a small proportion of known mosquito species have any sequence data publicly available. In NCBI’s GenBank repository, we found 12 reference accessions (Table S1) representing less than one quarter of the species historically reported from Harris County. Due to their correlation with the transmission of mosquito-borne diseases globally, “Ae. albopictus” or “Ae. Aegypti” generated more than 100 hits when compared to other medically significant mosquito species. Here we utilized a genome skimming approach to rapidly and inexpensively generate new mitogenomes for 37 mosquito species belonging to 10 genera. This shallow whole genome sequencing approach has been previously used for evolutionary investigations and mitochondrial genome recovery in a range of organisms including mosquitoes34,41,42,43.

More importantly, we were able to assemble complete mitogenomes with sufficient 30X to 339X higher organelle coverage using as little as 1% (Table 1) of the reads recovered from sequencing. This is the first report to generate novel sequence data and annotations for so many (n = 25) mosquito species from the United States in a single study.

The findings of low GC content, positive AT and negative GC skews in assemblages generated in this study and 37 genes are characteristic of mosquito mitochondrial genomes which have been reported across genera in earlier studies6,9,10,36. The phylogenetic relationships among Culicidae remains poorly characterized beyond well-studied vector species due to limitations in morphological identification, reference sequence data, reliable molecular markers, comparable data across collection series, and the growing recognition of cryptic species complexes. These facts lead to the unresolved phylogenetic status of species belonging to defined genera, some of which are common to Harris County. Although there has been an expansion in efforts to generate molecular information for a generous number of species, sampling efforts are often biased to known or targeted vector species, limited sampling efforts and data mining from sequence databases for understudied species4,7,10. During our sequence search for comparison and tree building strategies in this study, our 25 novel mitogenomes clustered within the broader Aedes, Anopheles, Culex and Psorophora genera with the under-represented Genus/species groups separating as expected (Fig. 2). However, outliers such as Ps. cyanescens and Ps. discolor fell within Aedes clades, which may be due to the unresolved taxonomic rectification of Psorophora and Aedes genera which are estimated to have diverged approximately 102 million years ago (MYA)7.

As an example of the greater resolution full mitogenomes provide over traditional barcoding approaches, we narrowed our focus to Culex species phylogenetics due to the long-term occurrence of West Nile virus outbreaks in Harris County and the continued incongruence of morphological identification with molecular analyses for members of this genus44,45,46. Many species belonging to the genus Culex have a global distribution. The genus is divided into subgenera which are then further split into subgroups, which adds to the complexity of species in this broad genus and the difficulty in rectifying taxonomic discrepancies. Many studies have used a range of strategies47,47,48 to understand the phylogenetic relationships of the Culex subgenus, with many focusing on complexities of the Pipiens group45,46. Notably, Cx. salinarius and Cx. restuans are nominotypical members of complexes together with Cx. quinquefasciatus in the Pipiens group. We were able to generate sequence data and assemble mitochondrial genomes for 8 Culex species (Table 1) from Harris County, including 2 new records that were not present in the GenBank database. Furthermore, phylogenetic analyses using Bayesian (Fig. 3A) and maximum likelihood (Figure S2A) approaches based on the 13 concatenated PCGs derived similar topologies, for Cx. quinquefasciatus representatives of the Pipiens group. However, both trees based on the concatenated PCGs resulted in stronger phylogenetic relationships when compared to using only the single COI gene for tree construction. Using just the single COI gene (barcode) is routinely utilized due to the easy acquisition of COI sequence data and absence of additional genomic data (Fig. 3B, S2B). However, the lack of discrete resolution within the Cx. quinquefasciatus cluster may be due to the biological complexities and genetic variability among members of this group , which was identified in previous studies49,50. Mitochondrial genome approaches have demonstrated limited power to discriminate between taxonomic groups where genetic introgression or hybridization may still occur51,52. Nuclear genomic data53,54 may provide the resolution required to unravel such biological complexities. such as hybridization and introgression in mosquito species complexes , which have shown to be limited when using mitochondrial analyses51,52.

Nucleotide diversity among protein coding genes vary across mosquito genera and prior studies have suggested that the NAD5 and NAD4, as well as other genes, may serve as more suitable targets for development of discriminatory markers for Culicidae55,56,57. Analysis of the protein coding genes of Culex species using a sliding window analysis (Fig. 4) demonstrated that NAD2, COXI, NAD5 and NAD4 have the highest nucleotide diversity, in that order, much more than the commonly used COXI gene. This suggests that these genes may have more potential for development of alternative barcoding tools for identification of these mosquito species4,5,36.

The rise in arboviral outbreaks in the Americas is unlikely to abate58. Approximately 25% of the mosquito species reported from Harris County are known to serve as vectors of arboviral pathogens, hence accurate identification of these species is critical for surveillance and design of appropriate mitigation strategies. The mitogenomes generated in this study will serve as reference sequences to verify accurate morphological identification both locally and globally. In addition, the 25 novel mitogenomes reported here significantly add to the volume of data currently available to better resolve the phylogenetic relationships among Culicidae taxa world-wide.

Methods

Sample collection and morphological identification

Mosquito specimens were collected in April 2022 and from January to December 2023 throughout Harris County during routine entomological surveys by the HCPH-MVCD (Fig. 5) using gravid and storm sewer traps (John W. Hock, Gainesville, FL). Adults were sorted and identified using a taxonomic key59 by HCPH-MVCD staff and targeted species of morphologically intact and verified specimens were shipped to the Johns Hopkins Bloomberg School of Public Health (Maryland, U.S.A.) for molecular analysis.

Fig. 5
Fig. 5
Full size image

Harris county entomology surveillance study sites This map was generated using the leaflet package(v.2.2.2) in R (v.4.2.1), with OpenStreetMap as a tile provider. Blue plots indicate locations of mosquito collections utilized for this study.

Total DNA extraction and sequencing

Single mosquito specimens were pre-processed using a previously described treatment60. Briefly, single specimens were incubated at 56 °C after homogenization in a cocktail mixture containing 98 μL of PK buffer (Applied Biosystems, Waltham, MA) and 2 μL of Proteinase K (Applied Biosystems, Waltham, MA). This was followed by an extraction protocol as described by the manufacturer (Qiagen DNeasy Blood and Tissue Kit, Hilden, Germany). Extracted DNA was quantified using the Qubit dsDNA assay kit (Thermo Fisher Scientific, Waltham, MA) and stored at − 20°C prior to sequencing and shipped to SeqCenter (Pittsburgh, USA) for library construction and sequencing on an Illumina NovaSeq 6000 system. Libraries were sequenced to a minimum depth of 6.67 million paired-end 150 bp reads for species in the Anopheles, Coquillettidia, Culiseta, Mansonia, Orthopodomyia and Toxorhynchites genera; while specimens belonging to the Aedes, Culex, Psorophora and Uranotaenia genera were sequenced to a minimum depth of 13.3 million paired-end 150 bp reads.

Mitochondrial genome assembly, annotation and sequence analysis

Mitogenomes were assembled using NOVOPlasty (RRID:SCR_017335) version 4.3.161, with reference mitochondrial genomes (NC_035159, NC_014574, NC_064603, NC_054327, NC_060642) as seed sequences and k-mer set at 39. MITOchondrial genome annotation server (MITOS)62 was utilized for automated annotations using the invertebrate genetic code under default settings. Start and stop codon locations were manually adjusted in Geneious Prime (RRID:SCR_010519) version 2023.2.1 (Biomatters, Auckland, Australia) to match reference mosquito mitochondrial genomes, with sequences and annotations submitted to the GenBank database. DnaSP version 6 was used to calculate nucleotide diversities (Pi’s) among the PCGs genes using a sliding window approach over a 250 bp window which overlapped by 25 bp steps across the alignment of the 8 Culex mitochondrial genome sequences generated in this study.

Phylogenetic analysis

Using the MAFFT amino acid alignment mode as implemented in the Geneious Prime (RRID:SCR_010519) version 2023.2.1 (Biomatters, Auckland, Australia), the protein coding genes of the mitogenomes generated in this study and those from available mosquito reference sequences (NC_035159, NC_006817, NC_065121, NC_000875, NC_036006, NC_037823, NC_014574, NC_067606, NC_067607 and NC_060642) and Drosophila melanogaster sequence (NC_024511), were imported from the GenBank repository, aligned and exported in nexus format. The best fit base pair substitution model for the aligned sequence matrix was determined using jModelTest (v2.1.10) software63 under default settings according to Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Phylogenetic analyses were performed using maximum likelihood in Molecular Evolutionary Genetics Analysis (MEGA) X version 10.0.564 with bootstrap set at 1000 replicates and Bayesian inference analysis was performed in Bayesian Evolutionary Analysis by Sampling Trees (BEAST) 265 using three independent runs under default settings with an application of 20% burn-in rate for tree building purposes. Trees were visualized using FigTree v.1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).