Introduction

Enhancers are short cis-acting regulatory elements in noncoding DNA that amplify transcriptional levels of target genes by tens to hundreds fold over the basal level of core promoter elements at the transcription start site. Core promoters are located proximal to the transcription start site and facilitate the binding of RNA polymerase and the initiation of transcription1,2,3. Enhancers control transcriptional activity of target genes through their interaction with activators and the promoter and are, in turn, responsible for most regulated gene expression in the transcriptome. The precise mechanisms of enhancer action is unknown and is an area of active study4,5,6. Enhancers can function at a distance from target genes and independent of their physical orientation in the chromosome7. The identities of enhancers and some of their interacting protein factors that lead to their regulatory function have been described in well-studied model organisms, but enhancers cannot be reliably predicted by sequence-based algorithms, and thus must be detected directly by functional activity using reporter assays, or indirectly inferred using methods to detect open or modified chromatin properties.

Sequence polymorphism within enhancers has been associated with phenotypic differences, including predisposition to disease, as observed in diverse organisms5,8,9,10,11. Most of the significant variants mapped in human genome-wide association studies (GWAS) are noncoding12, and at least 60–70% of significantly associated human GWAS single-nucleotide point mutations (SNPs) lie within functional enhancers8,13. In cancer studies, the majority of tumor-driving mutational changes are also thought to be in noncoding regulatory elements, especially enhancers14.

Genetic variation in enhancers can occur as SNPs, insertions and deletions (indels), and as copy number variants15,16,17. Enhancer variation among individuals can underlie both Mendelian and complex traits7,18,19. At the population level, positively-selected variation in enhancers controlling key pathways likely plays an important role in differentiation and evolution20,21. Indeed, some of the fastest-evolving parts of the human genome as compared to other primates are functional embryonic enhancers related to central nervous system development22.

Enhancer discovery in mosquitoes is limited to a few previous studies using indirect methods based on detection of chromatin properties to infer enhancer locations23,24,25. In genetic vector control strategies, it could be important to know the locations of enhancers that could cause unpredicted effects on transgene expression. Gal4 based enhancer trapping lines have been used in Anopheles stephensi to optimize methods to drive transgene expression26. Recent work in the African malaria vector Anopheles funestus highlights the effects of indel variation at a cis-regulatory element of a cytochrome P450 and its association with insecticide resistance27. Thus, relatively simple evolved sequence variation in enhancers can produce large phenotypic shifts in the organism28,29. Despite these examples, the effect of genetic variation on enhancers has barely been examined in any species, and to our knowledge, the functional effect of variation on enhancer activity has not been systematically surveyed in any organism.

Here, we analyze a set of candidate enhancers in order to benchmark the parameters needed for reliable enhancer discovery, validation, and determination of polymorphism effects in Anopheles. We validate the candidates as functional enhancers using luciferase reporter assays, and measure the effects of genetic polymorphism on enhancer activity. The results of the current report are a first step towards developing a comprehensive genome-wide catalog of Anopheles enhancers, and biological studies to characterize enhancer function in vector biology.

Results

Candidate enhancer selection

The standard approach for enhancer detection is by functional testing using luciferase reporter assays that directly measure enhancer activity, or by indirect methods such as ChIP-seq, which can infer the presence of a subset of enhancers by correlation with chromatin features. Here, we implemented for the first time in Anopheles a screen (Self-Transcribing Active Regulatory Region sequencing, STARR-seq) that detects enhancers directly by a functional reporter assay analogous to the luciferase reporter assay, but with the readout of enhancer-dependent RNA transcript output measured as sequenced cDNA, rather than by luciferase light output30.

We identified candidate A. coluzzii enhancers by manual examination of our generated sequence data from the functional screen near six selected genes mentioned below. We detected intervals where coverage of the cDNA sequence track, indicative of enhancer activity (Fig. 1, solid lines), was visibly greater than the coverage of the plasmid sequence track, which is the internal baseline control indicating background levels of the transfected plasmid (Fig. 1, dotted lines). The genes were selected because of predicted functions important for the biology of vectors, and because they were near potential enhancer signals. The genes are involved in vector immunity: Krueppel-Like Factor 6/7 (KLF, AGAP007038), Leucine-Rich Immune protein 1 (LRIM1, AGAP006348); insecticide resistance: Acetylcholinesterase 1 (ACE1, AGAP001356), GABA-gated chloride channel subunit (Rdl, AGAP006028); and developmental biology: LIM homeobox protein 2/9, ortholog of Drosophila apterous FBgn0267978 (AP, AGAP008980), and Ovo, AGAP000114 (Table 1). The regulatory regions and enhancers of the latter two genes, apterous and Ovo, have been well characterized for the Drosophila orthologs31,32,33. In addition, a negative control interval was chosen as a size-matched interval located within intron 1–2 of the gene, homeobox protein distal-less (DLX, AGAP007058) that has no visible divergence of cDNA and plasmid sequence tracks, and thus no predicted enhancer function. The candidate enhancers are named according to the most proximal coding sequence, but it is not known whether the enhancers influence the expression of the proximal gene.

Figure 1
figure 1

Detection of Anopheles coluzzii candidate enhancers. Sequence data near six Anopheles coluzzii genes were examined using the Integrative Genomics Viewer (IGV)43 to screen for candidate enhancers. Enhancers were called where coverage of the cDNA sequence track (solid lines) was greater than the baseline coverage of the plasmid sequence track (dotted lines). The cDNA sequence track is analogous to light output from luciferase reporter assays. Line color (green, red, blue) represents three biological replicates. The enhancers are named by the most proximal genes: Krueppel-Like Factor 6/7 (KLF, AGAP007038), Leucine-Rich Immune protein 1 (LRIM1, AGAP006348), Acetylcholinesterase 1 (ACE1, AGAP001356), GABA-gated chloride channel subunit (Rdl, AGAP006028), LIM homeobox protein 2/9, ortholog of Drosophila apterous FBgn0267978 (AP, AGAP008980), and Ovo, AGAP000114 (Table 1). A negative control interval within intron 1–2 of distal-less (DLX, AGAP007058) was chosen because it lacks visible divergence of cDNA and plasmid sequence tracks. Graphs display cDNA and plasmid tracks in 10 kb windows centered on the candidate enhancers. Only one candidate enhancer is seen in all windows except AP, where the central peak (arrow) was used. X-axis indicates genomic coordinates in the PEST reference genome, y-axis indicates normalized sequence depth corrected for overall plasmid depth observed in the IGV display.

Table 1 Physical location of enhancers and proximal annotated gene. Enhancer coordinates are based on the locations of PCR primers given in Supplementary Table S2. Coordinates from the PEST AgamP4 genome assembly.

Functional validation of enhancer activity

The predicted candidate enhancers predicted in Fig. 1 were functionally tested to validate the enhancer predictions. The standard test for enhancer activity is by cloning the candidate in a plasmid carrying a basal promoter and a luciferase reporter gene in an episomal assay. An active enhancer will augment the rate of transcription from the basal promoter, thus elevating the expression level of the luciferase gene. Luciferase expression is measured by adding luciferase substrate to cell extract and detecting light output as relative light units (RLU). Enhancer activity, if any, is measured as increased luciferase activation above background.

Candidate amplicons from A. coluzzii (Table 1) were cloned into the firefly luciferase reporter vector pGL-Gateway-DSCP, and co-transfected into 4a3A cells with the renilla luciferase control vector pRL-ubi-63E. Firefly luciferase RLU measurements were corrected using the renilla luciferase internal control values in the same well, and firefly/renilla RLU for the experimental insert were statistically compared to the firefly/renilla mean value for the DLX negative control, defined as the background level. At least one clone of each candidate enhancer displayed luciferase activity levels above background (p < 0.005), with activity across candidates that varied from 2-fold to more than 20-fold over background (Fig. 2). These results indicate that the enhancer predictions were accurate for all six predicted candidates, and thus validate these genomic intervals as functional A. coluzzii enhancers. The current information provides the first benchmark criteria that can be used for algorithmic genome-wide detection of A. coluzzii enhancers.

Figure 2
figure 2

Candidate Anopheles coluzzii enhancers augment expression of a luciferase reporter. The six candidate enhancers from Fig. 1 were amplified from Anopheles coluzzii mosquitoes and cloned into the pGL-Gateway-DSCP plasmid carrying a basal core promoter and firefly luciferase reporter gene. The cloned candidates were tested for influence upon luciferase expression using a dual luciferase assay system to quantify luciferase activity above background, defined by the DLX negative control (horizontal dotted line). Each of the six tested candidates displayed normalized luciferase activity significantly above background (p < 0.005), thus validating the candidates as functional A. coluzzii enhancers. Each point represents an individual replicate measure of luciferase activity for the tested candidate. Bars indicate the median and 95% confidence intervals. X-axis indicates the name of the candidate enhancer according to the nearest gene (Table 1), y-axis indicates the relative luciferase activity for each measurement, expressed as firefly luciferase corrected to the renilla luciferase internal control value, and normalized for the value of the DLX negative control (see Methods).

Screening for polymorphic alleles of validated enhancers

Having confirmed that all six predicted candidates are functional enhancers, we next wished to identify genetically variable alleles for each enhancer and measure their luciferase activity. For this, alleles of the enhancers were amplified and sequenced from A. coluzzii colonies initiated from the populations in Cameroon, Mali or Burkina Faso. For each of the six enhancers, at least two distinct genetic variants were chosen for tests of enhancer activity. The goal was to identify and test a range of genetic variants, not the mosquito colonies. Therefore, a given colony may or may not be represented for a given enhancer, depending on the variation it segregates. The enhancer alleles were cloned and sequenced, and neighbor joining (N-J) trees depict the evolutionary relatedness and degree of sequence difference of the alleles (Fig. 3). Complete sequences for all tested enhancer alleles are presented in Supplementary File S1.

Figure 3
figure 3

Phylogenetic comparison of enhancer allele genetic variation. Enhancer allelic variants were cloned and sequenced from Anopheles coluzzii colonies. Each sequenced clone represents a chromosomal haplotype. For each clone, individual sequences were aligned using MUSCLE and Mega was used to construct neighbor joining (N-J) trees for complete sequences for all haplotypes for each enhancer. Trees depict the degree of genetic similarity among alleles, and phylogenetic scale bars represent 0.5 nucleotide substitutions per site. The scale bar for the Rdl tree is long (pairwise distance 0.008 between alleles Fd03_#2 and Fd09_#1), indicating that the Rdl alleles segregate relatively little variation, while the Ovo tree scale bar is short (pairwise distance 0.0445 between alleles Fd03_#8 and Fd05_#4), indicating more than 5-fold greater genetic diversity among Ovo alleles as compared to Rdl. Alignments for complete sequences of alleles are presented in Supplementary File S1.

Genetic alleles of validated enhancers display distinct levels of enhancer activity

Luciferase activity was measured for all alleles to determine the effect of genetic variation on differences in functional enhancer activity. For five of the six enhancers, alleles displayed significantly different levels of enhancer activity (Fig. 4). For a given enhancer, alleles with the greatest difference in activity tended to be the most genetically different from each other (see also Fig. 3). For example, the alleles of the KLF enhancer cloned from colonies Fd05 and Fd03 are the most closely related genetically, and do not display a difference in luciferase activity as compared to the allele from colony Fd09. For two enhancers (LRIM1 and ACE1), at least one genetic variant displayed activity levels that were not significantly different from background, which effectively represents a naturally occurring functionally inactive null enhancer allele. Genetic variation segregating at the enhancer of Ovo did not display a significant influence on luciferase activity, and the Ovo enhancer appears to display the consistently highest luciferase activity over all alleles tested for any of the six enhancers. These results indicate that genetic alleles of validated enhancers can display significantly different levels of functional activity. The large observed differences in enhancer activity are expected to be translated into differential expression of the target genes that are regulated by the polymorphic enhancer alleles, likely leading to phenotypic differences in the organism related to the target gene functions. Enhancer allele outcomes for phenotype will need to be tested in manipulative experiments.

Figure 4
figure 4

Genetic variation influences enhancer functional activity. To test the functional effect of genetic variation within enhancer sequences, the enhancer alleles shown in Fig. 3 were cloned into luciferase reporter plasmid pGL-Gateway-DSCP, and luciferase activity was measured. Statistically significant differences in luciferase activity as determined using a non-parametric ANOVA are indicated by letters, samples labelled with different letters are significantly different from each other and samples with the same letter are not significantly different (thus samples labeled a,b are not statistically different from samples labelled either a or b). Bars indicate the median and 95% confidence intervals, n = 12 for all tests. X-axis labels indicate colony origin (Ng, Ngousso, other colony names as given) and allele name, y-axis indicates the relative luciferase activity for each measurement determined as in Fig. 2.

Enhancer activity is essentially independent of physical orientation

Enhancers tend to function independently of their physical orientation in the genome, which is testable when the candidate is cloned in a luciferase reporter plasmid. For three of the above validated enhancers, we recloned two alleles in both orientations in the reporter and measured luciferase activity. For the KLF and AP enhancers, there was no detectable effect of orientation (Fig. 5). The LRIM1 enhancer displayed a possible weak effect of orientation for allele Fd05_#1 (p = 0.042), although for both orientations of the LRIM1 enhancer the absolute activity values were lower than the other enhancers tested (indicated by y-axis values in Fig. 5), and thus the weak orientation difference for this one weak allele is not robustly supported. Thus, overall the enhancer alleles tested displayed function independent of their physical orientation with respect to the basal promoter.

Figure 5
figure 5

Enhancer activity is essentially independent of orientation. The influence of physical orientation of the enhancer within the luciferase reporter plasmid pGL-Gateway-DSCP was tested by cloning three enhancers, KLF, AP and LRIM1, in both orientations in the plasmid, and luciferase activity was measured. KLF and AP enhancers displayed no detectable effect of orientation on luciferase activity, while LRIM1 displayed a slight difference (p = 0.042) in luciferase activity for the allele Fd05_#1. Statistical differences indicated by letters as in Fig. 4, error bars as in Fig. 4, n = 12 for all tests. X-axis indicates the name of the enhancer allele tested and the enhancer insert orientation (arbitrarily defined as A and B), n indicates the number of wells measured, y-axis indicates the relative luciferase activity for each measurement as in Fig. 2.

Enhancer deletion mutagenesis reveals a modular structure of positive and negative regulatory elements

To resolve the minimal portion of the enhancer interval that carries the enhancer function, we carried out deletion mutagenesis for two different genetic alleles of the LRIM1 enhancer, one allele with high enhancer activity and the other low. The LRIM1 enhancer was used for proof of principle, and was chosen because it had alleles with distinct activity levels, and the genetic variation was spread across the enhancer. The deletion derivatives carried 50% or 25% of the length of the initial enhancer interval, reduced equally from both ends. We tested the deletion clones for luciferase activity, along with the original undeleted enhancer (Fig. 6A). Surprisingly, for LRIM1 allele Fd03_#3, the 50% construct displayed the highest luciferase activity, greater than either 100% or 25% constructs. This indicates that the integral 100% Fd03_#3 allele carries negative regulators of enhancer function, which were deleted in the 50% derivative to yield a derivative with elevated enhancer activity. The 25% derivative of allele Fd03_#3 displays significantly lower activity than the 50% derivative, suggesting that positive regulators of enhancer function are located outside the 25% derivative, but within the 50% derivative.

Figure 6
figure 6

Enhancer deletion mutagenesis reveals positive and negative regulatory elements. Deletion mutagenesis was carried out for two alleles of the LRIM1 enhancer, the high-activity allele Fd03_#3 and low-activity allele Fd05_#1 (Figs 4 and 5). The integral enhancer alleles (100%) were each deleted for one-quarter of their length from both termini (50% derivative), and one-quarter length again (25% derivative). (A) Deletion derivatives were tested for luciferase activity, along with the original integral alleles. Statistical differences indicated by letters as in Fig. 4, error bars as in Fig. 4, n = 12 for all tests. X-axis indicates allele name and deletion derivatives, y-axis indicates the relative luciferase activity for each measurement as in Fig. 2. Enhancer activity is not directly correlated with sequence length, and enhancer alleles are structured from distinct combinations of positive and negative regulators of transcription. (B) Plot of the number of variant nucleotide positions between the Fd03_#3 and Fd05_#1 alleles along the length of the enhancer sequence. Variant sites are counted within a 50 bp non-overlapping window and plotted at the midpoint of the window. The light gray shading indicates the extent of the 50% length derivatives and the dark gray shading the 25% derivatives. X-axis indicates nucleotide position in derivatives, y-axis indicates number of variable sites between the Fd03_#3 and Fd05_#1 alleles in 50 bp windows. There were a total of 60 variable sites between Fd03_#3 and Fd05_#1 alleles in the 100% integral enhancer, 37 variable sites in the 50% derivatives and 22 sites in the 25% derivatives. (C) Neighbor-joining tree depicting sequence relatedness between the integral 100% enhancer and the 50% and 25% derivatives for LRIM1 Fd03_#3 and Fd05_#1 alleles. The Fd09_#3 allele is included as an outgroup. Scale bar description as in Fig. 3.

Deletion derivatives of LRIM1 allele Fd05_#1 display a pattern distinct from the Fd03_#3 allele. For Fd05_#1, each incrementally smaller derivative was more active. This result was also surprising, because it indicates that a highly active core enhancer element within the smallest 25% derivative is attenuated by negative regulators that are progressively removed from 100% to 50% in length, and again from a 50% to 25% length interval. The deletion results indicate that enhancer activity is not directly correlated with sequence length, that there is a complex structure of functional elements and modifiers within the enhancer interval, and that different alleles of the same enhancer are comprised of distinct combinations of modular regulators that differentially influence transcription.

The density of variable sites between Fd03_#3 and Fd05_#1 varies across the interval, such that there were 60 variable nucleotide sites in the integral 100% length alleles, 37 variable sites in the 50% derivatives and 22 sites in the 25% derivatives (Fig. 6B). Finally, it is notable that the 25% derivative for allele Fd05_#1 displays activity levels indistinguishable from the Fd03_#3 50% derivative (p = 0.99), even though they are no more genetically similar than the integral 100% enhancer sequences for both alleles (Fig. 6C). This result highlights the relative independence of enhancer functional level from primary sequence patterns, unlike the fundamental dependence of protein coding gene function on the amino acid primary sequence code, and the consequent requirement for identification of enhancers by detecting functional activity. Testing of a large panel and manipulative experiments would be required to identify consistent patterns of enhancer modular organization.

Enhancer alleles segregate in the natural Anopheles coluzzii population

To confirm that the genetic variation observed in the enhancer alleles was natural and not an artifact of laboratory colonies, we compared sequence data for the six enhancers to genetic variation observed in wild A. coluzzii from whole genome sequence of the Anopheles gambiae 1000 (Ag1000) Genomes Consortium34. The comparison indicates that genetic variation is shared between the cloned A. coluzzii colony haplotypes used in luciferase assays and the natural population (Fig. 7). Representative short sequence alignments are shown, and full-length alignments with larger numbers of wild mosquitoes are presented in Supplementary File S2. Alignments are presented rather than phylogenetic trees because the wild Ag1000 sequences were called for SNPs but not indels. Therefore, all Ag1000 sequences by default share the same indels as the PEST reference, and a tree would artifactually make all wild sequences appear more similar to one another. This analysis demonstrates that genetic variants within confirmed functional enhancers, associated with differential enhancer activity, segregate in nature and do not represent variants unique to lab colonies. Natural segregation of variants associated with differential enhancer function supports the interpretation that the differential function of enhancer alleles (Fig. 4) based on a modular structure of regulatory elements (Fig. 6) likely result from natural selection for distinct phenotypic outcomes of allelic enhancer function.

Figure 7
figure 7

Genetic variants in differentially active enhancer alleles segregate in wild Anopheles coluzzii. Genetic variation observed in colonized and wild A. coluzzii was compared for the six studied enhancers. Representative short sequence alignments are shown (full-length alignments with additional samples in Supplementary File S2). Asterisks above sequence alignments indicate variant positions shared between the cloned A. coluzzii colony haplotypes used in luciferase assays and the natural population. The top sequence row in each alignment is the PEST genome reference sequence, followed by sequences of alleles tested by luciferase assays (boxed by rectangles), followed by sequences of wild A. coluzzii. Ambiguous nucleic acid codes are used for heterozygous sites only in wild samples because the cloned sequences from A. coluzzii colonies are haplotypes, which are unambiguous.

Discussion

The development of a methodology for screening and evaluation of Anopheles enhancers is an initial step towards a more comprehensive study of enhancers and their polymorphism effects. The currently examined enhancers were chosen because they were located near known functional genes, which were used to name the enhancers for convenience, but further work will be required to determine the actual influence, if any, of these enhancers upon the nearby genes, which is not known. Moreover, enhancer function is also controlled on spatial and temporal scales within the organism, and understanding Anopheles enhancers and their effects on phenotypes in detail will ultimately require incorporating this information.

Here we identify and validate candidate enhancer noncoding regulatory elements in the malaria vector, A. coluzzii. We show that naturally segregating genetic variation significantly influences enhancer activity levels. By modifying the level of enhancer activity, genetic variation in enhancers can cause quantitative changes in expression of the target genes regulated by the enhancers, leading to differences in biological function and ultimately mosquito phenotype. Different from mutations in protein coding genes, enhancers are typically located in noncoding DNA, and there is no sequence pattern to aid interpretation of noncoding variants. Most mosquito studies to date have focused solely on genes and proteins, rather than regulatory elements controlling the genes, in part due to the limited information available for the noncoding portion of the genome.

We detected variant alleles with significant difference in their functional enhancer activity, including functional null alleles that lack enhancer activity above background. For example, the LRIM1 enhancer Fd05_#1 allele or the ACE1 Fd03_#1 allele likely represent the ablation of an important transcription factor binding site, resulting in the absence of enhancer activity above background, which likely has downstream functional consequences. The range in functional enhancer activity that we observed is likely to affect phenotypes produced by the genes they transcriptionally regulate. Moreover, we demonstrate that variant alleles tested by luciferase activity in laboratory colonies also segregate in the wild population, and are therefore subject to natural selection. Thus, it is intriguing that selection has apparently generated a wide range of natural allelic forms of enhancers, including alleles that lack functional activity. This is consistent with the observation that genetic variation for enhancer function offers powerful raw material for adaptation and evolutionary change11,20,35,36. The members of the Gambiae species complex, including A. coluzzii, are highly adaptable to a range of ecological conditions, and durable to vector control measures. This is thought to be associated with their high genetic diversity34. The current study is novel in that it addresses standing genetic variation for noncoding regulatory function, which is a poorly understood but likely important contributing factor influencing the success of this mosquito and its relatives.

We functionally dissected two alleles of the LRIM1 enhancer, high and low activity variants, respectively, by deletion mutagenesis. By measuring functional activity of integral, 50% and 25% length derivatives of the intervals, we detected a modular structure of positive and negative regulators within the enhancer. Interestingly, deletion derivatives of the two alleles behaved differently, indicating that the outcome was not a simple consequence of sequence length. The high-activity allele Fd03_#3 appears to carries a negative regulator in the terminal one-quarter of its length on one or both ends, because removal of these sequences led to significantly elevated activity in the remaining 50% derivative as compared to the integral enhancer. However, removal of an additional one-quarter again of the sequence from both ends of the 50% derivative then diminished activity to a level below that of the integral enhancer, suggesting that the positive regulator(s) revealed in the 50% length derivative were no longer present in the 25% length derivative.

That the low activity of the smallest derivative of Fd05_#1 was not a simple consequence of sequence length is made clear by a similar examination of the low-activity allele Fd05_#1. In this case, each incremental length decrease of the tested sequence led to increased enhancer activity. The Fd05_#1 allele result suggests that the integral enhancer displayed low activity because it carried multiple negative regulators, which were removed by each successive deletion, revealing a highly active core enhancer element within the smallest interval tested. This latter minimal derivative of the low-activity Fd05_#1 allele carries an enhancer with, in fact, higher enhancer activity than the integral 100% sequence of the high-activity Fd03_#3 allele.

The LRIM1 deletion mutagenesis results suggest that that large functional allelic diversity can be generated for a given enhancer interval by the combinatorial effect of positive and negative modifiers. Sequence changes in enhancers can generate or remove binding motifs for transcription factors and other regulatory proteins, which can modify transcription levels directly37,38, or indirectly through loss of chromatin accessibility17. From the current results, we do not know whether the specific positive and negative modifiers found within the LRIM1 alleles are reused and combined to fine-tune the activity of different enhancers. If so, such regulator modules should be recognizable with enough genomic and functional data. Finally, the phenotypic implications of enhancer alleles will require determination of the target genes regulated by an enhancer, as well as the protein-DNA interactions underlying differential enhancer allele activity.

Vector control has been central to the malaria control effort by use of indoor residual spraying and long-lasting insecticide impregnated bednets. However, over-reliance on these methods has led to widespread insecticide resistance in wild populations, and novel methods of control are now required. The noncoding regulatory genome in Anopheles has the potential to provide novel new targets for vector control, but until now has not been interpretable or exploitable. For example, noncoding variants genetically associated with phenotypic traits could be interpreted by their functional enhancer activity. Consequently, enhancer variants could serve as markers for traits such as insecticide resistance, parasite susceptibility, behavior or adaptation to ecological conditions. Understanding the genomic enhancer landscape could improve the choice of insertion sites for exogenous transgenes for proper expression, and enhancers themselves could be genetically modified in order to alter transcription of immune or disease important genes with phenotypic consequences. The current work presents a necessary first step towards establishing an efficient, effective method for associating noncoding variation with important mosquito phenotypes.

Methods

Wild mosquito samples and DNA library

Mosquito larvae were collected in Goundry village, Burkina Faso (latitude 12.5166876, longitude −1.3921092) using described methods39, reared to adults, and were typed for species by the SINE200 X6.1 assay40. DNA from 60 A. coluzzii were pooled at equal volume and sheared using an S220 ultrasonicator (Covaris) to produce DNA fragments 800–1000 bp in length. Subsequently, DNA was processed as described for the STARR-seq assay30, cloned into the plasmid pSTARR-seq_fly (AddGene 71499), transformed into MegaX DH10B T1R Electrocomp Cells (Invitrogen), cultured in LB+ ampicillin (1 ug/ml), and plasmid DNA was purified using the Plasmid Plus Mega Kit (Qiagen). The Anopheles gambiae PEST AgamP4 genome assembly available at Vectorbase was used as the reference genome (https://www.vectorbase.org/organisms/anopheles-gambiae/pest/agamp4).

Culture of plasmid library in Anopheles 4a3A cells

Hemocyte-like 4a3A cells41 were maintained on Insect X-Press media (Lonza) supplemented with 10% Fetal Calf Serum (heat inactivated at 56 °C for 30 minutes), at 27 °C. No antibiotics were used. We confirmed that cells were derived from A. coluzzii by species typing using the Fanello assay42. The plasmid DNA library was transfected and cultured in 4a3A cells as described30 using Lipofectamine 3000 Reagent (Invitrogen) and cultured for 24 h, in three biological replicates. RNA was extracted from cells using the RNeasy Midi Kit (Qiagen) followed by mRNA purification using Dynabeads mRNA Purification Kit (ThermoFisher). Plasmid DNA was isolated using the Plasmid Plus Midi or Mini Kit (Qiagen).

Analysis of 4a3A library culture results

The mRNA purified from cells was reverse transcribed using SuperScript III or IV First-Strand cDNA Synthesis System (Invitrogen) as described for the STARR-seq assay30 using a plasmid-specific primer (RT_Rev, Supplementary Table S1), the cDNA was then amplified using primers Report_Fwd and Report_Rev (Supplementary Table S1), and the products were sequenced on an Illumina HiSeq. 2500 in 2 × 125 bp high output mode. Cell plasmid DNA was amplified and sequenced in the same manner as the cDNA samples but using primers Plasmid_Fwd and Plasmid_Rev (Supplementary Table S1).

Selection of enhancer candidates

The Integrative Genomics Viewer (IGV)43 was used to select candidate enhancers by visual examination in the proximity of six annotated genes of interest. For determination of enhancer activity, the RNA output transcribed from the STARR-seq reporter plasmid, converted to cDNA and sequenced as described above, is compared to the levels of the plasmid DNA, to control for differential plasmid replication. Thus, candidate enhancers were predicted in intervals where coverage of the cDNA sequence track was visibly greater than the baseline coverage of the plasmid sequence track. The target genes examined were Krueppel-Like Factor 6/7 (KLF, AGAP007038), Leucine-Rich Immune protein 1 (LRIM1, AGAP006348), Acetylcholinesterase 1 (ACE1, AGAP001356), GABA-gated chloride channel subunit (Rdl, AGAP006028), LIM homeobox protein 2/9, ortholog of Drosophila apterous FBgn0267978 (AP, AGAP008980), and Ovo, AGAP000114. In addition, a negative control interval was cloned, which was a size-matched interval located within intron 1 of the gene, homeobox protein distal-less (DLX, AGAP007058) that displayed no visible divergence of cDNA and plasmid sequence tracks by IGV examination, and thus no predicted enhancer function. In all but one case, the candidate enhancer was the only one in the vicinity of the target gene, for AP there were three peaks, the most gene proximal peak is likely a promoter so the next most proximal peak was chosen as shown in Fig. 1. The candidate enhancers are named according to the most proximal coding sequence (above and Table 1).

Candidate enhancers were amplified from DNA of mosquitoes from the following A. coluzzii colonies: Ngousso, initiated in Cameroon in 200644, Fd03, Mali, 2008, Fd05, Mali 2008, Fd09, Burkina Faso, 2008, and Fd33, Burkina Faso, 2014. Fd colonies were previously described45. Primers are listed in Supplementary Table S2. Amplicons were cloned into the pCR8/GW/TOPO vector (Invitrogen) and sequenced with GW1 and GW2 primers. A standard plasmid was used for luciferase assays that incorporates the engineered Drosophila synthetic core promoter (DSCP). The information about the fine structure of Anopheles core promoters does not yet exist. The results indicate that the DSCP is functional in Anopheles cells. At least two genetically distinct sequences per candidate were then cloned into the firefly luciferase reporter plasmid pGL-Gateway-DSCP (AddGene 71506) using Gateway LR Clonase II (Invitrogen), transformed into OneShot OmniMax 2T1 Phage-Resistant Cells (Invitrogen), and plasmid was purified from overnight culture.

To test the effect of enhancer orientation, the enhancer was cloned in the opposite orientation in pGL-Gateway-DSCP. To test resolved enhancers, the relevant enhancer insert was amplified with primers that generated either 50% or 25% of the initial insert size, equally reduced on both ends, and products were cloned in pGL-Gateway-DSCP. In all cases, plasmids were resequenced to confirm insert identity using the primers LucNrev and RVprimer3 (Supplementary Table S1).

Quantitation and statistical analysis of enhancer activity by luciferase assay

The Dual-Glo Luciferase Assay System (Promega) was used for luciferase assays. A. coluzzii 4a3A cells were seeded in 96 well plates at 1 × 105 cells/well, the difference in volume if any was made up to 65 µl with medium, and cells were incubated for 24 h at 27 °C. Two plasmids were transfected into the 4a3A cells, the enhancer candidate in firefly luciferase vector pGL-Gateway-DSCP, and the renilla luciferase control vector pRL-ubi-63E (AddGene 74280), at a ratio of 1:5 (renilla:firefly), using transfection reagent Lipofectamine 3000 (Invitrogen), and were then incubated for 24 h at 27 °C.

Luciferase activity was detected on a GloMax Discover instrument (Promega) at 25 °C, with two 20 min incubations, one after the addition of Dual-Glo Luciferase reagent (Promega) and another after the addition of Stop & Glo reagent (Promega). All samples were run in 6-fold replication within a single plate and across at least two independent plates, for at least two biological replicates of each candidate, yielding at least 12 measurements. Firefly luciferase measurements expressed in relative light units (RLU) were corrected using the measurements of RLU for the renilla luciferase internal control in the same well. Values for the DLX negative control on the same plate were defined as the background level. Values of firefly/renilla RLU for the experimental insert were normalized to the firefly/renilla mean value for DLX in order to combine results across replicates. Luciferase activity was declared above background if the firefly/renilla RLU ratio for the experimental insert was significantly higher than the firefly/renilla value for the DLX negative control. Luciferase activity was statistically compared using a non-parametric ANOVA (Kruskal-Wallis) with post hoc pairwise comparisons.

Analysis of enhancer allelic variants

The sequences of genetically polymorphic variants of a given enhancer, cloned from A. coluzzii colonies as described above, were analyzed for genetic relatedness. To generate neighbor joining (N-J) trees to depict the relationships between genetic variants for the same enhancer, complete sequences were aligned using MUSCLE within the package Molecular Evolutionary Genetics Analysis Mega version X46, and N-J trees constructed using Mega. When at least four variants were tested, bootstrapping was performed and bootstrap values are included on N-J trees. Scale bars of trees represent 0.5, but each bar is a different length. The longer the 0.5 scale bar, the more genetically similar the sequences. A workflow summary figure is shown in Supplementary Fig. S1.

Analysis of wild Anopheles variation data

Sequence information for 309 wild A. coluzzii from 6 West African countries; Angola (AR), Burkina Faso (AB), Cote d’Ivoire (AY), Ghana (AA), Guinea (AV) and Guinea-Bissau (AJ), generated as reported34 were downloaded from MalariaGen (https://www.malariagen.net/projects/ag1000g) as VCF files from the Ag1000G phase 2 AR1 data release and sequences of the six validated enhancers were extracted from the raw VCF files using GATK version 3.9 (SelectVariants mode)47, the consensus sequences with IUPAC ambiguity codes for the variants were extracted using BCFtools version 1.9 (consensus mode with –iupac-codes option)48, the sequences of each interval were aligned with Clustal W version 2.149 and visualized with AliView aligner version 1.2450.

The diploid sequences from the wild sequences were aligned to the cloned sequences generated from the six validated enhancers. Sequence alignments were visually examined for shared variation. Short representative sequence alignments are presented in Fig. 7 (not including indels), and complete alignments relative to the PEST AgamP3genome assembly, including indels, are presented in Supplementary File S2. Indel genotypes of the wild sequences shown in Supplementary File S2 are relative to the PEST reference haplotype, because the wild sequences were called for SNPs but not indels34.

Ethics statement

No animals or human subjects were used. Mosquito colonies were maintained on anonymous commercial human blood using an artificial membrane feeding device.