Background & Summary

Promoters, as cis-regulatory elements located upstream of genes’ transcription start sites (TSSs), are fundamental in gene regulation1. Over half of all human genes possess multiple promoters, referred to as alternative promoters (APs)2. Therefore, AP events, as a major pre-transcriptional mechanism, contribute to the generation of various 5’ untranslated regions and first exons3, thereby enriching the diversity of mRNA and protein isoforms. Additionally, some studies have demonstrated that the selection of APs can differ across various tissues, developmental stages2,4, and the process of cellular differentiation5. For instance, the selection of APs in CCND1 can change during the development of retinal cells6. Furthermore, increasing evidence also shows that AP events may lead to a range of diseases, especially cancers2. For example, the use of a specific AP in acetyl-CoA synthetase 2 (ACSS2) generates ACSS2-S2, which is associated with amplified ribosome biogenesis in hepatocellular carcinoma (HCC)7. In pan-cancer studies, AP events were also found to display cancer-specific regulation, and AP usage was significantly associated with patient survival outcomes8.

Besides humans, APs also play a vital role in other eukaryotic animals. For instance, it has been observed that the different isoforms, because of AP events in Rbfox1 within the mouse brain, serve distinct functions during cortical development9. Furthermore, the study conducted by Damir et al. on cis-regulatory elements in zebrafish revealed that signal transduction-associated genes with APs exhibit vertebrate conservation10. Recently, Alfonso-Gonzalez et al. also found that in Drosophila heads, 3′ end site choice is globally influenced by AP events11. Moreover, AP events in KATNAL1 have been proven to be associated with the reproductive traits of male bulls12. Overall, in animals, AP events are also essential in pre-transcriptional regulation, possess important biological functions and are associated with some important traits.

Regarding the potential regulators of AP events, it has been shown that AP events could be regulated by cis-acting elements and trans-acting factors. Among them, enhancers, as important cis-acting elements, can form a loop structure with the target promoter and are involved in the recruitment of TFs and cofactors, thus regulating AP events13,14. Additionally, TFs, as important trans-acting factors, can recognize TF motifs in the flanking regions of TSSs and activate or inhibit transcription initiation15,16. Furthermore, DNA methylation, as an important epigenetic modification, is enriched in the promoter region and affects the selection of APs17. For example, in the human mammary gland, the overexpression of the TF Ets-1 activates the AP events of the lactoferrin gene18.

To date, several technologies can be utilized to identify promoters with the development of high-throughput sequencing technology, such as cap analysis of gene expression (CAGE-seq)19, rapid amplification of 5’ complementary DNA ends (5’ RACE) and RNA annotation and mapping of promoters for analysis of gene expression (RAMPAGE)20. These approaches involve elaborate experimental procedures and are not as routinely used as RNA-seq. In contrast, RNA-seq data for diverse organisms, tissues, and cell types are relatively easy to produce and are plentifully available in public repositories. While detecting alternative promoters with RNA-seq data has lower sensitivity compared to other techniques, the availability of relatively abundant data and cost-effectiveness make it a viable approach to investigate AP events at the genome-wide level across multiple tissues and various animal species using RNA sequencing. Hence, several algorithms have been developed to identify alternative promoters with RNA-seq data, such as SEASTAR21, proActiv8 and mountClimber22.

Considering the significance of APs, numerous AP events have been detected in multiple human tissues, and relevant datasets have been constructed. For example, Demircioğlu et al. estimated promoter activity using RNA-seq data from 18,468 cancer and normal samples and found that AP events show obvious tissue-specific regulation and association with patients’ prognosis8. The Eukaryotic Promoter Database (EPD) has collected experimentally validated promoters for model organisms and also includes some alternative promoters. However, EPD does not focus on the APs and only includes limited APs23. Hence, the landscape of alternative promoters in animals other than humans has not been fully explored, and thus far, no database provides information on potential regulators of APs for animals.

Moreover, considering the dataset with 6,674 human normal samples included in Demircioğlu’s study was GTEx v7. The updated GTEx dataset with many more samples was also included in our study. Therefore, in this study, we systematically characterized the AP profiles in 23,077 samples from 12 animal species, including human, by analyzing RNA-seq data sourced from publicly available databases. These species include chicken (Gallus gallus), cow (Bos taurus), dog (Canis familiaris), frog (Xenopus tropicalis), fruitfly (Drosophila melanogaster), human (Homo sapiens), mouse (Mus musculus), pig (Sus scrofa), rat (Rattus norvegicus), rhesus (Macaca mulatta), worm (Caenorhabditis elegans), and zebrafish (Danio rerio). Then, we analyzed the associations between alternative promoters and different animal traits, such as age and sex, to identify potential trait-related AP events. Moreover, putative AP regulators, including TFs and eRNAs, were identified. Finally, we developed Animal-APdb, a database for browsing, searching, and downloading animal AP-related information.

Methods

Collection and processing of data and identification of AP events

The aligned RNA-seq data of human normal tissues were downloaded from the GTEx24 (version: 8) (Table 1). Moreover, we downloaded the RNA-seq data from normal tissue samples of other animals by accessing the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) of the National Center for Biotechnology Information (NCBI) and EMBL’s European Bioinformatics Institute (EBI)25,26,27 (Table 1). Detailed sample information, including tissue type, age, sex, and developmental stage, was also downloaded and manually curated. The raw SRA files of RNA-seq data were processed as follows: firstly, they were converted into FASTQ format, and subjected to quality control using FastQC (version: v0.11.8). Subsequently, data cleaning was performed using Trim Galore, followed by alignment to the respective reference genome with HISAT228. In addition, we calculated the gene-level read counts with FeatureCounts and employed transcripts per million (TPM) normalization for gene expression (Fig. 1a).

Table 1 Samples summary in Animal-APdb.
Fig. 1
figure 1

Flow charts of Animal-APdb. (a) Data collection and processing of Animal-APdb. (b) Main modules of Animal-APdb. (c) Database construction of Animal-APdb.

In total, 23,077 samples across 227 tissues of 12 species were included in Animal-APdb, ranging from 199 samples in zebrafish to 16,563 samples in human (Table 1) and from one tissue in frogs to 48 tissues in human.

Based on the collected RNA-seq data, the R package proActiv8 was utilized to identify possible APs in each sample and quantify promoter activity (Fig. 1a). Briefly, proActiv is an algorithm that estimates promoter activity based on RNA short-read sequencing data by mapping and quantifying first intron junctions of the genome. ProActiv has shown high performance in promoter activity estimates29,30, as well as higher consistency with H3K4me3 histone data compared with other methods8.

Specifically, for a promoter\(p\) in a sample \(s\), using proActiv, we obtained each promoter’s absolute activity\({A}_{p,s}\) and relative usage \({U}_{p,s}\), as the ratio of its individual activity to the cumulative activity of the same gene’s promoters:

$${U}_{p,s}=\frac{{A}_{p,s}}{{\sum }_{p\in {P}^{{\prime} }}{A}_{p,s}}$$

Here,\({U}_{p,s}\) and\({A}_{p,s}\) are the usage and absolute activity of promoter\(p\) of sample\(s\), respectively, and\({P}^{{\prime} }\) denotes the set of promoters belonging to the same gene. Compared with absolute activity, promoter usage can better represent the frequency of the selection of the specific AP, and to some extent, promoter usage helps minimize the batch effects. Hence, we mainly applied promoter usage\({U}_{p,s}\) in this study.

Identification of tissue-specific AP events

In this study, we identified tissue-specific APs with Demircioğlu’s method8. Tissue-specific alternative promoters were identified by applying a tissue-specific linear model, where each sample was tested for absolute promoter activity and relative usage. A promoter was considered tissue-specific if it met a Benjamini-Hochberg adjusted p-value threshold (≤0.05) for both absolute activity and relative usage, with specific fold-change requirements to distinguish promoter activity from gene expression differences. These criteria ensured that tissue-specific promoter activity was significant, with at least a 2-fold change in activity between the target tissue and others, and minimal changes in overall gene expression.

Identification of trait-related AP events

The trait data of human which contains sex, height, weight and age was collected from GTEx. And trait data of other animals which contains sex, height, weight and development stage information for each animal sample in Animal-APdb was retrieved from SRA. We analyzed the association between the usage of individual AP and each trait across diverse tissues.

  1. (1)

    For the trait of sex, the ‘Mann‒Whitney U test’ was utilized to compare the difference in AP usage between the male and female groups. To establish statistical significance, we set the criteria at |fold change (FC)| ≥ 1.5 and a false discovery rate (FDR) < 0.05.

  2. (2)

    For the trait of developmental stage, in human samples, the Spearman’s correlation would be applied to evaluate the association between AP usage and the age of the samples. We consider the correlation with |Rho| ≥ 0.3 and FDR < 0.05 as statistically significant. For other animal samples, all tissue samples were categorized into two categories: tissues with both embryo and postnatal samples, and the tissues with either embryo or postnatal samples exclusively. With regard to tissues with only embryo or postnatal samples, the Spearman’s correlation would be applied, using developmental index as a numerical variable, to evaluate the association between AP usage and the developmental index. Besides, if the development index was a dichotomous variable, the significance level of difference in AP usage between two groups would be evaluated with the ‘Mann‒Whitney U test’. As for tissues with both embryo and postnatal samples, firstly, we utilized the ‘Mann‒Whitney U test’ to detect the APs whose usage is significantly different between the embryo and postnatal groups. Secondly, the same methods as above were utilized to identify development-related APs in embryo and postnatal samples, respectively (Fig. 1b).

Identification of eRNAs related to AP events

Here, we used enhancer RNA (eRNA) data, a kind of non-coding RNA molecule transcribed from the loci of enhancers and whose expression can characterize the activity of the corresponding enhancer31, to calculate the associations between enhancer activities and AP events. We downloaded the locus and expression data of eRNAs from Animal-eRNAdb (http://gong_lab.hzau.edu.cn/Animal-eRNAdb/)32. Putative enhancer RNAs (eRNAs), presumed to regulate (AP) events, located within 1 Mb of the target AP, and their expressions showed significant associations with the target AP usage. (Spearman’s correlation coefficient |Rho| ≥ 0.3 and FDR < 0.05) (Fig. 1b).

We identified a total of 19,813AP events related to 63,854 eRNAs (ranging from 304 AP events related to 380 eRNAs in worms to 9,774 AP events related to 31,671 eRNAs in mice). More detailed information is presented in Table 2.

Table 2 Data summary of Related Aps.

Identification of TFs related to AP events

TFs can recognize their corresponding motifs in the flanking region of the TSS and activate or inhibit transcription initiation. To obtain TFs related to AP events, annotations of TFs were retrieved from AnimalTFDB (http://bioinfo.life.hust.edu.cn/AnimalTFDB4/#/)33, and the known TF motifs were collected from JASPAR (https://jaspar.genereg.net/)34. Combined with gene expression data, we identified candidate TFs related to AP events according to two major criteria: 1) TF expression had significant associations with AP usage and 2) TF might bind the flanking region of the TSS (from 2,000 bp upstream to 500 bp downstream of the TSS). Specifically, firstly, average TPM of TF expression > 5 in each tissue and TF expression had significant association with AP usage (Spearman’s correlation coefficient |Rho| ≥ 0.3 and FDR < 0.05); secondly, two methods were adopted in this study to validate whether specific TF could bind to the flanking region of the TSS. One method was using FIMO35 to scan TFBS motifs in the vicinity of each AP. Another method was adopting uniformly processed ChIP-seq data of specific TFs to overlap with the flanking region of the TSS. A total of 9,675 uniformly processed ChIP-seq data from 32 tissues of 6 species were collected from ChIP-Atlas36. Finally, the results were combined into the database.

Database framework

All data mentioned above were stored in the MongoDB database (version 3.6.8). The Animal-APdb website was built based on the Flask (version 1.0.3) framework with AngularJS (version 1.6.1) and Bootstrap, hosted on the Apache 2 webserver (version 2.4.18). In addition, ECharts and R are employed for database visualization. Animal-APdb is freely available online without registration or login for access (Fig. 1c).

Data records

These datasets are available on Figshare37, Zenodo38, and the Animal-APdb download page (http://gong_lab.hzau.edu.cn/Animal_AP#!/download). Each module file for each species is provided in ‘.tsv’ format. Files on AP usage offer detailed information about APs across multiple tissues for specific species. Trait-related AP files provide data on the correlation between APs and various traits across tissues. Regulator files include detailed information on eRNAs and TFs potentially involved in AP selection.

Technical Validation

All results mentioned above have been integrated into Animal-APdb. A summary of data entry can be found in Fig. 2 and Table 2.

Fig. 2
figure 2

Data summary and technical validation of Animal-APdb. (a) The number of APs identified for each species in Animal-APdb. (b) The number of tissue-specific APs identified for each species in Animal-APdb. (c) The total number of AP genes annotated in Animal-APdb compared to those annotated in EPD. (d) Comparison of human AP genes annotated by EPD, proActiv, and Animal-APdb. (e) Distribution of distances between APs for genes annotated exclusively in EPD and those annotated in both EPD and proActiv.

Data summary of Animal-APdb

As shown in Fig. 2a, a total of 102,349 AP events in these species, ranging from 1,346 in worms to 38,849 in human at the species level. Many AP events’ expressions vary a lot in multiple tissues, which corroborates previous research2. Notably, the number of AP events of each species related with the number of samples, genome complexity and the number of tissue types. Moreover, a total of 2,523 tissue-specific AP events were identified in species with two or more tissues, ranging from 34 in fruitfly to 884 in chicken (Fig. 2b).

A total of 13,340 trait-related AP events in all species (ranging from 5 in zebrafish to 6,687 in mouse) were identified. More detailed information is presented in Table 2.

We identified a total of 19,813 AP events related to 63,854 eRNAs in 8 species (ranging from 304 AP events related to 380 eRNAs in worm to 9,774 AP events related to 31,671 eRNAs in mouse). Moreover, a total of 75,195 AP events associated with 4,573 TFs in all 12 species (from 408 AP events associated with 54 TFs in worm to 29,412 AP events associated with 572 TFs in human). More detailed information is presented in Table 2.

Technical validation process of Animal-APdb

To ensure the quality and validity of the data in Animal-APdb, several rigorous steps were implemented during curation. First, the meta-information for all species was manually curated from the NCBI SRA database and GTEx to guarantee accuracy and reliability. To address potential batch effects between RNA-seq data from different BioProjects, BioProjects with insufficient data were excluded, thereby maintaining the integrity and consistency of the dataset. During RNA-seq processing, stringent quality control measures were applied to remove samples with poor sequencing quality. Filtering and alignment procedures were meticulously carried out to retain only high-quality data for downstream analyses.

Second, the R package proActiv was employed to identify alternative promoters and estimate their activities. The reliability of proActiv in estimating promoter activities has been validated using H3K4me3 histone modification data, CAGE-seq data, and Iso-seq data29. To ensure biological relevance, promoters with low activity, which are unlikely to have significant functional implications, were excluded from certain tissues and species. These steps collectively contribute to a robust and high-quality dataset that underpins the Animal-APdb resource. The annotation quality of APs in Animal-APdb was validated by comparing it with experimentally verified promoters in the EPD database. For most species, Animal-APdb contains a much greater number of genes with APs compared to EPD (Fig. 2c). However, it is important to note that some discrepancies arise due to differences in the reference genome versions used by EPD and Animal-APdb, which could affect the results for certain species.

To further investigate the representation of EPD-annotated genes with APs in Animal-APdb, the case of humans was analyzed as instance (Fig. 2d). Among the 8,361 genes with APs annotated in EPD, 6,994 were also identified by the proActiv. This substantial overlap highlights the consistency between the two methods when applied to the same reference genome. However, 1,367 AP genes annotated in EPD were not detected by proActiv. This discrepancy arises because proActiv categorizes transcripts with identical or closely located TSSs as being regulated by the same promoter. Supporting this, the distances between APs for genes annotated by both EPD and proActiv were significantly greater than those for genes annotated only by EPD (Fig. 2e). 4,501 AP genes were excluded due to low promoter activity, reflecting the stringency of the activity-based filtering process. In contrast, EPD-validated AP genes were reduced by only 1,797 in Animal-APdb. These results highlight the efficiency and necessity of the activity-based filtering process.

Usage Notes

The Animal-APdb provides a user-friendly web interface. It contains four main modules: ‘AP events’, ‘Trait’, ‘eRNA’, and ‘Transcription Factor’ for data searching, browsing, and visualization. To maximize the utility of this resource, users can query genes of interest to identify the presence of alternative promoters in specific species and tissues. This capability enables further investigation into how APs influence associated traits and the factors regulating the selection of APs.

Additionally, the database facilitates advanced data mining by integrating information across multiple species. This integration allows researchers to explore the relationship between APs’ usage and species evolution, shedding light on how promoter variation may have evolved in different species. Furthermore, the inclusion of multi-omics data enables the identification of regulatory factors that drive APs’ usage in key genes across species which offer a powerful framework for dissecting gene regulatory networks.