Background & Summary

The landscape of the northern Andes in Colombia is shaped by three mountain ranges (Western, Central, and Eastern), forming two extensive inter-Andean valleys drained by the Magdalena and Cauca rivers1. These valleys collectively constitute the Magdalena River basin (MRB), which is the largest trans-Andean basin. These ranges played a crucial role in determining the evolution of a unique species assemblage, constituting the ichthyofauna with the highest proportion of endemic species in Colombia2,3. This region encompasses a diverse array of aquatic environments along latitudinal and altitudinal gradients, forming a mosaic of habitats for a wide diversity of fish species, some of which are particularly significant in local fisheries4,5,6. The fish species in this region display a broad repertoire of adaptations to cope with vastly contrasting habitat conditions7. These adaptations are frequently found in specialised species, narrowly distributed in very specific habitats, unique to this mountainous region, where the level of endemism is notably high8. However, our understanding of the diversity and distribution of fish species does not progress as rapidly as the anthropogenic transformations occurring in these aquatic environments9. Therefore, developing data tools that enhance the taxonomic identification of fish species is essential to guide conservation decisions in the MRB. The lack of such information could result in the exclusion of considerations centred on the fish fauna, in development projects, with potentially severe consequences10.

Scientific research on fishes from the MRB was initiated by Humboldt (1805), who provided taxonomic descriptions of several endemic species. Taxonomic information was subsequently generated by foreign ichthyologists11, mostly naturalists from museums established in large European cities12,13,14,15,16,17. Notably, the contributions of the German-American ichthyologist Carl H. Eigenmann were based on an ambitious ichthyological reconnaissance of the territory that stands out as a landmark in the study of Colombian fishes18,19,20,21,22. Recent studies conducted by local researchers are expanding the taxonomic knowledge of the fish species from the Colombian Andes23,24,25. Historically, fish species have been identified using morphological traits. However, challenges arise when dealing with cryptic or morphologically similar species, particularly within genera of complex taxonomy such as Astroblepus, Astyanax, Characidium, Chaetostoma, Hemibrycon, Trichomycterus, among others26,27. Given these challenges, identifications based solely on morphology have required remarkable expertise in specific taxa to minimise the risk of identification errors. Therefore, new technologies and diagnostic tools such as digital image processing and pattern recognition techniques, characterization of acoustic or electric organ signals, and molecular methods are necessary28,29,30.

In recent decades, molecular methods have played a pivotal role in the discovery of hidden diversity, as identification becomes independent of the taxonomic expertise of the researcher31 or the life stage of the species32. These advancements have facilitated the integration of diverse fields, spanning taxonomy, conservation, invasive species detection, and biogeography33,34,35. Notably, the widespread application of advanced DNA sequencing technologies has allowed detailed analyses of genes and complete genomic sequences36,37,38,39. One significant breakthrough is the adoption of the metabarcoding approach, enabling the simultaneous identification of multiple samples40,41. This approach has extended its applications to studying entire ecological communities through environmental DNA sampling42,43,44. However, the efficacy of these procedures hinges on the availability and accuracy of DNA reference databases, some of which are public and crucial for bioinformatic applications (i.e. Barcode of Life BOLD: https://www.boldsystems.org/, GenBank: https://www.ncbi.nlm.nih.gov). The utilisation of these reference databases has witnessed a substantial increase worldwide in recent years45,46,47. These databases are linked to specimens catalogued in natural history museums, allowing reexamination to verify or revise taxonomic identifications. For fishes in the MRB, there are existing public DNA databases of the mitochondrial Cytochrome c Oxidase subunit I (COX1)48, both in the BOLD and GenBank platforms26,27,31, as well as complete mitochondrial genomes49. These resources contribute significantly to the understanding of the fish species diversity and enhance the accuracy and reliability of species identification.

The Magdalena biodiversity is mostly reflected and preserved in the country’s valuable biological collections. Despite the incomplete taxonomic representation of the fish diversity in biological collections, in part due to the continuous description of new taxa, cases such as the Collection of Ichthyology of the University of Antioquia (CIUA) stands out for its remarkable taxonomic and geographical representation of MRB fishes (208 species and 815 sites), being crucial for precise taxonomic identification, effective conservation, and management (Fig. S1). Valuing Colombia’s natural and cultural heritage, the CIUA’s extensive collection not only serves as a repository for biodiversity documentation, but also plays a role in advancing scientific research, education, and public outreach initiatives related to MRB fishes. To disentangle the evolution and taxonomy of complex fish groups as those already referred before, the University of Antioquia and the hydropower company Empresas Públicas de Medellín (EPM) began a collaboration in 2017 to establish a genetic reference database of the fishes preserved in the CIUA collection. This database has been built from specimens collected during environmental monitoring in the areas of influence of EPM’s hydropower plants and in expeditions intended to search representative topotypes of endemic species.

Our genetic database was built with specimens from 297 localities, encompassing a wide array of aquatic environments from the Atrato, Catatumbo, Cauca, Dagua, and a significant representation from the MRB (Fig. 1). The sampling conducted in these basins holds fundamental significance in comparing the taxonomic identification in use within the MRB. A total of 1,270 specimens were taxonomically classified, and preserved according to the procedures described in the methods section. Tissue samples were collected, and their DNA was extracted and sequenced for three mitochondrial regions: COX1, 12S ribosomal RNA, and 16S ribosomal RNA large subunit (Table 1). The specimens were catalogued with extensive metadata (Supplementary Table 1), resulting in a DNA database representing at least 183 species, 90 genera, and 36 families of fishes (Fig. 2). Additionally, the collection includes photographs corresponding to 168 species (Fig. 3). Species identifications were rigorously verified, based on phenotypic and genotypic data (see Methods and Technical Validation sections).

Fig. 1
figure 1

Sampling locations included in the fish collection CIUA of the Magdalena, Atrato, Catatumbo, and Dagua river basins.

Table 1 Summary of the dataset for the three sequenced mitochondrial genes obtained from 1,270 specimens of the Magdalena fish collection, alongside species counts from the Atrato, Catatumbo, and Dagua river basins.
Fig. 2
figure 2

Species sequenced for three mitochondrial regions (COX1, 12S rRNA, and 16S rRNA), focusing particularly on the Magdalena River basin, and including species counts from other river basins such as the Atrato, Catatumbo, and Dagua.

Fig. 3
figure 3

Live photographs of endemic specimens, primarily featuring topotype specimens from the Magdalena River basin.

The data obtained have already led to the publication of taxonomic descriptions of new species50. Topotypes, accounting for 39% of the species in our dataset (Fig. 3), were crucial to ensure a reliable source of taxonomic validation. The dataset includes DNA sequences from 54 species not previously available in GenBank or BOLD for any loci, along with additional species representing the first publicly available COX1, 12S, and 16S sequences for that taxon (Table 1). Moreover, we have initiated the construction of a dataset of mitochondrial genomes with results for 10 endemic fish species, some of which are important for fisheries or are currently classified as endangered species (Fig. 4). This comprehensive dataset contributes to advancing our understanding of the genetic diversity of fish species in the region. The 27.88% of fish species in this dataset have been classified according to their conservation threat status (51 species), while 66.67% are labelled as data deficient (122 species), and the remaining 5.46% have not been evaluated yet (10 species) (Fig. 5). The list of species and their IUCN status is available in Supplementary Table 1. These species are locally endemic, with low abundance (estimated by their capture frequency), or correspond to recently described species, or species in undersampled aquatic environments. Consequently, the lack of information on distribution, ecology, population trends, and threats, hinders conservation strategies. However, it is hoped that this database will contribute to providing the necessary tools for more informed assessments within the International Union for Conservation of Nature (IUCN; https://www.iucnredlist.org/), and granting legal status for conservation policies. In addition, 11 species of introduced fishes (non-native) that have been identified as invasive are incorporated into the database (details on these species can be found in Supplementary Table 1). This information is essential for the recognition and monitoring of invasive species by DNA metabarcoding, a technique that facilitates early detection and large-scale monitoring of these habitats.

Fig. 4
figure 4

Mitochondrial genome maps of 10 fish species from the Magdalena River basin.

Fig. 5
figure 5

(a) Percentage of fish species by IUCN Red List of Threatened Species threat category. (b) Chord diagram showing the number of species by taxonomic order with DNA sequences concerning threat categories.

This comprehensive database encompasses approximately 65% (153 species) of the estimated fish diversity in the MRB51. CIUA is actively contributing to the expansion of this coverage by publishing new DNA sequences on public platforms, a project outlined in detail on GenBank, and CavFish (https://cavfish.unibague.edu.co/) (details on these species can be found in Supplementary Table 1). This ongoing effort aims to continually increase the representation of fish species in the region, enhancing the utility and completeness of the database. This collection of information serves as an indispensable reference for the knowledge of Colombian fishes, providing valuable resources for a diverse community of users with varied interests. This resource not only contributes to ongoing research but also lays the groundwork for future investigations. Moreover, it will bring key elements for taxonomic research to unhide undescribed species in multiple lineages, completed with comprehensive collection data and detailed photographic catalogues of the voucher specimens of the genetic sequences constituting the database. Simultaneously, this database plays a crucial role in refining the precision of results in the growing number of studies employing DNA metabarcoding. As molecular methods become increasingly prevalent in ecological studies, the comprehensive and accurate data provided by our collection of information will undoubtedly contribute to more robust and reliable outcomes in research focusing on the intricate dynamics of the fish species from one of the most biodiverse regions in the world.

Methods

This study was conducted with the recommendations and approval of the Ethics Committee on Animal Testing of the University of Antioquia (CEEA). The protocol was reviewed and approved on November 14, 2017 by CEEA, and updated on February 9, 2021. Specimen collection was endorsed by the Ministry of Environment of Colombia through the Non-Commercial Scientific Research Permit granted to University of Antioquia (Resolution 0524 of May 27, 2014).

Sampling design and specimen collection

The database includes captures of fishes from the year 2010 through 2023 (Supplementary Table 1). These captures were in diverse aquatic ecosystems distributed between 0 and 3,500 metres above the sea level (Supplementary Table 1), including rivers, streams, creeks, floodplain lakes, and Andean reservoirs. Geographic coordinates were recorded at each sampling location using a satellite geopositioner (GPS), calibrated to the WGS84 datum. Due to the selectivity of the catch method for different fish species and body size, we standardise the catch effort for each environment (flowing channels, floodplains lakes, and reservoirs). We used the same catch effort for each environment. In flowing channels (creeks, streams, and rivers), the catch effort was 30 sets with each of the three cast nets (with different mesh sizes: 0.5, 1.5, and 3.5 cm), plus 60 min of sweeps with portable electric fishing equipment of pulsating DC current (340 V, 1–2 A) along 100 m of the flowing channel. In the littoral zones of floodplains, lakes, and reservoirs, the sampling was made by deploying two gill nets, each measuring 100 m long and 3 m high, during six-hour periods when possible. This time was adjusted by constraints due to access and security regulations in the reservoirs, or the maximum time allowed by local communities in floodplains lakes. Each gill net featured ten different mesh sizes (1–10 cm between opposite knots) to improve the chance of capturing a wide spectrum of fish species and body sizes.

Captured specimens were immersed in a light anaesthesia bath using eugenol (12.5 mg/l), in a stock solution of 1:9 (eugenol:ethanol), to reduce stress and mortality associated with handling52. Subsequently, at least one specimen per species at each locality was selected for live photographic documentation using the Photafish System53. Photographs were taken by J. L. Londoño-López, J. E. García-Melo, and J. G. Ospina-Pabón using SONY Alpha 6000, SONY Alpha 7III, SONY Alpha 7RIV, and Nikon d5500 camera bodies, with 90 mm and 60 mm macro lenses, under flash lighting. The specimens were then sacrificed receiving double dosage of eugenol, until they lost swimming activity and respiration. Tissue samples from each specimen were extracted and preserved at 96% ethanol, and whole voucher specimens were fixed with 10% formalin. Voucher specimens were subsequently transferred to 75% ethanol for long-term preservation and catalogued at CIUA. Associated tissues and DNA extractions were stored at −82 °C and −20 °C, respectively, in CIUA biorepository freezers.

Taxonomic identification and validation

Field taxonomic identifications were performed on fresh specimens and then confirmed in the laboratory using the same preserved specimens. This process involved regional or taxon-specific keys complemented by systematic and taxonomic reviews at the family or genus level (when available), as well as original species descriptions and redescriptions. For some taxonomic groups, these taxonomic identifications were validated by specialists: H. D. Agudelo-Zamora (Characidium), J. G. Albornoz-Garzón (Stevardiinae), T. P. Carvalho (Bunocephalus), M. C. Castellanos-Mejía (Rineloricaria), C. DoNascimiento (Heptapteridae, Pimelodidae, Trichomycteridae), C. A. García-Alzate (Hyphessobrycon), M. A. Hernández-Cortés (Heptapteridae), F. C. T. Lima (Bryconidae), N. K. Lujan (Hypostominae), V. M. Medina-Ríos (Trichomycterus), A. Méndez-López (Bryconidae), J. G. Ospina-Pabón (Astroblepidae), A. T. Thomaz (Argopleura). Ordinal classification and valid names adhere to Near and Thacker54 and DoNascimiento et al.51, respectively.

Taxonomic identification of voucher specimens was confirmed by comparing newly generated sequences with those available in public repositories (GenBank), as detailed in the Technical Validation section below. Genomic DNA was extracted from tissue samples using the DNeasy Blood and Tissue kit (Qiagen) and the GeneJET Genomic DNA Purification kit (Thermo Scientific), following the manufacturers’ protocols. DNA samples were amplified by PCR (Table 2). PCR reactions were conducted in a 30 µl reaction volume, consisting of 0.6 µl of each primer (10 mM), 0.6 µl of dNTPs (10 mM), 3 µl of reaction buffer (10x), 0.3 µl of OneTaq® DNA polymerase (5U/µl), and 3 µl of genomic DNA. The concentration of MgCl2 varied according to the molecular marker (Table 2). The PCR temperature cycle consisted of an initial 5 min step at 95 °C, followed by 35 cycles of 45 s at 94 °C, 1 min for primer hybridization (temperatures specified in Table 2), 1 min at 72 °C, and a final extension phase of 10 min at 72 °C.

Table 2 Primers and PCR conditions used to amplify targeted gene regions.

Data Records

The MRB fish database encompasses a comprehensive set of information resources, including; (1) preserved specimens catalogued and deposited in the CIUA55, (2) DNA and tissue banks obtained from voucher CIUA catalogued specimens (details on these species can be found in Supplementary Table 1), (3) a full-resolution photographic collection of voucher CIUA catalogued specimens photographed when still alive, from which a selection of representative photographs (at lower image resolution for optimal web-page visualisation) of each species from each collecting site (usually corresponding also to voucher specimens of published genetic sequences) is available at the CavFish project website (https://cavfish.unibague.edu.co/) (details on these species can be found in Supplementary Table 1), and (4) an edited database of genetic sequences corresponding to three mitochondrial loci (COX1, 12S, and 16S) along with complete mitochondrial genomes publicly available in GenBank56. These information resources ensure wide and easy accessibility for research and educational purposes, promoting an open-data approach for scientific and conservation collaborations, and further exploration and knowledge construction of the ichthyofauna diversity from Colombia and the Neotropics.

Technical Validation

Consensus sequences were assembled and edited using Geneious Prime v. 2023.1.2 (http://www.geneious.com) and aligned using the G-INS-i plugin in MAFFT v.7, with default parameters57. Subsequently, Maximum Likelihood (ML) phylogenies were individually inferred for each marker in the IQTREE 2 program58. Phylogenetic results were scrutinised to validate sequence identity, primarily based on adherence to well-corroborated monophyletic groups or anomalous phylogenetic placement of individual sequences. In instances where sequences of a single taxon are grouped together, additional validation was conducted using BLAST to ensure high correspondence (>98%) with the taxonomy inferred from morphological traits. For sequences clustered with unexpected lineages, corresponding vouchers were reexamined to confirm their taxonomic identification. These sequences were then cross-checked with BLAST to verify their actual identification. While BLAST facilitated confirmation of the genus, species identification primarily relied on morphological characteristics. This validation process aimed to ensure data accuracy and rectify any potential errors in specimen cataloguing prior to sequencing. It is important to highlight that some sequences matched at the genus level but not at the species level due to the absence of previously deposited sequences in genetic repositories for certain species (e.g., Astroblepus, Hemibrycon, Chaetostoma). This lack of representation underscores the significance of our work, emphasizing the need for further exploration and documentation of these taxa in genetic databases. Sequences falling into this category are classified as either “newly sequenced” or “newly sequenced for each molecular marker”, depending on whether they have been sequenced for one or more loci previously (Table 1).

Mitochondrial genomes were sequenced using the Illumina platform (Illumina Inc., San Diego, CA, USA). The DNA library preparation and sequencing were conducted by Macrogen Company, South Korea (https://dna.macrogen.com). Sequencing libraries were prepared using the TrueSeq Nano 350 bp DNA kit, following manufacturer’s protocol. Raw reads underwent quality filtering using Cutadapt v3.5.859, while medium-depth analysis and detection of alternative alleles were carried out using Bowtie2 v2.4.460 and SAMtools v1.1461. Filtered genomic reads were utilised for de novo assembly of the mitochondrial genome using SPAdes v3.15.362. Graphical annotation and analysis were performed using Mitofish v3.8563 and Proksee64.

Usage Notes

The MRB fish database, an open-access electronic resource, compiles data on cryptic species, particularly those within genera of intricate taxonomy. This database facilitates the review of complex taxa facing taxonomic uncertainties. In addition, it incorporates records of 210 specimens from 18 genera, including Astroblepus, Astyanax, Chaetostoma, Characidium, Creagrutus, Cynodonichthys, Eigenmannia, Hemibrycon, Hypostomus, Knodus, Lasiancistrus, Lycengraulis, Parodon, Poecilia, Pseudopimelodus, Rineloricaria, Sturisomatichthys, and Trichomycterus. However, due to limitations in the taxonomic information (phenotypic and genotypic) currently available, the taxonomic status of these specimens remains unresolved (Supplementary Table 1). As our understanding of fish taxonomy for the MRB increases, additional sequences from these collections will be verified and incorporated into the GenBank project.

We have made an important correction concerning sequences of primers L1484165 and H1591566, used to amplify Cytb. For a long time, these primers names have been incorrectly used to name other primers that amplify 12S gen26,67,68. The precise designation of this sequence is L941-PHE and H2010-VAL (see Table 2), as referenced by Sato69. This clarification is essential to interpret and apply our sequencing results properly.