Introduction

Size plays a pivotal role in biology due to its profound impact on the functioning of life1. For centuries, scientists have been fascinated by the causes and consequences of size-related variations among organisms. Although much of this interest has centred on overall body size, the exploration of cellular characteristics can be traced back to 1675, when the description of human red blood cells was provided2,3. Two centuries later, in 1875, George Gulliver’s illustrations revealed a remarkable variation in sizes of blood cells across vertebrates4. Gulliver’s work enhanced the understanding of erythrocyte diversity within the animal kingdom, particularly regarding characteristics associated with the variation in size of erythrocytes.

In most vertebrates, erythrocytes, or red blood cells (RBCs), represent the predominant blood cell type and the most abundant cellular component, playing a central role in metabolic physiology5. Their functionality stems primarily from haemoglobin, a specialised oxygen- and carbon dioxide-binding protein that facilitates the physiological process of oxygen delivery from respiratory organs to tissues6,7. Moreover, erythrocyte characteristics provide insights into how species have physiologically adapted to different environmental conditions8,9,10,11. Recent studies have demonstrated that cell size significantly influence ectothermic species’ responses to rising ambient temperatures12,13,14,15. Additionally, studies indicate intercorrelations between cell sizes across various organs and tissues16. This systemic relationship positions erythrocyte size as a simple yet useful proxy for assessing whole-organism cellular dimensions17,18. However, despite the importance of cell size in monospecific studies and multi-species comparison on other vertebrate groups19, there is currently no comprehensive, up-to-date database on RBC characteristics across diverse species.

To address this knowledge gap, we developed ErythroCite, the most extensive database of cell size-related traits to date, incorporating data for 660 fish species. ErythroCite goes beyond merely cataloguing blood cell size by integrating phylogenetic relationships, biological traits, and ecological information of four lineages of fishes. We focused on fish as a starting point for several reasons. First, fish account for approximately 50% of all vertebrate species, with over 35,000 described species20. Their erythrocytes are distinctively oval, flattened, and biconvex in shape21. Unlike the enucleated red blood cells of mammals, these nucleated cells offer valuable insights into the evolutionary adaptations of other ectothermic vertebrates, such as amphibians and reptiles. Second, while both aquatic organisms and high-altitude terrestrial vertebrates face oxygen limitations, water-breathing species, such as fishes, are more frequently exposed to low and fluctuating oxygen condition. Consequently, their gas transport systems-including the properties of red blood cells-must function efficiently to ensure oxygen delivery to tissues.

Third, the availability of trait databases for fish enables integration of ErythroCite with other datasets, enhancing our understanding of factors influencing variations in cell size. Finally, the establishment of a red blood cell database for fish is necessary to enhance and update existing initiatives, such as the Animal Genome Size Database22, which includes cell size information for several vertebrates groups. This should be achieved through a systematic, multilingual approach to literature review and data collection.

We expect ErythroCite helps researchers to conduct more robust comparative analyses and investigate the adaptive significance of erythrocyte size across a diverse range of fish species, thereby facilitating a deeper understanding of its evolutionary importance. In particular, we anticipate that the creation of this database will strengthen the current theory of optimal cell size9, which relates cell size to metabolism of organisms.

Methods

We follow MeRIT guidelines established by Nakagawa et al.23 to ensure better clarity and transparency in our reporting and description of methods. These guidelines use author initials in the methods section to attribute specific tasks to individual contributors, complementing the Contributor Roles Taxonomy system (CRediT, https://credit.niso.org/).

Literature searches

Our objective was to compile comprehensive data on the cytomorphology of red blood cells in fish species. Specifically, we identified studies that quantified parameters such as cell area and volume, as well as nuclear area and volume. Additionally, we collected associated geographical, biological, and ecological metadata for each entry and species. Furthermore, we gathered bibliometric information for each study for literature mapping on this subject.

The search for information was conducted by FPLeiva using three search engines: ISI Web of Science (core collection), Scopus, and Google Scholar (Fig. 1). The first two search engines were used exclusively for searches in English on 12 July 2024, utilising Radboud University’s subscription to these services. The combination of Boolean search terms employed was: (red blood cell* OR erythrocyte OR RBC OR haematids OR red corpuscle* OR erythroid) AND (area OR size OR dimension OR volume OR diameter OR morpho*) AND (fish OR teleost OR shark* OR ray* OR skate OR ratfish OR ghostshark OR spookfish OR aquatic vertebrate OR elasmobranchii OR chondrichthyes OR osteichthyes OR ray-finned fish* OR bony fish*). From these searches, the full records were downloaded, including abstracts, keywords, and all relevant information, across all years and editions, and document types. Using ISI Web of Science, a total of 4,341 records were identified, whilst in Scopus, 1,039 records were found.

Fig. 1
Fig. 1
Full size image

PRISMA-type diagram showing the systematic literature search for studies reporting cell size measurements in fish red blood cells. For each screening and exclusion stage, the number of studies is detailed. The diagram is based on a previous study by Pottier et al.264.

The search using Google Scholar was conducted on 22–24 July 2024, targeting studies published in Spanish, Italian, Portuguese, German, French, and Polish. To facilitate this multilingual search, we translated the English keywords into these six languages. For all languages except Spanish (the native language of FPLeiva), we used DeepL (www.deepl.com/) for initial translations. Native speakers then verified the accuracy of these translations: CAFreire for Portuguese, MShokri for Italian, KAlter for German, LSerre-Fredj for French, and AHermaniuk for Polish. We chose these languages to optimize the inclusion of non-English studies that could be read by at least one of the manuscript’s authors. The software Publish or Perish24 was used to search and extract records for each language. To accommodate Google Scholar’s 256-character search string limit, we modified our initial Boolean terms for each language. We condensed the search strings while preserving the essential concepts of our research question, ensuring comprehensive searches across all target languages despite Google Scholar’s constraints. Table 1 provides detailed translations of these modified search strings.

Table 1 Keywords combination used to search for references in seven different languages.

The Google Scholar searches conducted across various languages yielded a total of 3,599 studies. In total, our multi-engine, multilingual search produced 8,979 records. Subsequently, we screened these records to eliminate duplicates and evaluated their relevance based on titles, abstracts, and keywords.

In addition to our systematic searches, we employed complementary strategies to improve our literature search. For backward searches, we used a subset of the Animal Genome Size database22 (http://www.genomesize.com/cellsize/fish.htm) as a starting point. On the 10 of June 2024, FPLeiva accessed the latest version of this database and identified eleven relevant studies. Furthermore, FPLeiva has been compiling information on cell sizes of various ectotherm clades, including fish, through non-systematic searches. This ongoing effort added nine more studies to our review (Fig. 1).

To streamline the screening process, we utilized Rayyan25, an artificial intelligence-based platform designed to expedite systematic reviews by reducing the time required for each screening step. The screening was conducted by different team members based on their language expertise: FPLeiva handled the Spanish and English records, while CFreire screened Portuguese studies. MShokri was responsible for Italian, KAlter for German, LSerre-Fredj for French, and AHermaniuk for Polish studies.

Eligibility criteria

We applied the following inclusion criteria: (i) only primary research articles were included, ensuring original data and appropriate credit to primary sources; (ii) we focused on species-specific data for consistency and comparability, excluding genus-level data and hybrid species; only studies measuring mature erythrocytes were considered, avoiding those including immature or developing cells; (iii) we selected studies working with diploid organisms, excluding polyploids due to potential cell size variations from different chromosomal loads26, though we noted as comments when additional data were also available for polyploids; (iv) in cases involving various treatments, only studies that reported experimental control conditions as labelled in the study were considered, to ensure the results were comparable across studies in ErythroCite; (v) for the few instances in which anticoagulants were used during blood collection, we used the mean cell size, as anticoagulants can influence these measurements27; (vi) and when several techniques to obtain cell sizes were employed, we prioritised data obtained from blood smears, as they provide a more consistent measure of cell size compared to live cells, which can vary in size due to their physiological state. Using these inclusion criteria, the number of studies included in ErythroCite across all languages was 186, which were all cited here4,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212.

Data extraction and metadata

We endeavoured to incorporate the direct estimates of cell area, cell volume, mean corpuscular volume, nucleus area and nucleus volume from the original studies as much as possible. However, in numerous studies, only the lengths of the major and minor axes of the cells and their nuclei were reported. When this was the case, we employed standard formulae to calculate the area and the volume of the cell or its nucleus, assuming that both the cell and its nucleus were shaped like ellipsoids or oblate spheroids22,213.

The formula for cell area (A) was:

$$A=\pi \times \left(\frac{a}{2}\right)\times \left(\frac{b}{2}\right)$$

The formula used for cell volume (V):

$$V=\frac{4}{3}\times \pi \times \left(\frac{a}{2}\right)\times {\left(\frac{b}{2}\right)}^{2}$$

Where ‘a’ and ‘b’ denote the lengths of the semi-major and semi-minor axes of an ellipse, respectively. These parameters were employed in the preceding equations to calculate the area (A) and volume (V) of erythrocytes modelled as elliptical shapes.

While most methods for measuring cell volume rely on fixed blood smears, alternative approaches exist. Various studies have reported mean corpuscular volume (MCV, measured in μm³) as a proxy for cell volume. MCV is typically estimated using a standard formula, as reviewed by Witeska et al.21:

$${MCV}=\frac{{Ht}\times 10}{{RBC}}$$

Where Ht is the haematocrit and RBC is the red blood cells counts.

In our database, MCV values are presented in a separate column and should be interpreted with caution when compared to cell volume measurements obtained from blood smears, as emphasized by Gregory22. This distinction is important because MCV is derived from haematological parameters, while smear measurements are obtained through direct microscopic observation. Moreover, MCV represents an average value for the entire erythrocyte population, whereas cell volume estimates from smears provide measurements of individual cells.

Despite the extensive number of studies included in this work, the collection of methodological information (metadata) related to cell size estimation was relatively limited. Nevertheless, we gathered metadata associated with collection location where the species were sourced, body size, sex, and life stage studied. When location descriptions were general (e.g., Araucanía Region, Chile), coordinates were obtained from the OpenStreetMap Data Search Engine Nominatim (http://nominatim.openstreetmap.org). For more specific locations, such as named hatcheries, institutes, or localities, we utilized Google Maps to determine precise geographical positions. Additionally, we provided, in an additional column, the description of the location from where the animals were sourced, which should be used to filter, for example, wild-collected animals, in case users are interested in testing latitudinal hypotheses of cell size variation. This is because, for instance, an institute location does not necessarily correlate with natural habitat conditions in the same area.

We converted fish body sizes reported in length units to wet body mass (in grams) using species-specific length-weight relationships obtained from FishBase20. There was a single study, Martins et al.127,214, providing approximately 3,700 observations for 15 fish species. For this study, we averaged the cell sizes at the specimen level (five individuals per species). For studies presenting cell size data exclusively in figures without accompanying textual or tabular information, we employed Plot Digitizer, a Java-based program designed to extract X-Y coordinates from graphs (http://plotdigitizer.sourceforge.net).

Taxonomy and phylogeny

The species names were scrutinised for synonyms and any updates that might influence the taxonomy. To accomplish this, we adopted the taxonomic harmonisation procedure outlined by Lenoir et al.215 and Leiva et al.216. This taxonomic harmonization consists of three automated steps: first, we searched for species names in the National Center for Biotechnology Information (NCBI) taxonomy database; second, we verified any unmatched taxonomic entities using the Integrated Taxonomic Information System (ITIS) database; and third, we cross-checked remaining unmatched entities against the Global Biodiversity Information Facility (GBIF) database. If a match was identified, the corrected taxonomic entity was re-evaluated through the entire verification process in NCBI and ITIS to ensure accurate classification. Ultimately, only names at the species levels were retained in the database, with subspecies aggregated at the species level (e.g., Catostomus catostomus). The majority (91%) of species name verifications were sourced from NCBI, with ITIS and GBIF providing additional support. For the remaining species that could not be verified through this process, manual checks were performed using additional resources such as FishBase20 and World Register of Marine Species (WoRMS)217. When using ITIS, several species were grouped within the class Teleostei, while GBIF left most species unassigned to any class. In these cases, we manually reassigned these species to the class Actinopterygii. To address potential issues of data interoperability, we have additionally included the taxonomy of the species based on FishBase. This will allow users to more easily combine the cell size data with other fish traits, thereby enhancing interoperability between datasets from different studies218.

We retrieved the phylogenetic relationships of the species from Open Tree of Life (OTL)219. For Choerodon albigena, which lacked information in the OTL, we added it using the phylogenetic position of its sister species, Choerodon cephalotes.

We utilised the harmonised species list to obtain the associated realm for each species from FishBase20, accessed through WoRMS on 14 November 2024, using the WoRMS Taxon Match tool220. In WoRMS, the realms freshwater, brackish, marine, and terrestrial are assigned as a binary variable (1 or 0). In our database, we recorded whether a species occupies more than one aquatic realm throughout their life. This process resulted in five categories: marine, marine-brackish-freshwater, marine-brackish, freshwater-brackish, and freshwater, reflecting the diversity of habitats that species occupy and recognising their ability to adapt to different environmental conditions throughout their life cycle.

All analyses were carried out in R version 4.3.1221. The rutils package version 0.0.0.9222, readxl package version 1.4.3223, dplyr package version 1.1.4224, plyr package225, writexl package version 1.5.0226, tibble package version 3.2.1227, sessioninfo package version 1.2.2228, rnaturalearth package version 1.0.1229, tidygeocoder package version 1.0.5230 kableExtra package version 1.4.0231 and DataExplorer package version 0.8.3232 were used to curate, format, and inspect data. The RefManageR233,234 was used to manipulate references. The rgbif package version 3.7.8235,236, rfishbase package237 and taxize package version 0.9.98238,239 were used for the taxonomic harmonization. The rotl package version 3.1.0240, ape package version 5.8241, phytools package version 2.1-1242 and ggtree package243,244,245,246,247 were used to create and manipulate phylogenetic trees. The ggplot2 package version 3.5.1248, ggpubr package version 0.6.0249, fishualize package version 0.2.3250, cowplot package version 1.1.3251 and ggthemes package version 5.0.0252 were used to produce figures.

Data Records

All materials, including the database, R code, and additional supplementary content are available under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). ErythroCite is archived on GitHub at https://github.com/felixpleiva/ErythroCite and preserved on Zenodo253. This repository contains the data, metadata, and R code (https://felixpleiva.github.io/ErythroCite/) used for data curation, as well as for generating the figures and phylogenetic tree. References are also provided as a BibTeX file. ErythroCite will be updated as necessary to incorporate new studies and any identified corrections. In all cases, updates will comply with the standards of the Semantic Versioning Specification (SemVer, https://semver.org/).

Data Overview

ErythroCite encompasses over 1,700 records derived from 186 references. After applying the steps of taxonomic harmonization, the final number of unique species included in our database was 660, of which 629 were included in the OTL phylogeny (Fig. 2). In terms of taxonomic diversity, 90.2% were grouped within Actinopterygii (595 species of bony fishes), 8.6% species in Chondrichthyes (57 species of cartilaginous fishes), 0.75% species in Cyclostomata (5 species of jawless fishes) and 0.45% of the species in Dipnoi (3 species of lungfishes) (Fig. 3). To our knowledge, we have compiled the most comprehensive database of erythrocyte (red blood cell) sizes in fish species to date. We anticipate that this database will significantly contribute to understanding the factors influencing cell size variation among fishes and serve as a valuable resource for future research in macroecology, macrophysiology, comparative physiology, and evolutionary biology. However, despite its extensive coverage, our database reveals geographic and taxonomic biases, as well as a lack of reported information in biological metadata. In an ideal scenario, all species included in the current version of ErythroCite would have information on the five traits of interest (Figs. 2, 3), including those from which these traits are derived, such as cellular and nuclear lengths and widths. To address this issue, we foresee the use of phylogenetic imputation methods to fill gaps and to enhance the comprehensiveness of the database254,255. This approach could significantly augment the utility of ErythroCite. Specifically, ErythroCite is expected to facilitate research in two key areas: first, by investigating metabolic theories such as the optimal cell size theory9,12,13,18,256,257,258,259 and hypotheses related to the development of the cardiovascular system in fishes260; and second, by examining how external factors such as environmental temperature influence variations in fish cell sizes, particularly the observation that species with larger cells tend to inhabit colder regions like the polar areas261,262. These efforts will help identify global-scale variations in cell size by uncovering their underlying causes and analysing their effects. By integrating available metadata, we aim to enhance our understanding of the ecological and evolutionary implications of erythrocyte size diversity in fishes.

Fig. 2
Fig. 2
Full size image

Phylogenetic relationships and cell size trait distribution among 629 fishes. For illustrative purposes only, the trait values were averaged by species and then normalised by subtracting the minimum and dividing by the range. This standardisation scales all values to a range between 0 and 1. Grey bars indicate missing data for a given species. Silhouettes represent major taxonomic groups (sourced from www.phylopic.org, public domain).

Fig. 3
Fig. 3
Full size image

Cell size of erythrocytes among major lineages of fishes: (A) cell area (μm2), (B) nucleus area (μm2), (C) cell volume (μm33), (D) nucleus volume (μm³), and (E) mean corpuscular volume (μm³). The number of species (spp.) and records (N) measured for each of the variables is indicated above each box. No data were available on mean corpuscular volume for Cyclostomata. Boxes show median (horizontal line) ± 1.5 times the interquartile range (whiskers). Dots represent observations for each trait.

Technical Validation

To validate the entries in the ErythroCite database, we employed various approaches. FPLeiva double-checked all entries resulting from English-language searches. In addition, MShokri and FPLeiva reviewed a subset of studies representing 38% of the total records. No errors were identified during this stage of verification. We established a procedure to examine inconsistencies in our database, methodically detecting, assessing, and rectifying potential deviations in cell size measurements, and also covering both discrete and other continuous variables. For this purpose, we adopted some of the data verification steps outlined by Pottier et al.263.

We created frequency distribution plots for traits associated with any measure of cell size (cell area, cell volume, nuclear area, nuclear volume, mean corpuscular volume (MCV), cell length, cell width, nuclear length, and nuclear width) to check for outliers. For values in the distribution tails, we conducted checks not only for data entry errors but also for original calculation. Identified errors were corrected in the database. For MCV, we applied the standard formula and verified whether the resulting values closely matched those indicated in the papers. In cases of discrepancy, we considered this calculation as the corrected value, which often proved quite similar and suggested typographical errors in the original article. All these steps and potential corrections were implemented before the release of ErythroCite.

All cellular and nuclear area measurements were expressed in µm², while cell volume, nuclear volume, and MCV were expressed in µm³. Both cell and nucleus length and width were measured in µm.

The majority of entries in our database were derived from tables, primarily aggregated as means at the species, sex, or geographical location level. Alongside the mean values, we documented the corresponding sample sizes, the number of specimens analysed, and the associated error of the mean, which can be valuable for meta-analytic approaches. Where possible, all errors associated with each estimate were converted to standard deviation.